Cart btn

Wednesday, December 8, 2021

4 tricks you should know to parse date columns with Pandas read_csv()

 

4 tricks you should know to parse date columns with Pandas read_csv()

Some of the most helpful Pandas tricks

B. Chen

Importing data is the first step in any data science project. Often, you’ll work with data in CSV files and run into problems at the very beginning.

Among the problems, parse date columns are the most common to us. In this article, we will cover the following most common parse date columns problems:

  1. Reading date columns from a CSV file
  2. Day first format (DD/MM, DD MM or, DD-MM)
  3. Combining multiple columns to a datetime
  4. Customizing a date parser

Please check out my Github repo for the source code.

1. Reading date columns from a CSV file

By default, date columns are represented as object when loading data from a CSV file.

For example, data_1.csv

date,product,price
1/1/2019,A,10
1/2/2020,B,20
1/3/1998,C,30

The date column gets read as an object data type using the default read_csv():

df = pd.read_csv('data/data_1.csv')
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null object
1 product 3 non-null object
2 price 3 non-null int64
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes

To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes

Now, the DataFrame should look like:

2. Day first format (DD/MM, DD MM or, DD-MM)

By default, the argument parse_dates will read date data with month first (MM/DD, MM DD, or MM-DD) format, and this arrangement is relatively unique in the United State.

In most of the rest of the world, the day is written first (DD/MM, DD MM, or DD-MM). If you would like Pandas to consider day first instead of month, you can set the argument dayfirst to True.

pd.read_csv('data/data_1.csv', 
parse_dates=['date'],
dayfirst=True)

Alternatively, you can customize a date parser to handle day first format. Please check out the solution in the “4. Customizing a date parser”.

3. Combining multiple columns to a datetime

Sometimes date is split up into multiple columns, for example, year, month, and day

year,month,day,product,price
2019,1,1,A,10
2019,1,2,B,20
2019,1,3,C,30
2019,1,4,D,40

To combine them into a datetime, we can pass a nested list to parse_dates.

df = pd.read_csv('data/data_4.csv',
parse_dates=[['year', 'month', 'day']])
df.info()
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year_month_day 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 224.0+ bytes

Notice that the column name year_month_day is generated automatically. To specify a custom column name, we can pass a dictionary instead.

df = pd.read_csv('data/data_4.csv',
parse_dates={ 'date': ['year', 'month', 'day'] })
df.info()
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 224.0+ bytes

4. Customizing a date parser

By default, date columns are parsed using the Pandas built-in parser from dateutil.parser.parse. Sometimes, you might need to write your own parser to support a different date format, for example, YYYY-DD-MM HH:MM:SS:

date,product,price
2016-6-10 20:30:0,A,10
2016-7-1 19:45:30,B,20
2013-10-12 4:5:1,C,20

The easiest way is to write a lambda function which can read the data in this format, and pass the lambda function to the argument date_parser.

from datetime import datetimecustom_date_parser = lambda x: datetime.strptime(x, "%Y-%d-%m %H:%M:%S")df = pd.read_csv('data/data_6.csv',
parse_dates=['date'],
date_parser=custom_date_parser)
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes

Now, the date column has been read correctly in Pandas Dataframe.

That’s it

Thanks for reading.

Please check out the notebook on my Github for the source code.

Stay tuned if you are interested in the practical aspect of machine learning.

Some relevant articles

Quantity : Add to Cart

No comments:

Post a Comment

Tag Line