4 tricks you should know to parse date columns with Pandas read_csv()
Importing data is the first step in any data science project. Often, you’ll work with data in CSV files and run into problems at the very beginning.
Among the problems, parse date columns are the most common to us. In this article, we will cover the following most common parse date columns problems:
- Reading date columns from a CSV file
- Day first format (DD/MM, DD MM or, DD-MM)
- Combining multiple columns to a datetime
- Customizing a date parser
Please check out my Github repo for the source code.
1. Reading date columns from a CSV file
By default, date columns are represented as object when loading data from a CSV file.
For example, data_1.csv
date,product,price
1/1/2019,A,10
1/2/2020,B,20
1/3/1998,C,30
The date column gets read as an object data type using the default read_csv()
:
df = pd.read_csv('data/data_1.csv')
df.info()RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null object
1 product 3 non-null object
2 price 3 non-null int64
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
To read the date column correctly, we can use the argument parse_dates
to specify a list of date columns.
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
Now, the DataFrame should look like:
2. Day first format (DD/MM, DD MM or, DD-MM)
By default, the argument parse_dates
will read date data with month first (MM/DD, MM DD, or MM-DD) format, and this arrangement is relatively unique in the United State.
In most of the rest of the world, the day is written first (DD/MM, DD MM, or DD-MM). If you would like Pandas to consider day first instead of month, you can set the argument dayfirst
to True
.
pd.read_csv('data/data_1.csv',
parse_dates=['date'],
dayfirst=True)
Alternatively, you can customize a date parser to handle day first format. Please check out the solution in the “4. Customizing a date parser”.
3. Combining multiple columns to a datetime
Sometimes date is split up into multiple columns, for example, year, month, and day
year,month,day,product,price
2019,1,1,A,10
2019,1,2,B,20
2019,1,3,C,30
2019,1,4,D,40
To combine them into a datetime, we can pass a nested list to parse_dates
.
df = pd.read_csv('data/data_4.csv',
parse_dates=[['year', 'month', 'day']])
df.info()RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year_month_day 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 224.0+ bytes
Notice that the column name year_month_day is generated automatically. To specify a custom column name, we can pass a dictionary instead.
df = pd.read_csv('data/data_4.csv',
parse_dates={ 'date': ['year', 'month', 'day'] })
df.info()RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 224.0+ bytes
4. Customizing a date parser
By default, date columns are parsed using the Pandas built-in parser from dateutil.parser.parse
. Sometimes, you might need to write your own parser to support a different date format, for example, YYYY-DD-MM HH:MM:SS:
date,product,price
2016-6-10 20:30:0,A,10
2016-7-1 19:45:30,B,20
2013-10-12 4:5:1,C,20
The
easiest way is to write a lambda function which can read the data in
this format, and pass the lambda function to the argument date_parser
.
from datetime import datetimecustom_date_parser = lambda x: datetime.strptime(x, "%Y-%d-%m %H:%M:%S")df = pd.read_csv('data/data_6.csv',
parse_dates=['date'],
date_parser=custom_date_parser)
df.info()RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
Now, the date column has been read correctly in Pandas Dataframe.
That’s it
Thanks for reading.
Please check out the notebook on my Github for the source code.
Stay tuned if you are interested in the practical aspect of machine learning.
Some relevant articles
- 6 Pandas tricks you should know to speed up your data analysis
- 7 setups you should include at the beginning of a data science project.
No comments:
Post a Comment