Cart btn
  • It is a long established fact that a reader

  • It is a long established fact that a reader

  • It is a long established fact that a reader

Test widget

Pop Ular

Test Widget

Test Widget

Wednesday, December 8, 2021

4 tricks you should know to parse date columns with Pandas read_csv()

No comments:

 

4 tricks you should know to parse date columns with Pandas read_csv()

Some of the most helpful Pandas tricks

B. Chen

Importing data is the first step in any data science project. Often, you’ll work with data in CSV files and run into problems at the very beginning.

Among the problems, parse date columns are the most common to us. In this article, we will cover the following most common parse date columns problems:

  1. Reading date columns from a CSV file
  2. Day first format (DD/MM, DD MM or, DD-MM)
  3. Combining multiple columns to a datetime
  4. Customizing a date parser

Please check out my Github repo for the source code.

1. Reading date columns from a CSV file

By default, date columns are represented as object when loading data from a CSV file.

For example, data_1.csv

date,product,price
1/1/2019,A,10
1/2/2020,B,20
1/3/1998,C,30

The date column gets read as an object data type using the default read_csv():

df = pd.read_csv('data/data_1.csv')
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null object
1 product 3 non-null object
2 price 3 non-null int64
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes

To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes

Now, the DataFrame should look like:

2. Day first format (DD/MM, DD MM or, DD-MM)

By default, the argument parse_dates will read date data with month first (MM/DD, MM DD, or MM-DD) format, and this arrangement is relatively unique in the United State.

In most of the rest of the world, the day is written first (DD/MM, DD MM, or DD-MM). If you would like Pandas to consider day first instead of month, you can set the argument dayfirst to True.

pd.read_csv('data/data_1.csv', 
parse_dates=['date'],
dayfirst=True)

Alternatively, you can customize a date parser to handle day first format. Please check out the solution in the “4. Customizing a date parser”.

3. Combining multiple columns to a datetime

Sometimes date is split up into multiple columns, for example, year, month, and day

year,month,day,product,price
2019,1,1,A,10
2019,1,2,B,20
2019,1,3,C,30
2019,1,4,D,40

To combine them into a datetime, we can pass a nested list to parse_dates.

df = pd.read_csv('data/data_4.csv',
parse_dates=[['year', 'month', 'day']])
df.info()
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year_month_day 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 224.0+ bytes

Notice that the column name year_month_day is generated automatically. To specify a custom column name, we can pass a dictionary instead.

df = pd.read_csv('data/data_4.csv',
parse_dates={ 'date': ['year', 'month', 'day'] })
df.info()
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4 non-null datetime64[ns]
1 product 4 non-null object
2 price 4 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 224.0+ bytes

4. Customizing a date parser

By default, date columns are parsed using the Pandas built-in parser from dateutil.parser.parse. Sometimes, you might need to write your own parser to support a different date format, for example, YYYY-DD-MM HH:MM:SS:

date,product,price
2016-6-10 20:30:0,A,10
2016-7-1 19:45:30,B,20
2013-10-12 4:5:1,C,20

The easiest way is to write a lambda function which can read the data in this format, and pass the lambda function to the argument date_parser.

from datetime import datetimecustom_date_parser = lambda x: datetime.strptime(x, "%Y-%d-%m %H:%M:%S")df = pd.read_csv('data/data_6.csv',
parse_dates=['date'],
date_parser=custom_date_parser)
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes

Now, the date column has been read correctly in Pandas Dataframe.

That’s it

Thanks for reading.

Please check out the notebook on my Github for the source code.

Stay tuned if you are interested in the practical aspect of machine learning.

Some relevant articles

All Pandas cut() you should know for transforming numerical data into categorical data

No comments:

 

You have 2 free member-only stories left this month.

All Pandas cut() you should know for transforming numerical data into categorical data

B. Chen
All Pandas cut() you should know for transforming numerical data into categorical data (Image by author using canva.com)

Please check out Notebook for the source code.

1. Discretizing into equal-sized bins

df = pd.DataFrame({'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30]})df['age_group'] = pd.cut(df['age'], 3)
(image by author)
df['age_group']
(image by author)
interval = (max_value — min_value) / num_of_bins
= (99 - 2) / 3
= 32.33333
(<--32.3333-->] < (<--32.3333-->] < (<--32.3333-->] (1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]

2. Adding custom bins

df['age_group'] = pd.cut(df['age'], bins=[0, 12, 19, 61, 100])
(image by author)
df['age_group']0      (0, 12]
1 (61, 100]
2 (19, 61]
3 (19, 61]
4 (0, 12]
5 (12, 19]
6 (61, 100]
7 (61, 100]
8 (19, 61]
9 (19, 61]
Name: age_group, dtype: category
Categories (4, interval[int64]): [(0, 12] < (12, 19] < (19, 61] < (61, 100]]
df.sort_values('age_group')
(image by author)
df['age_group'].value_counts().sort_index()(0, 12]      2
(12, 19] 1
(19, 61] 4
(61, 100] 3
Name: age_group, dtype: int64

3. Adding labels to bins

bins=[0, 12, 19, 61, 100]
labels=['<12', 'Teen', 'Adult', 'Older']
df['age_group'] = pd.cut(df['age'], bins, labels=labels)
(image by author)
df['age_group']0      <12
1 Older
2 Adult
3 Adult
4 <12
5 Teen
6 Older
7 Older
8 Adult
9 Adult
Name: age_group, dtype: category
Categories (4, object): ['<12' < 'Teen' < 'Adult' < 'Older']
(image by author)
df['age_group'].value_counts().sort_index()<12      2
Teen 1
Adult 4
Older 3
Name: age_group, dtype: int64

4. Configuring leftmost edge with right=False

pd.cut(df['age'], bins=[0, 12, 19, 61, 100], right=False)0      [0, 12)
1 [61, 100)
2 [19, 61)
3 [19, 61)
4 [0, 12)
5 [12, 19)
6 [61, 100)
7 [61, 100)
8 [19, 61)
9 [19, 61)
Name: age, dtype: category
Categories (4, interval[int64]): [[0, 12) < [12, 19) < [19, 61) < [61, 100)]

5. Including the lowest value with include_lowest=True

df['age_group'] = pd.cut(df['age'], bins=[2, 12, 19, 61, 100])
(image by author)
df['age_group'] = pd.cut(
df['age'],
bins=[2, 12, 19, 61, 100],
include_lowest=True
)
(image by author)

6. Passing an IntervalIndex to bins

bins = pd.IntervalIndex.from_tuples([(0, 12), (19, 61), (61, 100)])
IntervalIndex([(0, 12], (19, 61], (61, 100]],
closed='right',
dtype='interval[int64]')
df['age_group'] = pd.cut(df['age'], bins)
(image by author)

7. Returning bins with retbins=True

result, bins = pd.cut(
df['age'],
bins=4, # A single number value
retbins=True
)
# Print out bins value
bins

array([ 1.903, 26.25 , 50.5 , 74.75 , 99. ])

8. Creating unordered categories

pd.cut(
df['age'],
bins=[0, 12, 19, 61, 100],
labels=['<12', 'Teen', 'Adult', 'Older'],
ordered=False,
)
0 <12
1 Older
2 Adult
3 Adult
4 <12
5 Teen
6 Older
7 Older
8 Adult
9 Adult
Name: age, dtype: category
Categories (4, object): ['<12', 'Teen', 'Adult', 'Older']

Conclusion

You may be interested in some of my other Pandas articles:

Tag Line