Cart btn

Wednesday, December 8, 2021

All Pandas cut() you should know for transforming numerical data into categorical data

 

You have 2 free member-only stories left this month.

All Pandas cut() you should know for transforming numerical data into categorical data

B. Chen
All Pandas cut() you should know for transforming numerical data into categorical data (Image by author using canva.com)

Please check out Notebook for the source code.

1. Discretizing into equal-sized bins

df = pd.DataFrame({'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30]})df['age_group'] = pd.cut(df['age'], 3)
(image by author)
df['age_group']
(image by author)
interval = (max_value — min_value) / num_of_bins
= (99 - 2) / 3
= 32.33333
(<--32.3333-->] < (<--32.3333-->] < (<--32.3333-->] (1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]

2. Adding custom bins

df['age_group'] = pd.cut(df['age'], bins=[0, 12, 19, 61, 100])
(image by author)
df['age_group']0      (0, 12]
1 (61, 100]
2 (19, 61]
3 (19, 61]
4 (0, 12]
5 (12, 19]
6 (61, 100]
7 (61, 100]
8 (19, 61]
9 (19, 61]
Name: age_group, dtype: category
Categories (4, interval[int64]): [(0, 12] < (12, 19] < (19, 61] < (61, 100]]
df.sort_values('age_group')
(image by author)
df['age_group'].value_counts().sort_index()(0, 12]      2
(12, 19] 1
(19, 61] 4
(61, 100] 3
Name: age_group, dtype: int64

3. Adding labels to bins

bins=[0, 12, 19, 61, 100]
labels=['<12', 'Teen', 'Adult', 'Older']
df['age_group'] = pd.cut(df['age'], bins, labels=labels)
(image by author)
df['age_group']0      <12
1 Older
2 Adult
3 Adult
4 <12
5 Teen
6 Older
7 Older
8 Adult
9 Adult
Name: age_group, dtype: category
Categories (4, object): ['<12' < 'Teen' < 'Adult' < 'Older']
(image by author)
df['age_group'].value_counts().sort_index()<12      2
Teen 1
Adult 4
Older 3
Name: age_group, dtype: int64

4. Configuring leftmost edge with right=False

pd.cut(df['age'], bins=[0, 12, 19, 61, 100], right=False)0      [0, 12)
1 [61, 100)
2 [19, 61)
3 [19, 61)
4 [0, 12)
5 [12, 19)
6 [61, 100)
7 [61, 100)
8 [19, 61)
9 [19, 61)
Name: age, dtype: category
Categories (4, interval[int64]): [[0, 12) < [12, 19) < [19, 61) < [61, 100)]

5. Including the lowest value with include_lowest=True

df['age_group'] = pd.cut(df['age'], bins=[2, 12, 19, 61, 100])
(image by author)
df['age_group'] = pd.cut(
df['age'],
bins=[2, 12, 19, 61, 100],
include_lowest=True
)
(image by author)

6. Passing an IntervalIndex to bins

bins = pd.IntervalIndex.from_tuples([(0, 12), (19, 61), (61, 100)])
IntervalIndex([(0, 12], (19, 61], (61, 100]],
closed='right',
dtype='interval[int64]')
df['age_group'] = pd.cut(df['age'], bins)
(image by author)

7. Returning bins with retbins=True

result, bins = pd.cut(
df['age'],
bins=4, # A single number value
retbins=True
)
# Print out bins value
bins

array([ 1.903, 26.25 , 50.5 , 74.75 , 99. ])

8. Creating unordered categories

pd.cut(
df['age'],
bins=[0, 12, 19, 61, 100],
labels=['<12', 'Teen', 'Adult', 'Older'],
ordered=False,
)
0 <12
1 Older
2 Adult
3 Adult
4 <12
5 Teen
6 Older
7 Older
8 Adult
9 Adult
Name: age, dtype: category
Categories (4, object): ['<12', 'Teen', 'Adult', 'Older']

Conclusion

You may be interested in some of my other Pandas articles:

Quantity : Add to Cart

No comments:

Post a Comment

Tag Line