You have 2 free member-only stories left this month.

All Pandas cut() you should know for transforming numerical data into categorical data

Feb 8·6 min read

All Pandas cut() you should know for transforming numerical data into categorical data (Image by author using canva.com)

Numerical data is common in data analysis. Often you have numerical data that is continuous, very large scales, or highly skewed. Sometimes, it can be easier to bin those data into discrete intervals. This is helpful to perform descriptive statistics when values are divided into meaningful categories. For example, we can divide the exact age into Toddler, Child, Adult, and Elder.

Pandas’ built-in cut() function is a great way to transform numerical data into categorical data. In this article, you’ll learn how to use it to deal with the following common tasks.

Discretizing into equal-sized bins
Adding custom bins
Adding labels to bins
Configuring leftmost edge with right=False
Include the lowest value with include_lowest=True
Passing an IntervalIndex to bins
Returning bins with retbins=True
Creating unordered categories

Please check out Notebook for the source code.

1. Discretizing into equal-sized bins

The simplest usage of cut() must has a column and an integer as input. It is discretizing values into equal-sized bins.

df = pd.DataFrame({'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30]})df['age_group'] = pd.cut(df['age'], 3)

Did you observe those intervals from the age_group column? Those interval values are having a round bracket at the start and a square bracket at the end, for example (1.903, 34.333]. It basically means any value on the side of the round bracket is not included in the interval and any value on the side of the square bracket is included (It is known as open and closed intervals in Math).

Now, let's take a look at the new column age_group.

df['age_group']

It shows dtype: category with 3 label values: (1.903, 34.333] , (34.333, 66.667] , and (66.667, 99.0]. Those label values are ordered as indicated with the symbol <. Behind the theme, an interval is calculated as follows in order to generate the equal-sized bins:

interval = (max_value — min_value) / num_of_bins
         = (99 - 2) / 3
         = 32.33333        (<--32.3333-->] < (<--32.3333-->] < (<--32.3333-->]        (1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]

2. Adding custom bins

Let’s divide the above age values into 4 custom groups i.e. 0–12, 12–19, 19–60, 61–100. To do that, we can simply pass those values in a list ([0, 12, 19, 61, 100]) to the argument bins.

df['age_group'] = pd.cut(df['age'], bins=[0, 12, 19, 61, 100])

We added the group values for age in a new column age_group. By looking into the column

df['age_group']0      (0, 12]
1    (61, 100]
2     (19, 61]
3     (19, 61]
4      (0, 12]
5     (12, 19]
6    (61, 100]
7    (61, 100]
8     (19, 61]
9     (19, 61]
Name: age_group, dtype: category
Categories (4, interval[int64]): [(0, 12] < (12, 19] < (19, 61] < (61, 100]]

We can see dtype: category with 4 ordered label values: (0, 12] < (12, 19] < (19, 61] < (61, 100].

Let’s sort the DataFrame by the column age_group:

df.sort_values('age_group')

Let’s count that how many values fall into each bin.

df['age_group'].value_counts().sort_index()(0, 12]      2
(12, 19]     1
(19, 61]     4
(61, 100]    3
Name: age_group, dtype: int64

3. Adding labels to bins

It is more descriptive to label these age_group values as ‘<12’, ‘Teen’, ‘Adult’, ‘Older’. To do that, we can simply pass those values in a list to the argument labels

bins=[0, 12, 19, 61, 100]
labels=['<12', 'Teen', 'Adult', 'Older']df['age_group'] = pd.cut(df['age'], bins, labels=labels)

Now, when we looking into the column, it shows the label instead

df['age_group']0      <12
1    Older
2    Adult
3    Adult
4      <12
5     Teen
6    Older
7    Older
8    Adult
9    Adult
Name: age_group, dtype: category
Categories (4, object): ['<12' < 'Teen' < 'Adult' < 'Older']

Similarly, it is showing label when sorting and counting

df['age_group'].value_counts().sort_index()<12      2
Teen     1
Adult    4
Older    3
Name: age_group, dtype: int64

4. Configuring leftmost edge with `right=False`

There is an argument right in Pandas cut() to configure whether bins include the rightmost edge or not. right defaults to True, which mean bins like[0, 12, 19, 61, 100] indicate (0,12], (12,19], (19,61],(61,100] . To include the leftmost edge, we can set right=False:

pd.cut(df['age'], bins=[0, 12, 19, 61, 100], right=False)0      [0, 12)
1    [61, 100)
2     [19, 61)
3     [19, 61)
4      [0, 12)
5     [12, 19)
6    [61, 100)
7    [61, 100)
8     [19, 61)
9     [19, 61)
Name: age, dtype: category
Categories (4, interval[int64]): [[0, 12) < [12, 19) < [19, 61) < [61, 100)]

5. Including the lowest value with `include_lowest=True`

Suppose you would like to divide the above age values into 2–12, 12–19, 19–60, 61–100 instead. You will get a result contains NaN when setting the bins to [2, 12, 19, 61, 100].

df['age_group'] = pd.cut(df['age'], bins=[2, 12, 19, 61, 100])

We get a NaN because the value 2 is the leftmost edge of the first bin (2.0, 19.0] and is not included. To include the lowest value, we can set include_lowest=True. Alternatively, you can set the right to False to include the leftmost edge.

df['age_group'] = pd.cut(
    df['age'], 
    bins=[2, 12, 19, 61, 100], 
    include_lowest=True
)

6. Passing an `IntervalIndex` to `bins`

So far we have been passing an array to bins. Instead of an array, we can also pass an IntervalIndex.

Let’s create an IntervalIndex with 3 bins (0, 12], (19, 61], (61, 100] :

bins = pd.IntervalIndex.from_tuples([(0, 12), (19, 61), (61, 100)])
IntervalIndex([(0, 12], (19, 61], (61, 100]],
              closed='right',
              dtype='interval[int64]')

Next, let’s pass it to the argument bins

df['age_group'] = pd.cut(df['age'], bins)

Notice that value not covered by the IntervalIndex is set to NaN. Basically, passing an IntervalIndex for bins results in those categories exactly.

7. Returning bins with `retbins=True`

There is an argument called retbin to return the bins. If it is set to True, the result will return the bins and it is useful when bins is passed as a single number value

result, bins = pd.cut(
    df['age'], 
    bins=4,            # A single number value
    retbins=True
)
# Print out bins value
bins
array([ 1.903, 26.25 , 50.5  , 74.75 , 99.   ])

8. Creating unordered categories

ordered=False will result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels:

pd.cut(
    df['age'], 
    bins=[0, 12, 19, 61, 100], 
    labels=['<12', 'Teen', 'Adult', 'Older'], 
    ordered=False,
)0      <12
1    Older
2    Adult
3    Adult
4      <12
5     Teen
6    Older
7    Older
8    Adult
9    Adult
Name: age, dtype: category
Categories (4, object): ['<12', 'Teen', 'Adult', 'Older']

Conclusion

Pandas cut() function is a quick and convenient way for transforming numerical data into categorical data.

I hope this article will help you to save time in learning Pandas. I recommend you to check out the documentation for the cut() API and to know about other things you can do.

Thanks for reading. Please check out the notebook for the source code and stay tuned if you are interested in the practical aspect of machine learning.

You may be interested in some of my other Pandas articles:

More tutorials can be found on my Github

Quantity : Add to Cart

Pages

Saintis Data

Wednesday, December 8, 2021