Method chaining reference

This is the referenc component of the "Method chaining" section of the Advanced Pandas tutorial. For the workbook component, click here.

In [1]:
import pandas as pd
pd.set_option('max_rows', 5)
wine = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
ramen = pd.read_csv("../input/ramen-ratings/ramen-ratings.csv", index_col=0)

Why method chaining?

Method chaining is the last topic we will cover in this first track of the Advanced Pandas tutorial. It is also the only section of this tutorial which is a technique or a pattern, not a function or variable.

Method chaining is a methodology for performing operations on a DataFrame or Series that emphasizes continuity. To demonstrate what I mean, here's a data cleaning and dropping operation (which you should be familiar with from the last section) done two different ways:

In [2]:
stars = ramen['Stars']
na_stars = stars.replace('Unrated', None).dropna()
float_stars = na_stars.astype('float64')
float_stars.head()
Out[2]:
Review #
2580    3.75
2579    1.00
2578    2.25
2577    2.75
2576    3.75
Name: Stars, dtype: float64
In [3]:
(ramen['Stars']
     .replace('Unrated', None)
     .dropna()
     .astype('float64')
     .head())
Out[3]:
Review #
2580    3.75
2579    1.00
2578    2.25
2577    2.75
2576    3.75
Name: Stars, dtype: float64

In the first statement we assign data to temporary variables, creating new ones as we go further and further along. In the second statement, written in the method chaining style, we instead "chain" our operations, one after the other, all on the same original DataFrame.

Most pandas operations can written in a method chaining style, and in the last couple years or so pandas has added more and more tools for making these sorts of statements easier to write. This paradigm comes to us from the R programming language—specifically, the dpyler module, part of the "Tidyverse".

Method chaining is advantageous for several reasons. One is that it lessens the need for creating and mentally tracking temporary variables. Another is that it emphasizes a correctly structured interative approach to working with data, where each operation is a "next step" after the last. Debugging is easy: just comment out operations that don't work until you get to one that does, and then start stepping forward again. And it looks kind of cool. =)

For a deeper exploration of why method chaining, read the Method Chaining Section of the Modern Pandas Tutorial (written by a pandas core dev).

Assign and pipe

Now that we've learned all these ways of manipulating data with pandas, we're ready to take advantage of method chaining to write clear, clean data manipulation code. Now I'll introduce three additional methods useful for coding in this style.

In [4]:
wine.head()
Out[4]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

The first of these is assign. The assign method lets you create new columns or modify old ones inside of a DataFrame inline. For example, to fill the region_1 field with the province field wherever the region_1 is null (useful if we're mixing in our own categories), we would do:

In [7]:
wine.assign(
    region_1=wine.apply(lambda srs: srs.region_1 if pd.notnull(srs.region_1) else srs.province, 
                        axis='columns')
)
Out[7]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
review_id
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro Douro NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Which is equivalent to:

wine['region_1'] = wine['region_1'].apply(
    lambda srs: srs.region_1 if pd.notnull(srs.region_1) else srs.province, 
    axis='columns'
)

You can modify as many old columns and create as many new ones as you'd like with assign, but it does have the limitation that the column being modified must not have any reserved characters like periods (.) or spaces () in the name.

The next method to know is pipe. pipe is a little mind-bending: it lets you perform an operation on the entire DataFrame at once, and replaces the current DataFrame which the output of your pipe.

For example, one way to change the give the DataFrame index a new name would be to do:

In [8]:
def name_index(df):
    df.index.name = 'review_id'
    return df

wine.pipe(name_index)
Out[8]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
review_id
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

pipe is a power tool: it comes in handy when you're performing very intricate operations on your DataFrame. You won't need it often, but it'll be super useful when you do.

That concludes this tutorial! Bravo!