import pandas as pd
pd.set_option('max_rows', 5)
wine = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
ramen = pd.read_csv("../input/ramen-ratings/ramen-ratings.csv", index_col=0)
Method chaining is the last topic we will cover in this first track of the Advanced Pandas tutorial. It is also the only section of this tutorial which is a technique or a pattern, not a function or variable.
Method chaining is a methodology for performing operations on a DataFrame
or Series
that emphasizes continuity. To demonstrate what I mean, here's a data cleaning and dropping operation (which you should be familiar with from the last section) done two different ways:
stars = ramen['Stars']
na_stars = stars.replace('Unrated', None).dropna()
float_stars = na_stars.astype('float64')
float_stars.head()
(ramen['Stars']
.replace('Unrated', None)
.dropna()
.astype('float64')
.head())
In the first statement we assign data to temporary variables, creating new ones as we go further and further along. In the second statement, written in the method chaining style, we instead "chain" our operations, one after the other, all on the same original DataFrame
.
Most pandas
operations can written in a method chaining style, and in the last couple years or so pandas
has added more and more tools for making these sorts of statements easier to write. This paradigm comes to us from the R programming language—specifically, the dpyler
module, part of the "Tidyverse".
Method chaining is advantageous for several reasons. One is that it lessens the need for creating and mentally tracking temporary variables. Another is that it emphasizes a correctly structured interative approach to working with data, where each operation is a "next step" after the last. Debugging is easy: just comment out operations that don't work until you get to one that does, and then start stepping forward again. And it looks kind of cool. =)
For a deeper exploration of why method chaining, read the Method Chaining Section of the Modern Pandas Tutorial (written by a pandas
core dev).
Now that we've learned all these ways of manipulating data with pandas
, we're ready to take advantage of method chaining to write clear, clean data manipulation code. Now I'll introduce three additional methods useful for coding in this style.
wine.head()
The first of these is assign
. The assign
method lets you create new columns or modify old ones inside of a DataFrame
inline. For example, to fill the region_1
field with the province
field wherever the region_1
is null (useful if we're mixing in our own categories), we would do:
wine.assign(
region_1=wine.apply(lambda srs: srs.region_1 if pd.notnull(srs.region_1) else srs.province,
axis='columns')
)
Which is equivalent to:
wine['region_1'] = wine['region_1'].apply(
lambda srs: srs.region_1 if pd.notnull(srs.region_1) else srs.province,
axis='columns'
)
You can modify as many old columns and create as many new ones as you'd like with assign
, but it does have the limitation that the column being modified must not have any reserved characters like periods (.
) or spaces () in the name.
The next method to know is pipe
. pipe
is a little mind-bending: it lets you perform an operation on the entire DataFrame
at once, and replaces the current DataFrame
which the output of your pipe
.
For example, one way to change the give the DataFrame
index a new name would be to do:
def name_index(df):
df.index.name = 'review_id'
return df
wine.pipe(name_index)
pipe
is a power tool: it comes in handy when you're performing very intricate operations on your DataFrame
. You won't need it often, but it'll be super useful when you do.
That concludes this tutorial! Bravo!