This is the workbook component of the "Method chaining" section of the tutorial. For the reference component, click here.
This is the last workbook in the tutorial. Congratulations! In this section we will put all of the things that we learned together to do some truly interesting things with some datasets. The exercises in this section are therefore also quite difficult! Try using method chaining syntax while working through the examples that follow, and make studious use of the hints that we provide.
import pandas as pd
pd.set_option('max_rows', 5)
from learntools.advanced_pandas.method_chaining import *
chess_games = pd.read_csv("../input/chess/games.csv")
Check your answers in each exercise using the check_qN
function (replacing N
with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:
check_q1(pd.DataFrame())
If you get stuck, use the answer_qN
function to see the code with the correct answer.
For the first set of questions, running the check_qN
on the correct answer returns True
.
For the second set of questions, using this function to check a correct answer will present an informative graph.
View your data by running the cell below
chess_games.head()
Exercise 1: It's well-known that in the game of chess, white has a slight first-mover advantage against black. Can you measure this effect in this dataset? Use the winner
column to create a pandas
Series
showing how often white wins, how often black wins, and how often the result is a tie, as a ratio of total games played. In other words, a Series
that looks something like this:
white 0.48
black 0.44
draw 0.08
Name: winner, dtype: float64
Hint: use len
to get the length of the initial DataFrame
, e.g. the count of all games played.
chess_games['winner'].value_counts() / len(chess_games)
print(check_q1(chess_games['winner'].value_counts() / len(chess_games)))
print(answer_q1())
Exercise 2: The opening_name
field of the chess_games
dataset provides interesting data on what the most commonly used chess openings are. However, it gives a bit too much detail, including information on the variation used for the most common opening types. For example, rather than giving Queen's Pawn Game
, the dataset often includes Queen's Pawn Game: Zukertort Variation
.
This makes it a bit difficult to use for categorical purposes. Here's a function that can be used to separate out the "opening archetype":
lambda n: n.split(":")[0].split("|")[0].split("#")[0].strip()
Use this function to parse the opening_name
field and generate a pandas
Series
counting how many times each of the "opening archetypes" gets used. Hint: use a map.
(chess_games
.opening_name
.map(lambda n: n.split(":")[0].split("|")[0].split("#")[0].strip())
.value_counts()
)
print(check_q2((chess_games
.opening_name
.map(lambda n: n.split(":")[0].split("|")[0].split("#")[0].strip())
.value_counts()
)))
print(answer_q2())
Exercise 3: In this dataset various players play variably number of games. Group the games by {white_id, victory_status}
and count how many times each white player ended the game in mate
, draw
, resign
, etcetera. The name of the column counting how many times each outcome occurred should be n
(hint: rename
or assign
may help).
(chess_games
.assign(n=0)
.groupby(['white_id', 'victory_status'])
.n
.apply(len)
.reset_index()
)
print(check_q3((chess_games
.assign(n=0)
.groupby(['white_id', 'victory_status'])
.n
.apply(len)
.reset_index()
)))
print(answer_q3())
Exercise 4: There are a lot of players in the dataset who have only played one or a small handful of games. Create a DataFrame
like the one in the previous exercise, but only include users who are in the top 20 users by number of games played. See if you can do this using method chaining alone! Hint: reuse the code from the previous example. Then, use pipe
.
(chess_games
.assign(n=0)
.groupby(['white_id', 'victory_status'])
.n
.apply(len)
.reset_index()
.pipe(lambda df: df.loc[df.white_id.isin(chess_games.white_id.value_counts().head(20).index)])
)
print(check_q4((chess_games
.assign(n=0)
.groupby(['white_id', 'victory_status'])
.n
.apply(len)
.reset_index()
.pipe(lambda df: df.loc[df.white_id.isin(chess_games.white_id.value_counts().head(20).index)])
)))
print(answer_q4())
Next, let's do some visual exercises.
The next exercise uses the following dataset:
kepler = pd.read_csv("../input/kepler-exoplanet-search-results/cumulative.csv")
kepler
Exercise 5: The Kepler space observatory is in the business of finding potential exoplanets (planets orbiting stars other suns) and, after collecting the evidence, generating whether or not to confirm, decline to confirm, or deny that a given space body is, in fact, an exoplanet. In the dataset above, the "before" status of the body is koi_pdisposition
, and the "after" status is koi_disposition
.
Using the dataset above, generate a Series
counting all of the possible transitions between pre-disposition and post-disposition. In other words, generate a Series
whose index is a MultiIndex
based on the {koi_pdisposition, koi_disposition}
fields, and whose values is a count of how many times each possible combination occurred.
kepler.assign(n=0).groupby(['koi_pdisposition', 'koi_disposition']).n.count()
print(check_q5(kepler.assign(n=0).groupby(['koi_pdisposition', 'koi_disposition']).n.count()))
print(answer_q5())
The next few exercises use the following datasets:
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
ramen_reviews = pd.read_csv("../input/ramen-ratings/ramen-ratings.csv", index_col=0)
print(wine_reviews.head())
print(ramen_reviews.head())
Exercise 6: As we demonstrated in previous workbooks, the points
column in the wine_reviews
dataset is measured on a 20-point scale between 80 and 100. Create a Series
which normalizes the ratings so that they fit on a 1-to-5 scale instead (e.g. a score of 80 translates to 1 star, while a score of 100 is five stars). Set the Series
name to "Wine Ratings", and sort by index value (ascending).
(((wine_reviews['points'].dropna() - 80) / 4)
.value_counts()
.sort_index()
.rename_axis("Wine Ratings")
)
print(check_q6((((wine_reviews['points'].dropna() - 80) / 4)
.value_counts()
.sort_index()
.rename_axis("Wine Ratings")
)))
print(answer_q6())
Exercise 7: The Stars
column in the ramen_reviews
dataset is the ramen equivalent to the similar data points in wine_reviews
. Luckily it is already on a 0-to-5 scale, but it has some different problems...create a Series
counting how many ramens earned each of the possible scores in the dataset. Convert the Series
to the float64
dtype and drop rames whose rating is "Unrated"
. Set the name of the Series
to "Ramen Ratings". Sort by index value (ascending).
(ramen_reviews
.Stars
.replace('Unrated', None)
.dropna()
.astype('float64')
.value_counts()
.rename_axis("Ramen Reviews")
.sort_index())
print(check_q7((ramen_reviews
.Stars
.replace('Unrated', None)
.dropna()
.astype('float64')
.value_counts()
.rename_axis("Ramen Reviews")
.sort_index())))
print(answer_q7())
Exercise 8:: We can see from the result of the previous exercise that whilst the wine reviewers stick to a strict 20-point scale, ramen reviews occassionally deviate into fractional numbers. Modify your answer to the previous exercise by rounding review scores to the nearest half-point (so 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5).
(ramen_reviews
.Stars
.replace('Unrated', None)
.dropna()
.astype('float64')
.map(lambda v: round(v * 2) / 2)
.value_counts()
.rename_axis("Ramen Reviews")
.sort_index()
)
print(check_q8((ramen_reviews
.Stars
.replace('Unrated', None)
.dropna()
.astype('float64')
.map(lambda v: round(v * 2) / 2)
.value_counts()
.rename_axis("Ramen Reviews")
.sort_index()
)))
print(answer_q8())
You've finished the Pandas track. Many data scientist feel efficiency with Pandas is the most useful and practical skill they have, because it allows you to progress quickly in any project you have.
You can take advantage of your Pandas skills by entering a Kaggle Competition or answering a question you find interesting using Kaggle Datasets.