This is the workbook component of the "Indexing, selecting, assigning" section. For the reference component, click here.
Selecting specific values of a pandas
DataFrame
or Series
to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.
In this set of exercises we will work on exploring the Wine Reviews dataset.
Fork this notebook using the button towards the top of the screen.
Run the following cell to load your data and some utility functions
import pandas as pd
import seaborn as sns
import sys
sys.path.append('../input/advanced-pandas-exercises/')
from indexing_selecting_assigning import *
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)
Look at an overview of your data by running the following line
reviews.head()
You can check your answers in each of the exercises that follow using the check_qN
function provided in the code cell above (replacing N
with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:
check_q1(pd.DataFrame())
For the first set of questions, if you use check_qN
on your answer, and your answer is right, a simple True
value will be returned.
For the second set of questions, using this function to check a correct answer will present you will an informative graph!
If you get stuck, use answer_qN
function to print the answer outright.
Exercise 1: Select the description
column from reviews
.
reviews.description
print(check_q1(reviews.description))
print('-------')
print(answer_q1())
Exercise 2: Select the first value from the description column of reviews
.
reviews.description[0]
print(check_q2(reviews.description[0]))
print('-------')
print(answer_q2())
Exercise 3: Select the first row of data (the first record) from reviews
. Hint: from this exercise onwards I strongly recommend using loc
or iloc
.
reviews.iloc[0]
print(check_q3(reviews.iloc[0]))
print('-------')
print(answer_q3())
Exercise 4: Select the first 10 values from the description
column in reviews
. Hint: format your output as a pandas
Series
.
reviews.iloc[0:10, 1]
print(check_q4(reviews.iloc[0:10, 1]))
print('-------')
print(answer_q4())
Exercise 5: Select the records with the 1
, 2
, 3
, 5
, and 8
row index positions. In other words, generate the followingDataFrame
:
reviews.iloc[[1,2,3,5,8]]
print(check_q5(reviews.iloc[[1,2,3,5,8]]))
print('-------')
print(answer_q5())
Exercise 6: Select the country
, province
, region_1
, and region_2
columns of the records with the 0
, 1
, 10
, and 100
index positions. In other words, generate the following DataFrame
:
reviews.loc[[0, 1, 10, 100], ['country', 'province', 'region_1', 'region_2']]
print(check_q6(reviews.loc[[0, 1, 10, 100], ['country', 'province', 'region_1', 'region_2']]))
print('-------')
print(answer_q6())
Exercise 7: Select the country
and variety
columns of the first 100 records.
Hint: you may use loc
or iloc
. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the reference for this tutorial section:
iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So0:10
will select entries0,...,9
.loc
, meanwhile, indexes inclusively. So0:10
will select entries0,...,10
.[...]
...[consider] when the DataFrame index is a simple numerical list, e.g.
0,...,1000
. In this casedf.iloc[0:1000]
will return 1000 entries, whiledf.loc[0:1000]
return 1001 of them! To get 1000 elements usingiloc
, you will need to go one higher and ask fordf.iloc[0:1001]
.
reviews.loc[0:100, ['country', 'variety']]
print(check_q7(reviews.loc[0:100, ['country', 'variety']]))
print('-------')
print(answer_q7())
Exercise 8: Select wines made in Italy
. Hint: reviews.country
equals what?
reviews.loc[reviews.country == 'Italy']
print(check_q8(reviews.loc[reviews.country == 'Italy']))
print('-------')
print(answer_q8())
Exercise 9: Select wines whose region_2
is not NaN
.
reviews.loc[reviews.region_2.notnull()]
print(check_q9(reviews.loc[reviews.region_2.notnull()]))
print('-------')
print(answer_q9())
The remaining exercises are visual.
Exercise 10: Select the points
column.
reviews.points
print(check_q10(reviews.points))
print(answer_q10())
Exercise 11: Select the points
column for the first 1000 wines.
reviews.loc[:1000, 'points']
print(check_q11(reviews.loc[:1000, 'points']))
print(answer_q11())
Exercise 12: Select the points
column for the last 1000 wines.
reviews.iloc[-1000:, 3]
print(check_q12(reviews.iloc[-1000:, 3]))
print(answer_q12())
Exercise 13: Select the points
column, but only for wines made in Italy.
reviews[reviews.country == "Italy"].points
check_q13(reviews[reviews.country == "Italy"].points)
print(answer_q13())
Exercise 14: Who produces more above-averagely good wines, France or Italy? Select the country
column, but only when said country
is one of those two options, and the points
column is greater than or equal to 90.
Your output should look roughly like this:
119 France
120 Italy
...
129969 France
129970 France
Name: country, Length: 15840, dtype: object
reviews[reviews.country.isin(["Italy", "France"]) & (reviews.points >= 90)].country
check_q14(reviews[reviews.country.isin(["Italy", "France"]) & (reviews.points >= 90)].country)
print(answer_q14())
Move on to the Summary functions and maps workbook.