Introduction

This is the workbook component of the "Indexing, selecting, assigning" section. For the reference component, click here.

Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

In this set of exercises we will work on exploring the Wine Reviews dataset.

Relevant Resources

Set Up

Fork this notebook using the button towards the top of the screen.

Run the following cell to load your data and some utility functions

In [1]:
import pandas as pd
import seaborn as sns

import sys
sys.path.append('../input/advanced-pandas-exercises/')
from indexing_selecting_assigning import *

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

Look at an overview of your data by running the following line

In [2]:
reviews.head()
Out[2]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Checking Answers

You can check your answers in each of the exercises that follow using the check_qN function provided in the code cell above (replacing N with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:

In [3]:
check_q1(pd.DataFrame())
Out[3]:
False

For the first set of questions, if you use check_qN on your answer, and your answer is right, a simple True value will be returned.

For the second set of questions, using this function to check a correct answer will present you will an informative graph!

If you get stuck, use answer_qN function to print the answer outright.

Exercises

Exercise 1: Select the description column from reviews.

In [4]:
reviews.description
Out[4]:
0         Aromas include tropical fruit, broom, brimston...
1         This is ripe and fruity, a wine that is smooth...
                                ...                        
129969    A dry style of Pinot Gris, this is crisp with ...
129970    Big, rich and off-dry, this is powered by inte...
Name: description, Length: 129971, dtype: object
In [5]:
print(check_q1(reviews.description))
print('-------')
print(answer_q1())
True
-------
reviews.description
None

Exercise 2: Select the first value from the description column of reviews.

In [6]:
reviews.description[0]
Out[6]:
"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."
In [7]:
print(check_q2(reviews.description[0]))
print('-------')
print(answer_q2())
True
-------
reviews.description[0]
None

Exercise 3: Select the first row of data (the first record) from reviews. Hint: from this exercise onwards I strongly recommend using loc or iloc.

In [8]:
reviews.iloc[0]
Out[8]:
country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object
In [9]:
print(check_q3(reviews.iloc[0]))
print('-------')
print(answer_q3())
True
-------
reviews.iloc[0]
None

Exercise 4: Select the first 10 values from the description column in reviews. Hint: format your output as a pandas Series.

In [10]:
reviews.iloc[0:10, 1]
Out[10]:
0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
                           ...                        
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, Length: 10, dtype: object
In [11]:
print(check_q4(reviews.iloc[0:10, 1]))
print('-------')
print(answer_q4())
True
-------
reviews.iloc[0:10, 1]
None

Exercise 5: Select the records with the 1, 2, 3, 5, and 8 row index positions. In other words, generate the followingDataFrame:

In [12]:
reviews.iloc[[1,2,3,5,8]]
Out[12]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
5 Spain Blackberry and raspberry aromas show a typical... Ars In Vitro 87 15.0 Northern Spain Navarra NaN Michael Schachner @wineschach Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... Tempranillo-Merlot Tandem
8 Germany Savory dried thyme notes accent sunnier flavor... Shine 87 12.0 Rheinhessen NaN NaN Anna Lee C. Iijima NaN Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe... Gewürztraminer Heinz Eifel
In [13]:
print(check_q5(reviews.iloc[[1,2,3,5,8]]))
print('-------')
print(answer_q5())
True
-------
reviews.iloc[[1, 2, 3, 5, 8]]
None

Exercise 6: Select the country, province, region_1, and region_2 columns of the records with the 0, 1, 10, and 100 index positions. In other words, generate the following DataFrame:

In [14]:
reviews.loc[[0, 1, 10, 100], ['country', 'province', 'region_1', 'region_2']]
Out[14]:
country province region_1 region_2
0 Italy Sicily & Sardinia Etna NaN
1 Portugal Douro NaN NaN
10 US California Napa Valley Napa
100 US New York Finger Lakes Finger Lakes
In [15]:
print(check_q6(reviews.loc[[0, 1, 10, 100], ['country', 'province', 'region_1', 'region_2']]))
print('-------')
print(answer_q6())
True
-------
reviews.loc[[0, 1, 10, 100], ['country', 'province', 'region_1', 'region_2']]
None

Exercise 7: Select the country and variety columns of the first 100 records.

Hint: you may use loc or iloc. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the reference for this tutorial section:

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

[...]

...[consider] when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using iloc, you will need to go one higher and ask for df.iloc[0:1001].

In [16]:
reviews.loc[0:100, ['country', 'variety']]
Out[16]:
country variety
0 Italy White Blend
1 Portugal Portuguese Red
... ... ...
99 US Bordeaux-style Red Blend
100 US Pinot Gris

101 rows × 2 columns

In [17]:
print(check_q7(reviews.loc[0:100, ['country', 'variety']]))
print('-------')
print(answer_q7())
True
-------
reviews.loc[0:100, ['country', 'variety']]
None

Exercise 8: Select wines made in Italy. Hint: reviews.country equals what?

In [18]:
reviews.loc[reviews.country == 'Italy']
Out[18]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
6 Italy Here's a bright, informal red that opens with ... Belsito 87 16.0 Sicily & Sardinia Vittoria NaN Kerin O’Keefe @kerinokeefe Terre di Giurfo 2013 Belsito Frappato (Vittoria) Frappato Terre di Giurfo
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129961 Italy Intense aromas of wild cherry, baking spice, t... NaN 90 30.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe COS 2013 Frappato (Sicilia) Frappato COS
129962 Italy Blackberry, cassis, grilled herb and toasted a... Sàgana Tenuta San Giacomo 90 40.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... Nero d'Avola Cusumano

19540 rows × 13 columns

In [19]:
print(check_q8(reviews.loc[reviews.country == 'Italy']))
print('-------')
print(answer_q8())
True
-------
reviews.loc[reviews.country == 'Italy']
None

Exercise 9: Select wines whose region_2 is not NaN.

In [20]:
reviews.loc[reviews.region_2.notnull()]
Out[20]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129952 US This Zinfandel from the eastern section of Nap... NaN 90 22.0 California Chiles Valley Napa Virginie Boone @vboone Houdini 2011 Zinfandel (Chiles Valley) Zinfandel Houdini
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation

50511 rows × 13 columns

In [21]:
print(check_q9(reviews.loc[reviews.region_2.notnull()]))
print('-------')
print(answer_q9())
True
-------
reviews.loc[reviews.region_2.notnull()]
None

The remaining exercises are visual.

Exercise 10: Select the points column.

In [22]:
reviews.points
Out[22]:
0         87
1         87
          ..
129969    90
129970    90
Name: points, Length: 129971, dtype: int64
In [23]:
print(check_q10(reviews.points))
AxesSubplot(0.125,0.125;0.775x0.755)
In [24]:
print(answer_q10())
reviews.points
None

Exercise 11: Select the points column for the first 1000 wines.

In [25]:
reviews.loc[:1000, 'points']
Out[25]:
0       87
1       87
        ..
999     88
1000    88
Name: points, Length: 1001, dtype: int64
In [26]:
print(check_q11(reviews.loc[:1000, 'points']))
AxesSubplot(0.125,0.125;0.775x0.755)
In [27]:
print(answer_q11())
reviews.loc[:1000, 'points']
None

Exercise 12: Select the points column for the last 1000 wines.

In [28]:
reviews.iloc[-1000:, 3]
Out[28]:
128971    91
128972    91
          ..
129969    90
129970    90
Name: points, Length: 1000, dtype: int64
In [29]:
print(check_q12(reviews.iloc[-1000:, 3]))
AxesSubplot(0.125,0.125;0.775x0.755)
In [30]:
print(answer_q12())
reviews.iloc[-1000:, 3]
None

Exercise 13: Select the points column, but only for wines made in Italy.

In [31]:
reviews[reviews.country == "Italy"].points
Out[31]:
0         87
6         87
          ..
129961    90
129962    90
Name: points, Length: 19540, dtype: int64
In [32]:
check_q13(reviews[reviews.country == "Italy"].points)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff76d523400>
In [33]:
print(answer_q13())
reviews[reviews.country == "Italy"].points
None

Exercise 14: Who produces more above-averagely good wines, France or Italy? Select the country column, but only when said country is one of those two options, and the points column is greater than or equal to 90.

Your output should look roughly like this:

119       France
120        Italy
           ...  
129969    France
129970    France
Name: country, Length: 15840, dtype: object
In [34]:
reviews[reviews.country.isin(["Italy", "France"]) & (reviews.points >= 90)].country
Out[34]:
119       France
120        Italy
           ...  
129969    France
129970    France
Name: country, Length: 15840, dtype: object
In [35]:
check_q14(reviews[reviews.country.isin(["Italy", "France"]) & (reviews.points >= 90)].country)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff769e27668>
In [36]:
print(answer_q14())
reviews[reviews.country.isin(["Italy", "France"]) & (reviews.points >= 90)].country
None

Keep going

Move on to the Summary functions and maps workbook.