Introduction

You have learned how to select relevant data from DataFrame and Series objects. Plucking the right data out of our data representation is critical to getting work done.

However, the data does not always come in the format we want. Sometimes we have to do some more work ourselves to reformat it for our desired task.

The remainder of this tutorial will cover different operations we can apply to our data to get the input "just right". We'll start off in this section by looking at the most commonly looked built-in reshaping operations. Along the way we'll cover data dtypes, a concept essential to working with pandas effectively.

Relevant Resources

Set Up

Run the code cell below to load your data and the necessary utility functions.

In [1]:
import pandas as pd
pd.set_option('max_rows', 5)
import numpy as np
from learntools.advanced_pandas.summary_functions_maps import *

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

Look at an overview of your data by running the line below:

Checking Answers

Check your answers in each exercise using the check_qN function (replacing N with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:

In [2]:
check_q1(pd.DataFrame())
Out[2]:
False

If you get stuck, use the answer_qN function to see the code with the correct answer.

For the first set of questions, running the check_qN on the correct answer returns True.

For the second set of questions, using this function to check a correct answer will present an informative graph!

Exercises

Look at your data by running the cell below:

In [3]:
reviews.head()
Out[3]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Exercise 1: What is the median of the points column?

In [4]:
reviews.points.median()
Out[4]:
88.0
In [5]:
print(check_q1(reviews.points.median()))
print('-------')
print(answer_q1())
True
-------
reviews.points.median()
None

Exercise 2: What countries are represented in the dataset?

In [6]:
reviews.country.unique()
Out[6]:
array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)
In [7]:
print(check_q2(reviews.country.unique()))
print('-------')
print(answer_q2())
True
-------
reviews.country.unique()
None

Exercise 3: What countries appear in the dataset most often?

In [8]:
reviews.country.value_counts()
Out[8]:
US          54504
France      22093
            ...  
Slovakia        1
Egypt           1
Name: country, Length: 43, dtype: int64
In [9]:
print(check_q3(reviews.country.value_counts()))
print('-------')
print(answer_q3())
True
-------
reviews.country.value_counts()
None

Exercise 4: Remap the price column by subtracting the median price. Use the Series.map method.

In [10]:
median_price = reviews.price.median()
reviews.price.map(lambda v: v - median_price)
Out[10]:
0          NaN
1        -10.0
          ... 
129969     7.0
129970    -4.0
Name: price, Length: 129971, dtype: float64
In [11]:
print(check_q4(reviews.price.map(lambda v: v - median_price)))
print('-------')
print(answer_q4())
True
-------
median_price = reviews.price.median()
reviews.price.map(lambda v: v - median_price)
None

Exercise 6: I"m an economical wine buyer. Which wine in is the "best bargain", e.g., which wine has the highest points-to-price ratio in the dataset?

Hint: use a map and the argmax function.

In [12]:
reviews.loc[(reviews.points / reviews.price).idxmax()].title
Out[12]:
'Bandit NV Merlot (California)'
In [13]:
print(check_q6(reviews.loc[(reviews.points / reviews.price).idxmax()].title))
print('-------')
print(answer_q6())
True
-------
reviews.loc[(reviews.points / reviews.price).idxmax()].title
None

Now it's time for some visual exercises. In the questions that follow, generate the data that we will need to have in order to produce the plots that follow. These exercises will use skills from this workbook as well as from previous ones. They look a lot like questions you will actually be asking when working with your own data!

Exercise 7: Is a wine more likely to be "tropical" or "fruity"? Create a Series counting how many times each of these two words appears in the description column in the dataset.

Hint: use a map to check each description for the string tropical, then count up the number of times this is True. Repeat this for fruity. Create a Series combining the two values at the end.

In [14]:
tropical_wine = reviews.description.map(lambda r: "tropical" in r).value_counts()
fruity_wine = reviews.description.map(lambda r: "fruity" in r).value_counts()
pd.Series([tropical_wine[True], fruity_wine[True]], index=['tropical', 'fruity'])
Out[14]:
tropical    3607
fruity      9090
dtype: int64
In [15]:
print(check_q7(pd.Series([tropical_wine[True], fruity_wine[True]], index=['tropical', 'fruity'])))
AxesSubplot(0.125,0.125;0.775x0.755)
In [16]:
print(answer_q7())
tropical_wine = reviews.description.map(lambda r: "tropical" in r).value_counts()
fruity_wine = reviews.description.map(lambda r: "fruity" in r).value_counts()
pd.Series([tropical_wine[True], fruity_wine[True]], index=['tropical', 'fruity'])

None

Exercise 8: What combination of countries and varieties are most common?

Create a Series whose index consists of strings of the form "<Country> - <Wine Variety>". For example, a pinot noir produced in the US should map to "US - Pinot Noir". The values should be counts of how many times the given wine appears in the dataset. Drop any reviews with incomplete country or variety data.

Hint: you can do this in three steps. First, generate a DataFrame whose country and variety columns are non-null. Then use a map to create a series whose entries are a str concatenation of those two columns. Finally, generate a Series counting how many times each label appears in the dataset.

In [17]:
ans = reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
ans = ans.apply(lambda srs: srs.country + " - " + srs.variety, axis='columns')
ans.value_counts()
Out[17]:
US - Pinot Noir            9885
US - Cabernet Sauvignon    7315
                           ... 
Portugal - Alvarelhão         1
Bulgaria - Gamza              1
Length: 1612, dtype: int64
In [18]:
print(check_q8(ans.value_counts()))
AxesSubplot(0.125,0.125;0.775x0.755)
In [19]:
print(answer_q8())
ans = reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
ans = ans.apply(lambda srs: srs.country + " - " + srs.variety, axis='columns')
ans.value_counts()
None