You have learned how to select relevant data from DataFrame
and Series
objects. Plucking the right data out of our data representation is critical to getting work done.
However, the data does not always come in the format we want. Sometimes we have to do some more work ourselves to reformat it for our desired task.
The remainder of this tutorial will cover different operations we can apply to our data to get the input "just right". We'll start off in this section by looking at the most commonly looked built-in reshaping operations. Along the way we'll cover data dtypes
, a concept essential to working with pandas
effectively.
Run the code cell below to load your data and the necessary utility functions.
import pandas as pd
pd.set_option('max_rows', 5)
import numpy as np
from learntools.advanced_pandas.summary_functions_maps import *
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
Look at an overview of your data by running the line below:
Check your answers in each exercise using the check_qN
function (replacing N
with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:
check_q1(pd.DataFrame())
If you get stuck, use the answer_qN
function to see the code with the correct answer.
For the first set of questions, running the check_qN
on the correct answer returns True
.
For the second set of questions, using this function to check a correct answer will present an informative graph!
Look at your data by running the cell below:
reviews.head()
Exercise 1: What is the median of the points
column?
reviews.points.median()
print(check_q1(reviews.points.median()))
print('-------')
print(answer_q1())
Exercise 2: What countries are represented in the dataset?
reviews.country.unique()
print(check_q2(reviews.country.unique()))
print('-------')
print(answer_q2())
Exercise 3: What countries appear in the dataset most often?
reviews.country.value_counts()
print(check_q3(reviews.country.value_counts()))
print('-------')
print(answer_q3())
Exercise 4: Remap the price
column by subtracting the median price. Use the Series.map
method.
median_price = reviews.price.median()
reviews.price.map(lambda v: v - median_price)
print(check_q4(reviews.price.map(lambda v: v - median_price)))
print('-------')
print(answer_q4())
Exercise 6: I"m an economical wine buyer. Which wine in is the "best bargain", e.g., which wine has the highest points-to-price ratio in the dataset?
Hint: use a map and the argmax
function.
reviews.loc[(reviews.points / reviews.price).idxmax()].title
print(check_q6(reviews.loc[(reviews.points / reviews.price).idxmax()].title))
print('-------')
print(answer_q6())
Now it's time for some visual exercises. In the questions that follow, generate the data that we will need to have in order to produce the plots that follow. These exercises will use skills from this workbook as well as from previous ones. They look a lot like questions you will actually be asking when working with your own data!
Exercise 7: Is a wine more likely to be "tropical" or "fruity"? Create a Series
counting how many times each of these two words appears in the description
column in the dataset.
Hint: use a map to check each description for the string tropical
, then count up the number of times this is True
. Repeat this for fruity
. Create a Series
combining the two values at the end.
tropical_wine = reviews.description.map(lambda r: "tropical" in r).value_counts()
fruity_wine = reviews.description.map(lambda r: "fruity" in r).value_counts()
pd.Series([tropical_wine[True], fruity_wine[True]], index=['tropical', 'fruity'])
print(check_q7(pd.Series([tropical_wine[True], fruity_wine[True]], index=['tropical', 'fruity'])))
print(answer_q7())
Exercise 8: What combination of countries and varieties are most common?
Create a Series
whose index consists of strings of the form "<Country> - <Wine Variety>"
. For example, a pinot noir produced in the US should map to "US - Pinot Noir"
. The values should be counts of how many times the given wine appears in the dataset. Drop any reviews with incomplete country
or variety
data.
Hint: you can do this in three steps. First, generate a DataFrame
whose country
and variety
columns are non-null. Then use a map to create a series whose entries are a str
concatenation of those two columns. Finally, generate a Series
counting how many times each label appears in the dataset.
ans = reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
ans = ans.apply(lambda srs: srs.country + " - " + srs.variety, axis='columns')
ans.value_counts()
print(check_q8(ans.value_counts()))
print(answer_q8())