Data types and missing data workbook

Introduction

This is the workbook component of the "Data types and missing data" section of the tutorial.

Relevant Resources

Set Up

Run the following cell to load your data and some utility functions

In [19]:
import pandas as pd
import seaborn as sns
from learntools.advanced_pandas.data_types_missing_data import *

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

Checking Answers

Check your answers in each exercise using the check_qN function (replacing N with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:

In [20]:
check_q1(pd.DataFrame())
Out[20]:
False

If you get stuck, use the answer_qN function to see the code with the correct answer.

For the first set of questions, running the check_qN on the correct answer returns True.

For the second set of questions, using this function to check a correct answer will present an informative graph!

In [21]:
reviews.head()
Out[21]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Exercises

Exercise 1: What is the data type of the points column in the dataset?

In [22]:
reviews.points.dtype
Out[22]:
dtype('int64')
In [23]:
print(check_q1(reviews.points.dtype))
print('-------')
print(answer_q1())
True
-------
reviews.points.dtype
None

Exercise 2: Create a Series from entries in the price column, but convert the entries to strings. Hint: strings are str in native Python.

In [24]:
reviews.price.astype('str')
Out[24]:
0          nan
1         15.0
          ... 
129969    32.0
129970    21.0
Name: price, Length: 129971, dtype: object
In [25]:
print(check_q2(reviews.price.astype('str')))
print('-------')
print(answer_q2())
True
-------
reviews.price.astype(str)
None

Here are a few visual exercises on missing data.

Exercise 3: Some wines do not list a price. How often does this occur? Generate a Seriesthat, for each review in the dataset, states whether the wine reviewed has a null price.

In [26]:
reviews.price.isnull()
Out[26]:
0          True
1         False
          ...  
129969    False
129970    False
Name: price, Length: 129971, dtype: bool
In [27]:
print(check_q3(reviews.price.isnull()))
print('-------')
print(answer_q3())
AxesSubplot(0.125,0.125;0.775x0.755)
-------
reviews.price.isnull()
None

Exercise 4: What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64
In [28]:
reviews.region_1.fillna("Unknown").value_counts()
Out[28]:
Unknown                  21247
Napa Valley               4480
                         ...  
Australia-New Zealand        1
Texoma                       1
Name: region_1, Length: 1230, dtype: int64
In [29]:
print(check_q4(reviews.region_1.fillna("Unknown").value_counts()))
print('-------')
print(answer_q4())
AxesSubplot(0.125,0.125;0.775x0.755)
-------
reviews.region_1.fillna("Unknown").value_counts()
None

Exercise 5: A neat property of boolean data types, like the ones created by the isnull() method, is that False gets treated as 0 and True as 1 when performing math on the values. Thus, the sum() of a list of boolean values will return how many times True appears in that list. Create a pandas Series showing how many times each of the columns in the dataset contains null values. Your result should look something like this:

country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64

Hint: write a map that will extract the vintage of each wine in the dataset. The vintages reviewed range from 2000 to 2017, no earlier or later. Use fillna to impute the missing values.

In [30]:
reviews.isnull().sum()
Out[30]:
country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64
In [31]:
print(check_q5(reviews.isnull().sum()))
print('-------')
print(answer_q5())
AxesSubplot(0.125,0.125;0.775x0.755)
-------
reviews.isnull().sum()
None

Keep going

Move on to the Renaming and combining workbook