Data types and missing data workbook¶

Introduction¶

This is the workbook component of the "Data types and missing data" section of the tutorial.

Relevant Resources¶

Set Up¶

Run the following cell to load your data and some utility functions

import pandas as pd
import seaborn as sns
from learntools.advanced_pandas.data_types_missing_data import *

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

Checking Answers¶

Check your answers in each exercise using the check_qN function (replacing N with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:

check_q1(pd.DataFrame())

False

If you get stuck, use the answer_qN function to see the code with the correct answer.

For the first set of questions, running the check_qN on the correct answer returns True.

For the second set of questions, using this function to check a correct answer will present an informative graph!

reviews.head()

Exercises¶

Exercise 1: What is the data type of the points column in the dataset?

reviews.points.dtype

dtype('int64')

print(check_q1(reviews.points.dtype))
print('-------')
print(answer_q1())

True
-------
reviews.points.dtype
None

Exercise 2: Create a Series from entries in the price column, but convert the entries to strings. Hint: strings are str in native Python.

reviews.price.astype('str')

0          nan
1         15.0
          ... 
129969    32.0
129970    21.0
Name: price, Length: 129971, dtype: object

print(check_q2(reviews.price.astype('str')))
print('-------')
print(answer_q2())

True
-------
reviews.price.astype(str)
None

Here are a few visual exercises on missing data.

Exercise 3: Some wines do not list a price. How often does this occur? Generate a Seriesthat, for each review in the dataset, states whether the wine reviewed has a null price.

reviews.price.isnull()

0          True
1         False
          ...  
129969    False
129970    False
Name: price, Length: 129971, dtype: bool

print(check_q3(reviews.price.isnull()))
print('-------')
print(answer_q3())

AxesSubplot(0.125,0.125;0.775x0.755)
-------
reviews.price.isnull()
None

Exercise 4: What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64

reviews.region_1.fillna("Unknown").value_counts()

Unknown                  21247
Napa Valley               4480
                         ...  
Australia-New Zealand        1
Texoma                       1
Name: region_1, Length: 1230, dtype: int64

print(check_q4(reviews.region_1.fillna("Unknown").value_counts()))
print('-------')
print(answer_q4())

AxesSubplot(0.125,0.125;0.775x0.755)
-------
reviews.region_1.fillna("Unknown").value_counts()
None

Exercise 5: A neat property of boolean data types, like the ones created by the isnull() method, is that False gets treated as 0 and True as 1 when performing math on the values. Thus, the sum() of a list of boolean values will return how many times True appears in that list. Create a pandas Series showing how many times each of the columns in the dataset contains null values. Your result should look something like this:

country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64

Hint: write a map that will extract the vintage of each wine in the dataset. The vintages reviewed range from 2000 to 2017, no earlier or later. Use fillna to impute the missing values.

reviews.isnull().sum()

country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64

print(check_q5(reviews.isnull().sum()))
print('-------')
print(answer_q5())

AxesSubplot(0.125,0.125;0.775x0.755)
-------
reviews.isnull().sum()
None

Keep going¶

Move on to the Renaming and combining workbook

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks