This is the workbook component of the "Data types and missing data" section of the tutorial.
Run the following cell to load your data and some utility functions
import pandas as pd
import seaborn as sns
from learntools.advanced_pandas.data_types_missing_data import *
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)
Check your answers in each exercise using the check_qN function (replacing N with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:
check_q1(pd.DataFrame())
If you get stuck, use the answer_qN function to see the code with the correct answer.
For the first set of questions, running the check_qN on the correct answer returns True.
For the second set of questions, using this function to check a correct answer will present an informative graph!
reviews.head()
Exercise 1: What is the data type of the points column in the dataset?
reviews.points.dtype
print(check_q1(reviews.points.dtype))
print('-------')
print(answer_q1())
Exercise 2: Create a Series from entries in the price column, but convert the entries to strings. Hint: strings are str in native Python.
reviews.price.astype('str')
print(check_q2(reviews.price.astype('str')))
print('-------')
print(answer_q2())
Here are a few visual exercises on missing data.
Exercise 3: Some wines do not list a price. How often does this occur? Generate a Seriesthat, for each review in the dataset, states whether the wine reviewed has a null price.
reviews.price.isnull()
print(check_q3(reviews.price.isnull()))
print('-------')
print(answer_q3())
Exercise 4: What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:
Unknown 21247
Napa Valley 4480
...
Bardolino Superiore 1
Primitivo del Tarantino 1
Name: region_1, Length: 1230, dtype: int64
reviews.region_1.fillna("Unknown").value_counts()
print(check_q4(reviews.region_1.fillna("Unknown").value_counts()))
print('-------')
print(answer_q4())
Exercise 5: A neat property of boolean data types, like the ones created by the isnull() method, is that False gets treated as 0 and True as 1 when performing math on the values. Thus, the sum() of a list of boolean values will return how many times True appears in that list.
Create a pandas Series showing how many times each of the columns in the dataset contains null values. Your result should look something like this:
country 63
description 0
..
variety 1
winery 0
Length: 13, dtype: int64
Hint: write a map that will extract the vintage of each wine in the dataset. The vintages reviewed range from 2000 to 2017, no earlier or later. Use fillna to impute the missing values.
reviews.isnull().sum()
print(check_q5(reviews.isnull().sum()))
print('-------')
print(answer_q5())
Move on to the Renaming and combining workbook