Data types and missing data workbook

Introduction

This is the workbook component of the "Data types and missing data" section of the tutorial.

Relevant Resources

Set Up

Run the following cell to load your data and some utility functions

In [1]:
import pandas as pd
import seaborn as sns
from learntools.advanced_pandas.data_types_missing_data import *

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

Checking Answers

Check your answers in each exercise using the check_qN function (replacing N with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:

In [2]:
check_q1(pd.DataFrame())
Out[2]:
False

If you get stuck, use the answer_qN function to see the code with the correct answer.

For the first set of questions, running the check_qN on the correct answer returns True.

For the second set of questions, using this function to check a correct answer will present an informative graph!

Exercises

Exercise 1: What is the data type of the points column in the dataset?

In [3]:
# Your code here

Exercise 2: Create a Series from entries in the price column, but convert the entries to strings. Hint: strings are str in native Python.

In [4]:
# Your code here

Here are a few visual exercises on missing data.

Exercise 3: Some wines do not list a price. How often does this occur? Generate a Seriesthat, for each review in the dataset, states whether the wine reviewed has a null price.

In [5]:
# Your code here

Exercise 4: What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64
In [6]:
# Your code here

Exercise 5: A neat property of boolean data types, like the ones created by the isnull() method, is that False gets treated as 0 and True as 1 when performing math on the values. Thus, the sum() of a list of boolean values will return how many times True appears in that list. Create a pandas Series showing how many times each of the columns in the dataset contains null values. Your result should look something like this:

country        63
description     0
               ..
variety         1
winery          0
Length: 13, dtype: int64

Hint: write a map that will extract the vintage of each wine in the dataset. The vintages reviewed range from 2000 to 2017, no earlier or later. Use fillna to impute the missing values.

In [7]:
# Your code here

Keep going

Move on to the Renaming and combining workbook