Recap

So far, you have loaded your data and reviewed it with the following code. Run this cell to set up your coding environment where the previous step left off.

In [1]:
# Code you have previously used to load data
import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex3 import *

print("Setup Complete")
Setup Complete

Exercises

Step 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called y. You'll need to print a list of the columns to find the name of the column you need.

In [2]:
# print the list of columns in the dataset to find the name of the prediction target
home_data.columns
Out[2]:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
In [3]:
y = home_data.SalePrice
step_1.check()

Correct

In [4]:
# The lines below will show you a hint or the solution.
step_1.hint() 
step_1.solution()

Hint: Use print(home_data.columns). The column you want is at the end of the list. Use the dot notation to pull out this column from the DataFrame

Solution:

y = home_data.SalePrice

Step 2: Create X

Now you will create a DataFrame called X holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in X.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

* LotArea
* YearBuilt
* 1stFlrSF
* 2ndFlrSF
* FullBath
* BedroomAbvGr
* TotRmsAbvGrd

After you've created that list of features, use it to create the DataFrame that you'll use to fit the model.

In [5]:
# Create the list of features below
feature_names = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# select data corresponding to features in feature_names
X = home_data[feature_names]

step_2.check()

Correct

In [6]:
step_2.hint()
step_2.solution()

Hint: Capitalization and spelling are important when specifying variable names. Use the brackets notation when specifying data for X.

Solution:

feature_names = ["LotArea", "YearBuilt", "1stFlrSF", "2ndFlrSF",
                      "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"]

X=home_data[feature_names]

Review Data

Before building a model, take a quick look at X to verify it looks sensible

In [7]:
# Review data
# print description or statistics from X
# print(X)
X.describe()

# print the top few lines
# print(X.head())
# X.head()
Out[7]:
LotArea YearBuilt 1stFlrSF 2ndFlrSF FullBath BedroomAbvGr TotRmsAbvGrd
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 10516.828082 1971.267808 1162.626712 346.992466 1.565068 2.866438 6.517808
std 9981.264932 30.202904 386.587738 436.528436 0.550916 0.815778 1.625393
min 1300.000000 1872.000000 334.000000 0.000000 0.000000 0.000000 2.000000
25% 7553.500000 1954.000000 882.000000 0.000000 1.000000 2.000000 5.000000
50% 9478.500000 1973.000000 1087.000000 0.000000 2.000000 3.000000 6.000000
75% 11601.500000 2000.000000 1391.250000 728.000000 2.000000 3.000000 7.000000
max 215245.000000 2010.000000 4692.000000 2065.000000 3.000000 8.000000 14.000000

Step 3: Specify and Fit Model

Create a DecisionTreeRegressor and save it iowa_model. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in X and y that you saved above.

In [8]:
from sklearn.tree import DecisionTreeRegressor
#specify the model. 
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit the model
iowa_model.fit(X, y)

step_3.check()

Correct

In [9]:
step_3.hint()
step_3.solution()

Hint: Include random_state when specifying model. Data is specified when fitting it.

Solution:

from sklearn.tree import DecisionTreeRegressor
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(X, y)

Step 4: Make Predictions

Make predictions with the model's predict command using X as the data. Save the results to a variable called predictions.

In [10]:
print(X.head())
predictions = iowa_model.predict(X)
print(predictions)
step_4.check()
   LotArea  YearBuilt      ...       BedroomAbvGr  TotRmsAbvGrd
0     8450       2003      ...                  3             8
1     9600       1976      ...                  3             6
2    11250       2001      ...                  3             6
3     9550       1915      ...                  3             7
4    14260       2000      ...                  4             9

[5 rows x 7 columns]
[208500. 181500. 223500. ... 266500. 142125. 147500.]

Correct

In [11]:
step_4.hint()
step_4.solution()

Hint: Use iowa_model.predict with an argument holding the data to predict with.

Solution:

iowa_model.predict(X)

Think About Your Results

Use the head method to compare the top few predictions to the actual home values (in y) for those same homes. Anything surprising?

You'll understand why this happened if you keep going.

Keep Going

You've built a decision tree model. It's natural to ask how accurate the model's predictions will be and how you can improve that. Learn how to do that with Model Validation.


Course Home Page