Robert McCulloch
1/23/2022
In this homework we will explore some of the basic tools in applied statistics. However, we will take a Machine Learning viewpoint in the our approach to model selection wil use a train/test split.
We will:
First our imports:
# numpy and pandas
import numpy as np
import pandas as pd
import math
#graphics with matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline
# model, train/test split, dummies (one-hot-encoding), rmse metric from scikit learn.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import mean_squared_error
Let’s read in the used cars data. We will just use the features mileage and color.
cd = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")
cd = cd[['price','mileage','color']]
cd['price'] = cd['price']/1000
cd['mileage'] = cd['mileage']/1000
cd.head()
In our brief introduction to statistics in R and python we learned how run a multiple regression.
One of the issues that often comes up in modeling is which features (or x variables) are important.
Many people use the t statistics and associated p-values to guide them in their choice.
In Machine Learning, we keep our focus on prediction since that is often the goal in practice.
We want to choose the modeling approach that predicts the best.
But how do we assess this?
As we will see, the basic way is to simply randomly split the data into two subsets.
Estimate the model using one subset called the training data.
Then see how well you predict on the other subset, the test data.
Because you actually have the y values in the test data you can see how well your predictions worked.
We will randomly split the data into train and test subsets.
We will put 75% of the data in train and the remaining 25% in test.
Train/test splits are a very common thing in Machine Learning so there is a function to do this in sklearn.
cdtr, cdte = train_test_split(cd,random_state=99, test_size=.25)
print('train dimension',cdtr.shape)
print('test dimension',cdte.shape)
Let's plot x=mileage vs. y=price in the training data just to make sure it looks correst.
plt.scatter(cdtr['mileage'],cdtr['price'])
plt.xlabel('mileage'); plt.ylabel('price')
Looks ok!
Now we fit our model using the training data. We will start by only using the feature mileage.
lm1 = LinearRegression(fit_intercept=True)
lm1.fit(cdtr[['mileage']],cdtr['price'])
Now we predict on the test data.
yhtest = lm1.predict(cdte[['mileage']])
How did we do?
We plot our predictions versus the actual prices in the test data.
## plot y vs predicted on test
yy = np.concatenate([yhtest,cdte['price'].to_numpy()])
plt.plot(yy,yy,c='red',linewidth=.8)
plt.scatter(yhtest,cdte['price'],c='blue',s=2.5)
plt.xlabel('yhtest'); plt.ylabel('price')
Not so great.
In particular, you will lose a lot of credibility if you predict a negative price.
We can use a variety of numerical measures to summarize our predictive performance.
Let's use rmse from sklearn.
rmse = math.sqrt(mean_squared_error(yhtest,cdte['price']))
print(f'rmse using just mileage is {rmse:0.2f}')
Let's check the rmse.
def rmse(y,yh):
return(math.sqrt(np.mean((y-yh)**2)))
print(rmse(cdte['price'],yhtest))
How about the out of sample R-squared.
np.corrcoef(cdte['price'],yhtest)**2
How about the categorical variable color?
Can that feature help us predict the price of a car?
How do we put color into a regression model?
We can’t just put color, a coefficient times “red” does not mean anything!
We will use dummy variables to put the categorical variable color into a linear regression model.
This is often called one hot encoding in machine learning.
Both pandas and sklearn have utilities to create the dummies. Let's use sklearn.
one_hot = LabelBinarizer()
cdumtr = one_hot.fit_transform(cdtr['color'])
cdumte = one_hot.fit_transform(cdte['color'])
print(cdtr['color'][:10])
cdumtr[:10]
For each possible color, we create a dummy variable which is 1 if the car is that color and 0 otherwise.
So, for example, the first column of cdumtr indicates which cars in the training data are black.
Since we have 4 possible colors, we get four columns of dummy variables.
The second dummy is for silver cars, the third dummy is for white cars, and the fourth dummy is for other colors.
The trick is to then put any three of the dummy columns into a multiple regression.
Let's drop the first dummy (the one for color black).
We will create train and test arrays whose first column is mileage and next three columns are the color dummies.
Xctr = np.hstack([cdtr.iloc[:,[1]].to_numpy(),cdumtr[:,1:4]])
Xcte = np.hstack([cdte.iloc[:,[1]].to_numpy(),cdumte[:,1:4]])
Xctr[:5]
We can see that the first car is black since all the other dummies are zero and has mileage 47.9.
Thus we can see that just using three of the four dummies is sufficient to identify the color of the car.
Now we can run the regression and get predictions.
lm2 = LinearRegression(fit_intercept=True)
lm2.fit(Xctr,cdtr['price'])
yhtest = lm2.predict(Xcte)
Our model is actually: $$ price = \beta_0 + \beta_1 mileage + \beta_2 D_{silver} + \beta_3 D_{white} + \beta_4 D_{other} + \epsilon $$
print('intercept:',lm2.intercept_)
print('slopes:',lm2.coef_)
Let's plotted the predictions versus mileage.
plt.scatter(cdte['mileage'],yhtest,s=2.0)
plt.xlabel('mileage'); plt.ylabel('predictions using mileage and color')
There are actually four parallel lines, one for each of the four colors.
The dummies just adjust the intercepts.
Does not look like color does a whole lot !!!!
To see if color does anything, let's check the out-of-sample rmse.
rmse = math.sqrt(mean_squared_error(yhtest,cdte['price']))
print(f'rmse using mileage and color is {rmse:0.2f}')
The same as when we just used mileage.
The possibility of a negative prediction for price was particularly disturbing.
Let’s try logging y.
lprice = np.log(cdtr['price'])
plt.scatter(cdtr['mileage'],lprice)
Does this look better ?
To predict price with this model we predict log price and then exponentiate.
lm3 = LinearRegression(fit_intercept=True)
lm3.fit(cdtr[['mileage']],lprice)
lyhat = lm3.predict(cdte[['mileage']])
yhat = np.exp(lyhat)
plt.scatter(cdte['price'],yhat)
plt.plot(yhat,yhat,c='red')
What do you think ?
Xq = np.column_stack([cdtr['mileage'],cdtr['mileage']**2])
lm4 = LinearRegression(fit_intercept=True)
lm4.fit(Xq,cdtr['price'])
yhat = lm4.predict(Xq)
plt.scatter(cdtr['mileage'],cdtr['price'],marker='o',c='black',s=.9)
plt.scatter(cdtr['mileage'],yhat,marker=5,c='red',s=.9)
plt.xlabel('mileage'); plt.ylabel('price')
We are fitting the model:
$$ price = \beta_0 + \beta_1 \, mileage + \beta_2 \, mileage^2 + \epsilon $$What do you think ?
Homework:
Use out of sample performance (a train/test split) to decide which of these two models is best:
Use out of sample rmse and graphics to compare the two models.