Train test/split, categorical variables, and transformations

In this homework we will explore some of the basic tools in applied statistics.
However, we will take a Machine Learning viewpoint in the our approach to model selection wil use a train/test split.

We will:

Get the Data

Let’s read in the used cars data.
We will just use the features mileage and color.

cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
names(cd)
## [1] "price"        "trim"         "isOneOwner"   "mileage"      "year"        
## [6] "color"        "displacement"
cd = cd[,c(1,4,6)]
names(cd)
## [1] "price"   "mileage" "color"
cd$color = as.factor(cd$color)
cd$mileage = cd$mileage/1000
cd$price = cd$price/1000
head(cd)
##    price mileage  color
## 1 43.995  36.858 Silver
## 2 44.995  46.883  Black
## 3 25.999 108.759  White
## 4 33.880  35.187  Black
## 5 34.895  48.153  Black
## 6  5.995 121.748  other

We now have the data.frame we want to work with.

train/test split

In our brief introduction to statistics in R and python we learned how run a multiple regression.

One of the issues that often comes up in modeling is which features (or x variables) are important.

Many people use the t statistics and associated p-values to guide them in their choice.

In Machine Learning, we keep our focus on prediction since that is often the goal in practice.

We want to choose the modeling approach that predicts the best.
But how do we assess this?

As we will see, the basic way is to simply randomly split the data into two subsets.
Estimate the model using one subset called the training data.
Then see how well you predict on the other subset, the test data.
Because you actually have the y values in the test data you can see how well your predictions worked.

We will randomly split the data into train and test subsets.
We will put 75% of the data in train and the remaining 25% in test.

n = nrow(cd)
set.seed(99)
pin = .75 #percent train (or percent in-sample)
ii = sample(1:n,floor(pin*n))
cdtr = cd[ii,]
cdte = cd[-ii,]
cat("dimension of train data:",dim(cdtr),"\n")
## dimension of train data: 750 3
cat("dimension of test data:",dim(cdte),"\n")
## dimension of test data: 250 3

Let’s plot mileage vs. size in the training data just to make sure it looks right.

plot(cdtr$mileage,cdtr$price)

Look ok.

Now we fit our model using the training data.
We will start by only using the feature mileage.

lmtr = lm(price~mileage,cdtr)

Now we predict on the test data:

yhtest = predict(lmtr,cdte)

How did we do?
We plot our predictions versus the actual prices in the test data.

library(ggplot2)
lms = range(c(cd$price,yhtest))
p = ggplot(data.frame(yhtest,price=cdte$price),aes(x=yhtest,y=price)) + geom_point() +
      xlim(lms) + ylim(lms) + geom_abline(intercept=0,slope=1,colour='red')
p

Not so great.
In particular, you will lose a lot of credibility if you predict a negative price.

We can use a variety of numerical measures to summarize our predictive performance.

rmse = function(y,yhat) {return(sqrt(mean((y-yhat)^2)))}
mabe = function(y,yhat) {return(mean(abs(y-yhat)))}

cat("out of sample rmse: ",rmse(cdte$price,yhtest),"\n") #root mean squared error
## out of sample rmse:  10.95046
cat("out of sample mad: ",mabe(cdte$price,yhtest),"\n") # mean absolute error
## out of sample mad:  8.571183
cat("out of sample R-squared: ",cor(cdte$price,yhtest)^2,"\n") # out of sample R-squared
## out of sample R-squared:  0.6355823

If you just look at the numbers, maybe it is not too bad.

Let’s see if we can improve our model by

Is color any good?

How about the categorical variable color?
Can that feature help us predict the price of a car?

How do we put color into a regression model?
We can’t just put color, a coefficient times “red” does not mean anything!

Or can we?

lmc = lm(price~mileage+color,cdtr)

What happened?

plot(cdtr$mileage,lmc$fitted.values,col=cdtr$color,cex=.2)

There are actually four parallel lines, one for each of the four colors.
The dummies just adjust the intercepts.
Does not look like color does a whole lot !!!!

How does this work?
Since color is a factor, R will automatically create a dummy variable for 3 of the 4 levels of color.
For example, the dummy for white is 1 when the car is white and 0 otherwise. This gives us the four different intercepts.

Note that in machine learning, the use of dummy varies (binary 0 or 1 variables)
is often called one hot encoding.

We are actually running a regression with these variables:

X = model.matrix(lmc)
head(X)
##     (Intercept) mileage colorother colorSilver colorWhite
## 432           1 199.000          0           0          1
## 289           1  70.000          0           1          0
## 812           1  81.508          0           0          0
## 534           1  43.267          0           0          0
## 246           1  87.651          0           0          0
## 885           1  31.533          0           1          0

For example, the first car is silver so its dummy is 1 while the others are 0.

Our model is actually: \[ price = \beta_0 + \beta_1 mileage + \beta_2 D_{other} + \beta_3 D_{silver} + \beta_4 D_{white} + \epsilon \] where, for example \(D_{white}\) is 1 if the car is white and 0 otherwise.

Let’s check that this is actually what R did.

lmc1 = lm(price~.,data.frame(price = cdtr$price,mileage=cdtr$mileage,X[,3:5]))
print(lmc1$coefficients)
## (Intercept)     mileage  colorother colorSilver  colorWhite 
##  57.0057676  -0.3384192  -3.4216249  -3.9697506   0.5278398
print(lmc$coefficients)
## (Intercept)     mileage  colorother colorSilver  colorWhite 
##  57.0057676  -0.3384192  -3.4216249  -3.9697506   0.5278398

In the plot, color certainly does not look like a killer variable, but use out of sample performance to see if we should add color to mileage and year.

yhtest = predict(lmc,cdte)
cat("out of sample rmse with mileage  and color is: ",rmse(cdte$price,yhtest),"\n")
## out of sample rmse with mileage  and color is:  10.62553

which is not a big improvement over what we had with just mileage.

log transform

The possibility of a negative prediction for price was particularly disturbing.

Let’s try logging y.

cdtr$lprice = log(cdtr$price)
plot(cdtr$mileage,cdtr$lprice)

Does this look any better ?

To predict price with this model we predict log price and then exponentiate.

llm = lm(lprice~mileage,cdtr)
lyhat = predict(llm,cdte)
yhat = exp(lyhat)
plot(cdte$price,yhat)
abline(0,1,col="red")

What do you think?

Quadratic term

cdtr$m2 = cdtr$mileage^2
cdte$m2 = cdte$mileage^2
lmq = lm(price~mileage+m2,cdtr)
plot(cdtr$mileage,cdtr$price)
points(cdtr$mileage,lmq$fitted,col="red")

What do you think?

We are fitting the model:

\[ price = \beta_0 + \beta_1 \, mileage + \beta_2 \, mileage^2 + \epsilon \]

summary(lmq)
## 
## Call:
## lm(formula = price ~ mileage + m2, data = cdtr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.328  -5.537   0.643   5.803  29.911 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.4574094  0.9911068   69.07   <2e-16 ***
## mileage     -0.7099567  0.0232254  -30.57   <2e-16 ***
## m2           0.0019642  0.0001189   16.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.053 on 747 degrees of freedom
## Multiple R-squared:  0.7613, Adjusted R-squared:  0.7606 
## F-statistic:  1191 on 2 and 747 DF,  p-value: < 2.2e-16

So our equation relating price to mileage is

\[ price = 68.457 - 0.71 \, mileage + .0019 * mileage^2 \]

Homework:

Use out of sample performance (a train/test split) to decide which of these two models is best:

Use out of sample rmse and graphics to compare the two models.