Transformation and Variable Section with a train/test split

In our brief introduction to statistics in R and python we learned how run a multiple regression.

One of the issues that often comes up in modeling is which features (or x variables) are important.

Many people use the t statistics and associated p-values to guide them in their choice.

In Machine Learning, we keep our focus on prediction since that is often the goal in practice.

We want to choose the modeling approach that predicts the best.
But how do we assess this?

As we will see, the basic way is to simply randomly split the data into two subsets.
Estimate the model using one subset called the training data.
Then see how well you predict on the other subset, the test data.
Because you actually have the y values in the test data you can see how well your predictions worked.

Let’s try this on our used cars data.
First we read in the data and again just use the features year and mileage.

cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd = cd[,c("price","mileage","year")]
cd$price = cd$price/1000
cd$mileage = cd$mileage/1000
head(cd)
##    price mileage year
## 1 43.995  36.858 2008
## 2 44.995  46.883 2012
## 3 25.999 108.759 2007
## 4 33.880  35.187 2007
## 5 34.895  48.153 2007
## 6  5.995 121.748 2002

Now we will randomly split the data into train and test subsets.
We will put 75% of the data in train and the remaining 25% in test.

n = nrow(cd)
set.seed(99)
pin = .75
ii = sample(1:n,floor(pin*n))
cdtrain = cd[ii,]
cdtest = cd[-ii,]
cat("dimension of train data:",dim(cdtrain),"\n")
## dimension of train data: 750 3
cat("dimension of test data:",dim(cdtest),"\n")
## dimension of test data: 250 3

Now we fit our model using the training data:

lmtr = lm(price~.,cdtrain)

Now we predict on the test data:

yhtest = predict(lmtr,cdtest)

How did we do?

library(ggplot2)
lms = range(c(cd$price,yhtest))
p = ggplot(data.frame(yhtest,price=cdtest$price),aes(x=yhtest,y=price)) + geom_point() +
      xlim(lms) + ylim(lms) + geom_abline(intercept=0,slope=1,colour='red')
p

Not so great.
In particular, you will lose a lot of credibility if you predict a negative price.

We can use a variety of numerical measures to summarize our predictive performance.

rmse = function(y,yhat) {return(sqrt(mean((y-yhat)^2)))}
mabe = function(y,yhat) {return(mean(abs(y-yhat)))}

cat("out of sample rmse: ",rmse(cdtest$price,yhtest),"\n") #root mean squared error
## out of sample rmse:  7.994986
cat("out of sample mad: ",mabe(cdtest$price,yhtest),"\n") # mean absolute error
## out of sample mad:  5.839794
cat("out of sample R-squared: ",cor(cdtest$price,yhtest)^2,"\n") # out of sample R-squared
## out of sample R-squared:  0.8064184

If you just look at the numbers, maybe it is not too bad.

Let’s use out sample performance to explore two basic issues.

A classic way to deal with the positivity is the use the log of the price in the regression and then exponentiate the predictions for log price to get a prediction for price.

Let’s also just try dropping year and just using the feature mileage.
So, we now have 4 models, with year/without year and with log/without log.

Assess these 4 models based on their out of sample predictive performance.

Is color any good?

How about the categorical variable color?
Can that feature help us predict the price of a car?

cd1 = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd1 = cd1[,c(1,4,5,6)] #first 10 rows, columns 1, 4, and 5.
cd1$price = cd1$price/1000
cd1$mileage = cd1$mileage/1000
cd1$color = as.factor(cd1$color)
head(cd1)
##    price mileage year  color
## 1 43.995  36.858 2008 Silver
## 2 44.995  46.883 2012  Black
## 3 25.999 108.759 2007  White
## 4 33.880  35.187 2007  Black
## 5 34.895  48.153 2007  Black
## 6  5.995 121.748 2002  other

How do we put color into a regression model?
We can’t just put color, \(\beta \, \text{red}\) does not mean anything!

Or can we?

lmc = lm(price~mileage+color,cd1)

What happened?

plot(cd1$mileage,lmc$fitted.values,col=cd1$color,cex=.2)

We actually get 4 different lines. Each line is for one of the 4 colors.

How does this work?
Since color is a factor, R will automatically create a dummy variable for 3 of the 4 levels of color.
For example, the dummy for white is 1 when the car is white and 0 otherwise. This gives us the four different intercepts.

Note that in machine learning, the use of dummy varies (binary 0 or 1 variables)
is often called one hot encoding.

We are actually running a regression with these variables:

X = model.matrix(lmc)
head(X)
##   (Intercept) mileage colorother colorSilver colorWhite
## 1           1  36.858          0           1          0
## 2           1  46.883          0           0          0
## 3           1 108.759          0           0          1
## 4           1  35.187          0           0          0
## 5           1  48.153          0           0          0
## 6           1 121.748          1           0          0

For example, the first car is silver so its dummy is 1 while the others are 0.

Our model is actually: \[ price = \beta_0 + \beta_1 mileage + \beta_2 D_{other} + \beta_3 D_{silver} + \beta_4 D_{white} + \epsilon \] where, for example \(D_{white}\) is 1 if the car is white and 0 otherwise.

Let’s check that this is actually what R did.

lmc1 = lm(price~.,data.frame(price = cd1$price,mileage=cd1$mileage,X[,3:5]))
print(lmc1$coefficients)
## (Intercept)     mileage  colorother colorSilver  colorWhite 
##  57.4289621  -0.3417251  -3.8049833  -4.3433991   0.7731780
print(lmc$coefficients)
## (Intercept)     mileage  colorother colorSilver  colorWhite 
##  57.4289621  -0.3417251  -3.8049833  -4.3433991   0.7731780

In the plot, color certainly does not look like a killer variable, but use out of sample performance to see if we should add color to mileage and year.