In this homework we will explore some of the basic tools in applied statistics.
However, we will take a Machine Learning viewpoint in the our approach to model selection wil use a train/test split.
We will:
Let’s read in the used cars data.
We will just use the features mileage and color.
cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
names(cd)
## [1] "price" "trim" "isOneOwner" "mileage" "year"
## [6] "color" "displacement"
cd = cd[,c(1,4,6)]
names(cd)
## [1] "price" "mileage" "color"
cd$color = as.factor(cd$color)
cd$mileage = cd$mileage/1000
cd$price = cd$price/1000
head(cd)
## price mileage color
## 1 43.995 36.858 Silver
## 2 44.995 46.883 Black
## 3 25.999 108.759 White
## 4 33.880 35.187 Black
## 5 34.895 48.153 Black
## 6 5.995 121.748 other
We now have the data.frame we want to work with.
In our brief introduction to statistics in R and python we learned how run a multiple regression.
One of the issues that often comes up in modeling is which features (or x variables) are important.
Many people use the t statistics and associated p-values to guide them in their choice.
In Machine Learning, we keep our focus on prediction since that is often the goal in practice.
We want to choose the modeling approach that predicts the best.
But how do we assess this?
As we will see, the basic way is to simply randomly split the data into two subsets.
Estimate the model using one subset called the training data.
Then see how well you predict on the other subset, the test data.
Because you actually have the y values in the test data you can see how well your predictions worked.
We will randomly split the data into train and test subsets.
We will put 75% of the data in train and the remaining 25% in test.
n = nrow(cd)
set.seed(99)
pin = .75 #percent train (or percent in-sample)
ii = sample(1:n,floor(pin*n))
cdtr = cd[ii,]
cdte = cd[-ii,]
cat("dimension of train data:",dim(cdtr),"\n")
## dimension of train data: 750 3
cat("dimension of test data:",dim(cdte),"\n")
## dimension of test data: 250 3
Let’s plot mileage vs. size in the training data just to make sure it looks right.
plot(cdtr$mileage,cdtr$price)
Look ok.
Now we fit our model using the training data.
We will start by only using the feature mileage.
lmtr = lm(price~mileage,cdtr)
Now we predict on the test data:
yhtest = predict(lmtr,cdte)
How did we do?
We plot our predictions versus the actual prices in the test data.
library(ggplot2)
lms = range(c(cd$price,yhtest))
p = ggplot(data.frame(yhtest,price=cdte$price),aes(x=yhtest,y=price)) + geom_point() +
xlim(lms) + ylim(lms) + geom_abline(intercept=0,slope=1,colour='red')
p
Not so great.
In particular, you will lose a lot of credibility if you predict a negative price.
We can use a variety of numerical measures to summarize our predictive performance.
rmse = function(y,yhat) {return(sqrt(mean((y-yhat)^2)))}
mabe = function(y,yhat) {return(mean(abs(y-yhat)))}
cat("out of sample rmse: ",rmse(cdte$price,yhtest),"\n") #root mean squared error
## out of sample rmse: 10.95046
cat("out of sample mad: ",mabe(cdte$price,yhtest),"\n") # mean absolute error
## out of sample mad: 8.571183
cat("out of sample R-squared: ",cor(cdte$price,yhtest)^2,"\n") # out of sample R-squared
## out of sample R-squared: 0.6355823
If you just look at the numbers, maybe it is not too bad.
Let’s see if we can improve our model by
How about the categorical variable color?
Can that feature help us predict the price of a car?
How do we put color into a regression model?
We can’t just put color, a coefficient times “red” does not mean anything!
Or can we?
lmc = lm(price~mileage+color,cdtr)
What happened?
plot(cdtr$mileage,lmc$fitted.values,col=cdtr$color,cex=.2)
There are actually four parallel lines, one for each of the four colors.
The dummies just adjust the intercepts.
Does not look like color does a whole lot !!!!
How does this work?
Since color is a factor, R will automatically create a dummy variable for 3 of the 4 levels of color.
For example, the dummy for white is 1 when the car is white and 0 otherwise. This gives us the four different intercepts.
Note that in machine learning, the use of dummy varies (binary 0 or 1 variables)
is often called one hot encoding.
We are actually running a regression with these variables:
X = model.matrix(lmc)
head(X)
## (Intercept) mileage colorother colorSilver colorWhite
## 432 1 199.000 0 0 1
## 289 1 70.000 0 1 0
## 812 1 81.508 0 0 0
## 534 1 43.267 0 0 0
## 246 1 87.651 0 0 0
## 885 1 31.533 0 1 0
For example, the first car is silver so its dummy is 1 while the others are 0.
Our model is actually: \[ price = \beta_0 + \beta_1 mileage + \beta_2 D_{other} + \beta_3 D_{silver} + \beta_4 D_{white} + \epsilon \] where, for example \(D_{white}\) is 1 if the car is white and 0 otherwise.
Let’s check that this is actually what R did.
lmc1 = lm(price~.,data.frame(price = cdtr$price,mileage=cdtr$mileage,X[,3:5]))
print(lmc1$coefficients)
## (Intercept) mileage colorother colorSilver colorWhite
## 57.0057676 -0.3384192 -3.4216249 -3.9697506 0.5278398
print(lmc$coefficients)
## (Intercept) mileage colorother colorSilver colorWhite
## 57.0057676 -0.3384192 -3.4216249 -3.9697506 0.5278398
In the plot, color certainly does not look like a killer variable, but use out of sample performance to see if we should add color to mileage and year.
yhtest = predict(lmc,cdte)
cat("out of sample rmse with mileage and color is: ",rmse(cdte$price,yhtest),"\n")
## out of sample rmse with mileage and color is: 10.62553
which is not a big improvement over what we had with just mileage.
The possibility of a negative prediction for price was particularly disturbing.
Let’s try logging y.
cdtr$lprice = log(cdtr$price)
plot(cdtr$mileage,cdtr$lprice)
Does this look any better ?
To predict price with this model we predict log price and then exponentiate.
llm = lm(lprice~mileage,cdtr)
lyhat = predict(llm,cdte)
yhat = exp(lyhat)
plot(cdte$price,yhat)
abline(0,1,col="red")
What do you think?
cdtr$m2 = cdtr$mileage^2
cdte$m2 = cdte$mileage^2
lmq = lm(price~mileage+m2,cdtr)
plot(cdtr$mileage,cdtr$price)
points(cdtr$mileage,lmq$fitted,col="red")
What do you think?
We are fitting the model:
\[ price = \beta_0 + \beta_1 \, mileage + \beta_2 \, mileage^2 + \epsilon \]
summary(lmq)
##
## Call:
## lm(formula = price ~ mileage + m2, data = cdtr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.328 -5.537 0.643 5.803 29.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.4574094 0.9911068 69.07 <2e-16 ***
## mileage -0.7099567 0.0232254 -30.57 <2e-16 ***
## m2 0.0019642 0.0001189 16.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.053 on 747 degrees of freedom
## Multiple R-squared: 0.7613, Adjusted R-squared: 0.7606
## F-statistic: 1191 on 2 and 747 DF, p-value: < 2.2e-16
So our equation relating price to mileage is
\[ price = 68.457 - 0.71 \, mileage + .0019 * mileage^2 \]
Homework:
Use out of sample performance (a train/test split) to decide which of these two models is best:
Use out of sample rmse and graphics to compare the two models.