1. Train test/split, categorical variables, and transformations

In this problm we will explore some of the basic tools in applied statistics. However, we will take a Machine Learning viewpoint in the our approach to model selection will use a train/test split.

We will:

learn about the train/test split
see how to use a categorical regressor in a linear model
try the log transform on y
try a polynomial transformation of an x

Get the Data

Let’s read in the used cars data. We will just use the features mileage and trim.

cd = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")
names(cd)

## [1] "price"        "trim"         "isOneOwner"   "mileage"      "year"        
## [6] "color"        "displacement"

cd = cd[,c(1,4,2)]
names(cd)

## [1] "price"   "mileage" "trim"

cd$trim = as.factor(cd$trim)
cd$mileage = cd$mileage/1000
cd$price = cd$price/1000
head(cd)

##    price mileage trim
## 1 43.995  36.858  550
## 2 44.995  46.883  550
## 3 25.999 108.759  550
## 4 33.880  35.187  550
## 5 34.895  48.153  550
## 6  5.995 121.748  500

We now have the data.frame we want to work with.

train/test split

In our brief introduction to statistics in R and python we learned how run a multiple regression.

One of the issues that often comes up in modeling is which features (or x variables) are important.

Many people use the t statistics and associated p-values to guide them in their choice.

In Machine Learning, we keep our focus on prediction since that is often the goal in practice.

We want to choose the modeling approach that predicts the best.
But how do we assess this?

As we will see, the basic way is to simply randomly split the data into two subsets.
Estimate the model using one subset called the training data.
Then see how well you predict on the other subset, the test data.
Because you actually have the y values in the test data you can see how well your predictions worked.

We will randomly split the data into train and test subsets.
We will put 75% of the data in train and the remaining 25% in test.

n = nrow(cd)
set.seed(99)
pin = .75 #percent train (or percent in-sample)
ii = sample(1:n,floor(pin*n))
cdtr = cd[ii,]
cdte = cd[-ii,]
cat("dimension of train data:",dim(cdtr),"\n")

## dimension of train data: 750 3

cat("dimension of test data:",dim(cdte),"\n")

## dimension of test data: 250 3

Let’s plot mileage vs. size in the training data just to make sure it looks right.

plot(cdtr$mileage,cdtr$price)

Looks ok.

Linear regression of price on mileage test rmse

Now we fit our model using the training data. We will start by only using the feature mileage.

lmtr = lm(price~mileage,cdtr)

Now we predict on the test data:

ypred = predict(lmtr,cdte)

How did we do? We plot our predictions versus the actual prices in the test data.

library(ggplot2)
lms = range(c(cd$price,ypred))
p = ggplot(data.frame(ypred,price=cdte$price),aes(x=ypred,y=price)) + geom_point() +
      xlim(lms) + ylim(lms) + geom_abline(intercept=0,slope=1,colour='red')
p

Not so great. In particular, you will lose a lot of credibility if you predict a negative price.

We can use a variety of numerical measures to summarize our predictive performance.
The most popular measure is mean squared error or root mean squared error.
You also see people looks at the mean absolute error, or mean absolute deviation
often called MAD.

myrmse = function(y,yhat) {return(sqrt(mean((y-yhat)^2)))}  #root mean squared error
mymad = function(y,yhat) {return(mean(abs(y-yhat)))} # mean absolute error (deviation)

cat("out of sample rmse: ",myrmse(cdte$price,ypred),"\n") #root mean squared error

## out of sample rmse:  10.95046

cat("out of sample mad: ",mymad(cdte$price,ypred),"\n") # mean absolute error

## out of sample mad:  8.571183

Let’s see if we can improve our model by

using the categorical variable trim
transforming y with the log
transforming mileage with a quadratic term.

Is trim any good?

How about the categorical variable trim? Can that feature help us predict the price of a car?

How do we put trim into a regression model? We can’t just put trim, a coefficient times “red” does not mean anything!

Or can we?

lmt = lm(price~mileage+trim,cdtr)

What happened?

plot(cdtr$mileage,lmt$fitted.values,col=cdtr$trim,cex=.5)

There are actually four parallel lines, one for each of the four trims.
The dummies just adjust the intercepts.

How does this work?
Since trim is a factor, R will automatically create a dummy variable
for 3 of the 4 levels of trim.
For example, the dummy for trim500 is 1 when the car has trim 500 and 0 otherwise.
This gives us the four different intercepts.

Note that in machine learning, the use of dummy varies (binary 0 or 1 variables) is often called one hot encoding.

We are actually running a regression with these variables:

X = model.matrix(lmt)
head(X)

##     (Intercept) mileage trim500 trim550 trimother
## 432           1 199.000       0       0         1
## 289           1  70.000       1       0         0
## 812           1  81.508       0       0         1
## 534           1  43.267       1       0         0
## 246           1  87.651       0       1         0
## 885           1  31.533       0       0         1

For example, the first car is trim other so its dummy is 1 while the others are 0.

Our model is actually:

\[ price = \beta_0 + \beta_1 \ mileage + \beta_2 \, D_{500} + \beta_3 \, D_{550} + \beta_4 \, D_{o} + \epsilon \]

where, for example $D_{o}$ is 1 if the car has trim other and 0 otherwise.

How did we do out of sample on the test data?

ypredt = predict(lmt,cdte)  # predict on test using model with mileage and trim

plot(cdte$price,ypredt,xlab='test y=price',ylab='predict with mileage and trim',pch=16,cex=.8)
abline(0,1,col='red',lty=2)

Let’s try the rmse and MAD.

cat("\nrmse with mileage and trim: ",myrmse(cdte$price,ypredt))

## 
## rmse with mileage and trim:  9.665141

cat("\nMAD with mileage and trim: ", mymad(cdte$price,ypredt))

## 
## MAD with mileage and trim:  7.588979

Better than with just mileage!!
But we still know we are not addressing the nonlinearity.
To fit a nonlinear model with linear multiple regression we will try
logging y and a quadratic in mileage.

log transform

The possibility of a negative prediction for price was particularly disturbing.

Let’s try logging y.

cdtr$lprice = log(cdtr$price)
plot(cdtr$mileage,cdtr$lprice)

Does this look any better ?

To predict price with this model we predict log price and then take exp.

llm = lm(lprice~mileage,cdtr)  #fit on train
lypred = predict(llm,cdte) #prediction on test for log of y
ypredl = exp(lypred) #prediction on test for y using log model
plot(cdte$price,ypredl)
abline(0,1,col="red")

What do you think?

Quadratic term

We simple add in mileage squared as an “x”.

cdtr$m2 = cdtr$mileage^2
cdte$m2 = cdte$mileage^2
lmq = lm(price~mileage+m2,cdtr)
plot(cdtr$mileage,cdtr$price)
points(cdtr$mileage,lmq$fitted,col="red")

What do you think?

We are fitting the model:

\[ price = \beta_0 + \beta_1 \, mileage + \beta_2 \, mileage^2 + \epsilon \]

So our equation relating price to mileage is

\[ price = 68.457 - 0.71 \, mileage + .0019 * mileage^2 \]

Homework:

Use out of sample performance (a train/test split) to decide which of these two models is best:

linear model of log(y) on mileage and trim
linear model of y on mileage, mileage squared, and trim

Use out of sample rmse and graphics to compare the two models.

2. Naive Bayes

We’ll use the same data we look at in the notes.

Read in the training data:

#trainB = read.csv("http://www.rob-mcculloch.org/data/smsTrainBS.csv")
#trainyB = read.csv("http://www.rob-mcculloch.org/data/smsTrainyBS.csv")[,1]

trainB = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainBS.csv")
trainyB = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv")[,1]

Let’s check this compares nicely with what we had in the notes.

## train y proportions
table(trainyB)/length(trainyB)

## trainyB
##         0         1 
## 0.8647158 0.1352842

which is the same as the notes.

tbl = table(trainyB,trainB$age)
tbl[1,] = tbl[1,]/sum(tbl[1,])
tbl[2,] = tbl[2,]/sum(tbl[2,])
tbl

##        
## trainyB           0           1
##       0 0.998613037 0.001386963
##       1 0.978723404 0.021276596

which is also the same as the $age table in the notes from the naiveBayes function.

The trainB also has four of the words:

names(trainB)

## [1] "age"   "adult" "call"  "know"

Homework

(a)

Using the same approach as in the notes, what is an estimate of P(y=spam | call = yes) ?
trainB$call is 1 if call=yes and 0 if call=no.

(b)

Using the same approach as in the notes and a naive bayes assumption that whether
or not call is in a message and whether or not know is in the message are independent
given y is known what is an estimate of P(y=spam | call=yes,know=yes) ?

(c)

What is P(y=spam | call=yes,know=yes) if you don’t assume call and know are
conditionally independent?

Hint: You would use the first equation on the third slide of 9. Naive Bayes Classification
instead of the second one.

hw1-sp26

Robert McCulloch

2026-01-31