1. Train test/split, categorical variables, and transformations¶

In this problm we will explore some of the basic tools in applied statistics.
However, we will take a Machine Learning viewpoint in the our approach to model selection will use a train/test split.

We will:

  • learn about the train/test split
  • see how to use a categorical regressor in a linear model
  • try the log transform on y
  • try a polynomial transformation of an x

Let's do the imports we will need.

In [134]:
# numpy and pandas
import numpy as np
import pandas as pd
import math

#graphics with matplotlib
import matplotlib.pyplot as plt

# model, train/test split, dummies (one-hot-encoding), rmse metric from scikit learn.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import mean_squared_error

Get the Data¶

Let’s read in the used cars data.
We will just use the features mileage and trim.

In [135]:
cd = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")
print("***the column names are:")
print(cd.columns.values)
# we will just use price, mileage, and trim
cd = cd[['price','mileage','trim']]
cd['price'] = cd['price']/1000
cd['mileage'] = cd['mileage']/1000
cd.head()
***the column names are:
['price' 'trim' 'isOneOwner' 'mileage' 'year' 'color' 'displacement']
Out[135]:
price mileage trim
0 43.995 36.858 550
1 44.995 46.883 550
2 25.999 108.759 550
3 33.880 35.187 550
4 34.895 48.153 550

Train/Test Split¶

In our brief introduction to statistics in R and python we learned how run a multiple regression.

One of the issues that often comes up in modeling is which features (or x variables) are important.

Many people use the t statistics and associated p-values to guide them in their choice.

In Machine Learning, we keep our focus on prediction since that is often the goal in practice.

We want to choose the modeling approach that predicts the best.
But how do we assess this?

As we will see, the basic way is to simply randomly split the data into two subsets.
Estimate the model using one subset called the training data.
Then see how well you predict on the other subset, the test data.
Because you actually have the y values in the test data you can see how well your predictions worked.

We will randomly split the data into train and test subsets.
We will put 75% of the data in train and the remaining 25% in test.

In [136]:
n = cd.shape[0]
ntr = np.floor(.75*n).astype(int)
nte = n-ntr
print(f'\n ***n, n train, n test: {n}, {ntr}, {nte}\n')
 ***n, n train, n test: 1000, 750, 250

We can now use scikit-learn to get the train test split.

In [137]:
cdtr, cdte = train_test_split(cd,random_state=99, test_size=.25)  #always set the seed, Gretzky
print(cdtr.shape)
print(cdte.shape)
(750, 3)
(250, 3)

Let's see how we could do a train/test split in numpy just for fun.

In [138]:
## random number generator
rng = np.random.default_rng(seed=14) #Dave Keon
## sample ntr from 0:(n-1) without replacement
ii = rng.choice(n,ntr,replace=False) #train indices
In [139]:
# Get the complement indices
cdtr1 = cd.iloc[ii,:]  #train
all_indices = np.arange(n)
comp_ii = np.setdiff1d(all_indices, ii) #test indices
cdte1 = cd.iloc[comp_ii,:] # test
print(cdtr1.shape)
print(cdte1.shape)
(750, 3)
(250, 3)

Let's plot mileage vs. price using the training data.

In [140]:
plt.scatter(cdtr['mileage'],cdtr['price'],s=.85)
plt.xlabel('mileage'); plt.ylabel('price')
Out[140]:
Text(0, 0.5, 'price')
No description has been provided for this image

Another very common train/test split method you will often see works when
we already have numpy array for y and X.

In [141]:
y = cd["price"].to_numpy()
y.shape
Out[141]:
(1000,)
In [142]:
X = cd[["mileage"]].to_numpy()
X.shape
Out[142]:
(1000, 1)
In [143]:
Xtr, Xte, ytr, yte = train_test_split(X,y,random_state=99, test_size=.25)
print(Xtr.shape)
print(Xte.shape)
print(ytr.shape)
print(yte.shape)
(750, 1)
(250, 1)
(750,)
(250,)
In [144]:
plt.scatter(Xtr,ytr,s=.85,c='green')
plt.xlabel("train data mileage"); plt.ylabel("train data price")
Out[144]:
Text(0, 0.5, 'train data price')
No description has been provided for this image

Looks ok.

Linear regression of price on mileage with test rmse¶

Now we fit our model using the training data.
We will start by only using the feature mileage.

In [145]:
lm1 = LinearRegression(fit_intercept=True)
lm1.fit(Xtr,ytr)  # fit on train
ypred = lm1.predict(Xte) # predict on test
In [146]:
plt.scatter(ypred,yte)
plt.plot(ypred,ypred,c='red')
Out[146]:
[<matplotlib.lines.Line2D at 0x7cc050069a20>]
No description has been provided for this image

Not so great. In particular, you will lose a lot of credibility if you predict a negative price.

We can use a variety of numerical measures to summarize our predictive performance.

The most popular measure is root mean squared error.

In [147]:
def myrmse(y,yh):
    return(math.sqrt(np.mean((y-yh)**2)))
In [148]:
print(myrmse(yte,ypred))
print(math.sqrt(mean_squared_error(yte,ypred)))
11.647566500771871
11.647566500771871

The mean absolute error, or MAD, is also used.

In [149]:
def mymad(y,yh):
    return(np.mean(np.abs(y-yh)))
In [150]:
mymad(yte,ypred)
Out[150]:
9.212157680769527

Is trim any good ?¶

Let's try using the categorical variable trim in addition to mileage.

We have to one hot encode trim.

In [151]:
one_hot = LabelBinarizer()  #does the same thing as OneHotEncoder but a little simpler
tdumtr = one_hot.fit_transform(cdtr["trim"])
tdumte = one_hot.transform(cdte["trim"])
print(cdtr["trim"][:10])
print(tdumtr[:10,:])
644      550
389      550
999    other
125      550
792      550
464      550
676      550
606      550
319      500
466      550
Name: trim, dtype: object
[[0 0 1 0]
 [0 0 1 0]
 [0 0 0 1]
 [0 0 1 0]
 [0 0 1 0]
 [0 0 1 0]
 [0 0 1 0]
 [0 0 1 0]
 [0 1 0 0]
 [0 0 1 0]]

So the columns of tdumtr are dummies for trim equal 430, 500, 550, and other respectively.

Let's add the dummies to our mileage x corresponding to the model:

$$ price = \beta_0 + \beta_1 \ mileage + \beta_2 \, D_{500} + \beta_3 \, D_{550} + \beta_4 \, D_{other} + \epsilon $$

Where, for example, $D_{500}$ is 1 if the car has trim 500 and 0 otherwise.

In [152]:
Xttr = np.hstack([cdtr.iloc[:,[1]].to_numpy(),tdumtr[:,1:4]])
Xtte = np.hstack([cdte.iloc[:,[1]].to_numpy(),tdumte[:,1:4]])
print(Xttr[:5])
print(cdte['mileage'].iloc[:5])
print(cdte['trim'].iloc[:5])
[[ 47.894   0.      1.      0.   ]
 [ 27.295   0.      1.      0.   ]
 [131.121   0.      0.      1.   ]
 [ 12.097   0.      1.      0.   ]
 [109.609   0.      1.      0.   ]]
890     90.451
983    116.299
107     29.047
609     77.556
113     96.530
Name: mileage, dtype: float64
890      550
983      500
107    other
609      550
113      500
Name: trim, dtype: object

Now let's run the regression of price on mileage and trim and then predict on test.

In [153]:
lm2 = LinearRegression(fit_intercept=True)
lm2.fit(Xttr,cdtr['price'])  # fit on train !
ypredt = lm2.predict(Xtte)   # predict on test !

##coeficients
lm2.coef_
Out[153]:
array([-0.2670384 , -0.93794863, 13.62817025,  8.50519223])

Let's plot the predictions vs mileage with trim color coded.

In [154]:
convd = {'430':'black','500':'red','550':'green','other':'blue'}
cvec = [convd[cval] for cval in cdte['trim']]
In [155]:
plt.scatter(cdte['mileage'],ypredt,s=2.0,c=cvec)
plt.xlabel('mileage'); plt.ylabel('predictions using mileage and trim')
Out[155]:
Text(0, 0.5, 'predictions using mileage and trim')
No description has been provided for this image

How about the test rmse and MAD.

In [156]:
print(myrmse(cdte["price"],ypredt))
print(mymad(cdte["price"],ypredt))
10.263742974901055
7.89483891583861

A little better than with just mileage !!

Log transform¶

Let's try logging price.

In [157]:
lprice = np.log(cdtr['price'])
plt.scatter(cdtr['mileage'],lprice,s=.85)
plt.xlabel('mileage'); plt.ylabel('log of price')
Out[157]:
Text(0, 0.5, 'log of price')
No description has been provided for this image

Does this look any better ?

To predict price with this model we predict log price and then take exp.

In [158]:
lm3 = LinearRegression(fit_intercept=True)
lm3.fit(cdtr[['mileage']],lprice)

lypred = lm3.predict(cdte[['mileage']])
ypredl = np.exp(lypred)
plt.scatter(cdte['price'],ypredl)
plt.plot(ypredl,ypredl,c='red')
plt.xlabel('test price'); plt.ylabel('predictions')
Out[158]:
Text(0, 0.5, 'predictions')
No description has been provided for this image

What do you think ?

Quadratic term¶

We can add mileage squared and a predictor !!

In [159]:
Xq = np.column_stack([cdtr['mileage'],cdtr['mileage']**2])
lm4 = LinearRegression(fit_intercept=True)
lm4.fit(Xq,cdtr['price'])
yhatq = lm4.predict(Xq)

plt.scatter(cdtr['mileage'],cdtr['price'],marker='o',c='black',s=.9)
plt.scatter(cdtr['mileage'],yhatq,marker=5,c='red',s=.9)
plt.xlabel('mileage'); plt.ylabel('price')
plt.title('train fit, quadratice model')
Out[159]:
Text(0.5, 1.0, 'train fit, quadratice model')
No description has been provided for this image

What do you think ?

$$ price = \beta_0 + \beta_1 \, mileage + \beta_2 \, mileage^2 + \epsilon $$

Homework:

Use out of sample performance (a train/test split) to decide which of these two models is best:

  • linear model of log(y) on mileage and trim
  • linear model of y on mileage, mileage squared, and trim

Use out of sample rmse and graphics to compare the two models.

2. Naive Bayes¶

We’ll use the same data we look at in the notes.

Read in the training data:

In [160]:
trainB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainBS.csv")
trainyB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv")['smsTrainyB']

Let's check the y=ham/spam.

In [161]:
trainyB.value_counts()/len(trainyB)
Out[161]:
smsTrainyB
0    0.864716
1    0.135284
Name: count, dtype: float64

Same as in the notes.

Let's check the P(age|y).

In [162]:
pd.crosstab(trainyB,trainB['age'])
Out[162]:
age 0 1
smsTrainyB
0 3600 5
1 552 12
In [163]:
print(5/(3600+5))
print(12/(552+12))
0.0013869625520110957
0.02127659574468085

which is also the same as the $age table in the notes from the naiveBayes function.

trainB does not have all the columns (terms) we used in the notes.
It just has the columns for four of the terms.

In [164]:
[x for x in trainB.columns]
Out[164]:
['age', 'adult', 'call', 'know']

Homework

(a)

Using the same approach as in the notes, what is an estimate of P(y=spam | call = yes) ?

trainB['call'] is 1 if call=yes and 0 if call=no.

(b)

Using the same approach as in the notes and a naive bayes assumption that whether or not call is in a message and whether or not know is in the message are independent given y is known, what is an estimate of P(y=spam | call=yes,know=yes) ?

(c)

What is P(y=spam | call=yes,know=yes) if you don’t assume call and know are conditionally independent?

Hint: You would use the first equation on the third slide of 9. Naive Bayes Classification instead of the second one.