LASSO, Polynomials and the cars data¶

Let’s use the LASSO and polynomial terms to fit the cars data (susedcars.csv).
Let’s just use mileage and year.

I'll show some basic things I got, but see if you can get a nice model for
y=price given x=(mileage, year) using polynomials of the features and regularization.

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
In [27]:
## read in the data and pull off price, mileage, and year.
cd = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")
cd = cd[['price','mileage','year']]
cd['price']  /= 1000
cd['mileage']  /= 1000
print(cd.head())
    price  mileage  year
0  43.995   36.858  2008
1  44.995   46.883  2012
2  25.999  108.759  2007
3  33.880   35.187  2007
4  34.895   48.153  2007
In [28]:
## convert to numpy
X = cd[['mileage','year']].to_numpy()  #mileage and year columns as a numpy array
print("*** type of X is",type(X))
print(X.shape) #number of rows and columns
print(X[0:4,:]) #first 4 rows
y = cd['price'].values #price as a numpy vector
print(f'length of y is {len(y)}')
print(y[:4]) #implicit start at 0
*** type of X is <class 'numpy.ndarray'>
(1000, 2)
[[  36.858 2008.   ]
 [  46.883 2012.   ]
 [ 108.759 2007.   ]
 [  35.187 2007.   ]]
length of y is 1000
[43.995 44.995 25.999 33.88 ]
In [29]:
## run the linear model (which is not so great)

lmmod = LinearRegression(fit_intercept=True)
lmmod.fit(X,y) # (X,y) is the training data
print("Model Slopes:    ",lmmod.coef_)
print("Model Intercept:",lmmod.intercept_)
Model Slopes:     [-0.1537219   2.69434954]
Model Intercept: -5365.489872256993
In [30]:
yhat = lmmod.predict(X)

plt.scatter(y,yhat,s=.8)
plt.plot(y,y,c='red') #add the line
plt.xlabel("y"); plt.ylabel("yhat")
Out[30]:
Text(0, 0.5, 'yhat')
No description has been provided for this image

Previously we saw that adding mileage squared helped the model.

But, if you look at the predictions for large values of mileage,
the predicted price actually went up as the mileage went up. That does not sound right!!

Let’s try throwing in a lot of polynomial terms using both mileage and year.
We can easily create the polynomial terms by hand, but both R and sklearn have ways to create polynomials.

Let's look at how the sklearn PolynomialFeatures works with just degree 2.

In [31]:
from sklearn.preprocessing import PolynomialFeatures
p2 = PolynomialFeatures(2,include_bias=False)
X2 = p2.fit_transform(X)
print(X2.shape)
X2[:5,:]
(1000, 5)
Out[31]:
array([[3.68580000e+01, 2.00800000e+03, 1.35851216e+03, 7.40108640e+04,
        4.03206400e+06],
       [4.68830000e+01, 2.01200000e+03, 2.19801569e+03, 9.43285960e+04,
        4.04814400e+06],
       [1.08759000e+02, 2.00700000e+03, 1.18285201e+04, 2.18279313e+05,
        4.02804900e+06],
       [3.51870000e+01, 2.00700000e+03, 1.23812497e+03, 7.06203090e+04,
        4.02804900e+06],
       [4.81530000e+01, 2.00700000e+03, 2.31871141e+03, 9.66430710e+04,
        4.02804900e+06]])
In [32]:
plt.scatter(X2[:,0],X2[:,2])
Out[32]:
<matplotlib.collections.PathCollection at 0x774477e55d20>
No description has been provided for this image

The five columns are x1,x2,x1 squared, x1 times x2, and x2 squared.

In [33]:
## pretty high correlations
pd.DataFrame(X2).corr()
Out[33]:
0 1 2 3 4
0 1.000000 -0.744729 0.946540 0.999994 -0.744929
1 -0.744729 1.000000 -0.660500 -0.743045 0.999999
2 0.946540 -0.660500 1.000000 0.946143 -0.660511
3 0.999994 -0.743045 0.946143 1.000000 -0.743247
4 -0.744929 0.999999 -0.660511 -0.743247 1.000000
In [34]:
# maybe better with demeaned features
Xm = X.mean(0)
Xdm = X - Xm
Xdm2 = p2.fit_transform(Xdm)
pd.DataFrame(Xdm2).corr()
Out[34]:
0 1 2 3 4
0 1.000000 -0.744729 0.500553 -0.365394 0.241734
1 -0.744729 1.000000 -0.253579 0.303090 -0.500058
2 0.500553 -0.253579 1.000000 -0.786721 0.373183
3 -0.365394 0.303090 -0.786721 1.000000 -0.716638
4 0.241734 -0.500058 0.373183 -0.716638 1.000000
In [35]:
plt.scatter(Xdm2[:,0],Xdm2[:,2])
Out[35]:
<matplotlib.collections.PathCollection at 0x7744764ec490>
No description has been provided for this image
In [36]:
# I'm going to try degree 5!!
poly = PolynomialFeatures(5,include_bias=False)  ## will give 20 features
Xp = poly.fit_transform(Xdm)
print(Xp.shape)
(1000, 20)

If you want to create the polynomial terms by hand, you might try just
subtracting the mean from the x variables before you transform them.

Try putting in ``a lot’’ of polynomial terms and
use the LASSO to regularize the model, and see how good an out-of-sample rmse you can get.

Using demeaned features and degrees bigger than 2 I got the in-sample fit depicted below.
I used cross-validation in to get good LASSO shrinkage and selection and then fit using all the data.

Maybe you should leave out some test data!! (optional, your call)

Below yhatP is what I got from just a linear regression on all the polynomial terms and phatL is from the LASSO.

In [37]:
from IPython.display import Image
Image(filename='Figure_1.png')
Out[37]:
No description has been provided for this image
In [38]:
Image(filename='./py_price-lasso.png')
Out[38]:
No description has been provided for this image