Note that in addition to our usual imports we have Ridge, RidgeCV, Lasso, and LassoCV. You can tell by the names what they are for!!
### imports
import numpy as np
import pandas as pd
import math
import scipy as sp
import matplotlib.pyplot as plt
plt.style.use('seaborn')
#ipython terminal
#%matplotlib
#jupyter notebook
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
Let's use the Boston housing data, but we will use all the features !!
We can just pull it out of sklearn.
## boston data
from sklearn.datasets import load_boston
boston = load_boston()
features = boston.data
y = boston.target
Let scale our x variables.
We typically do this when we use L1 or L2 regularization since the penalty trees all
the coefficients for the different features (X'S) the same.
Note that I am using all the data to do the scaling.
In principle this is in correct since when I do CV (cross - validation) below this
means I have already used all the data to pick a part of my modeling process so that
nothing is out of sample.
I am assumming that the scaling I would get from 9/10 of the data is the same as
the scaling I get from all the data since I am going to do 10-fold CV.
##################################################
## scale
scl = StandardScaler()
X = scl.fit_transform(features)
print("means should be 0, sds should be 1")
print(X.mean(axis=0))
print(X.std(axis=0))
##################################################
## simple regression
lmod = LinearRegression()
lmod.fit(X,y)
yhatl = lmod.predict(X)
### plot y vs yhat
plt.scatter(y,yhatl)
plt.xlabel('y'); plt.ylabel('yhat')
plt.plot(y,y,c='red',linestyle='dotted')
plt.title('y vs. yhat for linear regression, in-sample, all data')
Let's try ridge regression with 10-fold cross validation.
The "alphas" are the $\lambda$ values.
That is, they control the size of the L2 complexity penalty.
##################################################
## ridge
alphas = np.linspace(start=1,stop=200,num=100)
rcv = RidgeCV(alphas,cv=10)
rcv.fit(X,y)
print(rcv.alpha_)
print(rcv.coef_)
yhatr = rcv.predict(X)
Now let's try the LASS0.
##################################################
## lasso
lcv = LassoCV(cv=10)
lcv.fit(X,y)
# look at alphas used
print("number of alphas used:",lcv.n_alphas)
pd.Series(lcv.alphas_).describe()
plt.plot(lcv.alphas_)
#best alpha and coefficients
print("best alpha: ",lcv.alpha_)
#coefficents
print("coeficients at best alpha: ",lcv.coef_)
print("number of 0 coefficents: ",np.sum(lcv.coef_ == 0))
#fitted values
yhatL = lcv.predict(X)
#mse
msep = lcv.mse_path_
mses = msep.sum(axis=1)
plt.scatter(lcv.alphas_,mses)
plt.xlabel('alpha'); plt.ylabel('mse')
Let's compare the fits from the three different approach, linear regression, ridge, and LASSO.
##################################################
## look
pyhat = pd.DataFrame({'y':y,'yhatl':yhatl,'yhatr':yhatr,'yhatLasso':yhatL})
pyhat.corr()
Let's look at the "path" of a paricular coefficient as alpha (the amount of shrinkage varies) for the LASSO fit.
# Lasso path
Lpath = lcv.path(X,y)
for i in range(3):
print(Lpath[i].shape)
kk = 0 ## choose which coefficient to plot
plt.scatter(Lpath[0],Lpath[1][kk,:])
plt.xlabel('alpha'); plt.ylabel('coefficent')