Let’s use the glmnet package in R to do L1 regression in R.
To do L2 regression you would just change the alpha parameter used below from 1 to 0.
The same package can also be used for regularized logistic, poisson, and multinomial regression.
The package can do the elastic net with L1 (LASSO) and L2 (ridge) as special cases.
The package does a few important things automatically:
Note that when you elect to have glmnet stardardize the x’s, the coefficients are returned on the original scale.
Let’s used our beloved used cars data set as an example.
Note that the vignetter for glmnet is pretty good.
Try browseVignettes in R to find it.
## read in the data
cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
## a lot of the variables are categorical
sapply(cd,is.factor)
## price trim isOneOwner mileage year color
## FALSE FALSE FALSE FALSE FALSE FALSE
## displacement
## FALSE
sapply(cd,is.numeric)
## price trim isOneOwner mileage year color
## TRUE FALSE FALSE TRUE TRUE FALSE
## displacement
## FALSE
# for example, let's look at trim, trim is categorical with 11 categories or levels.
table(cd$trim)
##
## 430 500 550 other
## 143 127 591 139
#let's make the categorical variables factors in R
iifac = c(2,3,6,7)
for(i in iifac) cd[,i] = as.factor(cd[,i])
summary(cd)
## price trim isOneOwner mileage year
## Min. : 995 430 :143 f:841 Min. : 1997 Min. :1994
## 1st Qu.:12995 500 :127 t:159 1st Qu.: 40133 1st Qu.:2004
## Median :29800 550 :591 Median : 67920 Median :2007
## Mean :30583 other:139 Mean : 73652 Mean :2007
## 3rd Qu.:43992 3rd Qu.:100138 3rd Qu.:2010
## Max. :79995 Max. :255419 Max. :2013
## color displacement
## Black :415 4.6 :137
## other :227 5.5 :476
## Silver:213 other:387
## White :145
##
##
#Let’s try a simple regression:
# R will automaticall dummy up the factor trim
sreg = lm(price~mileage+trim,cd)
summary(sreg)
##
## Call:
## lm(formula = price ~ mileage + trim, data = cd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24984 -6270 -1182 5068 36846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.062e+04 1.239e+03 32.773 <2e-16 ***
## mileage -2.669e-01 8.499e-03 -31.403 <2e-16 ***
## trim500 -7.689e+02 1.143e+03 -0.673 0.501
## trim550 1.411e+04 1.006e+03 14.019 <2e-16 ***
## trimother 9.966e+03 1.145e+03 8.706 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9372 on 995 degrees of freedom
## Multiple R-squared: 0.7419, Adjusted R-squared: 0.7409
## F-statistic: 715 on 4 and 995 DF, p-value: < 2.2e-16
# we can see the dummies just give a different intercept for linear relationship
# between price and mileage for each level of trim
plot(cd$mileage,sreg$fitted.values,col=as.integer(cd$trim))
Look’s like the trim variable could matter!!!
Now let’s try the LASSO with all the variables.
First we need to get the \(X\) matrix since glmnet does not use forumulas.
# Now let’s try the LASSO. We will use the package glmnet.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
#glmnet does not use the “formula approach”.
#You have to give it a matrix x and response vector y corresponding to the model y=Xβ+ϵ.
#For our simple example above, this means we need the matrix with all the dummies.
#R has a simple way of getting the x matrix given the model formula using the model.matrix function.
x = model.matrix(price~mileage+trim,cd)[,-1] # [,-1] to drop the first column which is the one vector
print(dim(x))
## [1] 1000 4
head(x)
## mileage trim500 trim550 trimother
## 1 36858 0 1 0
## 2 46883 0 1 0
## 3 108759 0 1 0
## 4 35187 0 1 0
## 5 48153 0 1 0
## 6 121748 1 0 0
#Now we are ready to try using glmnet to run the LASSO.
#Note that the vignette for glmnet is pretty good (see > browseVignettes() in R.)
#We will use all the x variables so I will call model.matrix again but this time use the data frame cd.
#We first call cv.glmnet. This will do the cross validation needed to choose λ.
y = cd$price/1000
x = model.matrix(price~.,cd)[,-1]
print(dim(x))
## [1] 1000 11
head(x)
## trim500 trim550 trimother isOneOwnert mileage year colorother colorSilver
## 1 0 1 0 0 36858 2008 0 1
## 2 0 1 0 0 46883 2012 0 0
## 3 0 1 0 0 108759 2007 0 0
## 4 0 1 0 0 35187 2007 0 0
## 5 0 1 0 0 48153 2007 0 0
## 6 1 0 0 0 121748 2002 1 0
## colorWhite displacement5.5 displacementother
## 1 0 1 0
## 2 0 0 0
## 3 1 1 0
## 4 0 1 0
## 5 0 1 0
## 6 0 0 1
So we have 11 variables after we dummy up the categorical variables.
Ok, let’s do the LASSO. We actually use the function cv.glment, since it will do the cross validation.
set.seed(14) # Dave Keon
cars.gcv = cv.glmnet(x, y, type.measure = "mse", nfolds = 10,alpha=1)
#alpha=1 is the lasso penalty, and alpha=0 the ridge penalty.
# nfolds= 10 gives 10-fold cross-validation?
plot(cars.gcv)
In this case the cross-validation min is obtained at the largest lambda = the least shinkage = most complicated model.
And the most min lambda does not set any of the coefficients to 0.
The 1se lambda is perhaps more interesting since it suggests there might be a simpler model what does as well.
Let’s pull off the min lambda value and the 1se lambda value.
## let's get the good lambda values
lmin = cars.gcv$lambda.min
l1se = cars.gcv$lambda.1se
cat("lambda min and lambda 1se are: ",lmin,l1se,"\n")
## lambda min and lambda 1se are: 0.02406165 0.2966433
Wow, they are pretty different !!!
Let’s plot the rmse since it is more interpretable than the mse.
#Let’s plot the RMSE.
crmse = sqrt(cars.gcv$cvm) #glmnet used mse
plot(log(cars.gcv$lambda),crmse,xlab="log(lambda)",ylab="rmse")
cat("the min rmse is:",min(crmse),"\n")
## the min rmse is: 6.47855
Let’s look at the LASSO coefficient values.
We could usd the glmnet function, but we can just pull the LASSO fit out of the data structure returned by cv.glmnet.
#Now lets get the actual LASSO fit.
#We can run the function glmnet or we can get the fit on all the data from the cv results.
#Let’s just pull it out of the cv results and have a look at the coefficients.
cars.las = cars.gcv$glmnet.fit
plot(cars.las)
Now let’s get the coefficients at the min and 1se lambda.
## get the 1se coefficients
bhatL1 = coef(cars.las,s=l1se)[,1] #[,1] gets rid of sparse matrix format
names(bhatL1[abs(bhatL1)==0]) #which one are 0??
## [1] "trim500" "colorother"
## min mse lambda
bhatLm = coef(cars.las,s=lmin)[,1] #[,1] gets rid of sparse matrix format
names(bhatLm[abs(bhatLm)==0]) #which one are 0??
## character(0)
The min lambda has no zero coefficient, while the 1se lambda has two zero coefficients.
Now let’s get fitted values at a choice of lambda.
We’ll give it x so we will be fitting in-sample training data fits.
Of course if we wanted to get out-of-sample prediction, we would change the input x.
##Now let’s get some predictions.
##We’ll predict at our train x, but of course we would be using different x if we were predicting out of sample.
fmat = predict(cars.las,newx=x,s=c(lmin,l1se))
fmat = cbind(y,fmat,sreg$fitted.values)
colnames(fmat) = c("y","yhatmin","yhat1se","linear")
pairs(fmat)