Problem 1

Let’s use the LASSO and polynomial terms to fit the cars data (susedcars.csv).
Let’s just use mileage and year.

### get data
cd = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")

scd = cd[,c(1,4,5)]
scd$price = scd$price/1000
scd$mileage = scd$mileage/1000
head(scd)
##    price mileage year
## 1 43.995  36.858 2008
## 2 44.995  46.883 2012
## 3 25.999 108.759 2007
## 4 33.880  35.187 2007
## 5 34.895  48.153 2007
## 6  5.995 121.748 2002
lms = lm(price~.,scd)
print(summary(lms))
## 
## Call:
## lm(formula = price ~ ., data = scd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.857  -4.855  -1.670   3.483  34.499 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.365e+03  1.716e+02  -31.27   <2e-16 ***
## mileage     -1.537e-01  8.339e-03  -18.43   <2e-16 ***
## year         2.694e+00  8.526e-02   31.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.544 on 997 degrees of freedom
## Multiple R-squared:  0.8325, Adjusted R-squared:  0.8321 
## F-statistic:  2477 on 2 and 997 DF,  p-value: < 2.2e-16

Previously we saw that adding mileage squared helped the model.

But, if you look at the predictions for large values of mileage, the predicted price actually went up as the mileage went up. That does not sound right!!

Let’s try throwing in a lot of polynomial terms using both mileage and year.
We can easily create the polynomial terms by hand, but both R and sklearn have ways
to create polynomials.

x = as.matrix(scd[,c(2,3)])
xpoly = poly(x,degree=2,raw=TRUE) # a matrix
print(head(xpoly))
##          1.0       2.0  0.1       1.1     0.2
## [1,]  36.858  1358.512 2008  74010.86 4032064
## [2,]  46.883  2198.016 2012  94328.60 4048144
## [3,] 108.759 11828.520 2007 218279.31 4028049
## [4,]  35.187  1238.125 2007  70620.31 4028049
## [5,]  48.153  2318.711 2007  96643.07 4028049
## [6,] 121.748 14822.576 2002 243739.50 4008004

So, the 1.0 column is mileage, the 2.0 is mileage squared, the 0.1 is year,
the 1.1 is mileage times year, and the 0.2 column is year squared.

For example 1358.512164 is the squared mileage for the first observation.

You might want to consider using raw=FALSE which will give you orthogonal polynomials.
Orthogonal polynomials create terms which have the polynomial information but are less correlated.
This avoids the dreaded multicolinearity.

For example,

print(cor(xpoly))
##            1.0        2.0        0.1        1.1        0.2
## 1.0  1.0000000  0.9465403 -0.7447292  0.9999942 -0.7449288
## 2.0  0.9465403  1.0000000 -0.6604999  0.9461433 -0.6605108
## 0.1 -0.7447292 -0.6604999  1.0000000 -0.7430447  0.9999991
## 1.1  0.9999942  0.9461433 -0.7430447  1.0000000 -0.7432471
## 0.2 -0.7449288 -0.6605108  0.9999991 -0.7432471  1.0000000
xpoly2 = poly(x,degree=2,raw=FALSE) # a matrix
print(cor(xpoly2))
##               1.0           2.0           0.1        1.1           0.2
## 1.0  1.000000e+00 -2.184040e-17 -7.447292e-01 -0.3653938 -1.508954e-01
## 2.0 -2.184040e-17  1.000000e+00  1.376887e-01 -0.6974913  4.158876e-01
## 0.1 -7.447292e-01  1.376887e-01  1.000000e+00  0.3030902  6.217831e-18
## 1.1 -3.653938e-01 -6.974913e-01  3.030902e-01  1.0000000 -6.525176e-01
## 0.2 -1.508954e-01  4.158876e-01  6.217831e-18 -0.6525176  1.000000e+00

If you want to create the polynomial terms by hand, you might try just
subtracting the mean from the x variables before you transform them.

Try putting in ``a lot’’ of polynomial terms and
use the LASSO to regularize the model, and see how good an out-of-sample rmse you can get.

Using orthogonal polynomials and degrees bigger than 2 I got the in-sample fit depicted below.
I used cross-validation in glmnet to get good LASSO shrinkage and selection and then fit using all the data.

Maybe you should leave out some test data!! (optional, your call)

polyfit1
polyfit1
polyfit2
polyfit2

Problem 2

Turn to previous problem into a logit type problem by binarizing y=price.

y = scd$price
qval = quantile(y,probs=c(.75))
ybin = as.integer(y>qval)  # car is expensive if price is greater than 75 percentile
#plot(y,ybin)
colv = as.integer(ybin) + 2
plot(scd$mileage,scd$year,col=colv,cex=.5,pch=16)

Now relate ybin to polynomials in (mileage, year).
Again use L1 or L2 (or elastic net) shrinkage to get good predictions.

This is what I got where I where I used the LASSO (with logit) to get the model
and then got the in sample “predictions” for whether a car is expensive or not.

logitLasso
logitLasso