---
title: "Machine Learning Homework 4"
author: "Rob McCulloch"
date: "2/13/2018"
output:
pdf_document:
keep_tex: true
toc: true
number_sections: yes
fontsize: 10pt
linkcolor: blue
urlcolor: blue
header-includes:
- \usepackage[]{graphicx}
- \usepackage[]{color}
- \usepackage{amsmath}
- \usepackage{relsize}
- \usepackage{algorithm2e}
- \usepackage{animate}
- \newcommand{\sko}{\vspace{.1in}}
- \newcommand{\skoo}{\vspace{.2in}}
- \newcommand{\skooo}{\vspace{.3in}}
- \newcommand{\rd}[1]{\textcolor{red}{#1}}
- \newcommand{\bl}[1]{\textcolor{blue}{#1}}
- \newcommand{\tbf}[1]{\textbf{\texttt{#1}}}
- \newcommand{\ird}[1]{\textit{\textcolor{red}{#1}}}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(dev = 'pdf')
knitr::opts_chunk$set(collapse = TRUE)
```
# Out of Sample Predictive Performance of Variable Subsets
In class we looked at an example where we did a simple train/test loop
to investigate the out of sample predictive performance of different variable
subsets.
The pdf is on the webpage at:
http://www.rob-mcculloch.org/2018_ml/webpage/notes/properties.pdf
and the R markdown is at:
http://www.rob-mcculloch.org/2018_ml/webpage/notes/properties.Rmd
We compared the predictive performance of the subset (x1,x2) with that of the subset (x3).
The subset (x1,x2) are the starred variables in the usual R regression summary.
First, check the code!
Is (x3) really better than (x1,x2) in terms of out of sample prediction error?
Second, modify the code to compare the subsets (x1), (x2), (x3), (x4), (x5), and (x1,x2).
That is, try each single x variable and the subset (x1,x2).
How do they compare??
\vspace{.3in}
\newpage
# Properties of Least Squares
## $\hat{\beta}$ by Matrix Operations
Let's compute
$$
\hat{\beta} = (X'X)^{-1} X'y
$$
and check that we get the same thing as the R function lm.
Let's also check the FOCs (first order conditions).
We'll use the same data as in the previous problem.
I'll quickly show you the matrix operations in R and then you can try them on the data.
First, I'll make a toy $X$ and $y$ use as in my examples.
\
\
Toy data:
```{r toydat,include=TRUE,echo=TRUE}
n=4
X = cbind(rep(1,4),1:4) #cbind concatenates columns, rep repeats 1 four times.
y = c(1,2,4,3)
print(X); print(y)
is.matrix(X) #does cbind give me a matrix?
```
Matrix operations:
```{r matop,include=TRUE,echo=TRUE}
t(X) #transpose of X
xtx = t(X) %*% X # X'X, note the R form for matrix multiplication
xtxi = solve(xtx) # matrix inversion
xtx %*% xtxi # check we get the identity
```
Compute $\hat{\beta}$:
```{r bhat,include=TRUE,echo=TRUE}
bhat = solve(t(X)%*%X) %*% t(X) %*% y
bhat
lmf = lm(y~.,data.frame(X[,2],y))
lmf$coef
```
FOC:
```{r foc,include=TRUE,echo=TRUE}
t(X) %*% (y- X %*% matrix(bhat,ncol=1))
```
Ok, now compute $\hat{\beta}$ and check the first order conditions for the data from problem 1.
Just repeat what I just did on the toy data with the real data.
## $\hat{\sigma}$ and Standard errors
Get $\hat{\sigma}$ and $se(\hat{\beta}_i)$ for the data for the problem 1 data
directly from the formulas using R matrix operations and vector calculations.
That is, use:
$$
\hat{\sigma}^2 = \frac{1}{n-p} ||y-X \hat{\beta}||^2 = \frac{1}{n-p} \sum (y_i - x'_i \hat{\beta})^2
$$
and
$$
Var(\hat{\beta}) = \sigma^2 (X'X)^{-1}.
$$
Compare your numbers to the regression output.
Note, for the toy data:
```{r shat}
shat = summary(lmf)$sigma
cat("sigma hat is: ",shat,"\n")
sqrt(diag(solve(t(X) %*% X))) * shat
summary(lmf)
```
## Correlations
Here is the regression of y on x1-x5 and
the regression of y on x1-x5 with the x's demeaned.
How do the regression outputs compare??
Regression (again) of y on x1-5.
```{r demean1}
xyd = read.csv("sim-reg-data.csv")
lmd = lm(y~.,xyd)
summary(lmd)
```
Now demean the x's.
```{r demean2}
xydd = xyd
for(i in 2:6) xydd[[i]] = xydd[[i]] - mean(xydd[[i]])
lmdd = lm(y~.,xydd)
summary(lmdd)
mean(xyd$y)
```
Now let's look at all the correlations between y, x1-5, yhat, and e.
```{r corfit}
fmat = cbind(xyd,lmd$fitted,lmd$residuals)
names(fmat)[c(7,8)] = c("yhat","e")
cor(fmat)
```
Note that the residuals (e) are uncorrelated with each $x$.
Why are the residuals uncorrelated with the fitted values (yhat)?
Square the correlation between y and yhat.
How does it compare with $R^2$ (see ``Multiple R-squared'') in the R
regression summary.
# Orthogonalized Regression
Now let's compare the regression of y on x1-x5 with the regression
of y on x1-x4 and the xt5, where xt5 is the residual from regression x5 on x1-x4.
```{r ry1t4}
lmfy = lm(y~x1+x2+x3+x4,xyd)
summary(lmfy)
```
Regress x5 on x1-4 and then replace x5 with the residuals from this
regression.
```{r rx1t4}
lmf5 = lm(x5~x1+x2+x3+x4,xyd)
e5 = lmf5$residuals
xyde = cbind(xyd[,1:5],e5)
lmfe = lm(y~.,xyde)
summary(lmfe)
```
How do the coefficents from this last regression compare to the coefficients of x1-4
in the regression of y on x1-4?
How does the coefficient of e5 in this last regression compare to the coefficient of x5
in the regression of y on x1-5?
\
\
What is this number?
```{r se5}
lmf = lm(y~.,xyd); shat = summary(lmf)$sigma
shat/sqrt(sum(e5^2))
```
Find this number on the regression of y on x1-5,
that is, check that this number is indeed the standard error for the coefficient of x5.
What is the $R^2$ from the regression of x5 on x1-4?
Run the regression of y on just x5.
How does the standard error for the coefficient of the coefficient for x5 compare
to the standard error of the coefficient of x5 in the regression of y on x1-5?
Explain (in English) why they are so different.
# ** Predictive Variance
Suppose we have training data $(X,y)$ and $x_f$ at which we wish to predict the future $Y_f$.
Then our usual prediction is
$$
\hat{Y}_f = x_f \, \hat{\beta}
$$
where $\hat{\beta}$ is the OLS (ordinary least squares) estimate of $\beta$ obtained from the
training data.
Obtain a nice matrix formula for
$$
Var(E_f) = Var(Y_f - \hat{Y}_f)
$$
Note in particular that $Var(E_f)$ incoporates both variation in $(X,y)$ and $Y_f$.
# ** R-squared
Show that in linear regression
$$
cor(\hat{y},y)^2 = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2}
$$
That is show that the square of the correlation between y and the fitted values is indeed
the same as the usual formula for R-squared.