Check QR computation of least squares \(\beta\)

Get data.

cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
y = cd$price/1000
n = length(y)
X = cbind(rep(1,n),cd$mileage/1000,cd$year)
ddf = data.frame(X,y)

Get QR.

qrres = qr(X)
Q = qr.Q(qrres)
R = qr.R(qrres)
#check
temp = Q %*% R
summary(temp - X)
##        V1                   V2                   V3           
##  Min.   :-1.110e-16   Min.   :-2.842e-14   Min.   :3.001e-11  
##  1st Qu.:-1.110e-16   1st Qu.: 0.000e+00   1st Qu.:3.001e-11  
##  Median :-1.110e-16   Median : 7.105e-15   Median :3.001e-11  
##  Mean   :-1.080e-16   Mean   : 9.218e-15   Mean   :3.101e-11  
##  3rd Qu.:-1.110e-16   3rd Qu.: 1.421e-14   3rd Qu.:3.001e-11  
##  Max.   : 2.887e-15   Max.   : 6.821e-13   Max.   :9.836e-10

Get \(\hat{\beta}\) using QR.

bhat1 = solve(R) %*% t(Q) %*% y
print(bhat1)
##               [,1]
## [1,] -5365.4898723
## [2,]    -0.1537219
## [3,]     2.6943495

Get from R.

lmres = lm(y~.-1,ddf) # have 1 in X so don't fit an intercept
print(lmres$coef)
##            X1            X2            X3 
## -5365.4898723    -0.1537219     2.6943495

Did it the stupid way above by inverting R.
Often it is a good idea to do stupid but simple first, and then be smart!!
Better to use backsolve than invert R.

qy = t(Q) %*% y
bhat2 = backsolve(R,qy)
print(bhat2)
##               [,1]
## [1,] -5365.4898723
## [2,]    -0.1537219
## [3,]     2.6943495

Orthogonal Projections

Get cars data.

cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd = cd[,c(1,4,5)]
cd$price = cd$price/1000
cd$mileage = cd$mileage/1000
head(cd)
##    price mileage year
## 1 43.995  36.858 2008
## 2 44.995  46.883 2012
## 3 25.999 108.759 2007
## 4 33.880  35.187 2007
## 5 34.895  48.153 2007
## 6  5.995 121.748 2002

Regress price on mileage alone.

lm1 = lm(price~mileage,cd)
print(summary(lm1))
## 
## Call:
## lm(formula = price ~ mileage, data = cd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.670  -7.063   0.239   6.293  37.024 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.35978    0.67063   84.04   <2e-16 ***
## mileage     -0.34997    0.00787  -44.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.67 on 998 degrees of freedom
## Multiple R-squared:  0.6646, Adjusted R-squared:  0.6643 
## F-statistic:  1978 on 1 and 998 DF,  p-value: < 2.2e-16
coef1 = lm1$coef
print(coef1)
## (Intercept)     mileage 
##  56.3597845  -0.3499745

price on (mileage,year).

lm2 = lm(price~.,cd)
print(summary(lm2))
## 
## Call:
## lm(formula = price ~ ., data = cd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.857  -4.855  -1.670   3.483  34.499 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.365e+03  1.716e+02  -31.27   <2e-16 ***
## mileage     -1.537e-01  8.339e-03  -18.43   <2e-16 ***
## year         2.694e+00  8.526e-02   31.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.544 on 997 degrees of freedom
## Multiple R-squared:  0.8325, Adjusted R-squared:  0.8321 
## F-statistic:  2477 on 2 and 997 DF,  p-value: < 2.2e-16
coef2 = lm2$coef
print(coef2)
##   (Intercept)       mileage          year 
## -5365.4898723    -0.1537219     2.6943495

mileage on year.

lm3 = lm(mileage~year,cd)
M.Y = lm3$residuals

price on M.Y

tdf = data.frame(price = cd$price, M.Y)
lm4 = lm(price~M.Y,tdf)
print(summary(lm4))
## 
## Call:
## lm(formula = price ~ M.Y, data = tdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.394 -15.458  -0.519  12.592  48.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.58332    0.56562  54.070  < 2e-16 ***
## M.Y         -0.15372    0.01977  -7.775 1.88e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.89 on 998 degrees of freedom
## Multiple R-squared:  0.05711,    Adjusted R-squared:  0.05616 
## F-statistic: 60.45 on 1 and 998 DF,  p-value: 1.876e-14
coef4 = lm4$coef
print(coef4)
## (Intercept)         M.Y 
##  30.5833180  -0.1537219

How does the coefficient of mileage in R2 compare with the coefficent of M.Y in R4?

print(coef2)
##   (Intercept)       mileage          year 
## -5365.4898723    -0.1537219     2.6943495
print(coef4)
## (Intercept)         M.Y 
##  30.5833180  -0.1537219

They are exactly the same !!!!!

Let’s also check the standard errors.

sigmahat = summary(lm2)$sigma
check = sigmahat/sqrt(sum(M.Y^2))
cat('stderr of mileage in R2 should be: ',check,'\n')
## stderr of mileage in R2 should be:  0.008338762

Basic Gaussian Process Regression

True function.

n=100
x = seq(from=-2,to=2,length.out=100)
fx = x^3

\(\Sigma\) matrix.

sf =  4
l = .1

Sigma = matrix(0.0,n,n)
for(i in 1:n) {
   for(j in 1:n) {
      stand = (x[i]-x[j])/l
      Sigma[i,j] = (sf^2) * exp(-.5*stand^2)
   }
}

Conditional distribution.

Iy = 76:85
Ix = c(1:75,86:100)

Syy = Sigma[Iy,Iy]
Sxx = Sigma[Ix,Ix]

Syx = Sigma[Iy,Ix]
Sxy = Sigma[Ix,Iy]

xp = fx[Ix]

B = Syx %*% solve(Sxx)
mygx = B %*% xp
Sygx = Syy - B %*% Sxy

Simulate from conditional.

set.seed(34)
nd = 1000
Lygx = t(chol(Sygx))
dygx = as.double(mygx) + Lygx %*% matrix(rnorm(10*nd),nrow=10)

Plot.

### plot GP inference
plot(x,fx,type="n")
qmat = apply(t(dygx),2,quantile,probs=c(.1,.5,.9))
for(i in 1:nrow(dygx)) {
    ii = 75+i
    lines(x[rep(ii,2)],qmat[c(1,3),i],col='red')
    points(x[ii],qmat[2,i],col="red",pch=16)
}
lines(x,fx,col='gray',lwd=3)