Get data.
cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
y = cd$price/1000
n = length(y)
X = cbind(rep(1,n),cd$mileage/1000,cd$year)
ddf = data.frame(X,y)
Get QR.
qrres = qr(X)
Q = qr.Q(qrres)
R = qr.R(qrres)
#check
temp = Q %*% R
summary(temp - X)
## V1 V2 V3
## Min. :-1.110e-16 Min. :-2.842e-14 Min. :3.001e-11
## 1st Qu.:-1.110e-16 1st Qu.: 0.000e+00 1st Qu.:3.001e-11
## Median :-1.110e-16 Median : 7.105e-15 Median :3.001e-11
## Mean :-1.080e-16 Mean : 9.218e-15 Mean :3.101e-11
## 3rd Qu.:-1.110e-16 3rd Qu.: 1.421e-14 3rd Qu.:3.001e-11
## Max. : 2.887e-15 Max. : 6.821e-13 Max. :9.836e-10
Get \(\hat{\beta}\) using QR.
bhat1 = solve(R) %*% t(Q) %*% y
print(bhat1)
## [,1]
## [1,] -5365.4898723
## [2,] -0.1537219
## [3,] 2.6943495
Get from R.
lmres = lm(y~.-1,ddf) # have 1 in X so don't fit an intercept
print(lmres$coef)
## X1 X2 X3
## -5365.4898723 -0.1537219 2.6943495
Did it the stupid way above by inverting R.
Often it is a good idea to do stupid but simple first, and then be
smart!!
Better to use backsolve than invert R.
qy = t(Q) %*% y
bhat2 = backsolve(R,qy)
print(bhat2)
## [,1]
## [1,] -5365.4898723
## [2,] -0.1537219
## [3,] 2.6943495
Get cars data.
cd = read.csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd = cd[,c(1,4,5)]
cd$price = cd$price/1000
cd$mileage = cd$mileage/1000
head(cd)
## price mileage year
## 1 43.995 36.858 2008
## 2 44.995 46.883 2012
## 3 25.999 108.759 2007
## 4 33.880 35.187 2007
## 5 34.895 48.153 2007
## 6 5.995 121.748 2002
Regress price on mileage alone.
lm1 = lm(price~mileage,cd)
print(summary(lm1))
##
## Call:
## lm(formula = price ~ mileage, data = cd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.670 -7.063 0.239 6.293 37.024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.35978 0.67063 84.04 <2e-16 ***
## mileage -0.34997 0.00787 -44.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.67 on 998 degrees of freedom
## Multiple R-squared: 0.6646, Adjusted R-squared: 0.6643
## F-statistic: 1978 on 1 and 998 DF, p-value: < 2.2e-16
coef1 = lm1$coef
print(coef1)
## (Intercept) mileage
## 56.3597845 -0.3499745
price on (mileage,year).
lm2 = lm(price~.,cd)
print(summary(lm2))
##
## Call:
## lm(formula = price ~ ., data = cd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.857 -4.855 -1.670 3.483 34.499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.365e+03 1.716e+02 -31.27 <2e-16 ***
## mileage -1.537e-01 8.339e-03 -18.43 <2e-16 ***
## year 2.694e+00 8.526e-02 31.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.544 on 997 degrees of freedom
## Multiple R-squared: 0.8325, Adjusted R-squared: 0.8321
## F-statistic: 2477 on 2 and 997 DF, p-value: < 2.2e-16
coef2 = lm2$coef
print(coef2)
## (Intercept) mileage year
## -5365.4898723 -0.1537219 2.6943495
mileage on year.
lm3 = lm(mileage~year,cd)
M.Y = lm3$residuals
price on M.Y
tdf = data.frame(price = cd$price, M.Y)
lm4 = lm(price~M.Y,tdf)
print(summary(lm4))
##
## Call:
## lm(formula = price ~ M.Y, data = tdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.394 -15.458 -0.519 12.592 48.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.58332 0.56562 54.070 < 2e-16 ***
## M.Y -0.15372 0.01977 -7.775 1.88e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.89 on 998 degrees of freedom
## Multiple R-squared: 0.05711, Adjusted R-squared: 0.05616
## F-statistic: 60.45 on 1 and 998 DF, p-value: 1.876e-14
coef4 = lm4$coef
print(coef4)
## (Intercept) M.Y
## 30.5833180 -0.1537219
How does the coefficient of mileage in R2 compare with the coefficent of M.Y in R4?
print(coef2)
## (Intercept) mileage year
## -5365.4898723 -0.1537219 2.6943495
print(coef4)
## (Intercept) M.Y
## 30.5833180 -0.1537219
They are exactly the same !!!!!
Let’s also check the standard errors.
sigmahat = summary(lm2)$sigma
check = sigmahat/sqrt(sum(M.Y^2))
cat('stderr of mileage in R2 should be: ',check,'\n')
## stderr of mileage in R2 should be: 0.008338762
True function.
n=100
x = seq(from=-2,to=2,length.out=100)
fx = x^3
\(\Sigma\) matrix.
sf = 4
l = .1
Sigma = matrix(0.0,n,n)
for(i in 1:n) {
for(j in 1:n) {
stand = (x[i]-x[j])/l
Sigma[i,j] = (sf^2) * exp(-.5*stand^2)
}
}
Conditional distribution.
Iy = 76:85
Ix = c(1:75,86:100)
Syy = Sigma[Iy,Iy]
Sxx = Sigma[Ix,Ix]
Syx = Sigma[Iy,Ix]
Sxy = Sigma[Ix,Iy]
xp = fx[Ix]
B = Syx %*% solve(Sxx)
mygx = B %*% xp
Sygx = Syy - B %*% Sxy
Simulate from conditional.
set.seed(34)
nd = 1000
Lygx = t(chol(Sygx))
dygx = as.double(mygx) + Lygx %*% matrix(rnorm(10*nd),nrow=10)
Plot.
### plot GP inference
plot(x,fx,type="n")
qmat = apply(t(dygx),2,quantile,probs=c(.1,.5,.9))
for(i in 1:nrow(dygx)) {
ii = 75+i
lines(x[rep(ii,2)],qmat[c(1,3),i],col='red')
points(x[ii],qmat[2,i],col="red",pch=16)
}
lines(x,fx,col='gray',lwd=3)