---
title: "Machine Learning Final Project"
author: "Bradley Bush, Sandy Tanwisuth, Jesse St. Amand"
date: "May 4, 2017"
output:
pdf_document: default
html_document: null
header-includes:
\usepackage{float}
\usepackage{graphicx}
---
```{r echo=FALSE,warning=FALSE}
library(knitr)
opts_chunk$set(fig.path='figure/graphics-',
cache.path='cache/graphics-',
fig.align='center',
external=TRUE,
echo=TRUE,
warning=FALSE,
fig.pos='H'
)
a4width<- 8.3
a4height<- 11.7
```
# Introduction
Syngenta is one of the largest seed producers in the world. Currently they have a multi-year process by which they breed, test, and select varieties to sell comercially. They are interested in improving their selection process and have partnered with Idea Connection to host a data mining competition. The purpose of the competition (see:https://www.ideaconnection.com/Syngenta-AI-Challenge/) is to give teams a chance to use AI and data mining techniques (along with their creativity) to come up with ways to improve the selection process and allow Syngenta to 'grow more with less.'
We have decided to use the data from this competition for our project. We are attempting to predict soybean yield from environmental factors such as the amount of sun and precipitation the soybean plant gets during its growing season and from soil properties such as the amount of sand, silt, or organic matter present in the soil at the testing site. A full description of the variables used is given below.
# The Data
Here is a description of the variables:
Variable Name: | Variable Description:
---------------| ---------------------------------------------------------------------------
`YIELD` | Variety yield in bushels per acre (this is an average because you may have multiple experiments of the same variety at the same location-- IE at different parts of the test site)
`AREA` | The number of acres growing soybeans in the segment of about 36 sq miles around the testing location. The information was obtained from http://nassgeodata.gmu.edu/CropScape (NASS USDA, 2014)
`IRRIGATION` | The number of acres with irrigation in the segment of about 36 sq miles around the testing location. The information was obtained from http://nassgeodata.gmu.edu/CropScape (NASS USDA, 2014)
`AWC_100CM` | The available soil water capacity (volumetric fraction) until wilting point. This information was summarized for the first 100 cm.
`TEMP` | Temperature accumulation during the season (in Celsius). Meaning the sum of average daily temperatures of test site from 1 Apr to 31 Oct.
`PREC` | Precipitation accumulation during the season (in millimeters). Meaning the sum of average daily precipitation of test site from 1 Apr to 31 Oct.
`RAD` | Solar radiation accumulation during the season (in watts/meter). Meaning the sum of average daily solar radiation of test site from 1 Apr to 31 Oct.
`SAND_TOP` | Percentages of sand particles according to size (small, less than 0.002 mm; medium, between 0.002 and 0.05 mm; and large, more than 0.05 mm, respectively). These proportions define soil texture. This information was summarized for the first 30 cm.
`SILT_TOP` | Percentages of silt particles according to size (small, less than 0.002 mm; medium, between 0.002 and 0.05 mm; and large, more than 0.05 mm, respectively). These proportions define soil texture. This information was summarized for the first 30 cm.
`CLAY` | Percentages of clay particles according to size (small, less than 0.002 mm; medium, between 0.002 and 0.05 mm; and large, more than 0.05 mm, respectively). These proportions define soil texture. This information was summarized for the first 30 cm.
`PH` | The log of H+ concentration in the soil. Acidic soils have low pH values and high H+ concentration and alkaline soils have high pH values. It impacts soil chemical reactions and the ability of the soil to supply nutrient to plants. Optimum pH range is between 6.5 and 7. This information was summarized for the first 30 cm of the soil profile.
`ORGANIC.MATTER`| The percentage of the soil consisting of plant and animal residues at various stages of decomposition, soil organisms and their byproducts. It impacts soil chemical reactions and soil structure. It is an important indicator of the ability of the soil to supply nutrient and water to crops. This information was summarized for the first 30 cm of the soil profile.
`CEC` | The Cation Exchange Capacity (cmol kg-1) quantifies the amount of negative charge in the soil. It impacts soil chemical reactions and the ability of the soil to supply nutrient to plants. It is often associated with clay and organic matter content. This information was summarized for the first 30 cm of the soil profile.
## Processing Data
Much of our time was spent in this section of the project. The data is given to us in two different files. The first file had the soil and environmental data and the second file had the information about the soybean varities. We had to first combine the datasets by taking each observation (ie a given soybean variety for a given year) and appending the appropriate weather and soil data so that we could have all of the data in one dataframe. Once we combine the data we have 172,057 observations and 66 variables. Not all of these variables were useful (for example years of weather data for which we did not have soybean yields), so we had to create another dataset with only the variables we wanted to use. When running our models we split the data into three data sets: 50% training data, 25% validation data, and 25% test data.
```{r include=FALSE, eval=FALSE}
# read in the soy training data (minus geographic information)
sdat1 = read.csv("TRAINING_DATASET.csv", header = TRUE)
n = nrow(sdat1) # find number of rows in training data `sdat1`
# read in geographic data for all locations in training data `sdat1`
sdat2 = read.csv("Geographic_information_updated.csv", header = TRUE)
# merge the training data with the geographic information for each observation in variable `LOCATION'
sdat.merged = merge(sdat1, sdat2, by="LOCATION")
# delet weather data for years 2001-2008 (as we don't have varieties planted in those years)
sdat.merged1 = sdat.merged[,c(1:15,24:29,38:43,52:64)]
wdat = sdat.merged1[,16:33] # save weather data before deleting it below
sdat = sdat.merged1[,c(1:15,34:40)] # take out weather data to be sorted and added later according to year
# create temp vector with three columns for temperature, precipitation, and solar radiation (`TEMP`, `PREC`, and `RAD` respectively)
tempvec = data.frame(matrix(nrow = n, ncol = 3)) # create temp vec
colnames(tempvec) = c("TEMP","PREC","RAD") # add column names
for(i in 1:n) {
temp.year = sdat$YEAR[i] # get year in ith row
temp.year = temp.year%%2000 # this returns remainder of temp.year/2000 ie if temp.year = 2011, then this returns 11
if (temp.year==9) { # if year is 2009, then add a '0' to column name
index = which(colnames(wdat)==paste("TEMP_0", temp.year, sep="") | colnames(wdat)==paste("PREC_0", temp.year, sep="") | colnames(wdat)==paste("RAD_0", temp.year, sep="")) # find column indeces for temp, prec, and rad for year in i^th row
tempvec[i,] = wdat[i,index] # copy temp, prec, and rad values from `wdat` for 3 col indeces saved in `index` into `tempvec` all for row i
} else {
index = which(colnames(wdat)==paste("TEMP_", temp.year, sep="") | colnames(wdat)==paste("PREC_", temp.year, sep="") | colnames(wdat)==paste("RAD_", temp.year, sep="")) # find column indeces for temp, prec, and rad for year in i^th row
tempvec[i,] = wdat[i,index] # copy temp, prec, and rad values from `wdat` for 3 col indeces saved in `index` into `tempvec` all for row i
}
}
sdat.all = cbind(sdat,tempvec) # add `tempvec` to the right side of `sdat`
write.csv(sdat.all, file = "SoyData.csv") # save `sdat` as a SoyData.csv
sdat = sdat.all[,c("YIELD","AREA","IRRIGATION","CEC","PH","ORGANIC.MATTER","CLAY","SILT_TOP","SAND_TOP","AWC_100CM","TEMP","PREC","RAD")] # pull out only the variables that have information ie `repno` is the rep number and is useless as a variable
```
# Modeling the Data:
We tried 3 different methods from our Machine Learning class. We focused on Random Forests, Deep Neural Nets, and KNN. We mention Linear Regression (as a reference baseline) and Boosting (as a method to explore in the future). We tuned Random Forests, Neural Nets, and KNN, and discuss each method with its corresponding results below.
## Baseline: Generalized Linear Models
To give us a baseline for OOB performance, we chose a linear regression model using the package `glm`. We use stepwise eliminination to find our model using the `step` function. This gave us our baseline RMSE of $13.0193$ bushels per acre. Our baseline model is given below.
$$Yield= 32.61 + 0.00056*area + 0.00074*irrigation + 2.330*organic.matter + 0.325*silt.top - 0.449*cec + \varepsilon$$
```{r warning=FALSE,include=FALSE, eval=FALSE}
##Read in the data file using a library called "readr".
#read in text file using library readr so that all columns are in the right format
library(readr)
#read in data to a variable called SoyData
SoyData <- read_csv("SoyData.csv")
```
```{r resampling, warning=FALSE,include=FALSE, eval=FALSE}
##In this section, we resampling the data to 1000 observations and check the noramlity assumption.
#use random sampling to create a training set called Soy1000
Soy1000 <- SoyData[sample(nrow(SoyData), 1000), ]
#attach column-names
attach(SoyData)
#check column-names
colnamesSoy <- colnames(SoyData)
classSoy <- class(SoyData)
#check normality
shapiro.test(Soy1000$YIELD)
```
```{r glm, warning=FALSE,include=FALSE, eval=FALSE}
#create additive model
model <- glm( YIELD ~ AREA + IRRIGATION + CEC + PH + ORGANIC.MATTER + CLAY + SILT_TOP +
AWC_100CM + TEMP + SAND_TOP, data=Soy1000)
#plot the models
plot(model)
#summary model
summary.mod <- summary(model)
```
```{r step, warning=FALSE,include=FALSE, eval=FALSE}
##This section we were using the stepwise method to elminate insignificant variables.
#stepwise eliminations to eliminate insignificant variables
step(model)
```
```{r afterstep, warning=FALSE,include=FALSE, eval=FALSE}
##We recreated another additive model after the stepwise method.
#create new model after stepwise
new.model <- glm( YIELD ~ AREA + IRRIGATION + ORGANIC.MATTER + SILT_TOP , data=Soy1000)
summarynew <- summary(new.model)
#re-plot the new model
plot(new.model)
#create a new data set using random sample for testing
Soy1000Test <- SoyData[sample(nrow(SoyData), 1000), ]
#calculate rmse
score <- predict(new.model, Soy1000)
actual <- Soy1000Test$YIELD
rmse <- (mean((actual - score)^2))^0.5
rmse
##The RMSE looked promising but not as good as other methods.
```
## Random Forests
With Random Forests we tried to optimize using a spread of values for `mtry` and `ntree` as well as using an optimization function called `tuneRF` from the `randomForest` package, which optimizes over `mtry` in a more thorough fashion. Our results didn't vary much over `mtry` values and were fairly consistent over `ntree` values.
We have included a table of values along with a graph below. Our final Random Forests model choice had an OOB RMSE of $7.9042$ bushels per acre with `mtry` set to 6 and `ntree` set to 1500.
![ntree values of 500, 1000, 1500, and 2000 are used with mtry values of 3, 4, 6, 9 and 12. The results do not vary much for any of the settings.](C:\Users\bra2194372\Dropbox\Data\Rf.RMSEvsMTRY.PNG)
\pagebreak
mtry: | ntree=500 | ntree=1000 | ntree=1500 | ntree=2000
-----| ----------| ----------| ----------| ----------|
3 | 7.926718 | 7.926793 | 7.926908 | 7.925924
4 | 7.909572 | 7.909412 | 7.909228 | 7.909150
6 | 7.904852 | 7.904379 | 7.904189 | 7.904188
9 | 7.904154 | 7.904245 | 7.904059 | 7.904156
12| 7.904267 | 7.904182 | 7.904173 | 7.904311
Table 1: List of values for `ntree` and `mtry`
```{r include=FALSE, eval=FALSE}
# Random Forests!
#load libraries
library(randomForest)
library(gbm) #boosting
#shortcut to saved `SoyData.csv`
sdat = read.csv("SoyData.csv", header = TRUE)
# create training, validation, and test data
set.seed(14) # my birthday
n=nrow(sdat) # find number of obs in data
n1=floor(n/2) # find half of number of obs (rounded) in data (for training data)
n2=floor(n/4) # find fourth of number of obs (rounded) in data (for validataion data)
n3=n-n1-n2 # find fourth (ie 1 - 1/2 - 1/4 rounded ) of number of obs (rounded) in data (for test data)
ii = sample(1:n,n) # randomly reorders n elements
sdat.train = sdat[ii[1:n1],]
sdat.val = sdat[ii[(n1+1):(n1+n2)],]
sdat.test = sdat[ii[(n1+n2+1):n],]
# we fit a model on the training data `sdat.train` using random forests, then predict on the validation data `sdat.val`
# Note: mtry is the number of randomly choosen variables to use when building the 'random forests'
# Note: we use mtry=floor(sqrt(ncol(sdat))) = 3 as our default value for mtry
rf.fit = randomForest(YIELD~.,data=sdat.train, mtry=3, ntree=500) # fit random forests model on training data
rfvalpred = predict(rf.fit, newdata=sdat.val) # predict using model `rf.fit` on validation data
# let's try a larger number of trees
rf.fit1 = randomForest(YIELD~.,data=sdat.train, mtry=9, ntree=2000) # fit random forests model on training data
rfvalpred = predict(rf.fit1, newdata=sdat.val) # predict using model `rf.fit` on validation data
rf.fit2 = randomForest(YIELD~.,data=sdat.train, mtry=12, ntree=2000) # fit random forests model on training data
rfvalpred = predict(rf.fit2, newdata=sdat.val) # predict using model `rf.fit` on validation data
# mean(rf.fit2$mse) # use this to get last value rmse of 20000
bestmtry.df = as.data.frame(bestmtry) # make them all dataframes
bestmtry2.df = as.data.frame(bestmtry2)
bestmtry3.df = as.data.frame(bestmtry3)
bestmtry4.df = as.data.frame(bestmtry4)
bestmtry4.1.df = rbind(bestmtry4,c(12,mean(rf.fit2$mse))) # add the MSE
rf.results= cbind(bestmtry.df[,1],sqrt(bestmtry.df[,2]),sqrt(bestmtry2.df[,2]),sqrt(bestmtry3.df[,2]),sqrt(bestmtry4.1.df[,2]))
row.names(rf.results)= c("3", "4","6","9", "12")
colnames(rf.results)= c("mtry", "500", "1000", "1500", "2000")
# plot results
plot(rf.results[,1],rf.results[,2],type="n",xlab="mtry",ylab="RMSE",cex.lab=1.5, main = "RMSE vs mtry")
for(i in 2:5) lines(rf.results[,1],rf.results[,i],pch=18,col=i,lty=2,lwd=3) #plot each ntree curve
leg.text = c("500", "1000","1500","2000")
legend("topright",leg.text, col=c("red","green","blue","cyan"),lty = 1, lwd=3)
# Cross validation with Random Forests using function `rfcv()`
rf.fit.cv = rfcv(sdat.train[,-1], sdat.train[,1], cv.fold=10)
# tune Random Forests with the tuneRF() funcion in package randomForest
# Algorithm Tune (tuneRF)
set.seed(14)
bestmtry <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=500)
print(bestmtry)
set.seed(14)
bestmtry2 <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=1000)
print(bestmtry2)
set.seed(14)
bestmtry3 <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=1500)
print(bestmtry3)
set.seed(14)
bestmtry4 <- tuneRF(sdat.train[,-1], sdat.train[,1], stepFactor=1.5, improve=1e-5, ntree=2000)
print(bestmtry4)
# plot graph of mtry options
bestmtry.df = as.data.frame(bestmtry) #make it a dataframe
windows()
plot(bestmtry.df$mtry,bestmtry.df$OOBError, xlab = "Mtry", ylab = "OOBError", type = "o", main = "Mtry vs MSE (ie OOBError), ntree = 500" )
```
## KNN
We used the r package `kknn` for our knn models. Tuning these models was extremely time consuming (read: computationally expensive), but they ended up performing surprisingly well. Below is a table of the RMSE values for various k values.
|k | RMSE |
|--|------|
| 5 |8.592175 |
| 8 |8.341477 |
| 11 |8.217457 |
| 14 |8.172575 |
| 17 |8.126609 |
| 20 |8.102771 |
| 23 |8.089667 |
| 26 |8.076027 |
| 29 |8.074170 |
| 32 |8.067071 |
| 35 |8.062278 |
| 38 |8.055297 |
| 41 |8.054643 |
| 44 |8.055504 |
| 47 |8.057311 |
| 50 |8.056076 |
| 53 |8.062764 |
| 56 |8.067178 |
| 59 |8.066341 |
| 62 |8.069128 |
| 65 |8.075273 |
| 68 |8.076966 |
Figure 3. Table of k values and the RMSE of each model as depicted in figure 2. The best RMSE is ~8.055 found at a k of 41.
\pagebreak
![Preliminary figure of k verses cvmean, which compares the paramter k at values between 5 and 68 with the RMSE of a single model. This was used to test the approximate range of appropriate k values before moving onto the more computationally expensive model below.](C:\Users\bra2194372\Dropbox\Data\cvmean_kv.JPG)
\pagebreak
![Plot of k verses cvmean, which compares the paramter k at values between 5 and 68 with the RMSE of each model (calculated by comparing against the testing data). Each colored line represents a different subset of the training data that, when averaged together to produce the black line, reduces bias and varience in the model.](C:\Users\bra2194372\Dropbox\Data\cvmean_k.JPG)
\pagebreak
![Pairwise comparison of kNN at a k of 41 against a linear model. The greater linearity in the y versus kNN41 plots over the y vs linear plots give a visual representation that the kNN model is a better predictor for the data than the linear model.](C:\Users\bra2194372\Dropbox\Data\fmat.JPG)
```{r include=FALSE, eval=FALSE}
library(kknn)
sdat.all <- read.csv("SoyData2.csv", header = TRUE)
```
```{r include=FALSE, eval=FALSE}
# pull out only the variables that have information ie `repno` is the rep number and is useless as a variable
sdat=sdat.all[,c("YIELD","AREA","IRRIGATION","CEC","PH","ORGANIC.MATTER","CLAY","SILT_TOP","SAND_TOP","AWC_100CM","TEMP","PREC","RAD")]
# create training, validation, and test data
set.seed(14) # my birthday
n=nrow(sdat) # find number of obs in data
n1=floor(n/2) # find half of number of obs (rounded) in data (for training data)
n2=floor(n/4) # find fourth of number of obs (rounded) in data (for validataion data)
n3=n-n1-n2 # find fourth (ie 1 - 1/2 - 1/4 rounded ) of number of obs (rounded) in data (for test data)
ii = sample(1:n,n) # randomly reorders n elements
trainDf = sdat[ii[1:n1],]
valDf = sdat[ii[(n1+1):(n1+n2)],]
train2Df = rbind(trainDf,valDf)
testDf = sdat[ii[(n1+n2+1):n],]
```
```{r include=FALSE, eval=FALSE}
y = trainDf[,1]
x = trainDf[,2:12]
print(dim(x[,1]))
mmsc=function(x) {return((x-min(x))/(max(x)-min(x)))}
xs = apply(x,2,mmsc)
#plot y vs each x
par(mfrow=c(1,2)) #two plot frames
```
```{r include=FALSE, eval=FALSE}
par(mfrow=c(1,1))
set.seed(99)
kv = seq(5,70,3) #k values to try
n = length(y)
cvtemp = docvknn(xs,y,kv,nfold=10)
cvtemp = sqrt(cvtemp/n) #docvknn returns sum of squares
plot(kv,cvtemp)
```
```{r include=FALSE, eval=FALSE}
#run cross val several times
set.seed(99)
cvmean = rep(0,length(kv)) #will keep average rmse here
ndocv = 10 #number of CV splits to try
n=length(y)
cvmat = matrix(0,length(kv),ndocv) #keep results for each split
for(i in 1:ndocv) {
cvtemp = docvknn(xs,y,kv,nfold=10)
cvmean = cvmean + cvtemp
cvmat[,i] = sqrt(cvtemp/n)
}
```
```{r include=FALSE, eval=FALSE}
cvmean = cvmean/ndocv
cvmean = sqrt(cvmean/n)
plot(kv,cvmean,type="n",ylim=range(cvmat),xlab="k",cex.lab=1.5)
for(i in 1:ndocv) lines(kv,cvmat[,i],col=i,lty=3) #plot each result
lines(kv,cvmean,type="b",col="black",lwd=5) #plot average result
```
```{r include=FALSE, eval=FALSE}
#refit using all the data and k=42
ddf = data.frame(y,xs)
near41 = kknn(y~.,ddf,ddf,k=41,kernel = "rectangular")
lmf = lm(y~.,ddf)
fmat = cbind(y,near42$fitted,lmf$fitted)
colnames(fmat)=c("y","kNN42","linear")
pairs(fmat)
print(cor(fmat))
```
## Neural Net
We used the `h2o` package for running our neural nets. It was very different from the other packages we have worked with (as mentioned in class) and took some time to get used to.
The following table was created by running iterations of the neural network model with different parameters. In attempts 1-10, the RMSE (number of bushels per acre) was calculated by comparing the training data with the validation set. The final attempt found the RMSE of the best conditions (9), applied to the second training set (the first training data plus the validation data) and the test data. The greatest differenses in RMSE occured with changes to the hidden layers, with the best values coming out of the 400x400x400 set. Although attempt number 10 used more layers, it appears to overfit the data and therefore produce a higher RMSE.
|Attempt Number | Hidden Layers | Epochs | Activation | L1 | RMSE |
|---------------|---------------|--------|------------|----|------|
| 1 |10 |1000 |RectifierWithDropout |1.00E-02 |9.609 |
| 2 |10x10 |500 |RectifierWithDropout |1.00E-03 |10.185 |
| 3 |150x150 |300 |RectifierWithDropout |1.00E-04 |8.74 |
| 4 |200x200 |500 |RectifierWithDropout |1.00E-04 |8.567 |
| 5 |100x100x100 |1000 |RectifierWithDropout |1.00E-04 |8.545 |
| 6 |250x250 |1000 |RectifierWithDropout |1.00E-05 |8.437 |
| 7 |400x400 |1000 |RectifierWithDropout |1.00E-05 |8.41 |
| 8 |50x50x50x50 |1000 |RectifierWithDropout |1.00E-04 |10.06 |
| 9 |400x400x400 |1000 |RectifierWithDropout |1.00E-05 |8.07 |
| 10 |500x500x500x500 |1000 |RectifierWithDropout |1.00E-05 |8.14 |
| final |400x400x400 |1000 |RectifierWithDropout |1.00E-05 |8.115 |
```{r include=FALSE, eval=FALSE}
library(h2o)
source("mlfuns.R")
sdat.all <- read.csv("SoyData2.csv", header = TRUE)
```
```{r include=FALSE, eval=FALSE}
# pull out only the variables that have information ie `repno` is the rep number and is useless as a variable
sdat=sdat.all[,c("YIELD","AREA","IRRIGATION","CEC","PH","ORGANIC.MATTER","CLAY","SILT_TOP","SAND_TOP","AWC_100CM","TEMP","PREC","RAD")]
# create training, validation, and test data
set.seed(14) # my birthday
n=nrow(sdat) # find number of obs in data
n1=floor(n/2) # find half of number of obs (rounded) in data (for training data)
n2=floor(n/4) # find fourth of number of obs (rounded) in data (for validataion data)
n3=n-n1-n2 # find fourth (ie 1 - 1/2 - 1/4 rounded ) of number of obs (rounded) in data (for test data)
ii = sample(1:n,n) # randomly reorders n elements
# Divide data into training set, validation set, a combinded training + validation set, and the final testing set
trainDf = sdat[ii[1:n1],]
valDf = sdat[ii[(n1+1):(n1+n2)],]
train2Df = rbind(trainDf,valDf)
testDf = sdat[ii[(n1+n2+1):n],]
```
```{r include=FALSE, eval=FALSE}
h2oServer <- h2o.init(ip="localhost", port=54321,
max_mem_size="4g", nthreads=-1)
#create h2o dataframes
train_h2o = as.h2o(trainDf, destination_frame = "soy_train")
val_h2o = as.h2o(valDf, destination_frame = "soy_val")
train2_h2o = as.h2o(train2Df, destination_frame = "soy_train2")
test_h2o = as.h2o(testDf, destination_frame = "soy_test")
```
```{r include=FALSE, eval=FALSE}
if(file.exists(file.path("./", "modelSoy1"))) {
modelSoy1 = h2o.loadModel(path = file.path("./", "modelSoy1"))
} else {
modelSoy1 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=10,
epochs=1000,
export_weights_and_biases=T,
l1 = 1e-2,
model_id = "modelSoy1"
)
h2o.saveModel(modelSoy1, path="./")
}
phat = predict(modelSoy1, val_h2o)
phatL = list()
phatL$modelSoy1 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
### fit h2o deep
if (file.exists(file.path("./", "modelSoy2"))) {
modelSoy2 = h2o.loadModel(path = file.path("./", "modelSoy2"))
} else {
modelSoy2 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(10,10),
epochs=500,
activation="RectifierWithDropout",
l1=1e-3,
export_weights_and_biases=TRUE,
model_id = "modelSoy2"
)
h2o.saveModel(modelSoy2, path="./")
}
phat = predict(modelSoy2, val_h2o)
phatL$modelSoy2 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy3"))) {
modelSoy3 = h2o.loadModel(path = file.path("./", "modelSoy3"))
} else {
modelSoy3 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(150,150),
epochs=300,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy3"
)
h2o.saveModel(modelSoy3, path="./")
}
phat = predict(modelSoy3, val_h2o)
phatL$modelSoy3 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy4"))) {
modelSoy4 = h2o.loadModel(path = file.path("./", "modelSoy4"))
} else {
modelSoy4 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(200,200),
epochs=500,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy4"
)
h2o.saveModel(modelSoy4, path="./")
}
phat = predict(modelSoy4, val_h2o)
phatL$modelSoy4 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy5"))) {
modelSoy5 = h2o.loadModel(path = file.path("./", "modelSoy5"))
} else {
modelSoy5 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(100,100,100),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy5"
)
h2o.saveModel(modelSoy5, path="./")
}
phat = predict(modelSoy5, val_h2o)
phatL$modelSoy5 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy6"))) {
modelSoy6 = h2o.loadModel(path = file.path("./", "modelSoy6"))
} else {
modelSoy6 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(250,250),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy6"
)
h2o.saveModel(modelSoy6, path="./")
}
phat = predict(modelSoy6, val_h2o)
phatL$modelSoy6 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy7"))) {
modelSoy7 = h2o.loadModel(path = file.path("./", "modelSoy7"))
} else {
modelSoy7 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(400,400),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy7"
)
h2o.saveModel(modelSoy7, path="./")
}
phat = predict(modelSoy7, val_h2o)
phatL$modelSoy7 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy8"))) {
modelSoy8 = h2o.loadModel(path = file.path("./", "modelSoy8"))
} else {
modelSoy8 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(50,50,50,50),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-4,
export_weights_and_biases=TRUE,
model_id = "modelSoy8"
)
h2o.saveModel(modelSoy8, path="./")
}
phat = predict(modelSoy8, val_h2o)
phatL$modelSoy8 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy9"))) {
modelSoy9 = h2o.loadModel(path = file.path("./", "modelSoy9"))
} else {
modelSoy9 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(400,400,400),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy9"
)
h2o.saveModel(modelSoy9, path="./")
}
phat = predict(modelSoy9, val_h2o)
phatL$modelSoy9 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelSoy10"))) {
modelSoy10 = h2o.loadModel(path = file.path("./", "modelSoy10"))
} else {
modelSoy10 = h2o.deeplearning(
x=2:12, y=1,
training_frame=train_h2o,
hidden=c(500,500,500,500),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelSoy10"
)
h2o.saveModel(modelSoy10, path="./")
}
phat = predict(modelSoy10, val_h2o)
phatL$modelSoy10 = as.matrix( phat[,1] )
```
```{r include=FALSE, eval=FALSE}
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy1)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy2)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy3)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy4)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy5)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy6)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy7)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy8)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy9)^2))
print(rmse)
rmse = sqrt(mean((valDf[,1]-phatL$modelSoy10)^2))
print(rmse)
```
```{r include=FALSE, eval=FALSE}
if (file.exists(file.path("./", "modelFinal"))) {
modelFinal = h2o.loadModel(path = file.path("./", "modelFinal"))
} else {
modelFinal = h2o.deeplearning(
x=2:12, y=1,
training_frame=train2_h2o,
hidden=c(400,400,400),
epochs=1000,
activation="RectifierWithDropout",
l1=1e-5,
export_weights_and_biases=TRUE,
model_id = "modelFinal"
)
h2o.saveModel(modelFinal, path="./")
}
phat = predict(modelFinal, test_h2o)
phatL$modelFinal = as.matrix( phat[,1] )
rmse = sqrt(mean((testDf[,1]-phatL$modelFinal)^2))
print(rmse)
```
## Boosting: Future Work
We were curious how our 3 methods would compare to boosting with trees, so we used the `gbm` package with settings `distribution` set to gaussian, `interaction.depth` set to 4, `n.tree` set to 5000, and `shrinkage` set to 0.01. We were surprised to see that our Boosting model (with almost no tuning) was very competitive with an RMSE of $7.9159$ bushels per acre. This is a model we would like to tune in future work as it shows promise.
```{r include=FALSE, eval=FALSE}
# fit a tree using boosting on training data
boost.fit = gbm(YIELD~.,data=sdat.train,distribution="gaussian",interaction.depth=4,n.trees=5000,shrinkage=0.01)
boostvalpred=predict(boost.fit,newdata=sdat.val,n.trees=5000) # predict using model `boost.fit` on validation data
rmse = sqrt(mean((sdat.val$YIELD-boostvalpred)^2)) # find RMSE
cat("rmse on test for boosting: ",rmse,"\n")
```
```{r include=FALSE, eval=FALSE}
#plot (out-of-sample) fits
pairs(cbind(sdat.val$YIELD,rfvalpred,boostvalpred))
print(cor(cbind(sdat.val$YIELD,rfvalpred,boostvalpred)))
#plot test y vs test predictions
plot(sdat.val$YIELD,boostvalpred)
abline(0,1,col="red",lwd=2)
#variable importance from boosting
summary(boostfit)
rmse = sqrt(mean((sdat.val$YIELD-rfvalpred)^2))
cat("rmse on test for random forests: ",rmse,"\n")
#--------------------------------------------------
#variable importance from Random Forests
varImpPlot(rf.fit)
importance(rf.fit)
#--------------------------------------------------
#refit random forests on train-val
# rffit2 = randomForest(logMedVal~.,data=catrainval,mtry=3,ntree=500)
# rftestpred = predict(rffit2,newdata=catest)
#--------------------------------------------------
```
# Conclusion
Neural Nets, KNN, and Random Forests were all fairly close in performance, and all better than Linear Regression. Our best OOB performance was from our Random Forests model with an OOB RMSE of $7.9042$ bushels per acre. Boosting with trees showed promise and should be considered in future work.