Machine Learning, Spring 2017

Machine Learning / Statistical Learning
class #: 29457
Tuesday and Thursday, 4:30 - 5:45.
Tempe, WXLR A306
1/09 - 4/28.


Thursday, April 6:
Special Presentation on Data Science in Industry

Syllabus


Notes

Some R functions:

mlfuns.R
docv.R

KNN and the Bias-Variance-Tradeoff

Trees
   knn-bagging.R    tree-bagging.R       boost-demo.R
   sim-var-sel.R


Classification
   Forensic Glass R script: fglass.R


Logistic Regression
   Check deviance is -2logL in Logistic Regression
   Run and plot multinomial logit, compare with knn


Linear Models and Regularization
   R script to show properties of linear regression
   R script for best subsets, Hitters data.
   R script for ridge and lasso using glmnet, Hitters Data.
   R script for seeing Ridge vs Lasso in simple Problem.
   R script for plotting Ridge and Lasso shrinkage.
   R script for reading in diabetes data and looking at y.
   R script for Lasso on Diabetes.
   R script to see Lasso coefs plotted against lambda.
   R script for comparing Lasso,Ridge,Enet.
   R script to see Ridge coef plotted against lambda.
   R script for forwards stepwise on Diabetes.

   do-stepcv.R: R functions for doing stepwise.
   R script to learn about formulas and model.matrix (see Chapter 11, Statistical models in R, in the R-introduction Manual)
   notes-funs.R: R utility function Rob uses to write stuff out.


Generalized Linear Models
   R script for Regularized logit fit to simulated data.
   R script Lasso fit to w8there data.
   R script Ridge fit to w8there data.


Single Layer Neural Nets
   Single Layer Neural Nets (R code)
   Single Layer Neural Nets XOR (R code)
   plot.nnet.R


Deep Neural Nets
   Do XOR with h2o and Deep Learning
   Do Tabloid with h2o and Deep Learning
   yet another version of lift code
   deviance loss
   Visualize MNIST digits
   Fit MNIST digits
   h2o in R tutorial


Clustering


Dimension Reduction: Principle Components and the Autoencoder
   Script to do principle components on the Arrests Data
   Script to do movies Data


Latent Dirichlet Allocation for Undirected Text Analysis
   Example Script for LDA
Many thanks to Xin Lei for the R script and figures in the notes !!!


Graphical Models and Naive Bayes

Demo on using BART R package


R

The official R introduction:
Introduction to R
Go to The R Project to get the lastest version.

Rob's brief Introduction to R

There a a million books and online stuff.
A nice book is:
"The Art of R Programming", by Norman Matloff.

Note: You will probably want to use R-studio!!!!

To install R under Windows:
   Google "install R under Windows"
   click on "Download R-x.x.x for Windows"
   Click on "Run"

Here are three "Rmarkdown quickies" I found on the web.
Let me know if you find better one!!!
rmarkdown1.pdf
rmarkdown2.pdf
The R-studio Cheatsheet
and I like this one too:
rmarkdown-reference.pdf
And how about:
More Rmarkdown from RStudio


Homework

Homework (-1)

Get the .Rmd file first.Rmd to compile to pdf and/or html in R-studio.
Not to be handed in.

Homework 0

Write a nice solution to problem 1.1 in the bias variance notes in Rmarkdown.
Not to be handed in.

pdf of solution
Rmarkdown of solution

Homework 1

Problems 5.1 and 6.1 from the "Introduction to Predictive Models.." notes. Due Thursday January 26.

Hand in hard copy.
Each group just has to hand in one copy
but be sure to put all your names on the copy you hand in.

Homework 2

Due February 2.

(a)
Fit trees of various sized to the simple x=mileage, y =price problem using the susedcars.csv data.

What looks like a reasonable tree size?

(b)
Still just using the x=mileage, y =price problem,
use cross-validation to choose the tree size.
How does the tree chosen with CV compare with the one you chose in (a)?

(c)
Use cross validation to fit a tree using y = price and x = all the other variables.

How ``good'' is the fit?

Is the tree you fit interpretable?

Homework 3

Do problem 9.1 from the end of the Trees notes.
That is, try trees, random forests, boosting on the usedcars.csv data.

Homework 4

Homework 4, Due February 21st

Homework 5

Homework 5, Due March 2


Log:

February 9:
We finished the Classification notes, except that the lift example did not work.
I think I have fixed this, so that if you get the file mlfuns.R off the webpage
the code should run.

February 2: Finished the Trees notes.

January 26:
Finished ``5. Tree Models and the Bias Variance Trade Off'' in the trees notes.

January 19th: finished first set of notes on Intro, KNN, Bias-Variance tradeoff.
Homework is due Thursday January 26.

Hand in hard copy.
Each group just has to hand in one copy
but be sure to put all your names on the copy you hand in. January 17th: stopped at slide 72 of the "Introduction to Predictive Models.." notes.


Project:

The best thing is if you find a data set you are interested in and apply some of the
methods we have discussed.

But, if you want something simple but still hard to do:
Compare regularized linear methods to tree based methods on
either the (full) cars data (n about 20,000) or the (full) hockey data (n about 60,000).

For example, with the hockey data after you dummy up and add in some interactions, p will be in the hundreds.

To compare you will need to pick a set of transformation to add to the linear specification.
Of course, compare means on the basis of out-of-sample performance!!

But, we also want to interpret.
All the methods address the fundamental issue of variable selection
in one way or another.

What are the important variables?

Note: if the stuff runs to slow on your laptop you can subsample,
but do it as little as possible.