Machine Learning / Statistical Learning

class #: 29457

Tuesday and Thursday, 4:30 - 5:45.

Tempe, WXLR A306

1/09 - 4/28.

Thursday, April 6:

Special Presentation on Data Science in Industry

Syllabus

mlfuns.R

docv.R

KNN and the Bias-Variance-Tradeoff

Trees

knn-bagging.R tree-bagging.R boost-demo.R

sim-var-sel.R

Classification

Forensic Glass R script: fglass.R

Logistic Regression

Check deviance is -2logL in Logistic Regression

Run and plot multinomial logit, compare with knn

Linear Models and Regularization

R script to show properties of linear regression

R script for best subsets, Hitters data.

R script for ridge and lasso using glmnet, Hitters Data.

R script for seeing Ridge vs Lasso in simple Problem.

R script for plotting Ridge and Lasso shrinkage.

R script for reading in diabetes data and looking at y.

R script for Lasso on Diabetes.

R script to see Lasso coefs plotted against lambda.

R script for comparing Lasso,Ridge,Enet.

R script to see Ridge coef plotted against lambda.

R script for forwards stepwise on Diabetes.

do-stepcv.R: R functions for doing stepwise.

R script to learn about formulas and model.matrix (see Chapter 11, Statistical models in R, in the R-introduction Manual)

notes-funs.R: R utility function Rob uses to write stuff out.

Generalized Linear Models

R script for Regularized logit fit to simulated data.

R script Lasso fit to w8there data.

R script Ridge fit to w8there data.

Single Layer Neural Nets

Single Layer Neural Nets (R code)

Single Layer Neural Nets XOR (R code)

plot.nnet.R

Deep Neural Nets

Do XOR with h2o and Deep Learning

Do Tabloid with h2o and Deep Learning

yet another version of lift code

deviance loss

Visualize MNIST digits

Fit MNIST digits

Clustering

Dimension Reduction: Principle Components and the Autoencoder

Script to do principle components on the Arrests Data

Script to do movies Data

Latent Dirichlet Allocation for Undirected Text Analysis

Example Script for LDA

Many thanks to Xin Lei for the R script and figures in the notes !!!

Graphical Models and Naive Bayes

Introduction to R

Go to The R Project to get the lastest version.

Rob's brief Introduction to R

There a a million books and online stuff.

A nice book is:

"The Art of R Programming", by Norman Matloff.

Note: You will probably want to use R-studio!!!!

To install R under Windows:

Google "install R under Windows"

click on "Download R-x.x.x for Windows"

Click on "Run"

Here are three "Rmarkdown quickies" I found on the web.

Let me know if you find better one!!!

rmarkdown1.pdf

rmarkdown2.pdf

The R-studio Cheatsheet

and I like this one too:

rmarkdown-reference.pdf

And how about:

More Rmarkdown from RStudio

Not to be handed in.

Not to be handed in.

pdf of solution

Rmarkdown of solution

Hand in hard copy.

Each group just has to hand in one copy

but be sure to put all your names on the copy you hand in.

(a)

Fit trees of various sized to the simple x=mileage, y =price problem using the susedcars.csv data.

What looks like a reasonable tree size?

(b)

Still just using the x=mileage, y =price problem,

use cross-validation to choose the tree size.

How does the tree chosen with CV compare with the one you chose in (a)?

(c)

Use cross validation to fit a tree using y = price and x = all the other variables.

How ``good'' is the fit?

Is the tree you fit interpretable?

That is, try trees, random forests, boosting on the usedcars.csv data.

We finished the Classification notes, except that the lift example did not work.

I think I have fixed this, so that if you get the file mlfuns.R off the webpage

the code should run.

February 2: Finished the Trees notes.

January 26:

Finished ``5. Tree Models and the Bias Variance Trade Off'' in the trees notes.

January 19th: finished first set of notes on Intro, KNN, Bias-Variance tradeoff.

Homework is due Thursday January 26.

Hand in hard copy.

Each group just has to hand in one copy

but be sure to put all your names on the copy you hand in. January 17th: stopped at slide 72 of the "Introduction to Predictive Models.." notes.

methods we have discussed.

But, if you want something simple but still hard to do:

Compare regularized linear methods to tree based methods on

either the (full) cars data (n about 20,000) or the (full) hockey data (n about 60,000).

For example, with the hockey data after you dummy up and add in some interactions, p will be in the hundreds.

To compare you will need to pick a set of transformation to add to the linear specification.

Of course, compare means on the basis of out-of-sample performance!!

But, we also want to interpret.

All the methods address the fundamental issue of variable selection

in one way or another.

What are the important variables?

Note: if the stuff runs to slow on your laptop you can subsample,

but do it as little as possible.