Data Sets


This the Hitters data used extensively in the wonderful book "Introduction to Statistical Learning, second edition".
Missing values have been dropped, the two binary categorical variables have been dummied, and each x has been
scaled to have mean 0 and variance 1.
This is exactly what you get if you do the feature processing at the beginning of the Chapter 10 lab of ISLR.
Each row corresponds to a baseball player.
There are 21 columns, the first 20 are features and the last column (creatively called y) is the single numeric target
which is the player salary. Gitters.csv
These train/test splits are the same as in the ISLR lab:
Gitters_train.csv
Gitters_test.csv


monthly returns

Business Statistics

audit100.csv
audit10.csv
Housedata.csv   Housedata.txt  

RunsPerGame.csv

Price, Sales

Salary Data

shock absorber data  

Stocks Data

Profits

Zagat  

Orion

Beauty

midcity.csv   midcity.txt  


Simplified version of w8there data from Matt Taddy's textir R package:
swe8there.csv

Galaxies data from MASS library in R: galaxies.csv

Gene expression data from Alex Janss:
GDS4296_table.csv
python script to read in the data


Naive Bayes Classification for count data: NBYX.csv


Cereal data: cereal.txt

Ham/Spam train-test:
test x
test y



csv file of Boston housing data from MASS (in R)



Data for table 9.1 in Efron and Hastie, grabbed from the book webpage


Million Song data (from UCI respository)
Each observation corresponds to a song.
Numeric variable to be predicted is the year of song (first column).
90 x variables to predict y, characteristics of songs.
mstr.txt: training data
mste.txt: test data
R script to read data in and get results using linear regression
Background information on the data





Price of rough cut diamonds, documentation


The kdd CRM data from this page. (click on ``data'' in the green ribbon near the top of the page to get to the data)
kdd-upsell.zip  
This is a zipped directory with x and y as as separate .csv files, kdd-upselllabs-y.csv and kdd-upsell-x.csv.
This R script quickly looks at the data.look-kdd-upsell.R  


Drug Discovery data used in BART (Chipman, George, and McCulloch 2010)), drug-discovery.csv
y has 1=somewhat active, 2=highly active.
In the paper, we combined "1" and "2" to form a single "active" category.
> temp = read.csv("drug-discovery.csv")
> table(temp$y)

    0     1     2 
28832   383   159



Equally weighted porfolio of 50 stocks from the SP500, weekly data, p500-50-ew-weekly.csv

Isreal survey data:
Ayx.csv

Kaggle Delinquency data, train and test:
kaggle-del-train.csv
kaggle-del-test.csv

UCI Maching Learning Repository

KDnuggets Data Repository

evals.csv: Course Evaluation Data, does beauty affect your rating?

sms_spam.csv: Short Message Service, text messages, Ham or Spam?

OJ.csv

movie-tfidf.csv

Tabloid_test.csv

Tabloid_train.csv

mnist-test.csv

mnist-train.csv

diabetes.csv Data Description, from Hastie, Tibshirani, Wainwright website

sim-reg-data.csv

lmichyr.txt    lmichyr.desc, decription file

satisfaction.csv

hocky penalty data: pens.csv

hockey penalty data: all the hockey data.
Hockey Penalty Paper


respond.csv

defect.csv

cereal.csv

tokyo_sub.csv

KaggleDelinquency.csv  

Fidelity Returns: fidrets.csv  

shock.csv  

conret.csv  

zagat.csv  

BRWdata.csv  

nbeerm1.csv  

DefaultB.csv  

response-phat.csv  

susedcars.csv  

usedcars.csv  

calhouse.csv 

td1.csv (target marketing train)    td2.csv (target marketing test) 

BeautyData.csv

midcity.csv   midcity.txt  

Housedata.csv   Housedata.txt  

mfunds.csv   mfunds.txt  

Price-level.csv