Data Sets

Default data from ISLR

Diagnose cancer, y is first column:
CASchools.csv

This the Hitters data used extensively in the wonderful book "Introduction to Statistical Learning, second edition".
Missing values have been dropped, the two binary categorical variables have been dummied, and each x has been
scaled to have mean 0 and variance 1.
This is exactly what you get if you do the feature processing at the beginning of the Chapter 10 lab of ISLR.
Each row corresponds to a baseball player.
There are 21 columns, the first 20 are features and the last column (creatively called y) is the single numeric target
which is the player salary. Gitters.csv
These train/test splits are the same as in the ISLR lab:
Gitters_train.csv
Gitters_test.csv

monthly returns

Business Statistics

audit100.csv
audit10.csv
Housedata.csv   Housedata.txt

RunsPerGame.csv

Price, Sales

Salary Data

shock absorber data

Stocks Data

Profits

Zagat

Orion

Beauty

midcity.csv   midcity.txt

Simplified version of w8there data from Matt Taddy's textir R package:
swe8there.csv

Galaxies data from MASS library in R: galaxies.csv

Gene expression data from Alex Janss:
GDS4296_table.csv
python script to read in the data

Naive Bayes Classification for count data: NBYX.csv

Cereal data: cereal.txt

Ham/Spam train-test:
test x
test y

csv file of Boston housing data from MASS (in R)

Data for table 9.1 in Efron and Hastie, grabbed from the book webpage

Million Song data (from UCI respository)
Each observation corresponds to a song.
Numeric variable to be predicted is the year of song (first column).
90 x variables to predict y, characteristics of songs.
mstr.txt: training data
mste.txt: test data
R script to read data in and get results using linear regression
Background information on the data

Price of rough cut diamonds, documentation

The kdd CRM data from this page. (click on ``data'' in the green ribbon near the top of the page to get to the data)
kdd-upsell.zip
This is a zipped directory with x and y as as separate .csv files, kdd-upselllabs-y.csv and kdd-upsell-x.csv.
This R script quickly looks at the data.look-kdd-upsell.R

Drug Discovery data used in BART (Chipman, George, and McCulloch 2010)), drug-discovery.csv
y has 1=somewhat active, 2=highly active.
In the paper, we combined "1" and "2" to form a single "active" category.

> temp = read.csv("drug-discovery.csv")
> table(temp$y)

    0     1     2 
28832   383   159

Equally weighted porfolio of 50 stocks from the SP500, weekly data, p500-50-ew-weekly.csv

Isreal survey data:
Ayx.csv

Kaggle Delinquency data, train and test:
kaggle-del-train.csv
kaggle-del-test.csv

UCI Maching Learning Repository

KDnuggets Data Repository

evals.csv: Course Evaluation Data, does beauty affect your rating?

sms_spam.csv: Short Message Service, text messages, Ham or Spam?

OJ.csv

movie-tfidf.csv

Tabloid_test.csv

Tabloid_train.csv

mnist-test.csv

mnist-train.csv

diabetes.csv Data Description, from Hastie, Tibshirani, Wainwright website

sim-reg-data.csv

lmichyr.txt    lmichyr.desc, decription file

satisfaction.csv

hocky penalty data: pens.csv

hockey penalty data: all the hockey data.
Hockey Penalty Paper

respond.csv

defect.csv

cereal.csv

tokyo_sub.csv

KaggleDelinquency.csv

Fidelity Returns: fidrets.csv

shock.csv

conret.csv

zagat.csv

BRWdata.csv

nbeerm1.csv

DefaultB.csv

response-phat.csv

susedcars.csv

usedcars.csv

calhouse.csv

td1.csv (target marketing train)    td2.csv (target marketing test)

BeautyData.csv

midcity.csv   midcity.txt

Housedata.csv   Housedata.txt

mfunds.csv   mfunds.txt

Price-level.csv