Data Sets
This the Hitters data used extensively in the wonderful book "Introduction to Statistical Learning, second edition".
Missing values have been dropped, the two binary categorical variables have been dummied, and each x has been
scaled to have mean 0 and variance 1.
This is exactly what you get if you do the feature processing at the beginning of the Chapter 10 lab of ISLR.
Each row corresponds to a baseball player.
There are 21 columns, the first 20 are features and the last column (creatively called y) is the single numeric target
which is the player salary.
Gitters.csv
These train/test splits are the same as in the ISLR lab:
Gitters_train.csv
Gitters_test.csv
monthly returns
Business Statistics
audit100.csv
audit10.csv
Housedata.csv Housedata.txt
RunsPerGame.csv
Price, Sales
Salary Data
shock absorber data
Stocks Data
Profits
Zagat
Orion
Beauty
midcity.csv midcity.txt
Simplified version of w8there data from Matt Taddy's textir R package:
swe8there.csv
Galaxies data from MASS library in R: galaxies.csv
Gene expression data from Alex Janss:
GDS4296_table.csv
python script to read in the data
Naive Bayes Classification for count data: NBYX.csv
Cereal data: cereal.txt
Ham/Spam train-test:
test x
test y
csv file of Boston housing data from MASS (in R)
Data for table 9.1 in Efron and Hastie, grabbed from the book webpage
Million Song data (from UCI respository)
Each observation corresponds to a song.
Numeric variable to be predicted is the year of song (first column).
90 x variables to predict y, characteristics of songs.
mstr.txt: training data
mste.txt: test data
R script to read data in and get results using linear regression
Background information on the data
Price of rough cut diamonds,
documentation
The kdd CRM data from this page.
(click on ``data'' in the green ribbon near the top of the page to get to the data)
kdd-upsell.zip
This is a zipped directory with x and y as as separate .csv files, kdd-upselllabs-y.csv and kdd-upsell-x.csv.
This R script quickly looks at the data.look-kdd-upsell.R
Drug Discovery data used in BART (Chipman, George, and McCulloch 2010)),
drug-discovery.csv
y has 1=somewhat active, 2=highly active.
In the paper, we combined "1" and "2" to form a single "active" category.
> temp = read.csv("drug-discovery.csv")
> table(temp$y)
0 1 2
28832 383 159
Equally weighted porfolio of 50 stocks from the
SP500, weekly data,
p500-50-ew-weekly.csv
Isreal survey data:
Ayx.csv
Kaggle Delinquency data, train and test:
kaggle-del-train.csv
kaggle-del-test.csv
UCI Maching Learning Repository
KDnuggets Data Repository
evals.csv: Course Evaluation Data, does beauty affect your rating?
sms_spam.csv: Short Message Service, text messages, Ham or Spam?
OJ.csv
movie-tfidf.csv
Tabloid_test.csv
Tabloid_train.csv
mnist-test.csv
mnist-train.csv
diabetes.csv
Data Description, from Hastie, Tibshirani, Wainwright website
sim-reg-data.csv
lmichyr.txt
lmichyr.desc, decription file
satisfaction.csv
hocky penalty data: pens.csv
hockey penalty data: all the hockey data.
Hockey Penalty Paper
respond.csv
defect.csv
cereal.csv
tokyo_sub.csv
KaggleDelinquency.csv
Fidelity Returns: fidrets.csv
shock.csv
conret.csv
zagat.csv
BRWdata.csv
nbeerm1.csv
DefaultB.csv
response-phat.csv
susedcars.csv
usedcars.csv
calhouse.csv
td1.csv (target marketing train) td2.csv (target marketing test)
BeautyData.csv
midcity.csv midcity.txt
Housedata.csv Housedata.txt
mfunds.csv mfunds.txt
Price-level.csv