Using the training data, the tables below give the counts for how often Adult
and Age
are in the documents.
Read in the training data:
trainB = read.csv("http://www.rob-mcculloch.org/data/smsTrainB.csv")
trainyB = read.csv("http://www.rob-mcculloch.org/data/smsTrainyB.csv")[,1]
Check the propotions of ham and spam:
iispam = trainyB == 1
ageH = trainB[!iispam,'age']
ageS = trainB[iispam,'age']
adultH = trainB[!iispam,'adult']
adultS = trainB[iispam,'adult']
table(iispam)/length(trainyB)
## iispam
## FALSE TRUE
## 0.8647158 0.1352842
Get the joint frequencies of (age,adult) for the ham observations.
table(ageH,adultH)
## adultH
## ageH 0 1
## 0 3598 2
## 1 5 0
library(descr)
crosstab(ageH,adultH,plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## =========================
## adultH
## ageH 0 1 Total
## -------------------------
## 0 3598 2 3600
## -------------------------
## 1 5 0 5
## -------------------------
## Total 3603 2 3605
## =========================
Get the joint frequencies of (age,adult) for the spam observations.
table(ageS,adultS)
## adultS
## ageS 0 1
## 0 549 3
## 1 12 0
crosstab(ageS,adultS,plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## ========================
## adultS
## ageS 0 1 Total
## ------------------------
## 0 549 3 552
## ------------------------
## 1 12 0 12
## ------------------------
## Total 561 3 564
## ========================
age is on the rows and adult is on the columns.
As in the notes, we will always use observed frequencies to estimate probabilities.
So, for example,
$p(age=0,adult=1 | ham ) = 2/(3605)= 0.000554785 $
(a)
Using the tables check that the simple frequency estimate of check \(p(age=yes \,|\, ham)\) =.00138 as in the notes.
(b)
Use the table and the Naive Bayes assumption to estimate \(p(ham \,|\, adult = no, age=yes)\).
(c)
Use the table to estimate \(p(ham \,|\, adult = no, age=yes)\) without assuming Age
and Adult
are independent given y=ham/spam.
(d)
What happens if we try to estimate \(p(ham \,|\, adult=yes,age=yes)\) without the Naive Bayes assumption?
Get the susedcars.csv data set from the webpage. Plot x=mileage versus y=price. (price is the price of a used car.)
Does the relationship between mileage and price make sense?
Add the fit from a linear regression to the plot. Add the fit from kNN for various values of k to the plot.
For what value of k does the plot look nice?
Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?
What is the prediction from a linear fit?
Which model is better for the cars data with x=mileage and y=price, KNN (with a nice k) or the linear model.
Use a simple train/test split to see which model looks best.
Use plots to illustrate the results.