1. Naive Bayes

 

Using the training data, the tables below give the counts for how often Adult and Age are in the documents.

 

Read in the training data:

trainB = read.csv("http://www.rob-mcculloch.org/data/smsTrainB.csv")
trainyB = read.csv("http://www.rob-mcculloch.org/data/smsTrainyB.csv")[,1]

Check the propotions of ham and spam:

iispam = trainyB == 1
ageH = trainB[!iispam,'age']
ageS = trainB[iispam,'age']
adultH = trainB[!iispam,'adult']
adultS = trainB[iispam,'adult']
table(iispam)/length(trainyB)
## iispam
##     FALSE      TRUE 
## 0.8647158 0.1352842

Get the joint frequencies of (age,adult) for the ham observations.

table(ageH,adultH)
##     adultH
## ageH    0    1
##    0 3598    2
##    1    5    0
library(descr)
crosstab(ageH,adultH,plot=FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## =========================
##          adultH
## ageH        0   1   Total
## -------------------------
## 0        3598   2    3600
## -------------------------
## 1           5   0       5
## -------------------------
## Total    3603   2    3605
## =========================

Get the joint frequencies of (age,adult) for the spam observations.

table(ageS,adultS)
##     adultS
## ageS   0   1
##    0 549   3
##    1  12   0
crosstab(ageS,adultS,plot=FALSE)
##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ========================
##          adultS
## ageS       0   1   Total
## ------------------------
## 0        549   3     552
## ------------------------
## 1         12   0      12
## ------------------------
## Total    561   3     564
## ========================

age is on the rows and adult is on the columns.

 

As in the notes, we will always use observed frequencies to estimate probabilities.

So, for example,

$p(age=0,adult=1 | ham ) = 2/(3605)= 0.000554785 $


(a)

Using the tables check that the simple frequency estimate of check \(p(age=yes \,|\, ham)\) =.00138 as in the notes.

(b)

Use the table and the Naive Bayes assumption to estimate \(p(ham \,|\, adult = no, age=yes)\).

(c)

Use the table to estimate \(p(ham \,|\, adult = no, age=yes)\) without assuming Age and Adult are independent given y=ham/spam.

(d)

What happens if we try to estimate \(p(ham \,|\, adult=yes,age=yes)\) without the Naive Bayes assumption?


2. Fitting kNN to the Cars Data, just mileage

Get the susedcars.csv data set from the webpage. Plot x=mileage versus y=price. (price is the price of a used car.)

Does the relationship between mileage and price make sense?

Add the fit from a linear regression to the plot. Add the fit from kNN for various values of k to the plot.

For what value of k does the plot look nice?

Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?

What is the prediction from a linear fit?


3. Fitting kNN to the Cars Data, just mileage, knn or linear?

Which model is better for the cars data with x=mileage and y=price, KNN (with a nice k) or the linear model.

Use a simple train/test split to see which model looks best.

Use plots to illustrate the results.