Using the training data, the tables below give the counts for how
often Adult
and Age
are in the documents.
Read in the training data:
#trainB = read.csv("http://www.rob-mcculloch.org/data/smsTrainB.csv")
#trainyB = read.csv("http://www.rob-mcculloch.org/data/smsTrainyB.csv")[,1]
trainB = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainB.csv")
trainyB = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv")[,1]
Check the propotions of ham and spam:
iispam = trainyB == 1
ageH = trainB[!iispam,'age']
ageS = trainB[iispam,'age']
adultH = trainB[!iispam,'adult']
adultS = trainB[iispam,'adult']
table(iispam)/length(trainyB)
## iispam
## FALSE TRUE
## 0.8647158 0.1352842
Get the joint frequencies of (age,adult) for the ham observations.
table(ageH,adultH)
## adultH
## ageH 0 1
## 0 3598 2
## 1 5 0
library(descr)
crosstab(ageH,adultH,plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## ============================
## adultH
## ageH 0 1 Total
## ----------------------------
## 0 3598 2 3600
## ----------------------------
## 1 5 0 5
## ----------------------------
## Total 3603 2 3605
## ============================
Get the joint frequencies of (age,adult) for the spam observations.
table(ageS,adultS)
## adultS
## ageS 0 1
## 0 549 3
## 1 12 0
crosstab(ageS,adultS,plot=FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## |-------------------------|
##
## ==========================
## adultS
## ageS 0 1 Total
## --------------------------
## 0 549 3 552
## --------------------------
## 1 12 0 12
## --------------------------
## Total 561 3 564
## ==========================
age is on the rows and adult is on the columns.
As in the notes, we will always use observed frequencies to estimate probabilities.
So, for example,
$p(age=0,adult=1 | ham ) = 2/(3605)= 0.000554785 $
(a)
Using the tables check that the simple frequency estimate of check \(p(age=yes \,|\, ham)\) =.00138 as in the notes.
(b)
Use the table and the Naive Bayes assumption to estimate \(p(ham \,|\, adult = no, age=yes)\).
(c)
Use the table to estimate \(p(ham \,|\,
adult = no, age=yes)\) without assuming Age
and Adult
are independent given y=ham/spam.
(d)
What happens if we try to estimate \(p(ham \,|\, adult=yes,age=yes)\) without the Naive Bayes assumption?
Get the susedcars.csv data set from the webpage. Plot x=mileage versus y=price. (price is the price of a used car.)
Does the relationship between mileage and price make sense?
Add the fit from a linear regression to the plot. Add the fit from kNN for various values of k to the plot.
For what value of k does the plot look nice?
Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?
What is the prediction from a linear fit?
Which model is better for the cars data with x=mileage and y=price, KNN (with a nice k) or the linear model.
Use a simple train/test split to see which model looks best.
Use plots to illustrate the results.
We are going to use the used cars data again.
Previously, we used the “eye-ball” method to choose k for a kNN fit for mileage predicting price.
Use 5-fold cross-validation to choose k. How does your fit compare with the eyeball method?
Plot the data and then add the fit using the k you chose using cross-validation and the k you choose by eye-ball.
Use kNN with the k you chose using cross-validation to get a prediction for a used car with 100,000 miles on it. Use all the observations as training data to get your prediction (given your choice of k).
Use kNN to get a prediction for a 2008 car with 75,000 miles on it!
Remember:
Is your predictive accuracy better using (mileage,year) than it was with just mileage?
In our notes examples we used kernel=“rectangular” when calling the R function kknn.
In R, have a look at the help for kknn (?kknn).
In python, the help for KNeighborsRegressor
in sklearn
has
Parameters
----------
n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for :meth:`kneighbors` queries.
weights : str or callable
weight function used in prediction. Possible values:
- 'uniform' : uniform weights. All points in each neighborhood
are weighted equally.
- 'distance' : weight points by the inverse of their distance.
in this case, closer neighbors of a query point will have a
greater influence than neighbors which are further away.
- [callable] : a user-defined function which accepts an
array of distances, and returns an array of the same shape
containing the weights.
Uniform weights are used by default.
So, you can weight the y values at the neighbors equally, or weight the closer ones more heavily. Typically default is equal weights.
Using the used cars data and predictors (features!!) (mileage,year) try a weighting option other than uniform.