hw2-sp23

1. Naive Bayes

Using the training data, the tables below give the counts for how often Adult and Age are in the documents.

Read in the training data:

#trainB = read.csv("http://www.rob-mcculloch.org/data/smsTrainB.csv")
#trainyB = read.csv("http://www.rob-mcculloch.org/data/smsTrainyB.csv")[,1]

trainB = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainB.csv")
trainyB = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv")[,1]

Check the propotions of ham and spam:

iispam = trainyB == 1
ageH = trainB[!iispam,'age']
ageS = trainB[iispam,'age']
adultH = trainB[!iispam,'adult']
adultS = trainB[iispam,'adult']
table(iispam)/length(trainyB)

## iispam
##     FALSE      TRUE 
## 0.8647158 0.1352842

Get the joint frequencies of (age,adult) for the ham observations.

table(ageH,adultH)

##     adultH
## ageH    0    1
##    0 3598    2
##    1    5    0

library(descr)
crosstab(ageH,adultH,plot=FALSE)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ============================
##          adultH
## ageH        0      1   Total
## ----------------------------
## 0        3598      2    3600
## ----------------------------
## 1           5      0       5
## ----------------------------
## Total    3603      2    3605
## ============================

Get the joint frequencies of (age,adult) for the spam observations.

table(ageS,adultS)

##     adultS
## ageS   0   1
##    0 549   3
##    1  12   0

crosstab(ageS,adultS,plot=FALSE)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |-------------------------|
## 
## ==========================
##          adultS
## ageS       0     1   Total
## --------------------------
## 0        549     3     552
## --------------------------
## 1         12     0      12
## --------------------------
## Total    561     3     564
## ==========================

age is on the rows and adult is on the columns.

As in the notes, we will always use observed frequencies to estimate probabilities.

So, for example,

$p(age=0,adult=1 | ham ) = 2/(3605)= 0.000554785 $

(a)

Using the tables check that the simple frequency estimate of check $p(age=yes \,|\, ham)$ =.00138 as in the notes.

(b)

Use the table and the Naive Bayes assumption to estimate $p(ham \,|\, adult = no, age=yes)$.

(c)

Use the table to estimate $p(ham \,|\, adult = no, age=yes)$ without assuming Age and Adult are independent given y=ham/spam.

(d)

What happens if we try to estimate $p(ham \,|\, adult=yes,age=yes)$ without the Naive Bayes assumption?

2. Fitting kNN to the Cars Data, just mileage

Get the susedcars.csv data set from the webpage. Plot x=mileage versus y=price. (price is the price of a used car.)

Does the relationship between mileage and price make sense?

Add the fit from a linear regression to the plot. Add the fit from kNN for various values of k to the plot.

For what value of k does the plot look nice?

Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?

What is the prediction from a linear fit?

3. Fitting kNN to the Cars Data, just mileage, knn or linear?

Which model is better for the cars data with x=mileage and y=price, KNN (with a nice k) or the linear model.

Use a simple train/test split to see which model looks best.

Use plots to illustrate the results.

4. Using Cross Validation

We are going to use the used cars data again.

Previously, we used the “eye-ball” method to choose k for a kNN fit for mileage predicting price.

Use 5-fold cross-validation to choose k. How does your fit compare with the eyeball method?

Plot the data and then add the fit using the k you chose using cross-validation and the k you choose by eye-ball.

Use kNN with the k you chose using cross-validation to get a prediction for a used car with 100,000 miles on it. Use all the observations as training data to get your prediction (given your choice of k).

5. kNN, Cars Data with Mileage and Year

Use kNN to get a prediction for a 2008 car with 75,000 miles on it!

Remember:

Use cross-validation to choose k.
Scale your x’s !!

Is your predictive accuracy better using (mileage,year) than it was with just mileage?

6. Choice of Kernel

In our notes examples we used kernel=“rectangular” when calling the R function kknn.

In R, have a look at the help for kknn (?kknn).

In python, the help for KNeighborsRegressor in sklearn has

Parameters
----------
n_neighbors : int, optional (default = 5)
    Number of neighbors to use by default for :meth:`kneighbors` queries.

weights : str or callable
    weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.

    Uniform weights are used by default.

So, you can weight the y values at the neighbors equally, or weight the closer ones more heavily. Typically default is equal weights.

Using the used cars data and predictors (features!!) (mileage,year) try a weighting option other than uniform.