1. Using Cross Validation

In python, the simple code we looked at that uses sklearn to do cross validation is at:
http://www.rob-mcculloch.org/2021_ml/webpage/python/doBost-knn.py.

In R, you can follow the code on slided 81-84 of the knn notes if you wish.
Note that you need the code in http://www.rob-mcculloch.org/2021_ml/webpage/R/docv.R.

We are going to use the used cars data again.

Previously, we used the “eye-ball” method to choose k for a kNN fit for mileage predicting price.

Use 5-fold cross-validation to choose k. How does your fit compare with the eyeball method?

Plot the data and then add the fit using the k you chose using cross-validation and the k you choose by eye-ball.

Use kNN with the k you chose using cross-validation to get a prediction for a used car with 100,000 miles on it. Use all the observations as training data to get your prediction (given your choice of k).


2. kNN, Cars Data with Mileage and Year

Use kNN to get a prediction for a 2008 car with 75,000 miles on it!

Remember:

Is your predictive accuracy better using (mileage,year) than it was with just mileage?


3. Choice of Kernel

In our notes examples we used kernel=“rectangular” when calling the R function kknn.

In R, have a look at the help for kknn (?kknn).

In python, the help for KNeighborsRegressor in sklearn has

Parameters
----------
n_neighbors : int, optional (default = 5)
    Number of neighbors to use by default for :meth:`kneighbors` queries.

weights : str or callable
    weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.

    Uniform weights are used by default.

So, you can weight the y values at the neighbors equally, or weight the closer ones more heavily. Typically default is equal weights.

Using the used cars data and predictors (features!!) (mileage,year) try a weighting option other than uniform.