hw2-sp26

Fitting kNN to the Cars Data, just mileage

Get the susedcars.csv data set from the webpage.
Plot x=mileage versus y=price (price is the price of a used car).

Does the relationship between mileage and price make sense?

Add the fit from a linear regression to the plot.
Add the fit from kNN for various values of k to the plot.

For what value of k does the plot look nice?

Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?

What is the prediction from a linear fit?

Fitting kNN to the Cars Data, just mileage, knn or linear?

Which model is better for the cars data with x=mileage and y=price, KNN (with a nice k) or the linear model.

Use a simple train/test split to see which model looks best.

Use plots to illustrate the results.

Using Cross Validation, just mileage

We are going to use the used cars data again.

Previously, we used the “eye-ball” method to choose k for a kNN fit for mileage predicting price.

Use 5-fold cross-validation to choose k. How does your fit compare with the eyeball method?

Plot the data and then add the fit using the k you chose using cross-validation and the k you choose by eye-ball.

Use kNN with the k you chose using cross-validation to get a prediction for a used car with 100,000 miles on it.
Use all the observations as training data to get your prediction (given your choice of k).

kNN, Cars Data with Mileage and Year

Use kNN to get a prediction for a 2008 car with 75,000 miles on it!

Remember:

Use cross-validation to choose k.
Scale your x’s !!
Is your predictive accuracy better using (mileage,year) than it was with just mileage?

Choice of Kernel

In our notes examples we used kernel=“rectangular” when calling the R function kknn.

In R, have a look at the help for kknn (?kknn).

In python, the help for KNeighborsRegressor in sklearn has

Parameters
----------
n_neighbors : int, optional (default = 5)
    Number of neighbors to use by default for :meth:`kneighbors` queries.

weights : str or callable
    weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.

    Uniform weights are used by default.

So, you can weight the y values at the neighbors equally, or weight the closer ones more heavily.
Typically default is equal weights.

Using the used cars data and predictors (features!!) (mileage,year) try a weighting option other than uniform.

Basic Optimization, MLE for IID Poisson Data

Suppose \(y_i\) is a count. That is \(y_i \in {0,1,2,....}\).

In this case, a very common model is to assume the Poisson disttribuion:
\[ P(Y=y \;|\; \lambda) = \frac{e^{-\lambda} \, \lambda^y}{y!}, \; y = 0,1,2,\ldots \]

Given \(Y_i \sim \text{Poisson}(\lambda)\) iid, (that is, \(Y_i = y_i\)), what is the MLE of \(\lambda\)?

Constrained Optimization

Let
\[ f(x) = (x_1 - a_1)^2 + (x_2 - a_2)^2, \;\; g(x_1,x_2) = x_1^2 + x_2^2 - 1. \] Minimize \(f(x)\) subject to the constraint that \(g(x) \leq 0\).

First draw simple pictures to make the solution obvious.

Then check that the lagrange multiplier first order condition conforms with with your intution.

How does the norm of \((a_1,a_2)\) affect the solution !!??