Get the susedcars.csv data set from the webpage.
Plot x=mileage versus y=price (price is the price of a used car).
Does the relationship between mileage and price make sense?
Add the fit from a linear regression to the plot.
Add the fit from kNN for various values of k to the plot.
For what value of k does the plot look nice?
Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?
What is the prediction from a linear fit?
Which model is better for the cars data with x=mileage and y=price, KNN (with a nice k) or the linear model.
Use a simple train/test split to see which model looks best.
Use plots to illustrate the results.
We are going to use the used cars data again.
Previously, we used the “eye-ball” method to choose k for a kNN fit for mileage predicting price.
Use 5-fold cross-validation to choose k. How does your fit compare with the eyeball method?
Plot the data and then add the fit using the k you chose using cross-validation and the k you choose by eye-ball.
Use kNN with the k you chose using cross-validation to get a
prediction for a used car with 100,000 miles on it.
Use all the observations as training data to get your prediction (given
your choice of k).
Use kNN to get a prediction for a 2008 car with 75,000 miles on it!
Remember:
In our notes examples we used kernel=“rectangular” when calling the R function kknn.
In R, have a look at the help for kknn (?kknn).
In python, the help for KNeighborsRegressor in sklearn has
Parameters
----------
n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for :meth:`kneighbors` queries.
weights : str or callable
weight function used in prediction. Possible values:
- 'uniform' : uniform weights. All points in each neighborhood
are weighted equally.
- 'distance' : weight points by the inverse of their distance.
in this case, closer neighbors of a query point will have a
greater influence than neighbors which are further away.
- [callable] : a user-defined function which accepts an
array of distances, and returns an array of the same shape
containing the weights.
Uniform weights are used by default.
So, you can weight the y values at the neighbors equally, or weight
the closer ones more heavily.
Typically default is equal weights.
Using the used cars data and predictors (features!!) (mileage,year) try a weighting option other than uniform.
Suppose \(y_i\) is a count. That is \(y_i \in {0,1,2,....}\).
In this case, a very common model is to assume the Poisson
disttribuion:
\[
P(Y=y \;|\; \lambda) = \frac{e^{-\lambda} \, \lambda^y}{y!}, \; y =
0,1,2,\ldots
\]
Given \(Y_i \sim \text{Poisson}(\lambda)\) iid, (that is, \(Y_i = y_i\)), what is the MLE of \(\lambda\)?
Let
\[
f(x) = (x_1 - a_1)^2 + (x_2 - a_2)^2, \;\; g(x_1,x_2) = x_1^2 + x_2^2 -
1.
\] Minimize \(f(x)\) subject to
the constraint that \(g(x) \leq
0\).
First draw simple pictures to make the solution obvious.
Then check that the lagrange multiplier first order condition conforms with with your intution.
How does the norm of \((a_1,a_2)\) affect the solution !!??