hw1¶

This is the python code you need to understand to do hw1.

import matplotlib.pyplot as plt
import seaborn; seaborn.set()

import math

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

from numpy.random import default_rng

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

from scipy.stats import pearsonr as pcorr

#ipython magic function, helps display of plots in a notebook
%matplotlib inline

Get Data¶

Read in the data and pull off price, mileage, and year.

cd = pd.read_csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd = cd[['price','mileage','year']]
cd['price'] = cd['price']/1000
cd['mileage'] = cd['mileage']/1000
print(cd.head()) # head just prints out the first few rows

    price  mileage  year
0  43.995   36.858  2008
1  44.995   46.883  2012
2  25.999  108.759  2007
3  33.880   35.187  2007
4  34.895   48.153  2007

Train / Test split¶

Read in the data and pull off price, mileage, and year.

Now let's cut the random train/test split.
First we will do it using basic python/numpy/pandas, then we will use sklearn.

Using basic python/pandas we randomly choose the indices and then create train and test data frames.

We will first do it using basic random number generation in numpy:
https://numpy.org/doc/stable/reference/random/generator.html
https://numpy.org/devdocs/reference/random/generator.html

Then we will use a utility in sklearn:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

n = cd.shape[0]
pin = .75 # percent of data to put in train
rng = np.random.default_rng(seed=42)
ii = rng.choice(range(n),size=int(pin*n),replace=False)
indtr = np.zeros(n,dtype=bool)
indtr[ii] = True

cdtrain = cd[indtr]
cdtest = cd[~indtr]

print(cdtrain.shape)
print(cdtest.shape)

(750, 3)
(250, 3)

Now let's use sklearn. Here we will convert the data frames to numpy arrays.

# convert to numpy arrays
X = cd.iloc[:,[1,2]].to_numpy()
y = cd['price'].to_numpy()
print(X.shape)
print(y.shape)

(1000, 2)
(1000,)

#use sklearn train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, random_state=99,test_size=.25)

print(Xtrain.shape,ytrain.shape)
print(Xtest.shape,ytest.shape)

(750, 2) (750,)
(250, 2) (250,)

Fit on train, predict on Test¶

lmmod = LinearRegression(fit_intercept=True)
lmmod.fit(Xtrain,ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

yhtest = lmmod.predict(Xtest)

plt.scatter(yhtest,ytest)
plt.xlabel('yhtest'); plt.ylabel('y')

Text(0, 0.5, 'y')

def rmse(y,yh):
    return(math.sqrt(np.mean((y-yh)**2)))

def mabe(y,yh):
    return(np.mean(np.abs(y-yh)))

print('rmse: ',rmse(ytest,yhtest))
print('rmse: ',mabe(ytest,yhtest))
print('R-squared: ',pcorr(ytest,yhtest)[0]**2)

rmse:  8.218714610841417
rmse:  6.185759206056097
R-squared:  0.8307324590241304

Using Color, one hot encoding (dummy variables)¶

Now we will use the categorical variable color in our model.
We will have to create dummies for each category.

cd1 = pd.read_csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd1.columns.values
cd1 = cd1.iloc[:,[3,4,5,0]]
cd1['price'] = cd1['price']/1000
cd1['mileage'] = cd1['mileage']/1000
print(cd1.head())

   mileage  year   color   price
0   36.858  2008  Silver  43.995
1   46.883  2012   Black  44.995
2  108.759  2007   White  25.999
3   35.187  2007   Black  33.880
4   48.153  2007   Black  34.895

one_hot = LabelBinarizer()

cdums = one_hot.fit_transform(cd1['color'])

print(type(cdums))
print(cdums.shape)
cdums[0:10,:]

<class 'numpy.ndarray'>
(1000, 4)

array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [1, 0, 0, 0],
       [1, 0, 0, 0],
       [1, 0, 0, 0]])

cd1['color'][0:10]

0    Silver
1     Black
2     White
3     Black
4     Black
5     other
6     other
7     Black
8     Black
9     Black
Name: color, dtype: object

X1 = np.hstack([cd1.iloc[:,0:1].to_numpy(),cdums[:,1:4]])
X1[0:5,:]

array([[ 36.858,   1.   ,   0.   ,   0.   ],
       [ 46.883,   0.   ,   0.   ,   0.   ],
       [108.759,   0.   ,   1.   ,   0.   ],
       [ 35.187,   0.   ,   0.   ,   0.   ],
       [ 48.153,   0.   ,   0.   ,   0.   ]])

lmmod1 = LinearRegression(fit_intercept=True)
lmmod1.fit(X1,y)
print(lmmod1.intercept_,lmmod1.coef_)

57.42896206114064 [-0.34172514 -4.34339908  0.77317804 -3.80498327]