This is the python code you need to understand to do hw1.
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import math
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from numpy.random import default_rng
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from scipy.stats import pearsonr as pcorr
#ipython magic function, helps display of plots in a notebook
%matplotlib inline
Read in the data and pull off price, mileage, and year.
cd = pd.read_csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd = cd[['price','mileage','year']]
cd['price'] = cd['price']/1000
cd['mileage'] = cd['mileage']/1000
print(cd.head()) # head just prints out the first few rows
Read in the data and pull off price, mileage, and year.
Now let's cut the random train/test split.
First we will do it using basic python/numpy/pandas, then we will use sklearn.
Using basic python/pandas we randomly choose the indices and then create train and test data frames.
We will first do it using basic random number generation in numpy:
https://numpy.org/doc/stable/reference/random/generator.html
https://numpy.org/devdocs/reference/random/generator.html
Then we will use a utility in sklearn:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
n = cd.shape[0]
pin = .75 # percent of data to put in train
rng = np.random.default_rng(seed=42)
ii = rng.choice(range(n),size=int(pin*n),replace=False)
indtr = np.zeros(n,dtype=bool)
indtr[ii] = True
cdtrain = cd[indtr]
cdtest = cd[~indtr]
print(cdtrain.shape)
print(cdtest.shape)
Now let's use sklearn. Here we will convert the data frames to numpy arrays.
# convert to numpy arrays
X = cd.iloc[:,[1,2]].to_numpy()
y = cd['price'].to_numpy()
print(X.shape)
print(y.shape)
#use sklearn train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, random_state=99,test_size=.25)
print(Xtrain.shape,ytrain.shape)
print(Xtest.shape,ytest.shape)
lmmod = LinearRegression(fit_intercept=True)
lmmod.fit(Xtrain,ytrain)
yhtest = lmmod.predict(Xtest)
plt.scatter(yhtest,ytest)
plt.xlabel('yhtest'); plt.ylabel('y')
def rmse(y,yh):
return(math.sqrt(np.mean((y-yh)**2)))
def mabe(y,yh):
return(np.mean(np.abs(y-yh)))
print('rmse: ',rmse(ytest,yhtest))
print('rmse: ',mabe(ytest,yhtest))
print('R-squared: ',pcorr(ytest,yhtest)[0]**2)
Now we will use the categorical variable color in our model.
We will have to create dummies for each category.
cd1 = pd.read_csv("http://www.rob-mcculloch.org/data/susedcars.csv")
cd1.columns.values
cd1 = cd1.iloc[:,[3,4,5,0]]
cd1['price'] = cd1['price']/1000
cd1['mileage'] = cd1['mileage']/1000
print(cd1.head())
one_hot = LabelBinarizer()
cdums = one_hot.fit_transform(cd1['color'])
print(type(cdums))
print(cdums.shape)
cdums[0:10,:]
cd1['color'][0:10]
X1 = np.hstack([cd1.iloc[:,0:1].to_numpy(),cdums[:,1:4]])
X1[0:5,:]
lmmod1 = LinearRegression(fit_intercept=True)
lmmod1.fit(X1,y)
print(lmmod1.intercept_,lmmod1.coef_)