Simple Example of Multiple Regression in Python

Import Needed Modules

We need to import numpy and pandas (as np and pd) and matplot.pyplot (to graph) and LinearRegression from sklearn.linear_model.

In [28]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

To get the plots to show up in the notebook we also need:

In [29]:
%matplotlib inline 

#don't include the inline in a .py script

Read in the Data and Get the Variable we want

We will

  • read in the data to a pandas data frame
  • pull off price, mileage, and year
  • divide price and mileage by 1,000
  • do some simple summaries
In [30]:
cd = pd.read_csv("susedcars.csv")
cd = cd[['price','mileage','year']]
cd['price'] = cd['price']/1000
cd['mileage'] = cd['mileage']/1000
print(cd.head())
print(cd.describe())
print(cd.corr())
print(cd.price.describe())
    price  mileage  year
0  43.995   36.858  2008
1  44.995   46.883  2012
2  25.999  108.759  2007
3  33.880   35.187  2007
4  34.895   48.153  2007
             price      mileage         year
count  1000.000000  1000.000000  1000.000000
mean     30.583318    73.652408  2006.939000
std      18.411018    42.887422     4.194624
min       0.995000     1.997000  1994.000000
25%      12.995000    40.132750  2004.000000
50%      29.800000    67.919500  2007.000000
75%      43.992000   100.138250  2010.000000
max      79.995000   255.419000  2013.000000
            price   mileage      year
price    1.000000 -0.815246  0.880537
mileage -0.815246  1.000000 -0.744729
year     0.880537 -0.744729  1.000000
count    1000.000000
mean       30.583318
std        18.411018
min         0.995000
25%        12.995000
50%        29.800000
75%        43.992000
max        79.995000
Name: price, dtype: float64

Get y=price and X=(mileage,year) as Numpy ndarrays

In [31]:
X = cd[['mileage','year']].as_matrix()
print(X.shape)
print(X[0:4,:])
y = cd['price'].values
print(len(y))
print(y[0:4])
(1000, 2)
[[   36.858  2008.   ]
 [   46.883  2012.   ]
 [  108.759  2007.   ]
 [   35.187  2007.   ]]
1000
[ 43.995  44.995  25.999  33.88 ]

Plot y vs each x

Now let's plot mileage vs. price.

In [32]:
plt.scatter(X[:,1],y)
plt.xlabel("year")
plt.ylabel("price")
plt.title("year vs. price")
Out[32]:
Text(0.5,1,'year vs. price')

And year vs. price.

In [33]:
plt.scatter(X[:,0],y)
plt.xlabel("mileage")
plt.ylabel("price")
plt.title("mileage vs. price")
Out[33]:
Text(0.5,1,'mileage vs. price')

Run The Regression of y=price on X=(mileage,year)

Ok, now we can run the regression.

In [34]:
lmmod = LinearRegression(fit_intercept=True)
lmmod.fit(X,y)
print("Model Slopes:    ",lmmod.coef_)
print("Model Intercept:",lmmod.intercept_)
Model Slopes:     [-0.1537219   2.69434954]
Model Intercept: -5365.48987226

Note that there does not seem to be a simple regression summary in scikitlearn.
Maybe that is a good thing !!!!.

So, the fitted relationship is
$$ price = -5365.49 - 0.154 \, mileage + 2.7 \, year $$

Get and Plot the Fits

In [35]:
yhat = lmmod.predict(X)
print("the length of yhat is",len(yhat))
print("the type of yhat is:")
type(yhat)
the length of yhat is 1000
the type of yhat is:
Out[35]:
numpy.ndarray
In [36]:
plt.scatter(y,yhat)
Out[36]:
<matplotlib.collections.PathCollection at 0x7f133004a0f0>

Clearly, it is really bad !!!