Naive Bayes in Python

Let's look at the ham/spam data in python.

First of, here are our python imports.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

The model class is MultinomialNB. This is the naive bayes analogue to
from sklearn.linear_model import LinearRegression
which we used for linear regression.

The model will do the naive bayes work for us.

We also import tools for measuring our success.
We will use the confusion matrix and the accuracy_score.
For a numeric y we can use rmse or mad (mean absolute deviation).
For categorical outcomes, there are actually a lot of different ways people measure success. sklearn has a lot of metrics for measuring in-sample fit or out of sample accurracy.

Read in and quick look at data

In [2]:
trainB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainB.csv")
testB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestB.csv")
trainyB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv")['smsTrainyB']
testyB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestyB.csv")['smsTestyB']
In [3]:
print(trainB.shape)
print(testB.shape)
print(trainyB.shape)
print(testyB.shape)
(4169, 1139)
(1390, 1139)
(4169,)
(1390,)

We have 4,169 train observations and 1,390 test observations.
trainB is our x matrix and trainyB is our outcomes ham/spam.
Our corresponding test data is testB and testyB.

Let's have a quick look at the data.

In [4]:
trainB.iloc[0:5,0:4]  #first few rows and coluns of train x
Out[4]:
£wk €˜m €˜s abiola
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
In [5]:
trainyB.iloc[0:6]  # first few train y
Out[5]:
0    0
1    0
2    0
3    1
4    1
5    0
Name: smsTrainyB, dtype: int64
In [6]:
trainyB.value_counts()/trainyB.shape[0] #train ham/spam frequencies
Out[6]:
0    0.864716
1    0.135284
Name: smsTrainyB, dtype: float64

So all of the data has been represented as 0/1.
For y, 1 means spam and 0 means ham.
For x, the (i,j) element is 1 if the jth term is in the ith document and 0 otherwise.

Let's look at the term age and y.

In [7]:
pd.crosstab(trainyB,trainB['age'])
Out[7]:
age 0 1
smsTrainyB
0 3600 5
1 552 12
In [8]:
print(5/(5+3600))
print(12/(12+552))
0.0013869625520110957
0.02127659574468085

So, $P(X_i = 1 | y=1)$ for $x_i =age$, is estimated to be .0213.

Note that these numbers match up with what we got using

library(e1071)
smsNB = naiveBayes(smsTrain, smsTrainy)

in R.

Fit Naive Bayes on Train, Predict on Test

Ok let' do naive bayes using sklearn.
As usual we:

  • make a model
  • fit on train
  • predict on test
In [9]:
model = MultinomialNB()  #create model object
model.fit(trainB,trainyB) # fit on train
yhat = model.predict(testB) # predict on test
In [10]:
print(yhat.shape)
print(type(yhat))
print(yhat.dtype)
yhat[:5]
(1390,)
<class 'numpy.ndarray'>
int64
Out[10]:
array([0, 0, 0, 0, 1])

How did we do !!!!?????

What is our out of sample predictive performance on the test data?

The most basic the is the confusion matrix which is simple the cross tab of predicted and actual.

In [11]:
confusion_matrix(testyB,yhat)
Out[11]:
array([[1190,   17],
       [  25,  158]])

So we got 25+17 wrong.

In [12]:
accuracy_score(testyB,yhat)
Out[12]:
0.9697841726618706
In [13]:
(158+1190)/1390
Out[13]:
0.9697841726618706

Let's do the simple cross-tab to check.

In [14]:
pd.crosstab(testyB,yhat)
Out[14]:
col_0 0 1
smsTestyB
0 1190 17
1 25 158

Let's have a quick look at the model object.

In [15]:
model
Out[15]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

?model

Parameters

alpha : float, optional (default=1.0) Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

fit_prior : boolean, optional (default=True) Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

Let's compare to what we had in R.

In [16]:
inr = (1 - 0.02589928)
print(f'result in R was {inr}')
result in R was 0.97410072

which is virtually the same.