Naive Bayes in Python¶

Let's look at the ham/spam data in python.

First of, here are our python imports.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

The model class is MultinomialNB. This is the naive bayes analogue to
from sklearn.linear_model import LinearRegression
which we used for linear regression.

The model will do the naive bayes work for us.

We also import tools for measuring our success.
We will use the confusion matrix and the accuracy_score.
For a numeric y we can use rmse or mad (mean absolute deviation).
For categorical outcomes, there are actually a lot of different ways people measure success. sklearn has a lot of metrics for measuring in-sample fit or out of sample accurracy.

Read in and quick look at data¶

trainB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainB.csv")
testB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestB.csv")
trainyB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv")['smsTrainyB']
testyB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestyB.csv")['smsTestyB']

print(trainB.shape)
print(testB.shape)
print(trainyB.shape)
print(testyB.shape)

(4169, 1139)
(1390, 1139)
(4169,)
(1390,)

We have 4,169 train observations and 1,390 test observations.
trainB is our x matrix and trainyB is our outcomes ham/spam.
Our corresponding test data is testB and testyB.

Let's have a quick look at the data.

trainB.iloc[0:5,0:4]  #first few rows and coluns of train x

trainyB.iloc[0:6]  # first few train y

0    0
1    0
2    0
3    1
4    1
5    0
Name: smsTrainyB, dtype: int64

trainyB.value_counts()/trainyB.shape[0] #train ham/spam frequencies

0    0.864716
1    0.135284
Name: smsTrainyB, dtype: float64

So all of the data has been represented as 0/1.
For y, 1 means spam and 0 means ham.
For x, the (i,j) element is 1 if the jth term is in the ith document and 0 otherwise.

Let's look at the term age and y.

pd.crosstab(trainyB,trainB['age'])

print(5/(5+3600))
print(12/(12+552))

0.0013869625520110957
0.02127659574468085

So, $P(X_i = 1 | y=1)$ for $x_i =age$, is estimated to be .0213.

Note that these numbers match up with what we got using

library(e1071)
smsNB = naiveBayes(smsTrain, smsTrainy)

in R.

Fit Naive Bayes on Train, Predict on Test¶

Ok let' do naive bayes using sklearn.
As usual we:

make a model
fit on train
predict on test

model = MultinomialNB()  #create model object
model.fit(trainB,trainyB) # fit on train
yhat = model.predict(testB) # predict on test

print(yhat.shape)
print(type(yhat))
print(yhat.dtype)
yhat[:5]

(1390,)
<class 'numpy.ndarray'>
int64

array([0, 0, 0, 0, 1])

How did we do !!!!?????

What is our out of sample predictive performance on the test data?

The most basic the is the confusion matrix which is simple the cross tab of predicted and actual.

confusion_matrix(testyB,yhat)

array([[1190,   17],
       [  25,  158]])

So we got 25+17 wrong.

accuracy_score(testyB,yhat)

0.9697841726618706

(158+1190)/1390

0.9697841726618706

Let's do the simple cross-tab to check.

pd.crosstab(testyB,yhat)

Let's have a quick look at the model object.

model

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

?model

Parameters¶

alpha : float, optional (default=1.0) Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

fit_prior : boolean, optional (default=True) Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

Let's compare to what we had in R.

inr = (1 - 0.02589928)
print(f'result in R was {inr}')

result in R was 0.97410072

which is virtually the same.

	£wk	€˜m	€˜s	abiola
0	0	0	0	0
1	0	0	0	0
2	0	0	0	0
3	0	0	0	0
4	0	0	0	0

age	0	1
smsTrainyB
0	3600	5
1	552	12

col_0	0	1
smsTestyB
0	1190	17
1	25	158