Let's look at the ham/spam data in python.
First of, here are our python imports.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
The model class is MultinomialNB. This is the naive bayes analogue to
from sklearn.linear_model import LinearRegression
which we used for linear regression.
The model will do the naive bayes work for us.
We also import tools for measuring our success.
We will use the confusion matrix and the accuracy_score.
For a numeric y we can use rmse or mad (mean absolute deviation).
For categorical outcomes, there are actually a lot of different ways people measure
success. sklearn has a lot of metrics for measuring in-sample fit or out of sample
accurracy.
trainB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainB.csv")
testB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestB.csv")
trainyB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTrainyB.csv")['smsTrainyB']
testyB = pd.read_csv("https://bitbucket.org/remcc/rob-data-sets/downloads/smsTestyB.csv")['smsTestyB']
print(trainB.shape)
print(testB.shape)
print(trainyB.shape)
print(testyB.shape)
We have 4,169 train observations and 1,390 test observations.
trainB is our x matrix and trainyB is our outcomes ham/spam.
Our corresponding test data is testB and testyB.
Let's have a quick look at the data.
trainB.iloc[0:5,0:4] #first few rows and coluns of train x
trainyB.iloc[0:6] # first few train y
trainyB.value_counts()/trainyB.shape[0] #train ham/spam frequencies
So all of the data has been represented as 0/1.
For y, 1 means spam and 0 means ham.
For x, the (i,j) element is 1 if the jth term is in the ith document and 0 otherwise.
Let's look at the term age and y.
pd.crosstab(trainyB,trainB['age'])
print(5/(5+3600))
print(12/(12+552))
So, $P(X_i = 1 | y=1)$ for $x_i =age$, is estimated to be .0213.
Note that these numbers match up with what we got using
library(e1071)
smsNB = naiveBayes(smsTrain, smsTrainy)
in R.
Ok let' do naive bayes using sklearn.
As usual we:
model = MultinomialNB() #create model object
model.fit(trainB,trainyB) # fit on train
yhat = model.predict(testB) # predict on test
print(yhat.shape)
print(type(yhat))
print(yhat.dtype)
yhat[:5]
How did we do !!!!?????
What is our out of sample predictive performance on the test data?
The most basic the is the confusion matrix which is simple the cross tab of predicted and actual.
confusion_matrix(testyB,yhat)
So we got 25+17 wrong.
accuracy_score(testyB,yhat)
(158+1190)/1390
Let's do the simple cross-tab to check.
pd.crosstab(testyB,yhat)
Let's have a quick look at the model object.
model
?model
alpha : float, optional (default=1.0) Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
fit_prior : boolean, optional (default=True) Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
Let's compare to what we had in R.
inr = (1 - 0.02589928)
print(f'result in R was {inr}')
which is virtually the same.