## Description

Part I: Classification with Naïve Bayes

1. Create training and test set:

Split the data into a training and test set. Each of these should have about 2,300 instances,

and each should have about 40% spam, 60% not-spam, to reflect the statistics of the full

data set. Since you are assuming each feature is independent of all others, here it is not

necessary to standardize the features.

2. Create probabilistic model. (Write your own code to do this.)

• Compute the prior probability for each class, 1 (spam) and 0 (not-spam) in

the training data. As described in part 1, P(1) should be about 0.4.

• For each of the 57 features, compute the mean and standard deviation in the

training set of the values given each class. If any of the features has zero standard

deviation, assign it a “minimal” standard deviation (e.g., 0.0001) to avoid a divide-byzero error in Gaussian Naïve Bayes.

3. Run Naïve Bayes on the test data. (Write your own code to do this.)

• Use the Gaussian Naïve Bayes algorithm to classify the instances in your test

set, using

( ) ( ) ( )

( )

2

2

2

, ,

1

| ; , , ; ,

2

j j

x

P x c N x where N x e i j i i c i c

−

− = =

Because a product of 58 probabilities will be very small, we will instead use the

log of the product. Recall that the classification method is:

NB i ( ) argmax | ( ) ( )

class i

class P class P x class

=

x

Since

argmax argmax log ( ) ( )

z z

f z z =

we have:

In your report, include a short description of what you did, and your results: the

accuracy, precision, and recall on the test set, as well as a confusion matrix for the test

set. Write a few sentences describing your results, and answer these questions: Do

you think the attributes here are independent, as assumed by Naïve Bayes? Does

Naïve Bayes do well on this problem in spite of the independence assumption?

Speculate on other reasons Naïve Bayes might do well or poorly on this problem.

Here is what you need to turn in:

• Your report.

• Your well-commented code.

How to turn it in (read carefully!):

• Send these items in electronic format to ark2@pdx.edu (our grader) by 10pm on the due

date. No hard copy please!

• The report should be in pdf format and the code should be in plain-text format.

• Put “MACHINE LEARNING PROGRAMMING #2” in the subject line.

If there are any questions, don’t hesitate to ask me or the grader.