CS 445/545: Machine Learning Programming Assignment #2 solved


Category: You will receive a download link of the .ZIP file upon Payment


5/5 - (1 vote)

Part I: Classification with Naïve Bayes
1. Create training and test set:
Split the data into a training and test set. Each of these should have about 2,300 instances,
and each should have about 40% spam, 60% not-spam, to reflect the statistics of the full
data set. Since you are assuming each feature is independent of all others, here it is not
necessary to standardize the features.
2. Create probabilistic model. (Write your own code to do this.)
• Compute the prior probability for each class, 1 (spam) and 0 (not-spam) in
the training data. As described in part 1, P(1) should be about 0.4.
• For each of the 57 features, compute the mean and standard deviation in the
training set of the values given each class. If any of the features has zero standard
deviation, assign it a “minimal” standard deviation (e.g., 0.0001) to avoid a divide-byzero error in Gaussian Naïve Bayes.
3. Run Naïve Bayes on the test data. (Write your own code to do this.)
• Use the Gaussian Naïve Bayes algorithm to classify the instances in your test
set, using
( ) ( ) ( )
( )
, ,
| ; , , ; ,
j j
P x c N x where N x e i j i i c i c

   
 
  −
−      = =
Because a product of 58 probabilities will be very small, we will instead use the
log of the product. Recall that the classification method is:
NB i ( ) argmax | ( ) ( )
class i
class P class P x class  
=    
x 
argmax argmax log ( ) ( )
z z
f z z =
we have:
In your report, include a short description of what you did, and your results: the
accuracy, precision, and recall on the test set, as well as a confusion matrix for the test
set. Write a few sentences describing your results, and answer these questions: Do
you think the attributes here are independent, as assumed by Naïve Bayes? Does
Naïve Bayes do well on this problem in spite of the independence assumption?
Speculate on other reasons Naïve Bayes might do well or poorly on this problem.
Here is what you need to turn in:
• Your report.
• Your well-commented code.
How to turn it in (read carefully!):
• Send these items in electronic format to ark2@pdx.edu (our grader) by 10pm on the due
date. No hard copy please!
• The report should be in pdf format and the code should be in plain-text format.
• Put “MACHINE LEARNING PROGRAMMING #2” in the subject line.
If there are any questions, don’t hesitate to ask me or the grader.