## Description

Problem 1 (Na¨ıve Bayes; 30 points). Download the “20 Newsgroups data set” news.mat from

Courseworks. The training feature vectors/labels and test feature vectors/labels are stored as

data/labels and testdata/testlabels. Each data point corresponds to a message posted to one

of 20 di↵erent newsgroups (i.e., message boards). The representation of a message is a (sparse)

binary vector in X := {0, 1}d (for d := 61188) that indicates the words that are present in the

message. If the j-th entry in the vector is 1, it means the message contains the word that is given

on the j-th line of the text file news.vocab. The class labels are Y := {1, 2,…, 20}, where the

mapping from classes to newsgroups is in the file news.groups (which we won’t actually need).

In this problem, you’ll develop a classifier based on a Na¨ıve Bayes generative model. Here, we use

class conditional distributions of the form Pµ(x) = Qd

j=1 µxj

j (1 µj )1xj for x = (x1, x2,…,xd) 2

X . Here, µ = (µ1, µ2,…,µd) 2 [0, 1]d is the parameter vector from the parameter space [0, 1]d

.

Since there are 20 classes, the generative model is actually parameterized by 20 such vectors,

µy = (µy,1, µy,2,…,µy,d) for each y 2 Y, as well as the class prior parameters, ⇡y for each y 2 Y.

The class prior parameters, of course, must satisfy ⇡y 2 [0, 1] for each y 2 Y and P

y2Y ⇡y = 1.

(a) Give the formula for the MLE of the parameter µy,j based on training data {(xi, yi)}n

i=1.

(Remember, each unlabeled point is a vector: xi = (xi,1, xi,2,…,xi,d) 2 {0, 1}d.)

(b) MLE is not a good estimator for the class conditional parameters if the estimate turns out to

be zero or one. An alternative is the following estimator based on a technique called Laplace

smoothing: ˆµy,j := (1 + Pn

i=1 1{yi = y}xi,j )/(2 + Pn

i=1 1{yi = y}) 2 (0, 1).

Write codes for training and testing a classifier based on the Na¨ıve Bayes generative model described above. Use Laplace smoothing to estimate class conditional distribution parameters,

and MLE for class prior parameters. You should not use or look at any existing implementation (e.g., such as those that may be provided as library functions). Using your codes,

train and test a classifier with the data from news.mat. Your codes should be easy to

understand (e.g., by using sensible variable names and comments).

What to submit: (1) training and test error rates, (2) source code (in a separate file).

(c) Consider the binary classification problem, where newsgroups {1, 16, 20} comprise the “negative class” (class 0), and newsgroups {17, 18, 19} comprise the “positive class” (class 1).

Newsgroups {1, 16, 20} are “religious” topics, and newsgroups {17, 18, 19} are “political” topics. Modify the data in news.mat to create the training and test data sets for this problem.

Using these data and your codes from part (b), train and test a Na¨ıve Bayes classifier.

What to submit: training and test error rates. Save the learned classifier for part (d)!

(d) The classifier you learn is ultimately a linear classifier, which means it has the following form:

x 7!

(

0 if ↵0 + Pd

j=1 ↵jxj 0

1 if ↵0 + Pd

j=1 ↵jxj > 0

for some real numbers ↵0, ↵1,…, ↵d. Determine the values of these ↵j ’s for your learned

classifier from part (c). Then, report the vocabulary words whose indices j 2 {1, 2,…,d}

correspond to the 20 largest (i.e., most positive) ↵j value, and also the vocabulary words

whose indices j 2 {1, 2,…,d} correspond to the 20 smallest (i.e., most negative) ↵j value.

Don’t report the indices j’s, but rather the actual vocabulary words (from news.vocab).

What to submit: two ordered list (appropriately labeled) of 20 words each.

1

Problem 2 (Cost-sensitive classification; 10 points). Suppose you face a binary classification problem with input space X = R and output space Y = {0, 1}, where it is c times as bad to commit a

“false positive” as it is to commit a “false negative” (for some real number c 1). To make this

concrete, let’s say that if your classifier predicts 1 but the correct label is 0, you incur a penalty of

$c; if your classifier predicts 0 but the correct label is 1, you incur a penalty of $1. (And you incur

no penalty if your classifier predicts the correct label.)

Assume the distribution you care about has a class prior with ⇡0 = 2/3 and ⇡1 = 1/3, and the

class conditional densities are N(0, 1) for class 0, and N(2, 1/4) for class 1. Let f ? : R ! {0, 1} be

the classifier with the smallest expected penalty.

(a) Assume 1 c 14. Specify precisely (and with a simple expression involving c) the region

in which the classifier f ? predicts 1.

(b) Now instead assume c 15. Specify precisely the region in which the classifier f ? predicts 1.

Problem 3 (Covariance matrices; 10 points). Let X be a mean-zero random vector in Rd (so

E(X) = 0). Let ⌃ := E(XX>) be the covariance matrix of X, and suppose its eigenvalues are

1 2 ··· d. Let > 0 be a positive number.

(a) What are the eigenvalues of ⌃ + 2I?

(b) What are the eigenvalues of (⌃ + 2I)2?

In both cases, give your answers in terms of and the eigenvalues of ⌃.

2