Homework 2, COMS 4771 solved




Problem 1 (Na¨ıve Bayes; 30 points). Download the “20 Newsgroups data set” news.mat from
Courseworks. The training feature vectors/labels and test feature vectors/labels are stored as
data/labels and testdata/testlabels. Each data point corresponds to a message posted to one
of 20 di↵erent newsgroups (i.e., message boards). The representation of a message is a (sparse)
binary vector in X := {0, 1}d (for d := 61188) that indicates the words that are present in the
message. If the j-th entry in the vector is 1, it means the message contains the word that is given
on the j-th line of the text file news.vocab. The class labels are Y := {1, 2,…, 20}, where the
mapping from classes to newsgroups is in the file news.groups (which we won’t actually need).
In this problem, you’ll develop a classifier based on a Na¨ıve Bayes generative model. Here, we use
class conditional distributions of the form Pµ(x) = Qd
j=1 µxj
j (1 µj )1xj for x = (x1, x2,…,xd) 2
X . Here, µ = (µ1, µ2,…,µd) 2 [0, 1]d is the parameter vector from the parameter space [0, 1]d
Since there are 20 classes, the generative model is actually parameterized by 20 such vectors,
µy = (µy,1, µy,2,…,µy,d) for each y 2 Y, as well as the class prior parameters, ⇡y for each y 2 Y.
The class prior parameters, of course, must satisfy ⇡y 2 [0, 1] for each y 2 Y and P
y2Y ⇡y = 1.
(a) Give the formula for the MLE of the parameter µy,j based on training data {(xi, yi)}n
(Remember, each unlabeled point is a vector: xi = (xi,1, xi,2,…,xi,d) 2 {0, 1}d.)
(b) MLE is not a good estimator for the class conditional parameters if the estimate turns out to
be zero or one. An alternative is the following estimator based on a technique called Laplace
smoothing: ˆµy,j := (1 + Pn
i=1 1{yi = y}xi,j )/(2 + Pn
i=1 1{yi = y}) 2 (0, 1).
Write codes for training and testing a classifier based on the Na¨ıve Bayes generative model described above. Use Laplace smoothing to estimate class conditional distribution parameters,
and MLE for class prior parameters. You should not use or look at any existing implementation (e.g., such as those that may be provided as library functions). Using your codes,
train and test a classifier with the data from news.mat. Your codes should be easy to
understand (e.g., by using sensible variable names and comments).
What to submit: (1) training and test error rates, (2) source code (in a separate file).
(c) Consider the binary classification problem, where newsgroups {1, 16, 20} comprise the “negative class” (class 0), and newsgroups {17, 18, 19} comprise the “positive class” (class 1).
Newsgroups {1, 16, 20} are “religious” topics, and newsgroups {17, 18, 19} are “political” topics. Modify the data in news.mat to create the training and test data sets for this problem.
Using these data and your codes from part (b), train and test a Na¨ıve Bayes classifier.
What to submit: training and test error rates. Save the learned classifier for part (d)!
(d) The classifier you learn is ultimately a linear classifier, which means it has the following form:
x 7!
0 if ↵0 + Pd
j=1 ↵jxj  0
1 if ↵0 + Pd
j=1 ↵jxj > 0
for some real numbers ↵0, ↵1,…, ↵d. Determine the values of these ↵j ’s for your learned
classifier from part (c). Then, report the vocabulary words whose indices j 2 {1, 2,…,d}
correspond to the 20 largest (i.e., most positive) ↵j value, and also the vocabulary words
whose indices j 2 {1, 2,…,d} correspond to the 20 smallest (i.e., most negative) ↵j value.
Don’t report the indices j’s, but rather the actual vocabulary words (from news.vocab).
What to submit: two ordered list (appropriately labeled) of 20 words each.
Problem 2 (Cost-sensitive classification; 10 points). Suppose you face a binary classification problem with input space X = R and output space Y = {0, 1}, where it is c times as bad to commit a
“false positive” as it is to commit a “false negative” (for some real number c 1). To make this
concrete, let’s say that if your classifier predicts 1 but the correct label is 0, you incur a penalty of
$c; if your classifier predicts 0 but the correct label is 1, you incur a penalty of $1. (And you incur
no penalty if your classifier predicts the correct label.)
Assume the distribution you care about has a class prior with ⇡0 = 2/3 and ⇡1 = 1/3, and the
class conditional densities are N(0, 1) for class 0, and N(2, 1/4) for class 1. Let f ? : R ! {0, 1} be
the classifier with the smallest expected penalty.
(a) Assume 1  c  14. Specify precisely (and with a simple expression involving c) the region
in which the classifier f ? predicts 1.
(b) Now instead assume c 15. Specify precisely the region in which the classifier f ? predicts 1.
Problem 3 (Covariance matrices; 10 points). Let X be a mean-zero random vector in Rd (so
E(X) = 0). Let ⌃ := E(XX>) be the covariance matrix of X, and suppose its eigenvalues are
1 2 ··· d. Let > 0 be a positive number.
(a) What are the eigenvalues of ⌃ + 2I?
(b) What are the eigenvalues of (⌃ + 2I)2?
In both cases, give your answers in terms of and the eigenvalues of ⌃.