## Description

Problem 1 (30 points)

In this problem, you will implement and evaluate the nearest neighbor rule.

MNIST data set

Download the MNIST data set of handwritten digit images ocr.mat from the course website, and load it

into Python:

from scipy.io import loadmat

ocr = loadmat(‘ocr.mat’)

The 60000 unlabeled training data (i.e., feature vectors) are contained in a matrix called data (one data point

per row), and the corresponding labels are in a vector called labels. The 10000 test feature vectors and

labels are in, respectively, testdata and testlabels.

In Python, you can view an image (say, the first one) with the following commands:

import matplotlib.pyplot as plt

from matplotlib import cm

plt.imshow(ocr[‘data’][0].reshape((28,28)), cmap=cm.gray_r)

plt.show()

Nearest neighbor classifier

Let the training examples be denoted by (x1, y1), . . . ,(xn, yn) (for n = 60000), where each xi ∈ R

d and each

yi ∈ {0, . . . , 9}.

The nearest neighbor classifier ˆf : R

d → {0, . . . , 9} is a function defined in terms of the 60000 training

examples as follows. On input x ∈ R

d

, let i(x) := arg mini∈{1,…,n} kx − xik2, breaking ties arbitrarily (i.e.,

i(x) is any i ∈ {1, . . . , n} for which kx − xi(x)k2 ≤ mini∈{1,…,n} kx − xik2); and return yi(x)

.

2

Write a program in Python that implements the nearest neighbor classifier. Your program should take as

input a matrix of training feature vectors X and a vector of corresponding labels Y, as well as a matrix of test

feature vectors test. The output of the program should be a vector of the predicted labels preds for all test

points, as provided by the nearest neighbor classifier ˆf based on the examples in X and Y.

You should not use (or look at the source code for) any library functions for computing all pairwise distances

between entire sets of vectors, or for computing nearest neighbor queries—that would defeat the purpose of

this assignment. However, you can use functions for computing inner products between vectors and norms of

vectors, as well as other basic matrix and vector operations. If in doubt what is okay to use, just ask.

In order to make your code efficient and not take a long time to run, you should use matrix/vector operations

(rather than, say, a bunch of explicit for-loops). Think of the n training feature vectors as being stacked as

the rows of a matrix X ∈ R

n×d

, and the m test feature vectors as the rows of another matrix T ∈ R

m×d

. If

the i-th row of T is t

>

i

and the j-th row of X is x

>

j

, then the squared Euclidean distance between ti and xj is

kti − xjk

2

2 = (ti − xj )

>

(ti − xj ) = t

>

i

ti − 2t

>

i xj + x

>

j xj .

Can you leverage this identity to speed-up computation of the nearest neighbor classifier?3

2The expression “arg mina∈A f(a)” denotes the set of minimizers of the function f among all elements in the set A. On

occasion, as we do here, we’ll be a little sloppy and write expressions of the form “aˆ := arg mina∈A f(a)”. This will always means

that aˆ is some minimizer of f within A, rather than the set of minimizers. For this to really make sense, though, it must be

made clear (either explicitly or implicitly) what we mean in the case that there are multiple minimizers (or if there are none!).

3

In order to take advantage of matrix/vector operations, your Python installation needs to be using optimized linear algebra

libraries, such as ATLAS, MKL, or the Accelerate/vecLib framework on OS X. If you installed Python using Anaconda, then

you should already have MKL.

2

Evaluating the nearest neighbor classifier

Instead of using your nearest neighbor program with data and labels as the training data, do the following.

For each value n ∈ {1000, 2000, 4000, 8000}:

1. Draw n points from data, together with their corresponding labels, uniformly at random. Use sel =

random.sample(range(60000,n) (after import random), ocr[‘data’][sel].astype(‘float’), and

ocr[‘labels’][sel] to select the examples.4

2. Use these n points as the training data and testdata as the test points; compute the fraction of test

examples on which the nearest neighbor classifier predicts the label incorrectly (i.e., the test error rate).

A plot of the test error rate (on the y-axis) as a function of n (on the x-axis) is called a learning curve.

Carry out the (random) process described above ten times, independently. Produce a learning curve plot

using the average of these test error rates (that is, averaging over ten repetitions). Add error bars to your plot

that extend to one standard deviation above and below the means. Ensure the plot axes are properly labeled.

What to submit in your write-up:

(a) A brief description of how you implemented the nearest neighbor classifier (especially the “arg min”).

Note that it is fine to use numpy.argmin, but you should explain precisely how you use it in your code

to efficiently implement the nearest neighbor classification of new (test) points.

(b) Learning curve plot. (Please include the plot directly in the PDF write-up.)

(c) What would the learning curve look like if, instead of test error rate, you computed training error rate?

Please submit your source code on Courseworks.

4The data are stored as unsigned integers, so you need to convert them to floating point or else you may get integer overflows.

3

Problem 2 (35 points)

In this problem, you will practice deriving formulae for maximum likelihood estimators.

Consider a statistical model for iid random variables X1, . . . , Xn, parameterized by θ ∈ Θ. The distribution

Pθ of X1 is specified in each part below.

In each of the first two cases, derive a simple formula for the MLE for θ (given data (X1, . . . , Xn) =

(x1, . . . , xn)), and briefly justify each step of the derivation.

(a) X1 ∼ Pθ has probability density function fθ satisfying

fθ(x) ∝ x

2

e

−x/θ for all x > 0,

and fθ(x) = 0 for all x ≤ 0.

5 The parameter space is the positive reals Θ = {θ ∈ R : θ > 0}.

(b) X1 ∼ Pθ has probability density function fθ satisfying

fθ(x) ∝ 1 for all 0 ≤ x ≤ θ,

and fθ(x) = 0 for all x < 0 and all x > θ. The parameter space is the positive reals Θ = {θ ∈ R : θ > 0}.

In each of the next two cases, specific values of the data are given as follows:

(X1, . . . , Xn) = (0, 1, 1, 0, 2, 2, 1, 2, 1, 0, 2, 0, 2, 2, 0, 3, 0, 0, 0, 2).

Derive the actual MLE value of θ given this data (as opposed to an abstract formula), and briefly justify

each step of the derivation.

(c) X1 ∼ Pθ has probability density function fθ satisfying

fθ(x) ∝ e

−(x−θ)

2/(2θ

2

)

for all x ∈ R.

The parameter space is Θ = {θ ∈ R : θ 6= 0} (i.e., all real numbers θ such that θ 6= 0).

(d) X1 ∼ Pθ has probability mass function pθ satisfying

pθ(0) = 1

2

, pθ(1) = e

θ

, pθ(2) = 2e

θ

, pθ(3) = 1

2

− 3e

θ

.

The parameter space is Θ = {θ ∈ R : θ < − ln(6)}.
5The symbol “∝” means “proportional to”. Remember, probability density functions should integrate to one.
4
Problem 3 (35 points)
In this problem, you will implement a learning algorithm based on generative models for classification, apply
the learning algorithm to a simple data set, and quantitatively and qualitatively study the learned classifiers.
Download the “20 Newsgroups data set” news.mat from Courseworks. The training feature vectors/labels
and test feature vectors/labels are stored as data/labels and testdata/testlabels. Each data point
corresponds to a message posted to one of 20 different newsgroups (i.e., message boards). The representation
of a message is a (sparse) binary vector in X := {0, 1}
d
(for d := 61188) that indicates the words that are
present in the message. If the j-th entry in the vector is 1, it means the message contains the word that is
given on the j-th line of the text file news.vocab. The class labels are Y := {1, 2, . . . , 20}, where the mapping
from classes to newsgroups is in the file news.groups (which we won’t actually need).
In this problem, you’ll develop a classifier based on a Naive Bayes generative model. Here, we use class
conditional distributions with probability mass functions of the form pµ(x) = Qd
j=1 µ
xj
j
(1 − µj )
1−xj
for
x = (x1, x2, . . . , xd) ∈ X . The parameter vector is µ = (µ1, µ2, . . . , µd) must come from the parameter
space [0, 1]d
. Since there are 20 classes, the generative model is actually parameterized by 20 such vectors,
µy = (µy,1, µy,2, . . . , µy,d) for each y ∈ Y, as well as the class prior parameters, πy for each y ∈ Y. The class
prior parameters, of course, must satisfy πy ∈ [0, 1] for each y ∈ Y and P
y∈Y πy = 1.
(a) Give the formula for the MLE of the parameter µy,j based on training data {(xi
, yi)}
n
i=1. (Remember,
each unlabeled point is a vector: xi = (xi,1, xi,2, . . . , xi,d) ∈ {0, 1}
d
.)
(b) MLE is not a good estimator for the class conditional parameters if the estimate turns out to be zero or
one. An alternative is the following estimator based on a technique called Laplace smoothing:
µˆy,j :=
1 +Xn
i=1
1{yi = y}xi,j
/
2 +Xn
i=1
1{yi = y}
∈ (0, 1).
Write code for training and testing a classifier based on the Naive Bayes generative model described
above. Use Laplace smoothing to estimate class conditional distribution parameters, and MLE for class
prior parameters. You should not use or look at any existing implementation (e.g., those that may be
provided as library functions). Using your code, train and test a classifier with the data from news.mat.
What to submit: training and test error rates.
(c) Consider the binary classification problem, where newsgroups {1, 16, 20} comprise the “negative class”
(class 0), and newsgroups {17, 18, 19} comprise the “positive class” (class 1). Newsgroups {1, 16, 20}
are “religious” topics, and newsgroups {17, 18, 19} are “political” topics. Modify the data in news.mat
to create the training and test data sets for this problem. Using these data and your codes from part
(b), train and test a Naive Bayes classifier. Report the training and test error rates. Save the learned
classifier for part (d).
What to submit: training and test error rates.
(d) The classifier you learn is ultimately an affine classifier, which means it has the following form:
x 7→
(
0 if α0 +
Pd
j=1 αjxj ≤ 0
1 if α0 +
Pd
j=1 αjxj > 0

for some real numbers α0, α1, . . . , αd. Determine the values of these αj ’s for your learned classifier from

part (c). Then, report the vocabulary words whose indices j ∈ {1, 2, . . . , d} correspond to the 20 largest

(i.e., most positive) αj value, and also the vocabulary words whose indices j ∈ {1, 2, . . . , d} correspond

to the 20 smallest (i.e., most negative) αj value. Don’t report the indices j’s, but rather the actual

vocabulary words (from news.vocab).

What to submit: two ordered lists (clearly labeled) of 20 words each.

Please also submit your source code on Courseworks.

5