## Description

Problem 1 (Nearest neighbors; 20 points). Download the OCR image data set ocr.mat from

Courseworks, and load it into MATLAB:

load (‘ocr . mat ‘)

Or Python:

from scipy . io import loadmat

ocr = loadmat (‘ocr . mat ‘)

The unlabeled training data (i.e., feature vectors) are contained in a matrix called data (one

point per row), and the corresponding labels are in a vector called labels. The test feature vectors

and labels are in, respectively, testdata and testlabels. In MATLAB, you can view an image

(say, the first one) in the training data with the following commands:

imagesc ( reshape ( data (1 ,:) ,28 ,28) ‘);

If the colors are too jarring for you, try the following:

colormap (1 – gray );

In Python, to view the first image, try the following (ideally, from IPython or Jupyter Notebook):

import matplotlib . pyplot as plt

from matplotlib import cm

plt . imshow ( ocr [‘data ‘][0]. reshape ((28 ,28)) , cmap = cm . gray_r )

plt . show ()

Write a function that implements the 1-nearest neighbor classifier with Euclidean distance.

Your function should take as input a matrix of training feature vectors X and a vector of the

corresponding labels Y, as well as a matrix of test feature vectors test. The output should be a

vector of predicted labels preds for all the test points. Naturally, you should not use (or look at the

source code for) any library functions for computing Euclidean distances, nearest neighbor queries,

and so on. If in doubt about what is okay to use, just ask. Note that for efficiency, you should use

vector operations (rather than, say, a bunch of for-loops).1

Instead of using your 1-NN code directly with data and labels as the training data, do the

following. For each value n ∈ {1000, 2000, 4000, 8000},

• Draw n random points from data, together with their corresponding labels. In MATLAB, use sel = randsample(60000,n) to pick the n random indices, and data(sel,:) and

labels(sel) to select the examples; in Python, use sel = random.sample(xrange(60000),n)

(after import random), ocr[‘data’][sel], and ocr[‘labels’][sel].

• Use these n points as the training data and testdata as the test points, and compute the

test error rate of the 1-NN classifier.

A plot of the error rate (on the y-axis) as a function of n (on the x-axis) is called a learning curve.

We get an estimate of this curve by using the test error rate in place of the (true) error rate.

Repeat the (random) process described above ten times, independently. Produce an estimate

of the learning curve plot using the average of these test error rates (that is, averaging over ten

repetitions). Add error bars to your plot that extend to one standard deviation above and below

the means. Ensure the plot axes are properly labeled.

What to submit: (1) learning curve plot, (2) source code (in a separate file).

1

http://www.mathworks.com/help/matlab/matlab_prog/vectorization.html

1

Problem 2 (Prototype selection; 20 points). Prototype selection is a method for speeding-up

nearest neighbor search that replaces the training data with a smaller subset of prototypes (which

could be data points themselves). For simplicity, assume that 1-NN is used with Euclidean distance.

So a prototype selection method simply takes as input:

• the training data {(xi

, yi)}

n

i=1 from R

d × {0, 1, . . . , 9} (say), and

• a positive integer m;

it should return m labeled pairs {(x˜i

, y˜i)}

m

i=1, each from R

d × {0, 1, . . . , 9}.

Design a method for choosing prototypes, where the goal is for the 1-NN classifier based on

the prototypes to have good test accuracy. Implement your algorithm; use it to select prototypes

for the OCR data set, and evaluate the test error rate of the 1-NN classifier based on the selected

prototypes. You should use the whole training data set as input (i.e., all n = 60000 data points),

but vary the number of selected prototypes m in the set {1000, 2000, 4000, 8000}. If your procedure

is randomized, repeat it at least ten times (for each m) to properly assess its performance.

What to submit:

1. A brief description of your method (in words).

2. Concise and unambiguous pseudocode for your algorithm.

3. A table of the test error rates for the different values of m you try. (Report averages and

standard deviations if your procedure is randomized.)

4. Source code (in a separate file).

Problem 3 (Probability; 10 points). Suppose you have an urn containing 100 colored balls. Each

ball is painted with one of five possible colors from the color set C := {red, orange, yellow, green, blue}.

For each c ∈ C, let nc denote the number of balls in the urn with color c.

(a) Suppose you pick two balls uniformly at random with replacement from the urn. What is the

probability that they have different colors? Briefly explain your answer.

(b) If you could paint each ball in the urn (with any color from C), what would you do to maximize

the probability from part (a)? In other words, for each color in C, how many balls in the urn

would you paint with that color? Briefly explain your answer.

2