## Description

1 Class-Conditional Gaussians (4%)

In this question, you will derive the maximum likelihood estimates for class-conditional Gaussians

with independent features (diagonal covariance matrices), i.e. Gaussian Naive Bayes, with shared

variances. Start with the following generative model for a discrete class label y ∈ (1, 2, …, K) and a

real valued vector of d features x = (x1, x2, …, xd):

p(y = k) = αk (1)

p(x|y = k, µ,σ) = Y

D

i=1

2πσ2

i

!−1/2

exp (

−

X

D

i=1

1

2σ

2

i

(xi − µki)

2

)

(2)

where αk is the prior on class k, σ

2

i

are the shared variances for each feature (in all classes), and µki

is the mean of the feature i conditioned on class k. We write α to represent the vector with elements

αk and similarly σ is the vector of variances. The matrix of class means is written µ where the kth

row of µ is the mean for class k.

1. Use Bayes’ rule to derive an expression for p(y = k|x, µ,σ). [Hint: Use the law of total probability to derive an expression for p(x|µ,σ).]

2. Write down an expression for the negative likelihood function (NLL)

`(θ; D) = − log p(y

(1)

, x

(1), y(2)

, x

(2)

, · · · , y(N)

, x

(N)

|θ) (3)

of a particular dataset D = {(y

(1)

, x

(1)),(y

(2)

, x

(2)), · · · ,(y

(N)

, x

(N)

)} with parameters θ =

{α, µ,σ}. (Assume that the data are iid.)

3. Take partial derivatives of the likelihood with respect to each of the parameters µki and with

respect to the shared variances σ

2

i

.

4. Find the maximum likelihood estimates for µ and σ.

5*. Extra work for interested students: If you are familiar with Lagrange multipliers, show that

the MLE for αk is indeed given by equation 4:

αk =

1

N

X

N

i=1

1[y

(i) = k] (4)

2 Handwritten Digit Classification (11%)

For this question you will build classifiers to label images of handwritten digits. Each image is 8

by 8 pixels and is represented as a vector of dimension 64 by listing all the pixel values in raster

scan order. The images are grayscale and the pixel values are between 0 and 1. The labels y are

0, 1, 2, · · · , 9 corresponding to which character was written in the image. There are 700 training

cases and 400 test cases for each digit; they can be found in a2digits.zip.

Starter code written in python is provided to help you load the data. A skeleton is also provided

for each question that you should use to format your solution. You are welcome to use any other

programming language, as long as your code is functional and modular.

Please note that if you are asked to report/compute quantities these should be clearly displayed

in your written report. It is not sufficient to simply print these as an output of your code. The

same applies to plots and figures.

0. Load the data and plot the means for each of the digit classes in the training data (include

these in your report). Given that each image is a vector of size 64, the mean will be a vector of

size 64 which needs to be reshaped as an 8 × 8 2D array to be rendered as an image. Plot all

10 means side by side using the same scale.

2.1 K-NN Classifier

1. Build a simple K nearest neighbor classifier using Euclidean distance on the raw pixel data.

(a) For K = 1 report the train and test classification accuracy.

(b) For K = 15 report the train and test classification accuracy.

2. For K > 1 K-NN might encounter ties that need to be broken in order to make a decision.

Choose any (reasonable) method you prefer and explain it briefly in your report.

3. Use 10 fold cross validation to find the optimal K in the 1-15 range. You may use the KFold

implementation in sklearn or your existing code from Assignment 1. Report this value of

K along with the train classification accuracy, the average accuracy across folds and the test

accuracy.

2.2 Conditional Gaussian Classifier Training

Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full covariance matrix for each class. Remember that the conditional multivariate Gaussian probability density

is given by,

p(x|y = k, µ, Σk) = (2π)

−d/2

|Σk|

−1/2

exp

−

1

2

(x − µk)

T Σ

−1

k

(x − µk)

(5)

You should take p(y = k) = 1

10

. You will compute parameters µkj and Σk for k ∈ (0…9), j ∈ (1…64).

You should implement the covariance computation yourself (i.e. without the aid of ’np.cov’). [Hint:

To ensure numerical stability you may have to add a small positive value to the diagonal of each

covariance matrix. For this assignment you can add 0.01I to each matrix.]

1. Plot an 8 by 8 image of the log of the diagonal elements of each covariance matrix Σk. Plot all

ten classes side by side using the same grayscale.

2. Using the parameters you fit on the training set and Bayes rule, compute the average conditional log-likelihood, i.e. 1

N

PN

i=1 log(p(y

(i)

|x

(i)

, θ)) on both the train and test set and report

it.

3. Select the most likely posterior class for each training and test data point as your prediction,

and report your accuracy on the train and test set.

4*. Extra work for interested students: Compute the leading eigenvectors (largest eigenvalue)

for each class covariance matrix (can use np.linalg.eig) and plot them side by side as 8 by 8

images.

2.3 Naive Bayes Classifier Training

1. Convert the real-valued features x into binary features b using 0.5 as a threshold: bj = 1 if

xj > 0.5 otherwise bj = 0.

2. Using these new binary features b and the class labels, train a Bernoulli Naive Bayes classifier

using MAP estimation with prior Beta(α, β) with α = β = 2. In particular, fit the model

below on the training set.

p(y = k) = 1

10

(6)

p(bj = 1|y = k) = ηkj (7)

p(b|y = k, η) = Πd

j=1 (ηkj )

bj

(1 − ηkj )

(1−bj )

(8)

P(ηkj ) = Beta(2, 2) (9)

You should compute parameters ηkj for k ∈ (0…9), j ∈ (1…64)

Regularization: Instead of the prior, you could add two training cases to your data set for

each class, one which has every pixel off and one which has every pixel on. Make sure you

understand why this is equivalent to using a prior. You may use either scheme in your own

code.

3. Plot each of your ηk vectors as an 8 by 8 grayscale image. These should be presented side by

side and with the same scale.

4. Given your parameters, sample one new data point for each of the 10 digit classes. Plot these

new data points as 8 by 8 grayscale images side by side.

5. Using the parameters you fit on the training set and Bayes rule, compute the average conditional log-likelihood, i.e. 1

N

PN

i=1 log(p(y

(i)

|x

(i)

, θ)) on both the train and test set and report

it.

6. Select the most likely posterior class for each training and test data point, and report your

accuracy on the train and test set.

2.4 Model Comparison

Briefly (in a few sentences) summarize the performance of each model. Which performed best?

Which performed worst? Did this match your expectations?

Submission

Assignments must be submitted by 10:00 pm on November 12th; a late penalty of 10% per day

will be assessed thereafter (up to 3 days, then submission is blocked). We highly recommend using Python 3.6 embedded in Anaconda 3.4 to solve the programming questions. However, you are

welcome to use any programming language you like, given that your code solves the problem, and

is functional and modular.

What to submit

• All work to be submitted via Markus.

• All mathematical work has to be clearly marked by the question it belongs to, we highly

recommend using a math editor such as MathType, Word Equations or LaTex. However, if

you prefer to use paper and pen(cil), you are welcome to do so as long as your hand writing

is readable. Unreadable work will result in losing the full mark of the question.

• Your written report should be submitted in PDF format. Note that Markus has filesize limits

and it is your responsibility to ensure that your work is uploaded on time.

• Your code must be clearly structured with an obvious entry point.

• All graphs and plots have to be clearly marked by their question. Use readable grids whenever

applicable.