## Description

1. [3pts] Gaussian Discriminant Analysis. For this question you will build classifiers to

label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector

of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale

and the pixel values are between 0 and 1. The labels y are 0, 1, 2, · · · , 9 corresponding to

which character was written in the image. There are 700 training cases and 400 test cases for

each digit; they can be found in a2digits.zip.

Starter code is provided to help you load the data (data.py). A skeleton (q1.py) is also

provided for each question that you should use to structure your code.

Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full

covariance matrix for each class. Remember that the conditional multivariate Gaussian probability density is given by,

p(x | y = k, µ, Σ) = (2π)

−d/2

|Σk|

−1/2

exp

−

1

2

(x − µk

)

T Σ

−1

k

(x − µk

)

(1)

You should take p(y = k) = 1

10

. You will compute parameters µkj and Σk for k ∈

{0, . . . , 9}, j ∈ {1, . . . , 64}. You should implement the covariance computation yourself (i.e.

without the aid of ’np.cov’). To ensure numerical stability you will have to add a small multiple of the identity to each covariance matrix. For this assignment you should add 0.01I to

each Σk after computing its maximum likelihood estimate.

(a) [1pt] Using the parameters you fit on the training set and Bayes rule, compute the

average conditional log-likelihood, i.e. 1

N

PN

i=1 log(p(y

(i)

| x

(i)

, θ)) on both the train and

test set and report it.

(b) [1pt] Select the most likely posterior class for each training and test data point as your

prediction, and report your accuracy on the train and test set.

1

https://markus.teach.cs.toronto.edu/csc411-2019-01

2

http://www.cs.toronto.edu/~mren/teach/csc411_19s/syllabus.pdf

1

CSC411 Homework 5

(c) [1pt] Compute the leading eigenvectors (largest eigenvalue) for each class covariance

matrix (can use np.linalg.eig) and plot them side by side as 8 by 8 images.

Report your answers to the above questions, and submit your completed Python code for

q1.py.

2. [2pts] Categorial Distribution. Let’s consider fitting the categorical distribution, which

is a discrete distribution over K outcomes, which we’ll number 1 through K. The probability

of each category is explicitly represented with parameter θk. For it to be a valid probability

distribution, we clearly need θk ≥ 0 and P

k

θk = 1. We’ll represent each observation x as

a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this

model, the probability of an observation can be written in the following form:

p(x; θ) = Y

K

k=1

θ

xk

k

.

Suppose we observe a dataset D = {x

(n)}

N

n=1. Denote the count for outcome k as Nk, so

Nk =

PN

n=1 x

(n)

k

. In the previous assignment, you showed that the maximum likelihood

estimate for the counts was:

ˆθk =

Nk

N

.

Now let’s derive the Bayesian parameter estimate.

(a) [1pts] For the prior, we’ll use the Dirichlet distribution, which is defined over the set of

probability vectors (i.e. vectors that are nonnegative and whose entries sum to 1). Its

PDF is as follows:

p(θ) ∝ θ

a1−1

1

· · · θ

ak−1

K .

A useful fact is that if θ ∼ Dirichlet(a1, . . . , aK), then

E[θk] = P

ak

k

0 ak

0

.

Determine the posterior distribution p(θ | D). From that, determine the posterior predictive probability that the next outcome will be k.

(b) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the

parameter vector θ. For this question, you may assume each ak > 1.

3. [4pts] Factor Analysis.

In lecture, we covered the EM algorithm applied to mixture of Gaussians models. In this

question, we’ll look at another interesting example of EM, namely factor analysis. This is

a model very similar in spirit to PCA: we have data in a high-dimensional space, and we’d

like to summarize it with a lower-dimensional representation. Unlike PCA, we formulate the

problem in terms of a probabilistic model. We assume the latent code vector z is drawn

from a standard Gaussian distribution N (0, I), and that the observations are drawn from a

diagonal covariance Gaussian whose mean is a linear function of z. We’ll consider the slightly

simplified case of scalar-valued z. The probabilistic model is given by:

z ∼ N (0, 1)

x | z ∼ N (zu, Σ),

2

CSC411 Homework 5

where Σ = diag(σ

2

1

, . . . , σ2

D). Note that the observation model can be written in terms of

coordinates:

xd | z ∼ N (zud, σj ).

We have a set of observations {x

(n)}

N

n=1, and z is a latent variable, analogous to the mixture

component in a mixture-of-Gaussians model. The parameters of the model are θ = {u, Σ}.

In this question, we’ll derive both the E-step and the M-step for the EM algorithm. If you

don’t feel like you understand the EM algorithm yet, don’t worry; we’ll walk you through it,

and the question will be mostly mechanical.

(a) E-step (2pts). In this step, our job is determine the posterior distribution q(z) =

p(z | x; θ). After finding this distribution, you should compute formulas for the (univariate) statistics which we’ll use in the M-step:

m = E[z | x; θ] =

s = E[z

2

| x; θ] =

Tips:

• Compare the model here with the linear Gaussian model of the Appendix. Note that

z here is a scalar, while the Appendix gives the more general formulation where x

and z are both vectors.

• To help you check your work: p(z | x; θ) is a univariate Gaussian distribution whose

mean is a linear function of x, and whose variance does not depend on x.

• Once you have figured out the mean and variance, that will give you the conditional

expectations.

(b) M-step (2pts). In this step, we need to re-estimate the parameters of the model.

Suppose the current parameters (used in the E-step) are θold = {uold, Σold}. For this

part, your job is to derive a formula for unew that maximizes the expected log-likelihood,

i.e.,

unew ← arg max

u

1

N

X

N

n=1

Eqn(z

(n))

[log p(z

(n)

, x

(n)

; θ)].

Recall that qn(z

(n)

) = p(z

(n)

|x

(n)

; θold) is the distribution computed in part (a). Your answer should be given in terms of the m(n) = E[z

(n)

| x

(n)

; θold] and s

(n) = E[(z

(n)

)

2

| x

(n)

; θold]

from the previous part. (I.e., you don’t need to expand out the formulas for m(n) and

s

(n)

in this step, because if you were implementing this algorithm, you’d use the values

m(n) and s

(n)

that you previously computed.)

3

CSC411 Homework 5

Tips:

• Expand log p(z

(n)

, x

(n)

; θ) to log p(z

(n)

; θ) + log p(x

(n)

| z

(n)

; θ) (log is the natural

logarithm).

• Expand out the PDF of the Gaussian distribution.

• Apply linearity of expectation. You should wind up with terms proportional to

Eqn(z

(n))

[z

(n)

] and Eqn(z

(n))

[(z

(n)

)

2

]. Replace these expectations with m(n) and s

(n)

.

You should get an equation that does not mention z

(n)

.

• In order to find the maximum likelihood parameter unew, you need to take the

derivative with respect to ud, set it to zero, and solve for unew.

(c) M-step, cont’d (optional) Find the M-step update for the observation variances

{σd}

D

d=1. This can be done in a similar way to part (b).

4

CSC411 Homework 5

Appendix: Some Properties of Conditional Gaussians

Consider a multivariate Gaussian random variable z with the mean µ and the covariance matrix

Λ−1

(Λ is the inverse of the covariance matrix and is called the precision matrix). We denote this

by

p(z) = N (z | µ, Λ

−1

).

Now consider another Gaussian random variable x, whose mean is an affine function of z (in

the form to be clear soon), and its covariance L

−1

is independent of z. The conditional distribution

of x given z is

p(x | z) = N (x | Az + b,L

−1

).

Here the matrix A and the vector b are of appropriate dimensions.

In some problems, we are interested in knowing the distribution of z given x, or the marginal

distribution of x. One can apply Bayes’ rule to find the conditional distribution p(z | x). After

some calculations, we can obtain the following useful formulae:

p(x) = N

x | Aµ + b,L

−1 + AΛ−1A>

p(z | x) = N

z | C(A>L(x − b) + Λµ), C

with

C = (Λ + A>LA)

−1

.

5