## Description

1 AdaBoost [5 pts]

In the lecture on ensemble methods, we said that in iteration t, AdaBoost is picking (ht

, βt) that

minimizes the objective:

(h

∗

t

(x), β∗

t

) = arg min

(ht(x),βt)

X

n

wt(n)e

−ynβtht(xn)

= arg min

(ht(x),βt)

(e

βt − e

−βt

)

X

n

wt(n)I[yn 6= ht(xn)]

+e

−βt X

n

wt(n)

We define the weighted misclassification error at time t, t to be t =

P

n wt(n)I[yn 6= ht(xn)]. Also

the weights are normalized so that P

n wt(n) = 1.

(a) Take the derivative of the above objective function with respect to βt and set it to zero to

solve for βt and obtain the update for βt

.

(b) Suppose the training set is linearly separable, and we use a hard-margin linear support vector

machine (no slack) as a base classifier. In the first boosting iteration, what would the resulting

β1 be?

2 K-means for single dimensional data [5 pts]

In this problem, we will work through K-means for a single dimensional data.

(a) Consider the case where K = 3 and we have 4 data points x1 = 1, x2 = 2, x3 = 5, x4 = 7.

What is the optimal clustering for this data ? What is the corresponding value of the objective

?

Parts of this assignment are adapted from course material by Jenna Wiens (UMich) and Tommi Jaakola (MIT).

1

(b) One might be tempted to think that Lloyd’s algorithm is guaranteed to converge to the

global minimum when d = 1. Show that there exists a suboptimal cluster assignment (i.e.,

initialization) for the data in the above part that Lloyd’s algorithm will not be able to improve

(to get full credit, you need to show the assignment, show why it is suboptimal and explain

why it will not be improved).

2

3 Gaussian Mixture Models [8 pts]

We would like to cluster data {x1, . . . , xN }, xn ∈ R

d using a Gaussian Mixture Model (GMM)

with K mixture components. To do this, we need to estimate the parameters θ of the GMM, i.e.,

we need to set the values θ = {ωk, µk

, Σk}

K

k=1 where ωk is the mixture weight associated with

mixture component k, and µk and Σk denote the mean and the covariance matrix of the Gaussian

distribution associated with mixture component k.

If we knew which cluster each sample xn belongs to (we had complete data), we showed in the

lecture on Clustering that the log likelihood l is what we have below and we can compute the

maximum likelihood estimate (MLE) of all the parameters.

l(θ) = X

n

log p(xn, zn)

=

X

k

X

n

γnk log ωk +

X

k

(X

n

γnk log N(xn|µk

, Σk)

)

(1)

Since we do not have complete data, we use the EM algorithm. The EM algorithm works by

iterating between setting each γnk to the posterior probability p(zn = k|xn) (step 1 on slide 26

of the lecture on Clustering) and then using γnk to find the value of θ that maximizes l (step 2

on slide 26). We will now derive updates for one of the parameters, i.e., µj

(the mean parameter

associated with mixture component j).

(a) To maximize l, compute ∇µj

l(θ): the gradient of l(θ) with respect to µj

.

(b) Set the gradient to zero and solve for µj

to show that µj = P

1

n

γnj

P

n

γnjxn.

(c) Suppose that we are fitting a GMM to data using K = 2 components. We have N = 5

samples in our training data with xn, n ∈ {1, . . . , N} equal to: {5, 15, 25, 30, 40}.

We use the EM algorithm to find the maximum likeihood estimates for the model parameters,

which are the mixing weights for the two components, ω1 and ω2, and the means for the two

components, µ1 and µ2. The standard deviations for the two components are fixed at 1.

Suppose that at the end of step 1 of iteration 5 in the EM algorithm, the soft assignment γnk

for the five data items are as shown in Table 1.

γ1 γ2

0.2 0.8

0.2 0.8

0.8 0.2

0.9 0.1

0.9 0.1

Table 1: Entry in row n and column k of the table corresponds to γnk

What are updated values for the parameters ω1, ω2, µ1, and µ2 at the end of step 2 of the

EM algorithm?

3