Description
High Level Description
0.1 Tasks In this assignment you are asked to implement Kmeans clustering to identify main clusters in
the data, use the discovered centroid of cluster for classification, and implement Gaussian Mixture Models
to learn a generative model of the data. Specifically, you will
• Implement Kmeans clustering algorithm to identify clusters in a twodimensional toydataset.
• Implement image compression using Kmeans clustering algorithm.
• Implement classification using the centroids identified by clustering.
• Implement Gaussian Mixture Models to learn a generative model and generate samples from a mixture distribution.
0.2 Running the code We have provided two scripts to run the code. Run kmeans.sh after you finish implementation of kmeans clustering, classification and compression. Run gmm.sh after you finish
implementing Gaussian Mixture Models.
0.3 Dataset Through out the assignment we will use two datasets (See fig. 1) — Toy Dataset and Digits
Dataset (you do not need to download).
Toy Dataset is a twodimensional dataset generated from 4 Gaussian distributions. We will use this
dataset to visualize the results of our algorithm in two dimensions.
We will use digits dataset from sklearn [1] to test Kmeans based classifier and generate digits using
Gaussian Mixture model. Each data point is a 8 × 8 image of a digit. This is similar to MNIST but less
complex.
1
(a) Digits (b) 2D Toy Dataset
Figure 1: Datasets
0.4 Cautions Please DO NOT import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not
modify the code unless we instruct you to do so. A homework solution that does not match the provided
setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure
that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git
add, commit, and push all the required files, including your code and generated output files.
0.5 Final submission After you have solved problem 1 and 2, execute bash kmeans.sh command and
bash gmm.sh command. Git add, commit and push plots and results folder and all the *.py files.
Problem 1 Kmeans Clustering
Recall that for a dataset x1, . . . , xN ∈ RD, the Kmeans distortion objective is
J({µk}, {rik}) = 1
N
N
∑
i=1
K
∑
k=1
rikkµk − xik
2
2
(1)
where µ1, . . . , µK are centroids of the K clusters and rik ∈ {0, 1} represents whether example i belongs to
cluster k.
Clearly, fixing the centroids and minimizing J over the assignment give
rˆik =
(
1 k = arg mink
0 kµk
0 − xik
2
2
0 Otherwise.
(2)
On the other hand, fixing the assignment and minimizing J over the centroids give
µˆ k =
∑
N
i=1
rikxi
∑
N
i=1
rik
(3)
What the Kmeans algorithm does is simply to alternate between these two steps. See Algorithm 1 for
the pseudocode.
2
Algorithm 1 Kmeans clustering algorithm
1: Inputs:
An array of size N × D denoting the training set, x
Maximum number of iterations, max iter
Number of clusters, K
Error tolerance, e
2: Outputs:
Array of size K × D of means, {µk}
K
k=1
Membership vector R of size N, where R[i] ∈ [K] is the index of the cluster that
example i belongs to.
3: Initialize:
Set means {µk}
K
k=1
to be K points selected from x uniformly at random (with
replacement), and J to be a large number (e.g. 1010)
4: repeat
5: Compute membership rik using eq. 2
6: Compute distortion measure Jnew using eq. 1
7: if J − Jnew ≤ e then
STOP
8: end if
9: Set J := Jnew
10: Compute means µk using eq. 3
11: until maximum iteration is reached
1.1 Implementing Kmeans clustering algorithm Implement Algorithm 1 by filling out the TODO parts
in class KMeans of file kmeans.py. Note the following:
• Use numpy.random.choice for the initialization step.
• If at some iteration, there exists a cluster k with no points assigned to it, then do not update the
centroid of this cluster for this round.
• While assigning a sample to a cluster, if there’s a tie (i.e. the sample is equidistant from two centroids),
you should choose the one with smaller index (like what numpy.argmin does).
After you complete the implementation, execute bash kmeans.sh command to run kmeans on toy
dataset. You should be able to see three images generated in plots folder. In particular, you can see
toy dataset predicted labels.png and toy dataset real labels.png and compare the clusters identified by the algorithm against the real clusters. Your implementation should be able to recover the
correct clusters sufficiently well. Representative images are shown in fig. 2. Red dots are cluster centroids.
Note that color coding of recovered clusters may not match that of correct clusters. This is due to mismatch
in ordering of retrieved clusters and correct clusters (which is fine).
3
(a) Predicted Clusters (b) Real Clusters
Figure 2: Clustering on toy dataset
1.2 Image compression with Kmeans In the next part, we will look at lossy image compression as an
application of clustering. The idea is simply to treat each pixel of an image as a point xi
, then perform
Kmeans algorithm to cluster these points, and finally replace each pixel with its centroid.
What you need to implement is to compress an image with K centroids given. Specifically, complete the
function transform image in the file kmeansTest.py.
After your implementation, execute bash kmeans.sh again and you should be able to see an image
baboon compressed.png in the plots folder. You can see that this image is distorted as compared to
the original baboon.tiff.
1.3 Classification with kmeans Another application of clustering is to obtain a faster version of the nearest neighbor algorithm. Recall that nearest neighbor evaluates the distance of a test sample from every
training point to predict its class, which can be very slow. Instead, we can compress the entire training
set to just the K centroids, where each centroid is now labeled as the majority class of the corresponding
cluster. After this compression the prediction time of nearest neighbor is reduced from O(N) to just O(K)
(see Algorithm 2 for the pseudocode).
Algorithm 2 Classification with Kmeans clustering
1: Inputs:
Training Data : {X,Y}
Parameters for running Kmeans clustering
2: Training:
Run Kmeans clustering to find centroids and membership (reuse your code from Problem 1.1)
Label each centroid with majority voting from its members. i.e. arg maxc ∑i
rikI{yi = c}
3: Prediction:
Predict the same label as the nearest centroid (that is, 1NN on centroids).
Note: 1) break ties in the same way as in previous problems; 2) if some centroid doesn’t contain any
point, set the label of this centroid as 0.
Complete the fit and predict function in KMeansClassifier in file kmeans.py. Once completed,
run kmeans.sh to evaluate the classifier on a test set. For comparison, the script will also print accuracy of
a logistic classifier and a nearest neighbor classifier. (Note: a naive Kmeans classifier may not do well but
it can be an effective unsupervised method in a classification pipeline [2].)
4
Problem 2 Gaussian Mixture Model
Next you will implement Gaussian Mixture Model (GMM) for clustering and also generate data after learning the model. Recall the key steps of EM algorithm for learning GMMs on Slide 52 of Lec 8 (we change the
notation ωk
to πk
):
γik =
πkN (xi
; µk
, Σk
)
∑k πkN (xi
; µk
, Σk
)
(4)
Nk =
N
∑
i=1
γik (5)
µk =
∑
N
i=1 γikxi
Nk
(6)
Σk =
∑
N
i=1 γik(xi − µk
)(xi − µk
)
T
Nk
(7)
πk =
Nk
N
(8)
Algorithm 3 provides a more detailed pseudocode. Also recall the incomplete loglikelihood is
N
∑
n=1
ln p(xn) =
N
∑
n=1
ln
K
∑
k=1
πkN (xi
; µk
, Σk
) =
N
∑
n=1
ln
K
∑
k=1
p
πk
(2π)DΣk

exp
−
1
2
(xi − µk
)
TΣ
−1
k
(xi − µk
)
(9)
2.1 Implementing EM Implement EM algorithm (class Gaussian pdf, function fit and function com
pute log likelihood) in file gmm.py to estimate mixture model parameters. Please note the following:
• For Kmeans initialization, the inputs of Kmeans are the same as those of EM’s.
• When computing the density of a Gaussian with covariance matrix Σ, use Σ
0 = Σ + 10−3
I when Σ is
not invertible (in case it’s still not invertible, keep adding 10−3
I until it is invertible).
After implementation execute bash gmm.sh command to estimate mixture parameters for toy dataset.
You should see a Gaussian fitted to each cluster in the data. A representative image is shown in fig. 3.
We evaluate both initialization methods and you should observe that initialization with Kmeans usually
converges faster.
Figure 3: Gaussian Mixture model on toy dataset
5
Algorithm 3 EM algorithm for estimating GMM parameters
1: Inputs:
An array of size N × D denoting the training set, x
Maximum number of iterations, max iter
Number of clusters, K
Error tolerance, e
Init method — Kmeans or random
2: Outputs:
Array of size K × D of means, {µk}
K
k=1
Variance matrix Σk of size K × D × D
A vector of size K denoting the mixture weights, pi k
3: Initialize:
• For “random” init method: initialize means uniformly at random from [0,1)
for each dimension (use numpy.random.rand), initialize variance to be
identity matrix for each component, initialize mixture weight to be uniform.
• For “Kmeans” init method: run Kmeans, initialize means as the centroids
found by Kmeans, and initialize variance and mixture weight according to
Eq. (7) and Eq. (8) where γik is the binary membership found by Kmeans.
4: Compute the loglikelihood l using Eq. (9)
5: repeat
6: E Step: Compute responsibilities using Eq. (4)
7: M Step:
Estimate means using Eq. (6)
Estimate variance using Eq. (7)
Estimate mixture weight using Eq. (8)
8: Compute new loglikelihood lnew
9: if l − lnew ≤ e then
STOP
10: end if
11: Set l := lnew
12: until maximum iteration is reached
2.2 Implementing sampling We also fit a GMM with K = 30 using the digits dataset. An advantage of
GMM compared to Kmeans is that we can sample from the learned distribution to generate new synthetic
examples which look similar to the actual data.
To do this, implement sample function in gmm.py which uses self.means, self.variances and
self.pi k to generate digits. Recall that sampling from a GMM is a two step process: 1) first sample a
component k according to the mixture weight; 2) then sample from a Gaussian distribution with mean µk
and variance Σk
. Use numpy.random for these sampling steps.
After implementation, execute bash gmm.sh again. This should produce visualization of means µk
and some generated samples for the learned GMM. Representative images are shown in fig. 4.
6
(a) Means of GMM learnt on digits (b) Random digits sample generated from GMM
Figure 4: Results on digits dataset
References
[1] sklearn.datasets.digits http://scikitlearn.org/stable/modules/generated/sklearn.
datasets.load_digits.html#sklearn.datasets.load_digits
[2] Coates, A., & Ng, A. Y. (2012). Learning feature representations with kmeans. In Neural networks:
Tricks of the trade (pp. 561580). Springer, Berlin, Heidelberg.
7