## Description

Introduction

Machine learning techniques have been applied to a variety of image interpretation problems. In

this project, you will investigate facial recognition, which can be treated as a clustering problem

(“separate these pictures of Joe and Mary”).

For this project, we will use a small part of a huge database of faces of famous people (Labeled Faces

in the Wild [LFW] people dataset1

). The images have already been cropped out of the original

image, and scaled and rotated so that the eyes and mouth are roughly in alignment; additionally,

we will use a version that is scaled down to a manageable size of 50 by 37 pixels (for a total of

1850 “raw” features). Our dataset has a total of 1867 images of 19 different people. You will

apply dimensionality reduction using principal component analysis (PCA) and explore clustering

methods such as k-means and k-medoids to the problem of facial recognition on this dataset.

Download the starter files from the course website. It contains the following source files:

• util.py – Utility methods for manipulating data, including through PCA.

• cluster.py – Code for the Point, Cluster, and ClusterSet classes, on which you will build

the clustering algorithms.

• faces.py – Main code for the project.

Please note that you do not necessarily have to follow the skeleton code perfectly. We encourage

you to include your own additional methods and functions. However, you are not allowed to use

any scikit-learn classes or functions other than those already imported in the skeleton code.

1 PCA and Image Reconstruction [4 pts]

Before attempting automated facial recognition, you will investigate a general problem with images.

That is, images are typically represented as thousands (in this project) to millions (more generally)

of pixel values, and a high-dimensional vector of pixels must be reduced to a reasonably lowdimensional vector of features.

(a) As always, the first thing to do with any new dataset is to look at it. Use get_lfw_data(…)

to get the LFW dataset with labels, and plot a couple of the input images using show_image(…).

Then compute the mean of all the images, and plot it. (Remember to include all requested

images in your writeup.) Comment briefly on this “average” face.

(b) Perform PCA on the data using util.PCA(…). This function returns a matrix U whose

columns are the principal components, and a vector mu which is the mean of the data. If

you want to look at a principal component (referred to in this setting as an eigenface), run

show_image(vec_to_image(v)), where v is a column of the principal component matrix.

(This function will scale vector v appropriately for image display.) Show the top twelve

eigenfaces:

plot_gallery([vec_to_image(U[:,i]) for i in xrange(12)])

Comment briefly on your observations. Why do you think these are selected as the top

eigenfaces?

1

http://vis-www.cs.umass.edu/lfw/

2

(c) Explore the effect of using more or fewer dimensions to represent images. Do this by:

• Finding the principal components of the data

• Selecting a number l of components to use

• Reconstructing the images using only the first l principal components

• Visually comparing the images to the originals

To perform PCA, use apply_PCA_from_Eig(…) to project the original data into the lowerdimensional space, and then use reconstruct_from_PCA(…) to reconstruct high-dimensional

images out of lower dimensional ones. Then, using plotGallery(…), submit a gallery of the

first 12 images in the dataset, reconstructed with l components, for l = 1, 10, 50, 100, 500, 1288.

Comment briefly on the effectiveness of differing values of l with respect to facial recognition.

We will revisit PCA in the last section of this project.

2 K-Means and K-Medoids [16 pts]

Next, we will explore clustering algorithms in detail by applying them to a toy dataset. In particular,

we will investigate k-means and k-medoids (a slight variation on k-means).

(a) In k-means, we attempt to find k cluster centers µj ∈ R

d

, j ∈ {1, . . . , k} and n cluster

assignments c

(i) ∈ {1, . . . , k}, i ∈ {1, . . . , n}, such that the total distance between each

data point and the nearest cluster center is minimized. In other words, we attempt to find

µ1

, . . . , µk and c

(1), . . . , c(n)

that minimizes

J(c, µ) = Xn

i=1

||x

(i) − µc

(i) ||2

.

To do so, we iterate between assigning x

(i)

to the nearest cluster center c

(i) and updating

each cluster center µj

to the average of all points assigned to the j

th cluster.

Instead of holding the number of clusters k fixed, one can think of minimizing the objective

function over µ, c, and k. Show that this is a bad idea. Specifically, what is the minimum

possible value of J(c, µ, k)? What values of c, µ, and k result in this value?

(b) To implement our clustering algorithms, we will use Python classes to help us define three

abstract data types: Point, Cluster, and ClusterSet (available in cluster.py). Read

through the documentation for these classes. (You will be using these classes later, so make

sure you know what functionality each class provides!) Some of the class methods are already

implemented, and other methods are described in comments. Implement all of the methods

marked TODO in the Cluster and ClusterSet classes.

(c) Next, implement random_init(…) and kMeans(…) based on the provided specifications.

3

(d) Now test the performance of k-means on a toy dataset.

Use generate_points_2d(…) to generate three clusters each containing 20 points. (You

can modify generate_points_2d(…) to test different inputs while debugging your code,

but be sure to return to the initial implementation before creating any plots for submission.)

You can plot the clusters for each iteration using the plot_clusters(…) function.

In your writeup, include plots for the k-means cluster assignments and corresponding cluster

“centers” for each iteration when using random initialization.

(e) Implement kMedoids(…) based on the provided specification.

Hint: Since k-means and k-medoids are so similar, you may find it useful to refactor your code

to use a helper function kAverages(points, k, average, init=’random’, plot=True),

where average is a method that determines how to calculate the average of points in a

cluster (so it can take on values ClusterSet.centroids or ClusterSet.medoids).2

As before, include plots for k-medoids clustering for each iteration when using random initialization.

(f) Finally, we will explore the effect of initialization. Implement cheat_init(…).

Now compare clustering by initializing using cheat_init(…). Include plots for k-means

and k-medoids for each iteration.

3 Clustering Faces [12 pts]

Finally (!), we will apply clustering algorithms to the image data. To keep things simple, we will

only consider data from four individuals. Make a new image dataset by selecting 40 images each

from classes 4, 6, 13, and 16, then translate these images to (labeled) points: 3

X1, y1 = util.limit_pics(X, y, [4, 6, 13, 16], 40)

points = build_face_image_points(X1, y1)

(a) Apply k-means and k-medoids to this new dataset with k = 4 and initializing the centroids

randomly. Evaluate the performance of each clustering algorithm by computing the average

cluster purity with ClusterSet.score(…). As the performance of the algorithms can vary

widely depending upon the initialization, run both clustering methods 10 times and report

the average, minimum, and maximum performance.

average min max

k-means

k-medoids

How do the clustering methods compare in terms of clustering performance and runtime?

2

In Python, if you have a function stored to the variable func, you can apply it to parameters arg by callling

func(arg). This works even if func is a class method and arg is an object that is an instance of the class.

3There is a bug in fetch lfw version 0.18.1 where the results of the loaded images are not always in the same order.

This is not a problem for the previous parts but can affect the subset selected in this part. Thus, you may see varying

results. Results that show the correct qualitative behavior will get full credit.

4

Now construct another dataset by selecting 40 images each from two individuals 4 and 13.

(b) Explore the effect of lower-dimensional representations on clustering performance. To do

this, compute the principal components for the entire image dataset, then project the newly

generated dataset into a lower dimension (varying the number of principal components), and

compute the scores of each clustering algorithm.

So that we are only changing one thing at a time, use init=’cheat’ to generate the same initial set of clusters for k-means and k-medoids. For each value of l, the number of principal components, you will have to generate a new list of points using build_face_image_points(…).)

Let l = 1, 3, 5, . . . , 41. The number of clusters K = 2. Then, on a single plot, plot the

clustering score versus the number of components for each clustering algorithm (be sure to

label the algorithms). Discuss the results in a few sentences.

Some pairs of people are more similar to one another and some more different.

(c) Experiment with the data to find a pair that clustering can discriminate very well and another

pair that it finds very difficult (assume you have 40 images for each individual). Describe your

methodology (you may choose any of the clustering algorithms you implemented). Report the

two pairs in your writeup (display the pairs of images using plot_representative_images),

and comment briefly on the results.

5