Homework 2 Statistics S4240: Data Mining

$35.00

Category:

Description

5/5 - (1 vote)

Problem 1. (35 Points) In this problem we will manually go through all of the steps for
PCA. Basic computations like finding the eigenvalues for a matrix may be done using R.
a. (2 Points) Load hw02 q1 p1.csv. Find the column means and the row means for the data.
What do these values tell us about this data set?
b. (3 Points) Center the data and and find the empirical covariance matrix, ⌃ˆ. This should be a
5-by-5 matrix. What do the diagonal values of the covariance matrix tell us about this data set?
What do the o↵ diagonal elements tell us about this data set?
c. (5 Points) Give the eigenvalues and associated eigenvectors of ⌃ˆ. Why does this matrix have
the same left eigenvectors as right eigenvectors?
xT
lef t⌃ˆ = xT
lef t, ⌃ˆxright = xright
d. (5 Points) Give all of the loadings and all of the scores for the data.
e. (5 Points) Plot the proportion of variance captured against the number of components included.
How many components should we include and why?
f. (5 Points) Load hw02 q1 p2.csv. This has 5 new observations in the original coordinates. Using
the loadings obtained in (d), give the scores of these new 5 observations. [Hint: center these
new observations with respect to the dataset you loaded in a.]
g. (5 Points) Now, from the scores obtained in f, use only the first two scores to represent the new
5 observations. What are the coordinates of the projections in the original space (call it x0
)?
What is their Euclidean distance from the original data points?
h. (5 Points) Define the error of a point as
d(x0
, x) = x0 x,
which is a 5-dimensional vector. In what direction is d(x0
, x) for the 5 new points? Why do you
think this is?
Problem 2. (65 Points) We will continue working with the Yale Faces B data set from
the last homework with the goal of representing the images using PCA. We will use four lighting
conditions, P00A+000E+00, P00A+005E+10, P00A+005E-10, and P00A+010E+00, which are closest to
straight on lighting. We will use the pixmap library to manipulate the data. Load this library and
make sure that the folder YaleCropped is in your working directory.
1
a. (10 Points) Load the views P00A+000E+00, P00A+005E+10, P00A+005E-10, and P00A+010E+00
for all subjects. Convert each photo to a matrix (using getChannels) and then to a vector; store
the collection as a matrix where each row is a photo. What is the size of this matrix?
b. (10 Points) Compute a “mean face”, which is the average for each pixel across all of the faces.
Display the mean face as a photo in the original size and save a copy as .png. Include this in
your write up.
c. (10 Points) Subtract “mean faces” o↵ each of the faces. Then, use prcomp() to find the principal
components of your image matrix. Plot the number of components on the x-axis against the
proportion of the variance explained on the y-axis.
d. (10 Points) Each principal component is a picture, which are called “eigenfaces.” Display the first
9 eigenfaces in a 3-by-3 grid. What image components does each describe? (Note: pixmapGrey()
is fairly flexible and will automatically rescale data to have min 0 and max 1. You can do this
manually or allow pixmapGrey() to do it.)
e. (15 Points) Use the eigenfaces to reconstruct yaleB05 P00A+010E+00.pgm. Starting with the
mean face, add in one eigenface at a time until you reach 24 eigenfaces. Save the results in a
5-by-5 grid. Again, starting with the mean face, add in five eigenfaces at a time until you reach
120 eigenfaces. Save the results in a 5-by-5 grid. Include both of these in your write up. How
many faces do you feel like you need until you can recognize the person?
f. (10 Points) Remove the pictures of subject 01 from your image matrix (there should be four
pictures of him) and recenter the data. Rerun prcomp() to get new principal components. Use
these to reconstruct yaleB01 P00A+010E+00.pgm. Do this by subtracting o↵ the mean face and
projecting the remaining image onto the principal components. Print the reconstructed image.
Does it look like the original image? Why or why not?
Problem 3. (20 Points) James 3.7.3
Problem 4. (20 Points) James 3.7.4
Problem 5. (20 Points) Load the data set hw02 q5.csv.
a. (5 Points) Use the function dist() to produce a matrix of distances between all pairs of points.
Distances should be computed for the two-dimensional input points x = [x1, x2] (y is the output
variable). Print the results.
b. (5 Points) Use the first data point as the testing set and the rest of the data as a training
set. Implement kNN regression using the distance matrix from (a) for k = 1, 2,…, 10. This
algorithm should predict the y value of the first data point (with some error). Compute the
mean squared error for the testing set and the mean squared error for the training set for each
value of k; denote these values as MSEk,1
test and MSEk,1
train.
c. (5 Points) Rerun part (b). For each data point: use the ith data point as a testing set, the
remaining data as a training set, and run kNN for k = 1, 2,…, 10 for observations i = 2, 3,…,n.
2
For each value of k compute a mean squared error as follows:
MSEk
train = 1
n
Xn
i=1
MSEk,i
train
MSEk
test = 1
n
Xn
i=1
MSEk,i
test.
d. (5 Points) The results from part (c) are called leave one out cross-validation error. They are
commonly used for estimating prediction error and selecting model parameters. Use these results
to pick the optimal value for k. Should you make your choice based on MSEk
train or MSEk
test,
and why? What is the optimal choice of k, and why?
Problem 6. (35 Points) In this problem, we will use 1NN classification and PCA to do facial
recognition.
a. (5 Points) Load the views P00A+000E+00, P00A+005E+10, P00A+005E-10, and P00A+010E+00 for
all subjects in the CroppedYale directory. Convert each photo to a vector; store the collection
as a matrix where each row is a photo. Give this matrix the name face matrix 6a. Record the
subject number and view of each row of face matrix 6a in a data frame. The subject numbers
will be used as our data labels.
Use the following commands to divide the data into training and testing sets:
fm_6a_size = dim(face_matrix_6a)
# Use 4/5 of the data for training, 1/5 for testing
ntrain_6a = floor(fm_6a_size[1]*4/5)
ntest_6a = fm_6a_size[1]-ntrain_6a
set.seed(1)
ind_train_6a = sample(1:fm_6a_size[1],ntrain_6a)
ind_test_6a = c(1:fm_6a_size[1])[-ind_train_6a]
Here ind train 6a is the set of indices for the training data and ind test 6a is the set of indices
for the testing data. What are the first 5 files (rows) in the training set? What are the first 5
files in the testing set? Specify their subject and view indices.
b. (5 Points) Do PCA on your training set and use the first 25 scores to represent your data.
Specifically, create the mean face from the training set, subtract o↵ the mean face, and run
prcomp() on the resulting image matrix. Project your testing data onto the first 25 loadings so
that it is also represented by the first 25 scores. Do not rescale the scores. Use 1NN classification
in the space of the first 25 scores to identify the subject for each testing observation. In class
we discussed doing kNN classification by majority vote of the neighbors; in the 1NN case, there
is simply one vote. How many subjects are identified correctly? How many incorrectly? Plot
any subject photos that are misidentified next to the 1NN photo prediction.
c. (10 Points) Rerun parts (a) and (b) using the views P00A-035E+15, P00A-050E+00, P00A+035E+15,
and P00A+050E+00 for all subjects in the CroppedYale directory. Give this matrix the name
face matrix 6c. For each image, record the subject number and view in a data frame. Use the
following commands to divide the data into training and testing sets:
3
fm_6c_size = dim(face_matrix_6c)
# Use 4/5 of the data for training, 1/5 for testing
ntrain_6c = floor(fm_6c_size[1]*4/5)
ntest_6c = fm_6c_size[1]-ntrain_6c
set.seed(2)
ind_train_6c = sample(1:fm_6c_size[1],ntrain_6c)
ind_test_6c = c(1:fm_6c_size[1])[-ind_train_6c]
Do PCA on your training set and use the first 25 scores to represent your data. Project your
testing data onto the first 25 loadings so that it is also represented by the first 25 scores. Use
1NN in the space of the first 25 scores to identify the subject for each testing observation. Do
not rescale the scores. How many subjects are identified correctly? How many incorrectly? Plot
any subject photos that are misidentified next to the 1NN photo prediction.
d. (5 Points) Rerun part (c) with 10 di↵erent training and testing divides. Display the number of
faces correctly identified and the number incorrectly identified for each. What do these numbers
tell us?
e. (10 Points) Compare the results for parts (b) and (c). Are the testing error rates di↵erent?
Observe that the views in (a) are closer to each other than those in (c), where the latter has
much wider lighting ranges. What does this tell you about PCA? In general, when does PCA
work better?
4