CSC 411 Machine Learning & Data Mining ASSIGNMENT # 2 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1 Class-Conditional Gaussians (4%)
In this question, you will derive the maximum likelihood estimates for class-conditional Gaussians
with independent features (diagonal covariance matrices), i.e. Gaussian Naive Bayes, with shared
variances. Start with the following generative model for a discrete class label y ∈ (1, 2, …, K) and a
real valued vector of d features x = (x1, x2, …, xd):
p(y = k) = αk (1)
p(x|y = k, µ,σ) = Y
D
i=1
2πσ2
i
!−1/2
exp (

X
D
i=1
1

2
i
(xi − µki)
2
)
(2)
where αk is the prior on class k, σ
2
i
are the shared variances for each feature (in all classes), and µki
is the mean of the feature i conditioned on class k. We write α to represent the vector with elements
αk and similarly σ is the vector of variances. The matrix of class means is written µ where the kth
row of µ is the mean for class k.
1. Use Bayes’ rule to derive an expression for p(y = k|x, µ,σ). [Hint: Use the law of total probability to derive an expression for p(x|µ,σ).]
2. Write down an expression for the negative likelihood function (NLL)
`(θ; D) = − log p(y
(1)
, x
(1), y(2)
, x
(2)
, · · · , y(N)
, x
(N)
|θ) (3)
of a particular dataset D = {(y
(1)
, x
(1)),(y
(2)
, x
(2)), · · · ,(y
(N)
, x
(N)
)} with parameters θ =
{α, µ,σ}. (Assume that the data are iid.)
3. Take partial derivatives of the likelihood with respect to each of the parameters µki and with
respect to the shared variances σ
2
i
.
4. Find the maximum likelihood estimates for µ and σ.
5*. Extra work for interested students: If you are familiar with Lagrange multipliers, show that
the MLE for αk is indeed given by equation 4:
αk =
1
N
X
N
i=1
1[y
(i) = k] (4)
2 Handwritten Digit Classification (11%)
For this question you will build classifiers to label images of handwritten digits. Each image is 8
by 8 pixels and is represented as a vector of dimension 64 by listing all the pixel values in raster
scan order. The images are grayscale and the pixel values are between 0 and 1. The labels y are
0, 1, 2, · · · , 9 corresponding to which character was written in the image. There are 700 training
cases and 400 test cases for each digit; they can be found in a2digits.zip.
Starter code written in python is provided to help you load the data. A skeleton is also provided
for each question that you should use to format your solution. You are welcome to use any other
programming language, as long as your code is functional and modular.
Please note that if you are asked to report/compute quantities these should be clearly displayed
in your written report. It is not sufficient to simply print these as an output of your code. The
same applies to plots and figures.
0. Load the data and plot the means for each of the digit classes in the training data (include
these in your report). Given that each image is a vector of size 64, the mean will be a vector of
size 64 which needs to be reshaped as an 8 × 8 2D array to be rendered as an image. Plot all
10 means side by side using the same scale.
2.1 K-NN Classifier
1. Build a simple K nearest neighbor classifier using Euclidean distance on the raw pixel data.
(a) For K = 1 report the train and test classification accuracy.
(b) For K = 15 report the train and test classification accuracy.
2. For K > 1 K-NN might encounter ties that need to be broken in order to make a decision.
Choose any (reasonable) method you prefer and explain it briefly in your report.
3. Use 10 fold cross validation to find the optimal K in the 1-15 range. You may use the KFold
implementation in sklearn or your existing code from Assignment 1. Report this value of
K along with the train classification accuracy, the average accuracy across folds and the test
accuracy.
2.2 Conditional Gaussian Classifier Training
Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full covariance matrix for each class. Remember that the conditional multivariate Gaussian probability density
is given by,
p(x|y = k, µ, Σk) = (2π)
−d/2
|Σk|
−1/2
exp 

1
2
(x − µk)
T Σ
−1
k
(x − µk)

(5)
You should take p(y = k) = 1
10
. You will compute parameters µkj and Σk for k ∈ (0…9), j ∈ (1…64).
You should implement the covariance computation yourself (i.e. without the aid of ’np.cov’). [Hint:
To ensure numerical stability you may have to add a small positive value to the diagonal of each
covariance matrix. For this assignment you can add 0.01I to each matrix.]
1. Plot an 8 by 8 image of the log of the diagonal elements of each covariance matrix Σk. Plot all
ten classes side by side using the same grayscale.
2. Using the parameters you fit on the training set and Bayes rule, compute the average conditional log-likelihood, i.e. 1
N
PN
i=1 log(p(y
(i)
|x
(i)
, θ)) on both the train and test set and report
it.
3. Select the most likely posterior class for each training and test data point as your prediction,
and report your accuracy on the train and test set.
4*. Extra work for interested students: Compute the leading eigenvectors (largest eigenvalue)
for each class covariance matrix (can use np.linalg.eig) and plot them side by side as 8 by 8
images.
2.3 Naive Bayes Classifier Training
1. Convert the real-valued features x into binary features b using 0.5 as a threshold: bj = 1 if
xj > 0.5 otherwise bj = 0.
2. Using these new binary features b and the class labels, train a Bernoulli Naive Bayes classifier
using MAP estimation with prior Beta(α, β) with α = β = 2. In particular, fit the model
below on the training set.
p(y = k) = 1
10
(6)
p(bj = 1|y = k) = ηkj (7)
p(b|y = k, η) = Πd
j=1 (ηkj )
bj
(1 − ηkj )
(1−bj )
(8)
P(ηkj ) = Beta(2, 2) (9)
You should compute parameters ηkj for k ∈ (0…9), j ∈ (1…64)
Regularization: Instead of the prior, you could add two training cases to your data set for
each class, one which has every pixel off and one which has every pixel on. Make sure you
understand why this is equivalent to using a prior. You may use either scheme in your own
code.
3. Plot each of your ηk vectors as an 8 by 8 grayscale image. These should be presented side by
side and with the same scale.
4. Given your parameters, sample one new data point for each of the 10 digit classes. Plot these
new data points as 8 by 8 grayscale images side by side.
5. Using the parameters you fit on the training set and Bayes rule, compute the average conditional log-likelihood, i.e. 1
N
PN
i=1 log(p(y
(i)
|x
(i)
, θ)) on both the train and test set and report
it.
6. Select the most likely posterior class for each training and test data point, and report your
accuracy on the train and test set.
2.4 Model Comparison
Briefly (in a few sentences) summarize the performance of each model. Which performed best?
Which performed worst? Did this match your expectations?
Submission
Assignments must be submitted by 10:00 pm on November 12th; a late penalty of 10% per day
will be assessed thereafter (up to 3 days, then submission is blocked). We highly recommend using Python 3.6 embedded in Anaconda 3.4 to solve the programming questions. However, you are
welcome to use any programming language you like, given that your code solves the problem, and
is functional and modular.
What to submit
• All work to be submitted via Markus.
• All mathematical work has to be clearly marked by the question it belongs to, we highly
recommend using a math editor such as MathType, Word Equations or LaTex. However, if
you prefer to use paper and pen(cil), you are welcome to do so as long as your hand writing
is readable. Unreadable work will result in losing the full mark of the question.
• Your written report should be submitted in PDF format. Note that Markus has filesize limits
and it is your responsibility to ensure that your work is uploaded on time.
• Your code must be clearly structured with an obvious entry point.
• All graphs and plots have to be clearly marked by their question. Use readable grids whenever
applicable.