# HOMEWORK 2 – V2 CSC311 solved

\$25.00

## Description

5/5 - (1 vote)

1. Logistic Regression – 30 pts. In this question, we will look at logistic regression from
a probabilistic perspective.
1.1. Bayes’ Rule – 10 pts. Suppose you have a D-dimensional data vector x = (x1, . . . , xD)
T
and an associated class variable t ∈ {0, 1} which is Bernoulli random variable with parameter α
(i.e. P(t = 1) = α and P(t = 0) = 1 − α). Assume that the dimensions of x are conditionally
independent given t, and that the conditional distribution of each xi
is Gaussian with µi0 and µi1
as the means of the two classes and σi as their shared standard deviation, i.e. xi
|t ∼ N (µit, σ2
i
).
Use Bayes’ rule to show that p(t = 1|x) takes the form of a logistic function:
p(t = 1|x) = σ(wT x + b) = 1
1 + exp

PD
i=1 wixi − b

Derive expressions for the weights w = (w1, . . . , wD)
T and the bias b in terms of the parameters
of the class likelihoods and priors (i.e., µi0, µi1, σi and α).
1.2. Maximum Likelihood Estimation – 10 pts. Now suppose you are given a training set
D = {(x
(1), t(1)), . . . ,(x
(N)
, t(N)
)}. Consider a binary logistic regression classifier of the same form
as before:
p(t
(n) = 1|x
(n)
, w, b) = σ(wT x
(n) + b) = 1
1 + exp

PD
i=1 wix
(n)
i − b

Derive an expression for L(w, b), the negative log-likelihood of t
(1), . . . , t(n) given x
(1)
, . . . , x
(n)
and the model parameters, under the i.i.d. assumption. Then derive expressions for the derivatives
of L with respect to each of the model parameters.
1
1.3. L2 Regularization – 10 pts. Now, we treat x
(i)
’s as deterministic and assume that a Gaussian prior is placed on each element of w such that p(wi) = N (wi
|0, 1/λ), and an “improper” flat
prior on b such that p(b) = 11
. Derive an expression that is proportional to p(w, b|D), the posterior
distribution of w and b, based on this prior and the likelihood defined above. The expression you
derive must contain all terms that depend on w and b.
The posterior distribution for w and b is proportional to the product of this prior and the
likelihood of t
(1), . . . , t(n)
:
p(w, b|t
(1), . . . , t(n)
) ∝ p(w)p(b)p(t
(1), . . . , t(n)
|w, b)
Define Lpost(w, b) to be the negative logarithm of this posterior. Show that Lpost(w, b) takes
the following form:
Lpost(w, b) = L(w, b) + λ
2
X
D
i=1
w
2
i + C
where C is a term that depends on λ but not on either w or b. What are the derivatives of Lpost
with respect to each of the model parameters?
2. Logistic Regression vs. KNN – 30 pts. In this section you will compare the performance and characteristics of different classifiers, namely Logistic Regression and k-Nearest
Neighbors. You will extend the provided code and experiment with these extensions. Note that
you should understand the code first instead of using it as a black box.
Python: you should use Python with both the Numpy and Matplotlib packages installed.
The data you will be working with are hand-written digits, 4s and 9s, represented as 28×28
pixel arrays. There are two training sets: mnist train, which contains 80 examples of each class,
and mnist train small, which contains 5 examples of each class. There is also a validation set
mnist valid that you should use for model selection, and a test set mnist test.
Code for visualizing the datasets has been included in plot digits.
2.1. k-Nearest Neighbors – 5 pts. Use the supplied kNN implementation to predict labels for
mnist valid, using mnist train as the training set.
Write a script that runs kNN for different values of k ∈ {1, 3, 5, 7, 9} and plots the classification
rate on the validation set (number of correctly predicted cases, divided by total number of data
points) as a function of k.
Comment on the performance of the classifier and argue which value of k you would choose.
What is the classification rate for k

, your chosen value of k? Also compute the rate for k
∗ + 2 and
k
∗−2. Does the test performance for these values of k correspond to the validation performance?2
Why or why not?
1This is a like a uniform distribution on the entire real line, but it is improper and doesn’t really qualify for
being a density.
2
In general you shouldn’t peek at the test set multiple times, but for the purposes of this question it can be an
illustrative exercise.
2
2.2. Logistic regression – 10 pts. Look through the code in logistic regression template
and logistic. Complete the implementation of logistic regression by providing the missing part
Run the code on both mnist train and mnist train small. You will need to experiment with
the hyperparameters for the learning rate, the number of iterations (if you have a smaller learning
rate, your model will take longer to converge), and the way in which you initialize the weights. If
you get Nan/Inf errors, you may try to reduce your learning rate or initialize with smaller weights.
Report which hyperparameter settings you found worked the best and the final cross entropy
and classification error on the training, validation and test sets. Note that you should only compute
the test error once you have selected your best hyperparameter settings using the validation set.
Next look at how the cross entropy changes as training progresses. Submit 2 plots, one for each
of mnist train and mnist train small. In each plot show two curves one for the training set
and one for the validation set. Run your code several times and observe if the results change. If
they do, how would you choose the best parameter settings?
2.3. Penalized logistic regression – 15 pts. Look through the code in logistic regression template
and logistic. Complete the implementation of logistic regression by providing the missing part
of logistic.
Now, implement the penalized logistic regression model you derived in Question 1.3 by modifying logistic to include a regularizer. Call the new function logistic pen. You should only
penalize the weights and not the bias term, as it only controls the height of the function but
not its complexity. Note that you can omit the C(λ) term in your error computation, since its
derivative is 0 w.r.t. the weights and bias.
Run the code on both mnist train and mnist train small. You will need to experiment with
the hyperparameters for the learning rate, the number of iterations (if you have a smaller learning
rate, your model will take longer to converge), and the way in which you initialize the weights. If
you get Nan/Inf errors, you may try to reduce your learning rate or initialize with smaller weights.
Choose a hyperparameter setting which seems to work well (for learning rate, number of iterations, and weight initialization). With these hyperparameters, do the following for each value of
the penalty parameter λ ∈ {0, 0.001, 0.01, 0.1, 1.0}:
• Train on both mnist train and mnist train small, and report cross entropy and classification error on the training and validation sets.
• Look at how the cross entropy changes as training progresses. Submit 2 plots, one for each
of mnist train and mnist train small. In each plot show two curves one for the training
set and one for the validation set.
To do the comparison systematically, you should write a script that includes a loop to evaluate
different values of λ automatically. You should also re-run logistic regression at least 5 times for
each value of λ for the classification error and cross entropy reporting.
So you will need two nested loops: The outer loop is over values of λ; the inner loop is over multiple re-runs. Average the evaluation metrics (cross entropy and clasification error) over the different
re-runs. In the end, plot the average cross entropy and classification error against λ; you only need
to show plots for one run at each λ value. So for each of mnist train and mnist train small
you will have 2 plots, one plot for cross entropy and another plot for classification error. Each
plot will have two curves – one for training and one for validation.
3
How do the cross entropy and classification error change when you increase λ? Do they go up,
down, first up and then down, or down and then up? Explain why you think they behave this
way. Which is the best value of λ, based on your experiments? Report the test error for the best
value of λ.
Compare the results with and without penalty. Which one performed better for which data
set? Why do you think this is the case?
3. Neural Networks (40 points). Here you will experiment on a subset of the Toronto
Faces Dataset (TFD). Some code that partially implements a regular neural network is included
with this assignment (in Python).
We subsample 3374, 419 and 385 grayscale images from TFD as the training, validation and
testing set respectively. Each image is of size 48 × 48 and contains a face that has been extracted
from a variety of sources. The faces have been rotated, scaled and aligned to make the task
easier. The faces have been labeled by experts and research assistants based on their expression.
These expressions fall into one of seven categories: 1-Anger, 2-Disgust, 3-Fear, 4-Happy, 5-Sad,
6-Surprise, 7-Neutral. We show one example face per class in Figure 1.
Fig 1: Example faces. From left to right, the the corresponding class is from 1 to 7.
Code for training a neural network (fully connected) is partially provided in the following files.
• nn.py : Train a fully connected neural network with two hidden layers.
You need to fill in some portion of the code for:
• Performing backward pass of the network.
• Performing weight update with momentum (covered in tutorial).
First, follow the instruction in the files to complete the code.
3.1. Basic generalization [10 points]. Train a regular NN with the default set of hyperparameters. Examine the statistics and plots of training error and validation error (generalization). How
does the network’s performance differ on the training set versus the validation set during learning?
Show a plot of error curves (training and validation) for both networks.
3.2. Optimization [10 points]. Try different values of the learning rate  (“eps”). Try 5 different
settings of  from 0.001 to 1.0. What happens to the convergence properties of the algorithm
(looking at both cross-entropy and percent-correct)? Try 3 values of momentum from 0.0 to 0.9.
How does momentum affect convergence rate? Try 5 different mini-batch sizes, from 1 to 1000.
How does mini-batch size affect convergence? How would you choose the best value of these
parameters? In each of these hold the other parameters constant while you vary the one you are
studying.
4
3.3. Model architecture [10 points]. Fix momentum to be 0.9. Try 3 different values of the
number of hidden units for each layer of the fully connected network (range from 2 to 100). You
might need to adjust the learning rate and the number of epochs. Comment on the effect of this
modification on the convergence properties, and the generalization of the network.
3.4. Network Uncertainty [10 points]. Plot some examples where the neural network is not
confident of the classification output (the top score is below some threshold), and comment on
them. Will the classifier be correct if it outputs the top scoring class anyways?
5