# Homework 5 – Machine Learning CS4342 solved

\$35.00

## Description

5/5 - (1 vote)

1. 3-layer neural network [70 points]: In this problem you will implement and train a 3-layer neural
network to classify images of hand-written digits from the MNIST dataset. Similarly to Homework 3,
the input to the network will be a 28 × 28-pixel image (converted into a 784-dimensional vector); the
output will be a vector of 10 probabilities (one for each digit). Specifically, the network you create
should implement a function g : R
784 → R
10, where:
z
(1) = W(1)x + b
(1)
h
(1) = relu(z
(1)))
z
(2) = W(2)h
(1) + b
(2)
ˆy = g(x) = softmax(z
(2))
Computing each of the intermediate outputs z
(1)
, h
(1)
, z
(2), and yˆ is known as forwards propagation
since it follows the direction of the edges in the directed graph shown below: …
^ … …
x z1
z2 …
h1 …
y
W(1) W(2)
b
(1)
b
(2)
Layer 1 Layer 2 Layer 3
Loss function: For the MNIST dataset you should use the cross-entropy loss function:
fCE(W(1)
, b
(1)
,W(2)
, b
(2)) = −
1
n
Xn
i=1
X
10
k=1
y
(i)
k
log yˆ
(i)
k
where n is the number of examples.
Gradient descent: To train the neural network, you should use stochastic gradient descent (SGD).
The hard part is computing the individual gradient terms. This can be done efficiently using backwards
propagation (“backprop”), which is called as such because it proceeds opposite the direction of the edges
in the network graph above. You should start by initializing the weights randomly, and initializing the
bias terms to small positive numbers. The reason is that – due to the relu activation function which
has a gradient of 0 whenever its argument is less than 0 – we want to give enough bias to “encourage”
the argument of relu to be positive. This is already performed for you in the starter code. Then,
update the weights according to SGD using the gradient expressions shown below. (Note that these
expressions are obtained by deriving and multiplying the Jacobian matrices as described in class, and
1
then simplifying the result analytically.)
∇W(2) fCE = (yˆ − y)h
(1)>
∇b(2) fCE = (yˆ − y)
∇W(1) fCE = gx>
∇b(1) fCE = g
where column-vector g is defined so that
g
> =

(yˆ − y)
>W(2)
relu0
(z
(1)>
)
In the equation above, relu0
is the derivative of relu. Also, make sure that you follow the transposes
exactly!
Hyperparameter tuning: In this problem, there are several different hyperparameters that will
impact the network’s performance:
• Number of units in the hidden layer (suggestions: {30, 40, 50})
• Learning rate (suggestions: {0.001, 0.005, 0.01, 0.05, 0.1, 0.5})
• Minibatch size (suggestions: 16, 32, 64, 128, 256)
• Number of epochs
• Regularization strength
In order not to “cheat” – and thus overestimate the performance of the network – it is crucial to
optimize the hyperparameters only on the validation set; do not use the test set. (The training set
would be ok but typically leads to worse performance.)
,W(2)
, b
(1)
,
and b
(2). Specifically:
(a) Implement stochastic gradient descent for the network shown above. [40 points]
(b) Implement the pack and unpack functions shown in the starter code. Use these to verify that
your implemented cost and gradient functions are correct (the discrepancy should be less than
0.01) using a numerical derivative approximation – see the call to check grad in the starter code.
[10 points]
(c) Optimize the hyperparameters by training on the training set and selecting the parameter settings
that optimize performance on the validation set. You should systematically (i.e., in code)
try at least 10 (in total, not for each hyperparameter) different hyperparameter
settings; accordingly, make sure there is a method called findBestHyperparameters (and please
name it as such to help us during grading) [15 points]. Include a screenshot showing the progress
and final output (selected hyperparameter values) of your hyperparameter optimization.
(d) After you have optimized your hyperparameters, then run your trained network on the test
set and report (1) the cross-entropy and (2) the accuracy (percent correctly classified images).
Include a screenshot showing both these values during the last 20 epochs of SGD. The (unregularized) cross-entropy cost on the test set should be less than 0.16, and the accuracy
(percentage correctly classified test images) should be at least 95%. [5 points]
Datasets: You should use the following datasets, which are a superset of what I gave you for previous
assignments:
• https://s3.amazonaws.com/jrwprojects/mnist_train_images.npy
2
• https://s3.amazonaws.com/jrwprojects/mnist_train_labels.npy
• https://s3.amazonaws.com/jrwprojects/mnist_test_images.npy
• https://s3.amazonaws.com/jrwprojects/mnist_test_labels.npy
• https://s3.amazonaws.com/jrwprojects/mnist_validation_images.npy
• https://s3.amazonaws.com/jrwprojects/mnist_validation_labels.npy