EE599 Deep Learning – Homework 2 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1 Character recognition using a trained MLP
The purpose of this problem is to have you program, using numpy, the feedforward processing for a multilayer
perceptron (MLP) deep neural network. This processing is defined on slide 33 of the “introduction” lecture
slides. The grading of this homework will give you feedback on your FF program which is valuable since in
a future homework, you will be building on this to program the back-propagation as well.
The MNIST dataset of handwritten digits is widely used as a beginner dataset for benchmarking machine
learning classifiers. It has 784 input features (pixel values in each image) and 10 output classes representing
numbers 0–9. We have trained a MLP on MNIST, with a 784-neuron input layer, 2 hidden layers of 200 and
100 neurons, and a 10-neuron output layer. The activation functions used are ReLU for the hidden layers
and softmax for the output layer.
(a) Extract the weights and biases of the trained network from mnist_network_params.hdf5. The file
has 6 keys corresponding to numpy arrays W1, b1, W2, b2, W3, b3. You may want to check their
dimensions by using the ‘shape’ attribute of a numpy array.
(b) The file mnist_testdata.hdf5 contains 10,000 images in the key ‘xdata’ and their corresponding labels
in the key ‘ydata’. Extract these. Note that each image is 784-dimensional and each label is one-hot
10-dimensional. So if the label for an image is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], it means the image depicts a 3.
(c) Write functions for calculating ReLU and softmax. These are given as:
• ReLU(x) = max (0, x)
• Softmax(x) = 
e
x1
Pn
i=1 e
xi
,
e
x2
Pn
i=1 e
xi
, · · · ,
e
xn
Pn
i=1 e
xn

. Note that the softmax function operates on a vector of size n and returns another vector of size n which is a probability distribution.
For example, Softmax ([0, 1, 2]) = [0.09, 0.24, 0.67]. This indicates that the 3rd element is the most
likely outcome. For the MNIST case, n = 10.
(d) Using numpy, program a MLP that takes a 784-dimensional input image and calculates its 10-dimensional
output. Use ReLU activation for the 2 hidden layers and softmax for the output layer.
(e) Compare the output with the true label from ‘ydata’. The input is correctly classified if the position
of the maximum element in the MLP output matches with the position of the 1 in ydata. Find the
total number of correctly classified images in the whole set of 10,000. You should get 9790 correct.
(f) Sample some cases with correct classification and some with incorrect classification. Visually inspect
them. For those that are incorrect, is the correct class obvious to you? You can use matplotlib for
visualization:
1

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 2
hn
xn yn
qn
zn
xn
wˆ n
yˆn

rwE = 2
E
x(u)xt
(u)w y(u)x(u)

= 2(Rxw r)
rwE = 0 () Rxw = r (Wiener-Hopf equation)
wˆ n+1 = wˆ n (⌘/2)rwE
= wˆ n + ⌘(r Rxwˆ n)
1
2
rwE = r Rxw
= E
y(u)x(u) x(u)xt
(u)w
⇡ ynxn xnxt
nw
= (yn xt
nwˆ n)xn
wˆ n+1 = wˆ n + ⌘(yn wˆ t
nxn)xn
yˆn =
L
X1
l=0
wlxnl
= wt
vn
w =
2
6
6
6
6
6
6
4
w0
w1
.
.
.
wL1
3
7
7
7
7
7
7
5
vn = xn
n(L1) =
2
6
6
6
6
6
6
4
xn
xn1
.
.
.
xnL+1
3
7
7
7
7
7
7
5
1
2
rwE = r Rvnw
= E
y(i)vn(u) vn(u)vt
n(u)w
= (yn vt
nvˆn)xn
en
⌘ zn
noisy target
xn yˆn
wlmmse
(a) (b) (c)
Figure 1: (a) a model for MMSE estimation. (b) a system implementing the LMMSE estimator. (c) the
LMS algorithm with noisy targets, zn.
import matplotlib.pyplot as plt
plt.imshow(xdata[i].reshape(28,28))
plt.show()
Here, i is the index of the image you want to visualize, i.e. xdata[i] is a 784-dimensional numpy array.
2 Logistical Regression SGD
From lecture, we have that the output of a logistical regressor is
p = σ(wtx + b)
Derive the equations for using SGD to update the trainable parameters w and b when the label is y ∈ {0, 1}.
Recall that p is the output of the regressor and is interpreted as the probability that y = 1. Also recall that
the binary cross-entropy loss is used in this case
C = −y log p − (1 − y) log(1 − p)
3 LMS algorithm for channel modeling/tracking
In this problem, you will consider the the system of Fig. 1. This can be viewed as estimating a channel or
system so that future outputs can be predicted. Fig. 1(a), shows a model where xn goes through a linear
system and is observed in additive Gaussian noise. We will consider the case where hn is a 3-tap filter. In
this case, a Wiener filter with length L = 3 is “matched” to this signal generation model. After computing
the ideal Wiener filter of length 3, wllms, this can be used as model to predict future outputs – this is shown
in Fig. 1(b). The LMS algorithm may be used to approximate this as shown in Fig. 1(c).
In addition to the Wiener Filter and LMS algorithm for this matched model, we will consider a timevarying model to show that the LMS algorithm can track the taps of hn and also a case where the data is
generated using a more complex (unknown to you) model in place of Fig. 1(a).
This problem is based on the data file lms_fun.hdf5. This file has several data arrays with keys described
below.

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 3
1. In this part, you will find the 3-tap Wiener filter for the “matched” condition. Here the model is
zn(u) = yn(u) + qn(u) (1)
zn(u) = h0xn(u) + h1xn−1(u) + h2xn−2(u) + qn(u) (2)
where xn(u) is a sequence of iid, standard Gaussians, h0 = 1, h1 = 0.5, and h2 = 0.25. The noise
added, qn(u) is also iid, Gaussian with zero mean and variance σ
2
q
. The SNR is defined as1
SNR = σ
2
x
σ
2
q
=
1
σ
2
q
(3)
Consider the Wiener filter LMMSE estimator that estimates zn(u) from
vn(u) = [xn(u), xn−1(u), xn−2(u)]t
:
zˆn = wt
lmmsevn wlmmse = R−1
v
rvz (4)
where we have assumed that the correlation matrices are not a function of the time n (you will verify).
This problem will walk you through the derivation of the Wiener filter.
(a) Show that E {vn(u)zn(u)} = E {vn(u)yn(u)} so that the Wiener filter for estimating yn(u) and
zn(u) is the same and ˆzn = ˆyn.
(b) Show that Rxz[m] = Rxy[m] = E {xn(u)zn+m(u)} = hm.
(c) Fnd the (3 × 3) matrix Rvn
and show that is is not a function of n.
(d) Use the above results to find rn = E {vn(u)yn(u)} = E {vn(u)zn(u)} and show it is not a function
of n.
(e) Find the Wiener filter (LMMSE) filter taps wlmmse. Compute this numerically for SNRs of 3 and
10 dB.
(f) What is the LMMSE – i.e., the residual mean squared error for the Wiener filter? Compute this
numerically for SNRs of 10 and 3 dB. Do this for both the case where the Wiener filter is used to
estimate zn(u) and yn(u) – what is the difference?
2. There is data corresponding to 10 dB and 3 dB of SNR cases shown in lecture. Specifically, there are
600 sequences of length 501 samples each. There is an x and z array for each: – i.e., matched_10_x,
matched_10_z and matched_3_x, matched_3_z for 10 dB and 3 dB SNR, respectively. Use these to
develop your code and run curves similar to those shown in the slides. To aid in this process, there is
also data for yn and vn. Note that vn can be reproduced from xn, but we have done this for you to
help out.
(a) Program the LMS algorithm with input (regressor) vn and “noisy target” zn. This corresponds
to the example given in lecture.
(b) Plot learning curves for η = 0.05 and η = 0.15 for each SNR. Specifically, plot the MSE (average
of (zn − zˆn)
2
), averaged over the 600 sequences.
(c) How does the MSE for these learning curves compare to the LMMSE found in the analytical part
above? Note that you can also try using yn in place of zn if you would like to explore (optional).
(d) What is the largest value of η that you can use without having divergent MSE?
3. In this part, there is a single realization an x (v and y sequence, each of length 501 – e.g., timevarying_v,
timevarying_z. In this case the data were generated using a time-varying linear filter – coefficients
1
It may make more sense to use SNR = σ
2
x
(h
2
0+h
2
1+h
2
2
)
σ2
q
, but I used the definition specified here for my simulations.

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 4
in (2) actually vary with n. There is a dataset called timevarying_coefficents that has the 3 coefficients as they change with time. Plot these coefficients vs. time (n). Run the LMS algorithm using
the x and z datasets and and vary the learning rate to find a good value of η where the LMS algorithm
tracks the coefficient variations well. Plot the estimated coefficients along with the true coefficients for
this case.
4. In this part, use the dataset with key mismatched_x and mismatched_y. This is also a set of 600
sequences of length 501 samples each. This data was generated with a model that is not linear and is
unknown to you.
(a) Run the LMS algorithm over all 600 sequences and plot the average learning curve for a goods
choice of η. Note: in this case, you only have access to yn, which may contain noise.
(b) Using the entire data set, compute Rˆ
vn
and ˆrn and the corresponding LLSE. Is this lower than
the LMS learning curve after convergence?
4 Human vs Computer random sequences
Note: Portions of this problem will be worked in class.
1. Use the file binary_random_sp2030.hdf5. This file has all of the data generated by students in HW1
and also has an equal number of sequences generated using numpy. Using this data, construct the
(20 × 20) matrix Rˆ – i.e., the sample correlation matrix for the data. Note that the data has been
converted from 0 and 1 to ±1.
2. Find the eigen-vectors and eigen-values for Rˆ . What is the variance of the most significant two
components of the data? What percentage of the total variance is captured by these two components?
Can you see any significance in the eigen-vectors e0 and e1 that would suggest why they capture much
of the variation? Plot λk vs. k on a stem plot.
3. Create label data to indicate human or computer – i.e., computer: y = +1, human: −1 and merge
the two data sets. Compute the linear classifier weight vector w that maps the (20 × 1) vector to an
estimate of the label. What is the error rate when you threshold ˆy to a hard decision?
4. Visualize the data and classifier in two dimensions. Take a reasonable number of samples from each
the computer and human data – e.g., 100 each. Project these onto e0 and e1 and plot a scatter plot
of the data. Add to this plot the decision boundary projection in this 2D space.
5. Compute the logistical regression weight vector w that maps the (20 × 1) vector to an estimate of the
probability of the label. What is the error rate when you threshold ˆy to a hard decision?
5 Backprop Initialization for Multiclass Classification
Recall that the softmax function h(·) takes an M-dimensional input vector s and outputs an M-dimensional
output vector a given as
a = h(s) = 1
PM−1
m=0 e
sm





e
s0
e
s1
.
.
.
e
sM−1





(5)
Recall that multiclass cross-entropy cost is given as
C = −
Xn
i=1
yi
ln ai (6)

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 5
where y is a given vector of ground truth labels. Finally, recall that the error vector of the output layer is
given as:
δ = ∇sC = A˙ ∇aC (7)
where A˙
is the matrix of derivatives of softmax, given as2
A˙ =
dh(s)
ds
=









∂p0
∂s0
· · ·
∂pM−1
∂s0
.
.
.
.
.
.
.
.
.
∂p0
∂sM−1
· · ·
∂pm−1
∂sM−1









(8)
Assuming y is one-hot, show that δ = a − y
6 Backprop by Hand
An MLP has three input nodes, two hidden layers, and three outputs. The activation for the hidden layers
is ReLu. The output layer is softmax. The weights and biases for this MLP are:
W(1) =

1 −2 1
3 4 −2

, b
(1) =

1
−2

W(2) =

1 −2
3 4 
, b
(2) =

1
0

W(3) =


2 2
3 −3
2 1

 , b
(3) =


0
−4
−2


1. Feedforward Computation: In this part you will perform the feedforward computation for the input
vector x = [ +1 −1 +1 ]t
. The standard notation used in the slides and handouts is used. Specifically,
s
(l)
is the linear activation – i.e., a
(l) = h(s
(l)
) – and a˙
(l) = h˙(s
(l)
). Fill in the following table:
l : 1 2 3
s
(l)
:
a
(l)
:

(l)
: (not needed)
2. Backpropagation Computation: Run the standard SGD backpropagation for this input assuming
a multi-category cross-entropy loss function and the one-hot labeled target: y = [ 0 0 1 ]t
. Again,
our standard notation is used so that δ
(l) = ∇s
(l)C. Enter these delta values in the table below and
also provide the updated weights and biases assuming a learning rate of η = 0.5.
l : 1 2 3
2This is the denominator convention with the left-handed chain rule.

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 6
δ
(l)
:
W(l)
:
b
(l)
:
7 Avoiding Softmax Computations in Multiclass Classification
Computing the softmax activation for a large number of classes can be computationally intensive. For
example, we will see later that the word2vec word embedding method uses a trick to avoid computing the
softmax for this reason. In this problem, we explore a different method for avoiding softmax computation
in multiclass classification problems. In the following, we consider an output layer with M neurons for an
M-ary classification problem. The activation for this final layer is a sigmoid so that the final activation is
ai = σ(si) = 1
1 + e−si
(9)
where si
is the linear or pre-activation for the final layer. In other words, the softmax, which computes an
M-ary probability mass function (pmf) while this approach computes M 2-ary pmfs – i.e., one 2-ary pmf
for each output class. The labels y are one-hot encoded for the M-ary classification problem. Specifically,
only one component of y is 1 and the rest are 0.
1. Explain why using the activations in (9) with the standard multi-class cross-entropy function does not
make sense. Hint: recall that the milti-class cross-entropy function may be viewed as a constant offset
from the KL Divergence between two pmfs.
2. Recall that for a binary classification problem, the binary cross-entropy loss is used
BCE(a) = −y log a − (1 − y) log(1 − y) (10)
Consider the multi-class loss function that is the sum of the BCE for each class
CM−BCE = −
“M
X−1
m=0
ym log am + (1 − ym) log(1 − ym)
#
(11)
where here the label is still one-hot. Find the back-propogation gradient initialization for this case –
i.e., find ∇sCM−BCE.
3. How does your result for part 2 of this problem compare to the standard multi-class gradient initialization from problem 5? In what ways is it similar and in what ways does it differ. Can the CM−BCE
be viewed as resulting from the KL divergence?