## Description

1 Character recognition using a trained MLP

The purpose of this problem is to have you program, using numpy, the feedforward processing for a multilayer

perceptron (MLP) deep neural network. This processing is defined on slide 33 of the “introduction” lecture

slides. The grading of this homework will give you feedback on your FF program which is valuable since in

a future homework, you will be building on this to program the back-propagation as well.

The MNIST dataset of handwritten digits is widely used as a beginner dataset for benchmarking machine

learning classifiers. It has 784 input features (pixel values in each image) and 10 output classes representing

numbers 0–9. We have trained a MLP on MNIST, with a 784-neuron input layer, 2 hidden layers of 200 and

100 neurons, and a 10-neuron output layer. The activation functions used are ReLU for the hidden layers

and softmax for the output layer.

(a) Extract the weights and biases of the trained network from mnist_network_params.hdf5. The file

has 6 keys corresponding to numpy arrays W1, b1, W2, b2, W3, b3. You may want to check their

dimensions by using the ‘shape’ attribute of a numpy array.

(b) The file mnist_testdata.hdf5 contains 10,000 images in the key ‘xdata’ and their corresponding labels

in the key ‘ydata’. Extract these. Note that each image is 784-dimensional and each label is one-hot

10-dimensional. So if the label for an image is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0], it means the image depicts a 3.

(c) Write functions for calculating ReLU and softmax. These are given as:

• ReLU(x) = max (0, x)

• Softmax(x) =

e

x1

Pn

i=1 e

xi

,

e

x2

Pn

i=1 e

xi

, · · · ,

e

xn

Pn

i=1 e

xn

. Note that the softmax function operates on a vector of size n and returns another vector of size n which is a probability distribution.

For example, Softmax ([0, 1, 2]) = [0.09, 0.24, 0.67]. This indicates that the 3rd element is the most

likely outcome. For the MNIST case, n = 10.

(d) Using numpy, program a MLP that takes a 784-dimensional input image and calculates its 10-dimensional

output. Use ReLU activation for the 2 hidden layers and softmax for the output layer.

(e) Compare the output with the true label from ‘ydata’. The input is correctly classified if the position

of the maximum element in the MLP output matches with the position of the 1 in ydata. Find the

total number of correctly classified images in the whole set of 10,000. You should get 9790 correct.

(f) Sample some cases with correct classification and some with incorrect classification. Visually inspect

them. For those that are incorrect, is the correct class obvious to you? You can use matplotlib for

visualization:

1

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 2

hn

xn yn

qn

zn

xn

wˆ n

yˆn

–

rwE = 2

E

x(u)xt

(u)w y(u)x(u)

= 2(Rxw r)

rwE = 0 () Rxw = r (Wiener-Hopf equation)

wˆ n+1 = wˆ n (⌘/2)rwE

= wˆ n + ⌘(r Rxwˆ n)

1

2

rwE = r Rxw

= E

y(u)x(u) x(u)xt

(u)w

⇡ ynxn xnxt

nw

= (yn xt

nwˆ n)xn

wˆ n+1 = wˆ n + ⌘(yn wˆ t

nxn)xn

yˆn =

L

X1

l=0

wlxnl

= wt

vn

w =

2

6

6

6

6

6

6

4

w0

w1

.

.

.

wL1

3

7

7

7

7

7

7

5

vn = xn

n(L1) =

2

6

6

6

6

6

6

4

xn

xn1

.

.

.

xnL+1

3

7

7

7

7

7

7

5

1

2

rwE = r Rvnw

= E

y(i)vn(u) vn(u)vt

n(u)w

= (yn vt

nvˆn)xn

en

⌘ zn

noisy target

xn yˆn

wlmmse

(a) (b) (c)

Figure 1: (a) a model for MMSE estimation. (b) a system implementing the LMMSE estimator. (c) the

LMS algorithm with noisy targets, zn.

import matplotlib.pyplot as plt

plt.imshow(xdata[i].reshape(28,28))

plt.show()

Here, i is the index of the image you want to visualize, i.e. xdata[i] is a 784-dimensional numpy array.

2 Logistical Regression SGD

From lecture, we have that the output of a logistical regressor is

p = σ(wtx + b)

Derive the equations for using SGD to update the trainable parameters w and b when the label is y ∈ {0, 1}.

Recall that p is the output of the regressor and is interpreted as the probability that y = 1. Also recall that

the binary cross-entropy loss is used in this case

C = −y log p − (1 − y) log(1 − p)

3 LMS algorithm for channel modeling/tracking

In this problem, you will consider the the system of Fig. 1. This can be viewed as estimating a channel or

system so that future outputs can be predicted. Fig. 1(a), shows a model where xn goes through a linear

system and is observed in additive Gaussian noise. We will consider the case where hn is a 3-tap filter. In

this case, a Wiener filter with length L = 3 is “matched” to this signal generation model. After computing

the ideal Wiener filter of length 3, wllms, this can be used as model to predict future outputs – this is shown

in Fig. 1(b). The LMS algorithm may be used to approximate this as shown in Fig. 1(c).

In addition to the Wiener Filter and LMS algorithm for this matched model, we will consider a timevarying model to show that the LMS algorithm can track the taps of hn and also a case where the data is

generated using a more complex (unknown to you) model in place of Fig. 1(a).

This problem is based on the data file lms_fun.hdf5. This file has several data arrays with keys described

below.

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 3

1. In this part, you will find the 3-tap Wiener filter for the “matched” condition. Here the model is

zn(u) = yn(u) + qn(u) (1)

zn(u) = h0xn(u) + h1xn−1(u) + h2xn−2(u) + qn(u) (2)

where xn(u) is a sequence of iid, standard Gaussians, h0 = 1, h1 = 0.5, and h2 = 0.25. The noise

added, qn(u) is also iid, Gaussian with zero mean and variance σ

2

q

. The SNR is defined as1

SNR = σ

2

x

σ

2

q

=

1

σ

2

q

(3)

Consider the Wiener filter LMMSE estimator that estimates zn(u) from

vn(u) = [xn(u), xn−1(u), xn−2(u)]t

:

zˆn = wt

lmmsevn wlmmse = R−1

v

rvz (4)

where we have assumed that the correlation matrices are not a function of the time n (you will verify).

This problem will walk you through the derivation of the Wiener filter.

(a) Show that E {vn(u)zn(u)} = E {vn(u)yn(u)} so that the Wiener filter for estimating yn(u) and

zn(u) is the same and ˆzn = ˆyn.

(b) Show that Rxz[m] = Rxy[m] = E {xn(u)zn+m(u)} = hm.

(c) Fnd the (3 × 3) matrix Rvn

and show that is is not a function of n.

(d) Use the above results to find rn = E {vn(u)yn(u)} = E {vn(u)zn(u)} and show it is not a function

of n.

(e) Find the Wiener filter (LMMSE) filter taps wlmmse. Compute this numerically for SNRs of 3 and

10 dB.

(f) What is the LMMSE – i.e., the residual mean squared error for the Wiener filter? Compute this

numerically for SNRs of 10 and 3 dB. Do this for both the case where the Wiener filter is used to

estimate zn(u) and yn(u) – what is the difference?

2. There is data corresponding to 10 dB and 3 dB of SNR cases shown in lecture. Specifically, there are

600 sequences of length 501 samples each. There is an x and z array for each: – i.e., matched_10_x,

matched_10_z and matched_3_x, matched_3_z for 10 dB and 3 dB SNR, respectively. Use these to

develop your code and run curves similar to those shown in the slides. To aid in this process, there is

also data for yn and vn. Note that vn can be reproduced from xn, but we have done this for you to

help out.

(a) Program the LMS algorithm with input (regressor) vn and “noisy target” zn. This corresponds

to the example given in lecture.

(b) Plot learning curves for η = 0.05 and η = 0.15 for each SNR. Specifically, plot the MSE (average

of (zn − zˆn)

2

), averaged over the 600 sequences.

(c) How does the MSE for these learning curves compare to the LMMSE found in the analytical part

above? Note that you can also try using yn in place of zn if you would like to explore (optional).

(d) What is the largest value of η that you can use without having divergent MSE?

3. In this part, there is a single realization an x (v and y sequence, each of length 501 – e.g., timevarying_v,

timevarying_z. In this case the data were generated using a time-varying linear filter – coefficients

1

It may make more sense to use SNR = σ

2

x

(h

2

0+h

2

1+h

2

2

)

σ2

q

, but I used the definition specified here for my simulations.

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 4

in (2) actually vary with n. There is a dataset called timevarying_coefficents that has the 3 coefficients as they change with time. Plot these coefficients vs. time (n). Run the LMS algorithm using

the x and z datasets and and vary the learning rate to find a good value of η where the LMS algorithm

tracks the coefficient variations well. Plot the estimated coefficients along with the true coefficients for

this case.

4. In this part, use the dataset with key mismatched_x and mismatched_y. This is also a set of 600

sequences of length 501 samples each. This data was generated with a model that is not linear and is

unknown to you.

(a) Run the LMS algorithm over all 600 sequences and plot the average learning curve for a goods

choice of η. Note: in this case, you only have access to yn, which may contain noise.

(b) Using the entire data set, compute Rˆ

vn

and ˆrn and the corresponding LLSE. Is this lower than

the LMS learning curve after convergence?

4 Human vs Computer random sequences

Note: Portions of this problem will be worked in class.

1. Use the file binary_random_sp2030.hdf5. This file has all of the data generated by students in HW1

and also has an equal number of sequences generated using numpy. Using this data, construct the

(20 × 20) matrix Rˆ – i.e., the sample correlation matrix for the data. Note that the data has been

converted from 0 and 1 to ±1.

2. Find the eigen-vectors and eigen-values for Rˆ . What is the variance of the most significant two

components of the data? What percentage of the total variance is captured by these two components?

Can you see any significance in the eigen-vectors e0 and e1 that would suggest why they capture much

of the variation? Plot λk vs. k on a stem plot.

3. Create label data to indicate human or computer – i.e., computer: y = +1, human: −1 and merge

the two data sets. Compute the linear classifier weight vector w that maps the (20 × 1) vector to an

estimate of the label. What is the error rate when you threshold ˆy to a hard decision?

4. Visualize the data and classifier in two dimensions. Take a reasonable number of samples from each

the computer and human data – e.g., 100 each. Project these onto e0 and e1 and plot a scatter plot

of the data. Add to this plot the decision boundary projection in this 2D space.

5. Compute the logistical regression weight vector w that maps the (20 × 1) vector to an estimate of the

probability of the label. What is the error rate when you threshold ˆy to a hard decision?

5 Backprop Initialization for Multiclass Classification

Recall that the softmax function h(·) takes an M-dimensional input vector s and outputs an M-dimensional

output vector a given as

a = h(s) = 1

PM−1

m=0 e

sm

e

s0

e

s1

.

.

.

e

sM−1

(5)

Recall that multiclass cross-entropy cost is given as

C = −

Xn

i=1

yi

ln ai (6)

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 5

where y is a given vector of ground truth labels. Finally, recall that the error vector of the output layer is

given as:

δ = ∇sC = A˙ ∇aC (7)

where A˙

is the matrix of derivatives of softmax, given as2

A˙ =

dh(s)

ds

=

∂p0

∂s0

· · ·

∂pM−1

∂s0

.

.

.

.

.

.

.

.

.

∂p0

∂sM−1

· · ·

∂pm−1

∂sM−1

(8)

Assuming y is one-hot, show that δ = a − y

6 Backprop by Hand

An MLP has three input nodes, two hidden layers, and three outputs. The activation for the hidden layers

is ReLu. The output layer is softmax. The weights and biases for this MLP are:

W(1) =

1 −2 1

3 4 −2

, b

(1) =

1

−2

W(2) =

1 −2

3 4

, b

(2) =

1

0

W(3) =

2 2

3 −3

2 1

, b

(3) =

0

−4

−2

1. Feedforward Computation: In this part you will perform the feedforward computation for the input

vector x = [ +1 −1 +1 ]t

. The standard notation used in the slides and handouts is used. Specifically,

s

(l)

is the linear activation – i.e., a

(l) = h(s

(l)

) – and a˙

(l) = h˙(s

(l)

). Fill in the following table:

l : 1 2 3

s

(l)

:

a

(l)

:

a˙

(l)

: (not needed)

2. Backpropagation Computation: Run the standard SGD backpropagation for this input assuming

a multi-category cross-entropy loss function and the one-hot labeled target: y = [ 0 0 1 ]t

. Again,

our standard notation is used so that δ

(l) = ∇s

(l)C. Enter these delta values in the table below and

also provide the updated weights and biases assuming a learning rate of η = 0.5.

l : 1 2 3

2This is the denominator convention with the left-handed chain rule.

c K.M. Chugg, B. Franzke – February 8, 2020– EE599 Deep Learning 6

δ

(l)

:

W(l)

:

b

(l)

:

7 Avoiding Softmax Computations in Multiclass Classification

Computing the softmax activation for a large number of classes can be computationally intensive. For

example, we will see later that the word2vec word embedding method uses a trick to avoid computing the

softmax for this reason. In this problem, we explore a different method for avoiding softmax computation

in multiclass classification problems. In the following, we consider an output layer with M neurons for an

M-ary classification problem. The activation for this final layer is a sigmoid so that the final activation is

ai = σ(si) = 1

1 + e−si

(9)

where si

is the linear or pre-activation for the final layer. In other words, the softmax, which computes an

M-ary probability mass function (pmf) while this approach computes M 2-ary pmfs – i.e., one 2-ary pmf

for each output class. The labels y are one-hot encoded for the M-ary classification problem. Specifically,

only one component of y is 1 and the rest are 0.

1. Explain why using the activations in (9) with the standard multi-class cross-entropy function does not

make sense. Hint: recall that the milti-class cross-entropy function may be viewed as a constant offset

from the KL Divergence between two pmfs.

2. Recall that for a binary classification problem, the binary cross-entropy loss is used

BCE(a) = −y log a − (1 − y) log(1 − y) (10)

Consider the multi-class loss function that is the sum of the BCE for each class

CM−BCE = −

“M

X−1

m=0

ym log am + (1 − ym) log(1 − ym)

#

(11)

where here the label is still one-hot. Find the back-propogation gradient initialization for this case –

i.e., find ∇sCM−BCE.

3. How does your result for part 2 of this problem compare to the standard multi-class gradient initialization from problem 5? In what ways is it similar and in what ways does it differ. Can the CM−BCE

be viewed as resulting from the KL divergence?