## Description

1. Hard-Coding a Network. [2pts] In this problem, you need to find a set of weights and

biases for a multilayer perceptron which determines if a list of length 4 is in sorted order.

More specifically, you receive four inputs x1, . . . , x4, where xi ∈ R, and the network must

output 1 if x1 < x2 < x3 < x4, and 0 otherwise. You will use the following architecture:
All of the hidden units and the output unit use a hard threshold activation function:
φ(z) =
1 if z ≥ 0
0 if z < 0
Please give a set of weights and biases for the network which correctly implements this function
(including cases where some of the inputs are equal). Your answer should include:
• A 3 × 4 weight matrix W(1) for the hidden layer
• A 3-dimensional vector of biases b
(1) for the hidden layer
• A 3-dimensional weight vector w(2) for the output layer
• A scalar bias b
(2) for the output layer
You do not need to show your work.
1
https://markus.teach.cs.toronto.edu/csc321-2018-01
2
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf
1
CSC321Homework 3
2. Backprop. Consider a neural network with N input units, N output units, and K hidden
units. The activations are computed as follows:
z = W(1)x + b
(1)
h = σ(z)
y = x + W(2)h + b
(2)
,
where σ denotes the logistic function, applied elementwise. The cost will involve both h and
y:
E = R + S
R = r
>h

S =

1

2

ky − sk

2

for given vectors r and s.

• [1pt] Draw the computation graph relating x, z, h, y, R, S, and E.

• [3pts] Derive the backprop equations for computing x = ∂E/∂x. You may use σ

0

to

denote the derivative of the logistic function (so you don’t need to write it out explicitly).

3. Sparsifying Activation Function. [4pts] One of the interesting features of the ReLU

activation function is that it sparsifies the activations and the derivatives, i.e. sets a large

fraction of the values to zero for any given input vector. Consider the following network:

Note that each wi refers to the weight on a single connection, not the whole layer. Suppose

we are trying to minimize a loss function L which depends only on the activation of the

output unit y. (For instance, L could be the squared error loss 1

2

(y − t)

2

.) Suppose the unit

h1 receives an input of -1 on a particular training case, so the ReLU evaluates to 0. Based

only on this information, which of the weight derivatives

∂L

∂w1

,

∂L

∂w2

,

∂L

∂w3

are guaranteed to be 0 for this training case? Write YES or NO for each. Justify your

answers.

2