CSC321 Homework 5 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1. Binary Addition [5pts] In this problem, you will implement a recurrent neural network
which implements binary addition. The inputs are given as binary sequences, starting with
the least significant binary digit. (It is easier to start from the least significant bit, just like
how you did addition in grade school.) The sequences will be padded with at least one zero
on the end. For instance, the problem
100111 + 110010 = 1011001
would be represented as:
• Input 1: 1, 1, 1, 0, 0, 1, 0
• Input 2: 0, 1, 0, 0, 1, 1, 0
• Correct output: 1, 0, 0, 1, 1, 0, 1
There are two input units corresponding to the two inputs, and one output unit. Therefore,
the pattern of inputs and outputs for this example would be:
Design the weights and biases for an RNN which has two input units, three hidden units, and
one output unit, which implements binary addition. All of the units use the hard threshold
activation function. In particular, specify weight matrices U, V, and W, bias vector bh, and
scalar bias by for the following architecture:
1
https://markus.teach.cs.toronto.edu/csc321-2018-01
2
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/syllabus.pdf
1
CSC321 Homework 5
Hint: In the grade school algorithm, you add up the values in each column, including the
carry. Have one of your hidden units activate if the sum is at least 1, the second one if it is
at least 2, and the third one if it is 3.
2. LSTM Gradient [5pts] Here, you’ll derive the Backprop Through Time equations for the
univariate version of the Long-Term Short-Term Memory (LSTM) architecture.
Note: This question is important context for understanding LSTMs, but it is just ordinary
Backprop Through Time, so you know enough to do parts (a) and (b) after Lecture 15
(RNNs). Part (c) is optional because it depends on Lecture 16, but you should try to do it
yourself. You may find it helpful to read Section 3.4 of the Lecture 16 notes.
For reference, here are the computations it performs:
i
(t) = σ(wixx
(t) + wihh
(t−1))
f
(t) = σ(wfxx
(t) + wfhh
(t−1))
o
(t) = σ(woxx
(t) + wohh
(t−1))
g
(t) = tanh(wgxx
(t) + wghh
(t−1))
c
(t) = f
(t)
c
(t−1) + i
(t)
g
(t)
h
(t) = o
(t)
tanh(c
(t)
)
(a) [4pts] Derive the Backprop Through Time equations for the activations and the gates:
h
(t) =
c
(t) =
g
(t) =
o
(t) =
f
(t) =
i
(t) =
You don’t need to vectorize anything or factor out any repeated subexpressions.
(b) [1pt] Derive the BPTT equation for the weight wix:
wix =
(The other weight matrices are basically the same, so we won’t make you write those
out.)
(c) [optional, no points] Based on your answers above, explain why the gradient doesn’t
explode if the values of the forget gates are very close to 1 and the values of the input
and output gates are very close to 0. (Your answer should involve both h
(t) and c
(t)
.)
2