## Description

Question 1. [30 points]

An engineer is trying to design a linear neural network for a regression task. The design is

based on the following cost function:

C1 =

1

2

X

n

(y

n − w

Tx

n

)

2

(1)

where y

n

is scalar output and x

n is the vector input for the n

th training/testing sample.

Answer the questions below.

a) Prove that minimizing C1 is equivalent to minimizing another cost of the following form:

C2 =

1

2

w

TAw − b

Tw (2)

Find the expressions for A and b in terms of the input and output of the network. Find

the update rule for ∆w according to gradient descent optimization.

b) Identify an appropriate change of variables to prove that the update rule obtained in

part a is equivalent to the update rule for a different cost function:

C3 =

1

2

˜w

TA ˜w (3)

Express ˜w in terms of w, A and b.

c) Assume that A is a symmetric and positive-definite matrix. Find the maximum learning

rate η for which the weight updates lead to a stable solution.

Question 2. [30 points]

In this question we will revisit the cat versus car detection problem from the previous assignment. The file assign2_data1.mat contains the variables trainims (training images)

and testims (testing images) along with the ground truth labels trainlbls and testlbls.

You will implement stochastic gradient descent on mini-batches.

Here, you will measure the error term as a function of epoch number, where an epoch is

a single pass through all images in the training set. Two different error metrics will be

calculated: the mean squared error and the mean classification error (percentage of correctly classfied images). Record the error metrics for each epoch separately for the training

samples and the testing samples. Answer the questions below.

a) Using the backpropagation algorithm, design a multi-layer neural network with a single

hidden layer. Assume a hyperbolic tangent activation function for all neurons. Experiment

with different number of neurons N in the hidden layer, initialization of weight and bias

terms, and mini-batch sample sizes. Assuming a learning rate of η ∈ [0.1 0.5], select a

particular set of parameters that work well. Using the selected parameters, run backpropagation until convergence. Plot the learning curves as a function of epoch number for training

squared error, testing squared error, training classification error, and testing classsification

error.

b) Describe how the squared-error and classification error metrics evolve over epochs for

the training versus the testing sets? Is squared error an adequate predictor of classification

error?

c) Train separate neural networks using substantially smaller and larger number of hiddenlayer neurons (Nlow and Nhigh). Plot the learning curves for all error metrics, overlaying

the results for Nlow, Nhigh and N∗ prescribed in part a.

d) Design and train a separate network with two hidden layers. Assuming a learning rate of

η ∈ [0.1 0.5], select a particular set of parameters that work well. Plot the learning curves

for all error metrics, and comparatively discuss the convergence behavior and classification

performance of the two hidden-layer network with respect to the network in part a.

e) Assuming a momentum coefficient of α ∈ [0.1 0.5], retrain the neural network described

in part d. Select a particular set of parameters that work well. Plot the learning curves for

all error metrics, and comparatively discuss the convergence behavior with respect to part

d.

Question 3. [40 points]

Neural network architectures can produce powerful computational models for natural language processing. Here, you will consider one particular model for examining sequences

of words. The task is to predict the fourth word in sequence given the preceding trigram, e.g., trigram: ‘Neural nets are’, fourth word: ‘awesome’. A database of articles were

parsed to store sample fourgrams restricted to a vocabulary size of 250 words. The file

assign2_data2.mat contains training samples for input and output (trainx, traind), for

validation (valx, vald), and for testing (testx, testd). Using these samples, the following

network should be trained via backpropagation:

Neural nets are

awesome

Embedding

Logistic hidden

units

Softmax output The input layer has 3 neurons corresponding to the trigram entries. An embedding matrix

R (250×D) is used to linearly map each single word onto a vector representation of length

D. The same embedding matrix is used for each input word in the trigram, without considering the sequence order. The hidden layer uses a sigmoidal activation function on each of

P hidden-layer neurons. The output layer predicts a separate response zi

for each of 250

vocabulary words, and the probability of each word is estimated via a soft-max operation

(oi =

e

zi

P

j

e

zj

).

a) Assume the following parameters: a stochastic gradient descent algorithm, a mini-batch

size of 200 samples, a learning rate of η = 0.15, a momentum rate of α = 0.85, a maximum

of 50 epochs, and weights and biases initialized as random Gaussian variables of std 0.01. If

necessary, adjust these parameters to improve network performance. The algorithm should

be stopped based on the cross-entropy error on the validation data. Experiment with different D and P values, (D,P) = (32,256), (16,128), (8,64) and discuss your results.

b) Pick some sample trigrams from the test data, and generate predictions for the fourth

word via the trained neural network. Store the the predicted probability for each of the 250

words. For each of 5 sample trigrams, list the top 10 candidates for the fourth word. Are

the network predictions sensible?