CS 224d: Assignment #2 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1 Tensorflow Softmax (20 points) In this question, we will implement a linear classifier with loss function Jsof tmax−CE(W) = CE(y,softmax(xW)) (Here the rows of x are feature vectors). We will use TensorFlow’s automatic differentiation capability to fit this model to provided data. (a) (4 points) Implement the softmax function using TensorFlow in q1 softmax.py. Remember that sof tmax(x)i = e xi P j e xj (1) Note that you may not use tf.nn.softmax or related built-in functions. You can run basic (nonexhaustive tests) by running python q1 softmax.py. (b) (4 points) Implement the cross-entropy loss using TensorFlow in q1 softmax.py. Remember that CE(y, yˆ) = − X Nc i=1 yi log(ˆyi) (2) where y ∈ R 5 is a one-hot label vector and Nc is the number of classes. Note that you may not use TensorFlow’s built-in cross-entropy functions for this question. You can run basic (non-exhaustive tests) by running python q1 softmax.py. (c) (4 points) Carefully study the Model class in model.py. Briefly explain the purpose of placeholder variables and feed dictionaries in tensorflow computations. Fill in the implementations for the add placeholders, create feed dict in q1 classifier.py. Hint: Note that configuration variables are stored in the Config class. You will need to use these configuration variables in the code. 1 CS 224d: Assignment #2 (d) (4 points) Implement the transformation for a softmax classifier in function add model in q1 classifier.py. Add cross-entropy loss in function add loss op in the same file. Use the implementations from the earlier parts of the problem, not TensorFlow built-ins. (e) (4 points) Fill in the implementation for add training op in q1 classifier.py. Explain how TensorFlow’s automatic differentiation removes the need for us to define gradients explicitly. Verify that your model is able to fit to synthetic data by running python q1 classifier.py and making sure that the tests pass. Hint: Make sure to use the learning rate specified in Config. 2 Deep Networks for Named Entity Recognition (35 points) In this section, we’ll get to practice backpropagation and training deep networks to attack the task of Named Entity Recognition: predicting whether a given word, in context, represents one of four categories: • Person (PER) • Organization (ORG) • Location (LOC) • Miscellaneous (MISC) We formulate this as a 5-class classification problem, using the four above classes and a null-class (O) for words that do not represent a named entity (most words fall into this category). The model is a 1-hidden-layer neural network, with an additional representation layer similar to what you saw with word2vec. Rather than averaging or sampling, here we explicitly represent context as a “window” consisting of a word concatenated with its immediate neighbors: x (t) = [xt−1L, xtL, xt+1L] ∈ R 3d (3) where the input xt−1, xt, xt+1 are one-hot row vectors into an embedding matrix L ∈ R |V |×d , with each row Li as the vector for a particular word i = xt. We then compute our prediction as: h = tanh(x (t)W + b1) (4) yˆ = softmax(hU + b2) (5) And evaluate by cross-entropy loss J(θ) = CE(y, yˆ) = − X 5 i=1 yi log ˆyi (6) To compute the loss for the training set, we sum (or average) this J(θ) as computed with respect to each training example. For this problem, we let d = 50 be the length of our word vectors, which are concatenated into a window of width 3×50 = 150. The hidden layer has a dimension of 100, and the output layer ˆy has a dimension of 5. (a) (5 points) Compute the gradients of J(θ) with respect to all the model parameters: ∂J ∂U ∂J ∂b2 ∂J ∂W ∂J ∂b1 ∂J ∂Li Page 2 of 7 CS 224d: Assignment #2 where U ∈ R 100×5 b2 ∈ R 5 W ∈ R 150×100 b1 ∈ R 100 Li ∈ R 50 In the spirit of backpropagation, you should express the derivative of activation functions (tanh, softmax) in terms of their function values (as with sigmoid in Assignment 1). This identity may be helpful: tanh(z) = 2 sigmoid(2z) − 1 Furthermore, you should express the gradients by using an “error vector” propagated back to each layer; this just amounts to putting parentheses around factors in the chain rule, and will greatly simplify your analysis. All resulting gradients should have simple, closed-form expressions in terms of matrix operations. (Hint: you’ve already done most of the work here as part of Assignment 1.) (b) (5 points) To avoid parameters from exploding or becoming highly correlated, it is helpful to augment our cost function with a Gaussian prior: this tends to push parameter weights closer to zero, without constraining their direction, and often leads to classifiers with better generalization ability. If we maximize log-likelihood (as with the cross-entropy loss, above), then the Gaussian prior becomes a quadratic term 1 (L2 regularization): Jreg(θ) = λ 2   X i,j W2 ij + X i 0j 0 U 2 i 0j 0   (7) and we optimize the combined loss function Jfull(θ) = J(θ) + Jreg(θ) (8) Update your gradients from part (a) to include the additional term in this loss function (i.e. compute dJfull dW , etc.). (c) (5 points) In order to avoid neurons becoming too correlated and ending up in poor local minimina, it is often helpful to randomly initialize parameters. One of the most frequent initializations used is called Xavier initialization2 . Given a matrix A of dimension m × n, select values Aij uniformly from [−, ], where  = √ 6 √ m + n (9) Implement the initialization for use in xavier weight init in q2 initialization.py and use it for the weights W and U. (d) (20 points) In q2 NER.py implement the NER window model by filling in the appropriate sections. The gradients you derived in (a) and (b) will be computed for you automatically, showing the benefits that automatic differentiation can provide for rapid prototyping. Run python q2 NER.py to evaluate your model’s performance on the dev set, and compute predictions on the test data (make sure to turn off debug settings when doing final evaluation). Note that the test set has only dummy labels; we’ll compare your predictions against the ground truth after you submit. Deliverables: 1Optional (not graded): The interested reader should prove that this is indeed the maximum-likelihood objective when we let Wij ∼ N(0, 1/λ) for all i, j. 2This is also referred to as Glorot initialization and was initially described in http://jmlr.org/proceedings/papers/ v9/glorot10a/glorot10a.pdf Page 3 of 7 CS 224d: Assignment #2 • Working implementation of the NER window model in q2 NER.py. (We’ll look at, and possibly run this code for grading.) • In your writeup (i.e. where you’re writing the answers to the written problems), briefly state the optimal hyperparameters you found for your model: regularization, dimensions, learning rate (including time-varying, such as annealing), SGD batch size, etc. Report the performance of your model on the validation set. You should be able to get validation loss below 0.2. • List of predicted labels for the test set, one per line, in the file q2 test.predicted. • Hint: When debugging, set max epochs = 1. Pass the keyword argument debug=True to the call to load data in the init method. • Hint: This code should run within 15 minutes on a GPU and 1 hour on a CPU. 3 Recurrent Neural Networks: Language Modeling (45 points) In this section, you’ll implement your first recurrent neural network (RNN) for building a language model. Language modeling is a central task in NLP, and language models can be found at the heart of speech recognition, machine translation, and many other systems. Given words x1, . . . , xt, a language model predicts the following word xt+1 by modeling: P(xt+1 = vj | xt, . . . , x1) where vj is a word in the vocabulary. Your job is to implement a recurrent neural network language model, which uses feedback information in the hidden layer to model the “history” xt, xt−1, . . . , x1. Formally, the model3 is, for t = 1, . . . , n − 1: e (t) = x (t)L (10) h (t) = sigmoid  h (t−1)H + e (t) I + b1  (11) yˆ (t) = softmax  h (t)U + b2  (12) P¯(xt+1 = vj | xt, . . . , x1) = ˆy (t) j (13) where h (0) = h0 ∈ R Dh is some initialization vector for the hidden layer and x (t)L is the product of L with the one-hot row-vector x (t) representing index of the current word. The parameters are: L ∈ R |V |×d H ∈ R Dh×Dh I ∈ R d×Dh b1 ∈ R Dh U ∈ R Dh×|V | b2 ∈ R |V | (14) where L is the embedding matrix, I the input word representation matrix, H the hidden transformation matrix, and U is the output word representation matrix. b1 and b2 are biases. d is the embedding dimension, |V | is the vocabulary size, and Dh is the hidden layer dimension. The output vector yˆ (t) ∈ R |V | is a probability distribution over the vocabulary, and we optimize the (unregularized) cross-entropy loss: J (t) (θ) = CE(y (t) , yˆ (t) ) = − X |V | i=1 y (t) i log ˆy (t) i (15) 3This model is adapted from a paper by Toma Mikolov, et al. from 2010: http://www.fit.vutbr.cz/research/groups/ speech/publi/2010/mikolov_interspeech2010_IS100722.pdf Page 4 of 7 CS 224d: Assignment #2 where y (t) is the one-hot vector corresponding to the target word (which here is equal to xt+1). As in Q 2, this is a point-wise loss, and we sum (or average) the cross-entropy loss across all examples in a sequence, across all sequences4 in the dataset in order to evaluate model performance. (a) (5 points) Conventionally, when reporting performance of a language model, we evaluate on perplexity, which is defined as: PP(t)  y (t) , yˆ (t)  = 1 P¯(x pred t+1 = xt+1 | xt, . . . , x1) = 1 P|V | j=1 y (t) j · yˆ (t) j (16) i.e. the inverse probability of the correct word, according to the model distribution P¯. Show how you can derive perplexity from the cross-entropy loss (Hint: remember that y (t) is one-hot!), and thus argue that minimizing the (arithmetic) mean cross-entropy loss will also minimize the (geometric) mean perplexity across the training set. This should be a very short problem – not too perplexing! For a vocabulary of |V | words, what would you expect perplexity to be if your model predictions were completely random? Compute the corresponding cross-entropy loss for |V | = 2000 and |V | = 10000, and keep this in mind as a baseline. (b) (5 points) As you did in Q 2, compute the gradients with for all the model parameters at a single point in time t: ∂J(t) ∂U ∂J(t) ∂b2 ∂J(t) ∂Lx(t) ∂J(t) ∂I (t) ∂J(t) ∂H (t) ∂J(t) ∂b1 (t) where Lx(t) is the column of L corresponding to the current word x (t) , and (t) denotes the gradient for the appearance of that parameter at time t. (Equivalently, h (t−1) is taken to be fixed, and you need not backpropagate to earlier timesteps just yet – you’ll do that in part (c)). Additionally, compute the derivative with respect to the previous hidden layer value: ∂J(t) ∂h(t−1) (c) (5 points) Below is a sketch of the network at a single timestep: x (t) h (t) h (t−1) yˆ (t) … Draw the “unrolled” network for 3 timesteps, and compute the backpropagation-through-time gradients: ∂J(t) ∂Lx(t−1) ∂J(t) ∂H (t−1) ∂J(t) ∂I (t−1) ∂J(t) ∂b1 (t−1) 4We use the tensorflow function sequence loss to do this. Page 5 of 7 CS 224d: Assignment #2 where (t−1) denotes the gradient for the appearance of that parameter at time (t − 1). Because parameters are used multiple times in feed-forward computation, we need to compute the gradient for each time they appear. You should use the backpropagation rules from Lecture 5 5 to express these derivatives in terms of error term δ (t−1) = ∂J(t) ∂h(t−1) computed in the previous part. (Doing so will allow for re-use of expressions for t − 2, t − 3, and so on). Note that the true gradient with respect to a training example requires us to run backpropagation all the way back to t = 0. In practice, however, we generally truncate this and only backpropagate for a fixed number τ ≈ 3 − 5 timesteps. (d) (3 points) Given h (t−1), how many operations are required to perform one step of forward propagation to compute J (t) (θ)? How about backpropagation for a single step in time? For τ steps in time? Express your answer in big-O notation in terms of the dimensions d, Dh and |V |(Equation 14). What is the slow step? (e) (20 points) Implement the above model in q3 RNNLM.py. Data loaders and other starter code are provided. Follow the directions in the code to understand which parts need to be filled in. Running python q3 RNNLM.py will run the model. Note that you may not use built-in tensorflow functions such as those in the rnn cell module. Train a model on the ptb-train data, consisting of the first 20 sections of the WSJ corpus of the Penn Treebank. As in Q 2, you should tune your model to maximize generalization performance (minimize cross-entropy loss) on the dev set. We’ll evaluate your model on an unseen, but similar set of sentences. Deliverables: • In your writeup, include the best hyperparameters you used (training schedule, number of iterations, learning rate, backprop timesteps), and your perplexity score on the ptb-dev set. You should be able to get validation perplexity below 175. • Include your saved model parameters for your best model; we’ll use these to test your model. • Hint: When debugging, set max epochs = 1. Pass the keyword argument debug=True to the call to load data in the init method. • Hint: On a GPU, this code should run quickly (below 30 minutes). On a CPU, the code may take up to 4 hrs to run. (f) (7 points) The networks that you’ve seen in Assignment 1 and in q2 of this assignment are discriminative models: they take data, and make a prediction. The RNNLM model you’ve just implemented is a generative model, in that it actually models the distribution of the data sequence x1, . . . , xn. This means that not only can we use it to evaluate the likelihood of a sentence, but we can actually use it to generate one! After training, in q3 RNNLM.py, implement the generate text() function. This should run the RNN forward in time, beginning with the index for the start token , and sampling a new word xt+1 from the distribution yˆ (t) at each timestep. Then feed this word in as input at the next step, and repeat until the model emits an end token (index of ). 5http://cs224d.stanford.edu/lectures/CS224d-Lecture5.pdf Page 6 of 7 CS 224d: Assignment #2 Deliverables: • Include 2-3 generated sentences in your writeup. See if you can generate something humorous! Completely optional, not graded: If you want to experiment further with language models, you’re welcome to load up your own texts and train on them – sometimes the results can be quite entertaining! (See http://kingjamesprogramming.tumblr.com/ for a great one6 trained on a mix of the King James Bible and the Structure and Interpretation of Computer Programs.) 6This one just uses a simple n-gram Markov model, but there’s no reason an RNNLM can’t compete! Page 7 of 7