Assignment 4: HMM Part-of-Speech Tagging solved




Overview Natural Language Processing (NLP) is a subset of AI that focuses on the understanding and generation of written and spoken language. This involves a series of tasks from low-level speech recognition on audio signals up to high-level semantic understanding and inferencing on the parsed sentences. One task within this spectrum is Part-Of-Speech (POS) tagging. Every word and punctuation symbol is understood to have a syntactic role in its sentence, such as nouns (denoting people, places or things), verbs (denoting actions), adjectives (which describe nouns) and adverbs (which describe verbs), to name a few. Each word in a piece of text is therefore associated with a part-of-speech tag (usually assigned by hand), where the total number of tags can depend on the organization tagging the text. A list of all the part-of-speech tags can be found here. While this task falls under the domain of NLP, having prior language experience doesn’t offer any particular advantage. In the end, the main task is to create an HMM model that can figure out a sequence of underlying states given a sequence of observations. What You Need To Do: Your task for this assignment is to create a Hidden Markov Model (HMM) for POS tagging, including: 1. Training probability tables (i.e., initial, transition and emission) for HMM from training files containing text-tag pairs 2. Performing inference with your trained HMM to predict appropriate POS tags for untagged text. Your solution will be graded based on the learned probability tables and the accuracy on our test files, as well as the efficiency of your algorithm. See Mark Breakdown for more details. Starter Code & Validation Program The starter code contains one Python starter file, a validation program and several training and test files. You can download the code and supporting files as a zip file In that archive, you will find the following files: Project File (the file you will edit and submit on Markus): The file where you will implement your POS tagger; this is the only file tand graded. Training Files (don’t modify): data/training1.txt – training5.txt Training files (in text format) containing large texts with POS tags on eacTesting Files (don’t modify): data/test1.txt – test5.txt Test files (in text format), identical to the training files but without the POValidation Files: validation/ The public validation script for testing your solution with a set of providefiles. Running the Code You can run the POS tagger by typing the following at a command line: $ python3 -d -t -o where the parameters consist of: • one or more training file names, separated by a spaces • a single test file name • a single output file name The script will read in for training, and for testing. The test output (POS predictions) will be written to . Here is an example: $ python3 -d data/training1.txt data/training2.txt -t data/test1.txt -o dat a/output1.txt Making the HMM Tagger In the hidden Markov model, the hidden variables represent the POS tags, and the evidence variables represent the words in the sentences. During the training phase, you will need to learn three probability tables described below: • The initial probabilities over the POS tags (how likely each POS tag appears at the beginning of a sentence) • The transition probabilities from one POS tag to another. • The emission probabilities from each POS tag to each observed word. Learning these probabilities will require you to perform counting. For instance, to determine the initial probability for the POS tag NP0, you need to divide the number of times NP0 begins a sentence by the total number of sentences. How you store these transitions and their associated probabilities is up to you. Be aware of the storage and performance tradeoffs associated with each design choice. During the test phase, you will use these probability tables to calculate the most likely tag for each word in a sentence using inference algorithms like the Viterbi algorithm (or your own variation). You are required to split the training and test procedures in You can assume independence between sentences for the purpose of this assignment. As a result, you can split the text files into individual sentences during training and testing without losing information in the training and prediction process. Mark Breakdown The majority of your mark for this assignment will be calculated based on the accuracy of your Part-of-Speech tagger: Marking Scheme Component Weight Correctness (% words tagged correctly) 90% Heuristic file 10% As shown in this table, 10% of your mark for this assignment is earned by sharing the interesting heuristics or techniques that you used to enhance the performance of your HMM. These are meant to be part of your solution and described in a file called heuristics.pdf, which you will submit to Markus along with your file. This mark is assigned separately from your correctness mark (i.e. you can get full marks for your heuristics even if you don’t get full correctness marks). To get full marks on our correctness tests, we want to see your code predict correct tags at least 90% of the time. If your accuracy is less than 90%, your correctness mark will be calculated as (your accuracy / 90). For example, if you achieve 72% accuracy on a test worth 10 points, you will get (72/90)*10 = 8 points on that test. Our runtime expectations will be ~5 minutes for a single training & test file but will be relaxed for cases involving multiple training files. That being said, be mindful of the space and runtime performance of your data structures. Validating Your Solution As always, we have provided a validation script that you can use to check your solution before submitting it. Remember that this code is only meant to confirm that it runs as specified and gives you a basic idea of how your HMM performs. Getting 100% on the validation test only means that your HMM is not broken. It does not mean you are guaranteed 100% as your final mark. You will want to create tests of your own that vary the combination of training and testing files to see how your code generalizes. While we will use some of the same training and test files as provided to you here, there will also be unseen training and test files used to grade your solution. Help! My HMM Isn’t Working! If you’ve been reviewing the slides on HMMs and the Viterbi algorithm and can’t seem to get things to work, don’t panic. In the end, your mark for this assignment is based on the accuracy of your part-of-speech tagger, not whether you complete the Viterbi algorithm exactly right. As with past assignments, some of the intermediate stages to Viterbi can also perform the tagging task in a less accurate but more straightforward way. For instance, you could do the following: • Use the emission probability table to label each word with the most likely POS tag, ignoring the context of the words around it. • Use the transition probability table to calculate the most likely POS tag for the current word, given the POS tag for the previous word. Viterbi is just the more advanced form of this approach, calculating the likelihood of each POS label for the current word, given the previous sequence of words in the sentence. But if you’re getting stuck and running out of time, the alternatives above will serve in a pinch. What to Submit You will be using MarkUs to submit your assignment. You will submit only one file: your modified Make sure to document the code to help our TAs understand the approach you used!