CSE 4334/5334 Programming Assignment 1 (P1) solved




1. Description of Task
You code should accomplish the following tasks:
(1) Read the text file debate.txt. This is the transcript of the latest Texas Senate race debate between Ted Cruz
and Beto O’Rourke. The following code does it.
In [7]: import os
filename = ‘./debate.txt’
file = open(filename, “r”, encoding=’UTF-8′)
doc = file.read()
(2) Tokenize the content of the file. For this, you need a tokenizer. For example, the following piece of code uses
a regular expression tokenizer to return all course numbers in a string. Play with it and edit it. You can change
the regular expression and the string to observe different output results.
For tokenizing the Texas Senate debate transcript, let’s all use RegexpTokenizer(r'[a-zA-Z]+’). What tokens will it
produce? What limitations does it have?
In [ ]: from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[A-Z]{2,3}[1-9][0-9]{3,3}’)
tokens = tokenizer.tokenize(“CSE4334 and CSE5334 are taught together. IE
3013 is an undergraduate course.”)
(3) Perform stopword removal on the obtained tokens. NLTK already comes with a stopword list, as a corpus in
the “NLTK Data” (http://www.nltk.org/nltk_data/ (http://www.nltk.org/nltk_data/)). You need to install this corpus.
Follow the instructions at http://www.nltk.org/data.html (http://www.nltk.org/data.html). You can also find the
instruction in this book: http://www.nltk.org/book/ch01.html (http://www.nltk.org/book/ch01.html) (Section 1.2
Getting Started with NLTK). Basically, use the following statements in Python interpreter. A pop-up window will
appear. Click “Corpora” and choose “stopwords” from the list.
In [ ]: import nltk
After the stopword list is downloaded, you will find a file “english” in folder nltk_data/corpora/stopwords, where
folder nltk_data is the download directory in the step above. The file contains 179 stopwords.
nltk.corpus.stopwords will give you this list of stopwords. Try the following piece of code.
In [ ]: from nltk.corpus import stopwords
(4) Also perform stemming on the obtained tokens. NLTK comes with a Porter stemmer. Try the following code
and learn how to use the stemmer.
[‘CSE4334’, ‘CSE5334’, ‘IE3013’]
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/
In [ ]: from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
(5) Using the tokens, compute the TF-IDF vector for each paragraph. In this assignment, for calculating
inverse document frequency, treat debate.txt as the whole corpus and the paragraphs as documents.
That is also why we ask you to compute the TF-IDF vectors separately for all the paragraphs, one vector per
Use the following equation that we learned in the lectures to calculate the term weights, in which is a token
and is a document (i.e., paragraph):
Note that the TF-IDF vectors should be normalized (i.e., their lengths should be 1).
Represent a TF-IDF vector by a dictionary. The following is a sample TF-IDF vector.
In [ ]: {‘sanction’: 0.014972337775895645, ‘lack’: 0.008576372825970286, ‘regre
t’: 0.009491784747267843, ‘winter’: 0.030424375278541155}
(6) Given a query string, calculate the query vector. Compute the cosine similarity between the query vector
and the paragraphs in the transcript. Return the paragraph that attains the highest cosine similarity score. In
calculating the query vector, the vector is also to be normalized.
2. What to Submit
Submit through Blackboard your source code in a single .py file. You can define as many functions as you want,
but the file must define at least the following functions:
getidf(token): return the inverse document frequency of a token. If the token doesn’t exist in the
corpus, return -1. The parameter ‘token’ is already stemmed. (It means you should not perform
stemming inside this function.) Note the differences between getidf(“hispan”) and getidf(“hispanic”).
getqvec(qstring): return the query vector for a query string.
query(qstring): return the paragraph in the transcript that has the highest cosine similarity score with
respect to ‘qstring’. Return its score too. The output format should be as follows.
the paragraph
the score
w = (1 + lo t ) × (lo ). t,d g10
ft,d g10
If all paragraphs have zero scores, return the following message.
No Match
3. Sample Results
The file “sampleresults.txt” provides multiple sample results of calling the aforementioned three functions.
We use a script to automatically grade. Make sure your code produces identical results in idential format.
Or else you won’t get points. And apparently make sure your program runs.
4. Grading Rubrics
Your program will be evaluated on correctness, efficiency, and code quality.
Make sure to thoroughly understand the grading rubrics in file “rubrics.txt”.