Homework 3 Statistics S4240: Data Mining solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

Problem 1. Naive Bayes Text Classification (25 Points) Working with data can
often be messy and frustrating, especially when you are learning a language. The following
is designed to be an introduction to working with real world data. You will implement an
authorship attribution algorithm for the Federalist dataset using a Naive Bayes classifier. This dataset contains historically important papers written by two founding American
politicians (Hamilton and Madison), and your task will be to classify the author from the
text. The authorship of these papers has been a subject of interesting historical debate, so
this use of Naive Bayes is meaningful. Please note that the vast majority of the code has
been supplied for you (inline and in hw03.R); your code should only involve a few commands
per part, where a number of those commands will call the functions we supply.
a. Step 1 (5 points)
Download the Federalist Paper documents from the course website. Place them in
your working directory for R. First we must preprocess the data to 1) remove non-letter
characters, 2)remove stopwords, and 3)stem words. Stopwords are short functional words,
such as in, is, the. Stemming involves trimming inflected words to their stem, such as
reducing running to run.
##########################################
# This code uses tm to preprocess the papers into a format useful for NB
preprocess.directory = function(dirname){
# the directory must have all the relevant text files
ds = DirSource(dirname)
# Corpus will make a tm document corpus from this directory
fp = Corpus( ds )
# inspect to verify
# inspect(fp[1])
# another useful command
# identical(fp[[1]], fp[[“Federalist01.txt”]])
# now let us iterate through and clean this up using tm functionality
for (i in 1:length(fp)){
# make all words lower case
fp[i] = tm_map( fp[i] , tolower);
1
# remove all punctuation
fp[i] = tm_map( fp[i] , removePunctuation);
# remove stopwords like the, a, and so on.
fp[i] = tm_map( fp[i], removeWords, stopwords(“english”));
# remove stems like suffixes
fp[i] = tm_map( fp[i], stemDocument)
# remove extra whitespace
fp[i] = tm_map( fp[i], stripWhitespace)
}
# now write the corpus out to the files for our future use.
# MAKE SURE THE _CLEAN DIRECTORY EXISTS
writeCorpus( fp , sprintf(’%s_clean’,dirname) )
}
##########################################
The above code creates a reusable function in R, which takes the input argument dirname.
Use functions when you need to repeatedly use a bit of code. In this case, we will be
reading in data from four different directories. Your task is to run this code on each
of the four directories: fp hamilton train, fp hamilton test, fp madison train,
fp madison test. Look at the files before (namely, in the original directory) and after (namely, in the ‘ clean’ directory) you have used the above function.
Your submitted code for problem 1 should, for part a, include this function with source(‘hw03.R’),
after which you should call the above function on each of the four directories. The code
should be very easy once you understand the above steps.
Please note that the above function uses the package tm, which has much of the functionality below. However, for instructive purposes, we will implement the remaining functions
manually. Be sure you install tm correctly.
b. Step 2 (5 points)
We are next going to use a function to load each of the Federalist Papers from their
corresponding directory into your workspace:
##########################################
# To read in data from the directories:
# Partially based on code from C. Shalizi
read.directory <- function(dirname) { # Store the infiles in a list infiles = list(); # Get a list of filenames in the directory filenames = dir(dirname,full.names=TRUE); for (i in 1:length(filenames)){ infiles[[i]] = scan(filenames[i],what="",quiet=TRUE); } return(infiles) 2 } ########################################## This code creates a list of input files, which here are each federalist paper, and which we will refer to as infiles. Lists store objects with an ordering. In this case, the order is the dictated by the position in the directory. The code is structured as follows: • infiles = list() instantiates the variable infile as a list, albeit with no entries currently. • filenames = dir(dirname,full.names=TRUE) creates a vector of strings, populated by the names of the files in directory dirname. The option full.names=TRUE means that the path directory is prepended onto the file name to give a path; if this is set to FALSE, only the file names are given. • for (i in 1:length(filenames)){ loops through all of the file names. • infiles[[i]] = scan(filenames[i],what="",quiet=TRUE) fills list element i with a vector (or list) of input values. Here what="" means that all fields will be read as strings. The default separator is white space; this can be changed through the sep argument. The argument quiet = TRUE keeps scan from displaying statistics about the imported document, which will fill up a command screen when hundreds of documents are read. The net result of this command is that element i of infile is filled by a vector of strings, with each string representing a word. • return(infiles) returns the object infiles when the function read.directory(dirname) is called. Your code should now read in all four infile directories (the cleaned directories that you created in the previous part) into the following variable names: hamilton.train hamilton.test madison.train madison.test c. Step 3 (5 points) We are next going to use all of the files (training and testing for both authors) to make a dictionary, or a list of all the words used. We have again supplied the key function: ########################################## # Make dictionary sorted by number of times a word appears in corpus # (useful for using commonly appearing words as factors) # NOTE: Use the *entire* corpus: training, testing, spam and ham make.sorted.dictionary.df <- function(infiles){ # This returns a dataframe that is sorted by the number of times # a word appears 3 # List of vectors to one big vetor dictionary.full <- unlist(infiles) # Tabulates the full dictionary tabulate.dic <- tabulate(factor(dictionary.full)) # Find unique values dictionary <- unique(dictionary.full) # Sort them alphabetically dictionary <- sort(dictionary) dictionary.df <- data.frame(word = dictionary, count = tabulate.dic) sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=TRUE),]; return(sort.dictionary.df) } ########################################## The function make.sorted.dictionary.df(infiles) takes the list infiles generated by read.directory() and creates a data frame that holds 1) the list of unique words, and 2) the number of times each appears in the corpus, which are sorted in descending order based on frequency. We want the sorted list so that we can easily access the most used words. The code is structured as follows: • dictionary.full <- unlist(infiles) takes the list infiles and converts it into a very long vector, dictionary.full. • tabulate.dic <- tabulate(factor(dictionary.full)) makes a vector of counts for each word that appears in dictionary.full. The argument factor(dictionary.full) tells the tabulate function that it is to bin data according to the enumerated categories for dictionary.full. • dictionary <- unique(dictionary.full) returns a vector of unique words in dictionary.full. • dictionary <- sort(dictionary) sorts the vector dictionary from a to z; this is necessary because factor(dictionary.full) is a sorted list of words in dictionary.full. • dictionary.df <- data.frame(word = dictionary, count = tabulate.dic) creates a new data frame, dictionary.df, with two categories, word and count. The word category is populated by the alphabetized dictionary of factors, the count category is populated by the counts for the alphabetized words. • sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=TRUE),] reorders the elements of dictionary.df to be sorted in descending order according to the count values. order(dictionary.df$count,decreasing=TRUE) returns an index ordering when the count values are sorted in a descending manner; this is used to rearrange the rows, but not the columns, of the data frame. • return(sort.dictionary.df) returns the sorted values as a data frame. Your task is to now create a dictionary for the full dataset (all test and training across both authors) by concatenating the individual lists into a single large one for all infiles, and then using the large list to create a dictionary. 4 d. Step 4 (5 points) We are now going to create counts for each dictionary word in all documents and place them in a document (rows) by word (columns) matrix. The following code will be used: ########################################## # Make a document-term matrix, which counts the number of times each # dictionary element is used in a document make.document.term.matrix <- function(infiles,dictionary){ # This takes the text and dictionary objects from above and outputs a # document term matrix num.infiles <- length(infiles); num.words <- length(dictionary); # Instantiate a matrix where rows are documents and columns are words dtm <- mat.or.vec(num.infiles,num.words); # A matrix filled with zeros for (i in 1:num.infiles){ num.words.infile <- length(infiles[[i]]); infile.temp <- infiles[[i]]; for (j in 1:num.words.infile){ ind <- which(dictionary == infile.temp[j]); dtm[i,ind] <- dtm[i,ind] + 1; } } return(dtm); } ########################################## In many cases, producing a full document-term matrix is a poor idea. It requires a large amount of storage space for relatively little information. In settings with large corpora or large dictionaries, it is often best to store the data in the form of tokens that simply denote document number, word number in the dictionary, and count for that word-document combination. Only counts of at least 1 are stored, yielding something like this: document word count 3 1 2 3 5 1 3 11 2 3 12 2 ... ... ... However, matrix representation greatly simplifies the coding required for implementing a naive Bayes classifier. The function is structured as follows: • num.infiles <- length(infiles) calculates the number of infiles by computing the length of the list. 5 • num.words <- length(dictionary) calculates the length of the dictionary. • dtm <- mat.or.vec(num.infiles,num.words) instantiates a matrix with num.infiles rows and num.words columns that is filled with zeros. Note that mat.or.vec(num.infiles) would instantiate a vector with num.infiles entries that is filled with zeros. • for (i in 1:num.infiles){ loops through each of the infiles. • num.words.infile <- length(infiles[[i]]) calculates the number of words in infile i. • infile.temp <- infiles[[i]] temporarily stores the vector infile i in infile.temp. • for (j in 1:num.words.infile){ loops through each of the words in infile i. • ind <- which(dictionary == infile.temp[j]) stores the dictionary index equal to the jth word of infile i. • dtm[i,ind] <- dtm[i,ind] + 1 adds a count of 1 to to the observation matrix for word ind in infile i. • return(dtm) returns the document-term matrix. Your taks is to create a document term matrix for each group of papers. Since we are looping over a large space in R, this may take a few minutes to run. Make the following: dtm.hamilton.train dtm.hamilton.test dtm.madison.train dtm.madison.test e. Step 5 (5 points) This exercise culminates in building a Naive Bayes classifier. All of your data is now in matrix form. We are now going to use it to compute naive Bayes probabilities. Computationally, we may ignore the normalizer p(x test) can be ignored since we only need to determine which class has a higher likelihood. A test document is declared to be authored by Hamilton (y = 1) if p(y = 1) nYtest j=1 p(Xj = x test j | y = 1) ≥ p(y = 0) nYtest j=1 p(Xj = x test j | y = 0), and Madison (y = 0) otherwise. Therefore, we only need to estimate the probabilities p(y = 1), p(y = 0), p(xj = k | y = 1) for k = 1, . . . , |D|, and p(xj = k | y = 0) for k = 1, . . . , |D|. When computing probabilities defined by products, it is more convenient to work in log probabilities. Consider a dictionary with two words, each with equal probability given that the document is by Hamilton, and a test document with 100 words. Then, p(y = 1 | x test) = 1 2 × Y 100 j=1 1 2 =  1 2 101 ≈ 3.9 × 10−31 . 6 While best case for a small document, this is way beyond machine precision. If we compute the log probability, log p(y = 1 | x test)  = log  1 2 101! = −101 × log(2) ≈ −70.01. This can easily be stored within machine precision. Even when computing a vector of probabilities, it is often a good idea work in log probabilities. Note that programs like R or Matlab use base e. To compute a vector of log probabilities from a document term matrix with µ phantom examples, use the following function: ########################################## make.log.pvec <- function(dtm,mu){ # Sum up the number of instances per word pvec.no.mu <- colSums(dtm) # Sum up number of words n.words <- sum(pvec.no.mu) # Get dictionary size dic.len <- length(pvec.no.mu) # Incorporate mu and normalize log.pvec <- log(pvec.no.mu + mu) - log(mu*dic.len + n.words) return(log.pvec) } ########################################## The following commands were introduced in this code: • sum(pvec.no.mu) sums over all values in the vector pvec.no.mu. • colSums(dtm) sums over the rows of the matrix dtm, producing a vector with the number of elements equal to the number of columns in dtm. • pvec.no.mu + mu adds the scalar value mu to all elements of the vector pvec.no.mu. • (pvec.no.mu + mu)/(mu*dic.len + n.words) divides all elements of the vector pvec.no.mu + mu by the scalar value mu*dic.len + n.words. Your task is to calculate the log probability vectors for all document term matrices. Make your code compute the following with µ = 1 |D| , where |D| is the size of the dictionary (number of words in the dictionary): logp.hamilton.train logp.hamilton.test logp.madison.train logp.madison.test Question 2. (25 points) Write a function to assign authorship to a test paper (you don’t have to explicitly compute log (ˆp(y = ˜y | x test))): naive.bayes = function(logp.hamilton.train, logp.madison.train, log.prior.hamilton, log.prior.madison , dtm.test). This function takes the log probabilities for the dictionary in a Hamilton-authored document (logp.hamilton.train), a Madison-authored document (logp.madison.train), the log priors (log.prior.hamilton, log.prior.madison) and a document term matrix for the testing corpus. Here you must write the Naive Bayes classifier code as a function. Question 3. (20 points) Set µ = 1 |D| as before, use the training sets for training, and use the testing sets for testing. What percentage of test papers are classified correctly? Additionally, report the proportion of your: true positives (Hamilton classified as Hamilton divided by the total amount of testing Hamilton papers), true negatives (Madison classified as Madison divided by the total amount of testing Madison papers), false positives (Madison classified as Hamilton divided by the total amount of testing Madison) and false negatives (Hamilton classified as Madison divided by the total amount of testing Hamilton). Question 4. (30 points) Naive Bayes has a tunable parameter, µ. We will use 5-fold cross-validation over the training set to search the parameter set µ = { 1 10 1 |D| , 1 |D| , 10 1 |D| , 100 1 |D| , 1000 1 |D| }. Create three matrices to store proportion correctly classified, proportion of false negatives, and proportion of false positives. These should be 5 × 5 matrices. a. (10 points) For each value of µ, estimate the correct classification rate, the false negative rate and the false positive rate using 5-fold cross-validation. (Assume that you don’t have access to the testing sets). Summarize your results in three graphs. b. (5 points) What seems to be the best value for µ? Why? c. (10 points) For each value of µ, train on the full training set and test on the full testing set. Summarize the correct classification rate, the false negative rate and the false positive rate in three graphs. Does your answer from (b) still seem the best value? Why or why not? d. (5 points) How close are the rates estimated from cross-validation to the true rates on the testing set (rates from (c))? Give percentage error. What could account for differences? Give one way the differences between the cross-validation rate estimates and the rates on the training sets could be minimized. 8