# Homework 3 Statistics S4240: Data Mining solved

\$35.00

## Description

5/5 - (1 vote)

Problem 1. Naive Bayes Text Classification (25 Points) Working with data can
often be messy and frustrating, especially when you are learning a language. The following
is designed to be an introduction to working with real world data. You will implement an
authorship attribution algorithm for the Federalist dataset using a Naive Bayes classifier. This dataset contains historically important papers written by two founding American
text. The authorship of these papers has been a subject of interesting historical debate, so
this use of Naive Bayes is meaningful. Please note that the vast majority of the code has
been supplied for you (inline and in hw03.R); your code should only involve a few commands
per part, where a number of those commands will call the functions we supply.
a. Step 1 (5 points)
Download the Federalist Paper documents from the course website. Place them in
your working directory for R. First we must preprocess the data to 1) remove non-letter
characters, 2)remove stopwords, and 3)stem words. Stopwords are short functional words,
such as in, is, the. Stemming involves trimming inflected words to their stem, such as
reducing running to run.
##########################################
# This code uses tm to preprocess the papers into a format useful for NB
preprocess.directory = function(dirname){
# the directory must have all the relevant text files
ds = DirSource(dirname)
# Corpus will make a tm document corpus from this directory
fp = Corpus( ds )
# inspect to verify
# inspect(fp[1])
# another useful command
# identical(fp[[1]], fp[[“Federalist01.txt”]])
# now let us iterate through and clean this up using tm functionality
for (i in 1:length(fp)){
# make all words lower case
fp[i] = tm_map( fp[i] , tolower);
1
# remove all punctuation
fp[i] = tm_map( fp[i] , removePunctuation);
# remove stopwords like the, a, and so on.
fp[i] = tm_map( fp[i], removeWords, stopwords(“english”));
# remove stems like suffixes
fp[i] = tm_map( fp[i], stemDocument)
# remove extra whitespace
fp[i] = tm_map( fp[i], stripWhitespace)
}
# now write the corpus out to the files for our future use.
# MAKE SURE THE _CLEAN DIRECTORY EXISTS
writeCorpus( fp , sprintf(’%s_clean’,dirname) )
}
##########################################
The above code creates a reusable function in R, which takes the input argument dirname.
Use functions when you need to repeatedly use a bit of code. In this case, we will be
of the four directories: fp hamilton train, fp hamilton test, fp madison train,
fp madison test. Look at the files before (namely, in the original directory) and after (namely, in the ‘ clean’ directory) you have used the above function.
Your submitted code for problem 1 should, for part a, include this function with source(‘hw03.R’),
after which you should call the above function on each of the four directories. The code
should be very easy once you understand the above steps.
Please note that the above function uses the package tm, which has much of the functionality below. However, for instructive purposes, we will implement the remaining functions
manually. Be sure you install tm correctly.
b. Step 2 (5 points)
We are next going to use a function to load each of the Federalist Papers from their