## Description

Problem 1. (10 Points) James 6.8.1

Problem 2. (10 Points) James 6.8.3

Problem 3. (10 Points) James 6.8.5

Problem 4. (10 Points) James 8.4.5

Problems 5 to 7 use classification trees and logistic regression to classify the

Federalist Papers.

Question 5. (20 Points) Use your code from Homework 3 to read in the Federalist Papers and

create document term matrices dtm.hamilton.train, dtm.hamilton.test, dtm.madison.train,

and dtm.madison.test. For each term matrix, create a vector of 0’s and 1’s to indicate the author

of each document, with Madison documents given values 0 and Hamilton documents given values 1.

(This will be the response variable for each document) Combine the document term matrices and

vectors of 0’s and 1’s to create two data frames: one that includes all training data and one that

includes all testing data. At the end, in each data frame, the number of rows will be the number of

training (or testing) documents and the number of columns will be the total number of words in the

dictionary plus 1 (for the response variable). Also give names to columns of the data frames. The

column labels for the covariates are the dictionary words (hint: use as.vector(dictionary$words)

to get a vector of words) and the column label for the response is y.

a. (10 Points) Use tree classification to predict the author using the training data. Apply the model

to the testing data. Specifically, in R use rpart classification with Gini impurity coefficient

splits. Then compute the proportion classified correctly, the proportion of false negatives, and

the proportion of false positives. Plot the tree with labeled splits.

b. (10 Points) Now use tree classification again, but this time with information gain splits, to

predict the author. Apply the model to the testing data. Then compute the proportion classified

correctly, the proportion of false negatives, and the proportion of false positives. Plot the tree

with labeled splits. Are there any differences between the two plots? If so, what are they and

why do you think they arose?

1

Question 6. (20 Points) Create centered and scaled versions of your document term matrices.

(Do not center and scale the labels.) We will use these for regularized logistic regression with

glmnet.

a. (2 Points) Could we use an unregularized logistic regression model with this data set? Why or

why not?

b. (8 Points) Use glmnet to fit a ridge regression model on the training data. Apply the model

to the testing data. Then compute the proportion classified correctly, the proportion of false

negatives, and the proportion of false positives. Find the 10 most important words according to

the model along with their coefficients.

c. (10 Points) Use glmnet to fit a lasso regression model on the training data. Apply the model

to the testing data. Then compute the proportion classified correctly, the proportion of false

negatives, and the proportion of false positives. Find the 10 most important words according to

the model along with their coefficients. Compare the “important” words selected by ridge and

lasso. Are the words different? What about their relative weights?

Question 7. (20 Points) We can use feature selection to remove features that are irrelevant

to classification. Instead of calculating the probability over the entire dictionary, we will simply

count the number of times each of the n most relevant features appear and treat the set of features

themselves as a dictionary.

a. (10 Points) A common way to select features is by using the mutual information between feature

xk and class label y. Mutual information is expressed in terms of entropies,

I(X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X).

Show that

I(Y, xk) = X

1

y˜=0

p(x

test = k | y = ˜y)p(y = ˜y) log p(x

test = k | y = ˜y)

p(x

test = k)

+

1 − p(x

test = k | y = ˜y)

p(y = ˜y) log 1 − p(x

test = k | y = ˜y)

1 − p(x

test = k)

.

Hint: First check the formula for Mutual Information online and see how it can be derived from

entropies. Also note that

H(xk) = −(p(x

test = k) log p(x

test = k) + p(x

test 6= k) log p(x

test 6= k))

b. (10 Points) Compute the mutual information for all features; use this to select the top n features

as a dictionary. Use the document term matrices from the resulting dictionary for all four of

the methods in questions 5 and 6: tree classification with Gini splits, tree classification with

information splits, ridge logistic regression, and lasso logistic regression. (Hint: subset your

previously computed matrices/data frames.) For each method use the testing set to compute

the proportion classified correctly, the proportion of false negatives, and the proportion of false

positives for n = {200, 500, 1000, 2500}. Display the results in three graphs (each graph will now

have four lines). What happens? Why do you think this is?

Question 8. (25 points) James 8.4.10

Note: To begin this problem, you should execute library(ISLR) and data(“Hitters”) to load

the data set.

Question 9. (20 points) James 10.7.1

3