## Description

Question1 (2 points)

Suppose we collect data for a group of students in a class with variables

X1 hours studied per week

X2 GPA

Y receive an A

We fit a logistic regression and produce estimated coefficients βˆ

0 = −6, βˆ

1 = 0.05, βˆ

2 = 1.

(a) (1 point) Estimate the probability that a student who studies for 40 h and has a GPA

of 3.5 gets an A in the class.

(b) (1 point) How many hours would the student in part (a) need to study to have 50%

change of getting an A in the class?

Question2 (5 points)

PricewaterhouseCoopers (PwC) surveyed 1000 online shoppers in the U.S.A and China.

One question asked is whether the online shopper followed brands they purchased through

social media. The result is given below

Social Media

Country No Yes

U.S.A 487 513

China 72 928

(a) (1 point) What is the odds of online shoppers who follow brands through social media

in each country?

(b) (1 point) What is the odds ratio for comparing U.S.A. online shoppers with Chinese

online shoppers?

(c) (1 point) Write the logistic regression model for this problem using the log odds of

following brands through social media as the response variable and country as an

indicator predictor (U.S.A = 1).

(d) (1 point) Numerical optimisation gives the estimated slope of -2.5043 and standard

error of 0.1377. Transform this result to the odds scale and compare it with your

answer in part (b).

(e) (1 point) Construct a 95% confidence interval for the odds ratio and state the conclusion based on this interval.

Question3 (3 points)

When the number of predictors/features p is large, there tends to be deterioration in the

performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This

phenomenon is known as the curse of dimensionality, and it is tires into the fact that

non-parametric approaches often perform poorly when p is large.

(a) (1 point) Suppose that we have a set of observations, each with measurements on only

one feature X, that is, p = 1. We assume that X is uniformly distributed on [0, 1].

Associated with each observation is a response value. Suppose that we wish to predict

a test observation’s response using only observations that are within 10% of the range

of X closest to the that test observation. For instance, in order to predict the response

for a test observation with X = 0.6, we will use observations in the range [0.55, 0.65].

On average, what fraction of available observations will we use to make the prediction.

(b) (1 point) Now suppose that we have a set of observations, each with measurements

on two features X1 and X2, that is, p = 2. We assume that (X1, X2) are uniformly

distributed on [0, 1] × [0, 1]. We wish to predict a test observation’s response using

only observations that are within 10% of the range X1 and within 10% of the range of

X2 closest to the test observation. For instance, in order to predict the response for a

test observation with X1 = 0.6 and X2 = 0.35, we will use observations in the range

[0.55, 0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the

available observations will we use to make the prediction?

(c) (1 point) Now suppose that we wish to make a prediction for a test observation by

creating a p-dimensional hypercube centred around the test observation that contains,

on average, 10% of the training observations. For p = 100, what is the length of each

side of the hypercube?

Question4 (4 points)

Consider the K-nearest neighbour (KNN) classifier using Euclidean distance.

0 2 4 6 8 10

0

2

4

6

8 10

X1

X2

Y = 0

Y = 1

The dataset is described above. Note that a point can be its own neighbour.

(a) (3 points) Sketch the 1-nearest neighbour (1-NN) decision boundary for this dataset.

(b) (1 point) How would the point (8, 1) be classified using (1-NN)?

Page 2

Question5 (5 points)

Consider manually assigning items in a dataset described by into 2 clusters using K-means.

X1

X2

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1.0 −0.5 0.0 0.5 1.0

Use squared Euclidean distance as the measure of variation/dissimilarity.

(a) (2 points) Suppose we have the following initialisation.

X1

X2

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1.0 −0.5 0.0 0.5 1.0

Compute K-means clustering until convergence, show all steps for all iterations.

(b) (3 points) Suppose we have the following initialisation.

X1

X2

−5 −4 −3 −2 −1 0 1 2 3 4 5

−1.0 −0.5 0.0 0.5 1.0

Compute again until convergence. Compare with part (a), what do you notice?

Page 3

Ve572

Assignment 3

Question6 (6 points)

Suppose that we have 4 observations, for which the following matrix gives the dissimilarity.

0.3 0.4 0.7

0.3 0.5 0.8

0.4 0.5 0.45

0.7 0.8 0.45

For instance, the dissimilarity measure between the first and second observations is 0.3, and

the dissimilarity measure between the second and the fourth observations is 0.8.

(a) (2 points) On the basis of this dissimilarity matrix, sketch the dendrogram that results

from hierarchically clustering these four observations using complete linkage. Be sure to

indicate on the plot the height at which each fusion occurs, as well as the observations

corresponding to each leaf in the dendrogram.

(b) (2 points) Repeat part (a), this time using single linkage clustering.

(c) (1 point) Suppose that we cut the dendogram obtained in part (a) such that two

clusters result. Which observations are in each cluster?

(d) (1 point) Suppose that we cut the dendogram obtained in part (b) such that two

clusters result. Which observations are in each cluster?

Page 4