## Description

1 Learning basics of regression in Python (3%)

This question will take you step-by-step through performing basic linear regression on the Boston

Houses dataset. You need to submit modular code for this question. Your code needs to be functional once downloaded. Non-functional code will result in losing all marks for this question. If

your code is non-modular or otherwise difficult to understand, you risk losing a significant portion

of the marks, even if the code is functional.

Environment setup: For this question you are strongly encouraged to use the following python

packages:

• sklearn

• matplotlib

• numpy

It is strongly recommended that you download and install Anaconda 3.4 to manage the above installations. This is a Data Science package that can be downloaded from https://www.anaconda.

com/download/.

You will submit a complete regression analysis for the Boston Housing data. To do that, here are

the necessary steps:

• Load the Boston housing data from the sklearn datasets module

• Describe and summarize the data in terms of number of data points, dimensions, target, etc

• Visualization: present a single grid containing plots for each feature against the target. Choose

the appropriate axis for dependent vs. independent variables. Hint: use pyplot.tight layout

function to make your grid readable

• Divide your data into training and test sets, where the training set consists of 80% of the data

points (chosen at random). Hint: You may find numpy.random.choice useful

• Write code to perform linear regression to predict the targets using the training data. Remember to add a bias term to your model.

• Tabulate each feature along with its associated weight and present them in a table. Explain

what the sign of the weight means in the third column (’INDUS’) of this table. Does the sign

match what you expected? Why?

• Test the fitted model on your test set and calculate the Mean Square Error of the result.

• Suggest and calculate two more error measurement metrics; justify your choice.

• Feature Selection: Based on your results, what are the most significant features that best predict the price? Justify your answer.

2 Locally reweighted regression (6%)

1. Given {(x

(1), y(1)), ..,(x

(N)

, y(N)

)} and positive weights a

(1), …, a(N)

show that the solution to

the weighted least square problem

w∗ = arg min

1

2

X

N

i=1

a

(i)

(y

(i) − wT x

(i)

)

2 +

λ

2

||w||2

(1)

is given by the formula

w∗ =

XT AX + λI

−1 XT Ay (2)

where X is the design matrix (defined in class) and A is a diagonal matrix where Aii = a

(i)

2. Locally reweighted least squares combines ideas from k-NN and linear regression. For each

new test example x we compute distance-based weights for each training example a

(i) =

exp(−||x−x

(i)

||2/2τ

2

P

)

j

exp(−||x−x(j)

||2/2τ

2)

, computes w∗ = arg min 1

2

PN

i=1 a

(i)

(y

(i)−wT x

(i)

)

2+

λ

2

||w||2 and predicts

yˆ = x

T w∗

. Complete the implementation of locally reweighted least squares by providing the

missing parts for q2.py.

Important things to notice while implementing: First, do not invert any matrix, use a linear

solver (numpy.linalg.solve is one example). Second, notice that P

exp(Ai)

j

exp(Aj ) = P

exp(Ai−B)

j

exp(Aj−B)

but

if we use B = maxj Aj it is much more numerically stable as P

exp(Ai)

j

exp(Aj )

overflows/underflows

easily. This is handled automatically in the scipy package with the scipy.misc.logsumexp function.

3. Use k-fold cross-validation to compute the average loss for different values of τ in the range

[10,1000] when performing regression on the Boston Houses dataset. Plot these loss values for

each choice of τ .

4. How does this algorithm behave when τ → ∞? When τ → 0

3 Mini-batch SGD Gradient Estimator (6%)

Consider a dataset D of size n consisting of (x, y) pairs. Consider also a model M with parameters

θ to be optimized with respect to a loss function L(x, y, θ) = 1

n

Pn

i=1 `(x

(i)

, y(i)

, θ).

We will aim to optimize L using mini-batches drawn randomly from D of size m. The indices

of these points are contained in the set I = {i1, . . . , im}, where each index is distinct and drawn

uniformly without replacement from {1, . . . , n}. We define the loss function for a single mini-batch

as,

LI(x, y, θ) = 1

m

X

i∈I

`(x

(i)

, y(i)

, θ) (3)

1. Given a set {a1, . . . , an} and random mini-batches I of size m, show that

EI

”

1

m

X

i∈I

ai

#

=

1

n

Xn

i=1

ai

2. Show that EI [∇LI(x, y, θ)] = ∇L(x, y, θ)

3. Write, in a sentence, the importance of this result.

4. (a) Write down the gradient, ∇L above, for a linear regression model with cost function

`(x, y, θ) = (y − w

T x)

2

.

(b) Write code to compute this gradient.

5. Using your code from the previous section, for m = 50 and K = 500 compute

1

K

X

K

k=1

∇LIk

(x, y, θ)

, where Ik is the mini-batch sampled for the kth time.

Randomly initialize the weight parameters for your model from a N (0, I) distribution. Compare the value you have computed to the true gradient, ∇L, using both the squared distance

metric and cosine similarity. Which is a more meaningful measure in this case and why?

[Note: Cosine similarity between two vectors a and b is given by cos(θ) = a · b

||a||2||b||2

.]

6. For a single parameter, wj , compare the sample variance, σ˜j , of the mini-batch gradient estimate for values of m in the range [0,400] (using K = 500 again). Plot log ˜σj against log m.

Submission

Assignments must be submitted by 10:00 pm on the day it is due; a late penalty of 10% per day

will be assessed thereafter (up to 3 days, then submission is blocked). We highly recommend using Python 3.6 embedded in Anaconda 3.4 to solve the programming questions. However, you are

welcome to use any programming language you like, given that your code solves the problem, and

is functional and modular.

What to submit

• All work to be submitted via Markus (details to follow soon).

• All mathematical work has to be clearly marked by the question it belongs to, we highly

recommend using a math editor such as MathType, Word Equations or LaTex. However, if

you prefer to use paper and pen(cil), you are welcome to do so as long as your hand writing

is readable. Unreadable work will result in losing the full mark of the question.

• Provide a separate file containing your code for each question. Each file name has to be indicative to the question it belongs to.

• Your code must be clearly structured with an obvious entry point.

• All graphs and plots have to be clearly marked by its question. Use readable grids whenever

applicable.