# Machine Learning (CS 6140) Homework 2 solved

\$40.00

## Description

5/5 - (1 vote)

1. Logistic Regression. We consider the following models of logistic regression for a binary classification with a sigmoid function σ(z) = 1
1+e−z
,
Model 1: P(Y = 1|X, w1, w2) = σ(w1X1 + w2X2)
Model 2: P(Y = 1|X, w0, w1, w2) = σ(w0 + w1X1 + w2X2)
We have three training examples:
(x
1
, y1
) = ([1, 1]>, 1), (x
2
, y2
) = ([1, 0]>, −1), (x
3
, y3
) = ([0, 0]>, 1)
A. How does the the learned value of w = (w1, w2) change if we change the label of the third
example to −1? How about in Model 2? Explain (Hint: think of the decision boundary on 2D
plane.)
B. Now, suppose we train the logistic regression model (Model 2) based on the N training examples
x
1
, . . . , x
N and labels y
1
, . . . , yn by maximizing the penalized log-likelihood of the labels:
X
i
log P(y
i
|x
i
, w) −
λ
2
kwk
2
2
For large λ (strong regularization), the log-likelihood terms will behave as linear functions of w.
log σ(y
iw>x
i
) ≈
1
2
y
(i)w>x
i
Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE w in terms of λ and training data {(x
i
, yi
)}
N
i=1. Based on this, explain how w
behaves as λ increases.
2. Support Vector Machine. Consider a binary classification problem in one-dimensional space
where the sample contains four data points S = {(1, −1),(−1, −1),(2, 1),(−2, 1)} as shown in
Fig. 1.
Figure 1: Red points represent instances from class +1 and blue points represent instances from class -1.
A. Define Ht = [t, ∞). Consider a class of linear separateors H = {Ht
: t ∈ R}, i.e., for
∀Ht ∈ H, Ht(x) = 1 if x ≥ t otherwise −1. Is there any linear separator Ht ∈ H that achieves 0
1
classification error on this sample? If yes, show one of the linear separators that achieves 0 classification error on this example. If not, briefly explain why there cannot be such linear separator.
B. Now consider a feature map φ : R → R
2 where φ(x) = (x, x2
). . Apply the feature map to
all the instances in sample S to generate a transformed sample S
0 = {(φ(x), y) : (x, y) ∈ S}.Let
H0 = {ax1 + bx2 + c ≥ 0 : a
2 + b
2 6= 0} be a collection of half-spaces in R2
. More specifically,
Ha,b,c((x1, x2)) = 1 if ax1 + bx2 + c ≥ 0 otherwise −1. Is there any half-space H0 ∈ H0
that
achieves 0 classification error on the transformed sample S
0
? If yes, give the equation of the maxmargin linear separator and compute the corresponding margin.
C. What is the kernel corresponding to the feature map φ(·) in the last question, i.e., give the
kernel function K(x, z) : R × R → R.
3. Constructing Kernels. In this question you will be asked to construct new kernels from existing kernels. Suppose K1(x, z) : R
d × R
d → R and K2(x, z) : R
d × R
d → R are both kernels,
show the following functions are also kernels:
A. K(x, z) = c1K1(x, z) + c2K2(x, z) with c1, c2 ≥ 0.
B. K(x, z) = K1(x, z) · K2(x, z).
C. Let q(t) = Pp
i=0 cit
i be a polynomial function with nonnegative coefficients, i.e., ci ≥ 0, ∀i.
Show that K(x, z) = q(K1(x, z)) is a kernel.
D. K(x, z) = exp(K1(x, z)). (Hint: you can use the previous results to prove this.)
E. Let A be a positive semidefinite matrix and define K(x, z) = x
TAz.
F. K(x, z) = exp(−kx − zk
2
2
).
4. Support Vectors. In question 2, we explicitly constructed the feature map and find the corresponding kernel to help classify the instances using linear separator in the feature space. However
in most cases it is hard to manually construct the desired feature map, and the dimensionality of
the feature space can be very high, even infinity, which makes explicit computation in the feature
space infeasible in practice. In this question we will develop the dual of the primal optimization problem to avoid working in the feature space explicitly. Suppose we have a sample set
S = (x1, y1), …,(xn, yn) of labeled examples in R
d with label set {+1, −1}. Let φ : R
d → RD be
a feature map that transform each input example to a feature vector in R
D. Recall from the lecture
notes that the primal optimization of SVM is given by
minimize
w,ξi
1
2
kwk
2
2 + C
Xn
i=1
ξi
subject to yi(w
T φ(xi)) ≥ 1 − ξi ∀i = 1, · · · , n
ξi ≥ 0 ∀i = 1, ·, n
2
which is equivalent to the following dual optimization
minimize
αi
Xn
i=1
αi −
1
2
Xn
i=1
Xn
j=1
αiαjyiyjφ(xi)
T φ(xj )
subject to 0 ≤ αi ≤ C ∀i = 1, . . . , n
Xn
i=1
αiyi = 0 ∀i = 1, . . . , n
Recall from the lecture notes ξ1, …, ξn are called slack variables. The optimal slack variables have
intuitive geometric interpretation as shown in Fig. 3. Basically, when ξi = 0, the corresponding
feature vector φ(xi) is correctly classified and it will either lie on the margin of the separator or on
the correct side of the margin. Feature vector with 0 < ξi ≤ 1 lies within the margin but is still be correctly classified. When ξi > 1, the corresponding feature vector is misclassified. Support
vectors correspond to the instances with ξi > 0 or instances that lie on the margin. The optimal
vector w can be represented in terms of αi
, i = 1, · · · , n as w =
Pn
i=1 αiyiφ(xi).
A. Suppose the optimal ξ1, …, ξn have been computed. Use the ξi
to obtain an upper bound
on the number of misclassified instances.
B. In the primal optimization of SVM, what?s the role of the coefficient C? Briefly explain your
answer by considering two extreme cases, i.e., C → 0 and C → ∞.
C. Explain how to use the kernel trick to avoid the explicit computation of the feature vector
φ(xi)? Also, given a new instance x, , how to make prediction on the instance without explicitly
computing the feature vector φ(x)?
5. Generalized Lagrangian Function. Consider the optimization problem
min
w
f(w) s.t. gj (w) ≤ 0, ∀j = 1, . . . , m, hj (w) = 0, ∀j = 1, . . . , p (1)
Show that for the generalized Lagrangian function, defined by
L(w, α, β) , f(w) +Xn
j=1
αjgj (w) +X
p
j=1
βjhj (w)
3
the following always holds
max
α≥0,β
min
w
L(w, α, β) ≤ min
w
max
α≥0,β
L(w, α, β)
6. Dual Optimization. Consider the optimization program
min
x
1
2
x
>P0x + q
>
0 x + r0 s.t.
1
2
x
>Pix + q
>
i x + ri ≤ 0, i = 1, . . . , m
where P0 and all Pi are assumed to be positive semi-definite matrices. A) Form the generalized
Lagrangian function. B) Compute the Lagrange dual function. C) Derive the dual maximization
problem.
7. Logistic Regression Implementation.
A) Write down a code in Python whose input is a training dataset {(x
1
, y1
), . . . ,(x
N , yN )} and its
output is the weight vector w in the logistic regression model y = σ(w>x).
B) Download the dataset on the course webpage. Use ‘dataset1’. Run the code on the training
dataset to compute w and evaluate on the test dataset. Report w, classification error on the training set and classification error on the test set. Plot the data (use different colors for data in different
classes) and plot the decision boundary found by the logistic regressions.
C) Repeat part B using ‘dataset2’. Explain the differences in results between part A and B and
8. SVM Implementation.
Implement SVM with the SMO algorithm and train it on the provided dataset. For your implementation, you only have to use the linear kernel. In addition, run SVM using the LIBSVM package and compare the results. You can implement the simplified SMO, as described
in http://cs229.stanford.edu/materials/smo.pdf
A) Apply the SVM on the ‘dataset1’ and report the classification error (on both training and test
sets) as a function of the regularization parameter C.
B) Repeat part A using ‘dataset2’. Explain the differences in results between part A and B and