STA 561: Homework 4 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

1 Constructing Kernels (15 pts)
In class, we saw that by choosing a kernel
k(xi
, xk) =< Φ(x), Φ(x) >Hk
,
we can implicitly map data to a high dimensional space, and have the SVM algorithm work in that
space. One way to generate kernels is to explicitly define the mapping φ to a higher dimensional
space, and then work out the corresponding K. However in this question we are interested in direct
construction of kernels. Let K1 be kernels over R
n × R
n
, let a ∈ R. < a, b > denotes the dot
product, a
T
b.
For each of the function K below, state whether it is necessarily a kernel. If you think it is, prove
it; Otherwise, please specify reasons.
(a) K(x, z) = aK1(x, z)
(b) K(x, z) =< x, z >3 +(< x, z > −1)2
(c) K(x, z) =< x, z >2 + exp(−kxk
2
) exp(−kzk
2
)
2 Reproducing kernel Hilbert spaces (20 pts)
Let F be the set of all functions f : [0, 1] → R such that f(x) = ax for some real number a. Show
that this is a RKHS with kernel K(x, y) = xy.
3 Convexity and KKT conditions (40 pts)
In class, we have seen the hinge loss that is used for “maximum-margin” classification, most notably
for support vector machines. In this problem, you will see a new loss function that is defined as
follows.
L(x, y, f) = max(0, |y − f(x)| − ).
This loss is called the “epsilon-insensitive loss,” and we hope you can see why it has this name!
This is a very popular loss function. The cost function for the SVMs will thus be:
1
2
kwk
2 + C
Xn
i=1
L(xi
, yi
, f)
1
where x is the input, y is the output, and f(x) = w
T x is used for predicting the label. Both C and
 > 0 are parameters.
Hint: The primal form for this problem is given as:
min
w,η,η∗
1
2
kwk
2 + C
Xn
i=1
(ηi + η
?
i
)
subject to
yi − hw, xii −  ≤ ηi
hw, xii − yi −  ≤ η
?
i
ηi
, η?
i ≥ 0, i = 1, . . . , n,
where η, η? are two slack variables.
(a) Please write down the Lagrangian function for the above primal form, and use KKT conditions
to derive the dual form. You can look up the answer on the internet if you get stuck but try it first
so you get used to doing this type of calculation.
(b) How would you define support vectors in this problem?
(c) How does  influence the complexity of the model in practice? In other words, does increasing
 make the model more or less likely to overfit in general?
(d) How does C influence the complexity of the model in practice? In other words, does increasing
C make the model more or less likely to overfit in general?
(e) Now suppose you are given a new (unseen) sample x, please write down the equation for
evaluating f(x).
4 SVM Implementation (25 pts)
In this problem, you will experiment with the support vector machine (SVM), and various kernel
functions.
(a) Implement a “hard” maximum-margin SVM classifier in MATLAB, R, or Python. Here,
“hard” means that if the data are separable the SVM will return a maximum margin solution, but
if the data are not separable, the code will fail. Your implementation should optimize the dual
problem using a quadratic program solver or a specialized solver, and include a train function and
a predict function. (Note that you do not need to write the solver yourself!)
(b) Train an SVM classifier with the kernel function k(x, z) = x
>z on 9/10ths of the credit card
data set. (Please set seed as 2018 to choose which tenth to leave for testing.) What is the accuracy
of this classifier on the test data set? Show the ROC curves, and also report the AUC.
2
(c) Train an SVM classifier with the radial basis kernel
k(x, z) = exp 

kx − zk
2
2
σ
2

on the credit card data training set, for σ
2 = 5 and σ
2 = 25. What is the accuracy of these
classifiers on the test data set? Show the ROC curves, and also report the AUC.
3