CS5011 : Introduction to Machine Learning Programming Assignment #1 solved


Category: You will receive a download link of the .ZIP file upon Payment


5/5 - (1 vote)

Linear Classifier
1. You will use a synthetic data set for the classification task. Generate two classes with
10 features each. Each class is given by a multivariate Gaussian distribution, with
both classes sharing the same covariance matrix. Ensure that the covarianve matrix is
not spherical, i.e., that it is not a diagonal matrix, with all the diagonal entries being
the same. Generate 1000 examples for each class. Choose the centroids for the classes
close enough so that there is some overlap in the classes. Specify clearly the details of
the parameters used for the data generation. Randomly pick 40% of each class (i.e.,
400 data points per class) as a test set, and train the classiers on the remaining 60%
data. When you report performance results, it should be on the left out 40%. Call this
dataset as DS1.
2. For DS1, learn a linear classifer by using regression on indicator variable. Report the
best fit achieved by the classifer, along with the coeffcients learnt.
3. For DS1, use k-NN to learn a classifer. Repeat the experiment for different values of
k and report the performance for each value. Technically this is not a linear classifer,
but I want you to appreciate how powerful linear classifers can be. Do you do better
than regression on indicator variables or worse? Are there particular values of k which
perform better?
Linear Regression
1. For the regression tasks, you will use the Communities and Crime Data Set from the
UCI repository (http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime).
This is a real-life data set and as such would not have the nice properties that we expect. Your first job is to make this dataset usable, by filling in all the missing values.
Use the sample mean of the missing attribute. Is this is a good choice? What else
might you use? If you have a better method, describe it, and you may use it for filling
in the missing data. Turn in the complete data set.
2. Fit the above data using linear regression. Report the residual error of the best fit
achieved, along with the coeffcients learnt.
3. Use Ridge-regression on the above data. Repeat the experiment for different values of
λ . Report the residual error for each value, on test data, averaged over 5 different
80 − 20 splits, along with the coefficients learnt. Which value of λ gives the best fit?
Is it possible to use the information you obtained during this experiment for feature
selection? If so, what is the best fit you achieve with a reduced set of features?
Instructions on how to use 80-20 splits
• Make 5 different 80-20 splits in the data and name them as CandC-train < num >
.csv and CandC-test< num >.csv.
• For all 5 datasets that you have generated, learn a regression model using 80%
and test it using 20%.
• Report the average RSS over these 5 different runs.
Feature Extraction
1. You have been provided with a 3-dimensional dataset (DS3) which contains 2 classes.
Perform PCA on the dataset and extract 1 feature and use the data in this projected
space to train linear regression with indicator random variables. Use the learnt model
to classify the test instances. Report per-class precision, recall and f-measure. Also you
have to report the 3-D plot of the dataset and the plot of the dataset in the projected
space along with the classifier boundary.
2. Now use the same dataset and perform LDA on it and project the dataset to the
derived feature space. Report per-class precision, recall and f-measure. Also you have
to report the 3-D plot of the dataset and the plot of the dataset in the projected space
along with the classifier boundary. What do you infer from these two experiments?
Which feature extraction technique performs better for this scenario? Why?
Page 2
Support Vector Machines
1. You have been provided with training instances for an image classification problem
(DS4). You have to train an SVM to classify the test images into either of the following
four categories: coast, forest, inside-city, mountain. Follow the instructions below for
extracting the features from the images.
Instructions for Feature Extraction
You are given a set of scene images. In this step, the requirement is to extract some
features from the images that can be used as input to our SVM. There are many
feature extraction techniques. For this assignment, we will follow a color histogram
based approach. This is not the best technique for feature extraction, but most likely,
the easiest.
• Read image.
• Extract red, green and blue channels from the variable you read into in 1. The
sequence is r-g-b.
• For every channel divide it into 32 bins and find frequency.
• Concatenate these 32 dimensional feature vectors for every channel to find a 96D
vector for the whole image. (sequence r-g-b)
• Normalize the features before using them.
Use the training data to build classification models using the following kernels.
• Linear kernel
• Polynomial kernel
• Gaussian kernel
• Sigmoid kernel
Come up with the kernel parameters for the various models. You can use a fraction of
data supplied to do a n-fold cross validation to find the best model parameters.
• You are expected to use the libsvm library available at https://www.csie.ntu.
edu.tw/~cjlin/libsvm/. LibSVM has interfaces for matlab and python. It can
also be used a standalone tool on the command line.
• Once you learn the svm, you are expected to turn in the model learnt as a matfile
or dumped using pickle(if done in python)
Page 3
Bayesian Parameter Estimation
1. Design and implement a Bayesian Spam Filter that classifies email messages as either
spam (unwanted) or ham (useful), i.e yi ∈ {spam, ham} using a Naive Bayes Classifier(explained later) for the following four scenarios.
(a) Maximum Likelihood Estimation assuming likelihood L ∼ Multinomial(n1, n2, . . . , nk, N),
where k is the size of the vocabulary, nw is the number of times word w appears
in the document d and N =
i ni
(b) Maximum Likelihood Estimation assuming likelihood L ∼ Bernoulli(i, p), where
p is the parameter of the Bernoulli distribution and i ∈ {0, 1}. In our case, we
would have k Bernoulli distributions.
(c) Bayesian Parameter Estimation assuming that prior p ∼ Dir(~α), where α =
(α1, α2, . . . , αk), the elements of the vector are the parameters of the Dirichlet
(d) Bayesian Parameter Estimation, assuming that prior p ∼ Beta(α, β), where α and
β are the parameters of the Beta distribution.
Naive Bayesian Classifier
Naive Bayesian Classifier takes a collection of words as input and predicts the category
of the given text. Naive Bayes algorithm for text classification involves two stages :
training and classification. In the training stage, various probabilities are estimated
based on the training examples. In the classification stage, the estimated probabilities
are used to evaluate the likelihood of each class for the given document. The document
is then assigned a label with the highest likelihood score. Let C = {C1, C2, . . . , Cm}
be the set of labels and V = {w1, w2, . . . , wn} be all the words in the vocabulary. You
would need to estimate the following propbabilities in the training stage.
• Class priors: p(Ci) for i = 1, 2, . . . , m. Note that Pk
i=1 p(Ci) = 1.
• Within class word probabilities: p(wj
|Ci) for i = 1, 2, . . . , m and j = 1, 2, . . . , n.
Note that Pn
j=1 p(wj
|Ci) = 1 for i = 1, 2, 3, . . . , k.
Note- For some words, the conditional probability can end up becoming 0, to prevent this, we perform a ”add one smoothing” ( https://en.wikipedia.org/wiki/
Description of the dataset:
The data set contains a collection of spam and legitimate emails. Each token (word,
number, punctuations, etc.) is replaced by a unique number througout the dataset. In
Page 4
order to get the emails to an usable by Naive Bayes, some preprocessing is required.
Each document should be represented by a term frequency vector of size k, where the
ith element is the frequency of the ith term in the vocabulary (set of all distinct words
in any text document from the training set). Do not consider the token ”Subject:”
as part of the vocabulary. Files whose names have the form spmsg*.txt are spam
messages. Files whose names have the form *legit*.txt are legitimate examples.
• You are required to implement Naive Bayesian classifier for 4 different scenarios
as described previously. In the last two cases you are expected to try out different
parameters for the prior distribution. Report the results after performing 5-fold
cross validation (80-20 split). You have 10 folders in the dataset. Use 1-8 for
training, 9-10 for testing in one fold. In the next fold, use 3-10 for training, 1-2
for tesing and so on.
• Comment on the impact of choice of parameters on the performance. Also comment on the performance of the classifier under different scenarios. Plot a PRcurve (refer Page 145-146 Manning, Raghavan and Schutze) for each scenario and
for different parameter setting in the last 2 scenarios.
• Refer to chapter 13 of Manning, Raghavan and Schutze for further reference on
implementation of Naive Bayes for text classification.
• Your code should take trainM atrixp×k, trainlabelp×1 , testM atrixr×k and test −
labelr×k in the first two scenarios. p and r are number of documents in training
and test set respectively and k is the size of the vocabulary. In the last 2 scenarios
it should also take the parameters for the prior distribution as input. The code
should output precision, recall, f1-measure for spam class and plot PR-Curve for
the best model obtained in terms of performance.
Submission Instructions
Submit a single tarball file containing the following files in the specified directory structure.
Use the following naming convention: ’cs5011 pa1 rollno.tar.gz’
DS1 train.csv
DS1 test.csv
Source Code
Page 5