## Description

Background

In this miniproject you will implement two ML models—Linear regression, Logistic regression (with gradient descent)—

and provide analysis on these two models on two distinct datasets. The goal is to get started with programming for

Machine Learning and learn how these two commonly used models work.

Task 1: Acquire, preprocess, and analyze the data

Your first task is to acquire the data, analyze it, and clean it (if necessary). We will use two fixed datasets in this

project, outlined below.

• Dataset 1: ENB2012 data.xlsx (Energy efficiency dataset):

https://archive.ics.uci.edu/ml/datasets/Energy+efficiency

1

• Dataset 2: Qualitative Bankruptcy (250 instances).rar (Qualitative Bankruptcy dataset):

https://archive.ics.uci.edu/ml/datasets/Qualitative_Bankruptcy

The essential subtasks for this part of the project are:

1. Load the datasets into NumPy or Pandas objects in Python.

2. Clean the data. Are there any missing or malformed features? Are there are other data oddities that need to be

dealt with? You should remove any examples with missing or malformed features and note this in your

report.

If you choose to play with Pandas dataframes, a handy line of code that might be helpful is df[ df.eq(’?’).any(1)],

where df is the dataframe, and ’?’ represents a missing value in the datasets. This is a straightforward way to

handle this issue by simply eliminating rows with missing values. You are welcome to explore other possible

ways.

3. Compute basic statistics on the data to understand it better. E.g., what are the distributions of the positive vs.

negative classes, what are the distributions of some of the numerical features?

Task 2: Implement the models

You are free to implement these models as you see fit, but you should follow the equations that are presented in

the lecture slides, and you must implement the models from scratch (i.e., you CANNOT use SciKit Learn or any

other pre-existing implementations of these methods). However, you are free to use relevant code given in the course

website.

In particular, your two main tasks in the part are to:

1. Implement analytical linear regression solution for Dataset 1.

2. Implement logistic regression with gradient descent for Dataset 2.

3. Implement mini-batch stochastic gradient descent for both linear and logistic regression.

You are free to implement these models in any way you want, but you must use Python and you must implement

the models from scratch (i.e., you cannot use SciKit Learn or similar libraries). Using the NumPy or Pandas package, however, is allowed and encouraged. Regarding the implementation, we recommend the following

approach (but again, you are free to do what you want):

• Implement both models as Python classes. You should use the constructor for the class to initialize the model

parameters as attributes, as well as to define other important properties of the model.

• Each of your models classes should have (at least) two functions:

– Define a fit function, which takes the training data (i.e., X and Y)—as well as other hyperparameters (e.g.,

learning rate and batch size)—as input. This function should train your model by modifying the model parameters.

– Define a predict function, which takes a set of input points (i.e., X) as input and outputs predictions (i.e.,

yˆ) for these points.

Task 3: Run experiments

The goal of this project is to have you be familiar with how to train models.

Split each dataset into training, and test sets. Use test set to estimate performance in all of the experiments after training the model with training set. Evaluate the performance using the corresponding cost function for the classification

2

and regression tasks. You are welcome to perform any experiments and analyses you see fit, but at a minimum you

must complete the following experiments in the order stated below:

1. Report the performance of linear regression and fully batched logistic regression. For both datasets use a 80/20

train/test split and report the performance on both training set and test set.

2. Report the weights of each of features in your trained models and discuss how each feature could affect the

performance of the models.

3. Sample growing subsets of the training data (20%,30%,…80%). Observe and explain how does size of training

data affects the performance for both models. Plot two curves as a function of training size, one for performance

in train and one for test.

4. For both linear and logistic regression, try out growing minibatch sizes, e.g., 8, 16, 32, 64, and 128. Compare the convergence speed and final performance of different batch sizes to the fully batched baseline. Which

configuration works the best among the ones you tried?

5. Present the performance of both linear and logistic regression with at least three different learning rates (your

own choice).

6. Compare analytical linear regression solution with mini-batch stochastic gradient descent based linear regression solution. What do you find?

Note: The above experiments are the minimum requirements that you must complete; however, this project is

open-ended. For example, what happens when you add momentum to the gradient descent implementation? what if

you transform your data with non-linear bases discussed in the class? What happens if you add regularization? How

about different evaluation metrics for classification and regression problem? Which kind of feature preprocessing

improves performance? You do not need to do all of these things, but you should demonstrate creativity, rigour, and

an understanding of the course material in how you run your chosen experiments and how you report on them in your

write-up.

Deliverables

You must submit two separate files to MyCourses (using the exact filenames and file types outlined below):

1. code.zip: Your data processing, classification and evaluation code (as some combination of .py and .ipynb files).

2. writeup.pdf: Your (max 5-page) project write-up as a pdf (details below).

Project write-up

Your team must submit a project write-up that is a maximum of five pages (single-spaced, 11pt font or larger; minimum

0.5 inch margins, an extra page for references/bibliographical content can be used). We highly recommend that

students use LaTeX to complete their write-ups. This first mini-project report has relatively strict requirements, but

as the course progresses your project write-ups will become more and more open-ended. You have some flexibility

in how you report your results, but you must adhere to the following structure and minimum requirements:

Abstract (100-250 words) Summarize the project task and your most important findings. For example, include

sentences like “In this project we investigated the performance of two machine learning models on two benchmark

datasets”.

Introduction (5+ sentences) Summarize the project task, the two datasets, and your most important findings. This

should be similar to the abstract but more detailed. You should include background information and citations to

relevant work (e.g., other papers analyzing these datasets).

3

Datasets (5+ sentences) Very briefly describe the datasets and how you processed them. Present the exploratory

analysis you have done to understand the data, e.g. class distribution. Highlight any possible ethical concerns that

might arise when working these kinds of datasets.

Results (7+ sentences, possibly with figures or tables) Describe the results of all the experiments mentioned in

Task 3 (at a minimum) as well as any other interesting results you find (Note: demonstrating figures or tables would

be an ideal way to report these results).

Discussion and Conclusion (5+ sentences) Summarize the key takeaways from the project and possibly directions

for future investigation.

Statement of Contributions (1-3 sentences) State the breakdown of the workload across the team members.

Evaluation

The mini-project is out of 100 points, and the evaluation breakdown is as follows:

• Completeness (20 points)

– Did you submit all the materials?

– Did you run all the required experiments?

– Did you follow the guidelines for the project write-up?

• Correctness (40 points)

– Are your models implemented correctly?

– Are your reported performance close to our solution?

– Do you observe the correct trends in the experiments (e.g., how performance changes as the minibatch or

learning rates changes?)?

– Do you find notable features of the decision boundaries?

• Writing quality (25 points)

– Is your report clear and free of grammatical errors and typos?

– Did you go beyond the bare minimum requirements for the write-up (e.g., by including a discussion of

related work in the introduction)?

– Do you effectively present numerical results (e.g., via tables or figures)?

• Originality / creativity (15 points)

– Did you go beyond the bare minimum requirements for the experiments?

– Note: Simply adding in a random new experiment will not guarantee a high grade on this section! You

should be thoughtful and organized in your report. That is, the distinctive ideas that you came up with

should blend in your whole story. For instance, explaining the triggers behind them would be a great

starting point.

4

Final remarks

You are expected to display initiative, creativity, scientific rigour, critical thinking, and good communication skills.

You don’t need to restrict yourself to the requirements listed above – feel free to go beyond, and explore further.

You can discuss methods and technical issues with members of other teams, but you cannot share any code or data

with other teams.

Congratulations on completing your first course project! You are now familiar with the two most commonly used

ML models. As the class goes on, we will see more machine learning models and their interesting applications in

real life.

5