## Description

Introduction

This assignment asks you to implement an algorithm for learning parameters for a

simple Bayesian network from missing data.

The algorithm

Missing data are prevalent in real-world problems and appear as either hidden

variables that are never observed or missing values for some features. Learning from such

datasets can be solved using an algorithm called Expectation Maximization (EM). The

idea is to start with a model completed with random parameters and to repeat the

following two steps until convergence:

• Estimate the missing data using the current complete model (E-step),

• Learn a new set of parameters using the data set “completed” with the missing

data just estimated (M-step).

You should set a small threshold (say 0.001) for the change of likelihoods

between two iterations to detect the convergence of your algorithm. Since EM is sensitive

to the starting point, you should try multiple starting points in order to find a good

solution.

EM is also sensitive to the amount of missing data. Five data sets with missing

rates being 10%, 30%, 50%, 70%, and 100% respectively are provided. Learn a model

for each data set.

The model

The Bayesian network has the following variables: Gender, Weight and Height,

whose relations are shown in the following graph.

The datasets (download from course website) have 20 data points each with occasional

missing values for Gender, denoted as “-”. All the variables are binary: Gender (M/0,

F/1), Weight (greater_than_130/0, less_than_130/1), and Height (greater_than_55/0,

less_than_55/1). The parameters of this model are to be estimated from the datasets.

Gender

Weight Height

Implementation

Implement the EM algorithm for learning the parameters for the above model in

your preferred programming language. Hint: E-step of the EM algorithm is essentially

estimating the probabilities of different values of Gender given that we know a person’s

Weight and Height, i.e., P(Gender | Weight, Height), and use these estimations as if they

are our (expected) counts.

Evaluation and analysis

Test your EM algorithm on the five datasets. For each dataset, try several different

starting points. In your report you should at least include the following results:

• The starting points of the learning

• The final conditional probability tables for each learning

• Plots of the likelihood vs number of iterations to demonstrate the convergence of

your algorithm.

• Try the following starting parameters and report your results.

P(gender=M)=0.7;

P(weight=greater_than_130|gender=M)=0.8;

P(weight=greater_than_130|gender=F)=0.4;

P(height= greater_than_55|gender=M)=0.7;

P(height= greater_than_55|gender=F)=0.3;

And provide analysis on the following issues at least:

• Do multiple starting points help in finding better solutions?

• Do some of the different solutions have the same likelihood scores?

• How does the data missing rate affect your algorithm and the results?

Deliverables

Submit a single .zip file named LastName.FirstInitial.HW# containing the

following via email to me at changhe.yuan@qc.cuny.edu. In addition, submit a hardcopy

of your report and code in class on due date.

1) Well commented code;

2) A readme file explaining how to compile and run your program;

3) A written report explaining your experimental results.