CSCI3220 Algorithms for Bioinformatics Assignment 5 solved

$35.00 $21.00



5/5 - (1 vote)

In this assignment, you will work out some details of using the quad tree structure in hierarchical
clustering. You are encouraged to use some ways (which could be, but not necessarily, writing a small
program) to do some tedious calculations. You will then implement the min-heap data structure.
Finally, you will answer some questions related to the k-means clustering method.
1. This question is about hierarchical clustering. You are given the following gene expression data
matrix X of five genes g1-g5 measured in six samples s1-s6.
X s1 s2 s3 s4 s5 s6
g1 4 6 5 1 2 8
g2 1 4 6 8 1 3
g3 2 9 3 5 1 6
g4 8 5 2 1 3 4
g5 7 1 1 3 2 9
(a) Produce the pair-wise distance matrix among the five genes, where the distance between any
two genes gi and gj is defined as their squared Euclidean distance, �”�$, �&’ = ∑ “*+,-*.,’ 0 /
3 ,
in which xik is the expression level of gene gi in sample sk. Give distance values up to 1
decimal point. (10%)
(b) Suppose you want to perform agglomerative hierarchical clustering of the five genes based
on the distance matrix in Part a, using a quad tree to index the distance values. Show the
initial quad tree, with the different levels of the tree clearly indicated. (10%)
(c) For each of the following link types, show the updated quad tree after the first merge of two
clusters: i) single-link; ii) average-link; iii) complete link. For the two merging clusters, the
original entries for the one with the smaller index will be used to store the distances of the
new cluster. (15%)
2. Write a computer program called MinHeap in C, C++, Java or Python that builds a min-heap for
a list of integers and remove the smallest integer one by one until the heap is empty. Specifically,
the input includes the followings:
Ÿ The number of integers on the list
Ÿ The integers, one per line, which can be assumed to be all unique
The program should treat the input integers as an array that follows their input order and turn it
into a heap. The program should print to standard output (stdout) this initial heap and the updated
heap after removing each smallest value. Each heap should be printed as a single commadelimited list on a separate line.
Copyright © 2018 Kevin Yip Page 2 of 3
For example, if your implementation is Java, below is a typical run of the program (based on an
example from the lecture notes):
>java MinHeap
The main part of your program is expected to contain no more than 100 lines of code.
Your program will be graded based on i) whether it can be compiled/run successfully, ii) whether
it follows the input/output formats as specified above, iii) its accuracy on a number of test cases
and iv) whether the program is well-documented with appropriate comments added to explain
the meaning of the code. (50%)
3. This question is about k-means.
(a) Suppose the data set contains an outlier point that is very different from all the other points.
Describe how it would affect the clusters produced by k-means. (5%)
(b) Suppose the data set contains a column with values much larger in magnitude than the other
columns. Describe how it would affect the clusters produced by k-means. (5%)
(c) There is another clustering method that is similar to k-means, but instead of assigning every
point to the cluster with the closest representative, it assigns every point to each cluster in a
fuzzy manner according to its distance to the cluster representative. For example, suppose
k=2 and a point has a distance of 10 from the representative of cluster 1 and a distance of 20
from the representative of cluster 2, it is assigned a membership value of
(1/10)/[(1/10)+(1/20)] = 2/3 to cluster 1 and a membership value of (1/20)/[(1/10)+(1/20)] =
1/3 to cluster 2. Describe one advantage and one disadvantage of this method as compared to
k-means. (5%)
(d) [Optional] In practice, the number of clusters is often not known. Propose a way to determine
an appropriate value of k based on the input data set. (bonus 5%)
Copyright © 2018 Kevin Yip Page 3 of 3
For the programming question, you should first submit your program to our online judge system.
Please see tutorial notes set 1 for explanations.
For the non-programming question, give all your answers in a single file named _asmt5.,
where is your student ID and is either doc, docx or pdf. We prefer pdf files because it
has better portability. Then put your files for both the programming and non-programming questions
in a zip file named and submit it to Blackboard.
Both your written and source code files should contain the following header. Contact Kevin before
submitting the assignment if you have anything unclear about the guidelines on academic honesty.
CSCI3220 2018-19 First Term Assignment 5
I declare that the assignment here submitted is original except for source
material explicitly acknowledged, and that the same or closely related material
has not been previously submitted for another course. I also acknowledge that I
am aware of University policy and regulations on honesty in academic work, and
of the disciplinary guidelines and procedures applicable to breaches of such
policy and regulations, as contained in the following websites.
University Guideline on Academic Honesty:
Student Name:
Student ID :
Marking Scheme and Notes:
1. Remember to submit your assignment by 23:59pm of the due date. We may not accept late
2. For the written part, if you submit multiple times, ONLY the content and time-stamp of the latest
one before the submission deadline will be considered. For the program, the version submitted
to the online judge system with the highest score will be graded.
University Guideline for Plagiarism
Please pay attention to the university policy and regulations on honesty in academic work, and
the disciplinary guidelines and procedures applicable to breaches of such policy and regulations.
Details can be found at With each assignment,
students will be required to submit a statement that they are aware of these policies, regulations,
guidelines and procedures.