CSCI3220 Algorithms for Bioinformatics Assignment 3 solved

\$35.00

Description

5/5 - (1 vote)

In this assignment, you will use different ways to compute sequence similarity based on g-gapped kmers, to experience their relative. You will then perform some probability derivations using the
three basic rules, and solve some generic problems. Finally, you will implement the forward
algorithm for computing data likelihoods according to a first-order hidden Markov model, in a way
that your program can handle any type of sequences.
Questions:
1. This question is about g-gapped k-mers. Suppose we want to compute a similarity score
between DNA sequences s1=GCGTCCGAC and s2=CGACGCGAC based on 1-gapped 3-mers (i.e.,
g=1, k=3).
(a) How many different 1-gapped 3-mer patterns are there for DNA sequences? Show the steps
(b) List all the 1-gapped 3-mers actually supported by s1 and s2 and their occurrence counts in
the two sequences by filling in the following tables (add more rows if needed). The tables
should be sorted in ascending order of the 1-gapped 3-mers, where the wildcard character *
is ordered before the four nucleotides. (12%)
Table for s1
1-gapped 3-mer No. of occurrence
Table for s2
1-gapped 3-mer No. of occurrence
(c) Based on your tables in Part b, compute the similarity between the two sequences, defined
as the inner product of the two 1-gap 3-mer occurrence vectors. Show clearly the common
1-gap 3-mers of the two sequences in your calculations. (3%)
(d) Now complete the following table listing the number of mismatches among the 4-mers in
the two sequences. The row and column headers should be the 4-mers present in s1 and s2,
respectively, both sorted ascendingly. If a 4-mer appears multiple times in a sequence,
repeat it that number of times in the row/column headers. (8%)
s2
s1
(e) Based on your result in Part d, compute the similarity score between s1 and s2 again. Show
the steps in your calculation. Do not forget to compare with your result in Part c. (4%)
(f) In practice, it is not easy to decide which values of g and k to use. One possible approach is
to try multiple combinations of these values, and use some independent knowledge to
evaluate which combination leads to the best similarity scores. Propose two ideas for
reducing the total computational time as compared to repeating the above calculations for
each combination of g and k values. (6%)
2. This question is about statistical modeling. Suppose Y is a binary variable, and we want to
construct a model of Y based on binary features X1 and X2, which may not be conditionally
independent.
(a) Derive a formula for computing Pr(Y=0|X1=1,X2=1) in terms of Pr(X1=1), Pr(X1=1,X2=0)
and Pr(X1=1,X2=1,Y=1). Show your derivations step by step. (6%)
(b) [Optional] In general, with three variables X1, X2 and Y, there are various possible
probabilities of the form Pr(VL=vl|VR=vr), where VL and VR are two disjoint ordered sets of
variables (and VR can be empty), and vl and vr are their corresponding values. For example,
in the case of Pr(X1=1), VL=(X1), vl=(1), and VR= Æ. In the case of Pr(Y=0|X1=1,X2=1),
VL=(Y), vl=(0), VR=(X1,X2), and vr=(1,1).
Propose an algorithm that can take one of these probabilities as target (such as
Pr(Y=0|X1=1,X2=1)) and some of these probabilities as input (such as Pr(X1=1),
Pr(X1=1,X2=0) and Pr(X1=1,X2=1,Y=1)), and determine whether the input is sufficient for
inferring the value of the target. (bonus 10%)
(c) Now assume X1 and X2 are independent when conditioned on Y. Give a formal definition of
this assumption using mathematical symbols. (3%)
(d) Depending on the values of X1, X2 and Y, there are 8 probabilities of the form Pr(X1,X2|Y).
Given an example set of values for these 8 probabilities such that X1 and X2 are not
independent when conditioned on Y. Prove that the two variables are not conditionally
independent. (5%)
(e) If two variables X1 and X2 are independent when conditioned on Y, is it guaranteed that they
are also unconditionally independent? Explain why or why not. (5%)
(f) If two variables X1 and X2 are unconditionally independent, is it guaranteed that they are
also independent when conditioned on Y? Explain why or why not. (5%)
3. Write a computer program called Forward in C, C++, Java or Python that implements the
forward algorithm to compute the data likelihood of a given sequence based on a first-order
hidden Markov model. This program should be able to handle any number of states and any
number of emission symbols. Your program should take the following inputs in the specified
order, each on a different line:
• The number of states (an integer)
• The initial probabilities of the states (floating-point numbers, one per line), in the order of
p1, p2, …
• The transition probabilities among the states (floating-point numbers, one per line), in the
order of p11, p12, …, p21, p22, …
• The number of emission symbols (an integer)
• The emission symbols (characters, one per line), in the order of b1, b2, …
• The emission probabilities (floating-point numbers, one per line), in the order of e1(b1),
e1(b2), …, e2(b1), e2(b2), …
• The sequence the likelihood of which is to be computed (a string)
You can assume the input data are properly formatted and do not need to check for errors.
The output should be a single floating-point number of the likelihood value. Due to the inexact
nature of floating-point numbers, when we check your answers a small amount of leeway may
be considered.
The non-comment portion of your program is expected to contain no more than 200 lines of
code.
Here is an expected screen shot when a Java program is run on an example from the lecture
notes:
>java Forward
2
0.5
0.5
0.9
0.1
0.8
0.2
4
A
C
G
T
0.5
0.5
0
0
0.25
0.75
0
0
CAC
0.15156250000000002
Your program will be graded based on i) whether it can be compiled/run successfully, ii)
whether it follows the input/output formats as specified above, iii) its accuracy on a number of
test cases and iv) whether the program is well-documented with appropriate comments added to
explain the meaning of the code. (40%)
Submission:
For the programming question, you should first submit your program to our online judge system.
Please see tutorial notes set 1 for explanations.
For the non-programming question, give all your answers in a single file named _asmt3.,
where is your student ID and is either doc, docx or pdf. We prefer pdf files because it
has better portability. Then put your files for both the programming and non-programming
questions in a zip file named _asmt3.zip and submit it to Blackboard.
Both your written and source code files should contain the following header. Contact Kevin before
submitting the assignment if you have anything unclear about the guidelines on academic honesty.
CSCI3220 2018-19 First Term Assignment 3
I declare that the assignment here submitted is original except for source
material explicitly acknowledged, and that the same or closely related material
has not been previously submitted for another course. I also acknowledge that I
am aware of University policy and regulations on honesty in academic work, and
of the disciplinary guidelines and procedures applicable to breaches of such
policy and regulations, as contained in the following websites.
Student Name:
Student ID :
Marking Scheme and Notes:
1. Remember to submit your assignment by 23:59pm of the due date. We may not accept late
submissions.
2. For the written part, if you submit multiple times, ONLY the content and time-stamp of the
latest one before the submission deadline will be considered. For the program, the version
submitted to the online judge system with the highest score will be graded.
University Guideline for Plagiarism
Please pay attention to the university policy and regulations on honesty in academic work, and
the disciplinary guidelines and procedures applicable to breaches of such policy and regulations.
Details can be found at http://www.cuhk.edu.hk/policy/academichonesty/. With each
assignment, students will be required to submit a statement that they are aware of these policies,
regulations, guidelines and procedures.