## Description

In this assignment, you will use different ways to compute sequence similarity based on g-gapped kmers, to experience their relative. You will then perform some probability derivations using the

three basic rules, and solve some generic problems. Finally, you will implement the forward

algorithm for computing data likelihoods according to a first-order hidden Markov model, in a way

that your program can handle any type of sequences.

Questions:

1. This question is about g-gapped k-mers. Suppose we want to compute a similarity score

between DNA sequences s1=GCGTCCGAC and s2=CGACGCGAC based on 1-gapped 3-mers (i.e.,

g=1, k=3).

(a) How many different 1-gapped 3-mer patterns are there for DNA sequences? Show the steps

in your calculation. (3%)

(b) List all the 1-gapped 3-mers actually supported by s1 and s2 and their occurrence counts in

the two sequences by filling in the following tables (add more rows if needed). The tables

should be sorted in ascending order of the 1-gapped 3-mers, where the wildcard character *

is ordered before the four nucleotides. (12%)

Table for s1

1-gapped 3-mer No. of occurrence

Table for s2

1-gapped 3-mer No. of occurrence

(c) Based on your tables in Part b, compute the similarity between the two sequences, defined

as the inner product of the two 1-gap 3-mer occurrence vectors. Show clearly the common

1-gap 3-mers of the two sequences in your calculations. (3%)

(d) Now complete the following table listing the number of mismatches among the 4-mers in

the two sequences. The row and column headers should be the 4-mers present in s1 and s2,

respectively, both sorted ascendingly. If a 4-mer appears multiple times in a sequence,

repeat it that number of times in the row/column headers. (8%)

s2

s1

Copyright © 2018 Kevin Yip Page 2 of 4

(e) Based on your result in Part d, compute the similarity score between s1 and s2 again. Show

the steps in your calculation. Do not forget to compare with your result in Part c. (4%)

(f) In practice, it is not easy to decide which values of g and k to use. One possible approach is

to try multiple combinations of these values, and use some independent knowledge to

evaluate which combination leads to the best similarity scores. Propose two ideas for

reducing the total computational time as compared to repeating the above calculations for

each combination of g and k values. (6%)

2. This question is about statistical modeling. Suppose Y is a binary variable, and we want to

construct a model of Y based on binary features X1 and X2, which may not be conditionally

independent.

(a) Derive a formula for computing Pr(Y=0|X1=1,X2=1) in terms of Pr(X1=1), Pr(X1=1,X2=0)

and Pr(X1=1,X2=1,Y=1). Show your derivations step by step. (6%)

(b) [Optional] In general, with three variables X1, X2 and Y, there are various possible

probabilities of the form Pr(VL=vl|VR=vr), where VL and VR are two disjoint ordered sets of

variables (and VR can be empty), and vl and vr are their corresponding values. For example,

in the case of Pr(X1=1), VL=(X1), vl=(1), and VR= Æ. In the case of Pr(Y=0|X1=1,X2=1),

VL=(Y), vl=(0), VR=(X1,X2), and vr=(1,1).

Propose an algorithm that can take one of these probabilities as target (such as

Pr(Y=0|X1=1,X2=1)) and some of these probabilities as input (such as Pr(X1=1),

Pr(X1=1,X2=0) and Pr(X1=1,X2=1,Y=1)), and determine whether the input is sufficient for

inferring the value of the target. (bonus 10%)

(c) Now assume X1 and X2 are independent when conditioned on Y. Give a formal definition of

this assumption using mathematical symbols. (3%)

(d) Depending on the values of X1, X2 and Y, there are 8 probabilities of the form Pr(X1,X2|Y).

Given an example set of values for these 8 probabilities such that X1 and X2 are not

independent when conditioned on Y. Prove that the two variables are not conditionally

independent. (5%)

(e) If two variables X1 and X2 are independent when conditioned on Y, is it guaranteed that they

are also unconditionally independent? Explain why or why not. (5%)

(f) If two variables X1 and X2 are unconditionally independent, is it guaranteed that they are

also independent when conditioned on Y? Explain why or why not. (5%)

3. Write a computer program called Forward in C, C++, Java or Python that implements the

forward algorithm to compute the data likelihood of a given sequence based on a first-order

hidden Markov model. This program should be able to handle any number of states and any

number of emission symbols. Your program should take the following inputs in the specified

order, each on a different line:

• The number of states (an integer)

• The initial probabilities of the states (floating-point numbers, one per line), in the order of

p1, p2, …

Copyright © 2018 Kevin Yip Page 3 of 4

• The transition probabilities among the states (floating-point numbers, one per line), in the

order of p11, p12, …, p21, p22, …

• The number of emission symbols (an integer)

• The emission symbols (characters, one per line), in the order of b1, b2, …

• The emission probabilities (floating-point numbers, one per line), in the order of e1(b1),

e1(b2), …, e2(b1), e2(b2), …

• The sequence the likelihood of which is to be computed (a string)

You can assume the input data are properly formatted and do not need to check for errors.

The output should be a single floating-point number of the likelihood value. Due to the inexact

nature of floating-point numbers, when we check your answers a small amount of leeway may

be considered.

The non-comment portion of your program is expected to contain no more than 200 lines of

code.

Here is an expected screen shot when a Java program is run on an example from the lecture

notes:

>java Forward

2

0.5

0.5

0.9

0.1

0.8

0.2

4

A

C

G

T

0.5

0.5

0

0

0.25

0.75

0

0

CAC

0.15156250000000002

Your program will be graded based on i) whether it can be compiled/run successfully, ii)

whether it follows the input/output formats as specified above, iii) its accuracy on a number of

test cases and iv) whether the program is well-documented with appropriate comments added to

explain the meaning of the code. (40%)

Submission:

For the programming question, you should first submit your program to our online judge system.

Please see tutorial notes set 1 for explanations.

Copyright © 2018 Kevin Yip Page 4 of 4

For the non-programming question, give all your answers in a single file named

where

has better portability. Then put your files for both the programming and non-programming

questions in a zip file named

Both your written and source code files should contain the following header. Contact Kevin before

submitting the assignment if you have anything unclear about the guidelines on academic honesty.

CSCI3220 2018-19 First Term Assignment 3

I declare that the assignment here submitted is original except for source

material explicitly acknowledged, and that the same or closely related material

has not been previously submitted for another course. I also acknowledge that I

am aware of University policy and regulations on honesty in academic work, and

of the disciplinary guidelines and procedures applicable to breaches of such

policy and regulations, as contained in the following websites.

University Guideline on Academic Honesty:

http://www.cuhk.edu.hk/policy/academichonesty/

Student Name:

Student ID :

Marking Scheme and Notes:

1. Remember to submit your assignment by 23:59pm of the due date. We may not accept late

submissions.

2. For the written part, if you submit multiple times, ONLY the content and time-stamp of the

latest one before the submission deadline will be considered. For the program, the version

submitted to the online judge system with the highest score will be graded.

University Guideline for Plagiarism

Please pay attention to the university policy and regulations on honesty in academic work, and

the disciplinary guidelines and procedures applicable to breaches of such policy and regulations.

Details can be found at http://www.cuhk.edu.hk/policy/academichonesty/. With each

assignment, students will be required to submit a statement that they are aware of these policies,

regulations, guidelines and procedures.