Chromosome 2 is one of the largest human chromosomes, spanning over 242 million
nucleotides. In order to keep file sizes manageable, we have slimmed down chromosome 2 for
90 individuals and stored these sequences in slim_chr2_seq.fasta (the entire fasta file for Chr2
is over 200Mb!). Note that although a normal human genome has two possible SNPs at each
genome location (one for each homologous chromosome), we have only provided you with a
single SNP at each locus for each individual for simplicity. Additionally, we have included a file
(slim_chr2_SNPS.vcf) that contains a table of previously characterized SNPs and some
information about them (genome location, reference SNP, variant SNP, disease information).
We would like to read these large data files into Python data structures that we can easily
manipulate and handle.
Part I: Analysis of single-nucleotide variations in a population
Large DNA sequences are often stored in a text file in FASTA format, a standardized format for
storing sequences in a text file (you saw this already in Lab 2). FASTA files follow a basic
format: Lines starting with “>” are sequence headers, which identify all following lines of
sequence until the next “>” header line; FASTA files can contain one or many sequences, each
identified by its own “>” header line.
Write a script that reads the provided FASTA file (slim_chr2_seq.fasta) as well as the provided
SNP table (slim_chr2_SNPS.vcf) into lists that we can analyze. For each variant found in the
SNP table, calculate the frequency at which the variant (SNP) occurs in the set of 90 individuals
we have provided to you. Additionally, we want to find the variants with the maximum and
minimum variant frequency in this population.
Your script should do the following:
1. Open and read in chromosomal sequences in slim_chr2_seq.fasta for each of the 90
individuals. Store identifier and sequence information in separate lists, one element in
each list per individual.
2. Open and read in SNP information stored in slim_chr2_SNPS.vcf (which can be opened
as a text file). Store all information related to chromosome 2 using seven different lists
(open the file in a text editor before writing your code to understand its format).
Remember to close any files you open!
Note: Use the “slim_pos” column to access the base corresponding to the SNP in the fasta
sequences we provided!
PLEASE NOTE: While Python performs indexing from 0, biologists sometimes prefer to index
from 1. We see this in the slim_chr2_SNPS.vcf file. For example, if the value in the
“slim_pos” column is 2000, it refers to the 2000th base in the sequence. If you have one
of your sequences in a variable named “seq,” which element do you access with the
code: `seq`? Hint: it’s not the 2000th base!
3. For each provided variant, calculate the frequency of that SNP in the population of 90
individuals. Store the variant frequencies in a separate list.
4. Find and print the ids of the SNPs in the population that occur with the highest and
5. Open a new output file called slim_chr2_SNPs_withfrequencies.vcf and print the
following information for each SNP:
a. Chromosome Number
b. SNP Name
c. SNP “Slim” Position
d. SNP Reference SNP
e. SNP Alternative SNP
f. SNP Frequency
g. Gene Disease Information
Your file should be tab delimited, with the information for one SNP on each line. Please
include column titles, as you did in Lab 3! You are provided with a sample output file
called ‘slim_chr2_SNPs_withfrequencies_small.vcf’ to show the expected format.
HINT: Separate lists of the same length are often good ways of storing related information! The
enumerate() or range() functions easily allow you to access the same indices of separate lists!
OPTIONAL for all students: Write your own function that calculates the mean of a list (input: a
list, output: that list’s mean). Calculate the mean across all SNP frequencies in the population.
Part II: Characterizing SNPs
1. OPTIONAL for Csci 3003, but REQUIRED for Csci 5465 students. We’ve included
information about the gene associated with each of these SNPs, as well as any disease
associations with each gene. In this part of the lab, we are interested in genetic variants
that might be associated with the disease “Non-Hodgkin lymphoma”. Specifically, we
would like to collect a list of individuals that have variants that occur in genes associated
with this disease (i.e. search for the string in the disease association field). Add to your
script from Part I new code that that will open a new file called lymphoma_variants.txt
and print a list of individuals that have variants in any gene associated with NonHodgkin lymphoma. Print the results in the following, tab-delimited format (refer to
lymphoma_variants_small.txt for the required formatting):
IndividualID variantID GeneDiseaseInfo
2. For Csci 5465 students: Cross-reference the file you created that contains SNP
frequencies (slim_chr2_SNPs_withfrequencies.vcf) and compare frequencies of
synonymous versus nonsynonymous SNPs.
3. For Csci 3003 students: Select any disease you might be interested in that is listed in
the slim_chr2_SNPS.vcf file, and find a SNP that is in a gene associated with that
disease. What kind of SNP is it (i.e. how does it affect the protein)? What’s known about
the function of the gene it is in and its association with disease? What is the SNP
frequency? NOTE: You can answer all of these questions without completing step (1).
Part III: Discovering new variants
While it is important to characterize and catalog known SNPs, one of the important contributions
of large scale sequencing projects like the 1000 Genomes Project is the discovery of new
variants. Often SNPs can be specific to certain populations and are only found when new
individuals are sequenced. The SNPs we have seen so far were previously known before the
sequencing of these 90 individuals and were assigned reference SNP ids, which can be found
in dbSNP at http://www.ncbi.nlm.nih.gov/SNP/. In this part of the lab, we will scan the
sequenced chromosomes of the population and compare them against the reference sequence
in order to discover new SNPs.
Write a script that scans over each individual’s chromosome sequence. Compare each
nucleotide against the reference sequence and keep track of newly discovered variants and the
alternative base call in an appropriate data structure. Compose a SNP table listing the newly
HINT: Dictionaries will be a very useful data structure to use for this part.
Your script should do the following:
1. For each chromosome in the population, scan the nucleotides, comparing each base
against the reference sequence.
2. If you find a SNP (index where the individual’s sequence differs from the reference),
store its position in the sequence as well as the variant base.
3. Iterate over your newly discovered SNPs and print out the following information to a file
named novel_variants.txt (refer to novel_variants_small.txt for the required formatting):
a. The SNP’s position in the sequence
b. The reference SNP
c. The alternative SNP
d. The frequency of the alternative SNP in this population
Submit to Canvas:
When you’re finished with the lab, make a report of any questions you answered plus any
requested output, and gather the scripts that you modified. Submit your homework files using
the online Canvas submit page. Your files should include:
(1) A .py file with your python script
(2) Your slim_chr2_SNPs_withfrequencies.vcf output file
(3) lymphoma_variants.txt file (optional for CSCI 3003 students, required for 5465 students)
(4) novel_variants.txt file
(5) A txt, PDF, or doc file with the answers to any questions (or embed these within your .py