BIOL4803 Assignment #3 solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

The goal of this project is to create a script that analyzes various features of a bacterial genome. You are given the following types files to analyze: Sequence file is a FASTA file containing the DNA sequence of a bacterial species. The DNA is organized into 1 or more chromosomes. Annotation file is a text file containing tab-delimited data for each gene. There should be a header line, containing the following five columns: . Each line will contain information for a single gene. Assume the coordinate system is 1 based. Your script should take the following arguments: Positional arguments Sequence file – required, must be a string Annotation file – required, must be a string Optional arguments Codon analysis flag – optional, should not take a value Gene sequence flag – optional, should take 1 or more gene names to return the sequence Your script should do the following things: A. Using argparse, take in all the above arguments and store them appropriately into a single object. B. Read in and perform error checking on the sequence and annotation file. For the sequence file, use the pyfaidx module to read in the data. Verify that: 1. The file exists 2. It is proper fasta format 3. All nucleotides are A,C,G, or T (uppercase or lowercase are allowed) For the annotation file, you should use pandas to read in the data. Verify that: 1. The file exists 2. It contains five columns 3. The headers of the columns are named: GeneName, Chromosome, Strand, Start, Stop 4. None of the genes have the same name 5. Strand equals ‘+’ or ‘-‘ 6. Start is less than stop 7. The length of the gene is divisible by 3 If any of these conditions are violated, the program should print an informative statement of all of the violations and quit the program. C. If no optional arguments are given, your script should report: name, length, number of genes, and GC content for each of the chromosomes. D. If the codon analysis option is used, you should report that calculates the amino acid and codon usage for the entire genome (i.e. how often each amino acid is used within all of the proteins and how often each codon is used for a given amino acid): A 5.5% – GCA: 23%; GCC – 37%; GCG – 21%; GCT – 19% E. If the gene sequence option is used, you should print on the protein sequence for each of the genes that are requested in FASTA format. We’ve provided a template script for you to use. Some example outputs for this script are given below: OUTPUTS General Usage Help documentation: $ python3 Assignment3_Solution.py -h Base case: $ python3 Assignment3_Solution.py Seq.fa Annotation.txt Codon flag: $ python3 Assignment3_Solution.py Seq.fa Annotation.txt -c Gene flag: $ python3 Assignment3_Solution.py Seq.fa Annotation.txt -g fadA fadB X recF VV_RS00470 Error Handling Missing command inputs: $ python3 Assignment3_Solution.py Missing files: $ python3 Assignment3_Solution.py Seq2.fa Annotation.txt $ python3 Assignment3_Solution.py Seq2.fa Annotation2.txt Bad input files: $ python3 Assignment3_Solution.py SeqError1.fa Annotation.txt $ python3 Assignment3_Solution.py SeqError2.fa Annotation.txt $ python3 Assignment3_Solution.py Seq.fa AnnotationError1.txt Bad input files (duplicate gene): $ python3 Assignment3_Solution.py Seq.fa AnnotationError2.txt