CSC 226

CSC 226 logo

Software Design and Implementation


A12: It's in your Genes

A DNA researcher is interested in some code to help him with his research project dealing with bioinformatics which is an interesting blend of math, computers, and biology.


Objectives

  • More practice breaking a larger problem down into smaller pieces using functions
  • Gain practice manipulating strings
  • Introduce concepts about DNA
This assignment can be completed using pair programming or individually.

on Deoxyribonucleic Acid (DNA)

http://www.calabriadna.com/wp-content/uploads/2013/06/dna.jpg
  • DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other organisms. In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes. Nearly every cell in a person’s body has the same DNA, and an organism's complete set of DNA is called their a genome. Gene sequencing tries to determine the exact sequence of nucleotide bases in a strand of DNA to better understand the behavior of a gene. 
  • DNA is a double-stranded entity consisting of a sequence pair of nucleotides (also referred to as bases). The picture to the right shows an example of a sequence.
  • There are four possible nucleotides  in DNA: A, C, G, and T
  • Each strand contains the complementary sequence of the other, where:
    • T pairs with A.
    • A pairs with T.
    • C pairs with G.
    • G pairs with C.

In this assignment, we will take a string and do the following:

  • is_nucleotide() will test a string to see if it is a valid DNA strand given that a valid DNA strand may consist solely of the 4 nucleotides 'A', 'C',' G', and 'T'.
  • num_times() will determine how often a particular nucleotide occurs in a particular DNA sequence.
  • complement_strand() will determine the complementary sequence of a valid DNA strand given that:
    • T pairs with A.
    • A pairs with T.
    • C pairs with G.
    • G pairs with C.
    For example, the strand "TACG" would produce a complementary sequence of "ATGC".

  • Though this is a bit of a simplification, we will assume that the mRNA of a DNA sequence will take the complement strand of the DNA sequence as input and will replace each occurrence of the nucleotide T with the nucleotide U. Therefore, the function mRNA()will compute this mRNA given the complement of any valid DNA sequence. For example, given the complement strand "TACG" as input mRNA() would produce an mRNA strand of "UACG".
  • Each mRNA will then be translated to a sequence of Amino Acids by "breaking" the mRNA sequence into groups of three nucleotides. The function chunk_amino_acid() should accomplish this. Though you may not assume the mRNA input has length equally divisible by three, any "extra" nucleotides which extend beyond chunks of three, should simply be discarded.
  • sequence_gene() will take a sequence of nucleotides such as "CGTAGGCAT" and will utilize the above functions to return the corresponding amino acid sequence such as "ASV".

You may work alone or with a pair partner for this. If you work with a partner, be sure to follow good "pair-programming" practices.

Suggestions and Requirements:

  • Begin by downloading yourusername-A12-dna.py
  • Complete each of the functions:
    • is_nucleotide() checks that the string sequence provided is a valid string consisting only of the 4 nucleotides A, C, G, and T.
    • num_times()returns a count of how many times a given nucleotide is found in the input sequence.
    • complement_strand() returns the string which will be the second strand of the DNA sequence given that Ts complement As, and Cs complement Gs.
    • In mRNA() nucleotide T is replaced with the nucleotide U.
    • chunk_amino_acid() divides the input into substrings of length three, ignoring any "extra DNA" at the far end returning the relevant substrings in a list.
    • amino_acid_chunks(), which is provided for you and already complete, expects a three character string as a parameter and returns the corresponding single character amino acid.
    • sequence_gene() utilizes all of the above by taking a sequence of nucleotides, A, C, G, and T, and returning the corresponding amino acid sequence.
  • Write additional unit tests as needed to fully test each of the above functions.
  • Be sure to include good comments throughout in addition to an appropriate docstring for each function.
  • Be sure to modify the standard header at the top of your program with name, username, assignment number, purpose and acknowledgements.
When you are finished writing and testing your program, submit your source code, yourusername(s)-A12-dna.py, into Moodle under assignment A12.

Copyright © 2016 | http://cs.berea.edu/courses/CSC226/ | Licensed under a Creative Commons Attribution-Share Alike 3.0 United States License