Data Science Bootcamp Week 1 - mini project
Updated: Nov 19, 2019
* NB: as this was my first week, the code will be far from polished.
I started the Flatiron School's data science program on Aug 5th. So far we have covered:
Git and GitHub
Importing and statistical analysis of data
Using python libraries Numpy and Pandas
I feel pretty happy with the progress for only one week. To demonstrate what I learned how to do this week, I've taken to completing a mini project of my own.
DNA to Amino Acid (mini project)
DNA is quite literally genetic code. Without getting mired in the details of transcription and replication, suffice it to summarize the process like this:
DNA code is made of a sequence of four nucleotide bases: Guanine (G), Cytosine(C), Adenine(A), and Thymine(T). We will use their one-letter abbreviations from here on. DNA is double stranded with bases forming pairs (A – T; C – G)
RNA is pretty similar. Often it is single-stranded. One major difference to note is that instead of Thymine(T), it contains Uracil(U), which acts similarly to Thymine.
A sequence of DNA gets transcribed to a complementary sequence of RNA. This RNA then gets "read" in such a way that three adjacent, non-overlapping sequences of this RNA literally codes for a specific amino acid. These amino acids are bonded to form polypeptide chains (essentially proteins).
The polypeptide chains fold into a specific shape based on the chemistry of the amino acids. One important property of amino acids is whether or not they are hydrophobic or acidic. The folding of the protein is what gives it its ability to function properly, so it might be useful to investigate some of the chemical properties of a chain of amino acids.
I will walk through a python file called genetics_lib.py
I started by creating a dictionary of amino acids and their properties. So these are the amino acids in human proteins. Each entry that I added to this dictionary was composed of two parts. The key was the full name of the amino acid. The values was a list containing three items. The first item was a list of all codons that would code for that particular amino acid. The second item was the abbreviation of the name. The third entry was the chemical property of the amino acid.
The next thing I did was to create two functions. The first function called to_rna basically took a sequence of DNA stored as a string and replaced all instances of 'T' with 'U' . I'm sure there is a more efficient way of writing this code, but this is a basic for loop where a new list is created with the original and replaced values (when original was 'T'). I turned the list to a numpy array. No particular reason - I was just playing around with data types at that point.
Next I created two new functions that would output our end goal. The function get_amino_acids goes through each codon in a list of codons and for each key and value pair in the amino acids dictionary we created at the beginning, if it finds that codon in the value (v because that is where the codon values are stored in the nested list of the dictionary values). This also created a list of abbreviated amino acids in case we want to see that instead. I think I will go back and use map and a lambda function if I can, but for now, this was the most readable and logical way to code this function. The second function does the exact same thing, but it just returns a list of the chemical property v.
Next, I create a function that given a dna string, it would call the other functions to go from DNA --> RNA --> Amino Acid chain. I also added a small warning for two situations. If the sequence did not begin with a start codon, then I would print that alert. The same if the sequence did not end with a stop codon. Just in case the user was not aware that the DNA was not a complete codeable exon.
This is the same as the translate function, but would return polarity by calling the earlier functions we built.
Finally, I created a function called ask_input that would ask the user to input a string of DNA bases and asked whether they wanted a list of amino acids, just their chemical properties, or both.
And that's it! A basic first project, but one that I hope to build on in the coming weeks. Hopefully I will be able to create some interesting visualizations and potential analyses.
Below is a video of me using this library with the input being the beginning of the sequence that codes for one of the hemoglobin proteins in humans. If this code has a mutation, it can result in sickle cell anemia! More on that later.