LSTMs and DNA
2018-01-10
Biology is super complicated. If you don't believe me, check out the Roche Biochemical Pathways chart1. I wonder if you can buy a giant, laminated poster of this chart...

One of the more complicated biochemical processes is the one by which cells convert DNA to functional proteins. As a first step in this process, DNA is converted to mRNA in a process named transcription. Transcription is controlled with transcription factors (TFs) , which are proteins that bind to specific parts of the DNA. Understanding what specific parts of DNA that TFs bind to might lead to new insights into how transcription occurs and why certain genes are expressed or not expressed.
If you formulate this problem statistically, it is essentially sequence classification. That is, taking a n-length sequence of unique symbols (e.g. the four DNA bases T, A, C, and G) and learning to categorize them (say, as bindable to a protein or not).
There are a number of statistical models one can use for sequence classification. Long Short-Term Memory (LSTM) neural networks are an exciting approach2 that could be useful for this problem. These types of neural networks have been increasing used by Google, Apple, and others. Unlike standard feedforward networks, LSTMs have loops that allows them to excel at retaining information and learning patterns within sequences. These networks are composed of LSTM blocks, which have "gates" that determine the flow of information (see image on the right).
Using data from the publicly available UniProbe dataset (more details here), I implemented a LSTM using python's Keras module. The dataset consisted of about 20,000 DNA sequences that bind to a protein of interest, and about 20,000 sequences that didn't. The sequences themselves were each 60 bases long. The LSTM model I created included an embedding layer (to transform the discrete symbols into continuous vector space). I then included two hidden LSTM layers with 20% dropout to prevent overfitting. I ran this on batches of 64 observations and 10 epochs, using the stochastic ADAM optimizer and a tanh activation function. It should be said that this model was trained on a tiny dataset, but to satisfy my curiosity I overlooked this. Bottom line, I was able to obtain almost 80% accuracy in a test set.
- If you're interested, this diagram shows all of the known cellular biochemical processes in various species. I really don't get people that don't find science interesting...
- 2022 update: Go easy on me, I wrote this way back when LSTMs were big and the original paper on the Transformer model was 6 months old 😉!
© Roni Kobrosly 2022