# Improving the Quantification of DNA Sequences Using Evolutionary Information Based on Deep Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and the Proposed Models

#### 2.1. Materials

#### 2.2. The Proposed Model

**x**}${}_{t=1}^{T}$, the LSTM has cell states {

**C**}${}_{t=1}^{T}$ and hidden states {

**h**}${}_{t=1}^{T}$ and outputs a sequence {

**o**}${}_{t=1}^{T}$. This can be expressed mathematically by Equation (3) where ${\mathbf{W}}_{i}$, ${\mathbf{W}}_{o}$, ${\mathbf{W}}_{f}$, ${\mathbf{U}}_{i}$, ${\mathbf{U}}_{o}$, and ${\mathbf{U}}_{f}$ are the weight matrices and ${\mathbf{b}}_{o}$, ${\mathbf{b}}_{c}$, ${\mathbf{b}}_{i}$, and ${\mathbf{b}}_{f}$ are the biases. Sigmoid and Tanh are the activation functions.

#### 2.3. Functional SNP Prioritization

## 3. Results and Discussion

#### 3.1. The Performance of the DQDNN Model

#### 3.2. The Performance of the Functional SNP Prioritization Model

## 4. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## References

- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature
**2015**, 521, 436–444. [Google Scholar] [CrossRef] [PubMed] - Hindorff, L.A.; Sethupathy, P.; Junkins, H.A.; Ramos, E.M.; Mehta, J.P.; Collins, F.S.; Manolio, T.A. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA
**2009**, 106, 9362–9367. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw.
**2005**, 18, 602–610. [Google Scholar] [CrossRef] [PubMed] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Tayara, H.; Chong, K. Object Detection in Very High-Resolution Aerial Images Using One-Stage Densely Connected Feature Pyramid Network. Sensors
**2018**, 18, 3341. [Google Scholar] [CrossRef] [Green Version] - Tayara, H.; Soo, K.G.; Chong, K.T. Vehicle Detection and Counting in High-Resolution Aerial Images Using Convolutional Regression Neural Network. IEEE Access
**2018**, 6, 2220–2230. [Google Scholar] [CrossRef] - Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res.
**2011**, 12, 2493–2537. [Google Scholar] - Sundermeyer, M.; Alkhouli, T.; Wuebker, J.; Ney, H. Translation modeling with bidirectional recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 14–25. [Google Scholar]
- Nazari, I.; Tayara, H.; Chong, K.T. Branch Point Selection in RNA Splicing Using Deep Learning. IEEE Access
**2018**. [Google Scholar] [CrossRef] - Oubounyt, M.; Louadi, Z.; Tayara, H.; Chong, K.T. Deep Learning Models Based on Distributed Feature Representations for Alternative Splicing Prediction. IEEE Access
**2018**, 6, 58826–58834. [Google Scholar] [CrossRef] - Louadi, Z.; Oubounyt, M.; Tayara, H.; Chong, K.T. Deep Splicing Code: Classifying Alternative Splicing Events Using Deep Learning. Genes
**2019**, 10, 587. [Google Scholar] [CrossRef] [Green Version] - Oubounyt, M.; Louadi, Z.; Tayara, H.; Chong, K.T. DeePromoter: Robust Promoter Predictor Using Deep Learning. Front. Genet.
**2019**, 10. [Google Scholar] [CrossRef] [Green Version] - Tahir, M.; Tayara, H.; Chong, K.T. iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou’s 5-step rule. Chemom. Intell. Lab. Syst.
**2019**, 189, 96–101. [Google Scholar] [CrossRef] - Tahir, M.; Tayara, H.; Chong, K.T. iPseU-CNN: Identifying RNA Pseudouridine Sites Using Convolutional Neural Networks. Mol. Ther.-Nucleic Acids
**2019**, 16, 463–470. [Google Scholar] [CrossRef] [Green Version] - Tayara, H.; Tahir, M.; Chong, K.T. iSS-CNN: Identifying splicing sites using convolution neural network. Chemom. Intell. Lab. Syst.
**2019**, 188, 63–69. [Google Scholar] [CrossRef] - Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods
**2015**, 12, 931. [Google Scholar] [CrossRef] [Green Version] - Quang, D.; Xie, X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res.
**2016**, 44, e107. [Google Scholar] [CrossRef] [PubMed] [Green Version] - LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of adam and beyond. arXiv
**2019**, arXiv:1904.09237. [Google Scholar] - Chicco, D. Ten quick tips for machine learning in computational biology. BioData Min.
**2017**, 10, 35. [Google Scholar] [CrossRef] [PubMed] - Leslie, R.; O’Donnell, C.J.; Johnson, A.D. GRASP: Analysis of genotype–phenotype results from 1390 genome-wide association studies and corresponding open access database. Bioinformatics
**2014**, 30, i185–i194. [Google Scholar] [CrossRef] [PubMed] [Green Version] - 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature
**2012**, 491, 56. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 2.**The detailed architecture of the forward/reverse complement network (

**a**). The configurations of the Conv block (

**b**).

**Figure 6.**Scatter plot comparing the ROC-AUC scores of the proposed model DQDNN-DNA and (

**a**) DanQ and (

**b**) DeepSEA models.

**Figure 7.**Scatter plot comparing the ROC-AUC scores of the proposed model DQDNN-CONS and (

**a**) DanQ and (

**b**) DeepSEA models.

**Figure 9.**Scatter plot comparing the PR-AUC scores of the proposed model DQDNN-DNA and (

**a**) DanQ and (

**b**) DeepSEA.

**Figure 10.**Scatter plot comparing the PR-AUC scores of the proposed model DQDNN-CONS and (

**a**) DanQ and (

**b**) DeepSEA.

**Figure 11.**Examples of the PR-AUC comparison of the proposed model DQDNN with DanQ and DeepSEA for (

**a**) H1-hESC SIX5 and (

**b**) GM12878 EBF1.

**Figure 12.**Comparison of the the proposed model and the DanQ and DeepSEA models for prioritizing functionally annotated genome-wide repository of associations between SNPs and phenotypes (GRASP) quantitative trait loci (eQTLs) SNPs against 1000 Genomes Project noncoding SNPs across several negative SNP groups of varying distances to the positive SNPs.

Layer | Output Shape |
---|---|

Input | (1000,5) |

ine Conv1D(256,7,1) | (1000,256) |

Conv1D(256,13,1) | (1000,256) |

Conv1D(256,26,1) | (1000,256) |

ine Concatenate | (1000,768) |

ine Max_pooling_1D(7,7) | (142,768) |

Dropout(0.4) | (142,768) |

BiLSTM(256) | (142,512) |

BiLSTM(256) | (142,512) |

Max_pooling_1D(13,13) | (10,512) |

Dropout(0.5) | (10,512) |

Flatten() | 5120 |

Layer | Output Shape |
---|---|

Input | 5120 |

Dense(512) | 512 |

ReLU | 512 |

Dense(919) | 919 |

Sigmoid | 919 |

Layer | Output Shape |
---|---|

Input | 3676 |

Dropout(0.3) | 3676 |

Dense(256) | 256 |

ReLU | 256 |

Dropout(0.5) | 256 |

Dropout(1) | 1 |

Sigmoid | 1 |

**Table 4.**Performance comparison between using raw DNA sequences only and by integrating conservation scores (CONS) with the raw DNA sequences. PR, precision-recall.

ROC-AUC | PR-AUC | |||
---|---|---|---|---|

DQDNN-DNA | DQDNN-CONS | DQDNN-DNA | DQDNN-CONS | |

DNase I | 0.9190 | 0.9223 | 0.4779 | 0.4986 |

TF | 0.9580 | 0.9612 | 0.3740 | 0.3905 |

Histone marks | 0.8619 | 0.8827 | 0.3896 | 0.4297 |

ALL | 0.9428 | 0.9480 | 0.3905 | 0.4102 |

**Table 5.**Performance comparison in terms of the average ROC-AUC between the proposed model and the DanQand DeepSEA models.

DeepSEA | DanQ | DQDNN-DNA | DQDNN-CONS | |
---|---|---|---|---|

DNase I | 0.9082 | 0.9173 | 0.9190 | 0.9223 |

TF | 0.9478 | 0.9568 | 0.9580 | 0.9612 |

Histone marks | 0.8522 | 0.8621 | 0.8619 | 0.8827 |

ALL | 0.9325 | 0.9417 | 0.9428 | 0.9480 |

**Table 6.**Performance comparison in terms of the average PR-AUC between the proposed model and the DanQ and DeepSEA models.

DeepSEA | DanQ | DQDNN-DNA | DQDNN-CONS | |
---|---|---|---|---|

DNase I | 0.4407 | 0.4714 | 0.4779 | 0.4986 |

TF | 0.3203 | 0.3606 | 0.3740 | 0.3905 |

Histone marks | 0.3676 | 0.3882 | 0.3896 | 0.4297 |

ALL | 0.3425 | 0.3794 | 0.3905 | 0.4102 |

Negative SNP Group (bp) | ||||
---|---|---|---|---|

Folds | 31,000 bp | 6300 bp | 710 bp | 360 bp |

Fold 0 | 0.7048 | 0.7154 | 0.6981 | 0.6752 |

Fold 1 | 0.6763 | 0.6799 | 0.6877 | 0.6605 |

Fold 2 | 0.6948 | 0.7072 | 0.7002 | 0.6580 |

Fold 3 | 0.7032 | 0.7198 | 0.7083 | 0.6737 |

Fold 4 | 0.7105 | 0.7049 | 0.6900 | 0.6625 |

Fold 5 | 0.7221 | 0.7111 | 0.6985 | 0.6756 |

Fold 6 | 0.6772 | 0.6922 | 0.6623 | 0.6490 |

Fold 7 | 0.6611 | 0.6745 | 0.6657 | 0.6308 |

Fold 8 | 0.6888 | 0.6927 | 0.6727 | 0.6457 |

Fold 9 | 0.6840 | 0.6933 | 0.6778 | 0.6642 |

ine Average | 0.6923 | 0.6991 | 0.6861 | 0.6595 |

ine STD Error | 0.0184 | 0.0150 | 0.0158 | 0.0144 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tayara, H.; Chong, K.T.
Improving the Quantification of DNA Sequences Using Evolutionary Information Based on Deep Learning. *Cells* **2019**, *8*, 1635.
https://doi.org/10.3390/cells8121635

**AMA Style**

Tayara H, Chong KT.
Improving the Quantification of DNA Sequences Using Evolutionary Information Based on Deep Learning. *Cells*. 2019; 8(12):1635.
https://doi.org/10.3390/cells8121635

**Chicago/Turabian Style**

Tayara, Hilal, and Kil To Chong.
2019. "Improving the Quantification of DNA Sequences Using Evolutionary Information Based on Deep Learning" *Cells* 8, no. 12: 1635.
https://doi.org/10.3390/cells8121635