SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model
Abstract
:1. Introduction
- A DNA language model is utilized to learn the deep semantics of DNA sequences and introduces the semantic features in the chromatin accessibility prediction process; therefore, we are able to obtain additional complex environmental information.
- Both the DNA language model and the chromatin accessibility model use character-based inputs instead of k-mer which stands for segments of length . The strategy prevents the information of original sequences from being destroyed.
- The attention mechanism is widely used in our models in place of CNNs and RNNs, making the model more powerful and stable in handling long sequences.
2. Theories
3. Models
3.1. DNA Language Model
3.1.1. Process of Data
3.1.2. Model Structure
3.2. Chromatin Accessibility Model
3.2.1. Process of Data
3.2.2. Model Structure
4. Results and Discussions
4.1. Semantic DNA Evaluation
4.2. Analysis of Models
4.2.1. Effectiveness of Our DNA Language Model
4.2.2. Effectiveness of Our Chromatin Accessibility Model
4.3. Analysis of [CLS]
4.4. Analysis of SFA and SFC
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Aleksić, M.; Kapetanović, V. An Overview of the Optical and Electrochemical Methods for Detection of DNA-Drug Interactions. Acta Chim. Slov. 2014, 61, 555–573. [Google Scholar] [PubMed]
- Wang, Y.; Jiang, R.; Wong, W.H. Modeling the Causal Regulatory Network by Integrating Chromatin Accessibility and Transcriptome Data. Natl. Sci. Rev. 2016, 3, 240–251. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gallon, J.; Loomis, E.; Curry, E.; Martin, N.; Brody, L.; Garner, I.; Brown, R.; Flanagan, J.M. Chromatin Accessibility Changes at Intergenic Regions Are Associated with Ovarian Cancer Drug Resistance. Clin. Epigenet. 2021, 13, 122. [Google Scholar] [CrossRef] [PubMed]
- Janssen, S.; Cuvier, O.; Müller, M.; Laemmli, U.K. Specific Gain-and Loss-of-Function Phenotypes Induced by Satellite-Specific DNA-Binding Drugs Fed to Drosophila Melanogaster. Mol. Cell 2000, 6, 1013–1024. [Google Scholar] [CrossRef]
- Song, L.; Crawford, G.E. DNase-Seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements Across the Genome from Mammalian Cells. Cold Spring Harb. Protoc. 2010, 2010, pdb-prot5384. [Google Scholar] [CrossRef] [Green Version]
- Simon, J.M.; Giresi, P.G.; Davis, I.J.; Lieb, J.D. Using Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE) to Isolate Active Regulatory DNA. Nat. Protoc. 2012, 7, 256–267. [Google Scholar] [CrossRef] [Green Version]
- Buenrostro, J.D.; Wu, B.; Chang, H.Y.; Greenleaf, W.J. ATAC-Seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr. Protoc. Mol. Biol. 2015, 109, 21–29. [Google Scholar] [CrossRef]
- Lee, D.; Karchin, R.; Beer, M.A. Discriminative Prediction of Mammalian Enhancers from DNA Sequence. Genome Res. 2011, 21, 2167–2180. [Google Scholar] [CrossRef] [Green Version]
- Ghandi, M.; Lee, D.; Mohammad-Noori, M.; Beer, M.A. Enhanced Regulatory Sequence Prediction Using Gapped k-Mer Features. PLoS Comput. Biol. 2014, 10, e1003711. [Google Scholar] [CrossRef] [Green Version]
- Beer, M.A. Predicting Enhancer Activity and Variant Impact Using Gkm-SVM. Hum. Mutat. 2017, 38, 1251–1258. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Strick, A.J. Integration of Unpaired Single-Cell Chromatin Accessibility and Gene Expression Data via Adversarial Learning. arXiv 2021, arXiv:2104.12320. [Google Scholar]
- Kumar, S.; Bucher, P. Predicting Transcription Factor Site Occupancy Using DNA Sequence Intrinsic and Cell-Type Specific Chromatin Features. BMC Bioinform. 2016, 17, S4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the Sequence Specificities of DNA-and RNA-Binding Proteins by Deep Learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef]
- Zhou, J.; Troyanskaya, O.G. Predicting Effects of Noncoding Variants with Deep Learning–Based Sequence Model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [Green Version]
- Min, X.; Zeng, W.; Chen, N.; Chen, T.; Jiang, R. Chromatin Accessibility Prediction via Convolutional Long Short-Term Memory Networks with k-Mer Embedding. Bioinformatics 2017, 33, i92–i101. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, Q.; Xia, F.; Yin, Q.; Jiang, R. Chromatin Accessibility Prediction via a Hybrid Deep Convolutional Neural Network. Bioinformatics 2018, 34, 732–738. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Guo, Y.; Zhou, D.; Nie, R.; Ruan, X.; Li, W. DeepANF: A Deep Attentive Neural Framework with Distributed Representation for Chromatin Accessibility Prediction. Neurocomputing 2020, 379, 305–318. [Google Scholar] [CrossRef]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A Convolutional Neural Network for Modelling Sentences. arXiv 2014, arXiv:1404.2188. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing—EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Sun, C.; Yang, Z.; Luo, L.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. A Deep Learning Approach with Deep Contextualized Word Representations for Chemical–Protein Interaction Extraction from Biomedical Literature. IEEE Access 2019, 7, 151034–151046. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Giné, E. The lévy-Lindeberg Central Limit Theorem. Proc. Am. Math. Soc. 1983, 88, 147–153. [Google Scholar] [CrossRef] [Green Version]
- Horn, R.A. The Hadamard Product. In Proceedings of the Symposia in Applied Mathematics, Phoenix, AZ, USA, 10–11 January 1989; Volume 40, pp. 87–169. [Google Scholar]
- Liu, F.; Perez, J. Gated End-to-End Memory Networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Long Papers. Volume 1, pp. 1–10. [Google Scholar]
- Baldi, P.; Sadowski, P.J. Understanding Dropout. Adv. Neural Inf. Process. Syst. 2013, 26, 2814–2822. [Google Scholar]
- Pan, B.; Kusko, R.; Xiao, W.; Zheng, Y.; Liu, Z.; Xiao, C.; Sakkiah, S.; Guo, W.; Gong, P.; Zhang, C.; et al. Similarities and Differences Between Variants Called with Human Reference Genome Hg19 or Hg38. BMC Bioinform. 2019, 20, 17–29. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- De Ruijter, A.; Guldenmund, F. The bowtie method: A review. Saf. Sci. 2016, 88, 211–218. [Google Scholar] [CrossRef]
- John, S.; Sabo, P.J.; Thurman, R.E.; Sung, M.-H.; Biddie, S.C.; Johnson, T.A.; Hager, G.L.; Stamatoyannopoulos, J.A. Chromatin Accessibility Pre-Determines Glucocorticoid Receptor Binding Patterns. Nat. Genet. 2011, 43, 264–268. [Google Scholar] [CrossRef]
- Klenova, E.M.; Nicolas, R.H.; Paterson, H.F.; Carne, A.F.; Heath, C.M.; Goodwin, G.H.; Neiman, P.E.; Lobanenkov, V.V. CTCF, a conserved nuclear factor required for optimal transcriptional activity of the chicken c-myc gene, is an 11-Zn-finger protein differentially expressed in multiple forms. Mol. Cell. Biol. 1993, 13, 7612–7624. [Google Scholar]
- Colclough, K.; Bellanne-Chantelot, C.; Saint-Martin, C.; Flanagan, S.E.; Ellard, S. Mutations in the genes encoding the transcription factors hepatocyte nuclear factor 1 alpha and 4 alpha in maturity-onset diabetes of the young and hyperinsulinemic hypoglycemia. Hum. Mutat. 2013, 34, 669–685. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble Learning. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 2002; Volume 2, pp. 110–125. [Google Scholar]
- Chawla, N.V.; Sylvester, J. Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets. In Proceedings of the International Workshop on Multiple Classifier Systems, Prague, Czech Republic, 23–25 May 2007; Springer: Berlin, Germany, 2007; pp. 397–406. [Google Scholar]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Cell Type | Code | Size | l_min | l_max | l_mean | l_med | l_std |
---|---|---|---|---|---|---|---|
GM12878 | ENCSR0000EMT | 244,692 | 36 | 11,481 | 610 | 381 | 614 |
K562 | ENCSR0000EPC | 418,624 | 36 | 13,307 | 675 | 423 | 671 |
MCF-7 | ENCSR0000EPH | 503,816 | 36 | 12,041 | 471 | 361 | 391 |
HeLa-S3 | ENCSR0000ENO | 264,264 | 36 | 11,557 | 615 | 420 | 524 |
H1-hESC | ENCSR0000EMU | 266,868 | 36 | 7795 | 430 | 320 | 347 |
HepG2 | ENCSR0000ENP | 283,148 | 36 | 14,425 | 652 | 406 | 626 |
System | MT | PC | PH | NO | MU | NP | Average |
---|---|---|---|---|---|---|---|
(a) auROC | |||||||
Gkm-SVM | 0.8528 | 0.8203 | 0.8967 | 0.8648 | 0.8983 | 0.8359 | 0.8697 |
DeepSEA | 0.8788 | 0.8629 | 0.9200 | 0.8903 | 0.8827 | 0.8609 | 0.8782 |
k-mer | 0.8830 | 0.8809 | 0.9212 | 0.9016 | 0.9097 | 0.8722 | 0.8975 |
no feature 1 | 0.8727 | 0.8664 | 0.9058 | 0.8840 | 0.8849 | 0.8699 | 0.8806 |
SemanticCAP | 0.8907 | 0.8883 | 0.9241 | 0.9001 | 0.8982 | 0.8847 | 0.8977 |
(b) auPRC | |||||||
Gkm-SVM | 0.8442 | 0.8081 | 0.8860 | 0.8627 | 0.8823 | 0.8123 | 0.8504 |
DeepSEA | 0.8758 | 0.8551 | 0.9146 | 0.8888 | 0.8705 | 0.8508 | 0.8801 |
k-mer | 0.8774 | 0.8732 | 0.9156 | 0.8992 | 0.8968 | 0.8630 | 0.8973 |
no feature 1 | 0.8745 | 0.8663 | 0.9053 | 0.8852 | 0.8878 | 0.8730 | 0.8820 |
SemanticCAP | 0.8914 | 0.8896 | 0.9218 | 0.9004 | 0.8993 | 0.8871 | 0.8983 |
Model | Loss (No Mask) | Accuracy (No Mask) | Accuracy (Mask 30%) |
---|---|---|---|
convs (max) + lstms | 1.152 | 0.4538 | 0.3265 |
convs (max) + attention (ReLU) | 1.113 | 0.4814 | 0.3687 |
convs (avg) + trans 1 | 1.097 | 0.4926 | 0.3599 |
mconv 2 (PC 3) + trans | 0.968 | 0.5114 | 0.4572 |
mconv (PA 3) + trans | 0.931 | 0.5187 | 0.4784 |
mconv (SFA 3) + trans | 0.921 | 0.5202 | 0.4793 |
Model | Parameters (M) | Total (h) | auROC | auPRC | F1 |
---|---|---|---|---|---|
PC + trans 1 + lstm | 4.16 | 4.6 | 0.8595 | 0.8625 | 0.7880 |
PC + trans + conv + lstm | 4.95 | 2.9 | 0.8741 | 0.8765 | 0.8036 |
PC + trans + flatten | 16.4 | 2.0 | 0.8822 | 0.8839 | 0.8124 |
PC + trans + conv + flatten | 6.13 | 2.8 | 0.8817 | 0.8834 | 0.8119 |
PC + trans + linear | 3.84 | 1.5 | 0.8839 | 0.8854 | 0.8144 |
mconv + PC + trans + linear | 5.61 | 2.5 | 0.8881 | 0.8902 | 0.8590 |
mconv + SFC 2 + trans + linear | 5.61 | 2.5 | 0.8907 | 0.8914 | 0.8606 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Chu, X.; Jiang, Y.; Wu, H.; Quan, L. SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes 2022, 13, 568. https://doi.org/10.3390/genes13040568
Zhang Y, Chu X, Jiang Y, Wu H, Quan L. SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes. 2022; 13(4):568. https://doi.org/10.3390/genes13040568
Chicago/Turabian StyleZhang, Yikang, Xiaomin Chu, Yelu Jiang, Hongjie Wu, and Lijun Quan. 2022. "SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model" Genes 13, no. 4: 568. https://doi.org/10.3390/genes13040568
APA StyleZhang, Y., Chu, X., Jiang, Y., Wu, H., & Quan, L. (2022). SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes, 13(4), 568. https://doi.org/10.3390/genes13040568