Prot-GO: A Parallel Transformer Encoder-Based Fusion Model for Accurately Predicting Gene Ontology (GO) Terms from Full-Scale Protein Sequences
Abstract
1. Introduction
2. Background
2.1. Related Research
2.2. Contribution and Novelty
- Achieves state-of-the-art accuracy in the GO annotation of full-length sequences.
- Uses a single, lightweight model rather than ensembles, reducing computational demands.
- The training time, GPU memory, and other computational resources needed to utilize the model are considerably lower compared to other similar methods.
- Employs automated hyperparameter optimization without a development set.
- A superior comprehension of the protein sequence structure, which is demonstrated by the minimal accuracy disparity between the random and clustered dataset splits.
- Negligible dependency on the input sequence length, exhibiting robustness and the ability to be deployed in applications with significant input sequence length variability.
- Can a lightweight, single-model transformer outperform ensemble-based approaches in GO annotation?
- How robust is ProtGO to distributional shifts, such as clustered vs. random splits?
- Does ProtGO sustain accuracy across varying protein sequence lengths?
3. Methodology
3.1. GO Terms
- Molecular Function: Activities at the molecular level executed by gene products. The molecular function delineates processes that transpire at the molecular level, such as “catalysis” or “transportation”. It denotes activities rather than the entities (molecules or complexes) executing the actions, and it does not delineate the location, timing, or context of the action. Molecular functions typically relate to activities executed by individual gene products (such as proteins or RNA), while certain activities are carried out by molecular complexes consisting of numerous gene products. Broad functional terms include catalytic activity and transporter activity, whereas narrower functional words encompass adenylate cyclase activity and Toll-like receptor binding.
- Cellular Component: This type of GO term describes the location occupied by a macromolecular apparatus relative to cellular compartments and structures. The gene ontology delineates the locations of gene products in two manners as follows: The first method is by cellular anatomical entities, where a gene product performs a molecular function. Cellular anatomical entities encompass cellular components like the plasma membrane and cytoskeleton, along with membrane-bound compartments such as the mitochondrion. The other method is with the help of the stable macromolecular complexes to which they belong, for instance, the clathrin complex.
- Biological Process: The extensive procedures, or ’biological programming’, executed through various molecular activities. Instances of general biological process terminology include DNA repair and signal transduction. Examples of narrower terminology include the pyrimidine nucleobase biosynthesis process and glucose transmembrane transport.
3.2. Dataset
3.3. Model Architecture
3.4. Training and Evaluation
3.5. Performance Metrics
- Accuracy = Correct predictions/Total samples.
- Precision = TP/(TP + FP)
- Recall = TP/(TP + FN)
- F1 score = 2 × (Precision × Recall)/(Precision + Recall)
- AUC (Area Under the Curve): Measures overall classification separability from ROC curves.
4. Results
4.1. ROC Curves
4.2. Sequence Length Analysis
5. Conclusions
5.1. Implications
5.2. Limitations
5.3. Future Directions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Consortium, T.U. UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2022, 51, D523–D531. [Google Scholar] [CrossRef] [PubMed]
- Iacoviello, M.; Santamato, V.; Pagano, A.; Marengo, A. Interpretable AI-driven multi-objective risk prediction in heart failure patients with thyroid dysfunction. Front. Digit. Health 2025, 7, 1583399. [Google Scholar] [CrossRef]
- Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef]
- Krogh, A.; Brown, M.; Mian, I.S.; Sjölander, K.; Haussler, D. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 1994, 235, 1501–1531. [Google Scholar] [CrossRef]
- Eddy, S.R. Profile hidden Markov models. Bioinformatics 1998, 14, 755–763. [Google Scholar] [CrossRef]
- Blum, M.; Andreeva, A.; Florentino, L.; Chuguransky, S.; Grego, T.; Hobbs, E.; Pinto, B.; Orr, A.; Paysan-Lafosse, T.; Ponamareva, I.; et al. InterPro: The protein sequence classification resource in 2025. Nucleic Acids Res. 2024, 53, D444–D456. [Google Scholar] [CrossRef]
- Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef]
- Kulmanov, M.; Khan, M.A.; Hoehndorf, R. DeepGO: Predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 2017, 34, 660–668. [Google Scholar] [CrossRef]
- Dalkiran, A.; Rifaioglu, A.S.; Martin, M.J.; Cetin-Atalay, R.; Atalay, V.; Doğan, T. ECPred: A tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinform. 2018, 19, 334. [Google Scholar] [CrossRef] [PubMed]
- Cao, R.; Freitas, C.; Chan, L.; Sun, M.; Jiang, H.; Chen, Z. ProLanGO: Protein function prediction using neural machine translation based on a recurrent neural network. Molecules 2017, 22, 1732. [Google Scholar] [CrossRef] [PubMed]
- Almagro Armenteros, J.J.; Sønderby, C.K.; Sønderby, S.K.; Nielsen, H.; Winther, O. DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics 2017, 33, 3387–3395. [Google Scholar] [CrossRef]
- Schwartz, A.S.; Hannum, G.J.; Dwiel, Z.R.; Smoot, M.E.; Grant, A.R.; Knight, J.M.; Becker, S.A.; Eads, J.R.; LaFave, M.C.; Eavani, H.; et al. Deep Semantic Protein Representation for Annotation, Discovery, and Engineering. bioRxiv 2018. [Google Scholar] [CrossRef]
- Sureyya Rifaioglu, A.; Doğan, T.; Jesus Martin, M.; Cetin-Atalay, R.; Atalay, V. DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks. Sci. Rep. 2019, 9, 7344. [Google Scholar] [CrossRef]
- Li, Y.; Wang, S.; Umarov, R.; Xie, B.; Fan, M.; Li, L.; Gao, X. DEEPre: Sequence-based enzyme EC number prediction by deep learning. Bioinformatics 2017, 34, 760–769. [Google Scholar] [CrossRef]
- Hou, J.; Adhikari, B.; Cheng, J. DeepSF: Deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 2017, 34, 1295–1303. [Google Scholar] [CrossRef] [PubMed]
- Memon, S.A.; Khan, K.A.; Naveed, H. HECNet: A hierarchical approach to enzyme function classification using a Siamese Triplet Network. Bioinformatics 2020, 36, 4583–4589. [Google Scholar] [CrossRef]
- Khan, K.A.; Memon, S.A.; Naveed, H. A hierarchical deep learning based approach for multi-functional enzyme classification. Protein Sci. 2021, 30, 1935–1945. [Google Scholar] [CrossRef]
- Concu, R.; Cordeiro, M.N.D.S. Alignment-Free Method to Predict Enzyme Classes and Subclasses. Int. J. Mol. Sci. 2019, 20, 5389. [Google Scholar] [CrossRef]
- Zou, Z.; Tian, S.; Gao, X.; Li, Y. mlDEEPre: Multi-Functional Enzyme Function Prediction with Hierarchical Multi-Label Deep Learning. Front. Genet. 2019, 9, 714. [Google Scholar] [CrossRef] [PubMed]
- AlQuraishi, M. End-to-End Differentiable Learning of Protein Structure. Cell Syst. 2019, 8, 292–301.e3. [Google Scholar] [CrossRef] [PubMed]
- Senior, A.W.; Evans, R.; Jumper, J.; Kirkpatrick, J.; Sifre, L.; Green, T.; Qin, C.; Žídek, A.; Nelson, A.W.R.; Bridgland, A.; et al. Improved protein structure prediction using potentials from deep learning. Nature 2020, 577, 706–710. [Google Scholar] [CrossRef]
- Rao, R.M.; Liu, J.; Verkuil, R.; Meier, J.; Canny, J.; Abbeel, P.; Sercu, T.; Rives, A. MSA Transformer. In Proceedings of Machine Learning Research, Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Cambridge MA, USA, 2021; Volume 139, pp. 8844–8856. [Google Scholar]
- Du, Y.; Meier, J.; Ma, J.; Fergus, R.; Rives, A. Energy-based models for atomic-resolution protein conformations. arXiv 2020. [Google Scholar] [CrossRef]
- Yang, J.; Anishchenko, I.; Park, H.; Peng, Z.; Ovchinnikov, S.; Baker, D. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. USA 2020, 117, 1496–1503. [Google Scholar] [CrossRef] [PubMed]
- Biswas, S.; Khimulya, G.; Alley, E.C.; Esvelt, K.M.; Church, G.M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 2021, 18, 389–396. [Google Scholar] [CrossRef] [PubMed]
- Madani, A.; McCann, B.; Naik, N.; Keskar, N.S.; Anand, N.; Eguchi, R.R.; Huang, P.S.; Socher, R. ProGen: Language Modeling for Protein Generation. bioRxiv 2020. [Google Scholar] [CrossRef]
- Anishchenko, I.; Pellock, S.J.; Chidyausiku, T.M.; Ramelot, T.A.; Ovchinnikov, S.; Hao, J.; Bafna, K.; Norn, C.; Kang, A.; Bera, A.K.; et al. De novo protein design by deep network hallucination. Nature 2021, 600, 547–552. [Google Scholar] [CrossRef] [PubMed]
- Yang, K.K.; Wu, Z.; Arnold, F.H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 2019, 16, 687–694. [Google Scholar] [CrossRef]
- Mazurenko, S.; Prokop, Z.; Damborsky, J. Machine Learning in Enzyme Engineering. ACS Catal. 2020, 10, 1210–1223. [Google Scholar] [CrossRef]
- Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
- Ryu, J.Y.; Kim, H.U.; Lee, S.Y. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. USA 2019, 116, 13996–14001. [Google Scholar] [CrossRef] [PubMed]
- Bileschi, M.L.; Belanger, D.; Bryant, D.H.; Sanderson, T.; Carter, B.; Sculley, D.; Bateman, A.; DePristo, M.A.; Colwell, L.J. Using deep learning to annotate the protein universe. Nat. Biotechnol. 2022, 40, 932–937. [Google Scholar] [CrossRef] [PubMed]
- Dohan, D.; Gane, A.; Bileschi, M.L.; Belanger, D.; Colwell, L. Improving Protein Function Annotation via Unsupervised Pre-Training: Robustness, Efficiency, and Insights. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 14–18 August 2021; KDD ’21. pp. 2782–2791. [Google Scholar] [CrossRef]
- Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127. [Google Scholar] [CrossRef]
- Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; Linial, M. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics 2022, 38, 2102–2110. [Google Scholar] [CrossRef]
- Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef] [PubMed]
- Heinzinger, M.; Weissenow, K.; Sanchez, J.G.; Henkel, A.; Steinegger, M.; Rost, B. ProstT5: Bilingual Language Model for Protein Sequence and Structure. bioRxiv 2023. [Google Scholar] [CrossRef]
- Pan, T.; Li, C.; Bi, Y.; Wang, Z.; Gasser, R.B.; Purcell, A.W.; Akutsu, T.; Webb, G.I.; Imoto, S.; Song, J. PFresGO: An attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships. Bioinformatics 2023, 39, btad094. [Google Scholar] [CrossRef]
- Wang, W.; Shuai, Y.; Zeng, M.; Fan, W.; Li, M. DPFunc: Accurately predicting protein function via deep learning with domain-guided structure information. Nat. Commun. 2025, 16, 70. [Google Scholar] [CrossRef]
- Sanderson, T.; Bileschi, M.L.; Belanger, D.; Colwell, L.J. ProteInfer, deep neural networks for protein functional inference. eLife 2023, 12, e80942. [Google Scholar] [CrossRef]
- Tamir, A.; Salem, M.; Yuan, J.S. ProtEC: A Transformer Based Deep Learning System for Accurate Annotation of Enzyme Commission Numbers. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 3691–3702. [Google Scholar] [CrossRef]
- Suzek, B.E.; Wang, Y.; Huang, H.; McGarvey, P.B.; Wu, C.H.; UniProt Consortium. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015, 31, 926–932. [Google Scholar] [CrossRef] [PubMed]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Lu, K.; Grover, A.; Abbeel, P.; Mordatch, I. Pretrained transformers as universal computation engines. arXiv 2021, arXiv:2103.05247. [Google Scholar] [CrossRef]
- Salem, M.; Keshavarzi Arshadi, A.; Yuan, J.S. AMPDeep: Hemolytic activity prediction of antimicrobial peptides using transfer learning. BMC Bioinform. 2022, 23, 389. [Google Scholar] [CrossRef] [PubMed]
Split | Proteins | Biological Processes | Molecular Functions | Cellular Components |
---|---|---|---|---|
Train | 438,522 | 346,677 | 369,909 | 321,980 |
Development | 55,453 | 43,734 | 46,785 | 40,765 |
Test | 54,289 | 42,830 | 45,662 | 39,929 |
Split | Proteins | Biological Processes | Molecular Functions | Cellular Components |
---|---|---|---|---|
Train | 182,965 | 144,292 | 154,150 | 134,200 |
Development | 180,309 | 143,130 | 152,156 | 131,778 |
Test | 183,475 | 144,605 | 155,593 | 135,345 |
Aspect | Model | Accuracy | F1 Score | Precision | Recall |
---|---|---|---|---|---|
Biological Processes | ProtGO | 86.06% | 0.9251 | 0.9725 | 0.8821 |
Proteinfer | 80.21% | 0.8902 | 0.8447 | 0.9409 | |
Proteinfer_EN | 83.29% | 0.9088 | 0.8652 | 0.9570 | |
Molecular Function | ProtGO | 94.60% | 0.9722 | 0.9882 | 0.9568 |
Proteinfer | 88.94% | 0.9415 | 0.9166 | 0.9677 | |
Proteinfer_EN | 91.67% | 0.9565 | 0.9344 | 0.9797 | |
Cellular Component | ProtGO | 78.30% | 0.8783 | 0.9469 | 0.8189 |
Proteinfer | 69.36% | 0.8191 | 0.7386 | 0.9191 | |
Proteinfer_EN | 75.40% | 0.8597 | 0.7928 | 0.9390 |
Aspect | Model | Accuracy | F1 Score | Precision | Recall |
---|---|---|---|---|---|
Biological Processes | ProtGO | 82.16% | 0.9021 | 0.9532 | 0.8561 |
Proteinfer | 72.48% | 0.8424 | 0.8940 | 0.7965 | |
Proteinfer_EN | 75.56% | 0.8650 | 0.8618 | 0.8683 | |
Molecular Function | ProtGO | 91.51% | 0.9556 | 0.9785 | 0.9338 |
Proteinfer | 81.51% | 0.8992 | 0.9241 | 0.8756 | |
Proteinfer_EN | 83.88% | 0.9159 | 0.9012 | 0.9312 | |
Cellular Component | ProtGO | 73.28% | 0.8458 | 0.9215 | 0.7817 |
Proteinfer | 63.06% | 0.7782 | 0.8014 | 0.7563 | |
Proteinfer_EN | 66.22% | 0.8047 | 0.7832 | 0.8277 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tamir, A.; Yuan, J.-S. Prot-GO: A Parallel Transformer Encoder-Based Fusion Model for Accurately Predicting Gene Ontology (GO) Terms from Full-Scale Protein Sequences. Electronics 2025, 14, 3944. https://doi.org/10.3390/electronics14193944
Tamir A, Yuan J-S. Prot-GO: A Parallel Transformer Encoder-Based Fusion Model for Accurately Predicting Gene Ontology (GO) Terms from Full-Scale Protein Sequences. Electronics. 2025; 14(19):3944. https://doi.org/10.3390/electronics14193944
Chicago/Turabian StyleTamir, Azwad, and Jiann-Shiun Yuan. 2025. "Prot-GO: A Parallel Transformer Encoder-Based Fusion Model for Accurately Predicting Gene Ontology (GO) Terms from Full-Scale Protein Sequences" Electronics 14, no. 19: 3944. https://doi.org/10.3390/electronics14193944
APA StyleTamir, A., & Yuan, J.-S. (2025). Prot-GO: A Parallel Transformer Encoder-Based Fusion Model for Accurately Predicting Gene Ontology (GO) Terms from Full-Scale Protein Sequences. Electronics, 14(19), 3944. https://doi.org/10.3390/electronics14193944