POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction
Abstract
1. Introduction
2. Results
2.1. Experimental Setup
2.2. POSA-GO Outperforms Competing State-of-the-Art Methods
2.3. The Influence of the Number of Attention Heads on Prediction Performance
2.4. Model Ablation Study
3. Discussion
4. Materials and Methods
4.1. Overview
4.2. Dataset
4.3. The Architecture of POSA-GO
4.3.1. Pretrained Protein Language Model
4.3.2. Multilayer Perceptron
4.3.3. Multi-Head Attention Layer
4.3.4. GO Term Embedding with PO2Vec
4.3.5. Protein-Term Link Prediction
4.4. Evaluation Metrics
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Eisenberg, D.; Marcotte, E.M.; Xenarios, I.; Yeates, T.O. Protein function in the post-genomic era. Nature 2000, 405, 823–826. [Google Scholar] [CrossRef] [PubMed]
- Costanzo, M.; VanderSluis, B.; Koch, E.N.; Baryshnikova, A.; Pons, C.; Tan, G.; Wang, W.; Usaj, M.; Hanchard, J.; Lee, S.D.; et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 2016, 353, aaf1420. [Google Scholar] [CrossRef]
- The UniProt Consortium. UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Res. 2022, 51, D523–D531. [Google Scholar]
- Cruz, L.M.; Trefflich, S.; Weiss, V.A.; Castro, M.A.A. Protein Function Prediction. Methods Mol. Biol. 2017, 1654, 55–75. [Google Scholar] [PubMed]
- Clark, W.T.; Radivojac, P. Analysis of protein function and its prediction from amino acid sequence. Proteins 2011, 79, 2086–2096. [Google Scholar] [CrossRef] [PubMed]
- Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef]
- Buchfink, B.; Xie, C.; Huson, D.H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 2015, 12, 59–60. [Google Scholar] [CrossRef]
- Friedberg, I. Automated protein function prediction–the genomic challenge. Brief. Bioinform. 2006, 7, 225–242. [Google Scholar] [CrossRef]
- Lin, B.; Luo, X.; Liu, Y.; Jin, X. A comprehensive review and comparison of existing computational methods for protein function prediction. Brief. Bioinform. 2024, 25, bbae289. [Google Scholar] [CrossRef]
- Szklarczyk, D.; Nastou, K.; Koutrouli, M.; Kirsch, R.; Mehryary, F.; Hachilif, R.; Hu, D.; Peluso, M.E.; Huang, Q.; Fang, T.; et al. The STRING database in 2025: Protein networks with directionality of regulation. Nucleic Acids Res. 2025, 53, D730–D737. [Google Scholar] [CrossRef]
- Fan, K.; Guan, Y.; Zhang, Y. Graph2GO: A multi-modal attributed network embedding method for inferring protein functions. Gigascience 2020, 9, giaa081. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Shuai, Y.; Li, Y.; Zeng, M.; Li, M. Enhancing Protein Function Prediction Through the Fusion of Multi-Type Biological Knowledge with Protein Language Model and Graph Neural Network. IEEE Trans. Comput. Biol. Bioinform. 2025, 22, 581–590. [Google Scholar] [CrossRef]
- Jiao, P.; Wang, B.; Wang, X.; Liu, B.; Wang, Y.; Li, J. Struct2GO: Protein function prediction based on graph pooling algorithm and AlphaFold2 structure information. Bioinformatics 2023, 39, btad637. [Google Scholar] [CrossRef]
- You, R.; Huang, X.; Zhu, S. DeepText2GO: Improving large-scale protein function prediction with deep semantic text representation. Methods 2018, 145, 82–90. [Google Scholar] [CrossRef] [PubMed]
- Yan, N.; Lv, Z.; Hong, W.; Xu, X. Editorial: Feature representation and learning methods with applications in protein secondary structure. Front. Bioeng. Biotechnol. 2021, 9, 748722. [Google Scholar] [CrossRef] [PubMed]
- Zhou, H.; Yin, M.; Wu, W.; Li, M.; Fu, K.; Chen, J.; Wu, J.; Wang, Z. ProtCLIP: Function-Informed Protein Multi-Modal Learning. arXiv 2024, arXiv:2412.20014v1. [Google Scholar] [CrossRef]
- Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112–7127. [Google Scholar] [CrossRef]
- Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
- Lv, L.; Lin, Z.; Li, H.; Liu, Y.; Cui, J.; Chen, C.Y.-C.; Yuan, L.; Tian, Y. ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing. arXiv 2024, arXiv:2402.16445v2. [Google Scholar] [CrossRef]
- Kulmanov, M.; Hoehndorf, R. DeepGOPlus: Improved protein function prediction from sequence. Bioinformatics 2019, 36, 422–429. [Google Scholar] [CrossRef]
- Zhou, G.; Wang, J.; Zhang, X.; Yu, G. Deepgoa: Predicting gene ontology annotations of proteins via graph convolutional network. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; IEEE: New York, NY, USA; pp. 1836–1841. [Google Scholar]
- Cao, Y.; Shen, Y. TALE: Transformer-based protein function annotation with joint sequence–label embedding. Bioinformatics 2021, 37, 2825–2833. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.Y.; Wang, J.F.; Hu, Y.; Li, X.H.; Qian, Y.R.; Song, C.L. Evaluating the advancements in protein language models for encoding strategies in protein function prediction: A comprehensive review. Front. Bioeng. Biotechnol. 2025, 13, 1506508. [Google Scholar] [CrossRef]
- Wu, K.; Wang, L.; Liu, B.; Liu, Y.; Wang, Y.; Li, J. PSPGO: Cross-Species Heterogeneous Network Propagation for Protein Function Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1713–1724. [Google Scholar] [CrossRef]
- Li, W.; Wang, B.; Dai, J.; Kou, Y.; Chen, X.; Pan, Y.; Hu, S.; Xu, Z.Z. Partial order relation-based gene ontology embedding improves protein function prediction. Brief. Bioinform. 2024, 25, bbae077. [Google Scholar] [CrossRef] [PubMed]
- Valentini, G. True path rule hierarchical ensembles for genomewide gene function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2010, 8, 832–847. [Google Scholar] [CrossRef] [PubMed]
- Abbass, J.; Nebel, J.-C. Rosetta and the journey to predict proteins’ structures, 20 years on. Curr. Bioinform. 2020, 15, 611–628. [Google Scholar] [CrossRef]
- Cheng, L.; Hu, Y.; Sun, J.; Zhou, M.; Jiang, Q.; Sahinalp, C. DincRNA: A comprehensive webbased bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics 2018, 34, 1953–1956. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980v9. [Google Scholar] [CrossRef]
- Yang, F.-J. An Implementation of Naive Bayes Classifier. In Proceedings of the 2018 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 13–15 December 2018; pp. 301–306. [Google Scholar]
- Zhou, N.; Jiang, Y.; Bergquist, T.R.; Lee, A.J.; Kacsoh, B.Z.; Crocker, A.W.; Lewis, K.A.; Georghiou, G.; Nguyen, H.N.; Hamid, N.; et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019, 20, 244. [Google Scholar] [CrossRef]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
- Zhang, X.; Guo, H.; Zhang, F.; Wang, X.; Wu, K.; Qiu, S.; Liu, B.; Wang, Y.; Hu, Y.; Li, J. HNetGO: Protein function prediction via heterogeneous network transformer. Brief. Bioinform. 2023, 24, bbab556. [Google Scholar] [CrossRef] [PubMed]
- Abdine, H.; Chatzianastasis, M.; Bouyioukos, C.; Vazirgiannis, M. Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers. arXiv 2024, arXiv:2307.14367v3. [Google Scholar] [CrossRef]
- Guo, T.; Steen, J.A.; Mann, M. Mass-spectrometry-based proteomics: From single cells to clinical applications. Nature 2025, 638, 901–911. [Google Scholar] [CrossRef] [PubMed]
- Marcotte, E.M.; Pellegrini, M.; Ng, H.-L.; Rice, D.W.; Yeates, T.O.; Eisenberg, D. Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285, 751–753. [Google Scholar] [CrossRef] [PubMed]
- Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT Pretraining approach. arXiv 2019, arXiv:1907.11692v1. [Google Scholar] [CrossRef]
- Jang, Y.J.; Qin, Q.Q.; Huang, S.Y.; Peter, A.T.J.; Ding, X.M.; Kornmann, B. Accurate prediction of protein function using statistics-informed graph networks. Nat. Commun. 2024, 15, 6601. [Google Scholar] [CrossRef]
- Suzek, B.E.; Wang, Y.; Huang, H.; McGarvey, P.B.; Wu, C.H.; UniProt Consortium. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015, 31, 926–932. [Google Scholar] [CrossRef]
- Unsal, S.; Atas, H.; Albayrak, M.; Turhan, K.; Acar, A.C.; Doğan, T. Learning functional properties of proteins with language models. Nat. Mach. Intell. 2022, 4, 227–245. [Google Scholar] [CrossRef]
- Yuan, Q.; Xie, J.; Xie, J.; Zhao, H.; Yang, Y. Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion. Brief. Bioinform. 2023, 24, bbad117. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167v3. [Google Scholar] [CrossRef]
- Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329v5. [Google Scholar] [CrossRef]
- Agarap, A.F. Deep Learning using Rectified Linear Units (ReLU). arXiv 2019, arXiv:1803.08375v2. [Google Scholar] [CrossRef]
- Ibtehaz, N.; Kagaya, Y.; Kihara, D. Domain-PFP allows protein function prediction using function-aware domain embedding representations. Commun. Biol. 2023, 6, 1103. [Google Scholar] [CrossRef] [PubMed]
- Gu, Z.; Luo, X.; Chen, J.; Deng, M.; Lai, L. Hierarchical graph transformer with contrastive learning for protein function prediction. Bioinformatics 2023, 39, btad410. [Google Scholar] [CrossRef]
- Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2019, arXiv:1807.03748v2. [Google Scholar] [CrossRef]
- You, R.; Yao, S.; Mamitsuka, H.; Zhu, S. DeepGraphGO: Graph neural network for large-scale, multispecies protein function prediction. Bioinformatics 2021, 37, i262–i271. [Google Scholar] [CrossRef]
- Wu, Z.; Guo, M.; Jin, X.; Chen, J.; Liu, B. CFAGO: Cross-fusion of network and attributes based on attention mechanism for protein function prediction. Bioinformatics 2023, 39, btad123. [Google Scholar] [CrossRef]
Number of Attention Heads | Fmax | Smin | AUPR | ||||||
---|---|---|---|---|---|---|---|---|---|
MF | BP | CC | MF | BP | CC | MF | BP | CC | |
h = 1 | 0.582 | 0.48 | 0.645 | 8.138 | 25.695 | 10.063 | 0.596 | 0.444 | 0.673 |
h = 2 | 0.583 | 0.483 | 0.645 | 8.172 | 25.812 | 10.099 | 0.598 | 0.443 | 0.675 |
h = 4 | 0.581 | 0.482 | 0.645 | 8.254 | 25.858 | 10.096 | 0.594 | 0.442 | 0.672 |
h = 8 | 0.580 | 0.481 | 0.644 | 8.26 | 25.818 | 10.08 | 0.595 | 0.445 | 0.673 |
Method | Fmax | Smin | AUPR | ||||||
---|---|---|---|---|---|---|---|---|---|
MF | BP | CC | MF | BP | CC | MF | BP | CC | |
POSA-GO w/o attention | 0.577 | 0.478 | 0.644 | 8.324 | 26.135 | 10.116 | 0.592 | 0.435 | 0.675 |
POSA-GO w/o PO2Vec | 0.571 | 0.469 | 0.641 | 8.381 | 26.142 | 10.155 | 0.59 | 0.432 | 0.678 |
POSA-GO | 0.589 | 0.481 | 0.65 | 8.129 | 26.312 | 10.029 | 0.611 | 0.442 | 0.683 |
Statistics | BP | MF | CC | |
---|---|---|---|---|
CAFA3 | Training Set | 50,813 | 35,086 | 49,328 |
Testing Set | 2133 | 1088 | 1094 | |
Number of annotations | 19,901 | 6367 | 2470 | |
SwissProt | Training Set | 49,003 | 36,403 | 47,177 |
Testing Set | 5402 | 4038 | 5165 | |
Number of annotations | 19,832 | 6785 | 2760 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Y.; Wang, B.; Yan, B.; Jiang, H.; Dai, Y. POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction. Int. J. Mol. Sci. 2025, 26, 6362. https://doi.org/10.3390/ijms26136362
Liu Y, Wang B, Yan B, Jiang H, Dai Y. POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction. International Journal of Molecular Sciences. 2025; 26(13):6362. https://doi.org/10.3390/ijms26136362
Chicago/Turabian StyleLiu, Yubao, Benrui Wang, Bocheng Yan, Haiyue Jiang, and Yinfei Dai. 2025. "POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction" International Journal of Molecular Sciences 26, no. 13: 6362. https://doi.org/10.3390/ijms26136362
APA StyleLiu, Y., Wang, B., Yan, B., Jiang, H., & Dai, Y. (2025). POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction. International Journal of Molecular Sciences, 26(13), 6362. https://doi.org/10.3390/ijms26136362