Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features
Abstract
:1. Introduction
2. Materials and Methods
2.1. The Benchmark Dataset
2.2. Feature Extraction
2.2.1. BERT
2.2.2. UniRep
2.2.3. SSA
2.2.4. W2V
2.3. The Synthetic Minority over Sampling Technique (SMOTE)
2.4. Feature Selection Method
2.5. Classifiers
2.5.1. Logistic Regression (LR)
2.5.2. Naïveive Bayes (NB)
2.5.3. Linear Discriminant Analysis (LDA)
2.5.4. K-Nearest Neighbor (KNN)
2.5.5. Random Forest (RF)
2.5.6. Supporting Vector Machine (SVM)
2.5.7. The Lighting Gradient Boosting Machine (LGBM)
2.6. Evaluation
3. Results
3.1. Performance of Models Using Various BERT Embedding Features
3.2. Performance of Models Using BERT-Bfd and Other Deep Representation Learning Features
3.3. Performance of Models Using BERT-Bfd with Seven Different Classifiers
3.4. Performance after Feature Selection
3.5. Feature Analysis Using Dimension Reduction
3.6. Comparison with Previous Models
3.7. Comparison with a Different Benchmark Dataset Used by DeepTP
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ahmed, Z.; Zulfiqar, H.; Tang, L.; Lin, H. A Statistical Analysis of the Sequence and Structure of Thermophilic and Non-Thermophilic Proteins. Int. J. Mol. Sci. 2022, 23, 10116. [Google Scholar] [CrossRef]
- Guo, Z.; Wang, P.; Liu, Z.; Zhao, Y. Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction. Front. Bioeng. Biotechnol. 2020, 8, 584807. [Google Scholar] [CrossRef]
- Bhasin, M.; Raghava, G.P.S. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem. 2004, 279, 23262–23266. [Google Scholar] [CrossRef] [Green Version]
- Gromiha, M.M.; Suresh, M.X. Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 2008, 70, 1274–1279. [Google Scholar] [CrossRef]
- Lin, H.; Chen, W. Prediction of thermophilic proteins using feature selection technique. J. Microbiol. Methods 2011, 84, 67–70. [Google Scholar] [CrossRef]
- Nakariyakul, S.; Liu, Z.-P.; Chen, L. Detecting thermophilic proteins through selecting amino acid and dipeptide composition features. Amino Acids 2012, 42, 1947–1953. [Google Scholar] [CrossRef]
- Wang, D.; Yang, L.; Fu, Z.; Xia, J. Prediction of thermophilic protein with pseudo amino Acid composition: An approach from combined feature selection and reduction. Protein Pept. Lett. 2011, 18, 684–689. [Google Scholar] [CrossRef]
- Fan, G.-L.; Liu, Y.-L.; Wang, H. Identification of thermophilic proteins by incorporating evolutionary and acid dissociation information into Chou’s general pseudo amino acid composition. J. Theor. Biol. 2016, 407, 138–142. [Google Scholar] [CrossRef]
- Feng, C.; Ma, Z.; Yang, D.; Li, X.; Zhang, J.; Li, Y. A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features. Front. Bioeng. Biotechnol. 2020, 8, 285. [Google Scholar] [CrossRef]
- Ahmed, Z.; Zulfiqar, H.; Khan, A.A.; Gul, I.; Dao, F.-Y.; Zhang, Z.-Y.; Yu, X.L.; Tang, L. iThermo: A Sequence-Based Model for Identifying Thermophilic Proteins Using a Multi-Feature Fusion Strategy. Front. Microbiol. 2022, 13, 790063. [Google Scholar] [CrossRef]
- Charoenkwan, P.; Schaduangrat, N.; Moni, M.A.; Lio, P.; Manavalan, B.; Shoombuatong, W. SAPPHIRE: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins. Comput. Biol. Med. 2022, 146, 105704. [Google Scholar] [CrossRef] [PubMed]
- Zhao, J.; Yan, W.; Yang, Y. DeepTP: A Deep Learning Model for Thermophilic Protein Prediction. Int. J. Mol. Sci. 2023, 24, 2217. [Google Scholar] [CrossRef]
- Dubchak, I.; Muchnik, I.; Mayor, C.; Dralyuk, I.; Kim, S.H. Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins 1999, 35, 401–407. [Google Scholar] [CrossRef]
- Saravanan, V.; Gautham, N. Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. Omics 2015, 19, 648–658. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Zhu, P.; Zou, Q. Prediction of Thermophilic Proteins Using Voting Algorithm. In Proceedings of the International Work-Conference on Bioinformatics and Biomedical Engineering, Granada, Spain, 8–10 May 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 195–203. [Google Scholar]
- Zhao, W.; Xu, G.; Yu, Z.; Li, J.; Liu, J. Identification of nut protein-derived peptides against SARS-CoV-2 spike protein and main protease. Comput. Biol. Med. 2021, 138, 104937. [Google Scholar] [CrossRef] [PubMed]
- Zhou, W.; Xu, C.; Luo, M.; Wang, P.; Xu, Z.; Xue, G.; Jin, X.; Huang, Y.; Li, Y.; Nie, H.; et al. MutCov: A pipeline for evaluating the effect of mutations in spike protein on infectivity and antigenicity of SARS-CoV-2. Comput. Biol. Med. 2022, 145, 105509. [Google Scholar] [CrossRef]
- Cao, C.; Kossinna, P.; Kwok, D.; Li, Q.; He, J.; Su, L.; Guo, X.; Zhang, Q.; Long, Q. Disentangling genetic feature selection and aggregation in transcriptome-wide association studies. Genetics 2022, 220, 34849857. [Google Scholar] [CrossRef] [PubMed]
- Cao, C.; Kwok, D.; Edie, S.; Li, Q.; Ding, B.; Kossinna, P.; Campbell, S.; Wu, J.; Greenberg, M.; Long, Q. kTWAS: Integrating kernel machine with transcriptome-wide association studies improves statistical power and reveals novel genes. Brief. Bioinform. 2021, 22, bbaa270. [Google Scholar] [CrossRef]
- Cao, C.; Wang, J.; Kwok, D.; Cui, F.; Zhang, Z.; Zhao, D.; Li, M.J.; Zou, Q. webTWAS: A resource for disease candidate susceptibility genes identified by transcriptome-wide association study. Nucleic Acids Res. 2022, 50, D1123–D1130. [Google Scholar] [CrossRef]
- Canzhuang, S.; Yonge, F. Identification of Disordered Regions of Intrinsically Disordered Proteins by Multi-features Fusion. Curr. Bioinform. 2021, 16, 1126–1132. [Google Scholar] [CrossRef]
- Iraji, M.S.; Tanha, J.; Habibinejad, M. Druggable protein prediction using a multi-canal deep convolutional neural network based on autocovariance method. Comput. Biol. Med. 2022, 151 Pt A, 106276. [Google Scholar] [CrossRef]
- Jian, G.L.; Chen, H.B. A Path-based Method for Identification of Protein Phenotypic Annotations. Curr. Bioinform. 2021, 16, 1214–1222. [Google Scholar]
- Zheng, L.; Huang, S.; Mu, N.; Zhang, H.; Zhang, J.; Chang, Y.; Yang, L.; Zuo, Y. RAACBook: A web server of reduced amino acid alphabet for sequence-dependent inference by using Chou’s five-step rule. Database 2019, 2019, baz131. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Qu, K.; Wei, L.; Yu, J.; Wang, C. Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods. Front. Plant Sci. 2018, 9, 1961. [Google Scholar] [CrossRef] [Green Version]
- Cai, C.Z.; Han, L.Y.; Ji, Z.L.; Chen, X.; Chen, Y.Z. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003, 31, 3692–3697. [Google Scholar] [CrossRef] [Green Version]
- Liu, B.; Wang, S.; Dong, Q.; Li, S.; Liu, X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans. Nanobiosci. 2016, 15, 328–334. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [Green Version]
- Xia, W.; Zheng, L.; Fang, J.; Li, F.; Zhou, Y.; Zeng, Z.; Zhang, B.; Li, Z.; Li, H.; Zhu, F. PFmulDL: A novel strategy enabling multi-class and multi-label protein function annotation by integrating diverse deep learning methods. Comput. Biol. Med. 2022, 145, 105465. [Google Scholar] [CrossRef]
- Long, H.; Sun, Z.; Li, M.; Fu, H.; Lin, M. Predicting Protein Phosphorylation Sites Based on Deep Learning. Curr. Bioinform. 2020, 15, 300–308. [Google Scholar] [CrossRef]
- Ao, C.; Jiao, S.; Wang, Y.; Yu, L.; Zou, Q. Biological Sequence Classification: A Review on Data and General Methods. Research 2022, 2022, 11. [Google Scholar] [CrossRef]
- Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127. [Google Scholar] [PubMed]
- Detlefsen, N.S.; Hauberg, S.; Boomsma, W. Learning meaningful representations of protein sequences. Nat. Commun. 2022, 13, 1914. [Google Scholar] [CrossRef] [PubMed]
- Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, X.; Canny, J.; Abbeel, P.; Song, Y. Evaluating Protein Transfer Learning with TAPE. arXiv 2019, arXiv:1906.08230. [Google Scholar]
- Alley, E.C.; Khimulya, G.; Biswas, S.; AlQuraishi, M.; Church, G.M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 2019, 16, 1315–1322. [Google Scholar] [CrossRef]
- Yang, K.K.; Wu, Z.; Bedbrook, C.N.; Arnold, F.H. Learned protein embeddings for machine learning. Bioinformatics 2018, 34, 2642–2648. [Google Scholar] [CrossRef] [Green Version]
- Bepler, T.; Berger, B. Learning protein sequence embeddings using information from structure. arXiv 2019, arXiv:1902.08661. [Google Scholar]
- Hosseini, S.; Ilie, L. PITHIA: Protein Interaction Site Prediction Using Multiple Sequence Alignments and Attention. Int. J. Mol. Sci. 2022, 23, 12814. [Google Scholar] [CrossRef] [PubMed]
- Jiang, J.; Lin, X.; Jiang, Y.; Jiang, L.; Lv, Z. Identify Bitter Peptides by Using Deep Representation Learning Features. Int. J. Mol. Sci. 2022, 23, 7877. [Google Scholar] [CrossRef]
- Jiang, L.; Jiang, J.; Wang, X.; Zhang, Y.; Zheng, B.; Liu, S.; Zhang, Y.; Liu, C.; Wan, Y.; Xiang, D.; et al. IUP-BERT: Identification of Umami Peptides Based on BERT Features. Foods 2022, 11, 3742. [Google Scholar] [CrossRef] [PubMed]
- Wu, X.; Yu, L. EPSOL: Sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics 2021, 37, btab463. [Google Scholar] [CrossRef] [PubMed]
- Wei, Y.; Zou, Q.; Tang, F.; Yu, L. WMSA: A novel method for multiple sequence alignment of DNA sequences. Bioinformatics 2022, 38, 5019–5025. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.F. Predicting Thermophilic Proteins by Machine Learning. Curr. Bioinform. 2020, 15, 493–502. [Google Scholar]
- Asgari, E.; McHardy, A.C.; Mofrad, M.R.K. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci. Rep. 2019, 9, 3577. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Coin, L.; Bateman, A.; Durbin, R. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc. Natl. Acad. Sci. USA 2003, 100, 4516–4520. [Google Scholar] [CrossRef] [Green Version]
- Suzek, B.E.; Wang, Y.; Huang, H.; McGarvey, P.B.; Wu, C.H. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015, 31, 926–932. [Google Scholar] [CrossRef] [Green Version]
- Steinegger, M.; Mirdita, M.; Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 2019, 16, 603–606. [Google Scholar] [CrossRef] [Green Version]
- El-Gebali, S.; Mistry, J.; Bateman, A.; Eddy, S.R.; Luciani, A.; Potter, S.C.; Qureshi, M.; Richardson, L.J.; Salazar, G.A.; Smart, A.; et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019, 47, D427–D432. [Google Scholar] [CrossRef]
- UniProt Consortium, T. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 2018, 46, 2699. [Google Scholar] [CrossRef] [Green Version]
- Lv, Z.; Wang, D.; Ding, H.; Zhong, B.; Xu, L. Escherichia coli DNA N-4-Methycytosine Site Prediction Accuracy Improved by Light Gradient Boosting Machine Feature Selection Technology. IEEE Access 2020, 8, 14851–14859. [Google Scholar] [CrossRef]
- Tang, Y.-J.; Pang, Y.-H.; Liu, B. IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning. Bioinformatics 2021, 36, 5177–5186. [Google Scholar] [CrossRef] [PubMed]
- Stoltzfus, J.C. Logistic regression: A brief primer. Acad. Emerg. Med. 2011, 18, 1099–1104. [Google Scholar] [CrossRef] [PubMed]
- Yu, J.; Xuan, Z.; Feng, X.; Zou, Q.; Wang, L. A novel collaborative filtering model for LncRNA-disease association prediction based on the Naïve Bayesian classifier. BMC Bioinform. 2019, 20, 396. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Du, L.; Meng, Q.; Chen, Y.; Wu, P. Subcellular location prediction of apoptosis proteins using two novel feature extraction methods based on evolutionary information and LDA. BMC Bioinform. 2020, 21, 212. [Google Scholar] [CrossRef]
- Zhang, S.; Li, X.; Zong, M.; Zhu, X.; Wang, R. Efficient kNN Classification with Different Numbers of Nearest Neighbors. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1774–1785. [Google Scholar] [CrossRef]
- Lv, Z.; Jin, S.; Ding, H.; Zou, Q. A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features. Front. Bioeng. Biotechnol. 2019, 7, 215. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, B.; Li, K. iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features. Mol. Nucleic Acids 2019, 18, 80–87. [Google Scholar] [CrossRef] [Green Version]
- Huo, Y.; Xin, L.; Kang, C.; Wang, M.; Ma, Q.; Yu, B. SGL-SVM: A novel method for tumor classification via support vector machine with sparse group Lasso. J. Theor. Biol. 2020, 486, 110098. [Google Scholar] [CrossRef]
- Tan, J.X.; Li, S.H.; Zhang, Z.M.; Chen, C.X.; Chen, W.; Tang, H.; Lin, H. Identification of hormone binding proteins based on machine learning methods. Math. Biosci. Eng. 2019, 16, 2466–2480. [Google Scholar] [CrossRef]
- Zhang, Y.; Yu, S.; Xie, R.; Li, J.; Leier, A.; Marquez-Lago, T.T.; Akutsu, T.; Smith, A.I.; Ge, Z.; Wang, J.; et al. PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. Bioinformatics 2020, 36, 704–712. [Google Scholar] [CrossRef]
- Yu, L.; Wang, M.; Yang, Y.; Xu, F.; Zhang, X.; Xie, F.; Gao, L.; Li, X. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLoS Comput. Biol. 2021, 17, e1008696. [Google Scholar] [CrossRef]
- Meng, C.; Ju, Y.; Shi, H. TMPpred: A support vector machine-based thermophilic protein identifier. Anal. Biochem. 2022, 645, 114625. [Google Scholar] [CrossRef] [PubMed]
- Charoenkwan, P.; Chotpatiwetchkul, W.; Lee, V.S.; Nantasenamat, C.; Shoombuatong, W. A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides. Sci. Rep. 2021, 11, 23782. [Google Scholar] [CrossRef] [PubMed]
Classifier | SMOTE | 5-Fold Cross-Validation | Independent Test | Average a | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACC | MCC | Sn | Sp | AUC | ACC | MCC | Sn | Sp | AUC | ACC | MCC | Sn | Sp | AUC | ||
LGBM | Yes | 0.9662 | 0.9325 | 0.9671 | 0.9653 | 0.9947 | 0.9662 | 0.9323 | 0.9670 | 0.9654 | 0.9935 | 0.9662 | 0.9324 | 0.9670 | 0.9654 | 0.9941 |
SVM | Yes | 0.9692 | 0.9387 | 0.9671 | 0.9714 | 0.9951 | 0.9715 | 0.9433 | 0.9817 | 0.9619 | 0.9950 | 0.9704 | 0.9410 | 0.9744 | 0.9667 | 0.9951 |
RF | Yes | 0.9554 | 0.9111 | 0.9445 | 0.9662 | 0.9912 | 0.9573 | 0.9145 | 0.9524 | 0.9619 | 0.9913 | 0.9563 | 0.9128 | 0.9485 | 0.9641 | 0.9913 |
KNN | Yes | 0.9558 | 0.9118 | 0.9601 | 0.9515 | 0.9893 | 0.9537 | 0.9077 | 0.9634 | 0.9446 | 0.9888 | 0.9548 | 0.9098 | 0.9618 | 0.9480 | 0.9890 |
LDA | Yes | 0.9255 | 0.8513 | 0.9367 | 0.9142 | 0.9750 | 0.9395 | 0.8790 | 0.9451 | 0.9343 | 0.9800 | 0.9325 | 0.8651 | 0.9409 | 0.9243 | 0.9775 |
NB | Yes | 0.9350 | 0.8713 | 0.9090 | 0.9610 | 0.9710 | 0.9413 | 0.8825 | 0.9304 | 0.9516 | 0.9703 | 0.9382 | 0.8769 | 0.9197 | 0.9563 | 0.9707 |
LR | Yes | 0.9697 | 0.9397 | 0.9653 | 0.9740 | 0.9947 | 0.9662 | 0.9327 | 0.9780 | 0.9550 | 0.9935 | 0.9679 | 0.9362 | 0.9717 | 0.9645 | 0.9941 |
AVG | Yes | 0.9518 | 0.9040 | 0.9471 | 0.9564 | 0.9861 | 0.9549 | 0.9100 | 0.9585 | 0.9516 | 0.9865 | 0.9533 | 0.9070 | 0.9528 | 0.9540 | 0.9863 |
LGBM | No | 0.9608 | 0.9217 | 0.9579 | 0.9636 | 0.9933 | 0.9591 | 0.9184 | 0.9707 | 0.9481 | 0.9917 | 0.9600 | 0.9201 | 0.9643 | 0.9559 | 0.9925 |
SVM | No | 0.9684 | 0.9368 | 0.9661 | 0.9705 | 0.9945 | 0.9715 | 0.9433 | 0.9817 | 0.9619 | 0.9950 | 0.9700 | 0.9400 | 0.9739 | 0.9662 | 0.9948 |
RF | No | 0.9537 | 0.9076 | 0.9405 | 0.9662 | 0.9906 | 0.9644 | 0.9288 | 0.9634 | 0.9654 | 0.9914 | 0.9591 | 0.9182 | 0.9520 | 0.9658 | 0.9910 |
KNN | No | 0.9519 | 0.9039 | 0.9451 | 0.9584 | 0.9885 | 0.9484 | 0.8967 | 0.9487 | 0.9481 | 0.9882 | 0.9502 | 0.9003 | 0.9469 | 0.9532 | 0.9883 |
LDA | No | 0.9226 | 0.8456 | 0.9341 | 0.9116 | 0.9724 | 0.9413 | 0.8826 | 0.9451 | 0.9377 | 0.9798 | 0.9319 | 0.8641 | 0.9396 | 0.9247 | 0.9761 |
NB | No | 0.9341 | 0.8692 | 0.9076 | 0.9593 | 0.9708 | 0.9413 | 0.8825 | 0.9304 | 0.9516 | 0.9690 | 0.9377 | 0.8759 | 0.9190 | 0.9554 | 0.9699 |
LR | No | 0.9675 | 0.9351 | 0.9661 | 0.9688 | 0.9955 | 0.9751 | 0.9504 | 0.9853 | 0.9654 | 0.9936 | 0.9713 | 0.9428 | 0.9757 | 0.9671 | 0.9946 |
AVG | No | 0.9513 | 0.9029 | 0.9454 | 0.9569 | 0.9865 | 0.9573 | 0.9147 | 0.9608 | 0.9540 | 0.9870 | 0.9543 | 0.9088 | 0.9531 | 0.9555 | 0.9867 |
Top Features | SMOTE | 5-Fold Cross-Validation | Independent Test | Average | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACC | MCC | Sn | Sp | AUC | ACC | MCC | Sn | Sp | AUC | ACC | MCC | Sn | Sp | AUC | ||
90 | Yes | 0.9697 | 0.9394 | 0.9697 | 0.9697 | 0.9957 | 0.9751 | 0.9504 | 0.9853 | 0.9654 | 0.9935 | 0.9724 | 0.9449 | 0.9775 | 0.9675 | 0.9946 |
1024 | Yes | 0.9697 | 0.9397 | 0.9653 | 0.9740 | 0.9947 | 0.9662 | 0.9327 | 0.9780 | 0.9550 | 0.9935 | 0.9679 | 0.9362 | 0.9717 | 0.9645 | 0.9941 |
90 | No | 0.9675 | 0.9350 | 0.9662 | 0.9688 | 0.9954 | 0.9751 | 0.9504 | 0.9853 | 0.9654 | 0.9935 | 0.9713 | 0.9427 | 0.9757 | 0.9671 | 0.9945 |
1024 | No | 0.9675 | 0.9351 | 0.9661 | 0.9688 | 0.9955 | 0.9751 | 0.9504 | 0.9853 | 0.9654 | 0.9936 | 0.9713 | 0.9428 | 0.9757 | 0.9671 | 0.9946 |
Method | Year | Feature a | Evaluation b | ML c | FS d | Dimension | ACC | Sn | Sp | MCC | AUC |
---|---|---|---|---|---|---|---|---|---|---|---|
Gromiha’s method [4] | 2008 | AAC | 5CV | NN | None | 20 | 0.8940 | 0.8240 | 0.9300 | ----- | ----- |
Hao’s method [5] | 2011 | AAC, DC | JCV | SVM | ANOVA | 30 | 0.9327 | 0.9377 | 0.9269 | ----- | ----- |
De’s method [7] | 2011 | PC, CTD, AAC | JCV | SVM | Genetic | 30 | 0.9593 | 0.9617 | 0.9569 | 0.9187 | ----- |
Songyot’s method [6] | 2011 | AAC, DC | JCV | SVM | IFFS | 28 | 0.9390 | 0.9380 | 0.9410 | ----- | ----- |
Fan’s method [8] | 2016 | AAC, EI, pKa | JCV | SVM | None | 460 | 0.9353 | 0.8950 | 0.9564 | 0.8600 | ----- |
Li’s method [15] | 2019 | AAC, DDE, CKSAAGP | 10CV | VA | MRMD | ----- | 0.9303 | ----- | ----- | ----- | ----- |
Guo’s method [2] | 2020 | PC, ACC, RD | 10CV | SVM | MRMD | 119 | 0.9602 | 0.9585 | ----- | ----- | ----- |
SAPPHIRE [11] | 2022 | AAC, AAI, APAAC, DC, CTD, PAAC, PSSM_COM, RPM_PSSM, S_FPSSM | 10CV | EL | GA-SAR | 12 | 0.9350 | 0.9280 | 0.9430 | 0.8710 | 0.9790 |
DeepTP [12] | 2023 | AAC, DC, CTD, QSO, PAAC, APAAC | 10CV | MLP | LGBM, RFECV | 205 | 0.8710 | 0.8730 | 0.8690 | 0.7420 | 0.9340 |
BerThermo(this study) | 2023 | BERT-bfd | 5CV | LR | LGBM | 90 | 0.9697 | 0.9394 | 0.9697 | 0.9697 | 0.9957 |
Method | Year | Feature a | ML b | FS | Dimension | ACC | Sn | Sp | MCC | AUC |
---|---|---|---|---|---|---|---|---|---|---|
iThermo [10] | 2022 | ACC, TPAAC, APAAC, DC, DDE, CKSAAP, CTD | MLP | ANOVA | ----- | 0.9626 | 0.9634 | 0.9619 | 0.9269 | 0.9864 |
DeepTP [12] | 2023 | AAC, DC, CTD, QSO, PAAC, APAAC | MLP | LGBM, RFECV | 205 | 0.9342 | 0.9634 | 0.9066 | 0.8700 | 0.9826 |
BerThermo | / | BERT-bfd | LR | LGBM | 90 | 0.9751 | 0.9853 | 0.9654 | 0.9504 | 0.9935 |
Method | ML | ACC | MCC | Sn | Sp | AUC |
---|---|---|---|---|---|---|
DeepTP [12] | MLP | 0.871 | 0.742 | 0.873 | 0.869 | 0.943 |
BertThermo | LR | 0.913 | 0.826 | 0.916 | 0.910 | 0.972 |
Method | b Balanced Dataset | b Imbalanced Dataset | b Homology Dataset | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ACC | MCC | Sn | Sp | AUC | AP | ACC | MCC | Sn | Sp | AUC | AP | ACC | MCC | Sn | Sp | AUC | AP | |
a TMPpred [62] | 0.708 | 0.418 | 0.659 | 0.758 | / | / | 0.725 | 0.129 | 0.733 | 0.724 | / | / | 0.663 | 0.327 | 0.690 | 0.636 | / | / |
a SCMTPP [63] | 0.761 | 0.545 | 0.621 | 0.902 | 0.846 | 0.857 | 0.882 | 0.237 | 0.733 | 0.884 | / | 0.248 | 0.720 | 0.440 | 0.730 | 0.710 | 0.808 | 0.792 |
a iThermo [10] | 0.791 | 0.583 | 0.749 | 0.832 | 0.868 | 0.867 | 0.814 | 0.196 | 0.800 | 0.814 | / | 0.297 | 0.740 | 0.482 | 0.780 | 0.700 | 0.834 | 0.836 |
a SAPPHIRE [11] | 0.821 | 0.657 | 0.711 | 0.930 | 0.904 | 0.916 | 0.930 | 0.316 | 0.733 | 0.933 | / | 0.527 | 0.785 | 0.570 | 0.790 | 0.790 | 0.871 | 0.862 |
a DeepTP [12] | 0.873 | 0.746 | 0.854 | 0.891 | 0.944 | 0.946 | 0.886 | 0.277 | 0.833 | 0.887 | / | 0.536 | 0.830 | 0.671 | 0.920 | 0.740 | 0.909 | 0.906 |
BertThermo | 0.906 | 0.813 | 0.919 | 0.894 | 0.968 | 0.966 | 0.903 | 0.326 | 0.900 | 0.903 | 0.974 | 0.711 | 0.898 | 0.793 | 0.930 | 0.857 | 0.957 | 0.965 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pei, H.; Li, J.; Ma, S.; Jiang, J.; Li, M.; Zou, Q.; Lv, Z. Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features. Appl. Sci. 2023, 13, 2858. https://doi.org/10.3390/app13052858
Pei H, Li J, Ma S, Jiang J, Li M, Zou Q, Lv Z. Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features. Applied Sciences. 2023; 13(5):2858. https://doi.org/10.3390/app13052858
Chicago/Turabian StylePei, Hongdi, Jiayu Li, Shuhan Ma, Jici Jiang, Mingxin Li, Quan Zou, and Zhibin Lv. 2023. "Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features" Applied Sciences 13, no. 5: 2858. https://doi.org/10.3390/app13052858
APA StylePei, H., Li, J., Ma, S., Jiang, J., Li, M., Zou, Q., & Lv, Z. (2023). Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features. Applied Sciences, 13(5), 2858. https://doi.org/10.3390/app13052858