Scalable and Efficient Protein Secondary Structure Prediction Using Autoencoder-Reduced ProtBERT Embeddings
Abstract
1. Introduction
- To develop a deep learning framework for protein secondary structure prediction that balances computational efficiency with predictive accuracy;
- To compress ProtBERT-derived high-dimensional embeddings using stacked autoencoders, enabling efficient representation without significant information loss;
- To implement a fixed-length subsequence strategy for the handling of variable-length protein sequences, thereby improving memory usage and model consistency;
- To empirically identify an optimal configuration that offers a practical trade-off between biological fidelity and computational cost;
- To evaluate the model’s performance across both Q3 and Q8 classification schemes using a comprehensive, high-quality dataset.
- A hybrid deep learning pipeline integrating ProtBERT embeddings, autoencoder-based dimensionality reduction, and Bi-LSTM sequence modeling tailored for protein secondary structure prediction;
- A novel subsequencing approach that standardizes protein input representations, leading to enhanced training stability and efficient GPU utilization;
- A comprehensive experimental evaluation on a curated PISCES-derived dataset, exploring multiple feature dimensionalities and subsequence lengths under Q3 and Q8 schemes;
- An empirical demonstration that reducing the embedding dimensions to 256 preserves over 99% of predictive performance while decreasing GPU memory usage by 67% and training time by 43%;
- Identification of the 256D–50L configuration as an optimal balance point, enabling scalable deployment on resource-constrained platforms without compromising accuracy.
2. Related Work
3. Methodology
3.1. Dataset
3.2. Protein Representation Using BERT Models
3.3. Dimensionality Reduction with Autoencoders
3.4. Secondary Structure Prediction Models
3.5. Proposed Method
3.5.1. Raw Protein Sequences
3.5.2. Feature Extraction Using Pre-Trained BERT
3.5.3. Dimensionality Reduction via Autoencoders
3.5.4. Subsequence Generation
3.5.5. PSSP Model Training
4. Experiments and Results
4.1. Evaluation Metrics
- TP (True Positive): Correctly predicted positive samples.
- TN (True Negative): Correctly predicted negative samples.
- FP (False Positive): Incorrectly predicted positive samples.
- FN (False Negative): Incorrectly predicted negative samples.
4.2. Results
- The highest Q3 and Q8 F1 scores are achieved using 1024D features.
- Reducing the feature size to 256D preserves 99% of Q3 and 98.5% of Q8 performance.
- The 256D–50L configuration offers an optimal trade-off between efficiency and accuracy.
- Longer subsequences (75–100) provide more stable training but increase training time.
- Q8 classification requires more training epochs to reach convergence due to higher class complexity.
5. Discussion
5.1. Effects of Dimensionality Reduction and Subsequence Selection on Performance
- Memory Optimization: Reducing from 1024D to 256D decreased VRAM usage by 67.4% (31.1 GB to 10.13 GB).
- Speed–Accuracy Trade-off: The 50-residue subsequences maintained 99% accuracy while reducing training time by 43%.
- Hardware-Aware Design: GPU acceleration is viable only for models below 512D, emphasizing the value of dimensionality control.
- Model Complexity: Q8 classification required 2.7× more epochs than Q3 to reach equivalent convergence.
5.2. Biological Insight and Interpretability
6. Conclusions
7. Future Work
- Integration with Pretrained 3D Models: Future work can explore the integration of the proposed framework with state-of-the-art 3D structure predictors such as AlphaFold2 or ESMFold. Leveraging tertiary structure outputs as additional input features may enhance secondary structure prediction accuracy and offer multi-level structural consistency.
- Attention-Based Interpretability: Incorporating attention mechanisms into Bi-LSTM or replacing it with Transformer-based architectures could provide interpretability, enabling the identification of sequence regions critical for structural transitions.
- Multi-task Learning: Extending the model to jointly predict related structural features—such as solvent accessibility, contact maps, or intrinsically disordered regions—could improve generalization and model robustness across protein families.
- Cross-platform and Edge Deployment: Given the model’s compact design with dimensionality-reduced embeddings, deploying optimized variants on mobile or embedded platforms could open avenues for real-time protein structure analysis in portable lab devices.
- Data Augmentation and Self-Supervised Pretraining: Utilizing contrastive learning or masked amino acid prediction on large unlabeled protein datasets can improve feature richness, especially in low-resource scenarios where labeled data are limited.
- Dataset Diversity: The curated PISCES dataset, while high-quality, may suffer from structural bias—particularly the under-representation of membrane proteins. This may limit the model’s generalizability to under-represented structural classes. Future efforts should consider incorporating more diverse protein families, including membrane-associated proteins (e.g., GPCRs) and intrinsically disordered proteins (IDPs), to enhance structural diversity and real-world applicability.
- Dynamic Subsequence Lengths: In this study, we adopted fixed-length subsequences (25, 50, 75, and 100 residues) for computational consistency. However, proteins naturally exhibit domain structures of varying lengths. Future research could investigate dynamic or adaptive subsequence segmentation strategies—possibly guided by domain prediction tools or sequence entropy—to better capture structural boundaries and improve biological relevance.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Pauling, L.; Corey, R.B.; Branson, H.R. The Structure of Proteins: Two Hydrogen-Bonded Helical Configurations of the Polypeptide Chain. Proc. Natl. Acad. Sci. USA 1951, 37, 205–211. [Google Scholar] [CrossRef]
- Ismi, D.P.; Pulungan, R.; Afiahayati, N. Deep Learning for Protein Secondary Structure Prediction: Pre and Post-AlphaFold. Comput. Struct. Biotechnol. J. 2022, 20, 6271–6286. [Google Scholar] [CrossRef]
- Yuan, L.; Hu, X.; Ma, Y.; Liu, Y. DLBLS_SS: Protein Secondary Structure Prediction Using Deep Learning and Broad Learning System. RSC Adv. 2022, 12, 33479–33487. [Google Scholar] [CrossRef]
- Patel, M.S.; Mazumdar, H.S. Knowledge base and neural network approach for protein secondary structure prediction. J. Theor. Biol. 2014, 361, 182–189. [Google Scholar] [CrossRef] [PubMed]
- Kösesoy, İ.; Gök, M.; Öz, C. PROSES: A web server for sequence-based protein encoding. J. Comput. Biol. 2018, 25, 1120–1122. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Wang, J.; Zhang, S.; Zhang, Z.; Wu, W. A new hybrid coding for protein secondary structure prediction based on primary structure similarity. Gene 2017, 618, 8–13. [Google Scholar] [CrossRef] [PubMed]
- Kosesoy, I.; Gok, M.; Oz, C. A new sequence based encoding for prediction of host–pathogen protein interactions. Comput. Biol. Chem. 2019, 78, 170–177. [Google Scholar] [CrossRef]
- Kösesoy, İ.; Gök, M.; Kahveci, T. Prediction of host-pathogen protein interactions by extended network model. Turk. J. Biol. 2021, 45, 138–148. [Google Scholar] [CrossRef]
- Geethu, S.; Vimina, E.R. Protein Secondary Structure Prediction Using Cascaded Feature Learning Model. Appl. Soft Comput. 2023, 140, 110242. [Google Scholar] [CrossRef]
- Yang, W.; Hu, Z.; Zhou, L.; Jin, Y. Protein Secondary Structure Prediction Using a Lightweight Convolutional Network and Label Distribution Aware Margin Loss. Knowl.-Based Syst. 2021, 237, 107771. [Google Scholar] [CrossRef]
- Ema, R.R.; Khatun, M.A.; Adnan, M.N.; Kabir, S.S.; Galib, S.M.; Hossain, M.A. Protein Secondary Structure Prediction Based on CNN and Machine Learning Algorithms. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 115–126. [Google Scholar] [CrossRef]
- Zhou, J.; Wang, H.; Zhao, Z.; Xu, R.; Lu, Q. CNNH-PSS: Protein 8-Class Secondary Structure Prediction by Convolutional Neural Network with Highway. BMC Bioinform. 2018, 19, 60. [Google Scholar] [CrossRef]
- Lyu, Z.; Wang, Z.; Shuai, J.; Huang, Y. Protein Secondary Structure Prediction with a Reductive Deep Learning Method. Front. Bioeng. Biotechnol. 2021, 9, 687426. [Google Scholar] [CrossRef]
- Ghazikhani, H.; Butler, G. Enhanced Identification of Membrane Transport Proteins: A Hybrid Approach Combining ProtBERT-BFD and Convolutional Neural Networks. J. Integr. Bioinform. 2023, 20, 20220055. [Google Scholar] [CrossRef]
- Yamada, K.; Hamada, M. Prediction of RNA-Protein Interactions Using a Nucleotide Language Model. Bioinform. Adv. 2022, 2, vbac023. [Google Scholar] [CrossRef]
- Chowdhury, R.; Bouatta, N.; Biswas, S.; Rochereau, C.; Church, G.M.; Sorger, P.K.; AlQuraishi, M. Single-Sequence Protein Structure Prediction Using Language Models from Deep Learning. bioRxiv 2021. [Google Scholar] [CrossRef]
- Chou, P.Y.; Fasman, G.D. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol. 1978, 47, 45–148. [Google Scholar]
- Qian, N.; Sejnowski, T.J. Predicting the Secondary Structure of Globular Proteins Using Neural Network Models. J. Mol. Biol. 1988, 202, 865–884. [Google Scholar] [CrossRef]
- Heffernan, R.; Yang, Y.; Paliwal, K.; Zhou, Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure. Bioinformatics 2015, 33, 2842–2849. [Google Scholar] [CrossRef]
- Hanson, J.; Paliwal, K.; Litfin, T.; Yang, Y.; Zhou, Y. Accurate prediction of protein secondary structure using an ensemble of deep learning methods. Bioinformatics 2017, 33, 868–878. [Google Scholar]
- Wardah, W.; Khan, M.G.M.; Sharma, A.; Rashid, M.A. Protein Secondary Structure Prediction Using Neural Networks and Deep Learning: A Review. Comput. Biol. Chem. 2019, 81, 1–8. [Google Scholar] [CrossRef]
- Cheng, J.; Liu, Y.; Ma, Y. Protein Secondary Structure Prediction Based on Integration of CNN and LSTM Model. J. Vis. Commun. Image Represent. 2020, 71, 102844. [Google Scholar] [CrossRef]
- Sonsare, P.M.; C, G. Cascading 1d-Convnet Bidirectional Long Short Term Memory Network With Modified COCOB Optimizer: A Novel Approach for Protein Secondary Structure Prediction. Chaos Solitons Fractals 2021, 153, 111446. [Google Scholar] [CrossRef]
- Zubair, M.; Hanif, M.K.; Alabdulkreem, E.; Ghadi, Y.Y.; Khan, M.I.; Sarwar, M.U.; Hanif, A. A Deep Learning Approach for Prediction of Protein Secondary Structure. Comput. Mater. Contin. 2022, 72, 3705–3718. [Google Scholar] [CrossRef]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
- Wang, Y.; Mao, H.; Yi, Z. Protein Secondary Structure Prediction by Using Deep Learning Method. Knowl.-Based Syst. 2017, 118, 115–123. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7112–7127. [Google Scholar] [CrossRef]
- Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; Linial, M. ProteinBERT: A Universal Deep-Learning Model of Protein Sequence and Function. Bioinformatics 2022, 38, 2102–2110. [Google Scholar] [CrossRef]
- Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A.; Bottou, L. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
- Jha, K.; Saha, S.; Tanveer, M. Prediction of protein-protein interactions using stacked auto-encoder. Trans. Emerg. Telecommun. Technol. 2022, 33, e4256. [Google Scholar] [CrossRef]
- Manzoor, U.; Halim, Z. Protein encoder: An autoencoder-based ensemble feature selection scheme to predict protein secondary structure. Expert Syst. Appl. 2023, 213, 119081. [Google Scholar]
- Wang, L.; You, Z.H.; Chen, X.; Xia, S.X.; Liu, F.; Yan, X.; Zhou, Y.; Song, K.J. A computational-based method for predicting drug–target interactions by using stacked autoencoder deep neural network. J. Comput. Biol. 2018, 25, 361–373. [Google Scholar] [CrossRef]
- Patre, S.; Kanani, R.; Alam, F.F. SuperFoldAE: Enhancing Protein Fold Classification with Autoencoders. In Proceedings of the Computational Structural Bioinformatics Workshop; Springer: Cham, Switzerland, 2024; pp. 1–15. [Google Scholar]
- Sevgen, E.; Moller, J.; Lange, A.; Parker, J.; Quigley, S.; Mayer, J.; Srivastava, P.; Gayatri, S.; Hosfield, D.; Korshunova, M.; et al. ProT-VAE: Protein transformer variational autoencoder for functional protein design. bioRxiv 2023. [Google Scholar] [CrossRef]
- Sharma, A.K.; Srivastava, R. Protein secondary structure prediction using character bi-gram embedding and Bi-LSTM. Curr. Bioinform. 2021, 16, 333–338. [Google Scholar]
- Roslidar, R.; Brilianty, N.; Alhamdi, M.J.; Nurbadriani, C.N.; Harnelly, E.; Zulkarnain, Z. Improving Bi-LSTM for High Accuracy Protein Sequence Family Classifier. Indones. J. Electr. Eng. Inform. 2024, 12, 40–52. [Google Scholar] [CrossRef]
- Tran, T.X.; Le, N.Q.K.; Nguyen, V.N. Integrating CNN and Bi-LSTM for protein succinylation sites prediction based on Natural Language Processing technique. Comput. Biol. Med. 2025, 186, 109664. [Google Scholar] [CrossRef]
- Hannigan, G.D.; Prihoda, D.; Palicka, A.; Soukup, J.; Klempir, O.; Rampula, L.; Durcak, J.; Wurst, M.; Kotowski, J.; Chang, D.; et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 2019, 47, e110. [Google Scholar] [CrossRef]
- Lilhore, U.K.; Simaiya, S.; Dalal, S.; Faujdar, N.; Sharma, Y.K.; Rao, K.B.; Maheswara Rao, V.; Tomar, S.; Ghith, E.; Tlija, M. ProtienCNN-BLSTM: An efficient deep neural network with amino acid embedding-based model of protein sequence classification and biological analysis. Comput. Intell. 2024, 40, e12696. [Google Scholar] [CrossRef]
- Wang, D.; Zou, C.; Wei, Z.; Zhong, Z. Disease Phenotype Classification Model Based on Multi-channel Deep Supervised Bi-LSTM. In Proceedings of the 2024 5th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 12–14 April 2024; pp. 760–766. [Google Scholar]
- Gunduz, H. Comparative analysis of BERT and FastText representations on crowdfunding campaign success prediction. PeerJ Comput. Sci. 2024, 10, e2316. [Google Scholar] [CrossRef]
- Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
- Sun, P.D.; Foster, C.E.; Boyington, J.C. Overview of protein structural and functional folds. Curr. Protoc. Protein Sci. 2004, 35, 17-1. [Google Scholar] [CrossRef]
- Liu, J.; Rost, B. Domains, motifs and clusters in the protein universe. Curr. Opin. Chem. Biol. 2003, 7, 5–11. [Google Scholar] [CrossRef]
- Miao, Z.; Wang, Q.; Xiao, X.; Kamal, G.M.; Song, L.; Zhang, X.; Li, C.; Zhou, X.; Jiang, B.; Liu, M. CSI-LSTM: A web server to predict protein secondary structure using bidirectional long short term memory and NMR chemical shifts. J. Biomol. NMR 2021, 75, 393–400. [Google Scholar] [CrossRef]
- Guo, Y.; Li, W.; Wang, B.; Liu, H.; Zhou, D. DeepACLSTM: Deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction. BMC Bioinform. 2019, 20, 341. [Google Scholar] [CrossRef]
- Feng, R.; Wang, X.; Xia, Z.; Han, T.; Wang, H.; Yu, W. MHTAPred-SS: A Highly Targeted Autoencoder-Driven Deep Multi-Task Learning Framework for Accurate Protein Secondary Structure Prediction. Int. J. Mol. Sci. 2024, 25, 13444. [Google Scholar] [CrossRef]
Q3 Classes | Q8 Classes |
---|---|
E (Beta) | E ( strand), B ( bridge) |
H (Helical) | H ( helix), G (310-helix), I ( helix) |
C (Coil) | C (coil), T (turn), S (bend) |
Feature | Value |
---|---|
Update Date | 17 December 2022 |
Number of Proteins | 10,931 (non-redundant) |
Resolution | ≤2.5 Å |
Sequence Identity | ≤25% |
Minimum Residue Count | 40 |
Maximum Residue Count | 1500 |
Actual Positive | Actual Negative | |
Predicted Positive | TP | FP |
Predicted Negative | FN | TN |
Feature Dimension | Subsequence Length | |||
---|---|---|---|---|
25 | 50 | 75 | 100 | |
1024 | 0.804 | 0.8049 | 0.8045 | 0.8016 |
512 | 0.7976 | 0.8031 | 0.8031 | 0.7975 |
256 | 0.7991 | 0.8023 | 0.8002 | 0.7981 |
128 | 0.7990 | 0.8005 | 0.7969 | 0.7964 |
64 | 0.7989 | 0.7976 | 0.7946 | 0.7976 |
32 | 0.7760 | 0.7699 | 0.7559 | 0.7822 |
Feature Dimension | 25 | 50 | 75 | 100 |
---|---|---|---|---|
1024 | 14 | 16 | 14 | 23 |
512 | 24 | 18 | 28 | 27 |
256 | 24 | 24 | 34 | 29 |
128 | 32 | 32 | 36 | 30 |
64 | 29 | 32 | 30 | 52 |
32 | 24 | 28 | 15 | 41 |
Feature Dimension | 25 | 50 | 75 | 100 |
---|---|---|---|---|
1024 | 0.6553 | 0.6501 | 0.6504 | 0.6498 |
512 | 0.6484 | 0.6447 | 0.6457 | 0.6317 |
256 | 0.6472 | 0.6440 | 0.6416 | 0.6344 |
128 | 0.6463 | 0.6418 | 0.6402 | 0.6441 |
64 | 0.6414 | 0.6363 | 0.6332 | 0.6010 |
32 | 0.6173 | 0.6176 | 0.6267 | 0.6278 |
Study | Embedding Used | Autoencoder Used | Handles Variable-Length Sequences |
---|---|---|---|
Our Model | ProtBERT | ✓(for embedding compression) | ✓ |
CSI-LSTM [46] | NMR Shifts | ✗ | ✗ |
DeepACLSTM [47] | AAindex + HHBlits | ✓(attention + autoencoder) | ✓ |
MHTAPred-SS [48] | One-hot + Evolutionary Info | ✓(hierarchical transformer autoencoder) | ✓ |
Study | Embedding Compression Efficiency | Computational Efficiency | |
Our Model | ✓(90% reduction) | High (fast inference, low memory) | |
CSI-LSTM [46] | ✗ | Low (NMR preprocessing overhead) | |
DeepACLSTM [47] | ✓ | Moderate (deep + attention layers) | |
MHTAPred-SS [48] | ✓ | Moderate (transformer cost) |
Dimension | Time (s) | Platform | ||||||
---|---|---|---|---|---|---|---|---|
25 | 50 | 75 | 100 | 25 | 50 | 75 | 100 | |
1024 | 13217 | 16868 | 16844 | 21187 | CPU | CPU | CPU | CPU |
512 | 25262 | 25893 | 40629 | 41112 | CPU | CPU | CPU | CPU |
256 | 1057 | 1145 | 1373 | 1150 | GPU | GPU | GPU | GPU |
128 | 1035 | 1242 | 1370 | 1653 | GPU | GPU | GPU | GPU |
64 | 1051 | 1321 | 1576 | 582 | GPU | GPU | GPU | GPU |
32 | 1109 | 1296 | 1720 | 1733 | GPU | GPU | GPU | GPU |
Dimension | Time (s) | Platform | ||||||
---|---|---|---|---|---|---|---|---|
25 | 50 | 75 | 100 | 25 | 50 | 75 | 100 | |
1024 | 16804 | 19258 | 17987 | 30307 | CPU | CPU | CPU | CPU |
512 | 25533 | 19439 | 31777 | 31714 | CPU | CPU | CPU | CPU |
256 | 937 | 869 | 1240 | 1078 | GPU | GPU | GPU | GPU |
128 | 1240 | 1108 | 1254 | 1104 | GPU | GPU | GPU | GPU |
64 | 1090 | 1377 | 1366 | 1840 | GPU | GPU | GPU | GPU |
32 | 905 | 954 | 524 | 1443 | GPU | GPU | GPU | GPU |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Al-Shameri, Y.N.H.; Kösesoy, İ.; Gündüz, H.; Yılmaz, Ö.F. Scalable and Efficient Protein Secondary Structure Prediction Using Autoencoder-Reduced ProtBERT Embeddings. Appl. Sci. 2025, 15, 7112. https://doi.org/10.3390/app15137112
Al-Shameri YNH, Kösesoy İ, Gündüz H, Yılmaz ÖF. Scalable and Efficient Protein Secondary Structure Prediction Using Autoencoder-Reduced ProtBERT Embeddings. Applied Sciences. 2025; 15(13):7112. https://doi.org/10.3390/app15137112
Chicago/Turabian StyleAl-Shameri, Yahya Najib Hamood, İrfan Kösesoy, Hakan Gündüz, and Ömer Faruk Yılmaz. 2025. "Scalable and Efficient Protein Secondary Structure Prediction Using Autoencoder-Reduced ProtBERT Embeddings" Applied Sciences 15, no. 13: 7112. https://doi.org/10.3390/app15137112
APA StyleAl-Shameri, Y. N. H., Kösesoy, İ., Gündüz, H., & Yılmaz, Ö. F. (2025). Scalable and Efficient Protein Secondary Structure Prediction Using Autoencoder-Reduced ProtBERT Embeddings. Applied Sciences, 15(13), 7112. https://doi.org/10.3390/app15137112