MDPI - Publisher of Open Access Journals

19 pages, 1730 KB

Open AccessArticle

PPI-Diff: De Novo Generation of Peptide Binders via Resolution-Aware Geometric Diffusion

by Benzhi Dong, Sijia Li, Chang Hou and Dali Xu

Biomolecules 2026, 16(4), 528; https://doi.org/10.3390/biom16040528 - 1 Apr 2026

Viewed by 635

Peptide binders, serving as a critical drug modality bridging small-molecule compounds and protein macromolecules, can effectively mimic the secondary structural elements of natural proteins. Peptides exhibit unique physicochemical advantages when targeting protein protein interaction (PPI) interfaces, which are typically characterized by flat surfaces [...] Read more.

Peptide binders, serving as a critical drug modality bridging small-molecule compounds and protein macromolecules, can effectively mimic the secondary structural elements of natural proteins. Peptides exhibit unique physicochemical advantages when targeting protein protein interaction (PPI) interfaces, which are typically characterized by flat surfaces and extensive contact areas. Recently, diffusion models represented by RFdiffusion have established a new computational paradigm for protein backbone generation by defining a denoising process over the rigid-body transformation group. However, in the de novo design of binders targeting “undruggable” PPI targets, this general paradigm encounters significant adaptability bottlenecks. First, its underlying rigid-body assumption struggles to accurately describe the dynamic induced-fit process of peptides at the binding interface. Second, it lacks sufficient robustness to the experimental resolution heterogeneity inherent in training data. Furthermore, the decoupled two-stage generation of sequence and structure severs the synergy of physicochemical properties, leading to backbones with idealized, singular secondary structures that lack authentic spatial binding capacity and reasonable side-chain physicochemical features. To address these challenges, this study proposes PPI-Diff, a novel generative framework. While preserving the generative capability of diffusion models, PPI-Diff introduces three core mechanisms: (1) a resolution-aware constraint mechanism that maps the measurement precision of experimental data into explicit contextual constraints to dynamically suppress geometric noise from low-resolution samples; (2) an internal-coordinate-driven manifold diffusion model that performs conformational evolution on a Riemannian manifold constructed by dihedral angles, balancing local stereochemical validity with the precise capture of flexible peptide conformations; and (3) a geometry-semantic synergistic modeling mechanism that leverages the evolutionary embeddings of a pre-trained protein language model (ESM-2) as latent variables to align structure generation with biophysical functions. Systematic benchmarking demonstrates that, on a strictly non-homologous test set, the binders generated by PPI-Diff significantly outperform existing baseline models in terms of interface contact density, stereochemical validity, and sequence novelty. Full article

(This article belongs to the Section Biomacromolecules: Proteins, Nucleic Acids and Carbohydrates)

► Show Figures

Figure 1

20 pages, 921 KB

Open AccessArticle

ThermoFormer: Predicting Protein Melting Temperature Through Large-Scale Pretraining

by Jingchuan Li and Mingchen Li

Catalysts 2026, 16(4), 288; https://doi.org/10.3390/catal16040288 - 24 Mar 2026

Viewed by 828

Abstract

Temperature plays a dominant environmental role in determining the efficiency of protein function. Accurately predicting protein thermal stability is crucial for fundamental biology, drug discovery, and protein engineering. Here, we introduce ThermoFormer, a transformer-based protein language model that learns both temperature-aware representation and [...] Read more.

Temperature plays a dominant environmental role in determining the efficiency of protein function. Accurately predicting protein thermal stability is crucial for fundamental biology, drug discovery, and protein engineering. Here, we introduce ThermoFormer, a transformer-based protein language model that learns both temperature-aware representation and sequence patterns. Specifically, we first built a large-scale dataset comprising more than 96 million protein sequences annotated with their optimal growth temperature (OGT). ThermoFormer is pre-trained with a supervised OGT prediction task and an unsupervised masked language modeling (MLM) task on the dataset. We evaluated ThermoFormer’s pre-training performance and its transferability to other temperature-prediction datasets, including two melting temperature (TM) datasets, an optimal catalytic temperature (OCT) dataset, and a thermophilic protein classification task. The results show that ThermoFormer achieves state-of-the-art performance across all evaluated tasks, outperforming prior unsupervised pre-trained models. In addition, we have also shown that ThermoFormer enables zero-shot temperature prediction, i.e., even without further fine-tuning, ThermoFormer can still achieve comparable performance. Our model can serve as a foundation for encoding protein sequences with temperature-aware representations, improving transferability to temperature-related downstream tasks. Full article

(This article belongs to the Special Issue Biocatalysis-Driven Catalytic Routes for Green and Alternative Chemical Production)

► Show Figures

Graphical abstract

18 pages, 1642 KB

Open AccessArticle

Foundation Protein Language Models for Influenza A Virus T-Cell Epitope Prediction: A Transformer-Based Viroinformatics Framework

by Syed Nisar Hussain Bukhari and Kingsley A. Ogudo

Viruses 2026, 18(3), 380; https://doi.org/10.3390/v18030380 - 18 Mar 2026

Viewed by 826

Abstract

Influenza A virus remains a major cause of respiratory disease worldwide and poses a persistent challenge to vaccine development due to its rapid genetic evolution and antigenic variability. T-cell-based immunity has therefore gained increasing importance, as it can provide broader and more durable [...] Read more.

Influenza A virus remains a major cause of respiratory disease worldwide and poses a persistent challenge to vaccine development due to its rapid genetic evolution and antigenic variability. T-cell-based immunity has therefore gained increasing importance, as it can provide broader and more durable protection by targeting conserved viral regions. Accurate identification of T-cell epitopes (TCEs) is a fundamental requirement for epitope-based vaccine design and immunological research. Although numerous computational methods have been proposed, many existing approaches rely on handcrafted physicochemical features, which offer limited ability to capture contextual sequence dependencies. In this study, a transformer-based viroinformatics framework is proposed for the binary prediction of TCEs from Influenza A virus peptide sequences. The framework employs a pretrained Evolutionary Scale Modeling-2 (ESM-2) protein language model (PLM) to generate rich, contextualized embeddings directly from raw amino acid sequences, eliminating the need for manual feature engineering. These embeddings are processed using a lightweight attention-based transformer classifier to learn epitope-specific sequence patterns. The model achieves strong and stable predictive performance, attaining an accuracy of approximately 97% and an AUC close to 0.99 under stratified cross-validation. Ablation analysis further confirms that protein language model representations and self-attention contribute substantially to performance gains over classical machine learning baselines. To enhance practical reliability, Monte Carlo dropout is incorporated during inference to provide uncertainty-aware predictions, enabling differentiation between high-confidence and ambiguous peptide candidates. In addition, attention-based interpretability is used to identify residue-level contributions to model decisions, offering biologically meaningful insights into epitope recognition. Overall, this study demonstrates that PLMs combined with Transformer architectures provide an effective, interpretable, and a promising computational framework for Influenza A TCE discovery and vaccine research. Full article

(This article belongs to the Special Issue Viroinformatics and Viral Diseases)

► Show Figures

Figure 1

16 pages, 1192 KB

Open AccessArticle

Multi-Scale Feature Mixing of Language Model Embeddings for Enhanced Prediction of Submitochondrial Protein Localization

by Rong Wang, Menghua Wang, Yibo Wu, Lixiang Yang and Xiao Wang

Algorithms 2026, 19(3), 212; https://doi.org/10.3390/a19030212 - 11 Mar 2026

Viewed by 327

Abstract

Accurate prediction of submitochondrial localization is fundamental to understanding mitochondrial biogenesis and cellular metabolic pathways. While deep representations from pre-trained protein language models (pLMs) have significantly advanced the field, traditional global average pooling methods often fail to capture critical, localized N-terminal targeting signals, [...] Read more.

Accurate prediction of submitochondrial localization is fundamental to understanding mitochondrial biogenesis and cellular metabolic pathways. While deep representations from pre-trained protein language models (pLMs) have significantly advanced the field, traditional global average pooling methods often fail to capture critical, localized N-terminal targeting signals, particularly in long sequences where these motifs are mathematically diluted. To resolve this “signal dilution” bottleneck, we developed a multi-scale architecture that explicitly integrates high-resolution N-terminal features with global evolutionary context derived from ESM-2 embeddings. The proposed framework utilizes an orthogonal mixing strategy consisting of Token-mixing and Channel-mixing. Token-mixing is specifically designed to detect spatial rhythmic patterns across residue positions, while Channel-mixing refines the biochemical signatures within the latent feature space. Extensive benchmarking across diverse datasets demonstrates that our approach effectively maintains signal integrity. Compared to existing state-of-the-art methods, the model achieves a superior overall Generalized Correlation Coefficient (GCC) of 0.7443 on the SM424-18 dataset and 0.7878 on the SubMitoPred dataset, outperforming the latest benchmarks by 9.4% and 16.1%, respectively. Furthermore, on the independent M983 test set, our method maintained a high GCC of 0.6945, demonstrating a 9.9% improvement relative to the state-of-the-art methods. This robust and efficient framework provides a high-precision tool for large-scale mitochondrial proteomics. Full article

(This article belongs to the Special Issue Advanced Research on Machine Learning Algorithms in Bioinformatics (2nd Edition))

► Show Figures

Figure 1

14 pages, 3673 KB

Open AccessArticle

IMAGO: An Improved Model Based on Attention Mechanism for Enhanced Protein Function Prediction

by Meiling Liu, Longchang Liang, Qiutong Wang, Yunmeng Zhang, Lin Shi, Tianjiao Zhang and Zhenxing Wang

Biomolecules 2025, 15(12), 1667; https://doi.org/10.3390/biom15121667 - 29 Nov 2025

Viewed by 757

Abstract

Protein function prediction plays an important role in the field of biology. With the wide application of deep learning in the field of bioinformatics, more and more natural language processing (NLP) technologies are applied to the downstream tasks in the field of bioinformatics, [...] Read more.

Protein function prediction plays an important role in the field of biology. With the wide application of deep learning in the field of bioinformatics, more and more natural language processing (NLP) technologies are applied to the downstream tasks in the field of bioinformatics, and it has also shown excellent performance in protein function prediction. Protein-protein interaction (PPI) networks and other biological attributes contain rich information critical for annotating protein functions. However, existing deep learning networks still suffer from overfitting and noise issues, resulting in low accuracy in protein function prediction. Consequently, developing efficient models for protein function prediction remains a popular and challenging topic in the application of NLP in bioinformatics. In this study, we propose a novel protein function prediction model based on attention mechanisms, termed IMAGO. This model employs the Transformer pre-training process, integrating multi-head attention mechanisms and regularization techniques, and optimizes the loss function to effectively reduce overfitting and noise issues during training. It generates more robust embeddings, ultimately improving the accuracy of protein function prediction. Experimental results on human and mouse datasets indicate that our model surpasses other protein function prediction models across multiple metrics. Thus, this efficient, stable, and accurate deep learning model holds significant promise for protein function prediction. Full article

(This article belongs to the Section Bioinformatics and Systems Biology)

► Show Figures

Figure 1

20 pages, 7231 KB

Open AccessArticle

Systematic Exploration of Small-Molecule Binding via a Large Language Model Trained on Textualized Protein–Ligand Interactions

by Taeseob Lee, Heehoon Jung, Ahnjae Jung, JaeWoong Min, Jong Hui Hong, Bin Claire Zhang and Jongsun Jung

Molecules 2025, 30(23), 4516; https://doi.org/10.3390/molecules30234516 - 22 Nov 2025

Viewed by 1453

Abstract

Emergent Large Language Models (LLMs) show impressive capabilities in performing a wide range of tasks. These models can be harnessed for biophysical use as well. The main challenge in this endeavor lies in transforming 3D chemical data into 1D language-like data. We developed [...] Read more.

Emergent Large Language Models (LLMs) show impressive capabilities in performing a wide range of tasks. These models can be harnessed for biophysical use as well. The main challenge in this endeavor lies in transforming 3D chemical data into 1D language-like data. We developed a method to transform molecular data into language-like data and tokenize it for LLM use in a biophysical context. We then trained a model and validated it with a known protein–ligand complex. Using the pre-trained result, the model can assess the chemical properties of targets, detect shared binding properties and structures, and reveal related drugs. The model and the synthetic language to describe binding interactions uncovered novel protein–protein networks influenced by ligands, indicating functionally related yet previously unreported interactions. Full article

(This article belongs to the Special Issue 30th Anniversary of Molecules—Recent Advances in Computational and Theoretical Chemistry)

► Show Figures

Figure 1

25 pages, 1136 KB

Open AccessArticle

TruMPET: A New Method for Protein Secondary Structure Prediction Using Neural Networks Trained on Multiple Pre-Selected Physicochemical and Structural Features

by Yury V. Milchevskiy, Galina I. Kravatskaya and Yury V. Kravatsky

Int. J. Mol. Sci. 2025, 26(23), 11284; https://doi.org/10.3390/ijms262311284 - 21 Nov 2025

Viewed by 1301

Abstract

Protein structure prediction continues to pose multiple challenges, despite the progress made by ML. While recent deep learning models have achieved a strong performance using embeddings from protein language models, they often ignore non-canonical amino acids and rely heavily on sequence alignments or [...] Read more.

Protein structure prediction continues to pose multiple challenges, despite the progress made by ML. While recent deep learning models have achieved a strong performance using embeddings from protein language models, they often ignore non-canonical amino acids and rely heavily on sequence alignments or evolutionary profiles. Here, we present an improvement to this approach for predicting the secondary protein structure of DSSP classes solely from amino acid sequences. We suggest that ML feature sets should be generated from statistically significant mutually uncorrelated descriptors. The selection of statistically assessed descriptors, including predicting the physicochemical parameters of non-canonical amino acids, is a key component of the proposed method. The statistical significance and influence of each of the suggested features were assessed using a two-step Linear Discriminant Analysis, which permitted the evaluation of the statistical significance of each descriptor and their impact on model accuracy. We applied the set of 109 most influential statistically significant descriptors as a learning model for the two-layer Bi-LSTM network combined with ESMFold2 embeddings. Our method, TruMPET (Training upon Multiple Pre-selected Elements Technique), outperformed all other methods reported in the literature for the non-redundant datasets (CB513: DSSP Q3 = 91.36% and Q8 = 85.41%, TEST2018: DSSP Q3 = 90.64% and Q8 = 84.17%). Full article

(This article belongs to the Special Issue Recent Research of Protein Structure Prediction and Design)

► Show Figures

Figure 1

16 pages, 579 KB

Open AccessArticle

IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation

by Donglin Zhang, Weixiang Shi, Boyuan Ma, Weiqing Min and Xiao-Jun Wu

Foods 2025, 14(21), 3697; https://doi.org/10.3390/foods14213697 - 30 Oct 2025

Viewed by 1377

Abstract

In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement [...] Read more.

In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement of computer vision, RGB-based methods have been proposed, and more recently, RGB-D-based approaches have further improved performance by incorporating depth information to capture spatial cues. While these methods have shown promising results, they still face challenges in complex food scenes, such as limited ability to distinguish visually similar items with different ingredients and insufficient modeling of spatial or semantic relationships. To solve these issues, we propose an Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation. The method introduces an ingredient-guided module that encodes ingredient information using a pre-trained language model and aligns it with visual features via cross-modal attention. At the same time, an internal semantic modeling component is designed to enhance structural understanding through dynamic positional encoding and localized attention, allowing for fine-grained relational reasoning. On the Nutrition5k dataset, our method achieves PMAE values of 12.2% for Calories, 9.4% for Mass, 19.1% for Fat, 18.3% for Carb, and 16.0% for Protein. These results demonstrate that our IGSMNet consistently outperforms existing baselines, validating its effectiveness. Full article

(This article belongs to the Section Food Nutrition)

► Show Figures

Figure 1

17 pages, 1782 KB

Open AccessArticle

Protein Language Models Expose Viral Immune Mimicry

by Dan Ofer and Michal Linial

Viruses 2025, 17(9), 1199; https://doi.org/10.3390/v17091199 - 31 Aug 2025

Cited by 1 | Viewed by 2595

Abstract

Viruses have evolved sophisticated solutions to evade host immunity. One of the most pervasive strategies is molecular mimicry, whereby viruses imitate the molecular and biophysical features of their hosts. This mimicry poses significant challenges for immune recognition, therapeutic targeting, and vaccine development. In [...] Read more.

Viruses have evolved sophisticated solutions to evade host immunity. One of the most pervasive strategies is molecular mimicry, whereby viruses imitate the molecular and biophysical features of their hosts. This mimicry poses significant challenges for immune recognition, therapeutic targeting, and vaccine development. In this study, we leverage pretrained protein language models (PLMs) to distinguish between viral and human proteins. Our model enables the identification and interpretation of viral proteins that most frequently elude classification. We characterize these by integrating PLMs with explainable models. Our approach achieves state-of-the-art performance with ROC-AUC of 99.7%. The 3.9% of misclassified sequences are signified by viral proteins with low immunogenicity. These errors disproportionately involve human-specific viral families associated with chronic infections and immune evasion, suggesting that both the immune system and machine learning models are confounded by overlapping biophysical signals. By coupling PLMs with explainable AI techniques, our work advances computational virology and offers mechanistic insights into viral immune escape. These findings carry implications for the rational design of vaccines, and improved strategies to counteract viral persistence and pathogenicity. Full article

(This article belongs to the Special Issue Herpesviruses and Associated Diseases)

► Show Figures

Figure 1

20 pages, 5107 KB

Open AccessArticle

Enhancing Ferroptosis-Related Protein Prediction Through Multimodal Feature Integration and Pre-Trained Language Model Embeddings

by Jie Zhou and Chunhua Wang

Algorithms 2025, 18(8), 465; https://doi.org/10.3390/a18080465 - 25 Jul 2025

Viewed by 863

Abstract

Ferroptosis, an iron-dependent form of regulated cell death, plays a critical role in various diseases. Accurate identification of ferroptosis-related proteins (FRPs) is essential for understanding their underlying mechanisms and developing targeted therapeutic strategies. Existing computational methods for FRP prediction often exhibit limited accuracy [...] Read more.

Ferroptosis, an iron-dependent form of regulated cell death, plays a critical role in various diseases. Accurate identification of ferroptosis-related proteins (FRPs) is essential for understanding their underlying mechanisms and developing targeted therapeutic strategies. Existing computational methods for FRP prediction often exhibit limited accuracy and suboptimal performance. In this study, we harnessed the power of pre-trained protein language models (PLMs) to develop a novel machine learning framework, termed PLM-FRP, which utilizes deep learning-derived features for FRP identification. By integrating ESM2 embeddings with traditional sequence-based features, PLM-FRP effectively captures complex evolutionary relationships and structural patterns within protein sequences, achieving a remarkable accuracy of 96.09% on the benchmark dataset and significantly outperforming previous state-of-the-art methods. We anticipate that PLM-FRP will serve as a powerful computational tool for FRP annotation and facilitate deeper insights into ferroptosis mechanisms, ultimately advancing the development of ferroptosis-targeted therapeutics. Full article

(This article belongs to the Special Issue Advanced Research on Machine Learning Algorithms in Bioinformatics)

► Show Figures

Figure 1

15 pages, 2136 KB

Open AccessArticle

POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction

by Yubao Liu, Benrui Wang, Bocheng Yan, Haiyue Jiang and Yinfei Dai

Int. J. Mol. Sci. 2025, 26(13), 6362; https://doi.org/10.3390/ijms26136362 - 1 Jul 2025

Viewed by 1600

Abstract

Protein function prediction plays a crucial role in uncovering the molecular mechanisms underlying life processes in the post-genomic era. However, with the widespread adoption of high-throughput sequencing technologies, the pace of protein function annotation significantly lags behind that of sequence discovery, highlighting the [...] Read more.

Protein function prediction plays a crucial role in uncovering the molecular mechanisms underlying life processes in the post-genomic era. However, with the widespread adoption of high-throughput sequencing technologies, the pace of protein function annotation significantly lags behind that of sequence discovery, highlighting the urgent need for more efficient and reliable predictive methods. To address the problem of existing methods ignoring the hierarchical structure of gene ontology terms and making it challenging to dynamically associate protein features with functional contexts, we propose a novel protein function prediction framework, termed Partial Order-Based Self-Attention for Gene Ontology (POSA-GO). This cross-modal collaborative modelling approach fuses GO terms with protein sequences. The model leverages the pre-trained language model ESM-2 to extract deep semantic features from protein sequences. Meanwhile, it transforms the partial order relationships among Gene Ontology (GO) terms into topological embeddings to capture their biological hierarchical dependencies. Furthermore, a multi-head self-attention mechanism is employed to dynamically model the association weights between proteins and GO terms, thereby enabling context-aware functional annotation. Comparative experiments on the CAFA3 and SwissProt datasets demonstrate that POSA-GO outperforms existing state-of-the-art methods in terms of Fmax and AUPR metrics, offering a promising solution for protein functional studies. Full article

(This article belongs to the Special Issue New Computational Methodologies for Biomolecule Sequence, Structure and Function Discovery)

► Show Figures

Figure 1

26 pages, 5053 KB

Open AccessArticle

MTPrompt-PTM: A Multi-Task Method for Post-Translational Modification Prediction Using Prompt Tuning on a Structure-Aware Protein Language Model

by Ye Han, Fei He, Qing Shao, Duolin Wang and Dong Xu

Biomolecules 2025, 15(6), 843; https://doi.org/10.3390/biom15060843 - 9 Jun 2025

Cited by 2 | Viewed by 3948

Abstract

Post-translational modifications (PTMs) regulate protein function, stability, and interactions, playing essential roles in cellular signaling, localization, and disease mechanisms. Computational approaches enable scalable PTM site prediction; however, traditional models focus only on local sequence features from fragments around potential modification sites, limiting the [...] Read more.

Post-translational modifications (PTMs) regulate protein function, stability, and interactions, playing essential roles in cellular signaling, localization, and disease mechanisms. Computational approaches enable scalable PTM site prediction; however, traditional models focus only on local sequence features from fragments around potential modification sites, limiting the scope of their predictions. Recently, pre-trained protein language models (PLMs) have improved PTM prediction by leveraging biological knowledge derived from extensive protein databases. However, most PLMs used for PTM site prediction are pre-trained solely on amino acid sequences, limiting their ability to capture the structural context necessary for accurate PTM site prediction. Moreover, these methods typically train separate single-task models for each PTM type, which hinders the sharing of common features and limits potential knowledge transfer across tasks. To overcome these limitations, we introduce MTPrompt-PTM, a multi-task PTM prediction framework developed by applying prompt tuning to a structure-aware protein language model (S-PLM). Instead of training several single-task models, MTPrompt-PTM trains one multi-task model to predict multiple types of PTM sites using shared feature extraction layers and task-specific classification heads. Additionally, we incorporate a knowledge distillation strategy to enhance the efficiency and generalizability of multi-task training. Experimental results demonstrate that MTPrompt-PTM outperforms state-of-the-art PTM prediction tools on 13 types of PTM sites, highlighting the advantages of multi-task learning and structural integration. Full article

(This article belongs to the Special Issue Innovative Biomolecular Structure Analysis Techniques)

► Show Figures

Figure 1

17 pages, 3121 KB

Open AccessFeature PaperArticle

Bio-Inspired Mamba for Antibody–Antigen Interaction Prediction

by Xuan Liu, Haitao Fu, Yuqing Yang and Jian Zhang

Biomolecules 2025, 15(6), 764; https://doi.org/10.3390/biom15060764 - 26 May 2025

Cited by 2 | Viewed by 2815

Abstract

Antibody lead discovery, crucial for immunotherapy development, requires identifying candidates with potent binding affinities to target antigens. Recent advances in protein language models have opened promising avenues to tackle this challenge by predicting antibody–antigen interactions (AAIs). Despite their appeals, precisely detecting binding sites [...] Read more.

Antibody lead discovery, crucial for immunotherapy development, requires identifying candidates with potent binding affinities to target antigens. Recent advances in protein language models have opened promising avenues to tackle this challenge by predicting antibody–antigen interactions (AAIs). Despite their appeals, precisely detecting binding sites (i.e., paratopes and epitopes) within the complex landscape of long-sequence biomolecules remains challenging. Herein, we propose MambaAAI, a bio-inspired model built upon the Mamba architecture, designed to predict AAIs and identify binding sites through selective attention mechanisms. Technically, we employ ESM-2, a pre-trained protein language model to extract evolutionarily enriched representations from input antigen and antibody sequences, which are modeled as residue-level interaction matrixes. Subsequently, a dual-view Mamba encoder is devised to capture important binding patterns, by dynamically learning embeddings of interaction matrixes from both antibody and antigen perspectives. Finally, the learned embeddings are decoded using a multilayer perceptron to output interaction probabilities. MambaAAI provides a unique advantage, relative to prior techniques, in dynamically selecting bio-enhancing residue sites that contribute to AAI prediction. We evaluate MambaAAI on two large-scale antibody–antigen neutralization datasets, and in silico results demonstrate that our method marginally outperforms the state-of-the-art baselines in terms of prediction accuracy, while maintaining robust generalization to unseen antibodies and antigens. In further analysis of the selective attention mechanism, we found that MambaAAI successfully uncovers critical epitope and paratope regions in the SARS-CoV-2 antibody examples. It is believed that MambaAAI holds great potential to discover lead candidates targeting specific antigens at a lower burden. Full article

(This article belongs to the Special Issue Computational Intelligence in Structure and Function Prediction and Modeling of Proteins—2nd Edition)

► Show Figures

Figure 1

17 pages, 3804 KB

Open AccessArticle

LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model

by An Hu, Linai Kuang and Dinghai Yang

Appl. Sci. 2025, 15(6), 3283; https://doi.org/10.3390/app15063283 - 17 Mar 2025

Cited by 3 | Viewed by 3547

Abstract

The prediction of protein–protein interactions is a key task in proteomics. Since protein sequences are easily available and understandable, they have become the primary data source for predicting protein–protein interactions. With the development of natural language processing technology, language models have become a [...] Read more.

The prediction of protein–protein interactions is a key task in proteomics. Since protein sequences are easily available and understandable, they have become the primary data source for predicting protein–protein interactions. With the development of natural language processing technology, language models have become a research hotspot in recent years, and protein language models have also been developed accordingly. Compared with single-encoding methods, such as Word2Vec and one-hot, language models specifically designed for proteins are expected to extract more comprehensive information from sequences, thereby enhancing the performance of protein–protein interaction prediction methods. Inspired by the protein language model ProteinBERT, this study designed the LPBERT deep learning framework, which is a novel end-to-end deep learning architecture. LPBERT, which is based on ProteinBERT, combines Convolutional Neural Networks, Transformer encoders, and Bidirectional Long Short-Term Memory networks to achieve efficient prediction. Upon evaluation using the BioGRID H. sapiens and S. cerevisiae datasets, LPBERT outperformed other comparison methods, where it achieved accuracies of 98.93% and 97.94%, respectively. Moreover, it also demonstrated good performances on multiple other datasets. These experimental results indicate that LPBERT performed excellently in protein–protein interaction prediction tasks, thereby substantiating the effectiveness of introducing protein language models in this field. Full article

► Show Figures

Figure 1

18 pages, 646 KB

Open AccessArticle

GraphPhos: Predict Protein-Phosphorylation Sites Based on Graph Neural Networks

by Zeyu Wang, Xiaoli Yang, Songye Gao, Yanchun Liang and Xiaohu Shi

Int. J. Mol. Sci. 2025, 26(3), 941; https://doi.org/10.3390/ijms26030941 - 23 Jan 2025

Cited by 3 | Viewed by 3160

Abstract

Phosphorylation is one of the most common protein post-translational modifications. The identification of phosphorylation sites serves as the cornerstone for protein-phosphorylation-related research. This paper proposes a protein-phosphorylation site-prediction model based on graph neural networks named GraphPhos, which combines sequence features with structure features. [...] Read more.

Phosphorylation is one of the most common protein post-translational modifications. The identification of phosphorylation sites serves as the cornerstone for protein-phosphorylation-related research. This paper proposes a protein-phosphorylation site-prediction model based on graph neural networks named GraphPhos, which combines sequence features with structure features. Sequence features are derived from manual extraction and the calculation of protein pre-trained language models, and the structure feature is the secondary structure contact map calculated from protein tertiary structure. These features are then innovatively applied to graph neural networks. By inputting the features of the entire protein sequence and its contact graph, GraphPhos achieves the goal of predicting phosphorylation sites along the entire protein. Experimental results indicate that GraphPhos improves the accuracy of serine, threonine, and tyrosine site prediction by at least 8%, 15%, and 12%, respectively, exhibiting an average 7% improvement in accuracy compared to individual amino acid category prediction models. Full article

(This article belongs to the Special Issue New Advances in Protein Structure, Function and Design)

► Show Figures

Figure 1

Search Results (27)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (27)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI