MDPI - Publisher of Open Access Journals

16 pages, 2889 KB

Open AccessArticle

Enhanced Viral Genome Classification Using Large Language Models

by Hemalatha Gunasekaran, Nesaian Reginal Wilfred Blessing, Umar Sathic and Mohammad Shahid Husain

Algorithms 2025, 18(6), 302; https://doi.org/10.3390/a18060302 - 22 May 2025

Cited by 1 | Viewed by 1081

The classification of genomic sequences is a crucial area of research in the field of virology. This is due to the increasing number of outbreaks we have faced in recent times. We have a vast repository of genomic sequences from various species, including humans, animals, plants, bacteria, and viruses, which tend to mutate and form new variants or strains. In the realm of machine learning, several models are employed for genome sequence classification. Among these are traditional algorithms such as Random Forest (RF), K-nearest neighbors (KNNs), Decision Tree (DT), and Naive Bayes (NB), each offering unique advantages in handling genetic data. Additionally, deep learning models like Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and Bi-Directional LSTM networks are utilized for their robust capabilities in capturing complex patterns and dependencies within genomic sequences. In this study, we explored the application of Natural Language Processing (NLP) techniques to classify the genomic sequences. The focus of our research involves utilizing advanced large language models (LLMs) such as DNABERT, DNAGPT, and GENA LM, which are fine-tuned explicitly on the language of DNA. In this research, after a detailed analysis, we found that DNAGPT achieved an accuracy of 96%, which exceeds the performance of state-of-the-art machine learning and deep learning models. Full article

(This article belongs to the Special Issue Advanced Research on Machine Learning Algorithms in Bioinformatics)

► Show Figures

Figure 1

17 pages, 3197 KB

Open AccessArticle

Prediction of circRNA–miRNA Interaction Using Graph Attention Network Based on Molecular Attributes and Biological Networks

by Abdullah Almotilag, Murtada K. Elbashir, Mahmood A. Mahmood and Mohanad Mohammed

Processes 2025, 13(5), 1318; https://doi.org/10.3390/pr13051318 - 25 Apr 2025

Cited by 1 | Viewed by 770

Abstract

(1) Background: Circular RNAs (circRNAs) are covalently closed single-stranded molecules that play crucial roles in gene regulation, while microRNAs (miRNAs), specifically mature microRNAs, are naturally occurring small molecules of non-coding RNA with 17-25-nucleotide sizes. Understanding circRNA–miRNA interactions (CMIs) can reveal new approaches for diagnosing and treating complex human diseases. (2) Methods: In this paper, we propose a novel approach for predicting CMIs based on a graph attention network (GAT). We utilized DNABERT to extract molecular features of the circRNA and miRNA sequences and role-based graph embeddings generated by Role2Vec to extract the CMI features. The GAT’s ability to learn complex node dependencies in biological networks provided enhanced performance over the existing methods and the traditional deep neural network models. (3) Results: Our simulation studies showed that our GAT model achieved accuracies of 0.8762 and 0.8837 on the CMI-9905 and CMI-9589, respectively. These accuracies were the highest among the other existing CMI prediction methods. Our GAT method also achieved the highest performance as measured by the precision, recall, F1-score, area under the receiver operating characteristic (AUROC) curve, and area under the precision–recall curve (AUPR). (4) Conclusions: These results reflect the GAT’s ability to capture the intricate relationships between circRNAs and miRNAs, thus offering an efficient computational approach for prioritizing potential interactions for experimental validation. Full article

(This article belongs to the Special Issue Computational Biology Approaches to Genome and Protein Analyzes)

► Show Figures

Figure 1

18 pages, 2944 KB

Open AccessArticle

Risk Prediction of RNA Off-Targets of CRISPR Base Editors in Tissue-Specific Transcriptomes Using Language Models

by Kazuki Nakamae, Takayuki Suzuki, Sora Yonezawa, Kentaro Yamamoto, Taro Kakuzaki, Hiromasa Ono, Yuki Naito and Hidemasa Bono

Int. J. Mol. Sci. 2025, 26(4), 1723; https://doi.org/10.3390/ijms26041723 - 18 Feb 2025

Cited by 1 | Viewed by 1572

Abstract

Base-editing technologies, particularly cytosine base editors (CBEs), allow precise gene modification without introducing double-strand breaks; however, unintended RNA off-target effects remain a critical concern and are under studied. To address this gap, we developed the Pipeline for CRISPR-induced Transcriptome-wide Unintended RNA Editing (PiCTURE), a standardized computational pipeline for detecting and quantifying transcriptome-wide CBE-induced RNA off-target events. PiCTURE identifies both canonical ACW (W = A or T/U) motif-dependent and non-canonical RNA off-targets, revealing a broader WCW motif that underlies many unanticipated edits. Additionally, we developed two machine learning models based on the DNABERT-2 language model, termed STL and SNL, which outperformed motif-only approaches in terms of accuracy, precision, recall, and F1 score. To demonstrate the practical application of our predictive model for CBE-induced RNA off-target risk, we integrated PiCTURE outputs with the Predicting RNA Off-target compared with Tissue-specific Expression for Caring for Tissue and Organ (PROTECTiO) pipeline and estimated RNA off-target risk for each transcript showing tissue-specific expression. The analysis revealed differences among tissues: while the brain and ovaries exhibited relatively low off-target burden, the colon and lungs displayed relatively high risks. Our study provides a comprehensive framework for RNA off-target profiling, emphasizing the importance of advanced machine learning-based classifiers in CBE safety evaluations and offering valuable insights to inform the development of safer genome-editing therapies. Full article

(This article belongs to the Special Issue Research Advances in the Bioinformatics of Genome Editing and Gene Function Analysis)

► Show Figures

Figure 1

14 pages, 998 KB

Open AccessArticle

TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences

by Guohao Dong, Yuqian Wu, Lan Huang, Fei Li and Fengfeng Zhou

Genes 2024, 15(12), 1593; https://doi.org/10.3390/genes15121593 - 12 Dec 2024

Cited by 1 | Viewed by 2131

Abstract

Background/Objectives: Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding techniques, which fail to capture the inherent features and patterns of DNA sequences. Methods: We introduce TExCNN, a novel framework that integrates the pre-trained models DNABERT and DNABERT-2 to generate word embeddings for DNA sequences. We partitioned the DNA sequences into manageable segments and computed their respective embeddings using the pre-trained models. These embeddings were then utilized as inputs to our deep learning framework, which was based on convolutional neural network. Results: TExCNN outperformed current state-of-the-art models, achieving an average R² score of 0.622, compared to the 0.596 score achieved by the DeepLncLoc model, which is based on the Word2Vec model and a text convolutional neural network. Furthermore, when the sequence length was extended from 10,500 bp to 50,000 bp, TExCNN achieved an even higher average R² score of 0.639. The prediction accuracy improved further when additional biological features were incorporated. Conclusions: Our experimental results demonstrate that the use of pre-trained models for word embedding generation significantly improves the accuracy of predicting gene expression. The proposed TExCNN pipeline performes optimally with longer DNA sequences and is adaptable for both cell-type-independent and cell-type-dependent predictions. Full article

(This article belongs to the Section Bioinformatics)

► Show Figures

Graphical abstract

20 pages, 6077 KB

Open AccessArticle

DeepDualEnhancer: A Dual-Feature Input DNABert Based Deep Learning Method for Enhancer Recognition

by Tao Song, Haonan Song, Zhiyi Pan, Yuan Gao, Huanhuan Dai and Xun Wang

Int. J. Mol. Sci. 2024, 25(21), 11744; https://doi.org/10.3390/ijms252111744 - 1 Nov 2024

Cited by 2 | Viewed by 2355

Abstract

Enhancers are cis-regulatory DNA sequences that are widely distributed throughout the genome. They can precisely regulate the expression of target genes. Since the features of enhancer segments are difficult to detect, we propose DeepDualEnhancer, a DNABert-based method using a multi-scale convolutional neural network, BiLSTM, for enhancer identification. We first designed the DeepDualEnhancer method based only on the DNA sequence input. It mainly consists of a multi-scale Convolutional Neural Network, and BiLSTM to extract features by DNABert and embedding, respectively. Meanwhile, we collected new datasets from the enhancer–promoter interaction field and designed the method DeepDualEnhancer-genomic for inputting DNA sequences and genomic signals, which consists of the transformer sequence attention. Extensive comparisons of our method with 20 other excellent methods through 5-fold cross validation, ablation experiments, and an independent test demonstrated that DeepDualEnhancer achieves the best performance. It is also found that the inclusion of genomic signals helps the enhancer recognition task to be performed better. Full article

(This article belongs to the Special Issue New Computational Methodologies for Biomolecule Sequence, Structure and Function Discovery)

► Show Figures

Figure 1

19 pages, 3099 KB

Open AccessArticle

DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks

by Xueyan Liu, Hongyan Zhang, Ying Zeng, Xinghui Zhu, Lei Zhu and Jiahui Fu

Genes 2024, 15(4), 404; https://doi.org/10.3390/genes15040404 - 26 Mar 2024

Cited by 2 | Viewed by 2565

Abstract

The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer’s superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer’s excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified. Full article

(This article belongs to the Special Issue Advances and Applications of Machine Learning in Biomedical Genomics)

► Show Figures

Figure 1

15 pages, 5336 KB

Open AccessArticle

M6A-BERT-Stacking: A Tissue-Specific Predictor for Identifying RNA N6-Methyladenosine Sites Based on BERT and Stacking Strategy

by Qianyue Li, Xin Cheng, Chen Song and Taigang Liu

Symmetry 2023, 15(3), 731; https://doi.org/10.3390/sym15030731 - 15 Mar 2023

Cited by 15 | Viewed by 3626

Abstract

As the most abundant RNA methylation modification, N6-methyladenosine (m6A) could regulate asymmetric and symmetric division of hematopoietic stem cells and play an important role in various diseases. Therefore, the precise identification of m6A sites around the genomes of different species is a critical step to further revealing their biological functions and influence on these diseases. However, the traditional wet-lab experimental methods for identifying m6A sites are often laborious and expensive. In this study, we proposed an ensemble deep learning model called m6A-BERT-Stacking, a powerful predictor for the detection of m6A sites in various tissues of three species. First, we utilized two encoding methods, i.e., di ribonucleotide index of RNA (DiNUCindex_RNA) and k-mer word segmentation, to extract RNA sequence features. Second, two encoding matrices together with the original sequences were respectively input into three different deep learning models in parallel to train three sub-models, namely residual networks with convolutional block attention module (Resnet-CBAM), bidirectional long short-term memory with attention (BiLSTM-Attention), and pre-trained bidirectional encoder representations from transformers model for DNA-language (DNABERT). Finally, the outputs of all sub-models were ensembled based on the stacking strategy to obtain the final prediction of m6A sites through the fully connected layer. The experimental results demonstrated that m6A-BERT-Stacking outperformed most of the existing methods based on the same independent datasets. Full article

(This article belongs to the Special Issue Symmetry/Asymmetry in Bioinformatics: Image Understanding and Language Modeling)

► Show Figures

Figure 1

Search Results (7)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (7)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI