Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (11)

Search Parameters:
Keywords = DNABERT

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 2696 KB  
Article
BF-m7GPred: A Dual-Branch Feature Fusion Deep Learning Architecture for Identifying RNA N7-Methylguanosine Modification Sites
by Jiyu Chen, Xingyang Fan, Qiu Jie and Shutan Xu
Appl. Sci. 2026, 16(5), 2577; https://doi.org/10.3390/app16052577 - 7 Mar 2026
Viewed by 206
Abstract
RNA N7-methylguanosine (m7G) is an important post-transcriptional epigenetic modification that participates in key biological processes, including RNA processing, stability maintenance, and translational regulation. Medical research has shown that m7G modification and its related regulatory factors are closely related to many neurological diseases and [...] Read more.
RNA N7-methylguanosine (m7G) is an important post-transcriptional epigenetic modification that participates in key biological processes, including RNA processing, stability maintenance, and translational regulation. Medical research has shown that m7G modification and its related regulatory factors are closely related to many neurological diseases and tumors. The accurate prediction of m7G sites is thus critical for understanding their biological functions in diseases. In this work, we propose BF-m7GPred, a dual-branch deep learning framework that integrates single-nucleotide-level embeddings and motif-level embeddings for m7G modification site prediction. Our proposed context-aware module tokenizes RNA sequences using byte-pair encoding and encodes sequences with the pretrained foundation biological model DNABERT2. In parallel, the proposed feature fusion module transforms sequences into multiple feature matrices using multiple traditional encoders. We introduce a feature selection strategy tailored to the encoding characteristics of the two branches. On a benchmark dataset collected from m7G-Hub v2.0, BF-m7GPred achieves superior performance on the independent test set against existing methods. Furthermore, its generalization capability is validated through comparative experiments on 10 diverse RNA modification datasets. Full article
(This article belongs to the Special Issue Advances and Applications of Machine Learning for Bioinformatics)
Show Figures

Figure 1

22 pages, 2153 KB  
Article
Benchmark of Genomic Language Models on Human and Rice Genomic Tasks
by Xiaosheng Gao, Shunyao Wu and Weihua Pan
Appl. Sci. 2026, 16(4), 1745; https://doi.org/10.3390/app16041745 - 10 Feb 2026
Viewed by 473
Abstract
Genomic Language Models (GLMs), leveraging their vast parameter scales and the similarities between DNA sequences and natural languages, demonstrate immense potential in processing large-scale genomic data and elucidating gene regulation and evolutionary relationships. However, the cross-species generalization capability of large genomic models has [...] Read more.
Genomic Language Models (GLMs), leveraging their vast parameter scales and the similarities between DNA sequences and natural languages, demonstrate immense potential in processing large-scale genomic data and elucidating gene regulation and evolutionary relationships. However, the cross-species generalization capability of large genomic models has not yet been systematically evaluated. This study addresses this critical gap by benchmarking five GLMs (DNABERT-2, GROVER, HyenaDNA, NT-V2, and AgroNT) and a CNN baseline model using human (Homo sapiens) and rice (Oryza sativa) genomes across four downstream tasks: promoter detection, transcription start site (TSS) scanning, species classification, and gene region identification, through both zero-shot testing and fine-tuning. During testing, factors such as hyperparameters, early stopping protocols, and computational resources were fixed to ensure fairness, enabling us to systematically evaluate their performance and cross-species generalization capabilities. The results were further analyzed from multiple mathematical and representational perspectives to provide a more rigorous and objective assessment of each model’s performance. The results show that AgroNT consistently leads on rice tasks, while NT-V2 and DNABERT-2 achieved the best overall performance in fine-tuning and zero-shot experiments, respectively. Although their pretraining data did not include plants, they demonstrate excellent performance on rice-related tasks thanks to cross-species pretraining that enhances their generalization ability across human–rice domains. This benchmark study offers guidance on selecting appropriate genomic language models based on task characteristics and provides insights for future development in this field. Full article
Show Figures

Figure 1

18 pages, 684 KB  
Article
DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition
by Hua-Lin Xu, Xiu-Jun Gong, Hua Yu and Ying-Kai Wang
Genes 2026, 17(1), 27; https://doi.org/10.3390/genes17010027 - 28 Dec 2025
Viewed by 478
Abstract
Background: Accurate recognition of promoter sequences in Escherichia coli is fundamental for understanding gene regulation and engineering synthetic biological systems. However, existing computational methods struggle to simultaneously model long-range genomic dependencies and fine-grained local motifs, particularly the degenerate −10 and −35 elements of [...] Read more.
Background: Accurate recognition of promoter sequences in Escherichia coli is fundamental for understanding gene regulation and engineering synthetic biological systems. However, existing computational methods struggle to simultaneously model long-range genomic dependencies and fine-grained local motifs, particularly the degenerate −10 and −35 elements of σ70 promoters. To address this gap, we propose DNABERT2-CAMP, a novel hybrid deep learning framework designed to integrate global contextual understanding with high-resolution local motif detection for robust promoter identification. Methods: We constructed a balanced dataset of 8720 experimentally validated and negative 81-bp sequences from RegulonDB, literature, and the E. coli K-12 genome. Our model combines a pre-trained DNABERT-2 Transformer for global sequence encoding with a custom CAMP module (CNN-Attention-Mean Pooling) for local feature refinement. We evaluated performance using 5-fold cross-validation and an independent external test set, reporting standard metrics including accuracy, ROC AUC, and Matthews correlation coefficient (MCC). Results: DNABERT2-CAMP achieved 93.10% accuracy and 97.28% ROC AUC in cross-validation, outperforming existing methods including DNABERT. On an independent test set, it maintained strong generalization (89.83% accuracy, 92.79% ROC AUC). Interpretability analyses confirmed biologically plausible attention over canonical promoter regions and CNN-identified AT-rich/-35-like motifs. Conclusions: DNABERT2-CAMP demonstrates that synergistically combining pre-trained Transformers with convolutional motif detection significantly improves promoter recognition accuracy and interpretability. This framework offers a powerful, generalizable tool for genomic annotation and synthetic biology applications. Full article
(This article belongs to the Section Bioinformatics)
Show Figures

Figure 1

12 pages, 699 KB  
Article
Reaping the Fruits of LLM Pruning: Towards Small Language Models for Efficient Non-Coding Variant Effect Prediction
by Megha Hegde, Jean-Christophe Nebel and Farzana Rahman
Genes 2025, 16(11), 1358; https://doi.org/10.3390/genes16111358 - 10 Nov 2025
Viewed by 1174
Abstract
Background: Interpreting variant effects is essential for precision medicine. Large Transformer-based genomic language models (DNABERT 2, Nucleotide Transformer) capture patterns in coding DNA but scale poorly for non coding variant prediction because attention complexity grows quadratically with sequence length. Evidence from natural [...] Read more.
Background: Interpreting variant effects is essential for precision medicine. Large Transformer-based genomic language models (DNABERT 2, Nucleotide Transformer) capture patterns in coding DNA but scale poorly for non coding variant prediction because attention complexity grows quadratically with sequence length. Evidence from natural language processing shows that pruning less informative layers can reduce model size and computational load without sacrificing accuracy. Methods: We systematically ablated each Transformer layer in DNABERT 2 and the Nucleotide Transformer to assess its contribution to variant prediction. By observing changes in performance, we built layer importance profiles and created pruned models by removing redundant layers. Pruned and full models were fine tuned with identical hyperparameters using the Enformer eQTL causal variant dataset, a curated benchmark for non coding variant effect prediction. Results: Layer ablation revealed that the importance of individual layers varies widely across models; some layers can be removed with little loss in performance while others are critical. After fine tuning, pruned models achieved accuracy and area under the ROC curve comparable to full models. Additionally, pruned versions required substantially less training time and memory, reducing resource usage by a significant margin. Conclusions: Layer wise pruning provides a principled strategy for developing compact genomic LLMs. By identifying and removing less critical layers, we produced leaner models that preserve predictive power while lowering computational demands. These efficient models demonstrate how insights from general LLM research can advance genomic variant interpretation and make large scale non coding analysis more accessible in research and clinical settings. This approach complements ongoing efforts to optimise Transformer architectures for genomic data. Full article
(This article belongs to the Section Technologies and Resources for Genetics)
Show Figures

Figure 1

16 pages, 2889 KB  
Article
Enhanced Viral Genome Classification Using Large Language Models
by Hemalatha Gunasekaran, Nesaian Reginal Wilfred Blessing, Umar Sathic and Mohammad Shahid Husain
Algorithms 2025, 18(6), 302; https://doi.org/10.3390/a18060302 - 22 May 2025
Cited by 2 | Viewed by 2510
Abstract
The classification of genomic sequences is a crucial area of research in the field of virology. This is due to the increasing number of outbreaks we have faced in recent times. We have a vast repository of genomic sequences from various species, including [...] Read more.
The classification of genomic sequences is a crucial area of research in the field of virology. This is due to the increasing number of outbreaks we have faced in recent times. We have a vast repository of genomic sequences from various species, including humans, animals, plants, bacteria, and viruses, which tend to mutate and form new variants or strains. In the realm of machine learning, several models are employed for genome sequence classification. Among these are traditional algorithms such as Random Forest (RF), K-nearest neighbors (KNNs), Decision Tree (DT), and Naive Bayes (NB), each offering unique advantages in handling genetic data. Additionally, deep learning models like Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and Bi-Directional LSTM networks are utilized for their robust capabilities in capturing complex patterns and dependencies within genomic sequences. In this study, we explored the application of Natural Language Processing (NLP) techniques to classify the genomic sequences. The focus of our research involves utilizing advanced large language models (LLMs) such as DNABERT, DNAGPT, and GENA LM, which are fine-tuned explicitly on the language of DNA. In this research, after a detailed analysis, we found that DNAGPT achieved an accuracy of 96%, which exceeds the performance of state-of-the-art machine learning and deep learning models. Full article
(This article belongs to the Special Issue Advanced Research on Machine Learning Algorithms in Bioinformatics)
Show Figures

Figure 1

17 pages, 3197 KB  
Article
Prediction of circRNA–miRNA Interaction Using Graph Attention Network Based on Molecular Attributes and Biological Networks
by Abdullah Almotilag, Murtada K. Elbashir, Mahmood A. Mahmood and Mohanad Mohammed
Processes 2025, 13(5), 1318; https://doi.org/10.3390/pr13051318 - 25 Apr 2025
Cited by 2 | Viewed by 1457
Abstract
(1) Background: Circular RNAs (circRNAs) are covalently closed single-stranded molecules that play crucial roles in gene regulation, while microRNAs (miRNAs), specifically mature microRNAs, are naturally occurring small molecules of non-coding RNA with 17-25-nucleotide sizes. Understanding circRNA–miRNA interactions (CMIs) can reveal new approaches for [...] Read more.
(1) Background: Circular RNAs (circRNAs) are covalently closed single-stranded molecules that play crucial roles in gene regulation, while microRNAs (miRNAs), specifically mature microRNAs, are naturally occurring small molecules of non-coding RNA with 17-25-nucleotide sizes. Understanding circRNA–miRNA interactions (CMIs) can reveal new approaches for diagnosing and treating complex human diseases. (2) Methods: In this paper, we propose a novel approach for predicting CMIs based on a graph attention network (GAT). We utilized DNABERT to extract molecular features of the circRNA and miRNA sequences and role-based graph embeddings generated by Role2Vec to extract the CMI features. The GAT’s ability to learn complex node dependencies in biological networks provided enhanced performance over the existing methods and the traditional deep neural network models. (3) Results: Our simulation studies showed that our GAT model achieved accuracies of 0.8762 and 0.8837 on the CMI-9905 and CMI-9589, respectively. These accuracies were the highest among the other existing CMI prediction methods. Our GAT method also achieved the highest performance as measured by the precision, recall, F1-score, area under the receiver operating characteristic (AUROC) curve, and area under the precision–recall curve (AUPR). (4) Conclusions: These results reflect the GAT’s ability to capture the intricate relationships between circRNAs and miRNAs, thus offering an efficient computational approach for prioritizing potential interactions for experimental validation. Full article
(This article belongs to the Special Issue Computational Biology Approaches to Genome and Protein Analyses)
Show Figures

Figure 1

18 pages, 2944 KB  
Article
Risk Prediction of RNA Off-Targets of CRISPR Base Editors in Tissue-Specific Transcriptomes Using Language Models
by Kazuki Nakamae, Takayuki Suzuki, Sora Yonezawa, Kentaro Yamamoto, Taro Kakuzaki, Hiromasa Ono, Yuki Naito and Hidemasa Bono
Int. J. Mol. Sci. 2025, 26(4), 1723; https://doi.org/10.3390/ijms26041723 - 18 Feb 2025
Cited by 1 | Viewed by 2772
Abstract
Base-editing technologies, particularly cytosine base editors (CBEs), allow precise gene modification without introducing double-strand breaks; however, unintended RNA off-target effects remain a critical concern and are under studied. To address this gap, we developed the Pipeline for CRISPR-induced Transcriptome-wide Unintended RNA Editing (PiCTURE), [...] Read more.
Base-editing technologies, particularly cytosine base editors (CBEs), allow precise gene modification without introducing double-strand breaks; however, unintended RNA off-target effects remain a critical concern and are under studied. To address this gap, we developed the Pipeline for CRISPR-induced Transcriptome-wide Unintended RNA Editing (PiCTURE), a standardized computational pipeline for detecting and quantifying transcriptome-wide CBE-induced RNA off-target events. PiCTURE identifies both canonical ACW (W = A or T/U) motif-dependent and non-canonical RNA off-targets, revealing a broader WCW motif that underlies many unanticipated edits. Additionally, we developed two machine learning models based on the DNABERT-2 language model, termed STL and SNL, which outperformed motif-only approaches in terms of accuracy, precision, recall, and F1 score. To demonstrate the practical application of our predictive model for CBE-induced RNA off-target risk, we integrated PiCTURE outputs with the Predicting RNA Off-target compared with Tissue-specific Expression for Caring for Tissue and Organ (PROTECTiO) pipeline and estimated RNA off-target risk for each transcript showing tissue-specific expression. The analysis revealed differences among tissues: while the brain and ovaries exhibited relatively low off-target burden, the colon and lungs displayed relatively high risks. Our study provides a comprehensive framework for RNA off-target profiling, emphasizing the importance of advanced machine learning-based classifiers in CBE safety evaluations and offering valuable insights to inform the development of safer genome-editing therapies. Full article
Show Figures

Figure 1

14 pages, 998 KB  
Article
TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences
by Guohao Dong, Yuqian Wu, Lan Huang, Fei Li and Fengfeng Zhou
Genes 2024, 15(12), 1593; https://doi.org/10.3390/genes15121593 - 12 Dec 2024
Cited by 2 | Viewed by 3231
Abstract
Background/Objectives: Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding [...] Read more.
Background/Objectives: Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding techniques, which fail to capture the inherent features and patterns of DNA sequences. Methods: We introduce TExCNN, a novel framework that integrates the pre-trained models DNABERT and DNABERT-2 to generate word embeddings for DNA sequences. We partitioned the DNA sequences into manageable segments and computed their respective embeddings using the pre-trained models. These embeddings were then utilized as inputs to our deep learning framework, which was based on convolutional neural network. Results: TExCNN outperformed current state-of-the-art models, achieving an average R2 score of 0.622, compared to the 0.596 score achieved by the DeepLncLoc model, which is based on the Word2Vec model and a text convolutional neural network. Furthermore, when the sequence length was extended from 10,500 bp to 50,000 bp, TExCNN achieved an even higher average R2 score of 0.639. The prediction accuracy improved further when additional biological features were incorporated. Conclusions: Our experimental results demonstrate that the use of pre-trained models for word embedding generation significantly improves the accuracy of predicting gene expression. The proposed TExCNN pipeline performes optimally with longer DNA sequences and is adaptable for both cell-type-independent and cell-type-dependent predictions. Full article
(This article belongs to the Section Bioinformatics)
Show Figures

Graphical abstract

20 pages, 6077 KB  
Article
DeepDualEnhancer: A Dual-Feature Input DNABert Based Deep Learning Method for Enhancer Recognition
by Tao Song, Haonan Song, Zhiyi Pan, Yuan Gao, Huanhuan Dai and Xun Wang
Int. J. Mol. Sci. 2024, 25(21), 11744; https://doi.org/10.3390/ijms252111744 - 1 Nov 2024
Cited by 6 | Viewed by 3253
Abstract
Enhancers are cis-regulatory DNA sequences that are widely distributed throughout the genome. They can precisely regulate the expression of target genes. Since the features of enhancer segments are difficult to detect, we propose DeepDualEnhancer, a DNABert-based method using a multi-scale convolutional neural network, [...] Read more.
Enhancers are cis-regulatory DNA sequences that are widely distributed throughout the genome. They can precisely regulate the expression of target genes. Since the features of enhancer segments are difficult to detect, we propose DeepDualEnhancer, a DNABert-based method using a multi-scale convolutional neural network, BiLSTM, for enhancer identification. We first designed the DeepDualEnhancer method based only on the DNA sequence input. It mainly consists of a multi-scale Convolutional Neural Network, and BiLSTM to extract features by DNABert and embedding, respectively. Meanwhile, we collected new datasets from the enhancer–promoter interaction field and designed the method DeepDualEnhancer-genomic for inputting DNA sequences and genomic signals, which consists of the transformer sequence attention. Extensive comparisons of our method with 20 other excellent methods through 5-fold cross validation, ablation experiments, and an independent test demonstrated that DeepDualEnhancer achieves the best performance. It is also found that the inclusion of genomic signals helps the enhancer recognition task to be performed better. Full article
Show Figures

Figure 1

19 pages, 3099 KB  
Article
DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks
by Xueyan Liu, Hongyan Zhang, Ying Zeng, Xinghui Zhu, Lei Zhu and Jiahui Fu
Genes 2024, 15(4), 404; https://doi.org/10.3390/genes15040404 - 26 Mar 2024
Cited by 4 | Viewed by 3425
Abstract
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms [...] Read more.
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer’s superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer’s excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified. Full article
(This article belongs to the Special Issue Advances and Applications of Machine Learning in Biomedical Genomics)
Show Figures

Figure 1

15 pages, 5336 KB  
Article
M6A-BERT-Stacking: A Tissue-Specific Predictor for Identifying RNA N6-Methyladenosine Sites Based on BERT and Stacking Strategy
by Qianyue Li, Xin Cheng, Chen Song and Taigang Liu
Symmetry 2023, 15(3), 731; https://doi.org/10.3390/sym15030731 - 15 Mar 2023
Cited by 18 | Viewed by 4255
Abstract
As the most abundant RNA methylation modification, N6-methyladenosine (m6A) could regulate asymmetric and symmetric division of hematopoietic stem cells and play an important role in various diseases. Therefore, the precise identification of m6A sites around the genomes of different species is a critical [...] Read more.
As the most abundant RNA methylation modification, N6-methyladenosine (m6A) could regulate asymmetric and symmetric division of hematopoietic stem cells and play an important role in various diseases. Therefore, the precise identification of m6A sites around the genomes of different species is a critical step to further revealing their biological functions and influence on these diseases. However, the traditional wet-lab experimental methods for identifying m6A sites are often laborious and expensive. In this study, we proposed an ensemble deep learning model called m6A-BERT-Stacking, a powerful predictor for the detection of m6A sites in various tissues of three species. First, we utilized two encoding methods, i.e., di ribonucleotide index of RNA (DiNUCindex_RNA) and k-mer word segmentation, to extract RNA sequence features. Second, two encoding matrices together with the original sequences were respectively input into three different deep learning models in parallel to train three sub-models, namely residual networks with convolutional block attention module (Resnet-CBAM), bidirectional long short-term memory with attention (BiLSTM-Attention), and pre-trained bidirectional encoder representations from transformers model for DNA-language (DNABERT). Finally, the outputs of all sub-models were ensembled based on the stacking strategy to obtain the final prediction of m6A sites through the fully connected layer. The experimental results demonstrated that m6A-BERT-Stacking outperformed most of the existing methods based on the same independent datasets. Full article
Show Figures

Figure 1

Back to TopTop