You are currently viewing a new version of our website. To view the old version click .
Genes
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

28 December 2025

DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition

,
,
and
1
Department of Intelligent Technology, Tianjin Polytechnic University, Tianjin 300340, China
2
School of Artificial Intelligence, Tianjin University, No. 135 Yaguan Road, Haihe Education Park, Tianjin 300354, China
3
Key Laboratory of Systems Bioengineering (Ministry of Education), No. 135 Yaguan Road, Haihe Education Park, Tianjin 300354, China
4
School of Computer Science and Technology, Tianjin University, No. 135 Yaguan Road, Haihe Education Park, Tianjin 300354, China
Genes2026, 17(1), 27;https://doi.org/10.3390/genes17010027 
(registering DOI)
This article belongs to the Section Bioinformatics

Abstract

Background: Accurate recognition of promoter sequences in Escherichia coli is fundamental for understanding gene regulation and engineering synthetic biological systems. However, existing computational methods struggle to simultaneously model long-range genomic dependencies and fine-grained local motifs , particularly the degenerate −10 and −35 elements of σ70 promoters. To address this gap, we propose DNABERT2-CAMP, a novel hybrid deep learning framework designed to integrate global contextual understanding with high-resolution local motif detection for robust promoter identification. Methods: We constructed a balanced dataset of 8720 experimentally validated and negative 81-bp sequences from RegulonDB, literature, and the E. coli K-12 genome. Our model combines a pre-trained DNABERT-2 Transformer for global sequence encoding with a custom CAMP module (CNN-Attention-Mean Pooling) for local feature refinement. We evaluated performance using 5-fold cross-validation and an independent external test set, reporting standard metrics including accuracy, ROC AUC, and Matthews correlation coefficient (MCC). Results: DNABERT2-CAMP achieved 93.10% accuracy and 97.28% ROC AUC in cross-validation, outperforming existing methods including DNABERT. On an independent test set, it maintained strong generalization (89.83% accuracy, 92.79% ROC AUC). Interpretability analyses confirmed biologically plausible attention over canonical promoter regions and CNN-identified AT-rich/-35-like motifs. Conclusions: DNABERT2-CAMP demonstrates that synergistically combining pre-trained Transformers with convolutional motif detection significantly improves promoter recognition accuracy and interpretability. This framework offers a powerful, generalizable tool for genomic annotation and synthetic biology applications.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.