Abstract
Background: Accurate recognition of promoter sequences in Escherichia coli is fundamental for understanding gene regulation and engineering synthetic biological systems. However, existing computational methods struggle to simultaneously model long-range genomic dependencies and fine-grained local motifs , particularly the degenerate −10 and −35 elements of promoters. To address this gap, we propose DNABERT2-CAMP, a novel hybrid deep learning framework designed to integrate global contextual understanding with high-resolution local motif detection for robust promoter identification. Methods: We constructed a balanced dataset of 8720 experimentally validated and negative 81-bp sequences from RegulonDB, literature, and the E. coli K-12 genome. Our model combines a pre-trained DNABERT-2 Transformer for global sequence encoding with a custom CAMP module (CNN-Attention-Mean Pooling) for local feature refinement. We evaluated performance using 5-fold cross-validation and an independent external test set, reporting standard metrics including accuracy, ROC AUC, and Matthews correlation coefficient (MCC). Results: DNABERT2-CAMP achieved 93.10% accuracy and 97.28% ROC AUC in cross-validation, outperforming existing methods including DNABERT. On an independent test set, it maintained strong generalization (89.83% accuracy, 92.79% ROC AUC). Interpretability analyses confirmed biologically plausible attention over canonical promoter regions and CNN-identified AT-rich/-35-like motifs. Conclusions: DNABERT2-CAMP demonstrates that synergistically combining pre-trained Transformers with convolutional motif detection significantly improves promoter recognition accuracy and interpretability. This framework offers a powerful, generalizable tool for genomic annotation and synthetic biology applications.