3.1.1. Pretraining Data and Generalization
The pre-training dataset determines the coverage of a model’s initial embedding space, which in turn influences its generalization performance on the distribution of downstream tasks. In this benchmark, the involved models can be classified into three categories based on their pre-training datasets: GROVER and HyenaDNA, which were pre-trained on a single species; AgroNT, which was pre-trained on multiple species within a single domain (Edible Plants-related); NT-V2 and DNABERT-2, which were pre-trained across multiple domains and multiple species.
Table 4 and
Table 5 respectively report the MCC ranking of each model on each task under zero-shot and fine-tuning settings.
The results from the two groups of experimental tasks demonstrate that multi-domain and multi-species pre-trained models (DNABERT-2 and NT-V2) achieved the highest MCC in three tasks under both zero-shot and fine-tuning settings. Compared to single-species pre-trained models, they exhibited superior cross-species generalization after fine-tuning (DNABERT-2 and NT-V2 ranked in the top three in MCC across all tasks; although DNABERT-2’s average MCC was slightly lower than that of AgroNT, its overall performance was more stable). Notably, despite the complete absence of plant genomes in the pre-training data of these two models, they still outperformed single-species human-pre-trained models on rice (Oryza sativa) related tasks. In particular, NT-V2 even surpassed AgroNT in the rice TSS scanning task. This superior performance likely stems from the diverse genomic corpora in their multi-species pre-training datasets, which collectively construct a broader and more representative embedding space encompassing sequence evolutionary patterns from prokaryotes to eukaryotes and from lower to higher organisms, thereby significantly reducing the risk of overfitting to a single species.
To further visually illustrate the influence of pre-training data on the models’ embedding spaces, this benchmark extracted embeddings from five models using 50,000 randomly sampled non-overlapping 1024 bp fragments each from the human and rice genomes. These embeddings were subsequently reduced to two-dimensional space via UMAP for visualization, as shown in
Figure 4. The multi-species pre-trained models NT-V2 and DNABERT-2 were able to almost completely separate the two clusters of points, indicating that their embedding spaces successfully captured distinguishable deep cross-species patterns. AgroNT, as a single-domain multi-species pre-trained model, could also distinguish human from rice DNA reasonably well, but still exhibited substantial overlapping regions. In contrast, GROVER and HyenaDNA, which were pretrained solely on the human genome, showed highly overlapping point clouds for the two species, highlighting the cross-species generalization limitations inherent to single-species pre-training.
3.1.2. Model Architecture and Scale
Model architecture, parameter scale, and related attributes are also key factors influencing final performance.
Table 6 presents the attributes of each model, including pretraining data, model architecture, tokenization strategies, etc.
In this benchmark study, DNABERT-2, GROVER, NT-V2, and AgroNT are based on the Transformer architecture. This architecture leverages its core multi-head self-attention mechanism to effectively capture complex contextual dependencies throughout the entire sequence, which is particularly important for identifying dispersed regulatory elements. However, traditional positional encodings in Transformer models (such as fixed absolute positional encodings or sinusoidal/cosine encodings) have a fixed maximum position index during training. Once the input sequence length exceeds the pretraining length, frequency mismatch causes the attention distribution to become random, making stable length extrapolation difficult. Consequently, these models often struggle to reliably handle genomic data significantly longer than sequences seen during pretraining—a common scenario in genomic tasks.
To address this issue, DNABERT-2 and NT-V2 have introduced different optimizations to the standard Transformer architecture. NT-V2 incorporates Rotary Positional Embeddings (RoPE) and Gated Linear Units (GLU) with Swish activation. By applying frequency-based complex exponential rotation matrices to the query and key vectors, relative positional information is multiplicatively injected into the attention scores, making attention weights depend solely on the relative distance between positions and enabling efficient processing of sequences longer than those seen during pretraining. In contrast, DNABERT-2 adopts Attention with Linear Biases (ALiBi), which is mathematically equivalent to introducing a triangular bias that decays linearly with distance in the attention kernel. This approach achieves stable extrapolation to arbitrary sequence lengths without any learnable or fixed positional encodings and offers faster computation than RoPE. The excellent performance of these two models in this benchmark may be closely related to these architectural improvements.
A noteworthy phenomenon is the significant performance divergence between DNABERT-2 and NT-V2, both multi-species pretrained models. In zero-shot testing, DNABERT-2 achieved the highest average MCC, whereas NT-V2 showed only moderate to below-average performance in zero-shot settings but exhibited the largest gains after fine-tuning, surpassing all other models to become the top performer in average MCC.
This divergence may be attributed to differences in pretraining design. DNABERT-2 has a smaller parameter count and shorter pretraining sequence lengths, potentially focusing more on local motifs, but with relatively limited room for improvement during fine-tuning. NT-V2, with a larger parameter count and a context length four times that of DNABERT-2, learns richer and more distributed representations during pretraining. These representations are difficult to fully utilize through simple average pooling in zero-shot testing and require task-specific fine-tuning to activate. Overall, DNABERT-2 excels in out-of-the-box usability, while NT-V2’s design is more conducive to achieving greater improvements through fine-tuning, reflecting the trade-off between immediate generalization performance and adaptation potential in multi-species pretraining.
AgroNT has the largest parameter count in this benchmark (985 M) and consistently maintains advantages in plant tasks during both zero-shot testing and fine-tuning, with its human-task performance also showing certain improvements after fine-tuning.
Unlike the models above, HyenaDNA fundamentally modifies the attention mechanism structure by replacing explicit dot-product attention with the Hyena operator based on implicit long convolutions. It equivalently performs global convolution through parameterized implicit long filters (implemented via structured state-space recurrence), reducing the original O(n2) matrix computation complexity of attention to O(n) or quasi-linear levels. This provides HyenaDNA with a significant computational advantage when processing ultra-long sequences (up to 1 M tokens), at which point training times for Transformer-based architectures typically become prohibitive. However, with only 1 M parameters, its representational capacity is severely limited, which likely explains its suboptimal performance on the relatively shorter sequence tasks examined in this study.
Finally, as a control, the non-pretrained CNN baseline model is trained from scratch using only task-specific data. Due to its limited receptive field, it struggles to capture the long-range dependencies commonly present in genomic sequences, resulting in significantly inferior performance compared to all pretrained models across all six tasks.
However, this baseline model still achieves relatively good performance on promoter detection tasks in both plants and humans. To investigate the reasons behind this phenomenon and to gain deeper insight into the biological signals captured by the CNN, we extracted the top 5 highest-activating 7-mer motifs from all filters in the first convolutional layer. The results are shown in
Table 7 and
Table 8.
In human promoters, the dominant motifs are GC-rich sequences, which are highly consistent with the structure of CpG islands and the widespread presence of Sp1 transcription factor binding sites (GC-box) in TATA-less promoters [
29]. In contrast, plant promoters are predominantly characterized by CTG-rich sequences or pyrimidine-rich motifs [
30]. These motifs mainly originate from filter 31 and closely match the plant-specific Y-patch (pyrimidine patch) element. The Y-patch frequently substitutes for or complements the TATA box in plant core promoters. These findings indicate that the CNN successfully learns species-specific core promoter features.
The strong performance of CNNs on promoter detection tasks can be attributed to their three core mechanisms, local receptive fields, parameter sharing, and multi-scale feature extraction, which naturally align with the biological characteristic of DNA sequences: the presence of short yet biologically meaningful local patterns. This makes CNNs particularly effective at capturing common local regulatory elements in promoter regions (such as GC-box and Y-patch).
Additionally, model architecture and tokenization strategies may influence task performance through information compression and representational capabilities. For example, according to Dotan et al. (2024), subword units generated by the BPE algorithm are typically associated with biologically conserved sequence patterns, making BPE highly suitable for tasks requiring precise pattern recognition [
26]. However, due to the task design in this benchmark, no obvious differences in performance attributable to different tokenization strategies were observed. The influence of tokenization strategies may primarily manifest in the final token length resulting from information compression. The vocabulary generated by the BPE algorithm achieves high compression ratios, effectively reducing computational complexity and storage requirements. k-mer tokenization divides sequences into fixed-length k-mers with computational complexity O(n/k), making it highly suitable for large-scale data processing [
27]. Single-nucleotide tokenization provides the finest granularity, producing the longest token sequences for a given input length, and is suitable for models like HyenaDNA with computational complexity lower than O(n log n). This approach effectively handles scenarios such as single-base mutations, enabling fine-grained annotation at the single-base level, something difficult to achieve with the other two tokenization methods.
3.1.3. Genomic Structure and Task Characteristics
In this benchmark study, zero-shot and fine-tuning tasks are designed with identical standards across human and plant genomes (including sequence length, sampling strategy, positive/negative sample definitions, and label space), thereby eliminating design bias. Consequently, observed performance differences primarily stem from the inherent characteristics of human and rice genomes as well as biological differences in their genomic structural distributions. Overall, plant genomes exhibit greater dynamism and variability in sequence space due to frequent polyploidy and rapid structural variation, whereas animal genomes tend toward conservation and structural stability [
31,
32,
33].
Table 9 and
Table 10 lists the genome size, gene density, and average gene length for each genome used in the experiments.
In both zero-shot and fine-tuning settings, human tasks consistently achieved higher overall MCC in promoter identification and transcription start site (TSS) scanning. This advantage is likely attributable to more mature annotation resources and highly conserved regulatory motifs in the human genome, which result in lower variance among positive samples in the local pattern space. By contrast, rice tasks face greater challenges due to substantially higher genomic diversity. For example, rice has 42,189 genes with a gene density of 112.66 genes/Mb, whereas humans have only 21,570 genes and a gene density of just 6.96 genes/Mb. This implies that models must search for functional signals across a much broader sequence space.
In species classification tasks, performance between the two groups was roughly comparable overall, with Mammals classification showing slightly higher average performance than Poaceae. However, this already reflects an adjustment in the mammalian selection from family-level to class-level taxonomy. This comparative result may arise from the relatively large evolutionary distances and more pronounced sequence pattern differences among the five plant species considered.
In the gene region identification task, rice tasks significantly outperformed human tasks under both zero-shot and fine-tuning conditions. This performance reversal stems from the greater discriminative power provided by the non-conserved nature of the rice genome. Rice genes are markedly shorter and more compact, with an average gene length of 2871 bp and an average intron length of 405 bp; in comparison, human genes average 64,434 bp in length with introns averaging 8213 bp. The high variation in intron length and sharper regional boundary signals in rice enable models to more accurately distinguish exons, introns, and intergenic regions. Conversely, the human genome contains a large number of ultra-long introns, introducing substantial amounts of non-functional sequence that dilutes regional boundary signals, effectively equivalent to injecting high-entropy noise into the sequence, thereby significantly increasing modeling difficulty.
Additionally, in the gene region identification task, an examination of the confusion matrix heatmaps for gene region identification in the
Supplementary Figures S7, S8, S15 and S16 reveals that, across nearly all experimental groups, the classification results for introns and intergenic regions are comparatively less distinct than those for exons, with the confusion rates between these two categories being significantly higher than the confusion rates between exons and other categories. The speculative explanation for this phenomenon may stem from the compositional similarity between introns and intergenic regions: both constitute non-coding regions, often enriched in repetitive sequences, characterized by nucleotide composition biases and low-complexity patterns, while lacking the codon usage preferences and conserved coding signals specific to exons. Consequently, models face greater difficulty in capturing discriminative features to separate these regions. In contrast, exons, due to their protein-coding function, typically exhibit more conserved motifs and periodic patterns (such as the triplet periodicity indicative of coding sequences), making them more readily identifiable by the model.