Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

: Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.


Introduction
The exploration of non-coding regulatory genome regions, aiming to uncover the functionality of regulatory elements in gene expression coordination, has emerged as a focal point in recent research endeavors [1][2][3].Remarkably, in humans, a mere fraction of DNA is dedicated to coding proteins, leaving the vast majority, approximately 98%, as non-coding DNA [4].This non-coding segment harbors sequences pivotal to regulatory elements, which orchestrate the activation and suppression of genes across various contexts.These elements serve as docking sites for specialized proteins, thereby modulating gene expression [5].The advent of deep learning computational methodologies has notably enhanced the predictive analysis of non-coding regulatory elements [6,7].Intriguingly, researchers have identified linguistic-like statistical characteristics within non-coding DNA sequences, drawing parallels to natural languages [8].This similarity between non-coding DNA sequences and human language has paved the way for the application of language models in decoding the cryptic language of non-coding DNA [9,10].
Despite these advancements, significant challenges remain in effectively modeling and learning from DNA.First, the inherent limitation of CNNs in capturing long-range genomic dependencies poses a challenge.Second, the reliance on supervised learning constrains the models' generalization capabilities, tethering them to specific tasks defined by labeled data.This results in most sequence models being trained on a narrow scope of genome scenarios, which is susceptible to overfitting due to the scarcity of labeled datasets.Addressing these challenges necessitates an innovative approach to learning representations from unlabeled DNA sequences, capturing the intricate contextual information, and applying this knowledge across a spectrum of downstream tasks.This calls for a paradigm shift towards more adaptable and generalizable models, capable of transcending the limitations of current methodologies and unlocking new frontiers in genomic research.
Recent advancements by Ji et al. [23] have led to the adaptation of the Bidirectional Encoder Representations from Transformers (BERT) model [24,25] to genomic DNA settings, resulting in the development of DNABERT.This approach utilizes unlabeled human genome data to forge a versatile model capable of spanning a wide array of sequencerelated tasks in a unified manner.While DNABERT harnesses the expansive genome data to capture a broad representation of DNA language, it may not fully integrate domain-specific knowledge, which is pivotal for a nuanced understanding of genomic functions.Directly applying BERT's methodology to DNA sequences-considering them analogous to human language-may overlook the unique structure and function inherent in genomic data.GeneBERT [26], another significant contribution to the field, introduces a multi-modal pretraining model pre-trained on millions of DNA sequences from the ATAC-seq dataset [27].By treating different cell-type motif position weight matrices (PWMs) akin to image regions, GeneBERT attempts to blend sequence and structural data in its training process.However, this multi-modal approach may encounter limitations when applied to downstream tasks that solely involve sequence data, potentially restricting the model's applicability to purely sequence-based learning scenarios.
Our work builds on our previous MoDNA framework [28], presenting a more comprehensive evaluation.We integrate common DNA functional motifs as essential domain knowledge.The eukaryotic genome is replete with structured patterns such as repetitive elements, protein-binding sites, and splice sites [29], many of which manifest as motifs-short, recurring sequences indicative of specific protein-binding functions [30].MoDNA leverages this concept by not only focusing on predicting masked k-mers but also incorporating a motif prediction task to enrich DNA representation learning.Through this dual-prediction approach in its self-supervised pre-training phase, MoDNA utilizes functional motifs to refine DNA representation embeddings, enhancing the model's biological relevance and predictive accuracy.In this study, we conduct a comprehensive analysis to demonstrate MoDNA's superior performance across various datasets, solidifying its effectiveness in genomic predictive modeling.
Our contributions with MoDNA are threefold: (1) By embedding domain knowledge into the learning process, MoDNA not only captures semantic representations of DNA but also accurately predicts motif occurrences, enriching the model with deep semantic insights from genomic sequences.(2) MoDNA marks the inaugural application of the ELECTRA [31] framework to DNA sequence representation, showcasing superior computational efficiency and performance metrics compared with BERT-based models.(3) Leveraging extensive unlabeled genome data, MoDNA demonstrates notable improvements in promoter prediction and transcription factor binding site identification tasks, setting new benchmarks against existing state-of-the-art models.

Advancements in Self-Supervised Pre-Training for Natural Language Processing
The domain of Natural Language Processing (NLP) has been profoundly transformed by the advent of pre-trained language models, which adhere to a paradigm encompassing both pre-training and subsequent fine-tuning.Notable models such as GPT, BERT [24,32], RoBERTa [33], and ELECTRA [31] have demonstrated exemplary performance across a multitude of NLP tasks.These models employ self-supervised learning techniques on extensive corpora of unlabeled text, effectively circumventing the limitations inherent in supervised learning methodologies that necessitate laboriously annotated datasets.This innovative approach has been extended to the realm of biomedical language models, thereby facilitating the application of these advanced computational techniques to the analysis of genomic sequences [34,35].

Comparative Analysis of BERT and ELECTRA
The BERT model, in particular, has gained recognition for its capacity to leverage unlabeled data through the implementation of masked language learning alongside nextsentence prediction tasks, thereby enhancing the model's performance [36].Nonetheless, the BERT framework is characterized by certain limitations, including a disproportionate focus on masked tokens to the detriment of non-masked tokens, a considerable computational demand owing to the model's reliance on a subset of input tokens for learning, and a notable discrepancy between the conditions prevalent in the pre-training and fine-tuning stages.
Conversely, the ELECTRA model proposes a novel solution to these challenges by adopting a generator-discriminator architecture, which optimizes the pre-training strategy.Within this framework, the generator fulfills a role akin to that of BERT by predicting masked tokens, whereas the discriminator evaluates the veracity of each token within the sequence, thereby enhancing the model's efficiency and reducing computational overhead.Furthermore, ELECTRA addresses the pre-training and fine-tuning discrepancy by ensuring consistency in the input tokens throughout both stages, thereby facilitating the model's adaptation to a diverse array of downstream tasks.
In the context of the present study, the ELECTRA model has been selected as the foundational framework, predicated on the hypothesis that DNA sequences can be effectively conceptualized and represented as sequential entities akin to linguistic texts.To the best of the authors' knowledge, this constitutes the inaugural application of the ELECTRA model to the domain of DNA sequence representation, thereby underscoring the model's versatility and potential applicability in the field of genomic analysis.

Tokenization of DNA Sequences
Despite the apparent parallels between DNA sequences and human language, the representation of DNA necessitates a bespoke approach.Traditional methods such as onehot encoding, while straightforward, fail to encapsulate the complex semantic relationships inherent within DNA sequences.In contrast, the k-mer-based tokenization approach offers a more sophisticated alternative by segmenting sequences into overlapping substrings of length k.This methodology not only accommodates the canonical nucleotides-cytosine

[C], guanine [G], adenine [A], and thymine [T]-but also incorporates special tokens ([CLS], [SEP], [PAD], [MASK], [UNK]
) to fulfill various structural and functional roles within the modeling framework, thereby enriching the representational capacity of the model with respect to DNA sequences.

DNA Motifs
The intricate process of gene expression, particularly the transcription phase, is facilitated by the interaction of transcription factors with specific DNA sequences, known as enhancers and promoters.Within this biological context, DNA motifs-short, conserved sequence patterns endowed with significant biological functions-play a pivotal role [37,38].These motifs, which may signify protein-binding sites or other functional elements, are typically represented through Position Weight Matrices (PWMs) [39], providing a lucid and interpretable framework for their analysis.The MEME Suite [40], a comprehensive toolkit for motif-based sequence analysis, exemplifies the application of maximum likelihood algorithms for motif discovery in both DNA and protein sequences, thereby highlighting the synergistic relationship between computational modeling and biological research.

Methods
In this study, we introduce the MoDNA framework, which comprises two principal components: a generator and a discriminator.Both components are constructed using Transformer-based neural networks.This section elaborates on the architecture employed in our pre-training phase, the self-supervised tasks designed for model learning, and the subsequent fine-tuning process for genomic applications.

Model Architecture
The architecture of MoDNA is anchored by two networks: the generator and the discriminator.Each is structured with multiple Bidirectional Transformer Encoder blocks, as illustrated in Figure 1.We denote the number of Transformer blocks as L and the size of the hidden layers as H.A shared embedding layer forms the initial stage of both networks, incorporating token embeddings, token-type embeddings, and positional embeddings.This layer is pivotal in enabling the model to interpret the position and type of each token within a sequence.The Transformer blocks utilize a scaled dot-product attention mechanism, which is instrumental in directing the model's focus throughout the input data.The attention process is executed by transforming the input into query (Q), key (K), and value (V) matrices and computing the attention scores as follows: where d k represents the dimensionality of the keys and queries.This mechanism allows the model to prioritize relevant parts of the input.The generator is tasked with masked genome modeling (MGM), operating in conjunction with the discriminator to refine its predictive capabilities.

Pre-Training Strategy
The pre-training of MoDNA involves a meticulously crafted two-step process, depicted in Figure 2. Initially, DNA sequence k-mers are randomly masked, employing a [MASK] token.The generator undertakes the challenge of predicting these masked segments.Concurrently, it generates new samples to substitute the masked tokens.The discriminator is then trained to discern whether each token in the sequence is original or has been replaced.This discrimination task is crucial for the model's learning, as it enhances the understanding of DNA sequence structures.The process begins with the random masking of input DNA sequence k-mers, with x 2 representing the masked token.DNA k-mer tokens, along with special tokens, are constructed into a sequence of DNA tokens.These tokens are input into the generator, which aims at two main objectives: predicting the masked genomic sequences and identifying motif patterns.The generator also produces a sampling x2 to substitute the masked token [MASK].This modified sequence, combined with the unaltered tokens, is then processed by the discriminator, which is trained to detect replaced tokens and, with the given motif occurrence labels, to predict the presence of motifs.(c) Fine-Tuning Pipeline of MoDNA: The pre-trained discriminator's weights are used as the starting point.An additional multilayer perceptron is integrated for fine-tuning the model to specialize in various downstream tasks.

Discriminator Classifier
A significant aspect of our pre-training is the development of self-supervision tasks.Given the frequent occurrence of motifs-short, biologically significant patterns-in DNA sequences, we designed tasks that enable the model to recognize these motifs.We approach motif identification as a multi-label classification challenge, allowing the model to predict multiple motifs within a sequence simultaneously.

Genome Token Prediction
During the pre-training phase, our approach adopts the k-mer representation for unlabeled pre-training data, akin to the strategy employed by DNABERT.To illustrate, consider a DNA sequence TTGGAAAT; its six-mer representations would include sequences like TTGGAA, TGGAAA, and GGAAAT.The k-mer vocabulary encompasses all possible k-mer permutations, augmented by five special tokens.We transform the k-mer sequence x = [x 1 , ..., x n ] into a tokenization embedding E, aligned with a comprehensive k 4 + 4 token vocabulary.
In alignment with the ELECTRA framework, our methodology entails the concurrent training of a generator and a discriminator.Prior to the integration of k-mer x into the tokenization embedding, we introduce randomness by masking six consecutive positions within the six-mer x, serving as the input for the generator.We denote these random mask positions as c = [c 1 , ..., c t ], and accordingly, the tokens at these positions are substituted with [MASK] tokens, as demonstrated in the following equation: ( The generator, upon receiving the masked input r, encodes it into a contextual embedding and undertakes masked genome modeling (MGM) to infer the original identities of the masked tokens.The MGM process is guided by the following loss function: Subsequent to MGM, the replacements are sampled from the prediction output p to establish a sample distribution, as follows: The discriminator's inputs are then formulated by substituting the masked tokens x t with the generated samples x, as follows: The discriminator D plays a pivotal role in discerning the origin of each token x R t within the sequence x R , determining whether it is derived from the original sequence or is synthetically generated by the generator.This discrimination task is quantified through the following loss functions: This structured approach to genome token prediction underpins our pre-training strategy, facilitating a robust foundation for the subsequent fine-tuning on genomic tasks.Through the meticulous design of our generator and discriminator, we ensure that our model not only learns to accurately predict masked tokens but also gains a nuanced understanding of the genomic sequences, paving the way for significant advancements in genomic research.

Motif Prediction
Motifs, as recurrent nucleotide sequence patterns, hold significant biological implications, particularly in the context of gene regulation.The identification and interpretation of motifs can shed light on the biological functions associated with specific sequences.To extract motifs from DNA sequences [41], we utilize the MEME Suite [40], representing the discovered motifs as Position Weight Matrices (PWMs), denoted by m = [m 1 , ..., m n ].These PWMs, corresponding to the unmasked portions of the input x, serve as critical indicators of the underlying biological functions.
Anticipating that our generator G can capture the distribution of these motifs, we incorporate PWMs m as labels for motif learning.This approach enables the generator to align its predictions with the biologically meaningful patterns encoded within the PWMs.
The motif prediction task hinges on the generator's ability to match the PWMs with the actual motif distribution within the sequences.This matching process is quantified using the Kullback-Leibler Divergence (KL) loss [42], which assesses the similarity between two probability distributions.The motif learning loss, formulated as follows: relies on the generator's last hidden layer representation h G (x) to predict the motif patterns.Furthermore, the discriminator component, denoted as D moti f , is tasked with identifying motif occurrences within the sequences x R , which incorporate the generated samples.The objective is for the discriminator to pinpoint the locations of motifs within the genome tokens, guided by the binary motif occurrence labels m ∈ (0, 1).The loss function for this task is as follows:

Pre-Training Objectives
Our pre-training strategy encompasses the simultaneous training of the generator and discriminator, leveraging the motif prediction tasks as key self-supervised learning objectives.The composite loss function, integrating the losses from both components, is expressed as follows: Here, α and β are hyperparameters that calibrate the contributions of the respective loss components.
It is crucial to recognize the nuanced distinction in the operation of our discriminator D compared with traditional BERT models.Unlike BERT, which focuses solely on predicting masked tokens, our discriminator evaluates all tokens within the input sequences x R , enriched by the generator's samples.This comprehensive analysis, combined with the integration of motif patterns in both the generator and the discriminator, equips MoDNA with the capability to derive semantic and functional insights from the sequence data, thereby enhancing the overall quality of sequence representation learning.

Fine-Tuning
Following the pre-training phase, we enter the supervised fine-tuning stage, utilizing the discriminator network that has been pre-trained with the MoDNA framework.This step involves adapting our discriminator model, which has been primed on a vast array of genomic sequences, to specific downstream tasks that require precise and task-specific predictions.
A notable challenge encountered in traditional pre-training models like BERT is the discrepancy between pre-training and fine-tuning stages, primarily due to the use of masked inputs during pre-training and the transition to unmasked inputs for fine-tuning.MoDNA addresses this issue innovatively by employing generated samples to replace masked tokens during pre-training, ensuring that the discriminator's inputs during finetuning encompass the complete set of input tokens without relying solely on masked portions.This approach allows for a more comprehensive training of the discriminator across all tokens, enhancing its predictive capabilities.
The utility of the MoDNA model extends to a variety of downstream tasks [43].In this study, we focus on two critical applications: promoter prediction and transcription factor binding site (TFBS) prediction.For each task, we augment the pre-trained discriminator with a task-specific linear classifier, enabling the model to leverage the rich representations learned during pre-training for targeted predictions.During the fine-tuning process, we adjust all parameters of the pre-trained model in accordance with the specific requirements of each task, guided by task-relevant datasets and loss functions.This fine-tuning phase, spanning several epochs, culminates in the MoDNA model demonstrating enhanced per-formance and predictive accuracy on these downstream tasks, showcasing its effectiveness in genomic sequence analysis.

Experimental Results
This section outlines the experimental framework of our study, detailing the procedures for both pre-training and fine-tuning phases, and introduces the datasets employed.Subsequently, we compare our results with leading methods in the field for various downstream tasks.

Pre-Training and Fine-Tuning Experimental Pipelines
For our pre-training data, we turned to the human genome DNA sequences from the GRCh38 assembly.To add a layer of biological context, motif scanning was performed on these sequences, resulting in the generation of Position Weight Matrix (PWM) representations for the motifs discovered.In a manner similar to DNABERT, our MoDNA model was pre-trained with a 15% masking rate for the six-mers within the sequences.This involved breaking down DNA sequences (with a maximum length of 512 nucleotides) into six-mer permutations.These six-mers were then randomly masked and supplied to the generator for processing.In tandem, PWMs and motif occurrence labels were integrated into the training regime of both the generator and the discriminator, enriching the learning process with biologically relevant information.

Implementation Details
In our pre-training, we use the Adam optimizer as the pre-training optimizer, and the batch size in the pre-training stage is 64.Like DNABERT, we warm up the model at the beginning of the first 10k steps.After that, the learning rate starts from 2 × 10 −4 and linearly decays.The detailed parameters of our MoDNA are listed in Table 1.For our pre-training setup, the computations were carried out on four NVIDIA (NVIDIA Corp., 2001 Walsh Ave, Santa Clara, CA, USA) Quadro RTX 8000 GPUs, each with 48 GB of memory.The pre-training phase was completed in approximately 10 days given the substantial model size of 110 million parameters.Our pre-training exploits the extensive human genome sequences from the GRCh38 assembly, sourced from the National Library of Medicine.Sequences containing ambiguous "N" nucleotides were excluded to ensure data quality, and the remaining sequences were adjusted to lengths within the 5-510 base pair range.This dataset includes a total of 4,963,161 sequences, providing a broad base for our model's pre-training.For motif prediction, a set of 769 motifs, each 7-25 nucleotides long, was curated from HOCOMOCO v11 [44], a resource for human transcription factor binding motifs.These motifs were then used to annotate the pre-training sequences, with the corresponding PWMs serving as labels.

Promoter Prediction Dataset
Our exploration into promoter prediction was conducted on a dataset derived from the Eukaryotic Promoter Database (EPDnew) [45], encompassing a total of 59,198 samples.Each sample, centered around a Transcription Start Site (TSS), consists of a 70-nucleotide sequence.For the fine-tuning phase, we allocated 10% of these samples for evaluation purposes, with the remainder utilized for training.

The 690 ChIP-seq Datasets
Our exploration further extended to the analysis of 690 ChIP-seq datasets, encompassing 161 transcription factors (TFs) across 91 human cell types [46,47].The transcription factor binding site (TFBS) prediction task was conducted across all 690 datasets, employing 101 bp sequences centered on peak regions as positive samples.The generation of negative samples adhered to the methodology established by DESSO, selecting sequences with matching GC content from the GENCODE [48], thereby ensuring a balanced representation of genomic contexts.

Promoter Prediction
Promoter regions, situated proximal to Transcription Start Sites (TSSs), play a pivotal role in initiating the transcription process of DNA [6].These regions often encompass critical short DNA elements and motifs, typically ranging from 5 to 15 bases in length, acting as binding sites for proteins that orchestrate the transcription initiation and regulation of downstream genes.For our promoter prediction task, we utilized the same core DNA promoter dataset as DNABERT, which consists of sequences that are 70 bp in length and centered around the TSS.In our comparative analysis, we benchmarked against the DNABERT pre-trained model, adhering to the fine-tuning strategy outlined in their study.Additionally, we drew comparisons with GeneBERT, a recent multi-modal pre-training model pre-trained on 17 million genomic sequences from the ATAC-seq dataset.Our experiments employed identical settings and datasets for fine-tuning MoDNA.Table 2 presents the performance metrics of GeneBERT, DNABERT, MoDNA without motif incorporation, and the complete MoDNA framework.We also compared our model with CNN, CNN+GRU, and CNN+LSTM.The results are presented in Figure 3.It is evident from the results that MoDNA achieves superior performance in promoter prediction tasks.Notably, despite GeneBERT's extensive training on large-scale genomic data and inclusion of cell-type-specific motifs, MoDNA surpasses its performance across all evaluation metrics, including Accuracy, AUC, F1, MCC, Precision, and Recall.Furthermore, MoDNA demonstrates a noteworthy efficiency advantage over DNABERT, achieving a 1.5% relative improvement across all metrics.This enhancement underscores the effectiveness of MoDNA's pre-training strategy and the incorporation of self-defined motif prediction tasks.MoDNA's compact parameter set, in contrast with GeneBERT and DNABERT, contributes to its efficiency.Unlike traditional BERT pre-training approaches, MoDNA predicts across all tokens, thereby optimizing training efficiency and mitigating the data mismatch issue commonly encountered between pre-training and fine-tuning phases in previous models.Although our generator's pre-training includes masked genome prediction, we introduce a novel approach by generating mask tokens through sampling and incorporating them into the discriminator's input.This ensures consistency in input tokens throughout both pre-training and fine-tuning phases.Such an implementation not only showcases the efficacy of the ELECTRA framework but also validates our meticulously crafted pre-training tasks, which effectively harness implicit domain knowledge to bolster downstream task performance.

Transcription Factor Binding Site (TFBS) Prediction
Transcription involves copying a DNA segment into RNA, with transcription factor proteins playing a crucial role by binding to specific regulatory DNA regions.These proteins can attach to various regulatory elements such as promoters and enhancers, thus controlling gene expression.Given the pivotal role of these interactions in gene regulation, accurately predicting transcription factor binding sites is crucial for understanding DNA sequence functionality.To evaluate the efficacy of MoDNA in this domain, we fine-tuned our model on 690 ENCODE ChIP-seq datasets [47], as shown in Figure 4, comparing its performance against well-established methods in the field.Among these, DeepBind [1], a CNN-based model, stands out by learning sequence motifs as convolutional layer kernels, setting a high benchmark in TFBS prediction.Additionally, DeepSite [7] combines bidirectional long short-term memory with CNNs to capture the long-range dependencies between DNA sequence motifs.Another noteworthy method, DESSO [49], incorporates motif shape into DNA sequence representation learning with CNNs, demonstrating performance on par with DeepBind.Our comparative analysis spans all 690 datasets, with the average results encapsulated in Table 3. MoDNA distinguishes itself by outperforming the baseline methods across these datasets, achieving the highest average AUC and setting a new standard in TFBS prediction.In addition to the comprehensive 690-dataset comparison, we conducted focused comparisons with DeepBind and GeneBERT for a more nuanced evaluation.DeepBind's performance was assessed on a subset of 506 ENCODE ChIP-seq datasets, where it demonstrated robustness by mitigating ChIP-seq data biases.GeneBERT, on the other hand, fine-tuned its model on nine CTCF site ChIP-seq profile datasets.Figures 5 and 6 present the comparative results, highlighting MoDNA's superior performance.On the 506 EN-CODE ChIP-seq datasets, MoDNA's mean AUC reached 0.94, significantly outstripping DeepBind's 0.914.In the CTCF site experiment, MoDNA's mean AUC soared to 0.996, surpassing GeneBERT's 0.983 and underscoring the robustness of our pre-training and fine-tuning approach.

Ablation Studies and Discussion
To underscore the significance of our self-supervised pre-training approach, we conducted a comparative analysis of MoDNA's performance with and without the pre-training phase on the promoter prediction task.This comparison, detailed in Table 4, was conducted under identical experimental conditions.The results distinctly show a marked decrease in performance metrics for MoDNA without pre-training, underscoring the model's ability to internalize meaningful genomic representations during the pre-training phase.This phase enables MoDNA to assimilate essential biological insights, thereby enhancing its efficacy in downstream tasks.In a further exploration of MoDNA's efficiency, we pre-trained the model on a subset of the pre-trained dataset, which only pre-trains on 3000 DNA sequences, excluding motif prediction tasks.Remarkably, our MoDNA utilized a limited dataset and achieved an AUC of 0.926, surpassing GeneBERT's performance by 3.2%.A similar exercise with DNABERT yielded an AUC of 0.912, reinforcing MoDNA's superior model efficiency and its ability to achieve rapid convergence with limited data.To evaluate the impact of our motif prediction tasks, we also examined MoDNA's performance without the inclusion of motif-oriented learning.The experiments, utilizing the complete genome dataset, revealed that, while MoDNA without motif prediction outperformed GeneBERT and DNABERT, it fell short of the full MoDNA framework, as shown in Table 2.The enhanced performance of MoDNA, attributed to the incorporation of well-structured self-supervised tasks, validates our hypothesis that embedding motif prior knowledge into pre-training can significantly bolster the model's ability to learn functional biological features.Moreover, MoDNA's strategy of learning from all tokens-rather than just the masked ones as in BERT-proves to be advantageous.This approach, rooted in the ELECTRA model, not only improves training efficiency but also enriches the model with a deeper understanding of biological contexts through motif prediction tasks.This method-ological shift is pivotal for adapting self-supervised learning to the unique challenges posed by biological data.
The prevailing paradigm in self-supervised learning, particularly within the NLP domain, hinges on reconstructing meaningful word pieces.However, the application of such NLP strategies to the "DNA language" is not straightforward due to the ambiguous nature of DNA sequences when treated as word pieces.This discrepancy raises a critical question: how can we leverage domain-specific prior knowledge to guide pre-training in fields distinct from NLP? MoDNA represents an initial foray into addressing this challenge, aiming to adapt self-supervised pre-training to a broader range of domain-specific issues.As we look to the future, the potential for incorporating additional functional feature tasks into MoDNA is vast.This expansion could further refine the model's capabilities, paving the way for its application to a wider array of downstream tasks.The exploration of self-supervised pre-training, enriched with domain-specific knowledge, holds promise for advancing our understanding and application of machine learning in genomics and beyond.

Conclusions
In this work, we thoroughly evaluate the performance of MoDNA on diverse genomic tasks, demonstrating the framework's superiority over existing DNA sequence language models that previously faced limitations due to their reliance on task-specific labeled data and constrained generalization capabilities.The quest for a versatile DNA language representation model remains a critical research endeavor.Our approach, MoDNA, extends beyond the straightforward application of NLP paradigms to DNA sequences by incorporating motif patterns during the pre-training stage.Unlike in NLP, where labeled data are more readily available, biological annotations often entail prohibitively expensive experimental procedures.MoDNA enhances computational efficiency for a given dataset size and resolves the data mismatch issue commonly associated with BERT by leveraging replacement tokens generated during pre-training.Furthermore, MoDNA's discriminator is designed to process and differentiate among all input tokens, thereby optimizing learning efficiency.
In the realm of genomics, motifs are known to indicate protein binding sites within sequences.By embedding self-supervised motif prediction tasks that utilize motif patterns as domain knowledge, MoDNA enriches its pre-training phase.The generator focuses on motif prediction, while the discriminator assesses motif occurrences.This incorporation of biological insights enables MoDNA to capture a nuanced semantic representation of DNA.Upon fine-tuning, MoDNA demonstrates superior performance in promoter prediction and transcription factor binding site prediction, validating its effectiveness.The promising results invite the exploration of additional tasks, such as splice site prediction, to further challenge and refine MoDNA.Our findings underscore the value of integrating domainspecific knowledge through motif patterns into self-supervised learning tasks.
Although the MoDNA framework has demonstrated generalizability across different genomic datasets, its pre-training was conducted exclusively on the human reference genome.This approach limits the framework's ability to account for sequence conservation and diversity across various species, potentially affecting its applicability to non-human genomic data.For future work, we aim to develop a more robust foundational model.This will involve extending our training datasets to include a diverse range of organisms and incorporating advanced algorithms that better capture genetic variability and conservation across species.Such enhancements will likely increase the predictive power and utility of the MoDNA framework in a broader array of genomic applications.Future research should also consider expanding the scope of self-supervised tasks to encompass other biologically relevant features, fostering a more comprehensive understanding of genomic sequences.

Figure 1 .
Figure 1.The structure of generator and discriminator.

Figure 2 .
Figure 2. Overview of the MoDNA framework.(a) DNA Sequence Representation: Illustration of DNA sequence k-mers (k = 6), representing the basic units for analysis.(b) Pre-training Pipeline of MoDNA:The process begins with the random masking of input DNA sequence k-mers, with x 2 representing the masked token.DNA k-mer tokens, along with special tokens, are constructed into a sequence of DNA tokens.These tokens are input into the generator, which aims at two main objectives: predicting the masked genomic sequences and identifying motif patterns.The generator also produces a sampling x2 to substitute the masked token [MASK].This modified sequence, combined with the unaltered tokens, is then processed by the discriminator, which is trained to detect replaced tokens and, with the given motif occurrence labels, to predict the presence of motifs.(c) Fine-Tuning Pipeline of MoDNA: The pre-trained discriminator's weights are used as the starting point.An additional multilayer perceptron is integrated for fine-tuning the model to specialize in various downstream tasks.

Figure 3 .
Figure 3.Comparison results on promoter core datasets.

Figure 4 .
Figure 4.The performance of transcription factor binding sites of MoDNA in the 690 ChIP-seq datasets.

Figure 5 .
Figure 5.Comparison of AUC results with DeepBind of transcription factor binding site prediction on 506 TF binding profile datasets.

Figure 6 .
Figure 6.Comparison AUC results of transcription factor binding site classification on CTCF binding sites.

Table 2 .
Comparison results on promoter prediction classification.

Table 3 .
Comparison results on transcription factor binding site classification.

Table 4 .
Comparison between MoDNA with and without pre-training.