6mAPred-MSFF: A Deep Learning Model for Predicting DNA N6-Methyladenine Sites across Species Based on a Multi-Scale Feature Fusion Mechanism

: DNA methylation is one of the most extensive epigenetic modiﬁcations. DNA N6-methyladenine (6mA) plays a key role in many biology regulation processes. An accurate and reliable genome-wide identiﬁcation of 6mA sites is crucial for systematically understanding its biological functions. Some machine learning tools can identify 6mA sites, but their limited prediction accuracy and lack of robustness limit their usability in epigenetic studies, which implies the great need of developing new computational methods for this problem. In this paper, we developed a novel computational predictor, namely the 6mAPred-MSFF, which is a deep learning framework based on a multi-scale feature fusion mechanism to identify 6mA sites across different species. In the predictor, we integrate the inverted residual block and multi-scale attention mechanism to build lightweight and deep neural networks. As compared to existing predictors using traditional machine learning, our deep learning framework needs no prior knowledge of 6mA or manually crafted sequence features and sufﬁciently capture better characteristics of 6mA sites. By benchmarking comparison, our deep learning method outperforms the state-of-the-art methods on the 5-fold cross-validation test on the seven datasets of six species, demonstrating that the proposed 6mAPred-MSFF is more effective and generic. Speciﬁcally, our proposed 6mAPred-MSFF gives the sensitivity and speciﬁcity of the 5-fold cross-validation on the 6mA-rice-Lv dataset as 97.88% and 94.64%, respectively. Our model trained with the rice data predicts well the 6mA sites of other ﬁve species: Arabidopsis thaliana , Fragaria vesca , Rosa chinensis , Homo sapiens , and Drosophila melanogaster with a prediction accuracy 98.51%, 93.02%, and 91.53%, respectively. Moreover, via experimental comparison, we explored performance impact by training and testing our proposed model under different encoding schemes and feature descriptors.


Introduction
Epigenetics refers to the reversible and heritable changes in gene function when there is no change in the nuclear DNA sequence [1]. DNA methylation modifications play important roles in epigenetic regulation of gene expression without altering the sequence, and it is widely distributed in the genome of different species [2]. It can be divided into three categories according to the position of methylation modification: N6-methyladenine (6mA), 5-Methylcytosine (5mC), and N4-methylcytosine (4mC) [3,4]. DNA methylations at the 5th position of the pyrimidine ring of cytosine 5-Methylcytosine(5mC) and at the 6th position of the purine ring of adenine N6-methyladenine (6mA) are the most common DNA modifications in eukaryotes and prokaryotes, respectively [5]. Previous studies have shown that DNA N6-methyladenine (6mA) is associated with germ cell differentiation, stress response, embryonic development, nervous system, and other processes [6][7][8][9]. N6-Methyladenine (6mA) DNA methylation has recently been implicated as a potential new epigenetic marker in eukaryotes, including the Arabidopsis thaliana, Rice, Drosophila melanogaster, and so on [5,8,10]. Zhang et al. reveal that 6mA is a conserved DNA modification that is positively associated with gene expression and contributes to key agronomic traits in plants [11]. Some studies have found that N6-methyladenine DNA modification is also widely distributed in the Human Genome and plays important biological functions. Xiao et al. demonstrate that 6mA DNA modification is extensively present in human cells, and the decrease of genomic DNA 6mA promotes human tumorigenesis [12]. Zhou et al. proposed a 6mA DNA modification area as a new mechanism for the epigenetic regulation of stem cell differentiation [13]. Xie et al. report that N6-methyladenine DNA modifications are enriched in human glioblastoma, and targeting regulators of this modification can inhibit cancer growth by altering heterochromatin landscapes and downregulating oncogenic programs [14]. Therefore, how to quickly and accurately detect the DNA 6mA modification sites is also an important research topic in these epigenetic studies.
Due to the rapid development of high-throughput sequence technology, various experimental techniques have been proposed to detect DNA 6mA modifications and study protein function. Pormraning et al. developed a protocol using bisulfite sequencing and a methyl-DNA immunoprecipitation technique to analyze genome-wide DNA methylation in eukaryotes [15]. Krais et al. reported a fast and sensitive method for the quantification of global adenine methylation in DNA, using laser-induced fluorescence and capillary electrophoresis [16]. Flusberg et al. proposed the single-molecule real-time sequencing (SMRT) technology to detect the 4mC and 6mA sites from the whole genome [17]. Greer et al. used ultra-high performance liquid chromatography coupled with mass spectrometry technique to access DNA 6mA levels in Caenorhabditis elegans [18]. By performing mass spectrometry analysis and 6mA immunoprecipitation followed by sequencing (IP-seq), Zhou and his colleagues obtained the 6mA profile of the rice genome [10].
Although experimental methods indeed yielded encouraging results, the technology cannot detect m6A sites from the whole genome and the cost of the technique is high. Therefore, it is necessary to develop a computational model that can efficiently and accurately predict and identify 6mA sites. Recent studies focus more on the recognition of 6mA sites using machine learning [19], which is capable of predicting 6mA sites based on genome sequences, without any prior experimental knowledge. Chen et al. developed the first ML-based method named i6mA-Pred for identifying DNA 6mA sites and provided a benchmark 6mA dataset containing 880 6mA sites and 880 non-6mA sites in the rice genome. Their method used a Support Vector Machine (SVM) classifier based on chemical features of nucleotides and position-specific nucleotide frequencies [20]. The i6mA-Pred shows a good classification performance in rice 6mA data. However, the association information among nucleotides near 6mA sites is ignored. Pian et al. proposed a new classification method called MM-6mAPred based on a Markov model which makes use of the transition probability between adjacent nucleotides to identify 6mA sites [21]. Pian et al. built and evaluated their MM-6mAPred based on the 6mA-rice-Chen benchmark dataset. Their results show that MM-6mAPred outperformed i6mA-Pred in prediction of 6mA sites. Basith et al. developed a novel computational predictor, called SDM6A, which explores various features and five encoding methods to identify the DNA 6mA sites [22]. Basith et al. also trained and evaluated their SDM6A based on the 6mA-rice-Chen benchmark dataset, and they found that SDM6A outperformed i6mA-Pred on the 6mA-rice-Chen benchmark dataset. The above three prediction models are trained on the 6mA-rice-Chen benchmark dataset including only the 880 rice 6mA sites and 880 non-6mA sites. Even though the above methods have improved the performance for identifying 6mA sites, too few data sets have been adopted to fully reflect the whole genome and to build robust models. Lv et al. developed a machine learning method of predicting 6mA sites named iDNA6mA-rice which was trained and evaluated on the dataset 6mA-rice-Lv containing 154,000 6mA sites and 154,000 non-6mA sites in the rice genome [23]. iDNA6mA-rice utilized Random Forest to perform the classification for identifying 6mA sites after using mono-nucleotide binary encoding to formulate positive and negative samples.
In recent years, deep learning has not only been developed as a new research direction in machine learning, but has also made a lot of achievements in data mining, machine translation, natural language processing, and other related fields. In the field of computational biology [24][25][26][27][28], deep learning has been widely applied [29][30][31], especially in solving the problems of genome sequence-based by convolutional neural networks (CNN). Tahir et al. proposed an intelligent computational model called iDNA 6mA (5-step rule) which extracts the key features from DNA input sequences via the convolution neural network (CNN) to identify 6mA sites in the rice genome [32]. Yu et al. developed a simple and lightweight deep learning model named SNNRice6mA to identify DNA 6mA sites in the rice genome and showed its advantages over other methods [33]. Li et al. developed a deep learning framework named Deep6mA to identify DNA 6mA sites. Deep6mA, composed of a CNN and a bidirectional LSTM (BLSTM) module, is shown to have a better performance than other methods above on 6mA prediction [34]. Although the above methods have made great progress, their performance is still not satisfactory. Furthermore, most of the methods are designed for one specific species such as the rice genome. Although some methods provide the across species validation test, the results are not as good as the original species-specific model.
In this paper, we proposed a novel deep learning method called 6mAPred-MSFF to identify DNA 6mA sites. In this predictor, we establish the model by integrating the inverted residual block, the multi-scale channel attention module (MS-CAM), Bidirectional Long Short Term Memory Network (Bi-LSTM), and attentional feature fusion (AFF). To capture more feature information, we use the inverted residual block to expand the input features to a high dimension and filter them with a lightweight depthwise convolution. After that, we project them back to a low-dimensional representation as the input of the multi-scale channel attention module (MS-CAM). Bi-LSTM is designed to capture the long dependent information in sequences. The features generated from the Bi-LSTM are fed into the AFF to be fused. In the experimental part, we evaluate our predictor and other 6mA predictors with 5-fold cross validation on seven benchmark datasets. We also evaluate these 6mA predictors trained by the 6mA-rice-Lv dataset to predict the 6mA sites of other species. Finally, we explore the performance impact under the different encoding methods and feature descriptors.

Dataset Collection
Previous studies have demonstrated that a stringent dataset is essential for building a robust predictive model. We collected seven 6mA site datasets of six species including Rice, Arabidopsis thaliana (A. thaliana), Rosa chinensis (R. chinensis), Fragaria vesca (F. vesca), Homo sapiens (H. sapiens) and Drosophila melanogaster (D. melanogaster). There are two benchmark datasets 6mA-rice-Lv and 6mA-rice-Chen for Rice. The 6mA-rice-Lv dataset include 15,400 positive samples and 15,400 negative samples which are obtained from NCBI Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/genome/10 (accessed on 20 August 2021)) following the steps proposed by Lv et al. [23]. The 6mA-rice-Chen benchmark dataset includes the 880 rice 6mA sites and 880 non-6mA sites which are obtained from GEO (https://www.ncbi.nlm.nih.gov/geo/ (accessed on 20 August 2021)) under the accession number GSE103145 following the steps proposed by Chen et al. [20].
To demonstrate that our model has the ability to predict the 6mA sites of other species and 6mA shares similar patterns across different species, we collected DNA 6mA sequences of Arabidopsis thaliana and Drosophila melanogaster from the MethSMRT database (http://sysbio.gzzoc.com/methsmrt/ (accessed on 20 August 2021)) [35] and also obtained the DNA 6mA sequences of Fragaria vesca and Rosa chinensis from the MDR database [36]. Preliminary trials indicated that, when the length of the segments is 41 bp with the 6mA in the center, the highest predictive results could be obtained. Thus, the sequences of all positive samples are 41 bp. In order to construct a high-quality benchmark dataset, the following two steps were performed. Firstly, according to the the Methylome Analysis Technical Note [35], the sequences with modQV no less than 30 are left for the subsequent analysis. It should be noted that, in order to obtain statistically significant results, if the raw data are too small, this step was ignored to get more samples. Secondly, to avoid redundancy and reduce homology bias, sequences with more than 80% sequence similarity were removed using the CD-HIT program [37]. After the above two steps, the objective and strict positive datasets for the above species were obtained. Negative samples for Arabidopsis thaliana, Drosophila melanogaster, Fragaria vesca, and Rosa chinensis were also collected. These samples are 41 nt long sequences with Adenine in the center and were proved to be unmethylated by experiments. For convenience, we used the Homo sapiens dataset (http://lin-group.cn/server/iDNA-MS/download.html (accessed on 20 August 2021)) [38] directly, including 18,335 positive samples and 18,335 negative samples. The details of the datasets are presented in Table 1.

Architecture of 6mAPred-MSFF
In this work, we proposed a new predictive model based on a multi-scale feature fusion mechanism. An overview of our model architecture is illustrated in Figure 1. There are four modules in our model including the sequence embedding module, the feature extraction module, feature fusion module, and prediction module. First, in the sequence embedding module, we use four different encoding schemes (1-gram, NAC, 2-grams, DNC) for the input of the embedding layer. Each nucleotide is represented as a dense vector of float point values. As a result, the genomic sequences can be represented by four feature matrices, respectively. We concatenate the feature matrices of the 1-gram and NAC and the feature matrices of the 2-grams and DNC matrices, respectively. Second, the resulting two feature matrices are fed into the feature extraction module which consists of the inverted residual block and the multi-scale channel attention mechanism module (MS-CAM). To extract more information from the global features and filter the features as the source of the MS-CAM, we use the inverted residual block including an expansion point-wise layer, a convolution layer, and a projection point-wise layer. The MS-CAM extract and combine the global and local features of the two feature matrices, respectively. Before fusing the two feature matrices, we use the bidirectional LSTM layer to learn the information about the long-distance dependence. Afterward, in the feature fusion module, we combine the features extracted from the bidirectional LSTM layer by adding operation and feed them into the MS-CAM module to calculate the fusion weight of the two feature matrices. We multiply the features matrices with the corresponding fusion weights and combine the results by adding operation. In this way, we obtain the features by aggregating the local and global contexts of the four different encoding schemes. Third, in the prediction module, the outputs of the features are passed through dropout layers for the prevention of overfitting during the training. After the dropout layer, to unstack the output, a flatten function was used to squash the feature vectors from the previous layers. Right after a flattened layer, a fully connected (FC) dense layer used as Dense(n), with the number of n neurons set as 32. At last, an FC layer was applied and used a sigmoid function for the binary classification. Sigmoid is used to squeeze the values between the range of 0 and 1 to represent the probability of having 6mA and non-6mA sites. moid function for the binary classification. Sigmoid is used to squeeze the values between the range of 0 and 1 to represent the probability of having 6mA and non-6mA sites.
The 6mAPred-MSFF is trained by the Adam optimizer and the binary cross-entropy loss function. Via giving a prediction score, the 6mAPred-MSFF can determine whether the 6mA sites are detectable in DNA sequences. It is detectable if the prediction score is >0.5; otherwise, it is not. The model is implemented using Keras 2.4.3. Below, each of ourmodules will be described in detail. Figure 1. The flowchart of 6mAPred-MSFF. The sequence embedding module uses four different encoding schemes (1gram, NAC, 2-grams, DNC) for the input of the embedding layer. Next, the feature matrices are fed into the feature extraction module to capture and combine the global and local features, respectively. Afterwards, the local and global contexts of the four different encoding schemes are aggregated into the feature fusion module. Finally, the output of the feature fusion module is fed into the prediction module to predict the 6mA site of a certain species.

Sequence Embedding Module
DNA sequences are consisting of four nucleotides: "A" (adenine), "G" (guanine), "C" (cytosine), and "T" (thymine). Undetermined bases are annotated as "N". The DNA sequences can be expressed as S = s , s , … s where is the i-th word and the L is the The 6mAPred-MSFF is trained by the Adam optimizer and the binary cross-entropy loss function. Via giving a prediction score, the 6mAPred-MSFF can determine whether the 6mA sites are detectable in DNA sequences. It is detectable if the prediction score is >0.5; otherwise, it is not. The model is implemented using Keras 2.4.3. Below, each of ourmodules will be described in detail.

Sequence Embedding Module
DNA sequences are consisting of four nucleotides: "A" (adenine), "G" (guanine), "C" (cytosine), and "T" (thymine). Undetermined bases are annotated as "N". The DNA sequences can be expressed as S = s 1 , s 2 , . . . s L where s i is the i-th word and the L is the length of the sequence. We use four encoding methods to define 'words' in DNA sequences including 1-gram, NAC, 2-grams, and DNC.
1-gram encoding method. The n-grams are the set of all possible subsequences of nucleobases [39]. One-gram encoding sets the n-grams number n to 1. As a result, the 1-gram nucleobase sequences include 'A', 'T', 'C', 'G'. We define a dictionary that maps the nucleotides to numbers, such as A (adenine) to 1, T (thymine) is 2, C (cytosine) to 3, and G (guanine) to 4. In this way, we map a DNA sequence into a real number vector of length 41, which is denoted as V i,1−gram : NAC encoding method. The Nucleic Acid Composition (NAC) encoding calculates the frequency of each nucleic acid type in a nucleotide sequence. The frequencies of all four natural nucleic acids (i.e., "ATCG") can be calculated as: for a DNA sequence S of length L, and it can be converted into a vector of length 4, which is denoted as V i,N AC : 2-gram encoding method. A 2-gram encoding sets the n-grams number n to 2. We split the DNA sequences into overlapping n-gram nucleobases. The total possible number is 16, since there are four types of nucleobases. For example, we split a DNA sequence into overlapping 2-gram nucleobase sequences as follows: GTTGT . . . CTT → 'GT', 'TT', 'TG', 'GT', . . . , 'CT', 'TT'. We define a dictionary that maps the 2-gram nucleobase sequences to the 1-16 numbers set. In this way, we map a DNA sequence into a real number vector of length 40, which donates V i,2−gram : DNC encoding method. The Di-Nucleotide Composition (DNC) encoding calculates the frequency of every two nucleic acid types in a nucleotide sequence which gives 16 descriptors. It is defined as: where N rs is the number of di-nucleotide represented by nucleic acid types r and s. For a DNA sequence S of length L, it can be converted into a vector of length 16, which is donated as V i,DNC : Next, the vectors V i,1−gram , V i,N AC , V i,2−gram , and V i,DNC are fed to an embedding layer, transforming into learnable embedding vectors. To establish the relationship between V i,1−gram and V i,N AC , we concatenate the two embedding vectors of V i,1−gram and V i,N AC to a new feature vector that is denoted as X i : We do the same operation on the embedding vectors of V i,2−gram and V i,DNC , generating a new feature vector which is denoted as Y i : To extract more feature information, the vectors X i and Y i are fed into the feature extraction module.

Feature Extraction Module
The inverted residual block. To reduce the computation of the network, we always design the network with a low-dimensional space. However, the network can not capture enough information in low-dimensional space. To solve these problems, the inverted residual with linear bottleneck was proposed by Sandler et al. [40] in the MobileNetV2. This module takes an input as a low-dimensional compressed representation, which is first expanded to high dimensions and filtered with a lightweight depthwise convolution.
Features are subsequently projected back to a low-dimensional representation with a linear convolution which can prevent the destruction of information in low-dimensional space. Our proposed method follows the idea to build an inverted residual block that consists of an expansion layer, a depthwise layer, and a projection layer. The expansion layer and the projection layer are the 1 × 1 convolution layer, called the pointwise convolution with a different expansion ratio. The depthwise layer is a depthwise separable convolution layer that is a key building block for many efficient neural network architectures [40][41][42][43][44]. The basic idea is to replace a full convolutional operator with a factorized version that splits convolution into two separate layers. The first layer is called a depthwise convolution, and it performs lightweight filtering by applying a single convolutional filter per input channel. The second layer is a point-wise convolution, which is responsible for building new features through computing linear combinations of the input channels. The inverted residual block operator can be expressed as a composition of three operators which is denoted as F IRB (X): where A is the expansion layer that is a linear transformation: A : R s×s×k → R s×s×n , N is the depthwise layer which contains several non-linear per-channel transformation: N : R s×s×n → . . . → R s ×s ×n , and B is the projection layer which is again a linear transformation to the output domain: B : R s×s×n → R s ×s ×k . For our inverted residual block F IRB (X), A and B are the linear transformation without any nonlinear activations. We use the ELU activation function in the operator N = ELU • dwise • ELU. which accelerates the learning rate of the network. The ELU function is shown in the following equation: Thus, we can obtain k feature maps containing the 6mA site features captured by the inverted residual block.
Multi-scale attention mechanism block. The attention mechanism in deep learning, which mimics the human visual attention mechanism [45,46], is originally developed on a global scale. For example, the matrix multiplication in self-attention draws global dependencies of each word in a sentence [47] or each pixel in an image [48][49][50][51]. The Squeeze-and-Excitation Networks (SENet) squeeze global spatial information into a channel descriptor to capture channel-wise dependencies [52]. However, merely aggregating contextual information on a global scale is too biased and weakens the features of small objects. To solve the problem, the multi-scale channel attention module (MS-CAM) proposed by Yimian Dai et al. [53] aggregates global and local channel contexts given an intermediate feature X ∈ R H×W×C with C channels and feature maps of size H × W. The attention channel weights M(X) ∈ R C can be computed as: where L(X) denotes the local channel context, and G(X) denotes the global channel context. The stacked architecture of MS-CAM is shown in Figure 1. L(X) ∈ R C×H×W is computed as follows: L(X) = β(PWConv 2 (β(PWConv 1 (X)))) (12) where β denotes the Batch Normalization (BN) [54]. The pointwise convolution (PWConv) is used as the local channel context aggregator, which only exploits pointwise channel interactions for each spatial position. The expansion ratios of PWConv 1 and PWConv 2 are r and 1/r, respectively. Different from the original MS-CAM, we remove the nonlinear activation function between the convolution layer to prevent the destruction of feature information. G(X) ∈ R C is computed as follows: G(X) = β(PWConv 2 (β(PWConv 1 (g(X))))) (13) where g(X) is global average pooling (GAP) and computed as follows: The expansion ratios of PWConv 1 and PWConv 2 are r and 1/r, respectively. β denotes the Batch Normalization (BN). The refined X ∈ R H×W×C can be obtained as follows: where σ is the sigmoid function. ⊕ denotes the broadcasting addition and ⊗ denotes the element-wise multiplication. Bi-directional Long Short Term Memory Network. Recurrent Neural Network (RNN) is a powerful neural network for processing sequential data [55]. As a special type of RNN, Long Short Term Memory Network (LSTM) is not only designed to capture the long dependent information in sequence but also overcome the training difficulty of RNN due to gradient explosion and disappearance [56]; thus, it is the most widely used RNN in real applications. In LSTM, a storage mechanism is used to replace the hidden function in traditional RNN, with a purpose to enhance the learning ability of LSTM for long-distance dependency. Bi-directional LSTM (BLSTM), compared with unidirectional LSTM, better captures the information of sequence context. The LSTM components are updated by the following formulations: where i t , f t , and o t represent the input, forget, and output gate, respectively; C t is temp value for calculating C t ; t denotes the recurrent time step; W f , W i , W C and b are the weights of each equation. h t is the output of LSTM at the t time step. The bi-directional LSTM consists of two direction LSTM networks. Therefore, the h t is computed as follows:

Feature Fusion Module
Feature fusion, the combination of features from different layers or branches, is an omnipresent part of modern network architectures. It is often implemented via simple operations, such as summation or concatenation, but this might not be the best choice. We use a uniform and general scheme, namely attentional feature fusion (AFF) which is proposed by Yimian Dai et al. [53]. The architecture of AFF is shown in Figure 1. Based on the multi-scale channel attention module M(X), the AFF can be expressed as follows: where Z ∈ R H×W×C is the fused feature, and the element-wise summation of X ⊕ Y is used as the initial integration for MS-CAM. M(X ⊕ Y) and 1 − M(X ⊕ Y) are the fusion weights for X and Y, respectively. Please note that the fusion weights consist of real numbers between 0 and 1, which enable the network to conduct a soft selection or weighted averaging between X and Y.

Prediction Module
The fused feature Z generated from the AFF module is fed into the dropout layer which prevents overfitting and can improve the reliability and robustness of the model by discarding some intermediate features. We use a flatten function to integrate the feature vectors that are generated from the dropout layer. Afterwards, the flattened feature vectors are fed to the fully connected (FC) layer that has 32 hidden units. The activation function used in the FC layer is the ELU activation function. Finally, a sigmoid activation function is used to combine the outputs from the FC layer to make the final decision.
6mAPred-MSFF is trained by the Adam optimizer [57], which is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters. For binary classification tasks, the loss function for training 6mAPred-MSFF is set as the binary crossentropy for the measuring the difference between the target and the predicted output: where y i is the true label, y i is the corresponding predicted value from 6mAPred-MSFF, and α w 2 is a regularization term to avoid overfitting.

Evaluation Metrics
In our experiment, we used the following four indicators to evaluate the predictive performance of our proposed model, including Accuracy (ACC), Sensitivity (SN), Specificity (SP), and Mathew's Correlation Coefficient (MCC). They are the four commonly used indicators for classifier performance evaluation in other bioinformatics fields . They are formulated as follows: where TP, TN, FP, and FN represent the numbers of true positives, true negatives, false positives, and false negatives, respectively. MCC and ACC are two metrics used to evaluate the overall prediction ability of a predictive model. The receiver operating characteristic curve (ROC), the area under the curve (AUC), and precision recall curves (PRC) are used to show the detailed performance of different methods [88]. The AUC ranges from 0.5 to 1. The higher the AUC score, the better the performance of the model [89][90][91].

Performance Comparison on Seven Benchmark Datasets
To demonstrate the effectiveness of the proposed method, we compared its performance with three other existing state-of-the-art methods, including MM-6mAPred [21], SNNRice6mA [33], and Deep6mA [34]. MM-6mAPred is traditional machine learning method training the model by hand-made features extracted from original DNA sequences.
SNNRice6mA and Deep6mA are the deep learning methods. All the predictors are evaluated on the seven benchmark datasets by using 5-fold cross validation. Table 2 lists the performances of the proposed method, namely 6mAPred-MSFF, and four existing predictors on the seven datasets. For the 6mA-rice-Lv dataset, when compared to the second-best method Deep6mA, our proposed method achieves an SN of 97.88%, an SP of 96.64%, an ACC of 96.26%, and an MCC of 0.926, yielding a relative improvement over Deep6mA of 3.15%, 0.92%, 2.03%, and 4.63%, respectively. Compared with the machine learning method MM-6mAPred, our proposed method improves the performance of SN, SP, ACC, and MCC by 4.61%, 5.8%, 5.12%, and 12.65%, respectively. For the 6mA-rice-Chen dataset, compared with the runner-up predictor Deep6mA, with the only exception that the value of SP is better than ours, our proposed method improves the performance of SN, ACC, and MCC by 9.89%, 2.79%, and 6.81%, respectively. For the A. thaliana and R. chinensis datasets, with the exception that the predictor Deep6mA achieves the best value of SP, and our proposed method improves the performance of SN, ACC, and MCC by 0.28-5.2%. Moreover, we can see that the performance of 6mAPred-MSFF is also better than all other methods on the F. vesca, H. sapiens, and D. melanogaster datasets. For a more intuitive comparison, we further compared the ROC (receiver operating characteristic) curves and PR (precision-recall) curves of the four predictors, which are illustrated in Figures 2 and 3, respectively. We can observe that our method achieves the best values for an auROC of 0.989 and an auPRC of 0.981 on the 6mA-rice-Lv dataset. We also achieve the best values for an auROC of 0.976 and auPRC of 0.973 on the 6mA-rice-Chen dataset. For the other species, our proposed method also achieves the best values of auROC and auPRC. Therefore, we can conclude that our proposed method can achieve the best predictive performance for detecting 6mA sites in rice species.

Validation on Other Species
To validate whether the model trained on rice data is applicable to predict the 6mA sites of other species. We take the datasets including A. thaliana, R. chinensis, F. vesca, H. sapiens, and D. melanogaster as the independent test datasets. We evaluate the performance of 6mAPred-MSFF, which was trained on the rice 6mA-rice-Lv dataset and performed the same evaluation for other methods including Deep6mA, SNNRice6mA-large, and MM-6mAPred.
The prediction results are listed in Table 3. however, the Deep6mA achieves the best value of SP. Surprisingly, all models trained on the 6mA-rice-Lv dataset fail in predicting the 6mA sites of A. thaliana, since the best value of SN is 58.66%; although there is a better value of SP, the performance of predicting 6mA sites of A. thaliana can not meet the requirement in practice. Therefore, we can conclude that the models trained with an 6mA-rice-Lv dataset can be used to predict the 6mA sites of other species and the features of 6mA sites also have certain similarities among different species.

Performance Impact by Different Encoding Schemes and Feature Descriptors
Our proposed model uses four different encoding schemes (1-gram, NAC, 2-grams, DNC) as the input of the embedding layer. The initial features will have an important impact on the training results of deep learning models. To explore the impact of different encoding methods, we will evaluate our proposed model with N-grams encoding scheme and different feature descriptors as the initial input.
N-grams encoding. To validate the impact of the N-grams encoding, we compare our proposed model with the encoding schemes (1-gram, 2-grams, 3-grams 1-gram and 2-grams, 1-gram and 3-grams, 2-grams and 3-grams) based on the 6mA-rice-Lv dataset. The prediction results are listed in Table 4. From Table 4, we can see that the 1-gram encoding scheme achieves an SN of 97.15% second only to our proposed method. The 1-gram and 2-gram encoding scheme achieves an SN of 94.47%, an SP of 94.67%, an ACC of 95.86, and an MCC of 0.917. Therefore, our proposed method uses the 1-gram and 2-gram encoding scheme integrating NAC and DNC to encode the input sequences.  For a more intuitive comparison, we further compared the ROC (receiver operating characteristic) curves and PR (precision-recall) curves of the four feature descriptors, which are illustrated in Figures 4 and 5, respectively. We can observe that our proposed method achieves the best values of an auROC of 0.986 and an auPRC of 0.984 under the NCP. Our proposed method also achieves the best values of an auROC and auPRC under other feature descriptors. Therefore, we can conclude that our proposed method has a better ability to capture feature information.
For a more intuitive comparison, we further compared the ROC (receiver operating characteristic) curves and PR (precision-recall) curves of the four feature descriptors, which are illustrated in Figures 4 and 5, respectively. We can observe that our proposed method achieves the best values of an auROC of 0.986 and an auPRC of 0.984 under the NCP. Our proposed method also achieves the best values of an auROC and auPRC under other feature descriptors. Therefore, we can conclude that our proposed method has a better ability to capture feature information.

Conclusions
In this study, we have proposed a novel predictor called 6mAPred-MSFF to predict the DNA 6mA sites. 6mAPred-MSFF is the first deep learning predictor, in which we integrate the global and local context by the inverted residual block and multi-scale channel

Conclusions
In this study, we have proposed a novel predictor called 6mAPred-MSFF to predict the DNA 6mA sites. 6mAPred-MSFF is the first deep learning predictor, in which we integrate the global and local context by the inverted residual block and multi-scale channel attention module (MS-CAM). To fuse different feature vectors by fusion weights, 6mAPred-MSFF uses an attentional feature fusion (AFF) module based on the MS-CAM. As compared to existing predictors, our proposed method can automatically learn the global and local features and capture the characteristic specificity of 6mA sites. On the other hand, the experimental results demonstrate that our proposed method can effectively increase the accuracy and the generalization ability of the DNA 6mA sites prediction. Our proposed deep learning method shows better performance on the rice benchmark and the independent test of the other species.