SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model

A large number of inorganic and organic compounds are able to bind DNA and form complexes, among which drug-related molecules are important. Chromatin accessibility changes not only directly affect drug–DNA interactions, but they can promote or inhibit the expression of the critical genes associated with drug resistance by affecting the DNA binding capacity of TFs and transcriptional regulators. However, the biological experimental techniques for measuring it are expensive and time-consuming. In recent years, several kinds of computational methods have been proposed to identify accessible regions of the genome. Existing computational models mostly ignore the contextual information provided by the bases in gene sequences. To address these issues, we proposed a new solution called SemanticCAP. It introduces a gene language model that models the context of gene sequences and is thus able to provide an effective representation of a certain site in a gene sequence. Basically, we merged the features provided by the gene language model into our chromatin accessibility model. During the process, we designed methods called SFA and SFC to make feature fusion smoother. Compared to DeepSEA, gkm-SVM, and k-mer using public benchmarks, our model proved to have better performance, showing a 1.25% maximum improvement in auROC and a 2.41% maximum improvement in auPRC.


Introduction
In human cells, genetic and regulatory information is stored in chromatin, which is deoxyribonucleic acid (DNA) wrapped around histones. The chromatin structure has a lot to do with gene transcription, protein synthesis, biochemical processes, and other complex biological expressions. Among them, the binding of small organic and inorganic molecules to the DNA can influence numerous biological processes in which DNA participate. In particular, many anticancer, antibiotic, and antiviral drugs exert their primary biological effects by reversibly interacting with nucleic acids. Therefore, the study of its structure can help us design drugs to control gene expression and to cure diseases [1]. Some regions of the chromatin are open to transcription factors (TFs), RNA polymers (RNAPs), drug molecules, and other cellular materials, while others are tightly entangled together and do not play a role in most cellular processes. These two regions of the chromatin are called open regions and closed regions, which are also known as accessible and inaccessible regions [2]. Measuring the accessibility of chromatin regions can generate clues to gene function that can help us to identify appropriate targets for therapeutic intervention. Meanwhile, monitoring changes in chromatin accessibility can help us to track and understand drug effects. A study [3] found that chromatin accessibility changes at intergenic regions are associated Theorem 1. For two standardized distributions using layer normalization (LN), which are denoted as X 1 and X 2 , the concat of them, that is, X ≡ [X 1 , X 2 ], is still a standardized distribution.
Proof. Suppose that X 1 has n elements and X 2 has m elements. As we all know, LN [24] transforms the distribution X as X where µ and σ are the expectation and standard deviation of X respectively. Obviously, for the normalized distribution X 1 and X 2 , we have where E stands for the expectation function and D stands for the deviation function. The new distribution X is derived by concating X 1 and X 2 , and thus has n + m elements. Inferring from Equation (2), we have E(X) = nE(X 1 ) + mE(X 2 ) n + m = 0 (4) For X 1 and X 2 , we also know that Substituting Equations (2) and (3) into Equations (6) and (7), and finally into Equation (5), we have D(X) = 1 (8) Equations (4) and (8) demonstrate the standardization of X.
Theorem 2. For any two distributions X 1 , X 2 , there two coefficients λ 1 , λ 2 that always exist, so that the concat of them, after being multiplied by the two coefficients, respectively, that is X ≡ [λ 1 X 1 , λ 2 X 2 ], is a standardized distribution.
Proof. Suppose that X 1 has n elements and X 2 has m elements. We denote the expectation of the two distributions as µ 1 , µ 2 , and the variance as σ 2 1 , σ 2 2 . Notice that λ 1 and λ 2 are all scalars. Now, pay attention to X. To prove this theorem, we want X to be a standardized distribution, which requires the expectation of X to be 0 and the variance to be 1. Therefore, we can list the following equation set: At the same time, we have equations similar to Equations (6) and (7), those being: which are easy to calculate according to the nature of expectation and variance. Notice that Equation (9) has two variables and two independent equations, meaning it should be solvable. By calculating Equation (9), we can determine the numeric solution of λ 1 and λ 2 as follows: The existence of Equation (12) ends our proof. Actually, we are able to obtain two sets of solutions here because λ 1 can either be positive or negative, and so can λ 2 . The signs of λ 1 and λ 2 depend on the signs of µ 1 and µ 2 , which can be easily inferred from the first equation in Equation (9). Corollary 1. For any distributions X 1 , X 2 , . . ., X n , their n coefficients λ 1 , λ 2 , . . ., λ n , always exist, so that the concat of them, after being multiplied by the n coefficients, respectively, that is X ≡ [λ 1 X 1 , λ 2 X 2 , . . . , λ n X n ], is a standardized distribution.
Proof. The overall method of proof is similar to that used in Theorem 2. Note that, in this case, we have n variables but only two independent equations, resulting in infinite solutions according to Equation (9). To be more precise, the degree of freedom of our solutions is n − 2.
Theorem 3. In neural networks, for any two tensors X, Y that satisfy E(X) = E(Y) = 0, the probability of feature disappearance of X after concating and normalizing them is Ω SD(Y) SD(X) , where SD represents the standard deviation.
Proof. Feature disappearance is defined as a situation where the features are too small. Concretely, for a tensor X and a threshold t FD , if the result of a subsequent operation of X is smaller than t FD , then the feature disappearance of X occurs. Here, t FD can be an arbitrarily small value, such as 10 −5 .
Suppose that X has n elements and Y has m elements. We denote the expectation of the two distributions as µ 1 , µ 2 , and the variance as σ 2 1 , σ 2 2 . As stated in the precondition, we already know that With the help of Equation (9), we have We denote E(Z) as µ and D(Z) as σ 2 . According to Equation (1), for X , we know that We denote E(X ) as µ 1 and D(X ) as σ 1 2 . Now, we consider the results of a subsequent operation of X, which is ∑ n i=1 λ i X i . This is very common in convolution, linear, or attention layers. For the result, an observation is where λ m = max 1≤i≤n |λ i |. For the convenience of analysis, all λ are set to 1. This will not result in a loss of generality because the value scaling from λ m to 1 has no effect on the subsequent derivation. Here, we denote ∑ X as S X . According to the central limit theorem (Lindeberg-Lévy form) [25], we find that S X obeys a normal distribution, that is For a feature disappearance threshold t FD , we want to figure out the probability of |S X | < t FD . Denote this event as FD, and we can obtain where Φ is the cumulative distribution function (cdf) of the standard normal distribution.
Since it is an integral that does not have a closed form solution, we cannot directly analyze it. According to Equations (13), (14) and (16), we know that µ 1 = 0. At the same time, we know that t FD is a small number, leading to t FD √ nσ 1 → 0 . Therefore, we have the equation as follows: The formula is a Taylor expansion where ϕ is the probability density function (pdf) of the standard normal distribution, R 1 (x) is the Lagrange remainder, and o(x) is the Peano remainder, standing for a high-order infinitesimal of x. Combining Equations (15), (17), (20) and (21), we achieve where k s = m n and k d = σ 2 σ 1 . The above equation can also be written as Pr(FD) = Ω SD(Y) SD(X) .

Corollary 2.
In neural networks, feature disappearance can lead to gradient disappearance.
Proof. According to Theorem 3, feature disappearance happens if there exists a tensor T such that |T| < t FD . Similar to the definition of feature disappearance, gradient disappear- ance is defined as a situation where the gradients are too small. Concretely, for a parameter C with a gradient of grad C and a threshold t GD , if grad C is smaller than t GD , the gradient disappearance of C happens. Here, t GD can be an arbitrarily small value. Consider a subsequent operation of T, which is T = CT n , where n stands for the number of layers involved in the calculation. The gradient disappearance happens if At the same time, we already have |T| n < t n FD , which means that we simply need to meet the requirements for t n FD < t GD (24) Note that t FD is a small number, which means that t FD < 1. Finally, we can derive a formula for n: Thereby, we get a sufficient condition for n, and we can come to a conclusion. Gradient disappearance occurs in layers deep enough after feature disappearance.
The above corollary is consistent with intuition. The disappearance of gradients is always accompanied by the disappearance of features, and it is always a problem in deep neural networks. Theorem 4. In neural networks, for any two tensors X 1 , X 2 of the same dimension, there are always two matrices M 1 , M 2 , so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the two matrices, respectively, are equivalent in effect.

Proof.
First of all, we illustrate the definition of the Hadamard product [26]. The Hadamard product (also known as the element-wise product) is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension. Concretely, we can define it as The symbol '•' is used to distinguish it from the more common matrix product, which is denoted as '·' and is usually omitted. The definition implies that the dimension of X 1 should be the same as that of M 1 , as well as X 2 and M 2 . At the same time X 1 and X 2 are assumed to have the same dimensions in the precondition of our proposition. As such, we might as well set them to R R×N . The representation of X 1 and X 2 is presented below: Our goal is to weigh the effect of the two operations. For the convenience of comparison, we let the results after the two operations multiply a matrix, thus converting the dimension to R R×M . Adding a linear layer is very common in neural networks, and it hardly affects the network's expression ability.
Considering the first scheme, the concat of X 1 and X 2 , we have where A ∈ R 2N×M and U ∈ R R×M . Observing the i-th row and j-th column of U, we find that Considering the second scheme, with the addition of X 1 and X 2 as the core, we have where B ∈ R N×M and V ∈ R R×M . Still, we pay attention to the i-th row and j-th column of V and find that Comparing Equations (29) and (31), we find that, when we let λ T i1 • b j equal a jp and λ T i2 • b j equal a jq , the values of U and V are equal, which is strong evidence of effect equivalence.
As the equivalence has been proven, similar to the plain concat, no information is lost in the above method. We point out that the Hadamard product is an alternative version of the gate mechanism [27]. We use coefficients to adjust the original distribution to screen out effective features. For the speed and stability of training, setting the initial value of M to 1 is recommended.
Further, we can observe the gradient of the parameters λ in Equation (30), where we have Compared to the gate mechanism, our method is simpler, saves space, and is more direct in gradient propagation.
Of course, Theorem 4 could be generalized to cases with an arbitrary number of tensors. We describe it in the following corollary: Corollary 3. In neural networks, for any tensors X 1 , X 2 , . . ., X n of the same dimension, there always exist n matrices M 1 , M 2 , . . ., M n so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the n matrices, respectively, are equivalent in effect.
Proof. This proof is similar to the proof for Theorem 4.
Theorem 5. In neural networks, for a layer composed of n neurons, the effective training times of the neurons in this layer reach the maximum when the dropout rate is set to 0 or 1 − 1 n .
Proof. The number of neurons in this layer is n, so we shall mark them as N 1 , N 2 , . . ., N n . Suppose that the dropout rate [28] is p, and the total number of training times is t. We denote 1 − p as q.
Consider the t i -th training. The network randomly selects nq neurons to update due to the existence of the dropout mechanism. Denote these neurons as N 1 , N 2 , . . ., N nq .
Without the loss of generality, we consider the next time N 1 is selected, which is the t 2 -th training time. We denote the number of neurons selected for update in N 2 , . . ., N nq as S, and the number of neurons selected in N nq+1 , . . ., N n as T. We know that the selection of neurons in S is an independent event, so we have At the same time, the relationship between S and T is Inferring from Equations (33) and (34), we achieve The neurons represented by S are the neurons that are updated jointly at time t 2 and time t 1 , thus belonging to the same subnetwork. We assume that they share one training gain with N 1 . At the same time, the neurons represented by T have not been updated at time t 1 ; thus, each of them has one unique training gain. Therefore, at the update time t 2 , the expected gain of N 1 is 1 × 1 E(T)+1 × 1 E(S)+1 , which is derived from the above proportion analysis. Paying attention to t 1 and t 2 , we find that t 2 − t 1 obeys geometric distribution because the selection of N 1 is a Bernoulli experiment with probability q. That is, t 2 − t 1 ∼ GE (q), meaning that Therefore, the expected number of training times for N 1 is E t t 2 −t 1 = tq. The total training gain is the product of the number of training times and the gain of a single time training, which we denote as G. Now, the formula emerges: Denote f (q) as the denominator of G(q) and differentiate that to obtain With the help of Equation (38), it is easy to draw an image of G(p), shown in Figure 1, where we set t to 1. The observation is that when q = 1 n or q = 1, that is, p is 0 or 1 − 1 n , G reaches the maximum value t n , demonstrating that the effective training times of N 1 are the largest. The conclusion can be generalized to every neuron in the layer.
Denote ( ) as the denominator of ( ) and differentiate that to obtain With the help of Equation (38), it is easy to draw an image of ( ), shown in Figure  1, where we set to 1. The observation is that when = or = 1, that is, is 0 or 1 − , reaches the maximum value , demonstrating that the effective training times of are the largest. The conclusion can be generalized to every neuron in the layer. □

Corollary 4.
In neural networks, if the amount of training data is sufficient, the optimal value of the dropout rate is 0.5; if the amount of training data is insufficient, then a number that is close to 1 is a better choice.
Proof. Theorem 5 focuses on the effective neuron training times in the network, and the corollary focuses on the representation ability. It can be seen from Equation (37) that the effective training times of a certain layer are directly proportional to the total training times . When the number of training times reaches a certain threshold, the network

Corollary 4.
In neural networks, if the amount of training data is sufficient, the optimal value of the dropout rate is 0.5; if the amount of training data is insufficient, then a number that is close to 1 is a better choice.
Proof. Theorem 5 focuses on the effective neuron training times in the network, and the corollary focuses on the representation ability. It can be seen from Equation (37) that the effective training times of a certain layer are directly proportional to the total training times t. When the number of training times reaches a certain threshold, the network reaches a balance point, and further training will not bring any performance improvements. If the training data are sufficient, meaning that t and G are large enough, then the network is guaranteed to be fully trained. Therefore, we do not need to worry about whether the training times of neurons in the network is enough. However, we still need to consider the representation ability of the network, which has a close relationship with the number of subnetworks SN. It can be calculated as which is a combination number. Obviously, when p is 0.5, the number of subnetworks is the largest, and the network's representation ability is relatively strong. However, when there are not enough training data, we cannot guarantee the sufficiency of training. On the one hand, we need to set the dropout rate to a value close to 0 or 1 − 1 n Genes 2022, 13, 568 10 of 23 to guarantee the number of trainings indicated by the theorem. On the other hand, in order to ensure the network's representation ability, we want the dropout rate to be close to 0.5. Here, a balanced approach is to choose the turning point shown in Figure 1, which considers both training times and representation ability. Because this point is difficult to analyze, we provide a fitting function shown in Figure 1, the error of which is bounded by 2 × 10 −2 for n smaller than 512.
The above corollary is intuitive because the complexity of the network should be proportional to the amount of data. A small amount of data requires a simple model, calling for a higher dropout rate. Notice that a large dropout rate not only enables the model to be fully trained, but it also helps to accelerate the process.
In a modern neural network framework, the discarded neurons will not participate in gradient propagation this time, which largely reduces the number of parameters that need to be adjusted in the network.

Models
To address the task of chromatin accessibility prediction, we designed SemanticCAP, which includes a DNA language model that is shown in Section 3.1 and a chromatin accessibility model that is shown in Section 3.2. Briefly, we augment the chromatin accessibility model with the features provided by the DNA language model, thereby improving the chromatin accessibility prediction performance. A detailed methodology is described as follows.

Process of Data
We used the human reference genome GRCh37 (hg19) as the original data for our DNA language model. The human reference genome is a digital nucleic acid sequence database that can be used as a representative example of the gene set of an idealized individual of a species [29]. Therefore, a model based on the database could be applied to various genetic-sequence-related issues.
The task we designed for our DNA language model uses context to predict the intermediate base. However, there are at least three challenges. The first is that there are two inputs, the upstream and downstream, which are defined as the upper and lower sequences of a certain base. Since we predict the middle base from the information on both sides, the definition of the upstream and downstream are interchangeable, which means that the context should be treated in the same way. Second, the length of the input sequence is quite long, far from the output length of 4 bp, which stands for the classification results of bases. The large gap between the input and output lengths points to the fact that neural networks must be designed in a more subtle way. Otherwise, redundant calculations or poor results may occur. Third, we do not always have such long context data in real situations. For example, the length of the upstreams in the DNA datasets in Table 1 mostly vary from 0 bp to 600 bp, resulting in insufficient information in some cases.
To solve the above problems, we designed a simple but effective input format and training method. First of all, we randomly selected a certain position, taking the upstream and downstream sequences with lengths of 512 bp as the input, and the output is the base connecting the upstream and downstream, i.e., A, T, C, and G. For the first challenge, we combined the upstream and downstream into one sequence, separated by a special prediction token [LOST], and provided different segment embeddings for the two parts. Additionally, a special classification token [CLS] was added to the beginning of the sequence so that the model could learn an overall representation of it. For the second challenge, the final hidden state corresponding to the token [LOST] was used as the aggregate sequence representation for classification tasks, by which the output dimension was reduced quickly without complex network structures. This technique was used by Bert [30] for the first time. For the third challenge, some data augmentation tricks were applied to enhance the capabilities of the model. First, we constructed symmetric sequences based on the principle of base complementation and the non-directionality of DNA sequences, including the axial symmetry and mirror symmetry. This helps the model learn the two properties of DNA sequences. Second, we did not include all of the inputs in the model, which helps to enhance the model's prediction ability under conditions with insufficient information. Basically, we mask some percentage of the input tokens at random, and concretely there are two strategies. For a certain sequence, either upstream or downstream, we mask (replace with [MASK]) 20% of random tokens in 10% of cases or 40% of consecutive tokens in 15% of cases. Figure 2 is an example of our mask operation. In this case, two random tokens of the upstream and three consecutive tokens of the downstream are masked.
context data in real situations. For example, the length of the upstreams in the DNA datasets in Table 1 mostly vary from 0 bp to 600 bp, resulting in insufficient information in some cases.
To solve the above problems, we designed a simple but effective input format and training method. First of all, we randomly selected a certain position, taking the upstream and downstream sequences with lengths of 512 bp as the input, and the output is the base connecting the upstream and downstream, i.e., A, T, C, and G. For the first challenge, we combined the upstream and downstream into one sequence, separated by a special prediction token LOST , and provided different segment embeddings for the two parts. Additionally, a special classification token CLS was added to the beginning of the sequence so that the model could learn an overall representation of it. For the second challenge, the final hidden state corresponding to the token LOST was used as the aggregate sequence representation for classification tasks, by which the output dimension was reduced quickly without complex network structures. This technique was used by Bert [30] for the first time. For the third challenge, some data augmentation tricks were applied to enhance the capabilities of the model. First, we constructed symmetric sequences based on the principle of base complementation and the non-directionality of DNA sequences, including the axial symmetry and mirror symmetry. This helps the model learn the two properties of DNA sequences. Second, we did not include all of the inputs in the model, which helps to enhance the model's prediction ability under conditions with insufficient information. Basically, we mask some percentage of the input tokens at random, and concretely there are two strategies. For a certain sequence, either upstream or downstream, we mask (replace with MASK ) 20% of random tokens in 10% of cases or 40% of consecutive tokens in 15% of cases. Figure 2 is an example of our mask operation. In this case, two random tokens of the upstream and three consecutive tokens of the downstream are masked. Finally, there is no need to worry about overfitting. First, we have 10 9 bases in the DNA dataset, meaning that we will not over-learn some specific data. Second, we have mask and dropout operations in our training, which both are great ways to avoid over-training.

Model Structure
The input and output are constructed as described in Section 3.1.1, and we denote them as T in and T out . Basically, the model can be described as where embed is the input embedding layer, multi − conv stands for our multi-kernel CNN, transformer represents the transformer blocks, and mlp contains a linear layer and a softmax function. Figure 3 shows the full picture of the model. Finally, there is no need to worry about overfitting. First, we have 10 9 bases in the DNA dataset, meaning that we will not over-learn some specific data. Second, we have mask and dropout operations in our training, which both are great ways to avoid over-training.

Model Structure
The input and output are constructed as described in Section 3.1.1, and we denote them as and . Basically, the model can be described as where embed is the input embedding layer, multi − conv stands for our multi-kernel CNN, transformer represents the transformer blocks, and mlp contains a linear layer and a softmax function. Figure 3 shows the full picture of the model.  Function embed() is the encoding layer transforming the input sequence into a matrix. The dimension conversion is N L → R L×E , where L is the length of the sequence, and E is the encoding length. Specifically, we encode the input as where word − embed() is the meaning of the word itself, position − embed() provides the representation of different positions, and segment − embed() distinguishes the upstream and the downstream. An intuitive approach is to concatenate the three encodings without losing semantics, but this requires triple the space. Instead, we directly add these three encodings. This works because the three parameters are all leaf nodes of the training graph and can automatically adapt to each other's distributions. In this way, we reduce the dimension of the coded matrix, thus reducing the parameter space and data space. Function multi − conv() is the multi-kernel convolution layer learning a short-range relationship of the sequence. The dimension conversion is R L×E → R L×H , where H is the hidden dimension. Here, we use convolution kernels of different lengths to learn local relationships at different distances, and we propose a smooth feature addition (SFA) method to fuse these features. Specifically, we carry out where conv i () is a normal, one-dimensional, convolution layer with a kernel length of l i , the output dimension of which is R L×H , λ i is a network parameter with a dimension of R L×H , and k is the number of kernels of different lengths. The sizes of the convolution kernels are small rather than large, and their advantages have been verified in DenseNet [31]. On the one hand, small convolution kernels use less space than large convolution kernels.
On the other hand, we need small convolution kernels to learn the local information of the sequence, while the long-range dependence of the sequence is to be explored by the subsequent transformer module. Now, we will explain how we designed the smooth feature addition (SFA) algorithm. Before that, we must provide insight into what happens in the plain concat of features.
In a sequence problem, we often directly concat two features in the last dimension. Specifically, if we have two features with dimensions R L×M and R L×N , the dimension of the features after concat is R L×(M+N) . We thought that this approach would not lose information, but, in fact, there is a danger of feature disappearance. For two features with different distributions learning from different modules, plain concat will create an unbalanced distribution, where some values are extremely small. To make matters worse, layer normalization is usually used to adjust the distribution after a concat operation, causing the values to be concentrated near 0. Quantitative analysis can be seen in Theorem 3. Finally, as the network goes deeper, the gradient disappears, leading to the difficulty of learning. This is proven in Corollary 2.
A naive thought is to normalize the two distributions before concating them, which is proven to be correct in Theorem 1. However, it is not effective, for it converts the dimension from R L×H to R L×kH , posing a challenge for the subsequent module design. Considering that the dimensions of convolution features are the same, this inspired us to find a way to smoothly add them using some tuning parameters. This is how we designed SFA. Corollary 3 proves the equivalence of SFA and plain concat, and it illustrates the working mechanism of SFA and its advantages in space occupation, feature selection, and gradient propagation.
Function transformer() is the stack of transformer blocks learning a long-range relationship of the sequence. The dimension conversion is R L×H → R L×H . Simply, it can be described as transformer(T cnns ) = sub(ff, sub(attention, T cnns )) where sub( f , x) = LN(x + f (x)). ff represents the feed forward function, and attention is short for multi-head attention. The module was proposed by Vaswani et al. in 2017 [23]. Function mlp() is the output layer and is responsible for converting the hidden state to the output. The dimension conversion is R L×H → N. We extract the tensor corresponding to the token [LOST], convert it into an output probability through a linear layer, and generate the prediction value via a softmax function. The output process is

Process of Data
We selected DNase-seq experiment data from six typical cell lines, including GM12878, K562, MCF-7, HeLa-S3, H1-hESC, and HepG2, as the original data for our chromatin accessibility model. GM12878 is a type of lymphoblast produced by EBV transformation from the blood of a female donor of Northern European and Western European descent.
K562 is an immortalized cell derived from a female patient with chronic myeloid leukemia (CML). MCF-7 is a breast cancer cell sampled from a white female. HeLa-S3 is an immortal cell derived from a cervical cancer patient. H1-hESC is a human embryonic stem cell. HepG2 comes from a male liver cancer patient.
For each cell type, we downloaded the original sequence data from the ENCODE website, used a short read aligner tool bowtie [32] to map the DNA sequence to the human reference genome (hg19), and used HOTSPOT [33] to identify chromatin accessibility regions (peaks), i.e., genome-wide open chromatin regions that can yield information about possible protein binding regions on a genome-wide scale. We treated these variable-length sequences as positive samples. At the same time, we sampled the same number and same size sequences from the whole genome as negative samples. An overview of the data is shown in Table 1, which shows the number of sequences, the minimum value, the median value, the maximum value, and the standard deviation in lengths. Additionally, the distribution statistics of different datasets are shown in Figure 4. For the fairness of comparison, we removed sequences with lengths of less than 36 bp. We truncated or expanded each sequence symmetrically to a sequence of length 768 bp, and we took a context of a length of 512 bp for each site in it. Therefore, the actual input length of our model is 768 + 512 × 2 = 1792 bp. From Figure 4, we can observe that most of the lengths are clustered between 36 and 1792. This proves that our cut-off has little impact and is reasonable. Similar to our DNA language model, a special classification token [CLS] was added to the beginning of the sequence to predict the accessibility. Compared to the input length of 800 bp in [15], our prediction length increased by 124%, and the quantity of the DNA sequences that did not need to be truncated in the original dataset increased by 17.4%. Moreover, we did not pay a great price for such a long input because our context was transferred to a pre-trained model for the predictions. The output is the accessibility of the input sequence, i.e., either 0 for inaccessibility or 1 for accessibility.  Finally, the ratio of our training set, validation set, and test set is 0.85: 0.05: 0.10. The training set was used to train the model, the validation set was used to adjust the hyperparameters to prevent overfitting, and the test set was used to test the performance of the final model.

Model Structure
The input and output were constructed as described in Section 3.2.1. and are denoted as and . Basically, the model can be described as Finally, the ratio of our training set, validation set, and test set is 0.85 : 0.05 : 0.10. The training set was used to train the model, the validation set was used to adjust the hyperparameters to prevent overfitting, and the test set was used to test the performance of the final model.

Model Structure
The input and output were constructed as described in Section 3.2.1. and are denoted as T in and T out . Basically, the model can be described as where embed is the input embedding layer, multi − conv stands for our multi-kernel CNN, sconcat is short for our SConcat module, transformer represents the transformer blocks, and mlp contains a linear layer and a sigmoid function. Figure 5 shows a full picture of the model. One may find that the accessibility model is very similar to our DNA language model. Indeed, we only modified some of the model structures and changed the hyperparameters, but they are all very critical adjustments that make the model suitable for the task. Function embed() is the encoding layer transforming the input sequence into a feature matrix. The dimension conversion is ℕ → ℝ × , where is the length of the sequence, and is the encoding length. Specifically, we encode the input as Note that there is no segment − embed() in this task because there is no need to distinguish between the different segments.
Function multi − conv() has been explained in Section 3.1.2. The dimension conversion is ℝ × → ℝ × , where is the dimension of features learning from this layer. Function sconcat() is the concat layer that fuses the features of the language model with the features learned from multi − conv . The dimension conversion is where is the dimension of features generated from the DNA language model. Basically, the language model was used to construct features for different sites in the sequence, and a smooth feature concat (SFC) method was proposed to fuse them with the previous features: where ⃖ ⃗ stands for the context of sites in ; and are two network parameters Function embed() is the encoding layer transforming the input sequence into a feature matrix. The dimension conversion is N L → R L×E , where L is the length of the sequence, and E is the encoding length. Specifically, we encode the input as Note that there is no segment − embed() in this task because there is no need to distinguish between the different segments.
Function multi − conv() has been explained in Section 3.1.2. The dimension conversion is R L×E → R L×G , where G is the dimension of features learning from this layer.
Function sconcat() is the concat layer that fuses the features of the language model with the features learned from multi − conv. The dimension conversion is R L×G → R L×(G+H) , where H is the dimension of features generated from the DNA language model. Basically, the language model was used to construct features for different sites in the sequence, and a smooth feature concat (SFC) method was proposed to fuse them with the previous features: where ↔ T in stands for the context of sites in T in ; λ 1 and λ 2 are two network parameters with a dimension of R L ; and LM refers to our DNA language model. Here, it receives a DNA sequence, then constructs the context for each site in the sequence, and produces an output of length H. Specifically, if the length of the sequence is L, it will construct L pairs of contexts as the input and output an R L×H matrix. Now, we explain how we designed the smooth feature concat (SFC) algorithm. First, we should mention that the output dimension of the language model is R L×H , and the dimension of T cnns is R L×G , which means we cannot directly apply SFA in this scenario.
Fortunately, the analysis in Section 3.1.2 has already provided a solution to this problem. We can normalize the two distributions separately before concating them. However, this method uses LN twice and consumes additional parameter space and data space. One question is whether it is possible to use LN only once. It appears that this is the case. Theorem 2 states that, for any two distributions, there always exist two coefficients, so that the concat after they are multiplied by these two coefficients is a standardized distribution. That is how our SFA works. We multiply the two tensors by two coefficients, and we then carry out layer normalization after their concatenation. As such, we fused the two features smoothly with only one use of the LN operation. Interestingly, this method is a weakened version of Theorem 4.
Function transformer() is the same as that described in Section 3.
Function mlp() is the output layer and is responsible for transforming the hidden state to the output. The dimension conversion is R L×F → N. We extracted the tensor corresponding to the token [CLS], converted it into an output probability through a linear layer, and generated the prediction value via a sigmoid function. The output process is mlp(T trans ) = sigmoid linear T trans [CLS] (48)

Semantic DNA Evaluation
We compared the performance of our proposed method with several baseline methods, including the gapped k-mer SVM (gkm-SVM) [9], DeepSEA [14], and k-mer [15] methods. For the sake of fairness, all of the parameters were set as defaults. Moreover, to prove the effectiveness of the DNA language model, we also tested our accessibility model after excluding the DNA language model. For evaluation purposes, we computed two oftenused measures, the area under the receiver operating characteristic curve (auROC) and the area under the precision-recall curve (auPRC), which are good indicators of the robustness of a prediction model. The classification results for six datasets are shown in Table 2. Compared to the best baseline k-mer, our system shows a maximum improvement of 1.25% in auROC, and a maximum improvement of 2.41% in auPRC. Although some results on some datasets are not good, our model outperforms k-mer on average, with a 0.02% higher auROC score and a 0.1% higher auPRC score. Compared to gkm-SVM and DeepSEA, SemanticCAP shows an average improvement of about 2-3%. Finally, the introduction of our DNA language model resulted in performance improvements of 2%. We also tested the accessibility prediction accuracy of the loci shared in different cell lines. For example, GM12878 and HeLa-S3 have 20 common loci, and the prediction accuracy of these 20 loci in both cell lines is 85% and 90%, respectively. Another example is that K562 and MCF-7 have 21 common loci, and the prediction accuracy is 80.9% and 90.5%, respectively. This shows the applicability of our system on the common loci between different cell lines.

Effectiveness of Our DNA Language Model
We performed experiments on several different DNA language model structures, which can be divided roughly into two categories. The first category can be attributed to methods based on normal CNNs, and the second category uses our multi-conv architecture with data augmentation. Six structures were tested. At the same time, in order to test the prediction ability of different models in the case of insufficient information, we randomly masked some words and tested the results. The complete results are shown in Table 3. Through the comparison of LSTM and Attention, we found that the attention mechanism can greatly improve the prediction ability of the DNA language model. When using the MaxPooling and ReLU functions, we observed that the output of the last hidden layer was mostly 0, where the number of effective (not zero) neurons is about 3/192. This happens because the ReLU function shields neurons whose values are less than 0, and MaxPooling selectively updates specific neurons. Therefore, we replaced MaxPooling with AveragePooling, and the Attention layer that uses the ReLU function was replaced with a transformer. That is the third method listed in Table 3. The second category uses multi-conv to extract the local features of the sequence. The introduction to the multi-conv mechanism with data augmentation strategies brought increases in accuracy, especially when some tokens were masked. There are three kinds of feature fusion strategies: plain concat (PC), plain add (PA), and our smooth feature add (SFA). The third, fourth, and fifth items in the table indicate that SFA outperforms the other two fusion methods. The last item in Table 3, mconv(SFA)+trans, is the model that we finally chose as our DNA language model.

Effectiveness of Our Chromatin Accessibility Model
We experimented with several chromatin accessibility model structures, all of which were based on the transformer. The main difference is the use of multi-conv and the modules after the transformer. A complete comparison of the results is shown in Table 4. First, we focused on the module before the transformer. We noticed that the introduction of multi-conv also resulted in performance improvements, especially in F1. In our chromatin accessibility model, we concatenated the features provided by the DNA language model, where we could either directly concat (PC) them or use our SFC method. The evaluation values of the last two items show the superiority of SFC. Now, we will turn to the comparison of modules after the transformer. The transformation from the features of the transformer to the result is a challenge. In this part, five methods were tested. It should be mentioned that mconv + SFC + trans + linear is the final model. In terms of training time, our model can be fully parallelized, making it more advantageous than LSTM, based on recurrent networks. At the same time, our model has fewer parameters and has a simpler structure than Flatten after CNNs and can thus converge quickly. In terms of evaluation, the LSTM-based methods performed poorly. The main reason for this is that it is difficult for LSTM to learn the long-range dependence of a sequence. The convolution layer improves the performance of the LSTM to some extent by shortening the sequence length. In methods that are based on Flatten, introducing convolution layers actually reduces the accuracy. This could be caused by the convolution layers destroying the sequence features learned from the transformer. During multiple chromatin accessibility models, the method using multi-conv and our smoother concat (SFC) method obtained the best results with a relatively small number of parameters.

Analysis of [CLS]
We were able to observe the effectiveness of introducing the [CLS] symbol into our accessibility model. A direct indicator is the feature corresponding to [CLS] after the transformer layer, i.e., the value of T trans [ [CLS] ] in Equation (48). We randomly selected a certain number of positive and negative samples and used our chromatin accessibility model to predict them. For each sample, we output the 256-dimensional tensor corresponding to [CLS] after the transformer layer, and we reduced it to 2-dimensional space with t-SNE, which is shown in Figure 6. According to the figure, the feature has the ability to distinguish positive and negative examples, which is strong evidence of its effectiveness. predict them. For each sample, we output the 256-dimensional tensor corresponding to CLS after the transformer layer, and we reduced it to 2-dimensional space with t-SNE, which is shown in Figure 6. According to the figure, the feature has the ability to distinguish positive and negative examples, which is strong evidence of its effectiveness.

Analysis of SFA and SFC
In this section, we conducted two comparison experiments of PA, PC, SFA, and SFC. When testing the various DNA language models, we made a comparison between SFA, PA, and PC, which correspond to the last three items in Table 3. We used 5 × 10 samples to train the three models, drew a training loss map of them, and saw what would happen, which is shown in Figure 7a. PA quickly reduces losses at the fastest speed at the beginning because all of the features in multi-conv are trained to the same degree at the same time. However, in the later stage, there appears a phenomenon in which some features are overtrained while others are not, leading to the oscillation of loss.

Analysis of SFA and SFC
In this section, we conducted two comparison experiments of PA, PC, SFA, and SFC. When testing the various DNA language models, we made a comparison between SFA, PA, and PC, which correspond to the last three items in Table 3. We used 5 × 10 6 samples to train the three models, drew a training loss map of them, and saw what would happen, which is shown in Figure 7a. PA quickly reduces losses at the fastest speed at the beginning because all of the features in multi-conv are trained to the same degree at the same time. However, in the later stage, there appears a phenomenon in which some features are overtrained while others are not, leading to the oscillation of loss. In the experiment of various chromatin accessibility models, we made a comparison between SFC and PC, corresponding to the last two items in Table 4. The first 5 × 10 samples were used to measure its training state, which is shown in Figure 7b. As we can see, PC has a lower training speed because it has a problem regarding gradient disappearance. Compared to it, the gradient propagation of SFA is selective and more stable for the whole term.
We can observe the effectiveness of SFA from another angle. Paying attention to the parameters of SFA in multi-conv, whose dimension is ℝ × × , where is the number of kernels, is the sequence length, and is the hidden dimension, we normalized it, and converted it to ′ , whose dimension is ℝ × . This was carried out for both the language model and the chromatin accessibility model, and they can be observed in Figure 8. Note that the sum of the vertical axis in Figure 8a,b is always 1 due to the normalization. Obviously, different sequence positions and different convolution kernels have different weights, which proves SFA's ability to regulate features. In general, SFA and SFC make training smoother, faster, and better than ordinary concatenation and addition. They are smoother because we used parameters to regulate features. They are faster because they speed up the training of the model by avoiding the gradient problem. They are simple but effective. In fact, since they share the same essence (the Hadamard product), they share the same advantages. In the experiment of various chromatin accessibility models, we made a comparison between SFC and PC, corresponding to the last two items in Table 4. The first 5 × 10 3 samples were used to measure its training state, which is shown in Figure 7b. As we can see, PC has a lower training speed because it has a problem regarding gradient disappearance. Compared to it, the gradient propagation of SFA is selective and more stable for the whole term.
We can observe the effectiveness of SFA from another angle. Paying attention to the parameters C SFA of SFA in multi-conv, whose dimension is R K×L×H , where K is the number of kernels, L is the sequence length, and H is the hidden dimension, we normalized it, and converted it to C SFA , whose dimension is R K×L . This was carried out for both the language model and the chromatin accessibility model, and they can be observed in Figure 8. Note that the sum of the vertical axis in Figure 8a,b is always 1 due to the normalization. Obviously, different sequence positions and different convolution kernels have different weights, which proves SFA's ability to regulate features. In the experiment of various chromatin accessibility models, we made a comparison between SFC and PC, corresponding to the last two items in Table 4. The first 5 × 10 samples were used to measure its training state, which is shown in Figure 7b. As we can see, PC has a lower training speed because it has a problem regarding gradient disappearance. Compared to it, the gradient propagation of SFA is selective and more stable for the whole term.
We can observe the effectiveness of SFA from another angle. Paying attention to the parameters of SFA in multi-conv, whose dimension is ℝ × × , where is the number of kernels, is the sequence length, and is the hidden dimension, we normalized it, and converted it to ′ , whose dimension is ℝ × . This was carried out for both the language model and the chromatin accessibility model, and they can be observed in Figure 8. Note that the sum of the vertical axis in Figure 8a,b is always 1 due to the normalization. Obviously, different sequence positions and different convolution kernels have different weights, which proves SFA's ability to regulate features. In general, SFA and SFC make training smoother, faster, and better than ordinary concatenation and addition. They are smoother because we used parameters to regulate features. They are faster because they speed up the training of the model by avoiding the gradient problem. They are simple but effective. In fact, since they share the same essence (the Hadamard product), they share the same advantages. In general, SFA and SFC make training smoother, faster, and better than ordinary concatenation and addition. They are smoother because we used parameters to regulate features. They are faster because they speed up the training of the model by avoiding the gradient problem. They are simple but effective. In fact, since they share the same essence (the Hadamard product), they share the same advantages.

Conclusions
In this article, we propose a chromatin accessibility prediction model called Seman-ticCAP. Our model is able to predict open DNA regions, thus having a guiding role in disease detection, drug design, etc. For example, a gene called CYMC from cell H1-hESC mutated in the middle with a length of 5 bp, and its accessibility decreased from 0.98 to 0.14 as predicted by our model, which is consistent with the experimental data that it reduces transcription [34]. Another example is a mutation in a gene called HNF4A from cell K562, which leads to a reduction in gene expression [35]. Our model predicted that its accessibility decreased from 0.66 to 0.2, which provides a reasonable explanation for the experimental phenomena of reduction in gene expression caused by HNF4A mutation. Similarly, we can monitor the accessibility changes of DNA targeted by drugs (especially anticancer drugs), and the change of accessibility will provide guidance for drug action. Our main innovations are as follows. First, we introduced the concept of language models in natural language processing to model DNA sequences. This method not only provides the word vector presentation of the base itself, but it also provides sufficient information about the context of a site in a DNA sequence. Second, we used a small number of parameters to solve the feature fusion problem between different distributions. Specifically, we solve the problem of the smooth addition of distributions with the same dimensions using SFA and the problem of the smooth concatenation of distributions with different dimensions using SFC.
Third, we use an end-to-end model design, in which we fully utilize the learning ability and characteristics of the convolution and attention mechanism, thus achieving a better result with fewer parameters and a shorter training time.
Of course, there is still room for improvement in our method. In terms of the sample construction, we randomly selected the same number of DNA sequences with the same length as negative samples. This approach may be modified. For example, we could deliberately use an unbalanced dataset because there are so much DNA data, and we could then use some strategies, such as ensemble learning [36], to eliminate the negative effects of data imbalance [37]. In terms of data input, sequence truncation, and sequence completion operations exist in our model, which may cause information loss or redundant calculations. Additionally, the task we designed for the DNA language model could also be enhanced. Multiple positions can be predicted simultaneously, similar to the cloze problem in Bert. There are also some limitations in the current study. The first limitation is that the attention mechanism consumes too much memory, which could be replaced by a short-range attention or a mixed-length attention [38]. Additionally, our smooth feature fusion methods, SFA and SFC, could also be used in the multi-head attention to save space and accelerate training. Moreover, the dropout mechanism makes all neurons effective in the prediction phase, but there may exist a more reasonable way of fusing subnetworks. These issues need to be further explored.

Conflicts of Interest:
The authors declare no conflict of interest.