1. Introduction
In human cells, genetic and regulatory information is stored in chromatin, which is deoxyribonucleic acid (DNA) wrapped around histones. The chromatin structure has a lot to do with gene transcription, protein synthesis, biochemical processes, and other complex biological expressions. Among them, the binding of small organic and inorganic molecules to the DNA can influence numerous biological processes in which DNA participate. In particular, many anticancer, antibiotic, and antiviral drugs exert their primary biological effects by reversibly interacting with nucleic acids. Therefore, the study of its structure can help us design drugs to control gene expression and to cure diseases [
1]. Some regions of the chromatin are open to transcription factors (TFs), RNA polymers (RNAPs), drug molecules, and other cellular materials, while others are tightly entangled together and do not play a role in most cellular processes. These two regions of the chromatin are called open regions and closed regions, which are also known as accessible and inaccessible regions [
2]. Measuring the accessibility of chromatin regions can generate clues to gene function that can help us to identify appropriate targets for therapeutic intervention. Meanwhile, monitoring changes in chromatin accessibility can help us to track and understand drug effects. A study [
3] found that chromatin accessibility changes at intergenic regions are associated with ovarian cancer drug resistance. Another example is the way that the chromatin opening (increased accessibility) of the targeted DNA satellites can explain how the DNA-binding pyrrole–imidazole compounds that target different
Drosophila melanogaster satellites lead to gain- or loss-of-function phenotypes [
4]. In recent years, many high-throughput sequencing technologies have been used for the detection of open regions, such as DNase-seq [
5], FAIRE-seq [
6], and ATAC-seq [
7]. However, biological experimental methods are costly and time-consuming and thus cannot be applied to large-scale chemical examinations. These restrictions have promoted the development of calculation methods.
Alongside the progress in computer science, several kinds of sequence-based calculation methods have been proposed to identify functional regions. Simply put, we can divide them into traditional machine learning methods [
8,
9,
10,
11,
12] and neural network methods [
13,
14,
15,
16,
17]. Machine learning methods are mainly based on support vector machines (SVM), which perform supervised learning for the classification or regression of data groups. An SVM method [
8] based on k-mer features, which are defined as a full set of segments of varying lengths (3–10 bp) in a long sequence, was designed in 2011. This method recognizes enhancers in mammalian cells. Subsequently, the gkm-SVM (gapped k-mer SVM) proposed in 2014 [
9] exploited a feature set called interval k-mer features to improve the accuracy and stability of recognition. This method only uses the part of the segments that vary in length, instead of all of the segments. In recent years, with the rapid development of neural networks and the emergence of various deep learning models, a growing number of deep network models have come to be used to solve such problems, where convolutional neural networks (CNNs) [
18] and recurrent neural networks (RNNs) [
19] are dominant in this regard. A neural network is a computational learning system that uses a network of functions to understand and translate the data input of one form into a desired output, and deep learning is a type of artificial neural networks in which multiple layers of processing are used to extract progressively higher-level features from data. CNNs use the principle of convolution to encode the local information of the data, while RNNs model the sequence with reference to the memory function of the neurons. CNNs are used in DeepBind [
13] and DeepSEA [
14] to model the sequence specificity of protein binding, and they have both demonstrated significant performance improvements compared to traditional SVM-based methods. Min et al. utilized long short-term memory (LSTM) [
15] to predict chromatin accessibility and achieved state-of-the-art results for the time, thus proving the effectiveness of RNNs for DNA sequence problems.
However, we point out that the previous methods have the following shortcomings. First, most of the previous methods are based on k-mer, that is, a segment of length
. Specifically, it takes a segment of length
at intervals. The artificial division of the original sequence may destroy the internal semantic information, causing difficulties when learning subsequent models. Second, with the progress being made in language models, we have the ability to learn the interior semantic information of sequences through pre-training. There has been related work on existing methods, such as using GloVe [
20] to train the k-mer word vectors. However, these pre-training models are mostly traditional word vector methods. On the one hand, they can only learn the characteristics of the word itself and have no knowledge of the context of DNA sequences [
21]. On the other hand, they are limited to a specific dataset and thus cannot be widely applied to other scenarios. Third, traditional CNNs and RNNs have been proven to be unsuitable for long-sequence problems [
22]. CNNs, restricted by the size of convolution kernels, fail to learn global information effectively, while RNNs tend to cause gradient disappearance and result in slow training due to the lack of parallelizability when receiving a long input. In contrast, the attention mechanism (Attention) [
23] can effectively learn the long-range dependence of sequences and has been widely used in the field of natural language processing.
In response to the above disadvantages, we constructed a chromatin accessibility prediction model called SemanticCAP, which is based on features learning from a language model. The data and code for our system are available at
github.com/ykzhang0126/semanticCAP (accessed on 16 February 2022). The SemanticCAP model, trained on DNase-seq datasets, has an ability to predict the accessibility of DNA sequences from different cell lines and thus can be used as an effective alternative to biological sequencing methods such as DNase-seq. At a minimum, our model makes the following three improvements:
A DNA language model is utilized to learn the deep semantics of DNA sequences and introduces the semantic features in the chromatin accessibility prediction process; therefore, we are able to obtain additional complex environmental information.
Both the DNA language model and the chromatin accessibility model use character-based inputs instead of k-mer which stands for segments of length . The strategy prevents the information of original sequences from being destroyed.
The attention mechanism is widely used in our models in place of CNNs and RNNs, making the model more powerful and stable in handling long sequences.
Before formally introducing our method, we will first present some preliminary knowledge, including some common-sense information, theorems, and corollaries.
2. Theories
Theorem 1. For two standardized distributions using layer normalization (LN), which are denoted as and
, the concat of them, that is, , is still a standardized distribution.
Proof. Suppose that
has
elements and
has
elements. As we all know, LN [
24] transforms the distribution
as
where
and
are the expectation and standard deviation of
respectively. Obviously, for the normalized distribution
and
, we have
where
stands for the expectation function and
stands for the deviation function. The new distribution
is derived by concating
and
, and thus has
elements. Inferring from Equation (2), we have
For
and
, we also know that
Substituting Equations (2) and (3) into Equations (6) and (7), and finally into Equation (5), we have
Equations (4) and (8) demonstrate the standardization of . □
Theorem 2. For any two distributions , , there two coefficients , that always exist, so that the concat of them, after being multiplied by the two coefficients, respectively, that is, is a standardized distribution.
Proof. Suppose that
has
elements and
has
elements. We denote the expectation of the two distributions as
,
, and the variance as
,
. Notice that
and
are all scalars. Now, pay attention to
. To prove this theorem, we want
to be a standardized distribution, which requires the expectation of
to be
and the variance to be
. Therefore, we can list the following equation set:
At the same time, we have equations similar to Equations (6) and (7), those being:
which are easy to calculate according to the nature of expectation and variance. Notice that Equation (9) has two variables and two independent equations, meaning it should be solvable. By calculating Equation (9), we can determine the numeric solution of
and
as follows:
The existence of Equation (12) ends our proof. Actually, we are able to obtain two sets of solutions here because can either be positive or negative, and so can . The signs of and depend on the signs of and , which can be easily inferred from the first equation in Equation (9). □
Corollary 1. For any distributions
, , , , their
coefficients
, , , , always exist, so that the concat of them, after being multiplied by the
coefficients, respectively, that is
, is a standardized distribution.
Proof. The overall method of proof is similar to that used in Theorem 2. Note that, in this case, we have variables but only two independent equations, resulting in infinite solutions according to Equation (9). To be more precise, the degree of freedom of our solutions is . □
Theorem 3. In neural networks, for any two tensors
, that satisfy
, the probability of feature disappearance of after concating and normalizing them is
, where
represents the standard deviation.
Proof. Feature disappearance is defined as a situation where the features are too small. Concretely, for a tensor and a threshold , if the result of a subsequent operation of is smaller than , then the feature disappearance of occurs. Here, can be an arbitrarily small value, such as .
Suppose that
has
elements and
has
elements. We denote the expectation of the two distributions as
,
, and the variance as
,
. As stated in the precondition, we already know that
Let
and
. With the help of Equation (9), we have
We denote
as
and
as
. According to Equation (1), for
, we know that
We denote
as
and
as
. Now, we consider the results of a subsequent operation of
, which is
. This is very common in convolution, linear, or attention layers. For the result, an observation is
where
. For the convenience of analysis, all
are set to 1. This will not result in a loss of generality because the value scaling from
to 1 has no effect on the subsequent derivation. Here, we denote
as
. According to the central limit theorem (Lindeberg–Lévy form) [
25], we find that
obeys a normal distribution, that is
For a feature disappearance threshold
, we want to figure out the probability of
. Denote this event as
, and we can obtain
where
is the cumulative distribution function (cdf) of the standard normal distribution. Since it is an integral that does not have a closed form solution, we cannot directly analyze it. According to Equations (13), (14) and (16), we know that
. At the same time, we know that
is a small number, leading to
. Therefore, we have the equation as follows:
The formula is a Taylor expansion where
is the probability density function (pdf) of the standard normal distribution,
is the Lagrange remainder, and
is the Peano remainder, standing for a high-order infinitesimal of
. Combining Equations (15), (17), (20) and (21), we achieve
where
and
. The above equation can also be written as
. □
Corollary 2. In neural networks, feature disappearance can lead to gradient disappearance.
Proof. According to Theorem 3, feature disappearance happens if there exists a tensor such that . Similar to the definition of feature disappearance, gradient disappearance is defined as a situation where the gradients are too small. Concretely, for a parameter with a gradient of and a threshold , if is smaller than , the gradient disappearance of happens. Here, can be an arbitrarily small value.
Consider a subsequent operation of
, which is
, where
stands for the number of layers involved in the calculation. The gradient disappearance happens if
At the same time, we already have
, which means that we simply need to meet the requirements for
Note that
is a small number, which means that
. Finally, we can derive a formula for
:
Thereby, we get a sufficient condition for , and we can come to a conclusion. Gradient disappearance occurs in layers deep enough after feature disappearance. □
The above corollary is consistent with intuition. The disappearance of gradients is always accompanied by the disappearance of features, and it is always a problem in deep neural networks.
Theorem 4. In neural networks, for any two tensors
, of the same dimension, there are always two matrices
, , so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the two matrices, respectively, are equivalent in effect.
Proof. First of all, we illustrate the definition of the Hadamard product [
26]. The Hadamard product (also known as the element-wise product) is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension. Concretely, we can define it as
The symbol ‘
’ is used to distinguish it from the more common matrix product, which is denoted as ‘
’ and is usually omitted. The definition implies that the dimension of
should be the same as that of
, as well as
and
. At the same time
and
are assumed to have the same dimensions in the precondition of our proposition. As such, we might as well set them to
. The representation of
and
is presented below:
Our goal is to weigh the effect of the two operations. For the convenience of comparison, we let the results after the two operations multiply a matrix, thus converting the dimension to . Adding a linear layer is very common in neural networks, and it hardly affects the network’s expression ability.
Considering the first scheme, the concat of
and
, we have
where
and
. Observing the
-th row and
-th column of
, we find that
Considering the second scheme, with the addition of
and
as the core, we have
where
and
. Still, we pay attention to the
-th row and
-th column of
and find that
Comparing Equations (29) and (31), we find that, when we let equal and equal , the values of and are equal, which is strong evidence of effect equivalence. □
As the equivalence has been proven, similar to the plain concat, no information is lost in the above method. We point out that the Hadamard product is an alternative version of the gate mechanism [
27]. We use coefficients to adjust the original distribution to screen out effective features. For the speed and stability of training, setting the initial value of
to 1 is recommended.
Further, we can observe the gradient of the parameters
in Equation (30), where we have
Compared to the gate mechanism, our method is simpler, saves space, and is more direct in gradient propagation.
Of course, Theorem 4 could be generalized to cases with an arbitrary number of tensors. We describe it in the following corollary:
Corollary 3. In neural networks, for any tensors
, , , of the same dimension, there always exist
matrices
, , , so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the
matrices, respectively, are equivalent in effect.
Proof. This proof is similar to the proof for Theorem 4. □
Theorem 5. In neural networks, for a layer composed of
neurons, the effective training times of the neurons in this layer reach the maximum when the dropout rate is set to
or
.
Proof. The number of neurons in this layer is
, so we shall mark them as
,
,
,
. Suppose that the dropout rate [
28] is
, and the total number of training times is
. We denote
as q.
Consider the -th training. The network randomly selects neurons to update due to the existence of the dropout mechanism. Denote these neurons as , , , .
Without the loss of generality, we consider the next time
is selected, which is the
-th training time. We denote the number of neurons selected for update in
,
,
as
, and the number of neurons selected in
,
,
as
. We know that the selection of neurons in
is an independent event, so we have
At the same time, the relationship between
and
is
Inferring from Equations (33) and (34), we achieve
The neurons represented by
are the neurons that are updated jointly at time
and time
, thus belonging to the same subnetwork. We assume that they share one training gain with
. At the same time, the neurons represented by
have not been updated at time
; thus, each of them has one unique training gain. Therefore, at the update time
, the expected gain of
is
, which is derived from the above proportion analysis. Paying attention to
and
, we find that
obeys geometric distribution because the selection of
is a Bernoulli experiment with probability
. That is,
, meaning that
Therefore, the expected number of training times for
is
. The total training gain is the product of the number of training times and the gain of a single time training, which we denote as
. Now, the formula emerges:
Denote
as the denominator of
and differentiate that to obtain
With the help of Equation (38), it is easy to draw an image of
, shown in
Figure 1, where we set
to
. The observation is that when
or
, that is,
is 0 or
,
reaches the maximum value
, demonstrating that the effective training times of
are the largest. The conclusion can be generalized to every neuron in the layer. □
Corollary 4. In neural networks, if the amount of training data is sufficient, the optimal value of the dropout rate is 0.5; if the amount of training data is insufficient, then a number that is close to 1 is a better choice.
Proof. Theorem 5 focuses on the effective neuron training times in the network, and the corollary focuses on the representation ability. It can be seen from Equation (37) that the effective training times of a certain layer are directly proportional to the total training times . When the number of training times reaches a certain threshold, the network reaches a balance point, and further training will not bring any performance improvements.
If the training data are sufficient, meaning that
and
are large enough, then the network is guaranteed to be fully trained. Therefore, we do not need to worry about whether the training times of neurons in the network is enough. However, we still need to consider the representation ability of the network, which has a close relationship with the number of subnetworks
. It can be calculated as
which is a combination number. Obviously, when
is
, the number of subnetworks is the largest, and the network’s representation ability is relatively strong.
However, when there are not enough training data, we cannot guarantee the sufficiency of training. On the one hand, we need to set the dropout rate to a value close to
or
to guarantee the number of trainings indicated by the theorem. On the other hand, in order to ensure the network’s representation ability, we want the dropout rate to be close to
. Here, a balanced approach is to choose the turning point shown in
Figure 1, which considers both training times and representation ability. Because this point is difficult to analyze, we provide a fitting function shown in
Figure 1, the error of which is bounded by
for
smaller than
. □
The above corollary is intuitive because the complexity of the network should be proportional to the amount of data. A small amount of data requires a simple model, calling for a higher dropout rate. Notice that a large dropout rate not only enables the model to be fully trained, but it also helps to accelerate the process.
In a modern neural network framework, the discarded neurons will not participate in gradient propagation this time, which largely reduces the number of parameters that need to be adjusted in the network.
5. Conclusions
In this article, we propose a chromatin accessibility prediction model called SemanticCAP. Our model is able to predict open DNA regions, thus having a guiding role in disease detection, drug design, etc. For example, a gene called
CYMC from cell H1-hESC mutated in the middle with a length of 5 bp, and its accessibility decreased from 0.98 to 0.14 as predicted by our model, which is consistent with the experimental data that it reduces transcription [
34]. Another example is a mutation in a gene called
HNF4A from cell K562, which leads to a reduction in gene expression [
35]. Our model predicted that its accessibility decreased from 0.66 to 0.2, which provides a reasonable explanation for the experimental phenomena of reduction in gene expression caused by
HNF4A mutation. Similarly, we can monitor the accessibility changes of DNA targeted by drugs (especially anticancer drugs), and the change of accessibility will provide guidance for drug action. Our main innovations are as follows. First, we introduced the concept of language models in natural language processing to model DNA sequences. This method not only provides the word vector presentation of the base itself, but it also provides sufficient information about the context of a site in a DNA sequence. Second, we used a small number of parameters to solve the feature fusion problem between different distributions. Specifically, we solve the problem of the smooth addition of distributions with the same dimensions using SFA and the problem of the smooth concatenation of distributions with different dimensions using SFC.
Third, we use an end-to-end model design, in which we fully utilize the learning ability and characteristics of the convolution and attention mechanism, thus achieving a better result with fewer parameters and a shorter training time.
Of course, there is still room for improvement in our method. In terms of the sample construction, we randomly selected the same number of DNA sequences with the same length as negative samples. This approach may be modified. For example, we could deliberately use an unbalanced dataset because there are so much DNA data, and we could then use some strategies, such as ensemble learning [
36], to eliminate the negative effects of data imbalance [
37]. In terms of data input, sequence truncation, and sequence completion operations exist in our model, which may cause information loss or redundant calculations. Additionally, the task we designed for the DNA language model could also be enhanced. Multiple positions can be predicted simultaneously, similar to the cloze problem in Bert. There are also some limitations in the current study. The first limitation is that the attention mechanism consumes too much memory, which could be replaced by a short-range attention or a mixed-length attention [
38]. Additionally, our smooth feature fusion methods, SFA and SFC, could also be used in the multi-head attention to save space and accelerate training. Moreover, the dropout mechanism makes all neurons effective in the prediction phase, but there may exist a more reasonable way of fusing subnetworks. These issues need to be further explored.