Abstract
A large number of inorganic and organic compounds are able to bind DNA and form complexes, among which drug-related molecules are important. Chromatin accessibility changes not only directly affect drug–DNA interactions, but they can promote or inhibit the expression of the critical genes associated with drug resistance by affecting the DNA binding capacity of TFs and transcriptional regulators. However, the biological experimental techniques for measuring it are expensive and time-consuming. In recent years, several kinds of computational methods have been proposed to identify accessible regions of the genome. Existing computational models mostly ignore the contextual information provided by the bases in gene sequences. To address these issues, we proposed a new solution called SemanticCAP. It introduces a gene language model that models the context of gene sequences and is thus able to provide an effective representation of a certain site in a gene sequence. Basically, we merged the features provided by the gene language model into our chromatin accessibility model. During the process, we designed methods called SFA and SFC to make feature fusion smoother. Compared to DeepSEA, gkm-SVM, and k-mer using public benchmarks, our model proved to have better performance, showing a 1.25% maximum improvement in auROC and a 2.41% maximum improvement in auPRC.
1. Introduction
In human cells, genetic and regulatory information is stored in chromatin, which is deoxyribonucleic acid (DNA) wrapped around histones. The chromatin structure has a lot to do with gene transcription, protein synthesis, biochemical processes, and other complex biological expressions. Among them, the binding of small organic and inorganic molecules to the DNA can influence numerous biological processes in which DNA participate. In particular, many anticancer, antibiotic, and antiviral drugs exert their primary biological effects by reversibly interacting with nucleic acids. Therefore, the study of its structure can help us design drugs to control gene expression and to cure diseases []. Some regions of the chromatin are open to transcription factors (TFs), RNA polymers (RNAPs), drug molecules, and other cellular materials, while others are tightly entangled together and do not play a role in most cellular processes. These two regions of the chromatin are called open regions and closed regions, which are also known as accessible and inaccessible regions []. Measuring the accessibility of chromatin regions can generate clues to gene function that can help us to identify appropriate targets for therapeutic intervention. Meanwhile, monitoring changes in chromatin accessibility can help us to track and understand drug effects. A study [] found that chromatin accessibility changes at intergenic regions are associated with ovarian cancer drug resistance. Another example is the way that the chromatin opening (increased accessibility) of the targeted DNA satellites can explain how the DNA-binding pyrrole–imidazole compounds that target different Drosophila melanogaster satellites lead to gain- or loss-of-function phenotypes []. In recent years, many high-throughput sequencing technologies have been used for the detection of open regions, such as DNase-seq [], FAIRE-seq [], and ATAC-seq []. However, biological experimental methods are costly and time-consuming and thus cannot be applied to large-scale chemical examinations. These restrictions have promoted the development of calculation methods.
Alongside the progress in computer science, several kinds of sequence-based calculation methods have been proposed to identify functional regions. Simply put, we can divide them into traditional machine learning methods [,,,,] and neural network methods [,,,,]. Machine learning methods are mainly based on support vector machines (SVM), which perform supervised learning for the classification or regression of data groups. An SVM method [] based on k-mer features, which are defined as a full set of segments of varying lengths (3–10 bp) in a long sequence, was designed in 2011. This method recognizes enhancers in mammalian cells. Subsequently, the gkm-SVM (gapped k-mer SVM) proposed in 2014 [] exploited a feature set called interval k-mer features to improve the accuracy and stability of recognition. This method only uses the part of the segments that vary in length, instead of all of the segments. In recent years, with the rapid development of neural networks and the emergence of various deep learning models, a growing number of deep network models have come to be used to solve such problems, where convolutional neural networks (CNNs) [] and recurrent neural networks (RNNs) [] are dominant in this regard. A neural network is a computational learning system that uses a network of functions to understand and translate the data input of one form into a desired output, and deep learning is a type of artificial neural networks in which multiple layers of processing are used to extract progressively higher-level features from data. CNNs use the principle of convolution to encode the local information of the data, while RNNs model the sequence with reference to the memory function of the neurons. CNNs are used in DeepBind [] and DeepSEA [] to model the sequence specificity of protein binding, and they have both demonstrated significant performance improvements compared to traditional SVM-based methods. Min et al. utilized long short-term memory (LSTM) [] to predict chromatin accessibility and achieved state-of-the-art results for the time, thus proving the effectiveness of RNNs for DNA sequence problems.
However, we point out that the previous methods have the following shortcomings. First, most of the previous methods are based on k-mer, that is, a segment of length . Specifically, it takes a segment of length at intervals. The artificial division of the original sequence may destroy the internal semantic information, causing difficulties when learning subsequent models. Second, with the progress being made in language models, we have the ability to learn the interior semantic information of sequences through pre-training. There has been related work on existing methods, such as using GloVe [] to train the k-mer word vectors. However, these pre-training models are mostly traditional word vector methods. On the one hand, they can only learn the characteristics of the word itself and have no knowledge of the context of DNA sequences []. On the other hand, they are limited to a specific dataset and thus cannot be widely applied to other scenarios. Third, traditional CNNs and RNNs have been proven to be unsuitable for long-sequence problems []. CNNs, restricted by the size of convolution kernels, fail to learn global information effectively, while RNNs tend to cause gradient disappearance and result in slow training due to the lack of parallelizability when receiving a long input. In contrast, the attention mechanism (Attention) [] can effectively learn the long-range dependence of sequences and has been widely used in the field of natural language processing.
In response to the above disadvantages, we constructed a chromatin accessibility prediction model called SemanticCAP, which is based on features learning from a language model. The data and code for our system are available at github.com/ykzhang0126/semanticCAP (accessed on 16 February 2022). The SemanticCAP model, trained on DNase-seq datasets, has an ability to predict the accessibility of DNA sequences from different cell lines and thus can be used as an effective alternative to biological sequencing methods such as DNase-seq. At a minimum, our model makes the following three improvements:
- A DNA language model is utilized to learn the deep semantics of DNA sequences and introduces the semantic features in the chromatin accessibility prediction process; therefore, we are able to obtain additional complex environmental information.
- Both the DNA language model and the chromatin accessibility model use character-based inputs instead of k-mer which stands for segments of length . The strategy prevents the information of original sequences from being destroyed.
- The attention mechanism is widely used in our models in place of CNNs and RNNs, making the model more powerful and stable in handling long sequences.
Before formally introducing our method, we will first present some preliminary knowledge, including some common-sense information, theorems, and corollaries.
2. Theories
Theorem 1.
For two standardized distributions using layer normalization (LN), which are denoted as and
, the concat of them, that is, , is still a standardized distribution.
Proof.
Suppose that has elements and has elements. As we all know, LN [] transforms the distribution as
where and are the expectation and standard deviation of respectively. Obviously, for the normalized distribution and , we have
where stands for the expectation function and stands for the deviation function. The new distribution is derived by concating and , and thus has elements. Inferring from Equation (2), we have
For and , we also know that
Substituting Equations (2) and (3) into Equations (6) and (7), and finally into Equation (5), we have
Equations (4) and (8) demonstrate the standardization of . □
Theorem 2.
For any two distributions , , there two coefficients , that always exist, so that the concat of them, after being multiplied by the two coefficients, respectively, that is, is a standardized distribution.
Proof.
Suppose that has elements and has elements. We denote the expectation of the two distributions as , , and the variance as , . Notice that and are all scalars. Now, pay attention to . To prove this theorem, we want to be a standardized distribution, which requires the expectation of to be and the variance to be . Therefore, we can list the following equation set:
At the same time, we have equations similar to Equations (6) and (7), those being:
which are easy to calculate according to the nature of expectation and variance. Notice that Equation (9) has two variables and two independent equations, meaning it should be solvable. By calculating Equation (9), we can determine the numeric solution of and as follows:
The existence of Equation (12) ends our proof. Actually, we are able to obtain two sets of solutions here because can either be positive or negative, and so can . The signs of and depend on the signs of and , which can be easily inferred from the first equation in Equation (9). □
Corollary 1.
For any distributions
, , , , their
coefficients
, , , , always exist, so that the concat of them, after being multiplied by the
coefficients, respectively, that is
, is a standardized distribution.
Proof.
The overall method of proof is similar to that used in Theorem 2. Note that, in this case, we have variables but only two independent equations, resulting in infinite solutions according to Equation (9). To be more precise, the degree of freedom of our solutions is . □
Theorem 3.
In neural networks, for any two tensors
, that satisfy
, the probability of feature disappearance of after concating and normalizing them is
, where
represents the standard deviation.
Proof.
Feature disappearance is defined as a situation where the features are too small. Concretely, for a tensor and a threshold , if the result of a subsequent operation of is smaller than , then the feature disappearance of occurs. Here, can be an arbitrarily small value, such as .
Suppose that has elements and has elements. We denote the expectation of the two distributions as , , and the variance as , . As stated in the precondition, we already know that
Let and . With the help of Equation (9), we have
We denote as and as . According to Equation (1), for , we know that
We denote as and as . Now, we consider the results of a subsequent operation of , which is . This is very common in convolution, linear, or attention layers. For the result, an observation is
where . For the convenience of analysis, all are set to 1. This will not result in a loss of generality because the value scaling from to 1 has no effect on the subsequent derivation. Here, we denote as . According to the central limit theorem (Lindeberg–Lévy form) [], we find that obeys a normal distribution, that is
For a feature disappearance threshold , we want to figure out the probability of . Denote this event as , and we can obtain
where is the cumulative distribution function (cdf) of the standard normal distribution. Since it is an integral that does not have a closed form solution, we cannot directly analyze it. According to Equations (13), (14) and (16), we know that . At the same time, we know that is a small number, leading to . Therefore, we have the equation as follows:
The formula is a Taylor expansion where is the probability density function (pdf) of the standard normal distribution, is the Lagrange remainder, and is the Peano remainder, standing for a high-order infinitesimal of . Combining Equations (15), (17), (20) and (21), we achieve
where and . The above equation can also be written as . □
Corollary 2.
In neural networks, feature disappearance can lead to gradient disappearance.
Proof.
According to Theorem 3, feature disappearance happens if there exists a tensor such that . Similar to the definition of feature disappearance, gradient disappearance is defined as a situation where the gradients are too small. Concretely, for a parameter with a gradient of and a threshold , if is smaller than , the gradient disappearance of happens. Here, can be an arbitrarily small value.
Consider a subsequent operation of , which is , where stands for the number of layers involved in the calculation. The gradient disappearance happens if
At the same time, we already have , which means that we simply need to meet the requirements for
Note that is a small number, which means that . Finally, we can derive a formula for :
Thereby, we get a sufficient condition for , and we can come to a conclusion. Gradient disappearance occurs in layers deep enough after feature disappearance. □
The above corollary is consistent with intuition. The disappearance of gradients is always accompanied by the disappearance of features, and it is always a problem in deep neural networks.
Theorem 4.
In neural networks, for any two tensors
, of the same dimension, there are always two matrices
, , so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the two matrices, respectively, are equivalent in effect.
Proof.
First of all, we illustrate the definition of the Hadamard product []. The Hadamard product (also known as the element-wise product) is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension. Concretely, we can define it as
The symbol ‘’ is used to distinguish it from the more common matrix product, which is denoted as ‘’ and is usually omitted. The definition implies that the dimension of should be the same as that of , as well as and . At the same time and are assumed to have the same dimensions in the precondition of our proposition. As such, we might as well set them to . The representation of and is presented below:
Our goal is to weigh the effect of the two operations. For the convenience of comparison, we let the results after the two operations multiply a matrix, thus converting the dimension to . Adding a linear layer is very common in neural networks, and it hardly affects the network’s expression ability.
Considering the first scheme, the concat of and , we have
where and . Observing the -th row and -th column of , we find that
Considering the second scheme, with the addition of and as the core, we have
where
and . Still, we pay attention to the -th row and -th column of and find that
Comparing Equations (29) and (31), we find that, when we let equal and equal , the values of and are equal, which is strong evidence of effect equivalence. □
As the equivalence has been proven, similar to the plain concat, no information is lost in the above method. We point out that the Hadamard product is an alternative version of the gate mechanism []. We use coefficients to adjust the original distribution to screen out effective features. For the speed and stability of training, setting the initial value of to 1 is recommended.
Further, we can observe the gradient of the parameters in Equation (30), where we have
Compared to the gate mechanism, our method is simpler, saves space, and is more direct in gradient propagation.
Of course, Theorem 4 could be generalized to cases with an arbitrary number of tensors. We describe it in the following corollary:
Corollary 3.
In neural networks, for any tensors
, , , of the same dimension, there always exist
matrices
, , , so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the
matrices, respectively, are equivalent in effect.
Proof.
This proof is similar to the proof for Theorem 4. □
Theorem 5.
In neural networks, for a layer composed of
neurons, the effective training times of the neurons in this layer reach the maximum when the dropout rate is set to
or
.
Proof.
The number of neurons in this layer is , so we shall mark them as , , , . Suppose that the dropout rate [] is , and the total number of training times is . We denote as q.
Consider the -th training. The network randomly selects neurons to update due to the existence of the dropout mechanism. Denote these neurons as , , , .
Without the loss of generality, we consider the next time is selected, which is the -th training time. We denote the number of neurons selected for update in , , as , and the number of neurons selected in , , as . We know that the selection of neurons in is an independent event, so we have
At the same time, the relationship between and is
Inferring from Equations (33) and (34), we achieve
The neurons represented by are the neurons that are updated jointly at time and time , thus belonging to the same subnetwork. We assume that they share one training gain with . At the same time, the neurons represented by have not been updated at time ; thus, each of them has one unique training gain. Therefore, at the update time , the expected gain of is , which is derived from the above proportion analysis. Paying attention to and , we find that obeys geometric distribution because the selection of is a Bernoulli experiment with probability . That is, , meaning that
Therefore, the expected number of training times for is . The total training gain is the product of the number of training times and the gain of a single time training, which we denote as . Now, the formula emerges:
Denote as the denominator of and differentiate that to obtain
With the help of Equation (38), it is easy to draw an image of , shown in Figure 1, where we set to . The observation is that when or , that is, is 0 or , reaches the maximum value , demonstrating that the effective training times of are the largest. The conclusion can be generalized to every neuron in the layer. □
Figure 1.
The graph of the function with respect to G derived from Equation (37) in Theorem 5. It shows the gain of a single training time as n differs. G varies with p, and the extreme points (squares) and turning points (triangles) vary with n. The turning point is approximately 8.72 × 10−9 n3 − 9.35 × 10−6n2 + 2.44 × 10−3n + 0.78, which is a good choice for the dropout rate because it balances the gain of a single time training and the representation ability of a nerual network.
Corollary 4.
In neural networks, if the amount of training data is sufficient, the optimal value of the dropout rate is 0.5; if the amount of training data is insufficient, then a number that is close to 1 is a better choice.
Proof.
Theorem 5 focuses on the effective neuron training times in the network, and the corollary focuses on the representation ability. It can be seen from Equation (37) that the effective training times of a certain layer are directly proportional to the total training times . When the number of training times reaches a certain threshold, the network reaches a balance point, and further training will not bring any performance improvements.
If the training data are sufficient, meaning that and are large enough, then the network is guaranteed to be fully trained. Therefore, we do not need to worry about whether the training times of neurons in the network is enough. However, we still need to consider the representation ability of the network, which has a close relationship with the number of subnetworks . It can be calculated as
which is a combination number. Obviously, when is , the number of subnetworks is the largest, and the network’s representation ability is relatively strong.
However, when there are not enough training data, we cannot guarantee the sufficiency of training. On the one hand, we need to set the dropout rate to a value close to or to guarantee the number of trainings indicated by the theorem. On the other hand, in order to ensure the network’s representation ability, we want the dropout rate to be close to . Here, a balanced approach is to choose the turning point shown in Figure 1, which considers both training times and representation ability. Because this point is difficult to analyze, we provide a fitting function shown in Figure 1, the error of which is bounded by for smaller than . □
The above corollary is intuitive because the complexity of the network should be proportional to the amount of data. A small amount of data requires a simple model, calling for a higher dropout rate. Notice that a large dropout rate not only enables the model to be fully trained, but it also helps to accelerate the process.
In a modern neural network framework, the discarded neurons will not participate in gradient propagation this time, which largely reduces the number of parameters that need to be adjusted in the network.
3. Models
To address the task of chromatin accessibility prediction, we designed SemanticCAP, which includes a DNA language model that is shown in Section 3.1 and a chromatin accessibility model that is shown in Section 3.2. Briefly, we augment the chromatin accessibility model with the features provided by the DNA language model, thereby improving the chromatin accessibility prediction performance. A detailed methodology is described as follows.
3.1. DNA Language Model
3.1.1. Process of Data
We used the human reference genome GRCh37 (hg19) as the original data for our DNA language model. The human reference genome is a digital nucleic acid sequence database that can be used as a representative example of the gene set of an idealized individual of a species []. Therefore, a model based on the database could be applied to various genetic-sequence-related issues.
The task we designed for our DNA language model uses context to predict the intermediate base. However, there are at least three challenges. The first is that there are two inputs, the upstream and downstream, which are defined as the upper and lower sequences of a certain base. Since we predict the middle base from the information on both sides, the definition of the upstream and downstream are interchangeable, which means that the context should be treated in the same way. Second, the length of the input sequence is quite long, far from the output length of bp, which stands for the classification results of bases. The large gap between the input and output lengths points to the fact that neural networks must be designed in a more subtle way. Otherwise, redundant calculations or poor results may occur. Third, we do not always have such long context data in real situations. For example, the length of the upstreams in the DNA datasets in Table 1 mostly vary from bp to bp, resulting in insufficient information in some cases.
Table 1.
An overall view of accessible DNA segments of each cell. l_mean and l_med show the approximate length of the data, and l_std describes how discrete the data is. The length is denoted as bp.
To solve the above problems, we designed a simple but effective input format and training method. First of all, we randomly selected a certain position, taking the upstream and downstream sequences with lengths of bp as the input, and the output is the base connecting the upstream and downstream, i.e., , , , and .
For the first challenge, we combined the upstream and downstream into one sequence, separated by a special prediction token , and provided different segment embeddings for the two parts. Additionally, a special classification token was added to the beginning of the sequence so that the model could learn an overall representation of it. For the second challenge, the final hidden state corresponding to the token was used as the aggregate sequence representation for classification tasks, by which the output dimension was reduced quickly without complex network structures. This technique was used by Bert [] for the first time. For the third challenge, some data augmentation tricks were applied to enhance the capabilities of the model. First, we constructed symmetric sequences based on the principle of base complementation and the non-directionality of DNA sequences, including the axial symmetry and mirror symmetry. This helps the model learn the two properties of DNA sequences. Second, we did not include all of the inputs in the model, which helps to enhance the model’s prediction ability under conditions with insufficient information. Basically, we mask some percentage of the input tokens at random, and concretely there are two strategies. For a certain sequence, either upstream or downstream, we mask (replace with ) of random tokens in of cases or of consecutive tokens in of cases. Figure 2 is an example of our mask operation. In this case, two random tokens of the upstream and three consecutive tokens of the downstream are masked.
Figure 2.
An example of mask operation on a random DNA sequence. Here, C and T of the upstream and C, G, and A of the downstream are masked. The intermediate base is T, which is the target that needs to be predicted.
Finally, there is no need to worry about overfitting. First, we have 109 bases in the DNA dataset, meaning that we will not over-learn some specific data. Second, we have mask and dropout operations in our training, which both are great ways to avoid over-training.
3.1.2. Model Structure
The input and output are constructed as described in Section 3.1.1, and we denote them as and . Basically, the model can be described as
where is the input embedding layer, stands for our multi-kernel CNN, represents the transformer blocks, and contains a linear layer and a softmax function. Figure 3 shows the full picture of the model.
Figure 3.
DNA language model. The box on the right displays our smooth feature addition (SFA).
Function is the encoding layer transforming the input sequence into a matrix. The dimension conversion is , where is the length of the sequence, and is the encoding length. Specifically, we encode the input as
where is the meaning of the word itself, provides the representation of different positions, and distinguishes the upstream and the downstream. An intuitive approach is to concatenate the three encodings without losing semantics, but this requires triple the space. Instead, we directly add these three encodings. This works because the three parameters are all leaf nodes of the training graph and can automatically adapt to each other’s distributions. In this way, we reduce the dimension of the coded matrix, thus reducing the parameter space and data space.
Function is the multi-kernel convolution layer learning a short-range relationship of the sequence. The dimension conversion is , where is the hidden dimension. Here, we use convolution kernels of different lengths to learn local relationships at different distances, and we propose a smooth feature addition (SFA) method to fuse these features. Specifically, we carry out
where is a normal, one-dimensional, convolution layer with a kernel length of , the output dimension of which is , is a network parameter with a dimension of , and is the number of kernels of different lengths. The sizes of the convolution kernels are small rather than large, and their advantages have been verified in DenseNet []. On the one hand, small convolution kernels use less space than large convolution kernels. On the other hand, we need small convolution kernels to learn the local information of the sequence, while the long-range dependence of the sequence is to be explored by the subsequent module.
Now, we will explain how we designed the smooth feature addition (SFA) algorithm. Before that, we must provide insight into what happens in the plain concat of features.
In a sequence problem, we often directly concat two features in the last dimension. Specifically, if we have two features with dimensions and , the dimension of the features after concat is . We thought that this approach would not lose information, but, in fact, there is a danger of feature disappearance. For two features with different distributions learning from different modules, plain concat will create an unbalanced distribution, where some values are extremely small. To make matters worse, layer normalization is usually used to adjust the distribution after a concat operation, causing the values to be concentrated near . Quantitative analysis can be seen in Theorem 3. Finally, as the network goes deeper, the gradient disappears, leading to the difficulty of learning. This is proven in Corollary 2.
A naive thought is to normalize the two distributions before concating them, which is proven to be correct in Theorem 1. However, it is not effective, for it converts the dimension from to , posing a challenge for the subsequent module design. Considering that the dimensions of convolution features are the same, this inspired us to find a way to smoothly add them using some tuning parameters. This is how we designed SFA. Corollary 3 proves the equivalence of SFA and plain concat, and it illustrates the working mechanism of SFA and its advantages in space occupation, feature selection, and gradient propagation.
Function is the stack of transformer blocks learning a long-range relationship of the sequence. The dimension conversion is . Simply, it can be described as
where . represents the feed forward function, and is short for multi-head attention. The module was proposed by Vaswani et al. in 2017 [].
Function is the output layer and is responsible for converting the hidden state to the output. The dimension conversion is . We extract the tensor corresponding to the token , convert it into an output probability through a linear layer, and generate the prediction value via a softmax function. The output process is
3.2. Chromatin Accessibility Model
3.2.1. Process of Data
We selected DNase-seq experiment data from six typical cell lines, including GM12878, K562, MCF-7, HeLa-S3, H1-hESC, and HepG2, as the original data for our chromatin accessibility model. GM12878 is a type of lymphoblast produced by EBV transformation from the blood of a female donor of Northern European and Western European descent.
K562 is an immortalized cell derived from a female patient with chronic myeloid leukemia (CML). MCF-7 is a breast cancer cell sampled from a white female. HeLa-S3 is an immortal cell derived from a cervical cancer patient. H1-hESC is a human embryonic stem cell. HepG2 comes from a male liver cancer patient.
For each cell type, we downloaded the original sequence data from the ENCODE website, used a short read aligner tool bowtie [] to map the DNA sequence to the human reference genome (hg19), and used HOTSPOT [] to identify chromatin accessibility regions (peaks), i.e., genome-wide open chromatin regions that can yield information about possible protein binding regions on a genome-wide scale. We treated these variable-length sequences as positive samples. At the same time, we sampled the same number and same size sequences from the whole genome as negative samples. An overview of the data is shown in Table 1, which shows the number of sequences, the minimum value, the median value, the maximum value, and the standard deviation in lengths. Additionally, the distribution statistics of different datasets are shown in Figure 4. For the fairness of comparison, we removed sequences with lengths of less than bp. We truncated or expanded each sequence symmetrically to a sequence of length bp, and we took a context of a length of bp for each site in it. Therefore, the actual input length of our model is bp. From Figure 4, we can observe that most of the lengths are clustered between 36 and 1792. This proves that our cut-off has little impact and is reasonable. Similar to our DNA language model, a special classification token was added to the beginning of the sequence to predict the accessibility. Compared to the input length of bp in [], our prediction length increased by 124%, and the quantity of the DNA sequences that did not need to be truncated in the original dataset increased by 17.4%. Moreover, we did not pay a great price for such a long input because our context was transferred to a pre-trained model for the predictions. The output is the accessibility of the input sequence, i.e., either for inaccessibility or for accessibility.
Figure 4.
The length distribution of each cell, which is a more intuitive display of the content of the data. It can be seen that most of the lengths are concentrated between 102–103.
Finally, the ratio of our training set, validation set, and test set is . The training set was used to train the model, the validation set was used to adjust the hyperparameters to prevent overfitting, and the test set was used to test the performance of the final model.
3.2.2. Model Structure
The input and output were constructed as described in Section 3.2.1. and are denoted as and . Basically, the model can be described as
where is the input embedding layer, stands for our multi-kernel CNN, is short for our SConcat module, represents the transformer blocks, and contains a linear layer and a sigmoid function. Figure 5 shows a full picture of the model. One may find that the accessibility model is very similar to our DNA language model. Indeed, we only modified some of the model structures and changed the hyperparameters, but they are all very critical adjustments that make the model suitable for the task.
Figure 5.
Chromatin accessibility model. The box on the right displays our smooth feature concat (SFC).
Function is the encoding layer transforming the input sequence into a feature matrix. The dimension conversion is , where is the length of the sequence, and is the encoding length. Specifically, we encode the input as
Note that there is no in this task because there is no need to distinguish between the different segments.
Function has been explained in Section 3.1.2. The dimension conversion is , where is the dimension of features learning from this layer.
Function is the concat layer that fuses the features of the language model with the features learned from . The dimension conversion is , where is the dimension of features generated from the DNA language model. Basically, the language model was used to construct features for different sites in the sequence, and a smooth feature concat (SFC) method was proposed to fuse them with the previous features:
where stands for the context of sites in ; and are two network parameters with a dimension of ; and refers to our DNA language model. Here, it receives a DNA sequence, then constructs the context for each site in the sequence, and produces an output of length . Specifically, if the length of the sequence is , it will construct pairs of contexts as the input and output an matrix.
Now, we explain how we designed the smooth feature concat (SFC) algorithm. First, we should mention that the output dimension of the language model is , and the dimension of is , which means we cannot directly apply SFA in this scenario.
Fortunately, the analysis in Section 3.1.2 has already provided a solution to this problem. We can normalize the two distributions separately before concating them. However, this method uses twice and consumes additional parameter space and data space. One question is whether it is possible to use only once. It appears that this is the case. Theorem 2 states that, for any two distributions, there always exist two coefficients, so that the concat after they are multiplied by these two coefficients is a standardized distribution. That is how our SFA works. We multiply the two tensors by two coefficients, and we then carry out layer normalization after their concatenation. As such, we fused the two features smoothly with only one use of the operation. Interestingly, this method is a weakened version of Theorem 4.
Function is the same as that described in Section 3.1.2. The dimension conversion is , where .
Function is the output layer and is responsible for transforming the hidden state to the output. The dimension conversion is . We extracted the tensor corresponding to the token , converted it into an output probability through a linear layer, and generated the prediction value via a sigmoid function. The output process is
4. Results and Discussions
4.1. Semantic DNA Evaluation
We compared the performance of our proposed method with several baseline methods, including the gapped k-mer SVM (gkm-SVM) [], DeepSEA [], and k-mer [] methods. For the sake of fairness, all of the parameters were set as defaults. Moreover, to prove the effectiveness of the DNA language model, we also tested our accessibility model after excluding the DNA language model. For evaluation purposes, we computed two often-used measures, the area under the receiver operating characteristic curve (auROC) and the area under the precision-recall curve (auPRC), which are good indicators of the robustness of a prediction model. The classification results for six datasets are shown in Table 2. Compared to the best baseline k-mer, our system shows a maximum improvement of 1.25% in auROC, and a maximum improvement of 2.41% in auPRC. Although some results on some datasets are not good, our model outperforms k-mer on average, with a 0.02% higher auROC score and a 0.1% higher auPRC score. Compared to gkm-SVM and DeepSEA, SemanticCAP shows an average improvement of about 2–3%. Finally, the introduction of our DNA language model resulted in performance improvements of 2%.
Table 2.
The results of the comparative experience to test the chromatin accessibility prediction system. Refer to Table 1 for the codenames of these datasets.
We also tested the accessibility prediction accuracy of the loci shared in different cell lines. For example, GM12878 and HeLa-S3 have 20 common loci, and the prediction accuracy of these 20 loci in both cell lines is 85% and 90%, respectively. Another example is that K562 and MCF-7 have 21 common loci, and the prediction accuracy is 80.9% and 90.5%, respectively. This shows the applicability of our system on the common loci between different cell lines.
4.2. Analysis of Models
4.2.1. Effectiveness of Our DNA Language Model
We performed experiments on several different DNA language model structures, which can be divided roughly into two categories. The first category can be attributed to methods based on normal CNNs, and the second category uses our multi-conv architecture with data augmentation. Six structures were tested. At the same time, in order to test the prediction ability of different models in the case of insufficient information, we randomly masked some words and tested the results. The complete results are shown in Table 3. Through the comparison of LSTM and Attention, we found that the attention mechanism can greatly improve the prediction ability of the DNA language model. When using the MaxPooling and ReLU functions, we observed that the output of the last hidden layer was mostly , where the number of effective (not zero) neurons is about 3/192. This happens because the ReLU function shields neurons whose values are less than , and MaxPooling selectively updates specific neurons. Therefore, we replaced MaxPooling with AveragePooling, and the Attention layer that uses the ReLU function was replaced with a transformer. That is the third method listed in Table 3. The second category uses multi-conv to extract the local features of the sequence. The introduction to the multi-conv mechanism with data augmentation strategies brought increases in accuracy, especially when some tokens were masked. There are three kinds of feature fusion strategies: plain concat (PC), plain add (PA), and our smooth feature add (SFA). The third, fourth, and fifth items in the table indicate that SFA outperforms the other two fusion methods. The last item in Table 3, mconv(SFA)+trans, is the model that we finally chose as our DNA language model.
Table 3.
The results of the experiment comparing DNA language models.
4.2.2. Effectiveness of Our Chromatin Accessibility Model
We experimented with several chromatin accessibility model structures, all of which were based on the transformer. The main difference is the use of multi-conv and the modules after the transformer. A complete comparison of the results is shown in Table 4.
Table 4.
The results of the experiment comparing the chromatin accessibility models.
First, we focused on the module before the transformer. We noticed that the introduction of multi-conv also resulted in performance improvements, especially in F1. In our chromatin accessibility model, we concatenated the features provided by the DNA language model, where we could either directly concat (PC) them or use our SFC method. The evaluation values of the last two items show the superiority of SFC.
Now, we will turn to the comparison of modules after the transformer. The transformation from the features of the transformer to the result is a challenge. In this part, five methods were tested. It should be mentioned that mconv + SFC + trans + linear is the final model. In terms of training time, our model can be fully parallelized, making it more advantageous than LSTM, based on recurrent networks. At the same time, our model has fewer parameters and has a simpler structure than Flatten after CNNs and can thus converge quickly. In terms of evaluation, the LSTM-based methods performed poorly. The main reason for this is that it is difficult for LSTM to learn the long-range dependence of a sequence. The convolution layer improves the performance of the LSTM to some extent by shortening the sequence length. In methods that are based on Flatten, introducing convolution layers actually reduces the accuracy. This could be caused by the convolution layers destroying the sequence features learned from the transformer. During multiple chromatin accessibility models, the method using multi-conv and our smoother concat (SFC) method obtained the best results with a relatively small number of parameters.
4.3. Analysis of [CLS]
We were able to observe the effectiveness of introducing the symbol into our accessibility model. A direct indicator is the feature corresponding to after the transformer layer, i.e., the value of in Equation (48). We randomly selected a certain number of positive and negative samples and used our chromatin accessibility model to predict them. For each sample, we output the -dimensional tensor corresponding to after the transformer layer, and we reduced it to 2-dimensional space with t-SNE, which is shown in Figure 6. According to the figure, the feature has the ability to distinguish positive and negative examples, which is strong evidence of its effectiveness.
Figure 6.
Features corresponding to the token [CLS] after the transformer for different samples. The -dimensional tensor corresponding to is reduced to a 2-dimensional value with t-SNE, and each axis represents one of them. Accessible points and inaccessible points can be roughly distinguished.
4.4. Analysis of SFA and SFC
In this section, we conducted two comparison experiments of PA, PC, SFA, and SFC.
When testing the various DNA language models, we made a comparison between SFA, PA, and PC, which correspond to the last three items in Table 3. We used samples to train the three models, drew a training loss map of them, and saw what would happen, which is shown in Figure 7a. PA quickly reduces losses at the fastest speed at the beginning because all of the features in multi-conv are trained to the same degree at the same time. However, in the later stage, there appears a phenomenon in which some features are overtrained while others are not, leading to the oscillation of loss.
Figure 7.
Loss in training time: (a) The loss curve for SFA, PA, and PC in the training of the DNA language model; and (b) the loss curve for SFC and PC in the training of the chromatin accessibility model.
In the experiment of various chromatin accessibility models, we made a comparison between SFC and PC, corresponding to the last two items in Table 4. The first samples were used to measure its training state, which is shown in Figure 7b. As we can see, PC has a lower training speed because it has a problem regarding gradient disappearance. Compared to it, the gradient propagation of SFA is selective and more stable for the whole term.
We can observe the effectiveness of SFA from another angle. Paying attention to the parameters of SFA in multi-conv, whose dimension is , where is the number of kernels, is the sequence length, and is the hidden dimension, we normalized it, and converted it to , whose dimension is . This was carried out for both the language model and the chromatin accessibility model, and they can be observed in Figure 8. Note that the sum of the vertical axis in Figure 8a,b is always due to the normalization. Obviously, different sequence positions and different convolution kernels have different weights, which proves SFA’s ability to regulate features.
Figure 8.
SFA and SFC parameters: (a) Parameters of SFA in the DNA language model; and (b) Parameters of SFC in the chromatin accessibility model. We found that different sequence positions and different convolution kernels have different weights, which proves SFA’s ability to regulate features.
In general, SFA and SFC make training smoother, faster, and better than ordinary concatenation and addition. They are smoother because we used parameters to regulate features. They are faster because they speed up the training of the model by avoiding the gradient problem. They are simple but effective. In fact, since they share the same essence (the Hadamard product), they share the same advantages.
5. Conclusions
In this article, we propose a chromatin accessibility prediction model called SemanticCAP. Our model is able to predict open DNA regions, thus having a guiding role in disease detection, drug design, etc. For example, a gene called CYMC from cell H1-hESC mutated in the middle with a length of 5 bp, and its accessibility decreased from 0.98 to 0.14 as predicted by our model, which is consistent with the experimental data that it reduces transcription []. Another example is a mutation in a gene called HNF4A from cell K562, which leads to a reduction in gene expression []. Our model predicted that its accessibility decreased from 0.66 to 0.2, which provides a reasonable explanation for the experimental phenomena of reduction in gene expression caused by HNF4A mutation. Similarly, we can monitor the accessibility changes of DNA targeted by drugs (especially anticancer drugs), and the change of accessibility will provide guidance for drug action. Our main innovations are as follows. First, we introduced the concept of language models in natural language processing to model DNA sequences. This method not only provides the word vector presentation of the base itself, but it also provides sufficient information about the context of a site in a DNA sequence. Second, we used a small number of parameters to solve the feature fusion problem between different distributions. Specifically, we solve the problem of the smooth addition of distributions with the same dimensions using SFA and the problem of the smooth concatenation of distributions with different dimensions using SFC.
Third, we use an end-to-end model design, in which we fully utilize the learning ability and characteristics of the convolution and attention mechanism, thus achieving a better result with fewer parameters and a shorter training time.
Of course, there is still room for improvement in our method. In terms of the sample construction, we randomly selected the same number of DNA sequences with the same length as negative samples. This approach may be modified. For example, we could deliberately use an unbalanced dataset because there are so much DNA data, and we could then use some strategies, such as ensemble learning [], to eliminate the negative effects of data imbalance []. In terms of data input, sequence truncation, and sequence completion operations exist in our model, which may cause information loss or redundant calculations. Additionally, the task we designed for the DNA language model could also be enhanced. Multiple positions can be predicted simultaneously, similar to the cloze problem in Bert. There are also some limitations in the current study. The first limitation is that the attention mechanism consumes too much memory, which could be replaced by a short-range attention or a mixed-length attention []. Additionally, our smooth feature fusion methods, SFA and SFC, could also be used in the multi-head attention to save space and accelerate training. Moreover, the dropout mechanism makes all neurons effective in the prediction phase, but there may exist a more reasonable way of fusing subnetworks. These issues need to be further explored.
Author Contributions
Conceptualization: Y.Z. and L.Q.; Data curation: L.Q.; Formal analysis: Y.Z. and Y.J.; Investigation: Y.Z.; Supervision: X.C., H.W. and L.Q.; Validation: X.C., H.W. and L.Q.; Writing—original draft: Y.Z.; Writing—review and editing: L.Q. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the following: the National Natural Science Foundation of China (31801108, 62002251); the Natural Science Foundation of Jiangsu Province Youth Fund (BK20200856); and a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). This work was partially supported by the Collaborative Innovation Center of Novel Software Technology and Industrialization. The authors would also like to acknowledge the support of Jiangsu Province Key Lab for providing information processing technologies.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data and code are available at github.com/ykzhang0126/semanticCAP (accessed on 16 February 2022).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Aleksić, M.; Kapetanović, V. An Overview of the Optical and Electrochemical Methods for Detection of DNA-Drug Interactions. Acta Chim. Slov. 2014, 61, 555–573. [Google Scholar] [PubMed]
- Wang, Y.; Jiang, R.; Wong, W.H. Modeling the Causal Regulatory Network by Integrating Chromatin Accessibility and Transcriptome Data. Natl. Sci. Rev. 2016, 3, 240–251. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gallon, J.; Loomis, E.; Curry, E.; Martin, N.; Brody, L.; Garner, I.; Brown, R.; Flanagan, J.M. Chromatin Accessibility Changes at Intergenic Regions Are Associated with Ovarian Cancer Drug Resistance. Clin. Epigenet. 2021, 13, 122. [Google Scholar] [CrossRef] [PubMed]
- Janssen, S.; Cuvier, O.; Müller, M.; Laemmli, U.K. Specific Gain-and Loss-of-Function Phenotypes Induced by Satellite-Specific DNA-Binding Drugs Fed to Drosophila Melanogaster. Mol. Cell 2000, 6, 1013–1024. [Google Scholar] [CrossRef]
- Song, L.; Crawford, G.E. DNase-Seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements Across the Genome from Mammalian Cells. Cold Spring Harb. Protoc. 2010, 2010, pdb-prot5384. [Google Scholar] [CrossRef] [Green Version]
- Simon, J.M.; Giresi, P.G.; Davis, I.J.; Lieb, J.D. Using Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE) to Isolate Active Regulatory DNA. Nat. Protoc. 2012, 7, 256–267. [Google Scholar] [CrossRef] [Green Version]
- Buenrostro, J.D.; Wu, B.; Chang, H.Y.; Greenleaf, W.J. ATAC-Seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr. Protoc. Mol. Biol. 2015, 109, 21–29. [Google Scholar] [CrossRef]
- Lee, D.; Karchin, R.; Beer, M.A. Discriminative Prediction of Mammalian Enhancers from DNA Sequence. Genome Res. 2011, 21, 2167–2180. [Google Scholar] [CrossRef] [Green Version]
- Ghandi, M.; Lee, D.; Mohammad-Noori, M.; Beer, M.A. Enhanced Regulatory Sequence Prediction Using Gapped k-Mer Features. PLoS Comput. Biol. 2014, 10, e1003711. [Google Scholar] [CrossRef] [Green Version]
- Beer, M.A. Predicting Enhancer Activity and Variant Impact Using Gkm-SVM. Hum. Mutat. 2017, 38, 1251–1258. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Strick, A.J. Integration of Unpaired Single-Cell Chromatin Accessibility and Gene Expression Data via Adversarial Learning. arXiv 2021, arXiv:2104.12320. [Google Scholar]
- Kumar, S.; Bucher, P. Predicting Transcription Factor Site Occupancy Using DNA Sequence Intrinsic and Cell-Type Specific Chromatin Features. BMC Bioinform. 2016, 17, S4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the Sequence Specificities of DNA-and RNA-Binding Proteins by Deep Learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef]
- Zhou, J.; Troyanskaya, O.G. Predicting Effects of Noncoding Variants with Deep Learning–Based Sequence Model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [Green Version]
- Min, X.; Zeng, W.; Chen, N.; Chen, T.; Jiang, R. Chromatin Accessibility Prediction via Convolutional Long Short-Term Memory Networks with k-Mer Embedding. Bioinformatics 2017, 33, i92–i101. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, Q.; Xia, F.; Yin, Q.; Jiang, R. Chromatin Accessibility Prediction via a Hybrid Deep Convolutional Neural Network. Bioinformatics 2018, 34, 732–738. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Guo, Y.; Zhou, D.; Nie, R.; Ruan, X.; Li, W. DeepANF: A Deep Attentive Neural Framework with Distributed Representation for Chromatin Accessibility Prediction. Neurocomputing 2020, 379, 305–318. [Google Scholar] [CrossRef]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A Convolutional Neural Network for Modelling Sentences. arXiv 2014, arXiv:1404.2188. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing—EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Sun, C.; Yang, Z.; Luo, L.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. A Deep Learning Approach with Deep Contextualized Word Representations for Chemical–Protein Interaction Extraction from Biomedical Literature. IEEE Access 2019, 7, 151034–151046. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Giné, E. The lévy-Lindeberg Central Limit Theorem. Proc. Am. Math. Soc. 1983, 88, 147–153. [Google Scholar] [CrossRef] [Green Version]
- Horn, R.A. The Hadamard Product. In Proceedings of the Symposia in Applied Mathematics, Phoenix, AZ, USA, 10–11 January 1989; Volume 40, pp. 87–169. [Google Scholar]
- Liu, F.; Perez, J. Gated End-to-End Memory Networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Long Papers. Volume 1, pp. 1–10. [Google Scholar]
- Baldi, P.; Sadowski, P.J. Understanding Dropout. Adv. Neural Inf. Process. Syst. 2013, 26, 2814–2822. [Google Scholar]
- Pan, B.; Kusko, R.; Xiao, W.; Zheng, Y.; Liu, Z.; Xiao, C.; Sakkiah, S.; Guo, W.; Gong, P.; Zhang, C.; et al. Similarities and Differences Between Variants Called with Human Reference Genome Hg19 or Hg38. BMC Bioinform. 2019, 20, 17–29. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- De Ruijter, A.; Guldenmund, F. The bowtie method: A review. Saf. Sci. 2016, 88, 211–218. [Google Scholar] [CrossRef]
- John, S.; Sabo, P.J.; Thurman, R.E.; Sung, M.-H.; Biddie, S.C.; Johnson, T.A.; Hager, G.L.; Stamatoyannopoulos, J.A. Chromatin Accessibility Pre-Determines Glucocorticoid Receptor Binding Patterns. Nat. Genet. 2011, 43, 264–268. [Google Scholar] [CrossRef]
- Klenova, E.M.; Nicolas, R.H.; Paterson, H.F.; Carne, A.F.; Heath, C.M.; Goodwin, G.H.; Neiman, P.E.; Lobanenkov, V.V. CTCF, a conserved nuclear factor required for optimal transcriptional activity of the chicken c-myc gene, is an 11-Zn-finger protein differentially expressed in multiple forms. Mol. Cell. Biol. 1993, 13, 7612–7624. [Google Scholar]
- Colclough, K.; Bellanne-Chantelot, C.; Saint-Martin, C.; Flanagan, S.E.; Ellard, S. Mutations in the genes encoding the transcription factors hepatocyte nuclear factor 1 alpha and 4 alpha in maturity-onset diabetes of the young and hyperinsulinemic hypoglycemia. Hum. Mutat. 2013, 34, 669–685. [Google Scholar] [CrossRef]
- Dietterich, T.G. Ensemble Learning. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 2002; Volume 2, pp. 110–125. [Google Scholar]
- Chawla, N.V.; Sylvester, J. Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets. In Proceedings of the International Workshop on Multiple Classifier Systems, Prague, Czech Republic, 23–25 May 2007; Springer: Berlin, Germany, 2007; pp. 397–406. [Google Scholar]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).