SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model

Zhang, Yikang; Chu, Xiaomin; Jiang, Yelu; Wu, Hongjie; Quan, Lijun

doi:10.3390/genes13040568

Open AccessArticle

SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model

by

Yikang Zhang

^1,2,

Xiaomin Chu

¹,

Yelu Jiang

¹,

Hongjie Wu

³ and

Lijun Quan

^1,2,4,*

¹

School of Computer Science and Technology, Soochow University, Suzhou 215006, China

²

Jiangsu Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China

³

School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China

⁴

Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China

^*

Author to whom correspondence should be addressed.

Genes 2022, 13(4), 568; https://doi.org/10.3390/genes13040568

Submission received: 17 February 2022 / Revised: 22 March 2022 / Accepted: 22 March 2022 / Published: 23 March 2022

(This article belongs to the Special Issue Identification of Potential Links in Biomedical Networks)

Download

Browse Figures

Versions Notes

Abstract

:

A large number of inorganic and organic compounds are able to bind DNA and form complexes, among which drug-related molecules are important. Chromatin accessibility changes not only directly affect drug–DNA interactions, but they can promote or inhibit the expression of the critical genes associated with drug resistance by affecting the DNA binding capacity of TFs and transcriptional regulators. However, the biological experimental techniques for measuring it are expensive and time-consuming. In recent years, several kinds of computational methods have been proposed to identify accessible regions of the genome. Existing computational models mostly ignore the contextual information provided by the bases in gene sequences. To address these issues, we proposed a new solution called SemanticCAP. It introduces a gene language model that models the context of gene sequences and is thus able to provide an effective representation of a certain site in a gene sequence. Basically, we merged the features provided by the gene language model into our chromatin accessibility model. During the process, we designed methods called SFA and SFC to make feature fusion smoother. Compared to DeepSEA, gkm-SVM, and k-mer using public benchmarks, our model proved to have better performance, showing a 1.25% maximum improvement in auROC and a 2.41% maximum improvement in auPRC.

Keywords:

chromatin accessibility; drug design; language model; transformer; feature fusion

1. Introduction

In human cells, genetic and regulatory information is stored in chromatin, which is deoxyribonucleic acid (DNA) wrapped around histones. The chromatin structure has a lot to do with gene transcription, protein synthesis, biochemical processes, and other complex biological expressions. Among them, the binding of small organic and inorganic molecules to the DNA can influence numerous biological processes in which DNA participate. In particular, many anticancer, antibiotic, and antiviral drugs exert their primary biological effects by reversibly interacting with nucleic acids. Therefore, the study of its structure can help us design drugs to control gene expression and to cure diseases [1]. Some regions of the chromatin are open to transcription factors (TFs), RNA polymers (RNAPs), drug molecules, and other cellular materials, while others are tightly entangled together and do not play a role in most cellular processes. These two regions of the chromatin are called open regions and closed regions, which are also known as accessible and inaccessible regions [2]. Measuring the accessibility of chromatin regions can generate clues to gene function that can help us to identify appropriate targets for therapeutic intervention. Meanwhile, monitoring changes in chromatin accessibility can help us to track and understand drug effects. A study [3] found that chromatin accessibility changes at intergenic regions are associated with ovarian cancer drug resistance. Another example is the way that the chromatin opening (increased accessibility) of the targeted DNA satellites can explain how the DNA-binding pyrrole–imidazole compounds that target different Drosophila melanogaster satellites lead to gain- or loss-of-function phenotypes [4]. In recent years, many high-throughput sequencing technologies have been used for the detection of open regions, such as DNase-seq [5], FAIRE-seq [6], and ATAC-seq [7]. However, biological experimental methods are costly and time-consuming and thus cannot be applied to large-scale chemical examinations. These restrictions have promoted the development of calculation methods.

Alongside the progress in computer science, several kinds of sequence-based calculation methods have been proposed to identify functional regions. Simply put, we can divide them into traditional machine learning methods [8,9,10,11,12] and neural network methods [13,14,15,16,17]. Machine learning methods are mainly based on support vector machines (SVM), which perform supervised learning for the classification or regression of data groups. An SVM method [8] based on k-mer features, which are defined as a full set of segments of varying lengths (3–10 bp) in a long sequence, was designed in 2011. This method recognizes enhancers in mammalian cells. Subsequently, the gkm-SVM (gapped k-mer SVM) proposed in 2014 [9] exploited a feature set called interval k-mer features to improve the accuracy and stability of recognition. This method only uses the part of the segments that vary in length, instead of all of the segments. In recent years, with the rapid development of neural networks and the emergence of various deep learning models, a growing number of deep network models have come to be used to solve such problems, where convolutional neural networks (CNNs) [18] and recurrent neural networks (RNNs) [19] are dominant in this regard. A neural network is a computational learning system that uses a network of functions to understand and translate the data input of one form into a desired output, and deep learning is a type of artificial neural networks in which multiple layers of processing are used to extract progressively higher-level features from data. CNNs use the principle of convolution to encode the local information of the data, while RNNs model the sequence with reference to the memory function of the neurons. CNNs are used in DeepBind [13] and DeepSEA [14] to model the sequence specificity of protein binding, and they have both demonstrated significant performance improvements compared to traditional SVM-based methods. Min et al. utilized long short-term memory (LSTM) [15] to predict chromatin accessibility and achieved state-of-the-art results for the time, thus proving the effectiveness of RNNs for DNA sequence problems.

However, we point out that the previous methods have the following shortcomings. First, most of the previous methods are based on k-mer, that is, a segment of length

k

. Specifically, it takes a segment of length

k

at intervals. The artificial division of the original sequence may destroy the internal semantic information, causing difficulties when learning subsequent models. Second, with the progress being made in language models, we have the ability to learn the interior semantic information of sequences through pre-training. There has been related work on existing methods, such as using GloVe [20] to train the k-mer word vectors. However, these pre-training models are mostly traditional word vector methods. On the one hand, they can only learn the characteristics of the word itself and have no knowledge of the context of DNA sequences [21]. On the other hand, they are limited to a specific dataset and thus cannot be widely applied to other scenarios. Third, traditional CNNs and RNNs have been proven to be unsuitable for long-sequence problems [22]. CNNs, restricted by the size of convolution kernels, fail to learn global information effectively, while RNNs tend to cause gradient disappearance and result in slow training due to the lack of parallelizability when receiving a long input. In contrast, the attention mechanism (Attention) [23] can effectively learn the long-range dependence of sequences and has been widely used in the field of natural language processing.

In response to the above disadvantages, we constructed a chromatin accessibility prediction model called SemanticCAP, which is based on features learning from a language model. The data and code for our system are available at github.com/ykzhang0126/semanticCAP (accessed on 16 February 2022). The SemanticCAP model, trained on DNase-seq datasets, has an ability to predict the accessibility of DNA sequences from different cell lines and thus can be used as an effective alternative to biological sequencing methods such as DNase-seq. At a minimum, our model makes the following three improvements:

A DNA language model is utilized to learn the deep semantics of DNA sequences and introduces the semantic features in the chromatin accessibility prediction process; therefore, we are able to obtain additional complex environmental information.
Both the DNA language model and the chromatin accessibility model use character-based inputs instead of k-mer which stands for segments of length $k$ . The strategy prevents the information of original sequences from being destroyed.
The attention mechanism is widely used in our models in place of CNNs and RNNs, making the model more powerful and stable in handling long sequences.

Before formally introducing our method, we will first present some preliminary knowledge, including some common-sense information, theorems, and corollaries.

2. Theories

Theorem 1.

For two standardized distributions using layer normalization (LN), which are denoted as

X_{1}

and

X_{2}

, the concat of them, that is,

X \equiv [X_{1}, X_{2}]

, is still a standardized distribution.

Proof.

Suppose that

X_{1}

has

n

elements and

X_{2}

has

m

elements. As we all know, LN [24] transforms the distribution

X

as

X \underset{}{\overset{LN}{\to}} \frac{X - μ}{σ}

(1)

where

μ

and

σ

are the expectation and standard deviation of

X

respectively. Obviously, for the normalized distribution

X_{1}

and

X_{2}

, we have

E (X_{1}) = E (X_{2}) = 0

(2)

D (X_{1}) = D (X_{2}) = 1

(3)

where

E

stands for the expectation function and

D

stands for the deviation function. The new distribution

X

is derived by concating

X_{1}

and

X_{2}

, and thus has

n + m

elements. Inferring from Equation (2), we have

E (X) = \frac{n E (X_{1}) + m E (X_{2})}{n + m} = 0

(4)

\begin{array}{l} D (X) & = E (X^{2}) - E^{2} (X) = E (X^{2}) \\ = \frac{n E (X_{1}^{2}) + m E (X_{2}^{2})}{n + m} \end{array}

(5)

For

X_{1}

and

X_{2}

, we also know that

E (X_{1}^{2}) = D (X_{1}) + E^{2} (X_{1})

(6)

E (X_{2}^{2}) = D (X_{2}) + E^{2} (X_{2})

(7)

Substituting Equations (2) and (3) into Equations (6) and (7), and finally into Equation (5), we have

D (X) = 1

(8)

Equations (4) and (8) demonstrate the standardization of

X

. □

Theorem 2.

For any two distributions

X_{1}

,

X_{2}

, there two coefficients

λ_{1}

,

λ_{2}

that always exist, so that the concat of them, after being multiplied by the two coefficients, respectively, that is

X \equiv [λ_{1} X_{1}, λ_{2} X_{2}]

, is a standardized distribution.

Proof.

Suppose that

X_{1}

has

n

elements and

X_{2}

has

m

elements. We denote the expectation of the two distributions as

μ_{1}

,

μ_{2}

, and the variance as

σ_{1}^{2}

,

σ_{2}^{2}

. Notice that

λ_{1}

and

λ_{2}

are all scalars. Now, pay attention to

X

. To prove this theorem, we want

X

to be a standardized distribution, which requires the expectation of

X

to be

0

and the variance to be

1

. Therefore, we can list the following equation set:

\{\begin{array}{l} E (X) = \frac{n λ_{1} E (X_{1}) + m λ_{2} E (X_{2})}{n + m} = 0 \\ E (X^{2}) = \frac{n E ({(λ_{1} X_{1})}^{2}) + m E ({(λ_{2} X_{2})}^{2})}{n + m} \\ D (X) = E (X^{2}) - E^{2} (X) = 1 \end{array}

(9)

At the same time, we have equations similar to Equations (6) and (7), those being:

E ({(λ_{1} X_{1})}^{2}) = D (λ_{1} X_{1}) + E^{2} (λ_{1} X_{1})

(10)

E ({(λ_{2} X_{2})}^{2}) = D (λ_{2} X_{2}) + E^{2} (λ_{2} X_{2})

(11)

which are easy to calculate according to the nature of expectation and variance. Notice that Equation (9) has two variables and two independent equations, meaning it should be solvable. By calculating Equation (9), we can determine the numeric solution of

λ_{1}

and

λ_{2}

as follows:

\{\begin{matrix} λ_{1}^{2} = \frac{m (n + m) μ_{2}^{2}}{n m μ_{2}^{2} (μ_{1}^{2} + σ_{1}^{2}) + n^{2} μ_{1}^{2} (μ_{2}^{2} + σ_{2}^{2})} \\ λ_{2}^{2} = \frac{n (n + m) μ_{1}^{2}}{n m μ_{1}^{2} (μ_{2}^{2} + σ_{2}^{2}) + m^{2} μ_{2}^{2} (μ_{1}^{2} + σ_{1}^{2})} \end{matrix}

(12)

The existence of Equation (12) ends our proof. Actually, we are able to obtain two sets of solutions here because

λ_{1}

can either be positive or negative, and so can

λ_{2}

. The signs of

λ_{1}

and

λ_{2}

depend on the signs of

μ_{1}

and

μ_{2}

, which can be easily inferred from the first equation in Equation (9). □

Corollary 1.

For any distributions

X_{1}

,

X_{2}

,

\dots

,

X_{n}

, their

n

coefficients

λ_{1}

,

λ_{2}

,

\dots

,

λ_{n}

, always exist, so that the concat of them, after being multiplied by the

n

coefficients, respectively, that is

X \equiv [λ_{1} X_{1}, λ_{2} X_{2}, \dots, λ_{n} X_{n}]

, is a standardized distribution.

Proof.

The overall method of proof is similar to that used in Theorem 2. Note that, in this case, we have

n

variables but only two independent equations, resulting in infinite solutions according to Equation (9). To be more precise, the degree of freedom of our solutions is

n - 2

. □

Theorem 3.

In neural networks, for any two tensors

X

,

Y

that satisfy

E (X) = E (Y) = 0

, the probability of feature disappearance of

X

after concating and normalizing them is

Ω (\frac{S D (Y)}{S D (X)})

, where

S D

represents the standard deviation.

Proof.

Feature disappearance is defined as a situation where the features are too small. Concretely, for a tensor

X

and a threshold

t_{F D}

, if the result of a subsequent operation of

X

is smaller than

t_{F D}

, then the feature disappearance of

X

occurs. Here,

t_{F D}

can be an arbitrarily small value, such as

10^{- 5}

.

Suppose that

X

has

n

elements and

Y

has

m

elements. We denote the expectation of the two distributions as

μ_{1}

,

μ_{2}

, and the variance as

σ_{1}^{2}

,

σ_{2}^{2}

. As stated in the precondition, we already know that

μ_{1} = μ_{2} = 0

(13)

Let

Z \equiv [X, Y]

and

Z^{'} \equiv LN (Z) \equiv [X^{'}, Y^{'}]

. With the help of Equation (9), we have

E (Z) = 0

(14)

D (Z) = E (Z^{2}) = \frac{n σ_{1}^{2} + m σ_{2}^{2}}{n + m}

(15)

We denote

E (Z)

as

μ

and

D (Z)

as

σ^{2}

. According to Equation (1), for

X^{'}

, we know that

E (X^{'}) = E (\frac{X - E (Z)}{\sqrt{D (Z)}}) = \frac{μ_{1} - μ}{σ}

(16)

D (X^{'}) = D (\frac{X - E (Z)}{\sqrt{D (Z)}}) = \frac{σ_{1}^{2}}{σ^{2}}

(17)

We denote

E (X^{'})

as

{μ_{1}}^{'}

and

D (X^{'})

as

{σ_{1}}^{' 2}

. Now, we consider the results of a subsequent operation of

X

, which is

\sum_{i = 1}^{n} λ_{i} {X_{i}}^{'}

. This is very common in convolution, linear, or attention layers. For the result, an observation is

\begin{matrix} \sum_{i = 1}^{n} λ_{i} {X_{i}}^{'} \leq \sum_{i = 1}^{n} |λ_{i}| |{X_{i}}^{'}| \leq λ_{m} \sum_{i = 1}^{n} |{X_{i}}^{'}| \end{matrix}

(18)

where

λ_{m} = \underset{1 \leq i \leq n}{m a x} |λ_{i}|

. For the convenience of analysis, all

λ

are set to 1. This will not result in a loss of generality because the value scaling from

λ_{m}

to 1 has no effect on the subsequent derivation. Here, we denote

\sum X^{'}

as

S_{X^{'}}

. According to the central limit theorem (Lindeberg–Lévy form) [25], we find that

S_{X^{'}}

obeys a normal distribution, that is

S_{X^{'}} \sim N (n {μ_{1}}^{'}, n {σ_{1}}^{' 2})

(19)

For a feature disappearance threshold

t_{F D}

, we want to figure out the probability of

|S_{X^{'}}| < t_{F D}

. Denote this event as

F D

, and we can obtain

\begin{array}{l} P r (F D) & = \Pr \{|S_{X^{'}}| < t_{F D}\} \\ = \Pr \{|\frac{S_{X^{'}} - n {μ_{1}}^{'}}{\sqrt{n} {σ_{1}}^{'}}| < \frac{t_{F D} - n {μ_{1}}^{'}}{\sqrt{n} {σ_{1}}^{'}}\} \\ = 2 Φ (\frac{t_{F D} - n {μ_{1}}^{'}}{\sqrt{n} {σ_{1}}^{'}}) - 1 \end{array}

(20)

where

Φ

is the cumulative distribution function (cdf) of the standard normal distribution. Since it is an integral that does not have a closed form solution, we cannot directly analyze it. According to Equations (13), (14) and (16), we know that

{μ_{1}}^{'} = 0

. At the same time, we know that

t_{F D}

is a small number, leading to

\frac{t_{F D}}{\sqrt{n} {σ_{1}}^{'}} \to 0

. Therefore, we have the equation as follows:

\begin{array}{l} Φ (x) & = Φ (0) + φ (0) x + R_{1} (x) \\ = 0.5 + \frac{1}{\sqrt{2 π}} x + o (x) \end{array}

(21)

The formula is a Taylor expansion where

φ

is the probability density function (pdf) of the standard normal distribution,

R_{1} (x)

is the Lagrange remainder, and

o (x)

is the Peano remainder, standing for a high-order infinitesimal of

x

. Combining Equations (15), (17), (20) and (21), we achieve

\begin{array}{l} P r (F D) & = t_{F D} \sqrt{\frac{2 (k_{s} k_{d}^{2} + 1)}{π n (k_{s} + 1)}} + o (k_{d}) \\ = Ω (K_{d}) \end{array}

(22)

where

k_{s} = \frac{m}{n}

and

k_{d} = \frac{σ_{2}}{σ_{1}}

. The above equation can also be written as

P r (F D) = Ω (\frac{S D (Y)}{S D (X)})

. □

Corollary 2.

In neural networks, feature disappearance can lead to gradient disappearance.

Proof.

According to Theorem 3, feature disappearance happens if there exists a tensor

T

such that

|T| < t_{F D}

. Similar to the definition of feature disappearance, gradient disappearance is defined as a situation where the gradients are too small. Concretely, for a parameter

C

with a gradient of

g r a d_{C}

and a threshold

t_{G D}

, if

g r a d_{C}

is smaller than

t_{G D}

, the gradient disappearance of

C

happens. Here,

t_{G D}

can be an arbitrarily small value.

Consider a subsequent operation of

T

, which is

T^{'} = C T^{n}

, where

n

stands for the number of layers involved in the calculation. The gradient disappearance happens if

|g r a d_{C}| = |\frac{d T^{'}}{d C}| = |T^{n}| = {|T|}^{n} < t_{G D}

(23)

At the same time, we already have

{|T|}^{n} < t_{F D}^{n}

, which means that we simply need to meet the requirements for

t_{F D}^{n} < t_{G D}

(24)

Note that

t_{F D}

is a small number, which means that

t_{F D} < 1

. Finally, we can derive a formula for

n

:

n > \frac{\log t_{G D}}{\log t_{F D}}

(25)

Thereby, we get a sufficient condition for

n

, and we can come to a conclusion. Gradient disappearance occurs in layers deep enough after feature disappearance. □

The above corollary is consistent with intuition. The disappearance of gradients is always accompanied by the disappearance of features, and it is always a problem in deep neural networks.

Theorem 4.

In neural networks, for any two tensors

X_{1}

,

X_{2}

of the same dimension, there are always two matrices

M_{1}

,

M_{2}

, so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the two matrices, respectively, are equivalent in effect.

Proof.

First of all, we illustrate the definition of the Hadamard product [26]. The Hadamard product (also known as the element-wise product) is a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension. Concretely, we can define it as

\begin{matrix} A (R, N) \circ B & (R, N) = AB (R, N) \\ AB & _{i j} = A_{i j} B_{i j} \end{matrix}

(26)

The symbol ‘

\circ

’ is used to distinguish it from the more common matrix product, which is denoted as ‘

\cdot

’ and is usually omitted. The definition implies that the dimension of

X_{1}

should be the same as that of

M_{1}

, as well as

X_{2}

and

M_{2}

. At the same time

X_{1}

and

X_{2}

are assumed to have the same dimensions in the precondition of our proposition. As such, we might as well set them to

ℝ^{R \times N}

. The representation of

X_{1}

and

X_{2}

is presented below:

X_{1} = (\begin{matrix} p_{1} \\ p_{2} \\ ⋮ \\ p_{R} \end{matrix}) X_{2} = (\begin{matrix} q_{1} \\ q_{2} \\ ⋮ \\ q_{R} \end{matrix})

(27)

Our goal is to weigh the effect of the two operations. For the convenience of comparison, we let the results after the two operations multiply a matrix, thus converting the dimension to

ℝ^{R \times M}

. Adding a linear layer is very common in neural networks, and it hardly affects the network’s expression ability.

Considering the first scheme, the concat of

X_{1}

and

X_{2}

, we have

\begin{array}{l} U & = [X_{1}, X_{2}] \cdot A \\ = (\begin{matrix} p_{1} & q_{1} \\ p_{2} & q_{2} \\ ⋮ & ⋮ \\ p_{R} & q_{R} \end{matrix}) (\begin{matrix} a_{1} & a_{2} & \dots & a_{M} \end{matrix}) \end{array}

(28)

where

A \in ℝ^{2 N \times M}

and

U \in ℝ^{R \times M}

. Observing the

i

-th row and

j

-th column of

U

, we find that

\begin{array}{l} U_{i j} & = [p_{i}, q_{i}] \cdot a_{j} \\ = [p_{i}, q_{i}] \cdot [a_{j p}, a_{j q}] \\ = p_{i} \cdot a_{j p} + q_{i} \cdot a_{j q} \end{array}

(29)

Considering the second scheme, with the addition of

X_{1}

and

X_{2}

as the core, we have

\begin{array}{l} V & = (M_{1} \circ X_{1} + M_{2} \circ X_{2}) \cdot B \\ = (\begin{matrix} λ_{11} \circ p_{1} + λ_{12} \circ q_{1} \\ λ_{21} \circ p_{2} + λ_{22} \circ q_{2} \\ ⋮ \\ λ_{R 1} \circ p_{R} + λ_{R 2} \circ q_{R} \end{matrix}) (\begin{matrix} b_{1} & b_{2} & \dots & b_{M} \end{matrix}) \end{array}

(30)

where

B \in ℝ^{N \times M}

and

V \in ℝ^{R \times M}

. Still, we pay attention to the

i

-th row and

j

-th column of

V

and find that

\begin{array}{l} V_{i j} & = (λ_{i 1} \circ p_{i} + λ_{i 2} \circ q_{i}) \cdot b_{j} \\ = \sum ((λ_{i 1} \circ p_{i} + λ_{i 2} \circ q_{i}) \circ b_{j}^{T}) \\ = \sum (λ_{i 1} \circ b_{j}^{T} \circ p_{i} + λ_{i 2} \circ b_{j}^{T} \circ q_{i}) \\ = \sum (λ_{i 1} \circ b_{j}^{T} \circ p_{i}) + \sum (λ_{i 2} \circ b_{j}^{T} \circ q_{i}) \\ = p_{i} \cdot {(λ_{i 1} \circ b_{j}^{T})}^{T} + q_{i} \cdot {(λ_{i 2} \circ b_{j}^{T})}^{T} \\ = p_{i} \cdot (λ_{i 1}^{T} \circ b_{j}) + q_{i} \cdot (λ_{i 2}^{T} \circ b_{j}) \end{array}

(31)

Comparing Equations (29) and (31), we find that, when we let

λ_{i 1}^{T} \circ b_{j}

equal

a_{j p}

and

λ_{i 2}^{T} \circ b_{j}

equal

a_{j q}

, the values of

U

and

V

are equal, which is strong evidence of effect equivalence. □

As the equivalence has been proven, similar to the plain concat, no information is lost in the above method. We point out that the Hadamard product is an alternative version of the gate mechanism [27]. We use coefficients to adjust the original distribution to screen out effective features. For the speed and stability of training, setting the initial value of

M

to 1 is recommended.

Further, we can observe the gradient of the parameters

λ

in Equation (30), where we have

\nabla (\begin{matrix} λ_{11} \circ p_{1} + λ_{12} \circ q_{1} \\ λ_{21} \circ p_{2} + λ_{22} \circ q_{2} \\ ⋮ \\ λ_{R 1} \circ p_{R} + λ_{R 2} \circ q_{R} \end{matrix}) = (\begin{matrix} p_{1} & q_{1} \\ p_{2} & q_{2} \\ ⋮ & ⋮ \\ p_{R} & q_{R} \end{matrix})

(32)

Compared to the gate mechanism, our method is simpler, saves space, and is more direct in gradient propagation.

Of course, Theorem 4 could be generalized to cases with an arbitrary number of tensors. We describe it in the following corollary:

Corollary 3.

In neural networks, for any tensors

X_{1}

,

X_{2}

,

\dots

,

X_{n}

of the same dimension, there always exist

n

matrices

M_{1}

,

M_{2}

,

\dots

,

M_{n}

so that the operation of concating them and the operation of adding them after they have been multiplied in the Hadamard format by the

n

matrices, respectively, are equivalent in effect.

Proof.

This proof is similar to the proof for Theorem 4. □

Theorem 5.

In neural networks, for a layer composed of

n

neurons, the effective training times of the neurons in this layer reach the maximum when the dropout rate is set to

0

or

1 - \frac{1}{n}

.

Proof.

The number of neurons in this layer is

n

, so we shall mark them as

N_{1}

,

N_{2}

,

\dots

,

N_{n}

. Suppose that the dropout rate [28] is

p

, and the total number of training times is

t

. We denote

1 - p

as q.

Consider the

t_{i}

-th training. The network randomly selects

n q

neurons to update due to the existence of the dropout mechanism. Denote these neurons as

N_{1}

,

N_{2}

,

\dots

,

N_{n q}

.

Without the loss of generality, we consider the next time

N_{1}

is selected, which is the

t_{2}

-th training time. We denote the number of neurons selected for update in

N_{2}

,

\dots

,

N_{n q}

as

S

, and the number of neurons selected in

N_{n q + 1}

,

\dots

,

N_{n}

as

T

. We know that the selection of neurons in

S

is an independent event, so we have

E (S) = q (n q - 1)

(33)

At the same time, the relationship between

S

and

T

is

T = n q - 1 - S

(34)

Inferring from Equations (33) and (34), we achieve

E (T) = - n q^{2} + n q + q - 1

(35)

The neurons represented by

S

are the neurons that are updated jointly at time

t_{2}

and time

t_{1}

, thus belonging to the same subnetwork. We assume that they share one training gain with

N_{1}

. At the same time, the neurons represented by

T

have not been updated at time

t_{1}

; thus, each of them has one unique training gain. Therefore, at the update time

t_{2}

, the expected gain of

N_{1}

is

1 \times \frac{1}{E (T) + 1} \times \frac{1}{E (S) + 1}

, which is derived from the above proportion analysis. Paying attention to

t_{1}

and

t_{2}

, we find that

t_{2} - t_{1}

obeys geometric distribution because the selection of

N_{1}

is a Bernoulli experiment with probability

q

. That is,

t_{2} - t_{1} \sim G E (q)

, meaning that

E (t_{2} - t_{1}) = \frac{1}{q}

(36)

Therefore, the expected number of training times for

N_{1}

is

E (\frac{t}{t_{2} - t_{1}}) = t q

. The total training gain is the product of the number of training times and the gain of a single time training, which we denote as

G

. Now, the formula emerges:

G (q) = \frac{t}{- n^{2} q^{3} + (n^{2} + 2 n) q^{2} - (2 n + 1) q + n + 1}

(37)

Denote

f (q)

as the denominator of

G (q)

and differentiate that to obtain

\frac{\partial f (q)}{\partial q} = - (n q - 1) (3 n q - 2 n - 1)

(38)

With the help of Equation (38), it is easy to draw an image of

G (p)

, shown in Figure 1, where we set

t

to

1

. The observation is that when

q = \frac{1}{n}

or

q = 1

, that is,

p

is 0 or

1 - \frac{1}{n}

,

G

reaches the maximum value

\frac{t}{n}

, demonstrating that the effective training times of

N_{1}

are the largest. The conclusion can be generalized to every neuron in the layer. □

Corollary 4.

In neural networks, if the amount of training data is sufficient, the optimal value of the dropout rate is 0.5; if the amount of training data is insufficient, then a number that is close to 1 is a better choice.

Proof.

Theorem 5 focuses on the effective neuron training times in the network, and the corollary focuses on the representation ability. It can be seen from Equation (37) that the effective training times of a certain layer are directly proportional to the total training times

t

. When the number of training times reaches a certain threshold, the network reaches a balance point, and further training will not bring any performance improvements.

If the training data are sufficient, meaning that

t

and

G

are large enough, then the network is guaranteed to be fully trained. Therefore, we do not need to worry about whether the training times of neurons in the network is enough. However, we still need to consider the representation ability of the network, which has a close relationship with the number of subnetworks

S N

. It can be calculated as

S N = (\begin{matrix} n \\ n (1 - p) \end{matrix})

(39)

which is a combination number. Obviously, when

p

is

0.5

, the number of subnetworks is the largest, and the network’s representation ability is relatively strong.

However, when there are not enough training data, we cannot guarantee the sufficiency of training. On the one hand, we need to set the dropout rate to a value close to

0

or

1 - \frac{1}{n}

to guarantee the number of trainings indicated by the theorem. On the other hand, in order to ensure the network’s representation ability, we want the dropout rate to be close to

0.5

. Here, a balanced approach is to choose the turning point shown in Figure 1, which considers both training times and representation ability. Because this point is difficult to analyze, we provide a fitting function shown in Figure 1, the error of which is bounded by

2 \times 10^{- 2}

for

n

smaller than

512

. □

The above corollary is intuitive because the complexity of the network should be proportional to the amount of data. A small amount of data requires a simple model, calling for a higher dropout rate. Notice that a large dropout rate not only enables the model to be fully trained, but it also helps to accelerate the process.

In a modern neural network framework, the discarded neurons will not participate in gradient propagation this time, which largely reduces the number of parameters that need to be adjusted in the network.

3. Models

To address the task of chromatin accessibility prediction, we designed SemanticCAP, which includes a DNA language model that is shown in Section 3.1 and a chromatin accessibility model that is shown in Section 3.2. Briefly, we augment the chromatin accessibility model with the features provided by the DNA language model, thereby improving the chromatin accessibility prediction performance. A detailed methodology is described as follows.

3.1. DNA Language Model

3.1.1. Process of Data

We used the human reference genome GRCh37 (hg19) as the original data for our DNA language model. The human reference genome is a digital nucleic acid sequence database that can be used as a representative example of the gene set of an idealized individual of a species [29]. Therefore, a model based on the database could be applied to various genetic-sequence-related issues.

The task we designed for our DNA language model uses context to predict the intermediate base. However, there are at least three challenges. The first is that there are two inputs, the upstream and downstream, which are defined as the upper and lower sequences of a certain base. Since we predict the middle base from the information on both sides, the definition of the upstream and downstream are interchangeable, which means that the context should be treated in the same way. Second, the length of the input sequence is quite long, far from the output length of

4

bp, which stands for the classification results of bases. The large gap between the input and output lengths points to the fact that neural networks must be designed in a more subtle way. Otherwise, redundant calculations or poor results may occur. Third, we do not always have such long context data in real situations. For example, the length of the upstreams in the DNA datasets in Table 1 mostly vary from

0

bp to

600

bp, resulting in insufficient information in some cases.

To solve the above problems, we designed a simple but effective input format and training method. First of all, we randomly selected a certain position, taking the upstream and downstream sequences with lengths of

512

bp as the input, and the output is the base connecting the upstream and downstream, i.e.,

A

,

T

,

C

, and

G

.

For the first challenge, we combined the upstream and downstream into one sequence, separated by a special prediction token

[LOST]

, and provided different segment embeddings for the two parts. Additionally, a special classification token

[CLS]

was added to the beginning of the sequence so that the model could learn an overall representation of it. For the second challenge, the final hidden state corresponding to the token

[LOST]

was used as the aggregate sequence representation for classification tasks, by which the output dimension was reduced quickly without complex network structures. This technique was used by Bert [30] for the first time. For the third challenge, some data augmentation tricks were applied to enhance the capabilities of the model. First, we constructed symmetric sequences based on the principle of base complementation and the non-directionality of DNA sequences, including the axial symmetry and mirror symmetry. This helps the model learn the two properties of DNA sequences. Second, we did not include all of the inputs in the model, which helps to enhance the model’s prediction ability under conditions with insufficient information. Basically, we mask some percentage of the input tokens at random, and concretely there are two strategies. For a certain sequence, either upstream or downstream, we mask (replace with

[MASK]

)

20 %

of random tokens in

10 %

of cases or

40 %

of consecutive tokens in

15 %

of cases. Figure 2 is an example of our mask operation. In this case, two random tokens of the upstream and three consecutive tokens of the downstream are masked.

Finally, there is no need to worry about overfitting. First, we have 10⁹ bases in the DNA dataset, meaning that we will not over-learn some specific data. Second, we have mask and dropout operations in our training, which both are great ways to avoid over-training.

3.1.2. Model Structure

The input and output are constructed as described in Section 3.1.1, and we denote them as

T_{i n}

and

T_{o u t}

. Basically, the model can be described as

\begin{array}{l} T_{i n} & \overset{embed}{\to} T_{e m b e d} \overset{multi - conv}{\to} T_{c n n s} \\ \overset{transformer}{\to} T_{t r a n s} \overset{mlp}{\to} T_{o u t} \end{array}

(40)

where

embed

is the input embedding layer,

multi - conv

stands for our multi-kernel CNN,

transformer

represents the transformer blocks, and

mlp

contains a linear layer and a softmax function. Figure 3 shows the full picture of the model.

Function

embed ()

is the encoding layer transforming the input sequence into a matrix. The dimension conversion is

ℕ^{L} \overset{}{\to} ℝ^{L \times E}

, where

L

is the length of the sequence, and

E

is the encoding length. Specifically, we encode the input as

\begin{array}{l} embed (T_{i n}) = & word - embed (T_{i n}) \\ + position - embed (T_{i n}) \\ + segment - embed (T_{i n}) \end{array}

(41)

where

word - embed ()

is the meaning of the word itself,

position - embed ()

provides the representation of different positions, and

segment - embed ()

distinguishes the upstream and the downstream. An intuitive approach is to concatenate the three encodings without losing semantics, but this requires triple the space. Instead, we directly add these three encodings. This works because the three parameters are all leaf nodes of the training graph and can automatically adapt to each other’s distributions. In this way, we reduce the dimension of the coded matrix, thus reducing the parameter space and data space.

Function

multi - conv ()

is the multi-kernel convolution layer learning a short-range relationship of the sequence. The dimension conversion is

ℝ^{L \times E} \overset{}{\to} ℝ^{L \times H}

, where

H

is the hidden dimension. Here, we use convolution kernels of different lengths to learn local relationships at different distances, and we propose a smooth feature addition (SFA) method to fuse these features. Specifically, we carry out

multi - conv (T_{e m b e d}) = \sum_{i = 0}^{k} λ_{i} \circ LN ({conv}_{i} (T_{e m b e d}))

(42)

where

{conv}_{i} ()

is a normal, one-dimensional, convolution layer with a kernel length of

l_{i}

, the output dimension of which is

ℝ^{L \times H}

,

λ_{i}

is a network parameter with a dimension of

ℝ^{L \times H}

, and

k

is the number of kernels of different lengths. The sizes of the convolution kernels are small rather than large, and their advantages have been verified in DenseNet [31]. On the one hand, small convolution kernels use less space than large convolution kernels. On the other hand, we need small convolution kernels to learn the local information of the sequence, while the long-range dependence of the sequence is to be explored by the subsequent

transformer

module.

Now, we will explain how we designed the smooth feature addition (SFA) algorithm. Before that, we must provide insight into what happens in the plain concat of features.

In a sequence problem, we often directly concat two features in the last dimension. Specifically, if we have two features with dimensions

ℝ^{L \times M}

and

ℝ^{L \times N}

, the dimension of the features after concat is

ℝ^{L \times (M + N)}

. We thought that this approach would not lose information, but, in fact, there is a danger of feature disappearance. For two features with different distributions learning from different modules, plain concat will create an unbalanced distribution, where some values are extremely small. To make matters worse, layer normalization is usually used to adjust the distribution after a concat operation, causing the values to be concentrated near

0

. Quantitative analysis can be seen in Theorem 3. Finally, as the network goes deeper, the gradient disappears, leading to the difficulty of learning. This is proven in Corollary 2.

A naive thought is to normalize the two distributions before concating them, which is proven to be correct in Theorem 1. However, it is not effective, for it converts the dimension from

ℝ^{L \times H}

to

ℝ^{L \times k H}

, posing a challenge for the subsequent module design. Considering that the dimensions of convolution features are the same, this inspired us to find a way to smoothly add them using some tuning parameters. This is how we designed SFA. Corollary 3 proves the equivalence of SFA and plain concat, and it illustrates the working mechanism of SFA and its advantages in space occupation, feature selection, and gradient propagation.

Function

transformer ()

is the stack of transformer blocks learning a long-range relationship of the sequence. The dimension conversion is

ℝ^{L \times H} \overset{}{\to} ℝ^{L \times H}

. Simply, it can be described as

transformer (T_{c n n s}) = sub (ff, sub (attention, T_{c n n s}))

(43)

where

sub (f, x) = LN (x + f (x))

.

ff

represents the feed forward function, and

attention

is short for multi-head attention. The module was proposed by Vaswani et al. in 2017 [23].

Function

mlp ()

is the output layer and is responsible for converting the hidden state to the output. The dimension conversion is

ℝ^{L \times H} \overset{}{\to} ℕ

. We extract the tensor corresponding to the token

[LOST]

, convert it into an output probability through a linear layer, and generate the prediction value via a softmax function. The output process is

mlp (T_{t r a n s}) = softmax (linear (T_{t r a n s} [^{'} [LOST]^{'}]))

(44)

3.2. Chromatin Accessibility Model

3.2.1. Process of Data

We selected DNase-seq experiment data from six typical cell lines, including GM12878, K562, MCF-7, HeLa-S3, H1-hESC, and HepG2, as the original data for our chromatin accessibility model. GM12878 is a type of lymphoblast produced by EBV transformation from the blood of a female donor of Northern European and Western European descent.

K562 is an immortalized cell derived from a female patient with chronic myeloid leukemia (CML). MCF-7 is a breast cancer cell sampled from a white female. HeLa-S3 is an immortal cell derived from a cervical cancer patient. H1-hESC is a human embryonic stem cell. HepG2 comes from a male liver cancer patient.

For each cell type, we downloaded the original sequence data from the ENCODE website, used a short read aligner tool bowtie [32] to map the DNA sequence to the human reference genome (hg19), and used HOTSPOT [33] to identify chromatin accessibility regions (peaks), i.e., genome-wide open chromatin regions that can yield information about possible protein binding regions on a genome-wide scale. We treated these variable-length sequences as positive samples. At the same time, we sampled the same number and same size sequences from the whole genome as negative samples. An overview of the data is shown in Table 1, which shows the number of sequences, the minimum value, the median value, the maximum value, and the standard deviation in lengths. Additionally, the distribution statistics of different datasets are shown in Figure 4. For the fairness of comparison, we removed sequences with lengths of less than

36

bp. We truncated or expanded each sequence symmetrically to a sequence of length

768

bp, and we took a context of a length of

512

bp for each site in it. Therefore, the actual input length of our model is

768 + 512 \times 2 = 1792

bp. From Figure 4, we can observe that most of the lengths are clustered between 36 and 1792. This proves that our cut-off has little impact and is reasonable. Similar to our DNA language model, a special classification token

[CLS]

was added to the beginning of the sequence to predict the accessibility. Compared to the input length of

800

bp in [15], our prediction length increased by 124%, and the quantity of the DNA sequences that did not need to be truncated in the original dataset increased by 17.4%. Moreover, we did not pay a great price for such a long input because our context was transferred to a pre-trained model for the predictions. The output is the accessibility of the input sequence, i.e., either

0

for inaccessibility or

1

for accessibility.

Finally, the ratio of our training set, validation set, and test set is

0.85 : 0.05 : 0.10

. The training set was used to train the model, the validation set was used to adjust the hyperparameters to prevent overfitting, and the test set was used to test the performance of the final model.

3.2.2. Model Structure

The input and output were constructed as described in Section 3.2.1. and are denoted as

T_{i n}

and

T_{o u t}

. Basically, the model can be described as

\begin{array}{l} T_{i n} \overset{embed}{\to} T_{e m b e d} \overset{multi - conv}{\to} T_{c n n s} \overset{sconcat}{\to} T_{s c o n c a t} \\ \overset{transformer}{\to} T_{t r a n s} \overset{mlp}{\to} T_{o u t} \end{array}

(45)

where

embed

is the input embedding layer,

multi - conv

stands for our multi-kernel CNN,

sconcat

is short for our SConcat module,

transformer

represents the transformer blocks, and

mlp

contains a linear layer and a sigmoid function. Figure 5 shows a full picture of the model. One may find that the accessibility model is very similar to our DNA language model. Indeed, we only modified some of the model structures and changed the hyperparameters, but they are all very critical adjustments that make the model suitable for the task.

Function

embed ()

is the encoding layer transforming the input sequence into a feature matrix. The dimension conversion is

ℕ^{L} \overset{}{\to} ℝ^{L \times E}

, where

L

is the length of the sequence, and

E

is the encoding length. Specifically, we encode the input as

\begin{array}{l} embed (T_{i n}) = & word - embed (T_{i n}) \\ + position - embed (T_{i n}) \end{array}

(46)

Note that there is no

segment - embed ()

in this task because there is no need to distinguish between the different segments.

Function

multi - conv ()

has been explained in Section 3.1.2. The dimension conversion is

ℝ^{L \times E} \overset{}{\to} ℝ^{L \times G}

, where

G

is the dimension of features learning from this layer.

Function

sconcat ()

is the concat layer that fuses the features of the language model with the features learned from

multi - conv

. The dimension conversion is

ℝ^{L \times G} \overset{}{\to} ℝ^{L \times (G + H)}

, where

H

is the dimension of features generated from the DNA language model. Basically, the language model was used to construct features for different sites in the sequence, and a smooth feature concat (SFC) method was proposed to fuse them with the previous features:

sconcat (T_{i n}, T_{c n n s}) = LN ([λ_{1} \circ LM (\overset{\leftrightarrow}{T_{i n}}), λ_{2} \circ T_{c n n s}])

(47)

where

\overset{\leftrightarrow}{T_{i n}}

stands for the context of sites in

T_{i n}

;

λ_{1}

and

λ_{2}

are two network parameters with a dimension of

ℝ^{L}

; and

LM

refers to our DNA language model. Here, it receives a DNA sequence, then constructs the context for each site in the sequence, and produces an output of length

H

. Specifically, if the length of the sequence is

L

, it will construct

L

pairs of contexts as the input and output an

ℝ^{L \times H}

matrix.

Now, we explain how we designed the smooth feature concat (SFC) algorithm. First, we should mention that the output dimension of the language model is

ℝ^{L \times H}

, and the dimension of

T_{c n n s}

is

ℝ^{L \times G}

, which means we cannot directly apply SFA in this scenario.

Fortunately, the analysis in Section 3.1.2 has already provided a solution to this problem. We can normalize the two distributions separately before concating them. However, this method uses

LN

twice and consumes additional parameter space and data space. One question is whether it is possible to use

LN

only once. It appears that this is the case. Theorem 2 states that, for any two distributions, there always exist two coefficients, so that the concat after they are multiplied by these two coefficients is a standardized distribution. That is how our SFA works. We multiply the two tensors by two coefficients, and we then carry out layer normalization after their concatenation. As such, we fused the two features smoothly with only one use of the

LN

operation. Interestingly, this method is a weakened version of Theorem 4.

Function

transformer ()

is the same as that described in Section 3.1.2. The dimension conversion is

ℝ^{L \times F} \overset{}{\to} ℝ^{L \times F}

, where

F = G + H

.

Function

mlp ()

is the output layer and is responsible for transforming the hidden state to the output. The dimension conversion is

ℝ^{L \times F} \overset{}{\to} ℕ

. We extracted the tensor corresponding to the token

[CLS]

, converted it into an output probability through a linear layer, and generated the prediction value via a sigmoid function. The output process is

mlp (T_{t r a n s}) = sigmoid (linear (T_{t r a n s} [^{'} [CLS]^{'}]))

(48)

4. Results and Discussions

4.1. Semantic DNA Evaluation

We compared the performance of our proposed method with several baseline methods, including the gapped k-mer SVM (gkm-SVM) [9], DeepSEA [14], and k-mer [15] methods. For the sake of fairness, all of the parameters were set as defaults. Moreover, to prove the effectiveness of the DNA language model, we also tested our accessibility model after excluding the DNA language model. For evaluation purposes, we computed two often-used measures, the area under the receiver operating characteristic curve (auROC) and the area under the precision-recall curve (auPRC), which are good indicators of the robustness of a prediction model. The classification results for six datasets are shown in Table 2. Compared to the best baseline k-mer, our system shows a maximum improvement of 1.25% in auROC, and a maximum improvement of 2.41% in auPRC. Although some results on some datasets are not good, our model outperforms k-mer on average, with a 0.02% higher auROC score and a 0.1% higher auPRC score. Compared to gkm-SVM and DeepSEA, SemanticCAP shows an average improvement of about 2–3%. Finally, the introduction of our DNA language model resulted in performance improvements of 2%.

We also tested the accessibility prediction accuracy of the loci shared in different cell lines. For example, GM12878 and HeLa-S3 have 20 common loci, and the prediction accuracy of these 20 loci in both cell lines is 85% and 90%, respectively. Another example is that K562 and MCF-7 have 21 common loci, and the prediction accuracy is 80.9% and 90.5%, respectively. This shows the applicability of our system on the common loci between different cell lines.

4.2. Analysis of Models

4.2.1. Effectiveness of Our DNA Language Model

We performed experiments on several different DNA language model structures, which can be divided roughly into two categories. The first category can be attributed to methods based on normal CNNs, and the second category uses our multi-conv architecture with data augmentation. Six structures were tested. At the same time, in order to test the prediction ability of different models in the case of insufficient information, we randomly masked some words and tested the results. The complete results are shown in Table 3. Through the comparison of LSTM and Attention, we found that the attention mechanism can greatly improve the prediction ability of the DNA language model. When using the MaxPooling and ReLU functions, we observed that the output of the last hidden layer was mostly

0

, where the number of effective (not zero) neurons is about 3/192. This happens because the ReLU function shields neurons whose values are less than

0

, and MaxPooling selectively updates specific neurons. Therefore, we replaced MaxPooling with AveragePooling, and the Attention layer that uses the ReLU function was replaced with a transformer. That is the third method listed in Table 3. The second category uses multi-conv to extract the local features of the sequence. The introduction to the multi-conv mechanism with data augmentation strategies brought increases in accuracy, especially when some tokens were masked. There are three kinds of feature fusion strategies: plain concat (PC), plain add (PA), and our smooth feature add (SFA). The third, fourth, and fifth items in the table indicate that SFA outperforms the other two fusion methods. The last item in Table 3, mconv(SFA)+trans, is the model that we finally chose as our DNA language model.

4.2.2. Effectiveness of Our Chromatin Accessibility Model

We experimented with several chromatin accessibility model structures, all of which were based on the transformer. The main difference is the use of multi-conv and the modules after the transformer. A complete comparison of the results is shown in Table 4.

First, we focused on the module before the transformer. We noticed that the introduction of multi-conv also resulted in performance improvements, especially in F1. In our chromatin accessibility model, we concatenated the features provided by the DNA language model, where we could either directly concat (PC) them or use our SFC method. The evaluation values of the last two items show the superiority of SFC.

Now, we will turn to the comparison of modules after the transformer. The transformation from the features of the transformer to the result is a challenge. In this part, five methods were tested. It should be mentioned that mconv + SFC + trans + linear is the final model. In terms of training time, our model can be fully parallelized, making it more advantageous than LSTM, based on recurrent networks. At the same time, our model has fewer parameters and has a simpler structure than Flatten after CNNs and can thus converge quickly. In terms of evaluation, the LSTM-based methods performed poorly. The main reason for this is that it is difficult for LSTM to learn the long-range dependence of a sequence. The convolution layer improves the performance of the LSTM to some extent by shortening the sequence length. In methods that are based on Flatten, introducing convolution layers actually reduces the accuracy. This could be caused by the convolution layers destroying the sequence features learned from the transformer. During multiple chromatin accessibility models, the method using multi-conv and our smoother concat (SFC) method obtained the best results with a relatively small number of parameters.

4.3. Analysis of [CLS]

We were able to observe the effectiveness of introducing the

[CLS]

symbol into our accessibility model. A direct indicator is the feature corresponding to

[CLS]

after the transformer layer, i.e., the value of

T_{trans} [^{'} [CLS]^{'}]

in Equation (48). We randomly selected a certain number of positive and negative samples and used our chromatin accessibility model to predict them. For each sample, we output the

256

-dimensional tensor corresponding to

[CLS]

after the transformer layer, and we reduced it to 2-dimensional space with t-SNE, which is shown in Figure 6. According to the figure, the feature has the ability to distinguish positive and negative examples, which is strong evidence of its effectiveness.

4.4. Analysis of SFA and SFC

In this section, we conducted two comparison experiments of PA, PC, SFA, and SFC.

When testing the various DNA language models, we made a comparison between SFA, PA, and PC, which correspond to the last three items in Table 3. We used

5 \times 10^{6}

samples to train the three models, drew a training loss map of them, and saw what would happen, which is shown in Figure 7a. PA quickly reduces losses at the fastest speed at the beginning because all of the features in multi-conv are trained to the same degree at the same time. However, in the later stage, there appears a phenomenon in which some features are overtrained while others are not, leading to the oscillation of loss.

In the experiment of various chromatin accessibility models, we made a comparison between SFC and PC, corresponding to the last two items in Table 4. The first

5 \times 10^{3}

samples were used to measure its training state, which is shown in Figure 7b. As we can see, PC has a lower training speed because it has a problem regarding gradient disappearance. Compared to it, the gradient propagation of SFA is selective and more stable for the whole term.

We can observe the effectiveness of SFA from another angle. Paying attention to the parameters

C_{S F A}

of SFA in multi-conv, whose dimension is

ℝ^{K \times L \times H}

, where

K

is the number of kernels,

L

is the sequence length, and

H

is the hidden dimension, we normalized it, and converted it to

{C^{'}}_{S F A}

, whose dimension is

ℝ^{K \times L}

. This was carried out for both the language model and the chromatin accessibility model, and they can be observed in Figure 8. Note that the sum of the vertical axis in Figure 8a,b is always

1

due to the normalization. Obviously, different sequence positions and different convolution kernels have different weights, which proves SFA’s ability to regulate features.

In general, SFA and SFC make training smoother, faster, and better than ordinary concatenation and addition. They are smoother because we used parameters to regulate features. They are faster because they speed up the training of the model by avoiding the gradient problem. They are simple but effective. In fact, since they share the same essence (the Hadamard product), they share the same advantages.

5. Conclusions

In this article, we propose a chromatin accessibility prediction model called SemanticCAP. Our model is able to predict open DNA regions, thus having a guiding role in disease detection, drug design, etc. For example, a gene called CYMC from cell H1-hESC mutated in the middle with a length of 5 bp, and its accessibility decreased from 0.98 to 0.14 as predicted by our model, which is consistent with the experimental data that it reduces transcription [34]. Another example is a mutation in a gene called HNF4A from cell K562, which leads to a reduction in gene expression [35]. Our model predicted that its accessibility decreased from 0.66 to 0.2, which provides a reasonable explanation for the experimental phenomena of reduction in gene expression caused by HNF4A mutation. Similarly, we can monitor the accessibility changes of DNA targeted by drugs (especially anticancer drugs), and the change of accessibility will provide guidance for drug action. Our main innovations are as follows. First, we introduced the concept of language models in natural language processing to model DNA sequences. This method not only provides the word vector presentation of the base itself, but it also provides sufficient information about the context of a site in a DNA sequence. Second, we used a small number of parameters to solve the feature fusion problem between different distributions. Specifically, we solve the problem of the smooth addition of distributions with the same dimensions using SFA and the problem of the smooth concatenation of distributions with different dimensions using SFC.

Third, we use an end-to-end model design, in which we fully utilize the learning ability and characteristics of the convolution and attention mechanism, thus achieving a better result with fewer parameters and a shorter training time.

Of course, there is still room for improvement in our method. In terms of the sample construction, we randomly selected the same number of DNA sequences with the same length as negative samples. This approach may be modified. For example, we could deliberately use an unbalanced dataset because there are so much DNA data, and we could then use some strategies, such as ensemble learning [36], to eliminate the negative effects of data imbalance [37]. In terms of data input, sequence truncation, and sequence completion operations exist in our model, which may cause information loss or redundant calculations. Additionally, the task we designed for the DNA language model could also be enhanced. Multiple positions can be predicted simultaneously, similar to the cloze problem in Bert. There are also some limitations in the current study. The first limitation is that the attention mechanism consumes too much memory, which could be replaced by a short-range attention or a mixed-length attention [38]. Additionally, our smooth feature fusion methods, SFA and SFC, could also be used in the multi-head attention to save space and accelerate training. Moreover, the dropout mechanism makes all neurons effective in the prediction phase, but there may exist a more reasonable way of fusing subnetworks. These issues need to be further explored.

Author Contributions

Conceptualization: Y.Z. and L.Q.; Data curation: L.Q.; Formal analysis: Y.Z. and Y.J.; Investigation: Y.Z.; Supervision: X.C., H.W. and L.Q.; Validation: X.C., H.W. and L.Q.; Writing—original draft: Y.Z.; Writing—review and editing: L.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the following: the National Natural Science Foundation of China (31801108, 62002251); the Natural Science Foundation of Jiangsu Province Youth Fund (BK20200856); and a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). This work was partially supported by the Collaborative Innovation Center of Novel Software Technology and Industrialization. The authors would also like to acknowledge the support of Jiangsu Province Key Lab for providing information processing technologies.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and code are available at github.com/ykzhang0126/semanticCAP (accessed on 16 February 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Aleksić, M.; Kapetanović, V. An Overview of the Optical and Electrochemical Methods for Detection of DNA-Drug Interactions. Acta Chim. Slov. 2014, 61, 555–573. [Google Scholar] [PubMed]
Wang, Y.; Jiang, R.; Wong, W.H. Modeling the Causal Regulatory Network by Integrating Chromatin Accessibility and Transcriptome Data. Natl. Sci. Rev. 2016, 3, 240–251. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gallon, J.; Loomis, E.; Curry, E.; Martin, N.; Brody, L.; Garner, I.; Brown, R.; Flanagan, J.M. Chromatin Accessibility Changes at Intergenic Regions Are Associated with Ovarian Cancer Drug Resistance. Clin. Epigenet. 2021, 13, 122. [Google Scholar] [CrossRef] [PubMed]
Janssen, S.; Cuvier, O.; Müller, M.; Laemmli, U.K. Specific Gain-and Loss-of-Function Phenotypes Induced by Satellite-Specific DNA-Binding Drugs Fed to Drosophila Melanogaster. Mol. Cell 2000, 6, 1013–1024. [Google Scholar] [CrossRef]
Song, L.; Crawford, G.E. DNase-Seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements Across the Genome from Mammalian Cells. Cold Spring Harb. Protoc. 2010, 2010, pdb-prot5384. [Google Scholar] [CrossRef] [Green Version]
Simon, J.M.; Giresi, P.G.; Davis, I.J.; Lieb, J.D. Using Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE) to Isolate Active Regulatory DNA. Nat. Protoc. 2012, 7, 256–267. [Google Scholar] [CrossRef] [Green Version]
Buenrostro, J.D.; Wu, B.; Chang, H.Y.; Greenleaf, W.J. ATAC-Seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr. Protoc. Mol. Biol. 2015, 109, 21–29. [Google Scholar] [CrossRef]
Lee, D.; Karchin, R.; Beer, M.A. Discriminative Prediction of Mammalian Enhancers from DNA Sequence. Genome Res. 2011, 21, 2167–2180. [Google Scholar] [CrossRef] [Green Version]
Ghandi, M.; Lee, D.; Mohammad-Noori, M.; Beer, M.A. Enhanced Regulatory Sequence Prediction Using Gapped k-Mer Features. PLoS Comput. Biol. 2014, 10, e1003711. [Google Scholar] [CrossRef] [Green Version]
Beer, M.A. Predicting Enhancer Activity and Variant Impact Using Gkm-SVM. Hum. Mutat. 2017, 38, 1251–1258. [Google Scholar] [CrossRef] [Green Version]
Xu, Y.; Strick, A.J. Integration of Unpaired Single-Cell Chromatin Accessibility and Gene Expression Data via Adversarial Learning. arXiv 2021, arXiv:2104.12320. [Google Scholar]
Kumar, S.; Bucher, P. Predicting Transcription Factor Site Occupancy Using DNA Sequence Intrinsic and Cell-Type Specific Chromatin Features. BMC Bioinform. 2016, 17, S4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the Sequence Specificities of DNA-and RNA-Binding Proteins by Deep Learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef]
Zhou, J.; Troyanskaya, O.G. Predicting Effects of Noncoding Variants with Deep Learning–Based Sequence Model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [Green Version]
Min, X.; Zeng, W.; Chen, N.; Chen, T.; Jiang, R. Chromatin Accessibility Prediction via Convolutional Long Short-Term Memory Networks with k-Mer Embedding. Bioinformatics 2017, 33, i92–i101. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, Q.; Xia, F.; Yin, Q.; Jiang, R. Chromatin Accessibility Prediction via a Hybrid Deep Convolutional Neural Network. Bioinformatics 2018, 34, 732–738. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Guo, Y.; Zhou, D.; Nie, R.; Ruan, X.; Li, W. DeepANF: A Deep Attentive Neural Framework with Distributed Representation for Chromatin Accessibility Prediction. Neurocomputing 2020, 379, 305–318. [Google Scholar] [CrossRef]
Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A Convolutional Neural Network for Modelling Sentences. arXiv 2014, arXiv:1404.2188. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing—EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Sun, C.; Yang, Z.; Luo, L.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. A Deep Learning Approach with Deep Contextualized Word Representations for Chemical–Protein Interaction Extraction from Biomedical Literature. IEEE Access 2019, 7, 151034–151046. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Giné, E. The lévy-Lindeberg Central Limit Theorem. Proc. Am. Math. Soc. 1983, 88, 147–153. [Google Scholar] [CrossRef] [Green Version]
Horn, R.A. The Hadamard Product. In Proceedings of the Symposia in Applied Mathematics, Phoenix, AZ, USA, 10–11 January 1989; Volume 40, pp. 87–169. [Google Scholar]
Liu, F.; Perez, J. Gated End-to-End Memory Networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Long Papers. Volume 1, pp. 1–10. [Google Scholar]
Baldi, P.; Sadowski, P.J. Understanding Dropout. Adv. Neural Inf. Process. Syst. 2013, 26, 2814–2822. [Google Scholar]
Pan, B.; Kusko, R.; Xiao, W.; Zheng, Y.; Liu, Z.; Xiao, C.; Sakkiah, S.; Guo, W.; Gong, P.; Zhang, C.; et al. Similarities and Differences Between Variants Called with Human Reference Genome Hg19 or Hg38. BMC Bioinform. 2019, 20, 17–29. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
De Ruijter, A.; Guldenmund, F. The bowtie method: A review. Saf. Sci. 2016, 88, 211–218. [Google Scholar] [CrossRef]
John, S.; Sabo, P.J.; Thurman, R.E.; Sung, M.-H.; Biddie, S.C.; Johnson, T.A.; Hager, G.L.; Stamatoyannopoulos, J.A. Chromatin Accessibility Pre-Determines Glucocorticoid Receptor Binding Patterns. Nat. Genet. 2011, 43, 264–268. [Google Scholar] [CrossRef]
Klenova, E.M.; Nicolas, R.H.; Paterson, H.F.; Carne, A.F.; Heath, C.M.; Goodwin, G.H.; Neiman, P.E.; Lobanenkov, V.V. CTCF, a conserved nuclear factor required for optimal transcriptional activity of the chicken c-myc gene, is an 11-Zn-finger protein differentially expressed in multiple forms. Mol. Cell. Biol. 1993, 13, 7612–7624. [Google Scholar]
Colclough, K.; Bellanne-Chantelot, C.; Saint-Martin, C.; Flanagan, S.E.; Ellard, S. Mutations in the genes encoding the transcription factors hepatocyte nuclear factor 1 alpha and 4 alpha in maturity-onset diabetes of the young and hyperinsulinemic hypoglycemia. Hum. Mutat. 2013, 34, 669–685. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble Learning. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 2002; Volume 2, pp. 110–125. [Google Scholar]
Chawla, N.V.; Sylvester, J. Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets. In Proceedings of the International Workshop on Multiple Classifier Systems, Prague, Czech Republic, 23–25 May 2007; Springer: Berlin, Germany, 2007; pp. 397–406. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]

Figure 1. The graph of the function with respect to G derived from Equation (37) in Theorem 5. It shows the gain of a single training time as n differs. G varies with p, and the extreme points (squares) and turning points (triangles) vary with n. The turning point is approximately 8.72 × 10⁻⁹ n³ − 9.35 × 10⁻⁶n² + 2.44 × 10⁻³n + 0.78, which is a good choice for the dropout rate because it balances the gain of a single time training and the representation ability of a nerual network.

Figure 2. An example of mask operation on a random DNA sequence. Here, C and T of the upstream and C, G, and A of the downstream are masked. The intermediate base is T, which is the target that needs to be predicted.

Figure 3. DNA language model. The box on the right displays our smooth feature addition (SFA).

Figure 4. The length distribution of each cell, which is a more intuitive display of the content of the data. It can be seen that most of the lengths are concentrated between 10²–10³.

Figure 5. Chromatin accessibility model. The box on the right displays our smooth feature concat (SFC).

Figure 6. Features corresponding to the token [CLS] after the transformer for different samples. The

256

-dimensional tensor corresponding to

[CLS]

is reduced to a 2-dimensional value with t-SNE, and each axis represents one of them. Accessible points and inaccessible points can be roughly distinguished.

Figure 6. Features corresponding to the token [CLS] after the transformer for different samples. The

256

-dimensional tensor corresponding to

[CLS]

is reduced to a 2-dimensional value with t-SNE, and each axis represents one of them. Accessible points and inaccessible points can be roughly distinguished.

Figure 7. Loss in training time: (a) The loss curve for SFA, PA, and PC in the training of the DNA language model; and (b) the loss curve for SFC and PC in the training of the chromatin accessibility model.

Figure 8. SFA and SFC parameters: (a) Parameters of SFA in the DNA language model; and (b) Parameters of SFC in the chromatin accessibility model. We found that different sequence positions and different convolution kernels have different weights, which proves SFA’s ability to regulate features.

Table 1. An overall view of accessible DNA segments of each cell. l_mean and l_med show the approximate length of the data, and l_std describes how discrete the data is. The length is denoted as bp.

Cell Type	Code	Size	l_min	l_max	l_mean	l_med	l_std
GM12878	ENCSR0000EMT	244,692	36	11,481	610	381	614
K562	ENCSR0000EPC	418,624	36	13,307	675	423	671
MCF-7	ENCSR0000EPH	503,816	36	12,041	471	361	391
HeLa-S3	ENCSR0000ENO	264,264	36	11,557	615	420	524
H1-hESC	ENCSR0000EMU	266,868	36	7795	430	320	347
HepG2	ENCSR0000ENP	283,148	36	14,425	652	406	626

Table 2. The results of the comparative experience to test the chromatin accessibility prediction system. Refer to Table 1 for the codenames of these datasets.

System	MT	PC	PH	NO	MU	NP	Average
(a) auROC
Gkm-SVM	0.8528	0.8203	0.8967	0.8648	0.8983	0.8359	0.8697
DeepSEA	0.8788	0.8629	0.9200	0.8903	0.8827	0.8609	0.8782
k-mer	0.8830	0.8809	0.9212	0.9016	0.9097	0.8722	0.8975
no feature ¹	0.8727	0.8664	0.9058	0.8840	0.8849	0.8699	0.8806
SemanticCAP	0.8907	0.8883	0.9241	0.9001	0.8982	0.8847	0.8977
(b) auPRC
Gkm-SVM	0.8442	0.8081	0.8860	0.8627	0.8823	0.8123	0.8504
DeepSEA	0.8758	0.8551	0.9146	0.8888	0.8705	0.8508	0.8801
k-mer	0.8774	0.8732	0.9156	0.8992	0.8968	0.8630	0.8973
no feature ¹	0.8745	0.8663	0.9053	0.8852	0.8878	0.8730	0.8820
SemanticCAP	0.8914	0.8896	0.9218	0.9004	0.8993	0.8871	0.8983

¹ no feature is SemanticCAP without pre-trained features.

Table 3. The results of the experiment comparing DNA language models.

Model	Loss (No Mask)	Accuracy (No Mask)	Accuracy (Mask 30%)
convs (max) + lstms	1.152	0.4538	0.3265
convs (max) + attention (ReLU)	1.113	0.4814	0.3687
convs (avg) + trans ¹	1.097	0.4926	0.3599
mconv ² (PC ³) + trans	0.968	0.5114	0.4572
mconv (PA ³) + trans	0.931	0.5187	0.4784
mconv (SFA ³) + trans	0.921	0.5202	0.4793

¹ trans refers to transformer+linear. ² mconv stands for our multi-conv layer. ³ PA is plain add, PC is plain concat, and SFA is our smooth feature addition method.

Table 4. The results of the experiment comparing the chromatin accessibility models.

Model	Parameters (M)	Total (h)	auROC	auPRC	F1
PC + trans ¹ + lstm	4.16	4.6	0.8595	0.8625	0.7880
PC + trans + conv + lstm	4.95	2.9	0.8741	0.8765	0.8036
PC + trans + flatten	16.4	2.0	0.8822	0.8839	0.8124
PC + trans + conv + flatten	6.13	2.8	0.8817	0.8834	0.8119
PC + trans + linear	3.84	1.5	0.8839	0.8854	0.8144
mconv + PC + trans + linear	5.61	2.5	0.8881	0.8902	0.8590
mconv + SFC ² + trans + linear	5.61	2.5	0.8907	0.8914	0.8606

¹ trans is short for transformer blocks. ² SFC is our smooth feature concat method.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Chu, X.; Jiang, Y.; Wu, H.; Quan, L. SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes 2022, 13, 568. https://doi.org/10.3390/genes13040568

AMA Style

Zhang Y, Chu X, Jiang Y, Wu H, Quan L. SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes. 2022; 13(4):568. https://doi.org/10.3390/genes13040568

Chicago/Turabian Style

Zhang, Yikang, Xiaomin Chu, Yelu Jiang, Hongjie Wu, and Lijun Quan. 2022. "SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model" Genes 13, no. 4: 568. https://doi.org/10.3390/genes13040568

APA Style

Zhang, Y., Chu, X., Jiang, Y., Wu, H., & Quan, L. (2022). SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model. Genes, 13(4), 568. https://doi.org/10.3390/genes13040568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model

Abstract

1. Introduction

2. Theories

3. Models

3.1. DNA Language Model

3.1.1. Process of Data

3.1.2. Model Structure

3.2. Chromatin Accessibility Model

3.2.1. Process of Data

3.2.2. Model Structure

4. Results and Discussions

4.1. Semantic DNA Evaluation

4.2. Analysis of Models

4.2.1. Effectiveness of Our DNA Language Model

4.2.2. Effectiveness of Our Chromatin Accessibility Model

4.3. Analysis of [CLS]

4.4. Analysis of SFA and SFC

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI