A Clustering Approach for Motif Discovery in ChIP-Seq Dataset

Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.


Introduction
Transcription factor binding sites (TFBSs) [1] are short, degenerated nucleotide fragments (usually ≤30 bps) located in specific DNA regions. They play an important role in regulating gene expression. The Planted (l, d) Motif Search (PMS) problem [2] is a popular motif model for the identification of TFBSs (i.e., motif discovery) in bioinformatics, and is formally defined as follows: Definition 1 (PMS). Given a set of DNA sequences X = {X 1 , X 2 , . . . , X t } with |X i | = n and three non-negative integers d, q, l, with 0 ≤ d < l < n, 0 < q ≤ t, the PMS problem is to find an l-mer M (a string of length l), such that each selected sequence X i has an l-mer M i with Hamming distance d H (M, M i ) ≤ d, for i = 1, 2, . . . , q. The l-mer M is called an (l, d) motif and the l-mer M i is called a motif instance.
According to the different values of q representing the distribution of motif instances, there are three different motif discovery sequence models [3]: (i) Exactly one motif occurrence per sequence (the OOPS model), (ii) zero or one motif occurrences per sequence (the ZOOPS model), or (iii) zero or more motif occurrences per sequence (the TCM model). For the OOPS model, q = t, for the ZOOPS model and TCM model, 0 < q < t.
Generally, there are two kinds of algorithms for solving the PMS problem: exact algorithms and approximate algorithms. Exact algorithms [4][5][6][7][8][9][10] always use consensus sequences [11] to represent motifs and can find all (l, d) motifs. Most exact algorithms are pattern-driven algorithms, which attempt to enumerate all possible 4 l l-mers (substring patterns of length l) to find the l-mer with the maximum number of approximate occurrences. Approximate algorithms [12][13][14][15][16][17][18] usually use a position weight matrix (PWM) [19] to describe the most likely occurring motifs and can report results in a short time, but do not always identify all (l, d) motifs. Most approximate algorithms use probabilistic analysis to maximize the score function which describes how likely it is for an l-mer pattern to be a motif instance.
Recently, chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has produced extremely valuable information for the genome-wide identification of transcription factor binding sites (TFBSs) and in the field of epigenetics, which mainly focus on DNA methylation, nucleosome localization, and histone modification. For transcription factors, ChIP-Seq is widely used to study the binding of transcription factors for the analysis of gene expression regulation on a genome-wide scale. For histones, ChIP-Seq performs high-throughput histone modification sequencing in the whole genome with sufficient sequencing depth and range, which not only improves the sensitivity and specificity of sequencing, but can also transform qualitative sequencing methods into quantitative detection.
In this paper, we focus on an algorithm to discover transcription factor binding sites in a ChIP-Seq data set. A ChIP-Seq data set is a set of peak regions containing TFBSs obtained through ChIP-Seq experiments, read mapping, and peak calling. It contains hundreds or more DNA sequences, which increases the difficulty of accurate and efficient identification of TFBSs.
Some algorithms have recently been proposed to discover TFBSs in ChIP-Seq data sets [20][21][22][23][24][25][26][27][28][29]. However, none of them has proven to be absolutely superior, compared to the rest. Some of these are tailored versions of previous motif discovery algorithms, specifically tailored towards ChIP-Seq data sets, such as MEME-ChIP [20] and HMS [21]. MEME-ChIP [20], which incorporates two complementary motif discovery algorithms, known as MEME and DREME [22], can identify motifs without restriction on the size or number of sequences, allowing very large ChIP-Seq data sets to be analyzed. HMS [21], which is an improved version of Gibbs Sampler, combines stochastic sampling and a deterministic greedy search step, which improves computation efficiency. DREME [22] is specifically designed to find short, core DNA-binding motifs of eukaryotic TFs, and is optimized to analyze very large ChIP-Seq data sets in just minutes. One may speed up the existing motif discovery algorithms by integrating some information, such as in the cases of STEME [23] and ChIP-Munk [24]. STEME [23] accelerates MEME by indexing sequences with suffix trees. ChIP-Munk [24] combines a greedy approach with an expectation-maximization (EM) algorithm to achieve a high efficiency. There are also exhaustive methods for determining exact motifs in ChIP-Seq data sets, such as FMotif [25] and Weeder [26]. FMotif [25] first constructs a mismatched suffix tree to scan and count all possible motif instances, and then implements a depth-first search to enumerate all possible motifs. However, the run time of FMotif becomes unrealistic with increasing values of l and d. Others use word enumeration methods to process full-size ChIP-Seq data sets, such as CisFinder [28] and MCES [29]. CisFinder [28] employs a word clustering method to group short l-mers (l = 7, 8, or 9), but struggles to find exact (l, d) motifs with larger values of l and d in ChIP-Seq data sets. MCES [29], a new planted (l, d) motif discovery algorithm, mines and combines substrings to rapidly identify exact motifs in full-size ChIP-Seq data sets.
In this paper, we propose a new motif discovery algorithm, named AP-ChIP, which is specially designed for better discovering TFBSs in ChIP-Seq data sets. The algorithm first constructs and then further filters cluster subsets using probabilistic analysis. Then, Affinity Propagation (AP) clustering [30] is applied to the candidate cluster subsets in order to discover optimal motifs. Experimental results show that the AP-ChIP Algorithm 1 can find TFBSs in a ChIP-Seq data set very efficiently and effectively.

Method
A ChIP-Seq data set has the following fundamental features: (i) Some of the sequences may contain no motifs at all; and (ii) thousands of sequences lead to huge amount of background l-mers. To cater to ChIP-Seq data sets, we design the AP-ChIP Algorithm 1 under the ZOOPS model and set some proper thresholds to filter redundant background l-mers. More specifically, the AP-ChIP Algorithm 1 consists of the following three steps:

Construct Cluster Subsets
We introduce the observation that any two motif instances x 1 and x 2 , each of which differs from the same motif x up to d positions, must have Hamming distance of no more than 2d, denoted as d H (x 1 , x 2 ) ≤ 2d. Consequently, if an l-mer in one sequence is a motif instance, all other motif instances in the remaining sequences will be gathered in the corresponding cluster subset. Under the ZOOPS model, we choose h (h = t − q + 1) sequences as the reference sequences to ensure that at least one sequence among the h sequences contains a motif instance [31]. In general, we use the first h sequences {X 1 , X 2 , . . . , X h } as the reference sequences. As the l-mer which is the motif instance is not known in advance, we consider all l-mers x i,j (i = 1, 2, . . . , h; j = 1, 2, . . . , n − l + 1) in the first h sequences as the reference subsequences.
The ideal cluster subsets are expected to contain as few background l-mers as possible and, also, to include sufficient motif instances. Therefore, we set a threshold k (d < k ≤ 2d), so that, for each reference subsequence x i,j , all l-mers x i ,j (i = 1, 2, . . . , t, i = i; j = 1, 2, . . . , n − l + 1) in the whole sequences, except the ith sequence X i such that d H (x i,j , x i ,j ) ≤ k, are selected to construct a cluster subset; that is where represents the set of the selected l-mers in the i th sequence X i and x i ,j ∈ l X i if and only if x i ,j is an l−mer of the sequence X i . To set a proper threshold k, two probabilistic expressions are employed. The first is the probability of the Hamming distance between two random l-mers x 1 and x 2 being no more than k [4]: ( The other is the probability of the Hamming distance between two selected motif instances m 1 and m 2 being no more than k: [18].
Now, we describe the method for calculating p kmoti f . Given two motif instances, m 1 and m 2 , of the same motif m 0 with up to d mutations, the distances between m 1 , m 2 , and m 0 satisfy d H (m 0 , Thus, the probability p kmoti f can be calculated as and p(α, β) represents the probability of d H (m 0 , m 1 ) = α and d H (m 0 , m 2 ) = β; that is As d H (m 0 , m 1 ) = α and d H (m 0 , m 2 ) = β are independent, we have Combining Equations (5)-(9), we have Having calculated the two probabilities p k and p kmoti f , we now describe how to set the proper threshold k, in order to construct the cluster subsets which contain as few background l-mers as possible while including sufficient motif instances. A larger value of p kmoti f indicates more motif instances belong to the cluster subset; however, a smaller value of p k suggests that fewer background l-mers appear in the same cluster subset. Therefore, the threshold k should be set in a way that ensures that the value of p kmoti f is large enough, compared to the value of p k .
To demonstrate this issue, let us consider the (18, 5) problem instance as an example. The values of p k and p kmoti f are shown in Table 1. When k = 7, the value of p kmoti f is 0.7414, which allows us to obtain sufficient motif instances, whereas the value of p k is 0.0012 which, in turn, allows us to reduce the scale of background l-mers in the same cluster subset. Therefore, the optimal value of k is 7.

Filter Cluster Subsets
As is known, the true motif instances must exist in one of these h × (n − l + 1) cluster subsets C(x i,j , X), which are constructed with the reference subsequences x i,j (i = 1, 2, . . . , h; j = 1, 2, . . . , n − l + 1) from the first h sequences. However, with a great number of total cluster subsets, the identification of the cluster subsets that contain the true motif instances is highly time-consuming as most of these cluster subsets have redundant background l-mers.
To filter the interference cluster subsets, a threshold p f occ (i.e., an occurrence frequency) [29] is employed with the purpose of analyzing the probability of a random motif instance x occurring in a given sequence.
where p mut is the probability of a character mutation in one position. For each sequence then the sequence X i may contain a motif instance and is stored in a set N(i, j). For each N(i, j), if the number of the selected sequences in N(i, j), denoted as |N(i, j)|, is not less than q, then there are at least q possible motif instances in the corresponding cluster subset C(x i,j , X), the reference subsequence x i,j is considered as a potential motif instance, and the corresponding cluster subset C(x i,j , X) is a candidate cluster subset C candidate (x i,j , X).

Refine Cluster Subsets
Due to the occurrence frequency p f occ and the relatively small Hamming distance k between the reference subsequence and the selected l-mers, only a limited number of cluster subsets, which contains a small number of the selected l-mers, are retained as candidate cluster subsets for further Affinity Propagation (AP) clustering. In order to quickly produce highly conserved cluster subsets, we apply AP clustering to each candidate cluster subset. Compared to other clustering approaches, AP clustering can cluster large-scale data sets efficiently by exchanging messages between data points.

Affinity Propagation (AP) Clustering
As demonstrated in our recent work [32], it is possible to speed up AP clustering and improve its accuracy with the adapted similarity s(i, k), which is based on pair-wise constraints and a variable-similarity measure [33]. The adapted similarity, s(i, k), between two l-mers x i,j and x k,j is defined as where Note that R 1 ∈ (1, +∞) and R 2 ∈ (0, 1]. For each candidate cluster subset, based on the adapted similarity s(i, k), AP clustering recursively calculates two types of messages. The first type is the responsibility r(i, k), which reflects the suitability of point x k,j as the exemplar for point x i,j . The other type is the availability a(i, k), which indicates how suitable it would be for a point x i,j to choose the point x k,j as its exemplar: When the AP clustering converges, a set of l-mers in the produced cluster subset are selected as exemplars e(i) associated to the point x i,j :

Cluster Subset Refinement
To select an adequate number of desired cluster subsets with more motif instances but less background l-mers, we set an interval [min size, max size] = [t − q, t] to further refine the cluster subsets C AP (x i,j , X), which is produced, by AP clustering, using a reference l-mer x i,j . Regarding the number of the l-mers in each cluster subset C AP (x i,j , X), we conclude that there are three cases: (i) In the case where the number is less than t − q (for example, , such a small number of l-mers is not enough to create a cluster subset C AP (x i,j , X) which represents a true motif, so we consider it as an invalid cluster subset.
(ii) In the case where the number is between t − q and t (for example t − q < |C AP (x i,j , X)| ≤ t), we consider it as a valid cluster subset.
(iii) In the case where the number is more than t (for example |C AP (x i,j , X)| > t), the size of the cluster subset is so large that it may include too many background l-mers and, so, we use a greedy strategy to select t l-mers from C AP (x i,j , X) to form C valid (x i,j , X). First, C valid (x i,j , X) is initialized with the AP clustering exemplar e(i). Then, an l-mer x r,j from C AP (x i,j , X) − C valid (x i,j , X) is repeatedly chosen, following Equations (21) and (22), and added to C valid (x i,j , X) until |C valid (x i,j , X)| = t): x r,j = arg max where len(x i,j , x k,j ) is the length of the maximum intersection of x i,j and x k,j . To appropriately sort the valid cluster subsets C valid (x i,j , X), we use the information content (IC) [34], and the cluster subset with the maximum value of IC is considered as the true motif model: where p w,m is the probability of the character w ∈ A, T, C, G at the position m of the l-mer x i,j , and p w,0 is the corresponding background probability. Based on the above described three steps, the whole AP-ChIP Algorithm 1 is described as follows: Algorithm 1: AP-ChIP algorithm if |N(i, j)| ≥ q then 19 C candidate (x i,j , X) ← C(x i,j , X) 20 end 21 end 22 for each C candidate (x i,j , X) do 23 Use AP clustering and a greedy strategy to generate valid cluster subsets C valid (x i,j , X); 24 end 25 IC max ← 0; 26 for each C valid (x i,j , X) do 27 if IC(C valid (x i,j , X)) > IC max then 28 IC max ← IC(C valid (x i,j , X)); 29 end 30 end 31 get X moti f from IC max Steps 2-17 describe the process of constructing the cluster subsets. Steps 18-21 describe the filtration of the cluster subsets. Steps 22-31 describe the refinement of the cluster subsets and the verification of the motif with the maximum IC score.

Results on Simulated Data
Simulated data provide quantitative measures to test the performance of the AP-ChIP Algorithm 1. As in [29], we generate the simulated data as follows: First, we generated t independent and identically distributed (i.i.d) sequences of length n and a motif m of length l. Second, we randomly generated q (0 < q ≤ t) motif instances, each of which differed from the motif m at up to d positions. Third, the q motif instances were placed in a random position in a random selection of q sequences selected out of the t sequences. We, then, implemented the AP-ChIP Algorithm 1, using Matlab on a computer with a 2.6 GHZ processor and 4 Gbyte memory. The final experimental results consisted of the averages of five trials of simulated data experiments.
To evaluate the motif prediction accuracy, the nucleotide level performance coefficient (nPC), as defined by Pevezner and Sze [2], was used: where K is the set of nucleotide positions in the true motif and P is the corresponding set of nucleotide positions in the predicted motif. The value of nPC is between 0 and 1; the larger the value of nPC, the higher the accuracy of the predicted motif. We used the 2d neighborhood probability p 2d [8] to select a group of (l, d) motif instances. The larger the value of p 2d , the weaker the corresponding (l, d) problem instance becomes: In what follows, according to different values of α = q t (i.e., the ratio of the sequences containing motif instances to all the sequences), we designed two groups of experiments, both of which consisted of two sub-experiments, to test the performance of the AP-ChIP Algorithm 1 on simulated data sets.
In the first group of experiments, we set α = 100% for both sub-experiments. We compared the performance of the AP-ChIP Algorithm 1 with that of the widely used motif finding algorithms MEME [13], VINE [14], and Projection [16].
In the first sub-experiment, we set the number of sequences as t = 20 with sequence length n = 600. The running time and the predicted accuracy of these algorithms are shown in Table 2. For the instances (12, 2) and (15, 3) with p 2d < 0.05, the AP-ChIP Algorithm 1 achieved near-optimal predicted accuracy within a relatively short time. For the instances (15,4), (14,4), (25,8), and (21, 7) with p 2d ≥ 0.05, the predicted accuracy of the AP-ChIP Algorithm 1 was over 90% and the running time of the AP-ChIP Algorithm 1 remained competitive, compared to the other algorithms. In general, it is easy to find the true motif by increasing the sequence number and decreasing its length. Therefore, in the second sub-experiment, we set the sequence number as t = 1000 with sequence length n = 200. Table 3 shows that, for all (l, d) problem instances with p 2d ≥ 0.05, the predicted accuracy of the AP-ChIP Algorithm 1 is over 90% with the computational costs being satisfactory.
Next, in the second group of experiments, to simulate a real ChIP-Seq data set, we set α = 90% for both sub-experiments. This is because in real ChIP-Seq data set, most but not all of the sequences contain motif instances. In the first sub-experiment, we test the validity of the AP-ChIP Algorithm 1 on the simulated ChIP-Seq data set for the identification of (l, d) motifs with t = 1000 and n = 200. We choose p 2d = 0.05 to select a group of (l, d) problem instances. The reason for this choice is that p 2d = 0.05 is approximately the same as the p 2d value of the (15,4) problem instance, which is one of the most popular benchmarks for (l, d) problem instance. The running time and the predicted motif by the AP-ChIP Algorithm 1 are shown in Table 4. For each (l, d) problem instance, the AP-ChIP Algorithm 1 finds almost the same motif as the published one and also runs quite efficiently. In the second sub-experiment, in order to further demonstrate the performance of the AP-ChIP Algorithm 1 on the simulated ChIP-Seq data set, we compared the AP-ChIP Algorithm 1 against some established motif-finding algorithms in the following two aspects: (i) Different values of p 2d , ranging from 0.05 to 0.5, with fixed α = 90%, and (ii) different values of α floating from 0.7 to 1 with fixed (l, d) = (9, 2). Although a genome-wide ChIP-Seq data set typically has thousands to tens of thousands of sequences, using 20% to 50% of the ChIP-Seq data set is usually adequate for obtaining a good estimate of the true motifs. MEME-ChIP, a well-known algorithm for discovering motifs in ChIP-Seq data sets, is able to well identify (l, d) motifs with only 600 sequences. Thus, it was reasonable to set the sequence number as t = 600 and sequence length as n = 200 for motif discovery in our experiments.
First, we compared the running time and prediction accuracy of the AP-ChIP Algorithm 1 with those of the three compared algorithms, MEME-ChIP [20], ChIP-Munk [24], and FMotif [25], on different values of p 2d and with fixed α = 90%. As shown in Table 5, for each (l, d) problem instance, the AP-ChIP Algorithm 1 could solve it in a relatively short time, and its prediction accuracy was better than those of the three compared algorithms. Specifically, with increasing values of l and d, FMotif found it difficult to find exact motifs. Next, we compared the prediction accuracy of AP-ChIP Algorithm 1 with those of the algorithms MEME [13], MEME-ChIP [20], DREME [22], and FMotif [25], on different values of α floating from 0.7 to 1 and fixed (l, d) = (9, 2). As the ratio of a motif instance α = q t has a strong effect on the prediction accuracy, it was necessary for us to test the prediction accuracy of AP-ChIP Algorithm 1 on different values of α. It is rather cumbersome to identify the true motif when the value of α is small. FMotif is a powerful, exhaustive algorithm for finding exact short (l, d) motifs (l ≤ 10, d ≤ 2) contained in ChIP-Seq data sets. As shown in Figure 1, the prediction accuracy of AP-ChIP Algorithm 1 was nearly the same as that of FMotif, and was higher than that of MEME-ChIP, MEME, and DREME.

Results on Real Data
First, we tested the validity of the AP-ChIP Algorithm 1 for the identification of real motifs using a diverse set of real ChIP-Seq data sets; specifically, on 12 differently sized mESC data sets (Nanog, Oct4, Sox2, Esrrb, Zfx, Klf4, c-Myc, n-Myc, Tcfcp21l, Smad1, STAT3, and CTCF) [35]. We compared the motifs detected by the AP-ChIP Algorithm 1 with the motifs published by Chen et al. [35], and presented them in the form of sequence logos [36], which graphically represent the degree of motif conservation, as measured by relative entropy. Table 6 shows the running times and the predicted motifs of the AP-ChIP Algorithm 1. For each data set, the AP-ChIP Algorithm 1 was capable of finding motifs highly similar to the published ones within a reasonable time.
Moreover, to better show the results, we compared AP-ChIP 1 with MEME-ChIP on nine differently sized ENCODE data sets (Nfyb, Hnf4, Elf1, Ets, Egr1, Yy1, Six5, Srf, and Tal1) [37], where the TFBSs were referenced in the JASPAR database [38]. Table 7 shows the published motifs and the motifs predicted by the two algorithms. The motifs are also shown in the form of sequence logos. AP-ChIP 1 could successfully find a motif similar to the published motif for each data set, while, for some data sets MEME-ChIP failed to accurately predict the motif (e.g., in the Elf1 data set), or lost information on individual bases (e.g., in the Tal dataset). Table 6. Results on the mESC data set.  Table 7. Results on the ENCODE dataset.