Sublinear Time Motif Discovery from Multiple Sequences

Bin Fu; Yunhui Fu; Yuan Xue

doi:10.3390/a6040636

,

and

Department of Computer Science, University of Texas-Pan American, 1201 W University Dr., Edinburg, TX 78539, USA

^*

Author to whom correspondence should be addressed.

Algorithms2013, 6(4), 636-677;https://doi.org/10.3390/a6040636

This article belongs to the Special Issue Algorithms for Sequence Analysis and Storage

Version Notes

Order Reprints

Abstract

In this paper, a natural probabilistic model for motif discovery has been used to experimentally test the quality of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet, Σ. A motif

G = g_{1} g_{2} \dots g_{m}

is a string of m characters. In each background sequence is implanted a probabilistically-generated approximate copy of G. For a probabilistically-generated approximate copy

b_{1} b_{2} \dots b_{m}

of G, every character,

b_{i}

, is probabilistically generated, such that the probability for

b_{i} \neq g_{i}

is at most α. We develop two new randomized algorithms and one new deterministic algorithm. They make advancements in the following aspects: (1) The algorithms are much faster than those before. Our algorithms can even run in sublinear time. (2) They can handle any motif pattern. (3) The restriction for the alphabet size is a lower bound of four. This gives them potential applications in practical problems, since gene sequences have an alphabet size of four. (4) All algorithms have rigorous proofs about their performances. The methods developed in this paper have been used in the software implementation. We observed some encouraging results that show improved performance for motif detection compared with other software.

Keywords:

motif discovery; sublinear time; randomized algorithm; deterministic algorithm

1. Introduction

Motif discovery is an important problem in computational biology and computer science. For instance, it has applications in coding theory [1,2], locating binding sites and conserved regions in unaligned sequences [3,4,5,6], genetic drug target identification [7], designing genetic probes [7] and universal PCR primer design [7,8,9,10].

This paper focuses on the application of motif discovery to find conserved regions in a set of given DNA, RNA or protein sequences. Such conserved regions may represent common biological functions or structures. Many performance measures have been proposed for motif discovery. Let C be a subset of 0–1 sequences of length n. The covering radius of C is the smallest integer, r, such that each vector in

{0, 1}^{n}

is at a distance at most r from a string in C. The decision problem associated with the covering radius for a set of binary sequences is NP-complete [1]. The similar closest string and substring problems were proven to be NP-hard [1,7]. Some approximation algorithms have been proposed. Li et al. [11] gave an approximation scheme for the closest string and substring problems. The related consensus patterns problem is that given n sequences

s_{1}, \dots, s_{n}

, find a region of length L in each

s_{i}

and a string, s, of length L, so that the total Hamming distance from s to these regions is minimized. Approximation algorithms for the consensus patterns problem were reported in [12]. Furthermore, a number of heuristics and programs have been developed [13,14,15,16,17].

In many applications, motifs are faint and may not be apparent when two sequences alone are compared, but may become clearer when more sequences are compared together [18]. For this reason, it has been conjectured that comparing more sequences together can help with identifying faint motifs. This paper is a theoretical approach with a rigorous probabilistic analysis.

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in the background sequence is a random character from an alphabet, Σ. A motif

G = g_{1} g_{2} \dots g_{m}

is a string of m characters. In each background sequence is implanted a probabilistically-generated approximate copy of G. For a probabilistically-generated approximate copy

b_{1} b_{2} \dots b_{m}

of G, every character,

b_{i}

, is probabilistically generated, such that the probability for

b_{i} \neq g_{i}

, which is called a mutation, is at most α. This model was first proposed in [13] and has been widely used in experimentally testing motif discovery programs [14,15,16,17]. We note that a mutation in our model converts a character,

g_{i}

, in the motif into a different character,

b_{i}

, without probability restriction. This means that a character,

g_{i}

, in the motif may not become any character

b_{i}

in

Σ - {g_{i}}

with equal probability.

We develop three algorithms for which, under the probabilistic model, one can find the implanted motif with high probability via a tradeoff between computational time and the probability of mutation. Each algorithm has a preprocessing phase and the voting phase. We use a pair of functions,

(t_{1} (n, k), t_{2} (n, k))

, to describe the computational complexity of the motif detection algorithm, where n is the largest length of the input sequence and k is the number of sequences. Function

t_{1} (n, k)

is the time complexity for the part for preprocessing, and

t_{2} (n, k)

is the time complexity for recovering one character for the motif after preprocessing. The total time is

O (t_{1} (n, k) + t_{2} (n, k) | G |)

.

(1) There exists a randomized algorithm, such that there are positive constants,

c_{0}

and

c_{1}

, such that, if the alphabet size is at least four, the number of sequences is at least

c_{1} log n

, the motif length is at least

c_{0} log n

and each character in the motif region has a probability of at most

\frac{1}{{(log n)}^{2 + μ}}

of mutation for some fixed

μ > 0

, then the motif can be recovered with a probability of at least

\frac{3}{4}

in

(O (\frac{n}{\sqrt{h}} {(log n)}^{\frac{7}{2}} + h^{2} {log}^{2} n), O (log n))

time, where n is the longest length of any input sequences and

h = min (| G |, n^{\frac{2}{5}})

. The algorithm’s total time is sublinear if the motif length,

| G |

, is in the range

[{(log n)}^{7 + μ}, \frac{n}{{(log n)}^{1 + μ}}]

. This is the first sublinear time algorithm with rigorous analysis in this model.

(2) There exists a randomized algorithm, such that there are positive constants,

c_{0}, c_{1}

and α, such that if the alphabet size is at least four, the number of sequences is at least

c_{1} log n

, the motif length is at least

c_{0} log n

and each character in the motif region has a probability of at most α of mutation, then the motif can be recovered with a probability of at least

\frac{3}{4}

in

(O (\frac{n^{2}}{| G |} {(log n)}^{O (1)}), O (log n))

time.

(3) There exists a deterministic algorithm, such that there are positive constants,

c_{0}, c_{1}

and α, such that if the alphabet size is at least four, the number of sequences is at least

c_{1} log n

, the motif length is at least

c_{0} log n

and each character in the motif region has a probability of at most α of mutation, then the motif can be recovered with a probability of at least

\frac{3}{4}

in

(O (n^{2} {(log n)}^{O (1)}), O (log n))

time.

The research in this model has been reported in [19,20,21]. In [19], Fu et al.. developed an algorithm that needs the alphabet size to be a constant that is much larger than four. In [20], our algorithm cannot handle all possible motif patterns. In [21], Liu et al. designed an algorithm that runs in

O (n^{3})

time and lacks rigorous analysis about its performance. The motif recovery in this natural and simple model has not been fully understood and seems to be a complicated problem.

This paper presents two new randomized algorithms and one new deterministic algorithm. They make advancements in the following aspects: (1) The algorithms are much faster than those before. Our algorithms can even run in sublinear time. (2) They can handle any motif pattern. (3) The restriction for the alphabet size is as small as four, giving them potential applications in practical problems, since gene sequences have an alphabet size of four. (4) All algorithms have rigorous proofs about their performances.

The algorithm for motif detection is named Recover-Motif(.). The entire Recover-Motif(.) is described in Section 4.2. We analyze Algorithm Recover-Motif (.) in Section 6.

2. Notations and the Model of Sequence Generation

For a set, A,

| | A | |

denotes the number of elements in A. Σ is an alphabet with

| | Σ | | = t \geq 2

. For an integer,

n \geq 0

,

Σ^{n}

is the set of sequences of a length of n with characters from Σ. For a sequence

S = a_{1} a_{2} \dots a_{n}

,

S [i]

denotes the character,

a_{i}

, and

S [i, j]

denotes the substring,

a_{i} \dots a_{j}

, for

1 \leq i \leq j \leq n

.

| S |

denotes the length of the sequence, S. We use ∅ to represent the empty sequence, which has a length of zero.

Let

G = g_{1} g_{2} \dots g_{m}

be a fixed sequence of m characters. G is the motif to be discovered by our algorithm. A

Θ (n, G, α)

-sequence has the form

S = a_{1} \dots a_{n_{1}} b_{1} \dots b_{m} a_{n_{1} + 1} \dots a_{n_{2}}

, where

n_{2} + m \leq n

, each

a_{i}

has a probability of

\frac{1}{t}

to be equal to π for each

π \in Σ

and

b_{i}

has a probability of at most α and not equal to

g_{i}

for

1 \leq i \leq m

, where

m = | G |

.

ℵ (S)

denotes the motif region

b_{1} \dots b_{m}

of S. A mutation converts a character,

g_{i}

, in the motif into an arbitrary, different character,

b_{i}

, without probability restriction. This allows a character,

g_{i}

, in the motif to change into any character,

b_{i}

, in

Σ - {g_{i}}

with even a different probability. The motif region of S may start at an arbitrary position in S. Furthermore, a mutation may convert a character,

g_{i}

, in the motif into an arbitrary, different character,

b_{i}

, only subject to the restriction that

g_{i}

will mutate with a probability of at most α.

For two sequences

S_{1} = a_{1} \dots a_{m}

and

S_{2} = b_{1} \dots b_{m}

of the same length, let the relative Hamming distance

diff (S_{1}, S_{2}) = \frac{| {i | a_{i} \neq b_{i} (i = 1, \dots, m)} |}{m}

.

Definition 1.

For two intervals,

[i_{1}, j_{1}]

and

[i_{2}, j_{2}]

, define

shift ([i_{1}, j_{1}], [i_{2}, j_{2}]) = min (| i_{1} - i_{2} |, | j_{1} - j_{2} |)

.

3. Brief Introduction to the Algorithm

Every detection algorithm in this paper has two phases. The first phase is preprocessing, so that the motif regions from multiple sequences can be aligned in the same column region. The second phase is to recover the motif via voting. We use a pair of functions,

(t_{1} (n, k), t_{2} (n, k))

, to describe the computational complexity of the motif detection algorithm. Function

t_{1} (n, k)

is the time complexity for the preprocessing phase, and

t_{2} (n, k)

is the time complexity for outputting one character for the motif in the voting phase.

The motif, G, is a pattern unknown to algorithm Recover-Motif, and algorithm Recover-Motif will attempt to recover G from a series of

Θ (n, G, α)

-sequences generated by the probabilistic model.

3.1. Algorithm

The algorithm first detects a position that is close to the left motif boundary in a sequence. It finds such a position via sampling and collision between two sequences. After the rough left boundary of a sequence is found, it is used to find the rough boundaries of the rest of the sequences. Similarly, we find those right boundaries of the motif among the input sequences. The exact left boundary of each motif region will be detected in the next phase via voting. Each character of the motif is recovered by voting among all the characters at the same positions in the motif regions of input sequences. For a sequence, S, a sample point is a random position, i, in S. For two sequences, S and

S^{'}

, with two sample points, i and j, respectively, a rough motif boundary is detected by the similarity of

S [i, i + l]

and

S^{'} [i, i + l]

for some reasonably large parameter, l.

Descriptions of the Algorithm

Input:

Z = Z_{1} \cup Z_{2}

, where

Z_{1} = {S_{1}^{'}, \dots, S_{2 k_{1}}^{'}}

and

Z_{2} = {S_{1}^{″}, \dots, S_{k_{2}}^{″}}

are two sets of input sequences.

Output: planted motif in each sequence and consensus string. Start:

Randomly select sample points from each sequence, both in

Z_{1}

and

Z_{2}

.

For each pair of sequences selected from

Z_{1}

and

Z_{2}

,

find the rough left and rough right boundaries via the matching at sample points.

Improve the rough boundaries.

If the motif boundaries of each sequence in

Z_{2}

are not empty,

use the Voting algorithm to get the planted motifs.

End of Algorithm.

3.2. An Example

We provide the following example for the brief idea of our algorithm. Let the following input strings be defined as below. We assume that the original motif is TTTTTAACGATTAGCS. The motif part is displayed with bold font, and the mutated characters in the motif region have been marked by * in their feet.

3.2.1.. Input Sequences

This contains two groups

Z_{1} = {S_{1}^{'}, S_{2}^{'}}

and

Z_{2} = {S_{1}^{″}, S_{2}^{″}, S_{3}^{″}, S_{4}^{″}, S_{5}^{″}}

.

\begin{matrix} Z_{1} : \\ S_{1}^{'} & = & G T A C C A T G G A TT A_{*} TTAACGATTAGCS T A G A G G A C C T A . \\ S_{2}^{'} & = & A A T C C T T A C_{*} TTTTAACGATTAGCS G T C . \end{matrix}

The above two strings are used to detect the initial motif region and use them to deal with the motif in the second group below.

\begin{matrix} Z_{2} : \\ S_{1}^{″} & = & A T T C G A T C C A G TTTTTAACG G_{*} TTAGCS C A A T T A C T T A G . \\ S_{2}^{″} & = & G C A T T G C A T TTTTTAACGATTA C_{*} CS G T A C T T A G C T A G A T C . \\ S_{3}^{″} & = & T C A G G G C A T C G A G A C TTTTTA G_{*} CGATTAGCS C T A G A A T C A G A C C T . \\ S_{4}^{″} & = & G T A C C T G G C A T T G A A C G TTTTTAACGATTAGC A_{*} T G C A G A T G G A C C T T T A . \\ S_{5}^{″} & = & A A T G G A T C A G A TTTTTAACGATT C_{*} GCS C T A G A T T C A G . \end{matrix}

3.2.2.. Select Sample Points

Some sample points of two sequences in

Z_{1}

are selected randomly and marked with the little dots on the top.

\begin{matrix} S_{1}^{'} & = & G T \dot{A} C C \dot{A} T G \dot{G} A T \dot{T} A_{*} T TA \dot{A} CGA \dot{T} T \dot{A} GC \dot{S} T A \dot{G} A G \dot{G} A C C \dot{T} A . \\ S_{2}^{'} & = & \dot{A} A T \dot{C} C T T \dot{A} C_{*} \dot{T} TTT \dot{A} AC \dot{G} A \dot{T} TA \dot{G} CS \dot{G} T C . \end{matrix}

3.2.3.. Collision Detection

In this step, the left and right rough boundaries of two sequences will be marked. The following shows the left collision, which happens nearby the left motif boundary and are marked by two overline

\bar{T A T T}

and

\bar{T T T T}

subsequences.

\begin{matrix} S_{1}^{'} & = & G T \dot{A} C C \dot{A} T G \dot{G} A T \bar{\dot{T} A_{*} TT} A \dot{A} CGA \dot{T} T \dot{A} GC \dot{S} T A \dot{G} A G \dot{G} A C C \dot{T} A . \\ S_{2}^{'} & = & \dot{A} A T \dot{C} C_{*} T T \dot{A} C_{*} \bar{\dot{T} TTT} \dot{A} AC \dot{G} A \dot{T} TA \dot{G} CS \dot{G} T C . \end{matrix}

The following shows the right collision, which happens nearby the right motif boundary and is marked by two overline

\bar{T T A G}

subsequences.

\begin{matrix} S_{1}^{'} & = & G T \dot{A} C C \dot{A_{*}} T G \dot{G} A T \dot{T} A_{*} TT A \dot{A} CGA \bar{\dot{T} T \dot{A} G} C \dot{S} T A \dot{G} A G \dot{G} A C C \dot{T} A . \\ S_{2}^{'} & = & \dot{A} A T \dot{C} C_{*} T T \dot{A} C_{*} \dot{T} TTT \dot{A} AC \dot{G} A \bar{\dot{T} TA \dot{G}} CS \dot{G} T C . \end{matrix}

3.2.4.. Improving the Boundaries

In the early phase of the algorithm, we first detect a small piece of the motif in

S_{1}^{'}

by comparing

S_{1}^{'}

and

S_{2}^{'}

. Assume that “

T A_{*} TT

” and “

TTAG

” are found in the left and right motif regions of

S_{1}^{'}

, respectively. The rough motif length will be calculated via the difference of the location of the first character, ‘T’, of the first subsequence and the location of the last character, ‘G’, of the second subsequence. The position marked by “A” is the rough left boundary of the motif, and the position marked by “T” is the rough right boundary of the motif in

S_{1}^{'}

below.

\begin{matrix} S_{1}^{'} & = & G T A C C A T G G \underset{̲}{A} {TTA}_{*} TTAACGATTAGCS \underset{̲}{T} A G A G G A C C T A . \\ S_{2}^{'} & = & A A T C C T T \underset{̲}{A} C_{*} TTTTAACGATTAGCS \underset{̲}{G} T C . \end{matrix}

3.2.5.. Select Sample Points for the Sequences in $Z_{2}$

Some sample points near the motif boundaries of

S_{1}^{'}

are selected.

S_{1}^{'} = G T A C C A T G \dot{G} A T \dot{T} A_{*} \dot{T} T AACGATT \dot{A} G \dot{C} S T \dot{A} G A G G A C C T A

.

Sample points are selected in each sequence in

Z_{2}

.

\begin{matrix} S_{1}^{″} & = & A \dot{T} T C \dot{G} A T C C \dot{A} G T \dot{T} T \dot{T} {TAACGG}_{*} TTAG \dot{C} S C \dot{A} A T \dot{T} A C T T \dot{A} G . \\ S_{2}^{″} & = & G \dot{C} A T T \dot{G} C A T T \dot{T} {TTTAACGATTAC}_{*} \dot{C} S G T \dot{A} C T T \dot{A} G C T \dot{A} G A \dot{T} C . \\ S_{3}^{″} & = & \dot{T} C A \dot{G} G G C A \dot{T} C G A \dot{G} A C TTT \dot{T} {TAG}_{*} CGATTAG \dot{C} S C T A \dot{G} A A T C \dot{A} G A C \dot{C} T . \\ S_{4}^{″} & = & G T \dot{A} C C T \dot{G} G C A T \dot{T} G A A C G T \dot{T} TTTAACGATT \dot{A} {GCA}_{*} T G C \dot{A} G A T \dot{G} G A C C T \dot{T} T A . \\ S_{5}^{″} & = & A A \dot{T} G G A \dot{T} C A G A T \dot{T} {TTTAACGATTC}_{*} G \dot{C} S C T A \dot{G} A T T \dot{C} A G . \end{matrix}

3.2.6.. Collision Detection Between $S_{1}^{'}$ with the Sequences in $Z_{2}$

The rough motif boundaries of the sequences in

Z_{2}

are detected via the collisions with the subsequences near the motif area of

S_{1}^{'}

.

S_{1}^{'} = G T A C C A T G \dot{G} A T \dot{T} A_{*} \dot{T} T AACGATT \dot{A} G \dot{C} S T \dot{A} G A G G A C C T A

.

The rough motif boundaries are marked by the lines over the matched subsequences.

\begin{matrix} S_{1}^{″} & = & A \dot{T} T C \dot{G} A T C C \dot{A} G T \bar{\dot{T} T \dot{T} T} {AACGG}_{*} T \bar{TAG \dot{C}} S C \dot{A} A T \dot{T} A C T T \dot{A} G . \\ S_{2}^{″} & = & G \dot{C} A T T \dot{G} C A T T \bar{\dot{T} TTT} AACGAT \bar{{TAC}_{*} \dot{C}} S G T \dot{A} C T T \dot{A} G C T \dot{A} G A \dot{T} C . \\ S_{3}^{″} & = & \dot{T} C A \dot{G} G G C A \dot{T} C G A \dot{G} A C TTT \bar{\dot{T} {TAG}_{*}} CGAT \bar{TAG \dot{C}} S C T A \dot{G} A A T C \dot{A} G A C \dot{C} T . \\ S_{4}^{″} & = & G T \dot{A} C C T \dot{G} G C A T \dot{T} G A A C G T \bar{\dot{T} TTT} AACG \bar{ATT \dot{A}} {GCA}_{*} T G C \dot{A} G A T \dot{G} G A C C T \dot{T} T A . \\ S_{5}^{″} & = & A A \dot{T} G G A \dot{T} C A G A T \bar{\dot{T} TTT} AACGAT \bar{{TC}_{*} G \dot{C}} S C T A \dot{G} A T T \dot{C} A G . \end{matrix}

3.2.7.. Improving the Motif Boundaries for the Sequences in $Z_{2}$

After the collision with the sequences in

Z_{2}

, we obtain the rough location of the motifs of the sequences in

Z_{2}

. Their rough motif boundaries for the sequences in

Z_{2}

are improved to be closer to exact boundaries.

S_{1}^{'} = G T A C C A T G G A \bar{{TTA}_{*} T} T AACGATT \bar{AGCS} T A G A G G A C C T A

.

The improved motif boundaries of the sequences in

Z_{2}

are marked below. This phase does not cost much time, as only the positions near the rough motif boundaries are tested.

\begin{matrix} S_{1}^{″} & = & A T T C G A T C C A \underset{̲}{G} TTTTTAACG G_{*} TTAGCS \underset{̲}{C} A A T T A C T T A G . \\ S_{2}^{″} & = & G C A T T G C \underset{̲}{A} T {TTTTTAACGATTAC}_{*} CS \underset{̲}{G} T A C T T A G C T A G A T C . \\ S_{3}^{″} & = & T C A G G G C A T C G A G A \underset{̲}{C} {TTTTTAG}_{*} CGATTAGCS \underset{̲}{C} T A G A A T C A G A C C T . \\ S_{4}^{″} & = & G T A C C T G G C A T T G A A C \underset{̲}{G} {TTTTTAACGATTAGCA}_{*} \underset{̲}{T} G C A G A T G G A C C T T T A . \\ S_{5}^{″} & = & A A T G G A T C A G \underset{̲}{A} {TTTTTAACGATTC}_{*} GCS C \underset{̲}{T} A G A T T C A G . \end{matrix}

3.2.8.. Motif Boundaries for the Sequences in $Z_{2}$

S_{1}^{'} = G T A C C A T G G A \bar{{TTA}_{*} T} T AACGATT \bar{AGCS} T A G A G G A C C T A

.

Use the pair

(G_{L}, G_{R})

with

G_{L} = \bar{TTAT}

and

G_{R} = \bar{AGCS}

to find the motif boundaries in the sequences of

Z_{2}

. The rough boundaries of the second group are marked below with underlines. In this phase, the exact motif boundaries for most sequences of

Z_{2}

can be found.

\begin{matrix} S_{1}^{″} & = & A T T C G A T C C A \underset{̲}{G} TTTTTAACG G_{*} TTAGCS \underset{̲}{C} A A T T A C T T A G . \\ S_{2}^{″} & = & G C A T T G C A \underset{̲}{T} {TTTTTAACGATTAC}_{*} CS \underset{̲}{G} T A C T T A G C T A G A T C . \\ S_{3}^{″} & = & T C A G G G C A T C G A G A \underset{̲}{C} {TTTTTAG}_{*} CGATTAGCS \underset{̲}{C} T A G A A T C A G A C C T . \\ S_{4}^{″} & = & G T A C C T G G C A T T G A A C \underset{̲}{G} {TTTTTAACGATTAGCA}_{*} \underset{̲}{T} G C A G A T G G A C C T T T A . \\ S_{5}^{″} & = & A A T G G A T C A G \underset{̲}{A} {TTTTTAACGATTC}_{*} GCS \underset{̲}{C} T A G A T T C A G . \end{matrix}

3.2.9.. Extracting the Motif Regions

The motif regions of the second group will be extracted. The original motif is recovered via voting at each column.

\begin{matrix} G_{1}^{″} & = & TTTTTAACG G_{*} TTAGCS \\ G_{2}^{″} & = & TTTTTAACGATTA C_{*} CS \\ G_{3}^{″} & = & TTTTTA G_{*} CGATTAGCS \\ G_{4}^{″} & = & TTTTTAACGATTAGC A_{*} \\ G_{5}^{″} & = & TTTTTAACGATT C_{*} GCS \end{matrix}

3.2.10.. Recovering Motif via Voting

The original motif, TTTTTAACGATTAGCS, is recovered via voting at all columns. For example, the last S in the motif is recovered via voting among the characters, S, S, S, A, S, in the last column.

3.3. Our Results

We give an algorithm for the case with at most a

\frac{1}{{(log n)}^{2 + μ}}

mutation rate. The performance of the algorithm is stated in Theorem 2. Theorem 2 implies Corollary 3 by selecting

k = c_{1} log n

with some constant

c_{1}

that is large enough.

Theorem 2.

Assume that μ is a fixed number in

(0, 1)

and the alphabet size, t, is at least four. There exists a randomized algorithm and a constant

c_{0}

, such that if the length of the motif, G, is at least

c_{0} log n

, then, given k independent

Θ (n, G, \frac{1}{{(log n)}^{2 + μ}})

-sequences, the algorithm outputs

G^{'}

, such that:

(1) with a probability of at most

e^{- Ω (k)}

,

| G^{'} | \neq | G |

;

(2) for each

1 \leq i \leq | G |

, with a probability of at most

e^{- Ω (k)}

,

G^{'} [i] \neq G [i]

; and

(3) with a probability of at most

\frac{k}{n^{3}}

, the algorithm does not stop in

(O (k (\frac{n}{\sqrt{h}} {(log n)}^{\frac{5}{2}} + h^{2} log n)), O (k))

time, where n is the longest length of any input sequences, and

h = min (| G |, n^{\frac{2}{5}})

.

Corollary 3.

There exists a randomized algorithm and positive constants,

c_{0}, c_{1}

and μ, such that if the alphabet size is at least four, the number of sequences is at least

c_{1} log n

, the motif length is at least

c_{0} log n

and each character in the motif region has a probability of at most

\frac{1}{{(log n)}^{2 + μ}}

of mutation, then the motif can be recovered with a probability of at least

\frac{3}{4}

in

(O (\frac{n}{\sqrt{h}} {(log n)}^{\frac{7}{2}} + h^{2} {log}^{2} n), O (log n))

time, where n is the longest length of any input sequences, and

h = min (| G |, n^{\frac{2}{5}})

.

We give a randomized algorithm for the case with a

Ω (1)

mutation rate. The performance of the algorithm is stated in Theorem 4. Theorem 4 implies Corollary 5 by selecting

k = c_{1} log n

with some constant

c_{1}

that is large enough.

Theorem 4.

Assume that the alphabet size, t, is at least four. There exists a randomized algorithm and a constant

c_{0}

, such that if the length of the motif G is at least

c_{0} log n

, then, given k independent

Θ (n, G, μ))

-sequences, the algorithm outputs

G^{'}

, such that:

(1) with a probability of at most

e^{- Ω (k)}

,

| G^{'} | \neq | G |

;

(2) for each

1 \leq i \leq | G |

, with a probability of at most

e^{- Ω (k)}

,

G^{'} [i] \neq G [i]

; and

(3) with a probability of at most

\frac{k}{n^{3}}

, the algorithm does not stop in

(O (k (\frac{n^{2}}{| G |} {(log n)}^{O (1)} + h^{2})), O (k))

,

where n is the longest length of any input sequences, and

h = min (| G |, n^{\frac{2}{5}})

.

Corollary 5.

There exists a randomized algorithm and positive constants,

c_{0}, c_{1}

and α, such that if the alphabet size is at least four, the number of sequences is at least

c_{1} log n

, the motif length is at least

c_{0} log n

and each character in the motif region has a probability of at most α of mutation, then the motif can be recovered with a probability of at least

\frac{3}{4}

in

(O (\frac{n^{2}}{| G |} {(log n)}^{O (1)}), O (log n))

time.

We give a deterministic algorithm for the case with a

Ω (1)

mutation rate. The performance of the algorithm is stated in Theorem 6. Theorem 6 implies Corollary 7 by selecting

k = c_{1} log n

with some constant

c_{1}

that is large enough.

Theorem 6.

Assume that the alphabet size, t, is at least four. There exists a deterministic algorithm and a constant,

c_{0}

, such that if the length of the motif G is at least

c_{0} log n

, then, given k independent

Θ (n, G, μ))

-sequences, the algorithm runs in

(O (n^{2} {(log n)}^{O (1)} + h^{2} k), O (k))

and outputs

G^{'}

, such that:

(1) with a probability of at most

e^{- Ω (k)}

,

| G^{'} | \neq | G |

;

(2) for each

1 \leq i \leq | G |

, with a probability of at most

e^{- Ω (k)}

,

G^{'} [i] \neq G [i]

; and

(3) with a probability of at most

\frac{k}{n^{3}}

, the algorithm does not stop in

(O (k (n^{2} {(log n)}^{O (1)} + h^{2})), O (k))

time, where nis the longest length of any input sequences, and

h = min (| G |, n^{\frac{2}{5}})

.

Corollary 7.

There exists a deterministic algorithm and positive constants,

c_{0}, c_{1}

and α, such that if the alphabet size is at least four, the number of sequences is at least

c_{1} log n

, the motif length is at least

c_{0} log n

and each character in the motif region has a probability of at most α of mutation, then the motif can be recovered with a probability of at least

\frac{3}{4}

in

(O (n^{2} {(log n)}^{O (1)}), O (log n))

time.

4. Algorithm Recover-Motif

In this section, we give a unified approach to describe three algorithms. The performance of the algorithms is stated in Theorems 2, 4 and 6. The description of Algorithm Recover-Motif is given in Section 4.2. The analysis of the algorithm is given at Section 6.

4.1. Some Parameters

Definition 8.

i.: Parameter x is selected to be 10. This parameter controls the failure probability of our algorithms to be at most $\frac{1}{2^{x}}$ .
ii.: The size of the alphabet is $t \geq 4$ .
iii.: Select a constant $ρ_{0} \in (0, 1)$ to have inequality (1):

$\begin{matrix} ρ_{0} < \frac{t - 1}{2 t} \end{matrix}$

(1)
iv.: The constant $ϵ \in (0, 1)$ is selected to satisfy:

$\begin{matrix} ϵ < min ((\frac{t - 1}{t} - (2 ρ_{0} + 2 ϵ)), \frac{1}{5} (1 - \frac{2}{t - 1} - \frac{4}{2^{x}}), \frac{1}{3}) \end{matrix}$

(2)

The existence of ϵ follows from inequality (1). The constant ϵ is used to control the mutation in the motif area. It is a part of parameter β defined in item (xiv) of this definition.
v.: Let $c = e^{- \frac{ϵ^{2}}{3}}$ . The constant, c, is used to simplify probabilistic bounds, which are derived from the applications of Chernoff bounds (see Corollary 18).
vi.: Define $r (y) = (\frac{1}{t - 1} + \frac{c^{y}}{1 - c})$ .
vii.: Define $u_{1}$ to be a large constant that, for all $v \geq 0$ :

$\begin{matrix} \frac{2 (v + u_{1}) c^{v + u_{1}}}{{(1 - c)}^{2}} \leq \frac{1}{5 \cdot 2^{x}} \end{matrix}$

(3)
viii.: Select constant $ρ_{1} \in (0, 1)$ , such that:

$\begin{matrix} \frac{2}{t - 1} + \frac{4}{2^{x}} + 5 ϵ + ρ_{1} < 1 \end{matrix}$

(4)

The existence of $ρ_{1}$ follows from $ϵ < \frac{1}{5} (1 - \frac{2}{t - 1} - \frac{4}{2^{x}})$ , which is implied by inequality (2).
ix.: Select constant $ρ_{2} \in (0, 1)$ and constant positive integer v that are large enough, such that:

$\begin{matrix} \frac{6 (v + u_{1}) c^{v}}{1 - c} + ρ_{2} < ρ_{1}, and \end{matrix}$

(5)

$\begin{matrix} (\frac{1}{2^{x}} + (v + u_{1}) \frac{c^{v}}{1 - c} + \frac{c^{v}}{1 - c} + \frac{1}{5 \cdot 2^{x}}) \leq 1 / 2 \end{matrix}$

(6)
x.: Define $ς_{0} = \frac{1}{2^{x}}$ , and: $φ (v) = (v + u_{1}) \frac{c^{v}}{1 - c} + \frac{c^{v}}{1 - c})$
xi.: Select constant $α_{0}$ , such that:

$\begin{matrix} 4 (v - 1) α_{0} + α_{0} & < & ρ_{2}, and \end{matrix}$

(7)

$\begin{matrix} α_{0} & < & ρ_{0} \end{matrix}$

(8)

Adding inequalities (4), (5) and (7), we have inequality (9):

$\begin{matrix} (\frac{2}{t - 1} + \frac{4}{2^{x}} + 5 ϵ) + \frac{6 (v + u_{1}) c^{v}}{1 - c} + (4 (v - 1) α_{0} + α_{0}) < 1 \end{matrix}$

(9)

By arranging the terms in inequality (9) and the definitions of $r (v)$ and $φ (v)$ , we have inequality (10):

$\begin{matrix} 2 ((2 (v - 1) α_{0} + \frac{c^{v}}{1 - c}) + r (v) + 2 (ς_{0} + φ (v)) + 2 ϵ) + (α_{0} + ϵ) < 1 \end{matrix}$

(10)
xii.: The maximal mutation rate, α, for the second algorithm (Theorem 4) and the third algorithm (Theorem 6) are selected as $α_{0}$ . Since the mutation rate of our sublinear time algorithm is bounded by $\frac{1}{{(log n)}^{2 + μ}}$ , the maximal mutation rate α for the first algorithm (Theorem 2) is less than $α_{0}$ when n is large enough. We always assume that all mutation rates α in our three algorithms are in the range $(0, α_{0}]$ .
xiii.: Define $q (y) = 2 (v - 1) α + \frac{2 c^{y}}{1 - c}$ . By inequality (10), the definition of $q (y)$ and the fact that $α \in (0, α_{0})$ , we have:

$\begin{matrix} 2 (q (v) + r (v) + 2 (ς_{0} + φ (v)) + 2 ϵ) + (α_{0} + ϵ) & < & 1 \end{matrix}$

(11)

Inequality (11) implies $q (v) \leq \frac{1}{2}$ . By inequality (6), we have that:

$\begin{matrix} (\frac{1}{2^{x}} + (v + u_{1}) \frac{c^{v}}{1 - c} + \frac{c^{v}}{1 - c} + \frac{1}{5 \cdot 2^{x}}) + q (v) \leq 3 / 4 \end{matrix}$

(12)
xiv.: Let $β = 2 α + 2 ϵ$ . The parameter, β, controls the similarity of $ℵ (S)$ and the original motif, G (see Lemma 27).
xv.: Define $R = r (v)$ .
xvi.: We define the following $Q_{0}$ .

$\begin{matrix} Q_{0} = q (v) \end{matrix}$

(13)

The parameter, $Q_{0}$ , used in Lemma 27 gives an upper bound of the probability that a $Θ (n, G, α)$ -sequence, S, whose $ℵ (S)$ will not be similar enough to the original motif, G, according to the conditions in Lemma 27.
xvii.: Select constant $d_{0}$ , such that:

$\begin{matrix} n^{3} c^{d_{0} log n} < \frac{1}{5 \cdot 2^{x}} for all large n \end{matrix}$

(14)
xviii.: Select constant $d_{1}$ , such that $(v + u_{1}) c^{d_{1} log n} < \frac{1}{5 \cdot 2^{x}}$ .
xix.: Select number $u_{2}$ , such that:

$\begin{matrix} (d_{1} log n) (v + u_{1}) \frac{c^{v + u_{2}}}{1 - c} & \leq & \frac{1}{5 \cdot 2^{x}} . and \end{matrix}$

(15)

$\begin{matrix} (v + u_{1}) \frac{c^{v + u_{2}}}{1 - c} & < & \frac{1}{5 \cdot 2^{x}} \end{matrix}$

(16)

Since only n is variable, we can make $u_{2} = O (log log n)$ .
xx.: For a fixed $c_{0} \in (0, 1)$ , define $δ_{c_{0}} = \frac{ln \frac{1}{c_{0}}}{2}$ .

4.2. Description of Algorithm Recover-Motif

The algorithms are described in this section. The description combines three algorithms together. The simplest deterministic algorithm is also given in Section 5. Before presenting the algorithm, we define some notions.

Definition 9.

Two sequences, $X_{1}$ and $X_{2}$ , are weakly left matched if: (1) both $| X_{1} |$ and $| X_{2} |$ are at least $d_{0} log n$ ; and (2) $diff (X_{1} [1, i], X_{2} [1, i]) \leq β$ for all integers i, $v \leq i \leq d_{0} log n$ , where v is defined in item (ix) in Definition 8.
Two sequences, $X_{1}$ and $X_{2}$ , are left matched if: (1) $d_{0} log n \leq | X_{1} |, | X_{2} |$ ; (2) $X_{1} [i] = X_{2} [i]$ for $i = 1, \dots, v - 1$ ; and (3) $diff (X_{1} [1, i], X_{2} [1, i]) \leq β$ for all integers i, $v \leq i \leq d_{0} log n$ .
Two sequences, $X_{1}$ and $X_{2}$ , are weakly right matched if $X_{1}^{R}$ and $X_{2}^{R}$ are weakly left matched, where $X^{R} = a_{n} \dots a_{1}$ is the inverse sequence of $X = a_{1} \dots a_{n}$ .
Two sequences, $X_{1}$ and $X_{2}$ , are right matched if $X_{1}^{R}$ and $X_{2}^{R}$ are left matched, where $X^{R} = a_{n} \dots a_{1}$ is the inverse sequence of $X = a_{1} \dots a_{n}$ .
Two sequences, $X_{1}$ and $X_{2}$ , are matched if $X_{1}$ and $X_{2}$ are both left and right matched.

Variable L will be controlled in the range

L \in [{(log n)}^{3 + ϵ_{1}}, n^{\frac{2}{5} - ϵ_{2}}]

in our algorithm with a high probability. We define the following functions that depend on L.

Definition 10.

Define

M (L) = \frac{\sqrt{3 log n + x}}{\sqrt{1 - γ}} \sqrt{L} log n

. Define

M_{1} (L) = \frac{δ_{c_{0}} M (L)}{log n}

(see (xx) of Definition 8 for

δ_{c_{0}}

), where

c_{0} = \frac{1}{4}

.

We would like to minimize the function

(\frac{n}{L} M + L^{2}) log n

. This selection can make the total time complexity sublinear.

Definition 11.

For a

Θ (n, G, α)

sequence, S, define LB(S) to be the left boundary, l, of the motif region

ℵ (S)

in S and RB(S) to be the right boundary, r, of the motif region

ℵ (S)

in S, such that

ℵ (S) = S [l, r]

.

4.2.1.. Boundary-Phase of Algorithm Recover-Motif

The first phase of Algorithm Recover-Motif finds the rough motif boundaries of all input sequences. It first detects the rough motif boundaries of one sequence via comparing two input sequences. Then, the rough boundaries of the first sequence are used to find the rough motif boundaries of other input sequences.

Three algorithms share most of the functions. We have a unified approach to describe them. A special variable, “algorithm-type”, selects one of the three algorithms, respectively.

Definition 12.

Let algorithm-type represent one of the three algorithm types, “RANDOMIZED-SUBLINEAR”, “RANDOMIZED-SUBQUADRATIC” and “DETERMINISTIC-SUPERQUADRATIC”.

Definition 13.

Assume that

A_{1}

is a set of positions in a

Θ (n, G, α)

sequence,

S_{1}

, and

A_{2}

is a set of positions in a

Θ (n, G, α)

sequence,

S_{2}

. If there are positions

a_{1} \in A_{1}

and

a_{2} \in A_{2}

, such that for some position, j, with

1 \leq j \leq | G |

,

a_{1}

is the position of

ℵ (S_{1}) [j]

in

S_{1}

and

a_{2}

is the position of

ℵ (S_{2}) [j]

in

S_{2}

, then

A_{1}

and

A_{2}

have a collision at

(a_{1}, a_{2})

.

In the following function, Collision-Detection, the parameter,

ω \leq β

, is defined below in the three algorithms.

\begin{matrix} ω_{algorithm - type} & = & \{\begin{matrix} 0 & if algorithm - type = RANDOMIZED - SUBLINEAR; \\ β & i f algorithm - type = RANDOMIZED - SUBQUADRATIC; \\ β & i f algorithm - type = DETERMINISTIC - SUPERQUADRATIC . \end{matrix} \end{matrix}

(17)

Function Collision-Detection $(S_{1}, U_{1}, S_{2}, U_{2})$ is used to detect a point,

a_{1} \in U_{1}

, in the motif area in

S_{1}

and another

a_{1}^{'} \in U_{1}

point in the motif area of

S_{1}

. The two points,

a_{1}

and

a_{1}^{'}

, are close to the left and right motif boundaries of

S_{1}

, respectively. A similar pair of points,

e_{1}

and

e_{1}^{'}

, in

U_{2}

is also derived for

S_{2}

. See the examples in Section 3.2.3..

Collision-Detection

(S_{1}, U_{1}, S_{2}, U_{2})

Input: a pair of

Θ (n, G, α)

-sequences,

S_{1}

and

S_{2}

;

U_{i}

is a set of locations in

S_{i}

for

i = 1, 2

.

Output: the left and right rough boundaries of two sequences.

Let

D_{1}

be all subsequences

S_{1} [a, a + d_{0} log n - 1]

of

S_{1}

of a length of

d_{0} log n

with

a \in U_{1}

.

Let

D_{2}

be all subsequences

S_{2} [b, b + d_{0} log n - 1]

of

S_{2}

of a length of

d_{0} log n

with

b \in U_{2}

.

Find two subsequences,

X_{1} = S_{1} [a_{1}, a_{1} + d_{0} log n - 1] \in D_{1}

and

X_{2} = S_{2} [b_{1}, b_{1} + d_{0} log n - 1] \in D_{2}

, such that

a_{1}

is the least and

diff (X_{1}, X_{2}) \leq ω_{algorithm - type}

.

Find two subsequences,

X_{1}^{'} = S_{1} [a_{1}^{'}, a_{1}^{'} + d_{0} log n - 1] \in D_{1}

and

X_{2}^{'} = S_{2} [b_{1}^{'}, b_{1}^{'} + d_{0} log n - 1] \in D_{2}

, such that

a_{1}^{'}

is the largest and

diff (X_{1}^{'}, X_{2}^{'}) \leq ω_{algorithm - type}

.

Find two subsequences,

Y_{1} = S_{1} [f_{1}, f_{1} + d_{0} log n - 1] \in D_{1}

and

Y_{2} = S_{2} [e_{1}, e_{1} + d_{0} log n - 1] \in D_{2}

, such that

e_{1}

is the least and

diff (Y_{1}, Y_{2}) \leq ω_{algorithm - type}

.

Find two subsequences,

Y_{1}^{'} = S_{1} [f_{1}^{'}, f_{1}^{'} + d_{0} log n - 1] \in D_{1}

and

Y_{2}^{'} = S_{2} [e_{1}^{'}, e_{1}^{'} + d_{0} log n - 1] \in D_{2}

, such that

e_{1}^{'}

is the largest and

diff (Y_{1}^{'}, Y_{2}^{'}) \leq ω_{algorithm - type}

.

Return

(a_{1}, a_{1}^{'}, e_{1}, e_{1}^{'})

.

End of Collision-Detection.

Definition 14.

Let

[a, b]

be an interval with two integers boundaries, a and b, and l be a positive integer parameter. Define an l-partition of

[a, b]

to be l-

P ([a, b])

, which contains the intervals

[a_{1}, b_{1}], [a_{2}, b_{2}], \dots, [a_{r}, b_{r}]

, such that

a_{1} = a, b_{r} = b, a_{i + 1} = b_{i} + 1, b_{i} = a_{i} + l - 1

for

i = 1, 2, \dots, r - 1

and

a_{r} \leq b_{r} \leq a_{i} + l - 1

.

For example, the three-partition of the interval

[1, 10]

is 3-

P ([1, 10]) = {[1, 3], [4, 6], [7, 9], [10, 10]}

. Function Point-Selection $(S, L, I)$ will be defined differently in three different algorithms, where I is an interval of positions in sequence S and L is a positive integer parameter. For randomized algorithms, some random points are selected in L-

P (I)

. For a deterministic algorithm, all points in I are selected. See the examples in Section 3.2.2..

Point-Selection $(S, L, I)$

Input: a pair of

Θ (n, G, α)

-sequences, S, a size parameter, L, of partition and a set of intervals, I, of positions in S.

Output: a set, U, of positions from S, respectively.

Steps:

Let

U = \emptyset

.

If algorithm-type=RANDOMIZED-SUBLINEAR or RANDOMIZED-SUBQUADRATICand

if

(L \geq \frac{{(log n)}^{3 + τ}}{100})

:

for each interval,

I^{'}

, in I, obtain its L-partition in L-

P (I^{'})

and for each interval, J, in L-

P (I^{'})

,

sample

M (L)

(see Definition 10) random positions in J and put them into U.

Else,

put every position of I into

U_{1}

.

If algorithm-type=DETERMINISTIC-SUPERQUADRATIC,

put every position of I into U.

Return U.

End of Point-Selection.

The function, Improve-Boundaries

(S_{1}, a_{l}, a_{r}, S_{2}, f_{l}, f_{r}, L)

, is used to improve the existing rough left and right boundaries,

a_{l}

and

a_{r}

, of

S_{1}

, respectively, and to improve the existing rough left and right boundaries,

f_{l}

and

f_{r}

, of

S_{2}

, respectively. We assume

a_{l} \in [LB (S_{1}) - L, LB (S_{1}) + L]

,

a_{r} \in [RB (S_{1}) - L, LB (S_{1}) + L]

,

f_{l} \in [LB (S_{2}) - L, LB (S_{2}) + L]

and

f_{r} \in [RB (S_{2}) - L, RB (S_{2}) + L]

. After calling this function, more accurate approximate boundaries will be derived. From the probabilistic analysis, we have a good chance to get the exact motif boundaries for both

S_{1}

and

S_{2}

. See the examples in Section 3.2.7..

Improve-Boundaries

(S_{1}, a_{l}, a_{r}, S_{2}, f_{l}, f_{r}, L)

Input: a

Θ (n, G, α)

-sequence,

S_{1}

, with rough left and right boundaries,

a_{l}

and

a_{r}

, a

Θ (n, G, α)

-sequences,

S_{2}

with rough left and right boundaries,

f_{l}

and

f_{r}

, and an approximate distance, L, to the nearest motif boundary from those rough boundaries (the parameter, L, usually has the properties of

LB (S_{1}) \in [a_{l} - L, a_{l}]

,

RB (S_{1}) \in [a_{r}, a_{r} + L]

,

LB (S_{2}) \in [f_{l} - L, f_{l}]

and

RB (S_{2}) \in [f_{r}, f_{r} + L]

).

Output: improved rough left and right boundaries for both

S_{1}

and

S_{2}

.

Steps:

Find two subsequences,

X_{1} = S_{1} [a_{1}, a_{1} + d_{0} log n - 1]

and

X_{2} = S_{2} [b_{2}, b_{2} + d_{0} log n - 1]

,

with

a_{1} \in [a_{l} - L, a_{l} + L]

and

b_{2} \in [f_{l} - L, f_{l} + L]

, such that

diff (X_{1}, X_{2}) \leq β

and

a_{1}

is

the least.

Find two subsequences,

X_{1}^{'} = S_{1} [a_{1}^{'}, a_{1}^{'} + d_{0} log n - 1]

and

X_{2}^{'} = S_{2} [b_{2}^{'}, b_{2}^{'} + d_{0} log n - 1]

,

with

a_{1}^{'} \in [a_{r} - L, a_{r} + L]

and

b_{2} \in [f_{r} - L, f_{r} + L]

, such that

diff (X_{1}^{'}, X_{2}^{'}) \leq β

and

a_{1}^{'}

is

the largest.

Find two subsequences,

Y_{1} = S_{1} [e_{1}, e_{1} + d_{0} log n - 1]

and

Y_{2} = S_{2} [f_{2}, f_{2} + d_{0} log n - 1]

,

with

e_{1} \in [a_{l} - L, a_{l} + L]

and

f_{2} \in [f_{l} - L, f_{l} + L]

, such that

diff (Y_{1}, Y_{2}) \leq β

and

f_{2}

is

the least.

Find two subsequences,

Y_{1}^{'} = S_{1} [e_{1}^{'}, e_{1}^{'} + d_{0} log n - 1]

and

Y_{2}^{'} = S_{2} [f_{2}^{'}, f_{2}^{'} + d_{0} log n - 1]

,

with

e_{1}^{'} \in [a_{r} - L, a_{r} + L]

and

f_{2}^{'} \in [f_{r} - L, f_{r} + L]

, such that

diff (Y_{1}^{'}, Y_{2}^{'}) \leq β

and

f_{2}^{'}

is

the largest.

Return

(a_{1}, a_{1}^{'}, f_{2}, f_{2}^{'})

.

End of Improve-Boundaries.

The function, Initial-Boundaries

(S_{1}, S_{2})

, detects the motif boundaries for two sequences,

S_{1}

and

S_{2}

. It first detects rough motif boundaries that are controlled by parameter L. The rough boundaries will be improved to exact motif boundaries via calling Improve-Boundaries

(.)

. See the examples inSection 3.2.2., Section 3.2.3., Section 3.2.4., Section 3.2.5., Section 3.2.6., Section 3.2.7. and Section 3.2.8..

Initial-Boundaries

(S_{1}, S_{2})

Input: a pair of

Θ (n, G, α)

-sequences,

S_{1}

and

S_{2}

.

Output: rough left boundary

{roughLeft}_{S_{1}}

of

S_{1}

, right boundary

{roughRight}_{S_{1}}

of

S_{1}

, rough left boundary

{roughLeft}_{S_{2}}

of

S_{2}

and right boundary

{roughRight}_{S_{2}}

of

S_{2}

.

Steps:

Let

U_{1} = U_{2} = \emptyset

.

Let

L = n^{2 / 5}

.

Repeat.

Let

U_{1} =

Point-Selection

(S_{1}, L, [1, | S_{1} |])

.

Let

U_{2} =

Point-Selection

(S_{2}, L, [1, | S_{2} |])

.

Let

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection

(S_{1}, U_{1}, S_{2}, U_{2})

.

If (

L_{S_{1}} \neq \emptyset

and

R_{S_{1}} \neq \emptyset

),

then go to H.

Else,

L = L / 2

,

until

(L < \frac{1}{2} \frac{{(log n)}^{3 + τ}}{100})

.

H: Return Improve-Boundaries

(S_{1}, L_{S_{1}}, R_{S_{1}}, S_{2}, L_{S_{2}}, R_{S_{2}}, 2 L)

.

End of Initial-Boundaries.

If

L_{S}

and

R_{S}

are the left and right motif boundaries of a sequence, S, then the motif length is

R_{S} - L_{S} + 1

. When we have the exact motif boundaries,

L_{S_{i}}^{'}

and

R_{S_{i}}^{'}

, for most sequences,

S_{i}

, with high probability, their motif length can be derived via the median in

\cup_{i} {R_{S_{i}}^{'} - L_{S_{i}}^{'} + 1}

. Therefore, we have the function, Motif-Length-And-Boundaries( $Z_{1}$ ), to compute the length of the motif region.

Motif-Length-And-Boundaries(

Z_{1}

)

Input:

Z_{1} = {S_{1}^{'}, \dots, S_{2 k_{1}}^{'}}

is a set of independent

Θ (n, G, α)

sequences.

Steps:

For

i = 1

to

k_{1}

,

let

({roughLeft}_{S_{2 i - 1}^{'}}, {roughRight}_{S_{2 i}^{'}})

=Initial-Boundaries

(S_{2 i - 1}^{'}, S_{2 i}^{'})

.

Let

L_{1}

be the median of

\cup_{i = 1}^{k_{1}} {({roughRight}_{S_{2 i - 1}^{'}} - {roughLeft}_{S_{2 i - 1}^{'}} + 1)}

.

Return

L_{1}

.

End of Motif-Length-And-Boundaries.

4.2.2.. Extract-Phase of Algorithm Recover-Motif

After a set of motif candidates, W, is produced from Boundary-Phase of algorithm Recover-Motif, we use this set to match with another set of input sequences to recover the hidden motif by voting.

Match

(G_{l}, G_{r}, S_{i}^{″})

Input: a motif left part,

G_{l}

(which can be derived from the rough left boundary of an input sequence, S), a motif right part,

G_{r}

, and a sequence,

S_{i}^{″}

, from the group,

Z_{2}

, with known rough left and right boundaries.

Output: either a rough motif region of

S_{i}^{″}

or an empty sequence, which means the failure in extracting the motif region,

ℵ (S_{i}^{″})

, of

S_{i}^{″}

.

Steps:

Find a position, a, in

S_{i}^{″}

with

{roughLeft}_{S_{i}^{″}} \leq a \leq {roughLeft}_{S_{i}^{″}} + (v + u_{2})

,

such that

G_{l}

and

S_{i}^{″} [a, a + | G_{l} | - 1]

are left matched (see Definition 9).

Find a position, b, in

S_{i}^{″}

with

{roughRight}_{S_{i}^{″}} - (v + u_{2}) \leq b \leq {roughRight}_{S_{i}^{″}}

,

such that

G_{r}

and

S_{i}^{″} [b - | G_{r} | + 1, b]

are right matched (see Definition 9).

If both a and b are found,

then output

S_{i}^{″} [a, b]

.

Else, output ∅ (empty string).

End of Match.

If the left,

G_{l}

, and right,

G_{r}

, motif parts are known, we extract all the motif regions for all sequences in the set,

Z_{2}

, by the function, Extract $(G_{l}, G_{r}, Z_{2}$ ).

Extract

(G_{l}, G_{r}, Z_{2}

)

Input

Z_{2} = {S_{1}^{″}, S_{2}^{″}, \dots, S_{k_{2}}^{″}}

and left and right motif parts,

G_{l}

and

G_{r}

(see function Match

(G_{l}, G_{r}, S_{i})

).

Steps:

For each

S_{i}^{″}

with

i = 1, 2, \dots, k_{2}

,

let

G_{i}^{″} = Match (G_{l}, G_{r}, S_{i}^{″})

.

Return

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

.

End of Extract.

The following is Extract-Phase of algorithm Recover-Motif. It extracts the motif regions of another set,

Z_{2}

, of input sequences. The function is based on the condition that exact motif boundaries can be derived for most sequences. See the examples in Section 3.2.9..

Extract-Phase(

S^{'}, Z_{2}

):

Input

S^{'}

is an input sequence with known

{roughLeft}_{S^{'}}

and

{roughRight}_{S^{'}}

for its rough left and right boundaries, respectively, and

Z_{2} = {S_{1}^{″}, \dots, S_{k_{2}}^{″}}

is a set of input sequences.

Steps:

For each subsequence

G_{l} = S^{'} [a, a + d_{0} log n - 1]

with

a \in [{roughLeft}_{S^{'}}, {roughLeft}_{S^{'}} + (v + u_{1})]

and

G_{r} = S^{'} [b - d_{0} log n + 1, b]

, with

b \in [{roughRight}_{S^{'}} - (v + u_{1}), {roughRight}_{S^{'}}]

,

let

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

be the output from Extract

(G_{l}, G_{r}, Z_{2}

).

If the number of empty sequences in

G_{1}^{″}, \dots, G_{k_{2}}^{″}

is at most

(Q_{0} + (R + 2 ϵ)) k_{2}

,

then return

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

.

Return ∅ (empty set).

End of Extract-Phase .

4.2.3.. Voting-Phase

The function, Vote

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

, is to generate another sequence,

G^{'}

, by voting, where

G^{'} [i]

is the most frequent character among

G_{1}^{″} [i], G_{2}^{″} [i], \dots, G_{k_{2}}^{″} [i]

. See the examples in Section 3.2.10..

Voting-Phase

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

Input:

Θ (n, G, α)

sequences,

G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″}

, of the same length, m.

Output: a sequence,

G^{'}

, which is derived by voting on every position of the input sequences.

Steps:

For each

j = 1, \dots, m

,

let

a_{j}

be the most frequent character among

G_{1}^{″} [j], \dots, G_{k_{2}}^{″} [j]

.

Return

G^{'} = a_{1} \dots a_{m}

.

End of Vote.

4.2.4.. Entire Algorithm Recover-Motif

The entire algorithm is described below. The input has two sets of sequences,

Z_{1}

and

Z_{2}

. It detects the motif boundaries for the sequences in

Z_{1}

via pairwise comparisons and, also, the motif length. The motif regions of the sequences in

Z_{2}

are detected in the next phase and will be extracted. The original motif is recovered via voting for each column of characters among the extracted motif regions.

We maintain the sizes of

Z_{1}

and

Z_{2}

to be roughly equal, which implies:

\begin{matrix} | Z_{1} | = Θ (| Z_{2} |) \end{matrix}

(18)

Algorithm Recover-Motif (Z)

Input:

Z = Z_{1} \cup Z_{2}

, where

Z_{1} = {S_{1}^{'}, \dots, S_{2 k_{1}}^{'}}

and

Z_{2} = {S_{1}^{″}, \dots, S_{k_{2}}^{″}}

are two sets of input sequences.

Steps:

Preprocessing part:

For each

S \in Z_{1} \cup Z_{2}

, let

{roughLeft}_{S} = {roughRight}_{S} = 0

(the two boundaries are unknown).

l_{m o t i f} =

MotifLengthAndBoundaries(

Z_{1}

).

Let

L = l_{m o t i f} / 4

.

For

i = 1

to

k_{1}

,

let

U_{S_{2 i - 1}^{'}} =

Point-Selection

(S_{2 i - 1}^{'}, L, [{roughLeft}_{S_{2 i - 1}^{'}} - 2 L, {roughLeft}_{S_{2 i - 1}^{'}} + 2 L]) \cup

Point-Selection

(S_{2 i - 1}^{'}, L, [{roughRight}_{S_{2 i - 1}^{'}} - 2 L, {roughRight}_{S_{2 i - 1}^{'}} + 2 L])

.

For

j = 1

to

k_{2}

,

let

U_{S_{j}^{″}} =

Point-Selection

(S_{j}^{″}, L, [1, | S_{j}^{″} |])

.

For

i = 1

to

k_{1}

,

for each

S_{j}^{″} \in Z_{2}

,

Let

(L_{S_{2 i - 1}^{'}}, R_{S_{2 i - 1}^{'}}, L_{S_{j}^{″}}, R_{S_{j}^{″}}) =

Collision-Detection

(S_{2 i - 1}^{'}, U_{S_{2 i - 1}^{'}}, S_{j}^{″}, U_{S_{j}^{″}})

.

Let

(L_{S_{2 i - 1}^{'}}, R_{S_{2 i - 1}^{'}}, {roughLeft}_{S_{j}^{″}}, {roughRight}_{S_{j}^{″}})

=

Improve-Boundaries

(S_{2 i - 1}^{'}, L_{S_{2 i - 1}^{'}}, R_{S_{2 i - 1}^{'}}, S_{j}^{″}, L_{S_{j}^{″}}, R_{S_{j}^{″}}, 2 L)

.

Let

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

be the output from Extract-Phase(

S_{2 i - 1}^{'}, Z_{2}

).

If

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

is not empty,

then go to the Voting part.

Voting part:

Return Voting-Phase(

G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″}

).

End of Algorithm Recover-Motif .

5. Deterministic Algorithm

In this section, we give a deterministic algorithm, which is a simplified version of the unified algorithm described before in Section 4.2. It is simpler than the randomized versions. The first phase of Algorithm Recover-Motif (.) finds the rough motif boundaries of all input sequences. It first detects the rough motif boundaries of one sequence via comparing two input sequences. Then, the rough boundaries of the first sequence are used to find the rough motif boundaries of other input sequences. We still let:

\begin{matrix} ω_{DETERMINISTIC - SUPERQUADRATIC} = β . \end{matrix}

(19)

Collision-Detection $(S_{1}, S_{2})$

Input: a pair of

Θ (n, G, α)

-sequences,

S_{1}

and

S_{2}

;

U_{i}

is a set of locations in

S_{i}

for

i = 1, 2

.

Output: the left and right rough boundaries of two sequences.

Let

D_{1}

be all subsequences,

S_{1} [a, a + d_{0} log n - 1]

, of

S_{1}

of a length of

d_{0} log n

with

a \in [1, | S_{1} |]

.

Let

D_{2}

be all subsequences,

S_{2} [b, b + d_{0} log n - 1]

, of

S_{2}

of a length of

d_{0} log n

with

b \in [1, | S_{2} |]

.

Find two subsequences,

X_{1} = S_{1} [a_{1}, a_{1} + d_{0} log n - 1] \in D_{1}

and

X_{2} = S_{2} [b_{1}, b_{1} + d_{0} log n - 1] \in D_{2}

, such that

a_{1}

is the least and

diff (X_{1}, X_{2}) \leq ω_{DETERMINISTIC - SUPERQUADRATIC}

.

Find two subsequences,

X_{1}^{'} = S_{1} [a_{1}^{'}, a_{1}^{'} + d_{0} log n - 1] \in D_{1}

and

X_{2}^{'} = S_{2} [b_{1}^{'}, b_{1}^{'} + d_{0} log n - 1] \in D_{2}

, such that

a_{1}^{'}

is the largest and

diff (X_{1}^{'}, X_{2}^{'}) \leq ω_{DETERMINISTIC - SUPERQUADRATIC}

.

Find two subsequences,

Y_{1} = S_{1} [f_{1}, f_{1} + d_{0} log n - 1] \in D_{1}

and

Y_{2} = S_{2} [e_{1}, e_{1} + d_{0} log n - 1] \in D_{2}

, such that

e_{1}

is the least and

diff (Y_{1}, Y_{2}) \leq ω_{DETERMINISTIC - SUPERQUADRATIC}

.

Find two subsequences,

Y_{1}^{'} = S_{1} [f_{1}^{'}, f_{1}^{'} + d_{0} log n - 1] \in D_{1}

and

Y_{2}^{'} = S_{2} [e_{1}^{'}, e_{1}^{'} + d_{0} log n - 1] \in D_{2}

, such that

e_{1}^{'}

is the largest and

diff (Y_{1}^{'}, Y_{2}^{'}) \leq ω_{DETERMINISTIC - SUPERQUADRATIC}

.

Return

(a_{1}, a_{1}^{'}, e_{1}, e_{1}^{'})

.

End of Collision-Detection.

Function Point-Selection

(S_{1}, S_{2}, L)

is not used in the deterministic algorithm.

Improve-Boundaries

(S_{1}, a_{l}, a_{r}, S_{2}, f_{l}, f_{r}, L)

is the same as that in the randomized algorithms.

Initial-Boundaries

(S_{1}, S_{2})

Input: a pair of

Θ (n, G, α)

-sequences,

S_{1}

and

S_{2}

.

Output: rough left boundary

{roughLeft}_{S_{1}}

of

S_{1}

, right boundary

{roughRight}_{S_{1}}

of

S_{1}

, rough left boundary

{roughLeft}_{S_{2}}

of

S_{2}

and right boundary

{roughRight}_{S_{2}}

of

S_{2}

.

Steps:

Let

U_{1} = U_{2} = \emptyset

.

Let

L = n^{2 / 5}

.

Repeat.

Let

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection

(S_{1}, S_{2})

.

If (

L_{S_{1}} \neq \emptyset

and

R_{S_{1}} \neq \emptyset

),

then go to H.

Else,

L = L / 2

,

until

(L < \frac{1}{2} \frac{{(log n)}^{3 + τ}}{100})

.

H: Return Improve-Boundaries

(S_{1}, L_{S_{l}}, R_{S_{1}}, S_{2}, L_{S_{2}}, R_{S_{2}}, 2 L)

.

End of Initial-Boundaries.

Motif-Length-And-Boundaries(

Z_{1}

) is the same as that before.

Match

(G_{l}, G_{r}, S_{i})

is the same as that for the randomized algorithm.

Extract

(G_{l}, G_{r}, Z_{2}

) is the same as that for the randomized algorithm.

The following is Extract-Phase of algorithm Recover-Motif. It extracts the motif regions of another set,

Z_{2}

, of input sequences.

Extract-Phase(

S^{'}, Z_{2}

) is the same as that for the randomized algorithm.

Voting-Phase

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

is the same as that for the randomized algorithm.

The entire deterministic algorithm is described below. We maintain the sizes of

Z_{1}

and

Z_{2}

to be roughly equal.

Algorithm Recover-Motif (Z)

Input:

Z = Z_{1} \cup Z_{2}

, where

Z_{1} = {S_{1}^{'}, \dots, S_{2 k_{1}}^{'}}

and

Z_{2} = {S_{1}^{″}, \dots, S_{k_{2}}^{″}}

are two sets of input sequences.

Steps:

Preprocessing part:

For each

S \in Z_{1} \cup Z_{2}

, let

{roughLeft}_{S} = {roughRight}_{S} = 0

(the two boundaries are unknown).

l_{m o t i f} =

MotifLengthAndBoundaries(

Z_{1}

).

Let

L = l_{m o t i f} / 4

.

For

i = 1

to

k_{1}

,

for each

S_{j}^{″} \in Z_{2}

,

let

(L_{S_{2 i - 1}^{'}}, R_{S_{2 i - 1}^{'}}, L_{S_{j}^{″}}, R_{S_{j}^{″}}) =

Collision-Detection

(S_{2 i - 1}^{'}, S_{j}^{″})

.

Let

(L_{S_{2 i - 1}^{'}}, R_{S_{2 i - 1}^{'}}, {roughLeft}_{S_{j}^{″}}, {roughRight}_{S_{j}^{″}})

=

Improve-Boundaries

(S_{2 i - 1}^{'}, L_{S_{2 i - 1}^{'}}, R_{S_{2 i - 1}^{'}}, S_{j}^{″}, L_{S_{j}^{″}}, R_{S_{j}^{″}}, 2 L)

.

Let

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

be the output from Extract-Phase(

S_{2 i - 1}^{'}, Z_{2}

).

If

(G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″})

is not empty,

then go to the Voting part.

Voting part:

Return Voting-Phase(

G_{1}^{″}, G_{2}^{″}, \dots, G_{k_{2}}^{″}

).

End of Algorithm Recover-Motif .

6. Analysis of the Algorithm

The correctness of the algorithm will be proven via a series of lemmas in Section 6.2 and Section 6.3. Section 6.2 is for Boundary-Phase and Section 6.3 is for Extract-Phase. Furthermore, Section 6.3 gives some lemmas for the two randomized algorithms, and Section 6.4 gives the proof for the deterministic algorithm.

6.1. Review of Some Classical Results in Probability

Some well-known results in classical probability theory are listed. The readers can skip this section if they understand them well. The inclusion of these results make the paper self-contained.

For a list of events, $A_{1}, \dots, A_{m}$ , $\Pr [A_{1} \cup A_{2} \cup \dots \cup A_{m}] \leq \Pr [A_{1}] + \Pr [A_{2}] + \dots + \Pr [A_{m}]$ .
For two independent events, A and B, $\Pr [A \cap B] = \Pr [A] \Pr [B]$ .
For a random variable, Y, $\Pr [Y \geq t] \leq \frac{E [Y]}{t}$ for all positive real numbers, t. This is called Markov inequality.

The analysis of our algorithm employs the Chernoff bound [22] and Corollary 18 below, which can be derived from it (see [11]).

Theorem 15

([22]). Let

X_{1}, \dots, X_{n}

be n independent random 0–1 variables, where

X_{i}

takes one with a probability of

p_{i}

. Let

X = \sum_{i = 1}^{n} X_{i}

, and

μ = E [X]

. Then, for any

δ > 0

:

i.: $Pr (X < (1 - δ) μ) < e^{- \frac{1}{2} μ δ^{2}}$ and
ii.: $Pr (X > (1 + δ) μ) < {[\frac{e^{δ}}{{(1 + δ)}^{(1 + δ)}}]}^{μ}$ .

We follow the proof of Theorem 15 to make the following version of the Chernoff bound, so that it can be used in our algorithm analysis.

Theorem 16.

Let

X_{1}, \dots, X_{n}

be n independent random 0–1 variables, where

X_{i}

takes one with a probability of at most p. Let

X = \sum_{i = 1}^{n} X_{i}

. Then, for any

δ > 0

,

Pr (X > (1 + δ) p n) < {[\frac{e^{δ}}{{(1 + δ)}^{(1 + δ)}}]}^{p n}

.

Define

g (δ) = \frac{e^{δ}}{{(1 + δ)}^{(1 + δ)}}

. We note that

g (δ)

is always strictly less than one for all

δ > 0

and

g (δ)

is fixed if δ is a constant. This can be verified by checking that the function

f (x) = ln \frac{e^{x}}{{(1 + x)}^{(1 + x)}} = x - (1 + x) ln (1 + x)

is decreasing and

f (0) = 0

. This is because

f^{'} (x) = - ln (1 + x)

, which is less than zero for all

x > 0

.

Theorem 17.

Let

X_{1}, \dots, X_{n}

be n independent random 0–1 variables, where

X_{i}

takes one with a probability of at most p. Let

X = \sum_{i = 1}^{n} X_{i}

. Then, for any

δ > 0

,

Pr (X < (1 - δ) p n) < e^{- {\frac{1}{2}}^{p n} δ^{2}}

.

Corollary 18

([11]). Let

X_{1}, \dots, X_{n}

be n independent random 0–1 variables, and

X = \sum_{i = 1}^{n} X_{i}

.

i) If

X_{i}

takes one with a probability of at most p, then for any

\frac{1}{3} > ϵ > 0

,

Pr (X > p n + ϵ n) < e^{- \frac{1}{3} n ϵ^{2}}

.

ii) If

X_{i}

takes one with a probability of at least p, then for any

ϵ > 0

,

Pr (X < p n - ϵ n) < e^{- \frac{1}{2} n ϵ^{2}}

.

6.2. Analysis of Boundary-Phase of Algorithm Recover-Motif

Lemma 19 shows that with only a small probability, a sequence can match a random sequence. It will be used to prove that when two substrings in two different

Θ (n, G, α)

-sequences are similar, they are unlikely not to coincide with the motif regions in the two

Θ (n, G, α)

-sequences, respectively.

Lemma 19.

Assume that

X_{1}

and

X_{2}

are two independent sequences of the same length and that every character of

X_{2}

is a random character from Σ. Then:

i.: if $1 \leq | X_{1} | = | X_{2} | < v$ , then the probability that $X_{1}$ and $X_{2}$ are matched is $\leq \frac{1}{t^{| X_{1} |}}$ ( $t = | | Σ | |$ ); and
ii.: the probability for $diff (X_{1}, X_{2}) \leq β$ is at most $e^{- \frac{ϵ^{2} | X_{1} |}{3}}$ .

Proof:

The two statements are proven as follows.

Statement i: For every character,

X_{2} [j]

, with

1 \leq j < v

, the probability is

\frac{1}{t}

that

X_{2} [j] = X_{1} [j]

.

Statement ii: For every character,

X_{2} [j]

, with

1 \leq j \leq | X_{2} |

, the probability is

\frac{1}{t}

for

X_{2} [j]

to equal

X_{1} [j]

. If

diff (X_{1}, X_{2}) \leq β

, the two sequences,

X_{1}

and

X_{2}

, are identical in at least

(1 - β) | X_{1} |

positions, but the expected number of positions where the two sequences are identical is

\frac{1}{t} | X_{1} |

. The probability for

diff (X_{1}, X_{2}) \leq β

is at most

e^{- \frac{{(1 - β - \frac{1}{t})}^{2}}{3} | X_{1} |} \leq e^{- \frac{ϵ^{2}}{3} | X_{1} |}

by Corollary 18, Definition and 9, items (xiv) and xii, equation (8) and inequality (2) in Definition 8. ▮

Lemma 20 shows that with a small probability, an input

Θ (n, G, α)

sequence contains a motif region that has many mutations.

Lemma 20.

i.: With a probability of at most $c^{t}$ , a $Θ (n, G, α)$ sequence, S, changes more than $\frac{β}{2} t$ characters in its motif region, $ℵ (S) [i, i + t - 1]$ , with $1 \leq i \leq i + t - 1 \leq | G |$ , where c is defined in item (v) in Definition 8.
ii.: With a probability of at most $\frac{c^{y}}{1 - c}$ , a $Θ (n, G, α)$ sequence, S, changes more than $\frac{β}{2} t$ characters in its left motif region, $ℵ (S) [1, t]$ , for some t with $y \leq t \leq | G |$ , where c is defined in item (v) in Definition 8.

Proof:

Statement i: Every character in the

ℵ (S)

region has a probability of at most α to mutate. We know that

| ℵ (S) | = | G | \geq d

. By Corollary 18, with a probability of at most

e^{- \frac{ϵ^{2}}{3} t}

,

ℵ (S) [i, i + t - 1]

has more than

(α + ϵ) t

mutations.

Statement ii: We know that

| ℵ (S) | = | G | \geq d

. By Corollary 18, with a probability of at most

e^{- \frac{ϵ^{2}}{3} t}

, a sequence, S, in

Z_{1}

has more than

(α + ϵ) t

mutations (recall the setting for β in Definition 9) among the first left t characters. The total probability is at most

\sum_{t = y}^{\infty} e^{- \frac{ϵ^{2}}{3} t} = \frac{c^{y}}{1 - c}

. ▮

Lemma 21 shows that Improve-Boundaries() has a good chance to improve the accuracy of rough motif boundaries. Note that

LB (S)

and

RB (S)

are the left and right motif boundaries of S, respectively (see Definition 11).

Lemma 21.

Assume that

Θ (n, G, α)

sequence

S_{i}

has

L_{S_{i}} \in [LB (S_{i}) - L, LB (S_{i}) + L]

and

R_{S_{i}} \in [RB (S_{i}) - L, RB (S_{i}) + L]

for

i = 1, 2

. Then, for

({roughLeft}_{S_{1}}, {roughRight}_{S_{1}}, {roughLeft}_{S_{2}},

{roughRight}_{S_{2}})

=Improve-Boundaries

(S_{1}, L_{S_{1}}, R_{S_{1}}, S_{2}, L_{S_{2}}, R_{S_{2}}, L)

, we have the following two facts:

i.: with a probability of at most $\frac{2 c^{v}}{1 - c} + \frac{2 (v + u) c^{v + u}}{{(1 - c)}^{2}} + \frac{1}{5 \cdot 2^{x} n}$ ; ${roughLeft}_{S_{i}}$ is not in $[LB (S_{i}) - (v + u), LB (S_{i})]$ for $i = 1, 2$ .
ii.: with a probability of at most $\frac{2 c^{v}}{1 - c} + \frac{2 (v + u) c^{v + u}}{{(1 - c)}^{2}} + \frac{1}{5 \cdot 2^{x} n}$ , ${roughRight}_{S_{i}}$ is not in $[RB (S_{i}), RB (S_{i}) + (v + u)]$ for $i = 1, 2$ .
iii.: Improve-Boundaries $(S_{1}, L_{S_{1}}, R_{S_{1}}, S_{2}, L_{S_{2}}, R_{S_{2}}, L)$ runs in $O (L^{2} log n)$ time.

Proof:

We need a bound for the following inequality:

\begin{matrix} \sum_{i = j}^{\infty} i a^{i} < \frac{j a^{j}}{{(1 - a)}^{2}} \end{matrix}

(20)

Let

f (x) = \sum_{i = j}^{\infty} e^{θ i x}

. Compute the derivative

f^{'} (x) = θ \sum_{i = j}^{\infty} i e^{θ i x}

. We also have the closed form for the function

f (x) = \frac{e^{θ j x}}{1 - e^{θ x}}

, which implies:

\begin{matrix} f^{'} (x) & = & \frac{θ j e^{θ j x} (1 - e^{θ x}) - e^{θ j x} (- θ e^{θ x})}{{(1 - e^{θ x})}^{2}} \end{matrix}

(21)

\begin{matrix} = & \frac{θ j e^{θ j x} - θ (j - 1) e^{θ (j + 1) x}}{{(1 - e^{θ x})}^{2}} \end{matrix}

(22)

Let

θ = ln a

and

x = 1

. We have

\sum_{i = j}^{\infty} i a^{i} = \frac{j a^{j} - (j - 1) a^{j + 1}}{{(1 - a)}^{2}} < \frac{j a^{j}}{{(1 - a)}^{2}}

.

Statement i. By Lemma 20, with a probability of at most

\frac{2 c^{v}}{1 - c}

, one of the left motif first y character region of

S_{i}

will change

\frac{β}{2 y}

characters. Therefore, with a probability of at most

P_{1} = \frac{2 c^{v}}{1 - c}

,

{roughLeft}_{S_{i}} > LB (S_{i})

.

For a pair of positions, p, in

S_{1}

, and q, in

S_{2}

, without loss of generality, assume that p has a larger distance to the left boundary

LB (S_{1})

of

S_{1}

than q to the left boundary

LB (S_{2})

of

S_{2}

. Let

v + y

be the distance from p to the left boundary

LB (S_{1})

of

S_{1}

.

By Lemma 19, the probability is at most

c^{v + y}

that there will be a match. There are at most

(v + y)

cases for q. With a probability of at most

P_{2} = 2 \sum_{y = u}^{\infty} (v + y) c^{v + y} < \frac{2 (v + u) c^{v + u}}{{(1 - c)}^{2}}

, by inequality (20),

{roughLeft}_{S_{1}} < L B (S_{1}) - (v + u)

.

For the cases that one position is in a random region and has a distance more than

d_{0} log n

from the left boundary, the probability is at most

P_{3} = n^{2} c^{d_{0} log n} < \frac{1}{5 \cdot 2^{x} n}

by inequality (14).

Therefore, we have a total probability of at most

P_{1} + P_{2} + P_{3}

that

{roughLeft}_{S_{1}}

is not in

[LB (S_{1}) - (v + u), LB (S_{1})]

.

Statement ii. One can also provide a symmetric analogous proof for this statement.

Statement iii. The computation time easily follows from the implementation of Improve-Boundaries

(S_{1}, L_{S_{1}}, R_{S_{1}}, S_{2}, L_{S_{2}}, R_{S_{2}})

. ▮

Lemma 22.

Assume that for each L with

0 < L \leq \frac{| G |}{2}

, with a probability of at most

ς (n)

,

L_{S_{i}} \notin [{LB}_{S_{i}} - L, {LB}_{S_{i}} + L]

for

i = 1, 2

, where

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection(

S_{1}, U_{1}, S_{2}, U_{2})

,

U_{1} =

Point-Selection

(S_{1}, L)

and

U_{2} =

Point-Selection

(S_{2}, L)

. Then, with a probability of at most

ς (n) + \frac{2 (v + u_{1}) c^{v + u_{1}}}{{(1 - c)}^{2}} + \frac{c^{v}}{1 - c} + \frac{1}{5 \cdot 2^{x} n}

, Initial-Boundary

(S_{1}, S_{2})

returns

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}})

with

L_{S_{i}} \notin [LB (S_{i}) - (v + u_{1}), LB (S_{i})]

or

R_{S_{i}} \notin [RB (S_{i}), RB (S_{i}) + (v + u_{1}))]

for

i = 1, 2

.

Proof:

It follows from Lemma 21: ▮

Lemma 23.

Assume that with a probability of

p < 0.5

, each

S_{2 i - 1}^{'}

has its rough boundaries,

{roughLeft}_{S_{2 i - 1}^{'}} \notin [LB (S_{2 i - 1}^{'}) - u, LB (S_{2 i - 1}^{'})]

or

{roughRight}_{S_{2 i - 1}^{'}} \notin [RB (S_{2 i - 1}^{'}), RB (S_{2 i - 1}^{'}) + u]

; then, with a probability of at most

e^{- {(0.5 - p - ϵ)}^{2} k_{1} / 3}

,

l_{m o t i f}

is not in

[| G | - 2 u, | G | + 2 u]

, where

l_{m o t i f}

is selected as the median of

\cup_{i = 1}^{k_{1}} {({roughRight}_{S_{2 i - 1}^{'}} - {roughLeft}_{S_{2 i - 1}^{'}})}

.

Proof:

If both

{roughLeft}_{S_{2 i - 1}^{'}} \in [LB (S_{2 i - 1}^{'}) - u, LB (S_{2 i - 1}^{'})]

and

{roughRight}_{S_{2 i - 1}^{'}} \in [RB (S_{2 i - 1}^{'}), RB (S_{2 i - 1}^{'}) + u]

, then

({roughRight}_{S_{2 i - 1}^{'}} - {roughLeft}_{S_{2 i - 1}^{'}})

is in

[| G | - 2 u, | G | + 2 u]

.

If the median of

\cup_{i = 1}^{k_{1}} {({roughRight}_{S_{2 i - 1}^{'}} - {roughLeft}_{S_{2 i - 1}^{'}})}

is not in

[| G | - 2 u, | G | + 2 u]

, then there are at least

k_{1}

numbers, i, to have

{roughLeft}_{S_{2 i - 1}^{'}} \notin [LB (S_{2 i - 1}^{'}) - u, LB (S_{2 i - 1}^{'})]

or

{roughRight}_{S_{2 i - 1}^{'}} \notin [RB (S_{2 i - 1}^{'}), RB (S_{2 i - 1}^{'}) + u]

.

On the other hand, the probability is at most p,

{roughLeft}_{S_{2 i - 1}^{'}} \notin [LB (S_{2 i - 1}^{'}) - u, LB (S_{2 i - 1}^{'})]

or

{roughRight}_{S_{2 i - 1}^{'}} \notin [RB (S_{2 i - 1}^{'}), RB (S_{2 i - 1}^{'}) + u]

. Therefore, this lemma follows from Corollary 18. ▮

For a

Θ (n, G, α)

-sequence S, we often obtain its left rough boundary with

{roughLeft}_{S} \leq LB (S)

. Sometimes, its exact left boundary may be missed in the algorithm.

Definition 24.

A $Θ (n, G, α)$ -sequence, S, misses its left boundary if ${roughLeft}_{S} > LB (S)$ .
A $Θ (n, G, α)$ -sequence, S, misses its right boundary if ${roughRight}_{S} < RB (S)$ .

Definition 25.

i.: A $Θ (n, G, α)$ -sequence, S, contains a left half stable motif region, $ℵ (S)$ , if $diff (G^{'} [1, h], G [1, h]) \leq \frac{β}{2}$ for all $h = v, v + 1, \dots, m$ , where $G^{'} = ℵ (S)$ , and $m = | G |$ , as defined in Definition 8 and Section 2, respectively.
ii.: A $Θ (n, G, α)$ -sequence, S, contains a right half stable motif region, $ℵ (S)$ , if $diff (G^{'} [m - h, m], G [m - h, m]) \leq \frac{β}{2}$ for $h = v - 1, v + 1, \dots, m - 1$ , where $G^{'} = ℵ (S)$ and $m = | G |$ .
iii.: A $Θ (n, G, α)$ -sequence, S, contains a stable motif region, $ℵ (S)$ , satisfying the following conditions: (1) $G^{'} [i] = G [i]$ for $i = 1, \dots, v - 1$ ; (2) $G^{'} [m - i + 1] = G [m - i + 1]$ for $i = 1, \dots, v - 1$ ; and (3) the S motif region is both left and right half stable, where $G^{'} = ℵ (S)$ and $m = | G |$ .

Definition 26.

Assume that:

$l_{m o t i f} \in [| G | - 2 (v + u_{1}), | G | + 2 (v + u_{1})]$ ;
Scontains both a left half and a right half stable motif region and ${roughLeft}_{S} \in [LB (S) - (v + u_{1}), LB (S)]$ and ${roughRight}_{S} \in [RB (S), RB (S) + (v + u_{1})]$ (see Definition 8 for $u_{1}$ and v); and
for each L with $(v + u_{1}) < L \leq \frac{| G |}{2}$ , if $S_{1}$ has ${roughLeft}_{S_{1}} \in [{LB}_{S_{1}} - L, {LB}_{S_{1}} + L]$ and ${roughRight}_{S_{1}} \in [{RB}_{S_{1}} - L, {RB}_{S_{1}} + L]$ , then with a probability of at most $ς (n)$ , $L_{S_{i}^{″}} \notin [{LB}_{S_{i}^{″}} - 2 L, {LB}_{S_{i}^{″}} + 2 L]$ for $i = 1, 2$ , where $(L_{S_{1}}, R_{S_{1}}, L_{S_{i}^{″}}, R_{S_{i}^{″}}) =$ Collision-Detection( $S_{1}, U_{1}, S_{i}^{″}, U_{2})$ , $U_{1} =$ Point-Selection $(S_{1}, L, [{roughLeft}_{S_{1}} - 2 L, {roughLeft}_{S_{1}} + 2 L]) \cup$ Point-Selection $(S_{1}, L, [{roughRight}_{S_{1}} - 2 L, {roughRight}_{S_{1}} + 2 L])$ and $U_{2} =$ Point-Selection $(S_{i}^{″},$ $L, [1, | S_{i}^{″} |])$ .
The rough boundaries for all sequences, $S_{i}^{″} \in Z_{2}$ , are computed via $(L_{S}, R_{S}, L_{S_{i}^{″}}, R_{S_{i}^{″}}) =$ Collision-Detection $(S, U_{S}, S_{i}^{″}, U_{S_{i}^{″}})$ and $(L_{S}, R_{S}, {roughLeft}_{S_{i}^{″}}, {roughRight}_{S_{i}^{″}})$ = Improve-Boundaries $(S,$ $L_{S}, R_{S}, S_{i}^{″}, L_{S_{i}^{″}}, R_{S_{i}^{″}}, 2 L)$ .

Then, with a probability of at most

e^{- \frac{ϵ^{2} k_{2}}{3}}

, there are more than

(2 (ς (n) + (v + u_{1}) \frac{c^{v + u}}{1 - c} + \frac{c^{v}}{1 - c}) + ϵ) k_{2}

sequences

S_{i}^{″}

in

{S_{1}^{″}, \dots, S_{k_{2}}^{″}}

with

roughLeft (S_{i}^{″}) \notin [LB (S_{i}^{″}) - (v + u), LB (S_{i}^{″})]

or

roughRight (S_{i}^{″}) \notin [RB (S_{i}^{″}), RB (S_{i}^{″}) + (v + u)]

.

Proof:

According to the condition of this lemma, with a probability of at most

P_{1} = ς (n)

,

L_{S_{i}^{″}} \notin [{LB}_{S_{i}^{″}} - 2 L, {LB}_{S_{i}^{″}} + 2 L]

, where

(L_{S}, R_{S}, L_{S_{i}^{″}}, R_{S_{i}^{″}}) =

Collision-Detection(

S, U_{1}, S_{i}^{″}, U_{2})

and

(U_{1}, U_{2}) =

Point-Selection

(S, S_{i}^{″}, L)

.

For a fixed pattern from S, by Lemma 19, with a probability of at most

\sum_{y = v + u}^{\infty} c^{y} = \frac{c^{v + u}}{1 - c}

, it has a distance of more than

v + u

to the true left boundary. As we need to deal with

v + u_{1}

possible patterns from S, with a probability of at most

P_{2, l} = (v + u_{1}) \cdot \frac{c^{v + u}}{1 - c}

,

{roughLeft}_{S_{i}^{″}} < LB (S_{i}^{″}) - (v + u)

.

Similarly, with a probability of at most

P_{2, r} = (v + u_{1}) \frac{c^{v + u}}{1 - c}

,

{roughRight}_{S_{i}^{″}} > RB (S_{i}^{″}) + (v + u)

. Let

P_{2} = P_{2, l} + P_{2, r}

.

With a probability of at most

P_{3, l} = \frac{c^{v}}{1 - c}

,

S_{i}^{″}

does not contain a left half stable motif region by Lemma 20. Similarly, with a probability of at most

P_{3, r} = \frac{c^{v}}{1 - c}

,

S_{i}^{″}

does not contain a right half stable motif region. Let

P_{3} = P_{3, l} + P_{3, r}

.

Although, S is involved in searching the left boundary with all other sequences. The non-missing condition is to let each sequence not change too many characters in the motif region. Therefore, this is an independent event for each sequence. It is safe to use the Chernoff bound to deal with it.

With a probability of at most

P = e^{- \frac{ϵ^{2} k_{2}}{3}}

, there are more than

(P_{1} + P_{2} + P_{3} + ϵ) k_{2}

sequences,

S_{i}^{″}

, in

{S_{1}^{″}, \dots, S_{k_{2}}^{″}}

with

roughLeft (S_{i}^{″}) \notin [LB (S_{i}^{″}) - (v + u), LB (S_{i}^{″})]

or

roughRight (S_{i}^{″}) \notin [RB (S_{i}^{″}), RB (S_{i}^{″}) + (v + u)]

. ▮

6.3. Analysis of Extract-Phase and Voting-Phase of Algorithm Recover-Motif

Lemma 27 shows that with a high probability, the left and last parts of the motif region in a

Θ (n, G, α)

-sequence do not change much.

Lemma 27.

With a probability of at most

Q_{0}

defined in Equation (13), a

Θ (n, G, α)

-sequence, S, does not contain a stable motif region.

Proof:

The probability is

V_{1} = 2 (v - 1) α

not to satisfy conditions (1) and (2) of item (iii) in Definition 25. Consider condition (3) of item (iii) in Definition 25. Since every character of

ℵ (S) [1, m]

(notice that

m = | G |

) has a probability of at most α to mutate, by Corollary 18, the probability is at most

e^{- \frac{1}{3} ϵ^{2} r}

that

diff (G [1, h], G^{'} [1, h]) > \frac{β}{2} = α + ϵ

. Let

V_{3} = \sum_{r = v}^{\infty} e^{- \frac{1}{3} ϵ^{2} r} = \frac{c^{v}}{1 - c}

, where

c = e^{- \frac{1}{3} ϵ^{2}}

, as defined in Definition 8. Therefore, the probability is at most

V_{3}

that

diff (G [1, h], G^{'} [1, h]) > \frac{β}{2} = α + ϵ

for some

h \in {v, v + 1, \dots, m}

. Similarly, we define

V_{4} = \sum_{r = v}^{\infty} e^{- \frac{1}{3} ϵ^{2} r} \leq \frac{c^{v}}{1 - c}

for the probability on the right-hand side. The probability is at most

V_{4}

that

diff (G [m - h, m], G^{'} [m - h, m]) > \frac{β}{2} = α + ϵ

for some

h \in {v, v + 1, \dots, m}

. The probability that S does not contain a stable motif region is at most

V_{1} + V_{3} + V_{4} = Q_{0}

. ▮

Definition 28.

Assume that

Z_{1} = {S_{1}^{'}, \dots, S_{2 k_{1}}^{'}}

contains

S_{2 i - 1}^{'}

, which contains a stable motif region. We fix such a

S_{2 i - 1}^{'}

.

Define $G_{L} = ℵ (S_{2 i - 1}^{'}) [1, d_{0} log n - 1]$ to be the left part of the motif region, $ℵ (S_{2 i - 1}^{'})$ .
Define $G_{R} = ℵ (S_{2 i - 1}^{'}) [| G | - (d_{0} log n) + 1, | G |]$ to be the right part of the motif region, $ℵ (S_{2 i - 1}^{'})$ .

Lemma 29 shows that with a high probability, Extract-Phase of algorithm Recover-Motif extracts the correct motif regions from the sequences in

Z_{1}

. It uses

G^{″}

to match

ℵ (S)

in another sequences, S. The parameter, R, gives a small probability that the matched region between

G^{″}

and S is not in

ℵ (S)

.

Lemma 29.

i.: Assume that $G_{l}$ and $G_{r}$ are fixed sequences of length $d_{0} log n$ . Let S be a $Θ (n, G, α)$ -sequence with $M = Match (G_{l}, G_{r}, S)$ , and let $w_{0}$ be the number of characters of M that are not in the region of $ℵ (S)$ . Then, the probability is at most R that $w_{0} \geq 1$ , where R is defined in (xv) of Definition 8.
ii.: The probability is at most $Q_{0}$ that given a $Θ (n, G, α)$ -sequence S, $Match (G_{L}, G_{R}, S) = \emptyset$ .

Proof:

Assume that

w_{0} \geq 1

. Let w be the number of characters outside of

ℵ (S)

on the left of M, and let

w^{'}

be the number of characters outside of

ℵ (S)

on the right of M. Clearly,

w_{0} = w + w^{'}

. Since

w_{0} \geq 1

, either

w \geq 1

or

w^{'} \geq 1

. See Figure 1. Without loss of generality, we assume

w \geq 1

.

Figure 1.

G^{″}

and M.

Statement i: There are two cases.

Case (a):

1 \leq w < v

. By Lemma 19, the probability for this case is at most

\frac{1}{t}

for a fixed w. The total probability for this case for

1 \leq w < v

is at most

\sum_{i = 1}^{v - 1} \frac{1}{t^{i}} \leq \sum_{i = 1}^{\infty} \frac{1}{t^{i}} = \frac{1}{t - 1}

.

Case (b):

v \leq w

. By Lemma 19, the probability is at most

e^{- \frac{ϵ^{2}}{3} w}

for a fixed w. The total probability for

v \leq w

is at most

\sum_{w = v}^{\infty} e^{- \frac{ϵ^{2}}{3} w} = \frac{c^{v}}{1 - c}

.

The probability analysis is similar when

w^{'} \geq 1

. Therefore, the probability for this case is at most

R = (\frac{1}{t - 1} + \frac{c^{v}}{1 - c})

for

w_{0} \geq 1

.

Statement ii: By Lemma 27, with a probability of at most

Q_{0}

, S does not contain a stable motif region. Therefore, we have a probability of at most

Q_{0}

that given a random

Θ (n, G, α)

-sequence, S,

Match (G_{L}, G_{R}, S) = \emptyset

. ▮

Lemma 30 shows that we can use

G_{l}

and

G_{r}

to extract most of the motif regions for the sequences in

Z_{2}

if

G^{'} = G_{L}

(recall that

G_{L}

is defined right after Lemma 27).

Lemma 30.

Assume that

G_{l}

and

G_{r}

are two sequences of a length of

d_{0} log n

, and

G_{i} = Match (G_{l}, G_{r}, S_{i}^{″})

for

S_{i}^{″} \in Z_{2} = {S_{1}^{″}, \dots, S_{k_{2}}^{″}}

and

i = 1, \dots, k_{2}

(recall that each sequence,

G_{i}

, is either an empty sequence or a sequence of the length of

| G_{l} |

).

i.: If $G_{l} = G_{L}$ , $G_{r} = G_{R}$ and there are no more than $y k_{2}$ ( $y \in [0, 1]$ ) sequences, $S_{i}^{″}$ , with ${roughLeft}_{S_{i}^{″}} \notin [LB (S_{i}^{″}) - (v + u_{2}), LB (S_{i}^{″})]$ or ${roughRight}_{S_{i}^{″}} \notin [RB (S_{i}^{″}), RB (S_{i}^{″}) + (v + u_{2})]$ , then the probability is at most $e^{- \frac{ϵ^{2} k_{2}}{3}}$ that there are more than $(Q_{0} + y + ϵ) k_{2}$ sequences, $G_{i}$ , with $G_{i} = \emptyset$ .
ii.: For arbitrary $G_{l}$ and $G_{r}$ , with a probability of at most $e^{- \frac{ϵ^{2} k_{2}}{3}}$ , $| {i | G_{i} \neq \emptyset and G_{i} \neq ℵ (S_{i}^{″}), i = 1, \dots, k_{2}} | > (R + ϵ) k_{2}$ , where R is defined in Definition 8.

Proof:

Recall that sequence

G_{1 L}

is selected right after Lemma 27.

Statement i: By Lemma 29, for every

S_{i}^{″} \in Z_{2}

, the probability is at most

Q_{0}

that

S_{i}^{″}

does not contain a stable motif region,

ℵ (S_{i}^{″})

. By Corollary 18, we have a probability of at most

e^{- \frac{ϵ^{2} k_{2}}{3}}

that there are more than

(Q_{0} + y + ϵ) k_{2}

sequences,

G_{i}

, with

G_{i} = \emptyset

.

Statement ii: By Lemma 29, the probability is at most R that

G_{i} \neq ℵ (S_{i}^{″})

. By Corollary 18, with a probability of at most

e^{- \frac{ϵ^{2} k_{2}}{3}}

,

| {i | G_{i} \neq ℵ (S_{i}^{″}), i = 1, \dots, k_{2}} | > (R + ϵ) k_{2}

. ▮

Definition 31.

Given two sequences, $G_{r}$ and $G_{r}$ , define:
$M (G_{r}, G_{r}) = {G_{i}^{″} : G_{i}^{″} =$ Match $(G_{l}, G_{r}, {roughLeft}_{S_{i}^{″}}, {roughRight}_{S_{i}^{″}}, S_{i}^{″})$ $i = 1, \dots, k_{2}}$ .
For a $Θ (n, G, α)$ sequence, S, define $G_{S, L}$ to be the $ℵ (S) [1, d_{0} log n]$ , which is the leftmost subsequence of a length of $d_{0} log n$ in the motif region of S.
For a $Θ (n, G, α)$ sequence, S, define $G_{S, R}$ to be the $ℵ (S) [m - d_{0} log n + 1, m]$ , which is the rightmost subsequence of a length of $d_{0} log n$ in the motif region of S, where $m = | G | = | ℵ (S) |$ ;

the condition iv of Lemma 32.

Lemma 32.

Assume that we have the following conditions:

i.: For each L with $0 < L \leq \frac{| G |}{2}$ , with a probability of at most $ς_{1} (n)$ , $L_{S_{i}} \notin [{LB}_{S_{i}} - 2 L, {LB}_{S_{i}} + 2 L]$ or $R_{S_{i}} \notin [{RB}_{S_{i}} - 2 L, {RB}_{S_{i}} + 2 L]$ for $i = 1, 2$ , where $(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =$ Collision-Detection( $S_{1}, U_{1}, S_{2}, U_{2})$ , $U_{1} =$ Point-Selection $(S_{1}, L, [1, | S_{1} |])$ and $U_{2} =$ Point-Selection $(S_{2}, L, [1, | S_{2} |])$ .
ii.: For each L with $0 < L \leq \frac{| G |}{2}$ , if $S_{1}$ has ${roughLeft}_{S_{1}} \in [{LB}_{S_{1}} - L, {LB}_{S_{1}} + L]$ and ${roughRight}_{S_{1}} \in [{RB}_{S_{1}} - L, {RB}_{S_{1}} + L]$ , then with a probability of at most $ς_{2} (n)$ , $L_{S_{i}^{″}} \notin [{LB}_{S_{i}^{″}} - 2 L, {LB}_{S_{i}^{″}} + 2 L]$ for $i = 1$ or $i = 2$ , where $(L_{S_{1}}, R_{S_{1}}, L_{S_{i}^{″}}, R_{S_{i}^{″}}) =$ Collision-Detection( $S_{1}, U_{1}, S_{i}^{″}, U_{2})$ , $U_{1} =$ Point-Selection $(S_{1}, L, [{roughLeft}_{S_{1}} - 2 L, {roughLeft}_{S_{1}} + 2 L]) \cup$ Point-Selection $(S_{1}, L, [{roughRight}_{S_{1}} - 2 L, {roughRight}_{S_{1}} + 2 L])$ and $U_{2} =$ Point-Selection $(S_{i}^{″}, L, [1, | S_{i}^{″} |])$ .
iii.: The inequality $(P_{0} + Q_{0}) < c_{0}$ holds for some constant $c_{0} < 1$ , where $Q_{0}$ is defined in Equation (13) and $P_{0} = ς_{1} (n) + \frac{2 (v + u_{1}) c^{v + u_{1}}}{{(1 - c)}^{2}} + \frac{c^{v}}{1 - c} + \frac{1}{5 \cdot 2^{x} n}$ .
iv.: The inequality $1 - 2 (Q_{0} + V_{0} + (R + 2 ϵ)) - (α + ϵ) > 0$ holds, where $V_{0} = (2 (ς_{2} (n) + (v + u_{1}) \frac{c^{v + u_{2}}}{1 - c} + \frac{c^{v}}{1 - c}) + ϵ)$ .

Then, the algorithm generates a set of at most

k_{2}

subsequences for voting and votes for a sequence,

G^{'}

, such that:

(1) with a probability of at most

e^{- Ω (k_{1})} + e^{- Ω (k_{2})}

,

| G^{'} | \neq | G |

; and

(2) for each

1 \leq i \leq | G |

, with a probability of at most

e^{- Ω (k_{1})} + e^{- Ω (k_{2})}

,

G^{'} [i] \neq G [i]

.

Before proving Lemma 32, we note that both

ς_{1} (n)

and

ς_{2} (n)

are at most

\frac{1}{2^{x} n^{3}}

for all of the three algorithms. They will be proven by Lemma 47 and Lemma 46 for the case algorithm-type=RANDOMIZED-SUBLINEAR, Lemma 41 and Lemma 40 for the case algorithm-type=RANDOMIZED-SUBQUADRATIC, and Lemma 35 for the case algorithm-type=DETERMINISTIC-SUPERQUADRATIC.

Proof:

By Lemmas 22, with a probability of at most

P_{0} = ς_{1} (n) + \frac{2 (v + u_{1}) c^{v + u_{1}}}{{(1 - c)}^{2}} + \frac{c^{v}}{1 - c} + \frac{1}{5 \cdot 2^{x} n}

,

{roughLeft}_{S_{2 i - 1}^{'}} \notin [LB (S_{2 i - 1}^{'}) - (v + u_{1}), LB (S_{2 i - 1}^{'})]

or

{roughRight}_{S_{2 i - 1}^{'}} \notin [RB (S_{2 i - 1}^{'}), RB (S_{2 i - 1}^{'}) + (v + u_{1})]

.

By Lemma 23, with a probability of at most

P_{a} = e^{- {(0.5 - P_{0} - ϵ)}^{2} k_{1} / 3} = e^{- Ω (k_{1})}

, the approximate motif length,

l_{m o t i f}

, is not in the range

[| G | - 2 (v + u_{1}), | G | + 2 (v + u_{1})]

. Assume

l_{m o t i f} \in [| G | - 2 (v + u_{1}), | G | + 2 (v + u_{1})]

in the rest of the proof of this lemma.

By Lemma 27, with a probability of at most

Q_{0}

, a

Θ (n, G, α)

sequence does not contain a stable motif region. Therefore, with a probability of at most

P_{1} = {(P_{0} + Q_{0})}^{k_{1}}

, the following statement is false:

(i) One of

S_{2 i - 1}^{'}

for

i = 1, \dots, k_{1}

has

{roughLeft}_{S_{2 i - 1}^{'}} \in [LB (S_{2 i - 1}^{'}) - (v + u_{1}), LB (S_{2 i - 1}^{'})]

,

{roughRight}_{S_{2 i - 1}^{'}} \in [RB (S_{2 i - 1}^{'}), RB (S_{2 i - 1}^{'}) + (v + u_{1})]

and has a stable motif region.

By Lemma 26, with a probability of at most

P_{2} = e^{- \frac{ϵ^{2} k_{2}}{3}}

, there are more than

(2 (ς_{2} (n) + (v + u_{1}) \frac{c^{v + u_{2}}}{1 - c} + \frac{c^{v}}{1 - c}) + ϵ) k_{2}

sequences

S_{i}^{″}

with

{roughLeft}_{S_{i}^{″}} \notin [LB (S_{i}^{″}) - (v + u_{2}), LB (S_{i}^{″})]

or

{roughRight}_{S_{i}^{″}} \notin [RB (S_{i}^{″}), LB (S_{i}^{″}) + (v + u_{2})]

. In other words, with a probability of at most

P_{2}

, the following statement is false:

(ii) There are no more than

V_{0} k_{2}

sequences

S_{i}^{″}

with

{roughLeft}_{S_{i}^{″}} \notin [LB (S_{i}^{″}) - (v + u_{2}), LB (S_{i}^{″})]

or

{roughRight}_{S_{i}^{″}} \notin [RB (S_{i}^{″}), RB (S_{i}^{″}) + (v + u_{2})]

, where

V_{0} = (2 (ς_{2} (n) + (v + u_{1}) \frac{c^{v + u_{2}}}{1 - c} + \frac{c^{v}}{1 - c}) + ϵ)

.

Assume that Statement (ii) is true. By Lemma 30, with a probability of at most

P_{3} = c^{k_{2}}

, the following statement is false:

(iii)

M (G_{L}, G_{R})

contains at most

(Q_{0} + V_{0} + ϵ) k_{2}

empty sequences.

We start from the rough left boundary,

{roughLeft}_{1}

, of

S_{1}

to match the other left boundaries of

S_{i}^{″}

for

i = 1, \dots, k_{2}

. There are in total at most

2 (v + u_{1})

candidates to consider.

By Lemma 30, if

M (G_{l}, G_{r})

, which consists of

k_{2}

matched regions, has at most

(Q_{0} + V_{0} + ϵ) k_{2}

empty sequences, then it has more than

(R + ϵ) k_{2}

from non-motif regions with a probability of at most

P_{4} = 2 (v + u_{1}) e^{- \frac{ϵ^{2} k_{2}}{3}}

. After the pattern is fixed, those events in the matching are considered to be independent of each other. This is why we can apply the Chernoff bound to deal with them. Therefore, the probability is at most

P_{4}

; the following statement is false:

(iv) If

M (G_{l}, G_{r})

contains at most

(Q_{0} + V_{0} + ϵ) k_{2}

empty sequences, then

M (G_{l}, G_{r})

contains at most

(Q_{0} + V_{0} + ϵ + (R + ϵ)) k_{2} = (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

elements not from motif regions

{ℵ (S_{i}^{″}) : 1 \leq i \leq k_{2}}

.

Therefore, with a probability of at most

P_{a} + P_{1} + P_{2} + P_{3} + P_{4} = e^{- Ω (k_{1})} + e^{- Ω (k_{2})}

, the sequences are not ready for voting in the next phase, which means the following two conditions are satisfied:

(a) There exists

G_{l}

and

G_{r}

generated by the algorithm, such that

M (G_{l}, G_{r})

contains at most

(Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

elements not from motif regions

{ℵ (S_{i}^{″}) : 1 \leq i \leq k_{2}}

.

(b) For every

G_{l}

and

G_{r}

that

M (G_{l}, G_{r})

contains at most

(Q_{0} + V_{0} + ϵ) k_{2}

empty sequences generated by the algorithm,

M (G_{l}, G_{r})

contains at most

(Q_{0} + V_{0} + ϵ + (R + ϵ)) k_{2} = (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

elements not from motif regions

{ℵ (S_{i}^{″}) : 1 \leq i \leq k_{2}}

.

Statement (1): For a

M (G_{l}, G_{r})

with at most

(Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

elements not from motif regions

{ℵ (S_{i}^{″}) : 1 \leq i \leq k_{2}}

, we still have

k_{2} - (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

elements in

M (G_{l}, G_{r})

from motif regions

{ℵ (S_{i}^{″}) : 1 \leq i \leq k_{2}}

. By the condition (iv) in this lemma, we have

k_{2} - (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2} > (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

. Therefore,

| G^{'} |

is selected to be the length of G in the Voting-Phase().

Statement (2): For a

M (G_{l}, G_{r}) = {G_{1}^{″}, \dots, G_{k_{2}}^{″}}

with at most

(Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

elements not from motif regions

{ℵ (S_{i}^{″}) : 1 \leq i \leq k_{2}}

, we still have

k_{2} - (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

elements in

M (G_{l}, G_{r})

from motif regions

{ℵ (S_{i}^{″}) : 1 \leq i \leq k_{2}}

. By Corollary 18, with a probability of at most

e^{- \frac{ϵ^{2} k_{2}}{3}}

, there are more than

(α + ϵ) k_{2}

characters that are mutated in the same position among all

k_{2}

motif regions for the sequences in

Z_{2}

. We have that

k_{2} - (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2} - (α + ϵ) k_{2} > (Q_{0} + V_{0} + (R + 2 ϵ)) k_{2}

by the condition (iv) in this lemma. We let

G^{'} [j]

be the most frequent character among

G_{1}^{″} [j], \dots, G_{k_{2}}^{″} [j]

in Voting-Phase. Therefore, with a probability of at most

e^{- Ω (k_{1})} + e^{- Ω (k_{2})}

,

G^{'} [j] \neq G [j]

. ▮

We will use multiple variable functions to characterize the computational time for three algorithms. In order to unify the complexity analysis of three algorithms, we introduce the following notation.

Definition 33.

A function,

T (x, y) : N \times N \to N

, is nondecreasing if it is nondecreasing on both variables. If for arbitrary positive constants,

c_{1}

and

c_{2}

,

T (c_{1} x, c_{2} y) \leq c T (x, y)

for some positive constant, c, then

T (x, y)

is slow.

Lemma 34.

Assume that

t (x, y)

,

s (n, L)

and

g (n, l)

are non-decreasing slow functions.Assume that Collision-Detection(

S_{1}, U_{1}, S_{2}, U_{2}

) returns the result in

t (n, | | U_{1} | | + | | U_{2} | |)

time and the Point-Selection(

S_{1}, S_{2}, L)

) selects

s (n, L)

positions in

g (n, L)

time. Assume that with a probability of at most

φ (n)

, the function Initial-Boundaries() does not stop when

L \leq | G | / 4

, and

| | U_{S_{2 i - 1}^{'}} | | + | | U_{S_{j}^{″}} | |

in the algorithm Recover-Motif is no more than

f (n, | G |)

.

Then, with a probability of at most

k_{1} φ (n)

, the entire algorithm Recover-Motif does not stop in the time complexity

(O (k_{1} (\sum_{i = 1}^{i_{0}} (t (n, s (n, \frac{n}{2^{i} n^{2 / 5}})) + g (n, \frac{n}{2^{i_{0}} n^{2 / 5}}))) + k_{1} h^{2} log n + k_{1} k_{2} t (n, f (n, | G |)) + h^{2} log n) + k_{1} k_{2} (log n) (log log n)), O (k_{2}))

, where

i_{0}

is the largest j, such that

\frac{n}{2^{j} n^{2 / 5}} \leq min (n^{2 / 5}, | G |)

and

h = min (n^{2 / 5}, | G |)

.

Proof:

The function Initial-Boundaries()is executed

k_{1}

times. According to the condition that with a probability of at most

φ (n)

, the function Initial-Boundaries(.) does not stop when

L \leq | G | / 4

, we have the fact that with a probability of at most

k_{1} φ (n)

, one of those executions of Initial-Boundaries(.) does not stop when

L \leq | G | / 4

.

In the rest of the proof, we assume that all executions of Initial-Boundaries(.) stops when

L \leq | G | / 4

.

When

L = O (h)

, we detect rough left and right motif boundaries and run Improve-Boundaries(), which takes

O (h^{2} log n)

time by Lemma 21. It takes

O (\sum_{i = 1}^{i_{0}} (t (n, s (n, \frac{n}{2^{i} n^{2 / 5}})) + g (n, \frac{n}{2^{i} n^{2 / 5}}) + h^{2} log n)

time to run Initial-Boundaries(

S_{2 i - 1}^{'}, S_{2 i}^{'}

) one time for one pair (

S_{2 i - 1}^{'}, S_{2 i}^{'}

) in

Z_{1}

. It takes

O (k_{1} (\sum_{i = 1}^{i_{0}} (t (n, s (n, \frac{n}{2^{i} n^{2 / 5}})) + g (n, \frac{n}{2^{i} n^{2 / 5}}) + k_{1} h^{2} log n)

time to run Initial-Boundaries(

S_{2 i - 1}^{'}, S_{2 i}^{'}

) one time for all pairs (

S_{2 i - 1}^{'}, S_{2 i}^{'}

) in

Z_{1}

.

It takes

k_{2} (t (n, f (n, | G |)) + h^{2} log n)

time to find the rough boundaries for all sequences in

Z_{2}

with a fixed sequence, S, from

Z_{1}

by executing for the loop “For each

S_{j}^{″} \in Z_{2}

” in the algorithm Recover-Motif. It takes

k_{1} k_{2} (t (n, f (n, | G |)) + h^{2} log n)

time to find the rough boundaries for all sequences in

Z_{2}

via all sequences,

S_{2 i - 1}^{'}

, from

Z_{1}

through the loop “For each

S_{j}^{″} \in Z_{2}

” in the algorithm Recover-Motif.

Recall that parameters v and

u_{1}

are constants, and

u_{2}

is

O (log log n)

. Calling Match(

G_{l}, G_{r}, S_{i}^{″}

) takes

O ((v + u_{2}) log n)

time for each

S_{i}^{″} \in Z_{2}

. The total times for calling Match(

G_{l}, G_{r}, S_{i}^{″}

) is

O (k_{1} k_{2} (v + u_{1}) (v + u_{2}) log n) = O (k_{1} k_{2} (log n) (log log n))

.

The voting part takes

O (k_{2})

time for executing voting for recovering one character in the motif. ▮

6.4. Deterministic Algorithm for an $Ω (1)$ Mutation Rate

In this section, we give a deterministic algorithm for the case with an

Ω (1)

mutation rate. The performance of the algorithm is stated in Theorem 6.

Lemma 35.

Let algorithm-type=DETERMINISTIC-SUPERQUADRATIC. Assume that

d_{0} log n \leq L \leq | G | / 2

and

c_{0} log n \leq | G |

. Let

I_{1}

be a set of intervals of the positions of

S_{1}

that satisfies

[LB (S_{1}) - L, LB (S_{1}) + L] \cup [RB (S_{1}) - L, RB (S_{1}) + L] \subseteq \cup_{I^{'} \in I_{1}} I^{'}

. Let

I_{2}

be a set of intervals of the positions of

S_{2}

that satisfies

[LB (S_{2}) - L, LB (S_{2}) + L] \cup [RB (S_{2}) - L, RB (S_{2}) + L] \subseteq \cup_{I^{'} \in I_{2}} I^{'}

. Let

U_{1} =

Point-Selection

(S_{1}, L, I_{1})

,

U_{2} =

Point-Selection

(S_{2}, L, I_{2})

and

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection

(S_{1}, U_{1}, S_{2},, U_{2})

. Then:

i.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the left rough boundary, $L_{S_{1}}$ , has at most $d_{0} log n$ distance from $LB (S_{1})$ and the left rough boundary $L_{S_{2}}$ has at most $d_{0} log n$ distance from $LB (S_{2})$ .
ii.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the right rough boundary, $R_{S_{1}}$ , has at most $d_{0} log n$ distance from $RB (S_{1})$ and the right boundary of $R_{S_{2}}$ has at most $d_{0} log n$ distance from $RB (S_{2})$ .

Proof:

For two sequences,

S_{1}

and

S_{2}

, let

ℵ (S_{a})

be the subsequence,

S_{a} [i_{a}, j_{a}]

, for

a = 1, 2

. By Corollary 18, with a probability of at most

P_{l} = 2 c^{d_{0} log n} \leq \frac{2}{5 \cdot 2^{x} n^{3}}

(see inequality (8) in Definition 14), there are more than

(α + ϵ) d_{0} log n

mutations in

S_{a} [i_{a}, i_{a} + d_{0} log n - 1]

for

a = 1, 2

.

With a probability of at most

P_{l}

, the left boundary position is missed during the matching. We have similar

P_{r} = P_{l}

to miss the right boundary.

Assume that

p_{1}

and

p_{2}

are two positions of

S_{1}

and

S_{2}

, respectively. If one of two positions is outside the motif region and has more than

d_{0} log n

distance to the motif boundary, with a probability of at most

c^{- d_{0} log n} \leq \frac{1}{5 \cdot 2^{x} n^{3}}

(see inequality (8) in Definition 14), for them to match requires

diff (Y_{1}, Y_{2}) \leq β

by Lemma 19, where

Y_{a}

is a subsequence

S_{a} [p_{a}, p_{a} + d_{0} log n - 1]

for

a = 1, 2

. ▮

Lemma 36.

For the case algorithm-type=DETERMINISTIC-SUPERQUADRATIC, we have:

i.: Collision-Detection( $S_{1}, U_{1}, S_{2}, U_{2}$ ) takes $t (n, | | U_{1} | | + | | U_{2} | |) = O ((| | U_{1} | | + | | U_{2} {| |)}^{2} log n)$ time.
ii.: Point-Selection( $S_{1}, L, [1, | S_{1} |]$ ) selects $s (n, L) = O (n)$ positions in $g (n, L) = O (n)$ time.
iii.: $| | U_{S_{2 i - 1}^{'}} | | + | | U_{S_{j}^{″}} | |$ in the algorithm Recover-Motif is no more than $f (n, | G |) = O (| G | + n)$ .
iv.: With a probability of at most $\frac{k}{2^{x} n^{3}}$ , the algorithm Recover-Motif does not run in computational complexity $(O (k (n^{2} {(log n)}^{O (1)} + h^{2} log n)), O (k))$ .

Proof:

Statement i. The parameter,

ω_{DETERMINISTIC - SUPERQUADRATIC}

, is set to be β in Collision-Detection. It follows from the time complexity of the brute force method.

Statement ii. It follows from the implementation of Point-Selection().

Statement iii. It follows from the choice of Point-Selection(.) for the sublinear time algorithm at Recover-Motif(.).

Statement iv. It follows from Lemma 34 and Statements i, ii and iii. ▮

We give the proof for Theorem 6.

Proof:

[Theorem 6] The computational time part of this theorem follows from Lemma 36.

By Lemma 35, we let

ς_{1} (n) = \frac{1}{2^{x} n^{3}} \leq ς_{0}

for the probability bound,

ς_{1} (n)

, in the condition (i) of Lemma 32.

By Lemma 35, we can let

ς_{2} (n) = \frac{1}{2^{x} n^{3}} \leq ς_{0}

for the probability bound,

ς_{2} (n)

, in the condition (i) of Lemma 32

By inequality (12), the condition (iii) of Lemma 32 is satisfied.

By inequality (11), we know that the condition (iv) of Lemma 32 can be satisfied.

The failure probability part of this theorem follows from Lemma 21 and Lemma 32 by using the fact that

k_{1}, k_{2}

and k are of the same order (see equation (18)). ▮

6.5. Randomized Algorithms for Motif Detection

In this section, we present two randomized algorithms for motif detection. The first one is a sublinear time algorithm that can handle

\frac{1}{{(log n)}^{2 + μ}}

mutation, and the second one is a super-linear time algorithm that can handle

Ω (1)

mutation. They also share some common functions.

Lemma 37.

Let

c_{1}

be a constant in

(0, 1)

. Assume m and n are two non-negative integers with

m \leq n

. Then, for every integer,

m_{1}

, with

0 \leq m_{1} \leq \frac{δ_{c_{1}} m}{ln n}

,

(\binom{n}{m_{1}}) c_{1}^{m} \leq e^{(m ln c_{1}) / 2}

, where constant

δ_{c_{1}} = \frac{- ln c_{1}}{2}

as defined in Definition 8.

Proof:

We have the inequalities:

\begin{matrix} (\binom{n}{m_{1}}) c_{1}^{m} & \leq & n^{m_{1}} c_{1}^{m} \end{matrix}

(23)

\begin{matrix} = & e^{m_{1} ln n} c_{1}^{m} \end{matrix}

(24)

\begin{matrix} \leq & e^{\frac{δ_{c_{1}} m}{ln n} ln n} c_{1}^{m} \end{matrix}

(25)

\begin{matrix} = & e^{δ_{c_{1}} m} e^{m ln c_{1}} \end{matrix}

(26)

\begin{matrix} = & e^{(m ln c_{1}) / 2} \end{matrix}

(27)

▮

Lemma 38.

Let

S = U \cup V

be a set of n elements with

U \cap V = \emptyset

. Assume that

x_{1}, \dots, x_{m}

are m random elements in S. Then, with a probability of at most

(\binom{| | U | |}{m_{1}}) {(\frac{| | V | | + m_{1}}{n})}^{m}

, the list,

x_{1}, \dots, x_{m}

, contains at most

m_{1}

different elements from U (in other words,

| | {x_{1}, \dots, x_{m}} \cap U | | \leq m_{1}

).

Proof:

For a subset,

S^{'} \subseteq S

, with

| | S^{'} | | = m_{0}

for some integer,

m_{0} \geq 0

, the probability is at most

{(\frac{m_{0}}{n})}^{m}

that all elements,

x_{1}, \dots, x_{m}

, are in

S^{'}

. For every subset,

X \subseteq S

, with

| | X | | \leq m_{1}

, there exists another subset,

S^{'} \subseteq S

, such that

| | S^{'} | | = m_{1}

and

X \subseteq S^{'}

. We have that

\Pr [| | {x_{1}, \dots, x_{m}} \cap U | | \leq m_{1}] \leq \Pr [{x_{1}, \dots, x_{m}} \cap U \subseteq U^{'} for some U^{'} \subseteq U with | | U^{'} | | = m_{1}]

. There are

(\binom{| | U | |}{m_{1}})

subsets of U with a size of

m_{1}

. We have the probability of at most

(\binom{| | U | |}{m_{1}}) {(\frac{| | V | | + m_{1}}{n})}^{m}

that

x_{1}, \dots, x_{m}

contains at most

m_{1}

different elements in U. ▮

Lemma 39.

Let β be a constant in

(0, 1)

and

c_{1} = 1 - \frac{β}{2}

. Let

m_{1} \leq \frac{δ_{c_{1}} m}{ln β n}

and

m \leq n^{1 - ϵ}

for some fixed

ϵ > 0

. Let

S_{1}

and

S_{2}

be two sets of n elements with

| | S_{1} \cap S_{2} | | \geq β n

and C be a set of the size

| | C | | \leq γ m_{1}

for some constant

γ \in (0, 1)

. Then, for all large n, with a probability of at most

2 e^{- \frac{(1 - γ) m_{1} m}{n}}

, we have

(A - C) \cap (B - C) = \emptyset

, where

A = {x_{1}, \dots, x_{m}}

and

B = {y_{1}, \dots, y_{m}}

are two sets, which may have multiplicities, of m random elements from

S_{1}

and

S_{2}

, respectively.

Proof:

In the entire proof of this lemma, we always assume that n is sufficiently large. We are going to give an upper bound about the probability that B does not contain any element in

A - - C

. For each element,

y_{i} \in B

, with a probability of at most

1 - \frac{| | A | | - | | C | |}{n}

,

y_{i}

is not in

A - - C

. Therefore, the probability is at most

{(1 - \frac{| | A | | - | | C | |}{n})}^{m}

that B does not contain any element in

A - - C

.

By Lemma 38, the probability is at most

(\binom{β n}{m_{1}}) {(\frac{(1 - β) n + m_{1}}{n})}^{m}

that

| | A \cap (S_{1} \cap S_{2}) | | \leq m_{1}

. We have the inequalities:

\begin{matrix} \Pr [(A - C) \cap (B - C) = \emptyset] \end{matrix}

(28)

\begin{matrix} = & \Pr [(A - C) \cap (B - C) = \emptyset | | | A \cap (S_{1} \cap S_{2}) | | \geq m_{1}] \cdot \Pr [| | A \cap (S_{1} \cap S_{2}) | | \geq m_{1}] + \end{matrix}

(29)

\begin{matrix} \Pr [(A - C) \cap (B - C) = \emptyset | | | A \cap (S_{1} \cap S_{2}) | | < m_{1}] \cdot \Pr [| A \cap (S_{1} \cap S_{2}) | < m_{1}] \end{matrix}

(30)

\begin{matrix} \leq & \Pr [(A - C) \cap (B - C) = \emptyset | | | A \cap (S_{1} \cap S_{2}) | | \geq m_{1}] + \Pr [| | A \cap (S_{1} \cap S_{2}) | | < m_{1}] \end{matrix}

(31)

\begin{matrix} \leq & {(1 - \frac{| | A | | - | | C | |}{n})}^{m} + (\binom{β n}{m_{1}}) {(\frac{(1 - β) n + m_{1}}{n})}^{m} \end{matrix}

(32)

\begin{matrix} \leq & {(1 - \frac{| | (A \cap S_{1} \cap S_{2}) | | - | | C | |}{n})}^{m} + (\binom{β n}{m_{1}}) {(\frac{(1 - β) n + m_{1}}{n})}^{m} \end{matrix}

(33)

\begin{matrix} \leq & {(1 - \frac{(1 - γ) m_{1}}{n})}^{m} + (\binom{β n}{m_{1}}) {(\frac{(1 - β) n + m_{1}}{n})}^{m} \end{matrix}

(34)

\begin{matrix} \leq & e^{- \frac{(1 - γ) m_{1} m}{n}} + (\binom{β n}{m_{1}}) {(\frac{(1 - β) n + m_{1}}{n})}^{m} \end{matrix}

(35)

\begin{matrix} \leq & e^{- \frac{(1 - γ) m_{1} m}{n}} + (\binom{β n}{m_{1}}) {(1 - \frac{β}{2})}^{m} \end{matrix}

(36)

\begin{matrix} \leq & e^{- \frac{(1 - γ) m_{1} m}{n}} + e^{(m ln c_{1}) / 2} \end{matrix}

(37)

\begin{matrix} \leq & 2 e^{- \frac{(1 - γ) m_{1} m}{n}} \end{matrix}

(38)

The inequality,

{(1 - \frac{(1 - γ) m_{1}}{n})}^{m} \leq e^{- \frac{(1 - γ) m_{1} m}{n}}

, which is used from Equation (34) to Equation (35), follows from the fact that

1 - x \leq e^{- x}

. The transition from (35) to (36) follows from the fact that

\frac{m_{1}}{n} \leq \frac{β}{2}

, since

m_{1} = o (n)

, according to the conditions of the lemma.

It is easy to see that

\frac{2 (1 - γ) m_{1} m}{- m ln c_{1}} = \frac{2 (1 - γ) m_{1}}{- ln c_{1}} \leq n

for all large n. Thus,

\frac{(1 - γ) m_{1} m}{n} \geq (m ln c_{1}) / 2

(note that

ln c_{1} < 0

as

c_{1} \in (0, 1)

). Thus, by Lemma 37,

(\binom{β n}{m_{1}}) {(1 - \frac{β}{2})}^{m} \leq e^{m ln c_{1} / 2} \leq e^{- \frac{(1 - γ) m_{1} m}{n}}

. This is why we have the transition from Equation (37) to Equation (38). Therefore,

\Pr [(A - C) \cap (B - C) = \emptyset] \leq 2 e^{- \frac{(1 - γ) m_{1} m}{n}}

. ▮

6.5.1.. Randomized Algorithm for an $Ω (1)$ Mutation Rate

In this section, we give an algorithm for the case with an

Ω (1)

mutation rate. The performance of the algorithm is stated in Theorem 4.

Lemma 40.

Let algorithm-type=RANDOMIZED-SUBQUADRATIC. Assume that

d_{0} log n \leq L \leq | G | / 2

and

c_{0} log n \leq | G | < \frac{{(log n)}^{3 + τ}}{100}

. Let

I_{1}

be a set of intervals of the positions of

S_{1}

that satisfy

[LB (S_{1}) - L, LB (S_{1}) + L] \cup [RB (S_{1}) - L, RB (S_{1}) + L] \subseteq \cup_{I^{'} \in I_{1}} I^{'}

. Let

I_{2}

be a set of intervals of the positions of

S_{2}

that satisfy

[LB (S_{2}) - L, LB (S_{2}) + L] \cup [RB (S_{2}) - L, RB (S_{2}) + L] \subseteq \cup_{I^{'} \in I_{2}} I^{'}

. Let

U_{1} =

Point-Selection

(S_{1}, L, I_{1})

,

U_{2} =

Point-Selection

(S_{2}, L, I_{2})

and

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection

(S_{1}, U_{1}, S_{2},, U_{2})

. Then:

i.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the left rough boundary, $L_{S_{1}}$ , has at most $d_{0} log n$ distance from $LB (S_{1})$ and the left rough boundary, $L_{S_{2}}$ , has at most $d_{0} log n$ distance from $LB (S_{2})$ ;
ii.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the right rough boundary, $R_{S_{1}}$ , has at most $d_{0} log n$ distance from $RB (S_{1})$ and the right boundary of $R_{S_{2}}$ has at most $d_{0} log n$ distance from $RB (S_{2})$ .

Proof:

The proof is the same as Lemma 35 for the algorithm with algorithm-type=DETERMINISTIC-SUPERQUADRATIC. For two sequences,

S_{1}

and

S_{2}

, let

ℵ (S_{a})

be the subsequence

S_{a} [i_{a}, j_{a}]

for

a = 1, 2

. By Corollary 18, with a probability of at most

P_{l} = 2 c^{d_{0} log n} \leq \frac{2}{5 \cdot 2^{x} n^{3}}

(see inequality (8) in Definition 14), there are more than

(α + ϵ) d_{0} log n

mutations in

S_{a} [i_{a}, i_{a} + d_{0} log n - 1]

for

a = 1, 2

.

With a probability of at most

P_{l}

, the left boundary position is missed during the matching. We have similar

P_{r} = P_{l}

to miss the right boundary.

Assume that

p_{1}

and

p_{2}

are two positions of

S_{1}

and

S_{2}

, respectively. If one of the two positions is outside the motif region and has more than

d_{0} log n

distance to the motif boundary, with a probability of at most

c^{- d_{0} log n} \leq \frac{1}{5 \cdot 2^{x} n^{3}}

(see inequality (8) in Definition 14), for them to match requires

diff (Y_{1}, Y_{2}) \leq β

by Lemma 19, where

Y_{a}

is a subsequence

S_{a} [p_{a}, p_{a} + d_{0} log n - 1]

for

a = 1, 2

. ▮

Lemma 41.

Let algorithm-type=RANDOMIZED-SUBQUADRATIC. Assume that

d_{0} log n \leq L \leq | G | / 2

and

| G | \geq \frac{{(log n)}^{3 + τ}}{100}

. Let

I_{1}

be a set of intervals of the positions of

S_{1}

that satisfy

[LB (S_{1}) - L, LB (S_{1}) + L] \cup [RB (S_{1}) - L, RB (S_{1}) + L] \subseteq \cup_{I^{'} \in I_{1}} I^{'}

. Let

I_{2}

be a set of intervals of the positions of

S_{2}

that satisfy

[LB (S_{2}) - L, LB (S_{2}) + L] \cup [RB (S_{2}) - L, RB (S_{2}) + L] \subseteq \cup_{I^{'} \in I_{2}} I^{'}

. Let

U_{1} =

Point-Selection

(S_{1}, L, I_{1})

,

U_{2} =

Point-Selection

(S_{2}, L, I_{2}])

and

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection

(S_{1}, U_{1}, S_{2}, U_{2})

. Then:

i.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the left rough boundary, $L_{S_{1}}$ , has at most a $2 L$ distance from $LB (S_{1})$ and the left rough boundary $L_{S_{2}}$ has at most a $2 L$ distance from $LB (S_{2})$ ;
ii.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the right rough boundary, $R_{S_{1}}$ , has at most a $2 L$ distance from $RB (S_{1})$ and the right boundary of $R_{S_{2}}$ has at most a $2 L$ distance from $RB (S_{2})$ .

Proof:

We prove the following two statements which imply the lemma.

i.: With a probability of at most $\frac{1}{2^{x} n^{3}}$ , there are no intervals, $A_{i}$ from $S_{1}$ and $B_{j}$ from $S_{2}$ , such that: (1) $| | A_{i} (S_{1}, ℵ (S_{1})) \cap B_{j} (S_{2}, ℵ (S_{2})) | |$ is at least $\frac{L}{2}$ ; (2) the left boundary of $S_{1}$ has at most a $2 L$ distance from $A_{i}$ ; (3) the left boundary of $S_{2}$ has at most a $2 L$ distance from $B_{j}$ ; and (4) there is a collision between the sampled positions in $A_{i}$ and $B_{j}$ .
ii.: With a probability pf at most $\frac{1}{2^{x} n^{3}}$ , there are no intervals, $A_{i}$ from $S_{1}$ and $B_{j}$ from $S_{2}$ , such that: (1) $| | A_{i} (S_{1}, ℵ (S_{1})) \cap B_{j} (S_{2}, ℵ (S_{2})) | |$ is at least $\frac{L}{2}$ ; (2) the right boundary of $S_{1}$ has at most a $2 L$ distance from $A_{i}$ ; (3) the right boundary of $S_{2}$ has at most a $2 L$ distance from $B_{j}$ ; and (4) there is a collision between the sampled positions in $A_{i}$ and $B_{j}$ .

We only prove statement i. The proof for statement ii is similar.

Select

A_{i}

from

S_{1}

and

B_{j}

from

S_{2}

to be the first pair of intervals with

| | A_{i} (S_{1}, ℵ (S_{1})) \cap B_{j} (S_{2}, ℵ (S_{2})) | | \geq \frac{L}{2}

. It is easy to see that such a pair exists, and both have a distance from the left boundary with a distance of at most

2 L

. This is because when a leftmost interval of a length pf L is fully inside the motif region of the first sequence, we can always find the second interval from the second sequence with an intersection of a length of at least

\frac{L}{2}

.

Replace m by

M (L)

,

m_{1}

by

M_{1} (L)

(see Definition 10) and n by L to apply Lemma 39. We do not consider any damaged position in this algorithm; therefore, let C be empty.

With a probability of at most

o (\frac{1}{2^{x} n^{3}})

, there is a point in

(A_{i} (S_{1}, ℵ (S_{1})) - C) \cap (B_{j} (S_{2}, ℵ (S_{2})) - C)

. The subsequences of a length of

d_{0} log n

starting at the point from

S_{1}

and

S_{2}

fail to have the difference bounded by β with a probability of at most

o (\frac{1}{2^{x} n^{3}})

by Lemma 20. With a probability of at most

o (\frac{1}{2^{x} n^{3}})

, we do not have that the rough boundaries are detected with a distance of at most

2 L

to exact motif boundaries. ▮

Lemma 42.

For the case algorithm-type=RANDOMIZED-SUBQUADRATIC, we have:

i.: Collision-Detection( $S_{1}, U_{1}, S_{2}, U_{2}$ ) takes $t (n, | | U_{1} | | + | | U_{2} | |) = O ((| | U_{1} | | + | | U_{2} {| |)}^{2} log n)$ time.
ii.: Point-Selection( $S_{1}, L, [1, | S_{1} |]$ ) selects $s (n, L) = O ((\frac{n}{L}) M (L))$ positions in $g (n, L) = O (s (n, L))$ time if $L \geq \frac{{(log n)}^{3 + τ}}{100}$ .
iii.: Point-Selection( $S_{1}, L, [1, | S_{1} |]$ ) selects $s (n, L) = O (n)$ positions in $g (n, L) = O (n)$ time if $L < \frac{{(log n)}^{3 + τ}}{100}$ .
iv.: $| | U_{S_{2 i - 1}^{'}} | | + | | U_{S_{j}^{″}} | |$ in the algorithm Recover-Motif is no more than $f (n, | G |) = O (M (| G |) + \frac{n}{| G |} M (| G |))$ .
v.: With a probability of at most $\frac{k}{2^{x} n^{3}}$ , the algorithm Recover-Motif does not stop in $(O (k (\frac{n^{2}}{| G |} {(log n)}^{O (1)} + h^{2} log n)), O (k))$ time.

Proof:

Statement i. The parameter,

ω_{RANDOMIZED - SUBLINEAR}

, is set to be β in Collision-Detection. It follows from the time complexity of the brute force method.

Statements ii and iii. They follow from the implementation of Point-Selection().

Statement iv. It follows from the choice of Point-Selection(.) for the sublinear time algorithm at Recover-Motif(.).

Statement iv. It follows from Lemma 40, Lemma 34 and Statements i, ii and iii. ▮

We give the proof for Theorem 6.

Proof:

[Theorem 4] The computational time part of this theorem follows from Lemma 42.

By Lemma 40, we can let

ς_{1} (n) = \frac{1}{2^{x} n^{3}} \leq ς_{0}

for the probability bound,

ς_{1} (n)

, in the condition (i) of Lemma 32.

By Lemma 41, we can let

ς_{2} (n) = \frac{1}{2^{x} n^{3}} \leq ς_{0}

for the probability bound,

ς_{2} (n)

, in the condition (ii) of Lemma 32.

By inequality (12), the condition (iii) of Lemma 32 is satisfied.

By inequality (11), we know that the condition (iii) of Lemma 32 can be satisfied.

The failure probability part of this theorem follows from Lemma 21 and Lemma 32 by using the fact that

k_{1}, k_{2}

and k are of the same order (see Equation (18)). ▮

6.5.2.. Sublinear Time Algorithm for a $\frac{1}{{(log n)}^{2 + μ}}$ Mutation Rate

In this section, we give an algorithm for the case with at most

α = \frac{1}{{(log n)}^{2 + μ}}

mutation rate. The performance of the algorithm is stated in Theorem 2.

Definition 43.

A position, p, in the motif region,

ℵ (S)

, of an input sequence, S, is damaged if there exists at least one mutation in

S [p, p + d_{0} log n - 1]

, where

d_{0}

is defined in item (xvii) in Definition 8.

Lemma 44.

Assume that

α L = {(log n)}^{1 + Ω (1)}

. With a probability of at most

e^{- {(log n)}^{1 + Ω (1)}}

, there are more than

\frac{M_{1} (L)}{{(log n)}^{Ω (1)}}

positions that are from the

M (L)

(see Definition 10 for

M (.)

and

M_{1} (.)

) sampled positions that are damaged in an interval of a length pf L.

Proof:

By Theorem 16, with a probability of at most

P_{1} = 2^{- α L}

(let

δ = 2

), there are more than

3 α L

mutations in an interval of a length of L. Therefore, with a probability of at most

2^{- α L} = e^{- {(log n)}^{1 + Ω (1)}}

, there are more than

3 α L log n

positions that are damaged. Therefore, each random position in an interval of a length of L has at most a probability of

\frac{3 α L log n}{L} = 3 α log n

to be damaged.

Since

α = (\frac{1}{{(log n)}^{2 + Ω (1)}})

and

M (L)

positions are sampled, by Theorem 16, with a probability of at most

P_{2} = 2^{- (3 α log n) M (L)} = e^{- {(log n)}^{1 + Ω (1)}}

(let

δ = 2

), the number of damaged positions sampled in an interval of a length of L is more than

((1 + δ) 3 α log n) M = (9 α log n) M (L) = \frac{M_{1} (L)}{{(log n)}^{Ω (1)}}

. Thus, with a total probability of at most

P_{1} + P_{2} = e^{- {(log n)}^{1 + Ω (1)}}

, there are more than

\frac{M_{1} (L)}{{(log n)}^{Ω (1)}}

damaged positions that are from the

M (L)

sampled positions in an interval of a length of L. ▮

Definition 45.

Let A be a set of positions in an input sequence, S, with

ℵ (S) = S [i, j]

. Let

A (S, ℵ (S)) = {x - i + 1 | x \in A \cap [i, j]}

. In other words,

A (S, ℵ (S))

contains all the positions of A in

ℵ (S)

.

Lemma 46. Let

Let algorithm-type=RANDOMIZED-SUBLINEAR. Assume that

| G | < \frac{{(log n)}^{3 + τ}}{100}

and L is an integer with

d_{0} log n \leq L \leq | G | / 2

. Let

I_{1}

be a set of intervals of the positions of

S_{1}

that satisfy

[LB (S_{1}) - L, LB (S_{1}) + L] \cup [RB (S_{1}) - L, RB (S_{1}) + L] \subseteq \cup_{I^{'} \in I_{1}} I^{'}

. Let

I_{2}

be a set of intervals of the positions of

S_{2}

that satisfy

[LB (S_{2}) - L, LB (S_{2}) + L] \cup [RB (S_{2}) - L, RB (S_{2}) + L] \subseteq \cup_{I^{'} \in I_{2}} I^{'}

. Let

U_{1} =

Point-Selection

(S_{1}, L, I_{1})

,

U_{2} =

Point-Selection

(S_{2}, L, I_{2}])

and

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection

(S_{1}, U_{1}, S_{2}, U_{2})

. Then:

i.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the left rough boundary, $L_{S_{1}}$ , has at most a $| G | / 4$ distance from $LB (S_{1})$ and the left rough boundary, $L_{S_{2}}$ , has at most a $| G | / 4$ distance from $LB (S_{2})$ .
ii.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the right rough boundary, $R_{S_{1}}$ , has at most a $| G | / 4$ distance from $RB (S_{1})$ and the right boundary of $R_{S_{2}}$ has at most a $| G | / 4$ distance from $RB (S_{2})$ .

Proof:

For two sequences,

S_{1}

and

S_{2}

, it is easy to see that there is a common position in both motif regions of the two sequences, such that there is no mutation in the next

d_{0} log n

characters with a high probability. This is because that mutation probability is small.

By Theorem 16, with a probability of at most

P_{l} = 2^{- α | G | / 4}

(let

δ = 2

), there are more than

3 α \frac{| G |}{4}

mutated characters in the interval

ℵ (S_{i}) [1, \frac{| G |}{4}]

for

i = 1, 2

. Since the mutation probability is

α = (\frac{1}{{(log n)}^{2 + Ω (1)}})

, with a probability of at most

2^{- α | G | / 4} = e^{- {(log n)}^{1 + Ω (1)}}

, there are more than

3 α \cdot \frac{| G |}{4} d_{0} log n

positions to be damaged in

ℵ (S_{i}) [1, \frac{| G |}{4}]

.

Similarly, we have a probability of at most

P_{r} = e^{- {(log n)}^{1 + Ω (1)}}

that there are more than

((3 α d_{0} log n) \cdot \frac{| G |}{4}) = \frac{| G |}{{(log n)}^{Ω (1)}}

damaged positions in

ℵ (S_{i}) [\frac{3 | G |}{4} - 1, | G |]

.

Now, we assume that left side has more than

((3 α d_{0} log n) \cdot \frac{| G |}{4}) = \frac{| G |}{{(log n)}^{Ω (1)}}

damaged positions and the right side, more than

((3 α d_{0} log n) \cdot \frac{| G |}{4}) = \frac{| G |}{{(log n)}^{Ω (1)}}

damaged positions in

ℵ (S_{i}) [\frac{3 | G |}{4} - 1, | G |]

. Since each position in each interval of a length of L is selected in Point-Selection

(S_{1}, S_{2}, L)

, it is easy to verify the conclusions of this lemma. ▮

Lemma 47.

Let algorithm-type=RANDOMIZED-SUBLINEAR. Assume that

| G | \geq \frac{{(log n)}^{3 + τ}}{100}

and

d_{0} log n \leq L \leq | G | / 2

. Let

I_{1}

be a set of intervals of the positions of

S_{1}

that satisfy

[LB (S_{1}) - L, LB (S_{1}) + L] \cup [RB (S_{1}) - L, RB (S_{1}) + L] \subseteq \cup_{I^{'} \in I_{1}} I^{'}

. Let

I_{2}

be a set of intervals of the positions of

S_{2}

that satisfy

[LB (S_{2}) - L, LB (S_{2}) + L] \cup [RB (S_{2}) - L, RB (S_{2}) + L] \subseteq \cup_{I^{'} \in I_{2}} I^{'}

. Let

U_{1} =

Point-Selection

(S_{1}, L, I_{1})

,

U_{2} =

Point-Selection

(S_{2}, L, I_{2}])

and

(L_{S_{1}}, R_{S_{1}}, L_{S_{2}}, R_{S_{2}}) =

Collision-Detection

(S_{1}, U_{1}, S_{2}, U_{2})

. Then:

i.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , the left rough boundary, $L_{S_{1}}$ , has at most a $2 L$ distance from $LB (S_{1})$ and the left rough boundary, $L_{S_{2}}$ , has at most a $2 L$ distance from $LB (S_{2})$ .
ii.: With a probability of at most $\frac{1}{2^{x} n^{3}}$ , the right rough boundary, $R_{S_{1}}$ , has at most a $2 L$ distance from $RB (S_{1})$ and the right boundary of $R_{S_{2}}$ has at most a $2 L$ distance from $RB (S_{2})$ .

Proof:

We prove the following two statements, which imply the lemma.

i.: With a probability of at most $\frac{1}{2^{x} n^{3}}$ , there are no intervals, $A_{i}$ from $S_{1}$ and $B_{j}$ from $S_{2}$ , such that: (1) $| | A_{i} (S_{1}, ℵ (S_{1})) \cap B_{j} (S_{2}, ℵ (S_{2})) | |$ is at least $\frac{L}{2}$ ; (2) the left boundary of $S_{1}$ has at most a $2 L$ distance from $A_{i}$ ; (3) the left boundary of $S_{2}$ has at most a $2 L$ distance from $B_{j}$ ; and (4) there is a collision between the sampled positions in $A_{i}$ and $B_{j}$ .
ii.: with a probability of at most $\frac{1}{2^{x} n^{3}}$ , there are no intervals, $A_{i}$ from $S_{1}$ and $B_{j}$ from $S_{2}$ , such that: (1) $| | A_{i} (S_{1}, ℵ (S_{1})) \cap B_{j} (S_{2}, ℵ (S_{2})) | |$ is at least $\frac{L}{2}$ ; (2) the right boundary of $S_{1}$ has at most a $2 L$ distance from $A_{i}$ ; (3) the right boundary of $S_{2}$ has at most a $2 L$ distance from $B_{j}$ ; and (4) there is a collision between the sampled positions in $A_{i}$ and $B_{j}$ .

We only prove statement i. The proof for statement ii is similar to that for statement i. Assume that L satisfies the condition of this lemma. Select

A_{i}

from

S_{1}

and

B_{j}

from

S_{2}

to be the first pair of intervals of a size of L with

| | A_{i} (S_{1}, ℵ (S_{1})) \cap B_{j} (S_{2}, ℵ (S_{2})) | | \geq \frac{L}{2}

. This is because when a leftmost interval of a length of L is fully inside the motif region of the first sequence, we can always find the second interval from the second sequence with an intersection of a length of at least

\frac{L}{2}

.

Replace m by

M (L)

,

m_{1}

by

M_{1} (L)

(see Definition 10) and n by L to apply Lemma 39. We also let C be the set of damaged positions in

ℵ (S_{1})

and

ℵ (S_{2})

caused by the mutated positions. With a probability of at most

o (\frac{1}{2^{x} n^{3}})

, C has a size of

o (M_{1} (L))

by Lemma 44. With a probability of at most

o (\frac{1}{2^{x} n^{3}})

, there is point in

(A_{i} (S_{1}, ℵ (S_{1})) - C) \cap (B_{j} (S_{2}, ℵ (S_{2})) - C)

. The existence of such a point makes

L_{S_{1}}

and

L_{S_{2}}

have a distance of at most

2 L

to

LB (S_{1})

and

LB (S_{2})

, respectively. ▮

Lemma 48.

For the case algorithm-type=RANDOMIZED-SUBLINEAR, we have:

i.: Collision-Detection( $S_{1}, U_{1}, S_{2}, U_{2}$ ) takes $t (n, | | U_{1} | | + | | U_{2} | |) = O ((| | U_{1} | | + | | U_{2} | |) log n)$ time.
ii.: Point-Selection( $S_{1}, L, [1, | S_{1} |]$ ) selects $s (n, L) = O ((\frac{n}{L}) M (L))$ positions in $g (n, L) = O (s (n, L))$ time if $L \geq \frac{{(log n)}^{3 + τ}}{100}$ .
iii.: Point-Selection( $S_{1}, L, [1, | S_{1} |]$ ) selects $s (n, L) = O (n)$ positions in $g (n, L) = O (n)$ time if $L < \frac{{(log n)}^{3 + τ}}{100}$ .
iv.: $| | U_{S_{2 i - 1}^{'}} | | + | | U_{S_{j}^{″}} | |$ in the algorithm Recover-Motif is no more than $f (n, | G |) = O (M (| G |) + \frac{n}{| G |} M (| G |))$ .
v.: With a probability of at most $\frac{k}{2^{x} n^{3}}$ , the algorithm Recover-Motif does not stop in $(O (k (\frac{n}{\sqrt{h}} {(log n)}^{\frac{5}{2}} + h^{2} log n)), O (k))$ time.

Proof:

Statement i. The parameter,

ω_{RANDOMIZED - SUBLINEAR}

, is set to be zero in Collision-Detection. It follows from the time complexity of bucket sorting, which is described in standard algorithm textbooks.

Statements ii and iii. They follow from the implementation of Point-Selection().

Statement iv. It follows from the choice of Point-Selection(.) for the sublinear time algorithm at Recover-Motif(.).

Statement v. It follows from Lemma 46, Lemma 47 and Lemma 34 and Statements i, ii iii and iv. ▮

We give the proof for Theorem 2.

Proof:

[Theorem 2] The computational time part of this theorem follows from Lemma 48.

By Lemma 46, we can let

ς_{1} (n) = \frac{1}{2^{x} n^{3}} \leq ς_{0}

for the probability bound,

ς_{1} (n)

, in the condition (i) of Lemma 32.

By Lemma 47, we can let

ς_{2} (n) = \frac{1}{2^{x} n^{3}} \leq ς_{0}

for the probability bound,

ς_{1} (n)

, in the condition (ii) of Lemma 32.

By inequality (12), the condition (iii) of Lemma 32 is satisfied.

By inequality (11), we know that the condition (iv) of Lemma 32 can be satisfied.

The failure probability part of this theorem follows from Lemma 21 and Lemma 32 by using the fact that

k_{1}, k_{2}

and k are of the same order (see Equation (18)). ▮

6.6. Experiments on Simulated Datasets

Aiming at solving the motif discovery problem, we implemented our algorithm in Java. Our tests were all done on a laptop with an Intel Dual Core 1.5 G CPU and 3.0 G Memory. In the first experiment, we tested our algorithm on several simulated datasets, which are all generated from our probability model with a small mutation rate. Each input set contains 20 or 15 sequences, which are of a length of 600 or 500 base pairs. Additionally, each bp of all the simulated gene sequences was generated independently with the same occurrence probability. A motif with a fixed length was randomly planted into each input sequence. The minimum Hamming distances between the results and consensus are recorded.

6.7. Experiments

There are many other tools for detecting and analyzing the motifs, like the EMmethod [23], MEME [24], Gibbs [25], Compo [26], MochiView [27], PhyME [28], HeliCis [29] and WebMotif [30], among others. Each of them has their advantages and disadvantages.

Table 1 shows the results on simulated datasets. From the table, we could find that the results of our algorithm for finding the motif on simulated datasets are satisfied. Our algorithm could find all the motifs from each sequence and get the consensus with an accuracy rate of 100%. If the dataset has a high mutation rate, we could increase the number of repetitions, so that the result on the datasets will be more accurate. We also recorded the total time cost for each test, which mainly depends on the number of test repetitions. In the experimental tables, the parameter, N, represents the number of sequences, parameter M represents the length of motifs, parameter M is the maximum length of sequences and R is the number of iterations. GCR1 is a famous DNA binding protein, whose ability to bind DNA is dependent on the CTTCC sequence motif [31]. Several other popular data sets are also used in the experiments of our motif detection and its comparisons to the other methods.

Table 1. Results on simulated data.

**Table 1.** Results on simulated data.
	N	M	L	R	Accuracy Rate	Time cost(s)
Set 1	20	600	15	60	100	98
Set 2	15	600	15	10	100	18
Set 3	20	600	12	15	100	22
Set 4	20	500	15	40	100	79

In the second experiment, we tested our algorithm on a real sequence set, which was obtained from SCPD. SCPD contains a large number of gene data and transcription factors of yeast. Sequences in the same set are all regulated by a common motif. We chose 1,000 bp as the length of the input sequences. In order to show the advantages of our algorithm, we also compared the result of our algorithm with the results of several other existing motif finding methods on the same dataset, such as Gibbs, MEME, Info-Gibbs and Consensus. Table 2 shows the details of the data set we used in the experiment.

Table 2. Number of Sequences and Motif Length.

**Table 2.** Number of Sequences and Motif Length.
	Bas1	GCN4	GCR1	Rap1Ebf1	HSE-HTSF
N	6	9	6	15	5
L	10	10	10	15	10

Table 3 shows the results of the five algorithms on biological data sets. Table 4 and Figure 2 give the average mismatch numbers of each algorithm. We choose four well-known motif-detecting softwares to make the comparisons. From Table 3 and Table 4, we see that the average mismatch numbers of our algorithm on data sets GCN4 and GCR1 are significantly lower than other four well-known methods. On the data sets Bas1,Rap1Ebf1 and HSE-HTSF, our algorithms also shows satisfied performance compared to other methods.

In addition, our algorithm also shows its high speed in computations compared to other four motif finding methods. Because the starting pattern of algorithms are represented by a string, so our algorithm can avoid some extra time consuming computations unlike Gibbs sampling and EM methods, such as computations of likelihoods. According to this feature, we use the consensus string of the voting operation obtained from the last iteration as a new starting pattern in program, and continue doing voting operations repeatedly until there is no further improvement. Experimental results show that if we set the number of iterations to be a large integer, the programs could give more accurate results within a reasonable time.

Table 3. Total number of mismatch positions.

**Table 3.** Total number of mismatch positions.
	Bas1	GCN4	GCR1	Rap1Ebf1	HSE-HTSF
Our Algorithm	10	8	4	45	5
Gibbs	8	51	5	202	7
MEME	8	15	10	32	3
InfoGibbs	9	21	5	46	9
Consensus	8	9	5	42	7

Table 4. Average mismatch numbers per sequence.

**Table 4.** Average mismatch numbers per sequence.
	Bas1	GCN4	GCR1	Rap1Ebf1	HSE-HTSF
Our Algorithm	1.67	0.89	0.67	3	1
Gibbs	1.33	5.6	0.83	13.46	1.4
MEME	1.33	1.67	1.67	2.13	0.6
InfoGibbs	1.5	2.33	0.83	3.06	1.8
Consensus	1.33	1	0.83	2.8	1.4

Figure 2. Average mismatch numbers per sequence.

6.8. Conclusions

We develop algorithms under the probabilistic model. One of them finds the implanted motif with a high probability if the alphabet size is at least four, the motif length is in

[{(log n)}^{7 + μ}, \frac{n}{{(log n)}^{1 + μ}}]

and each character in the motif region has a probability of at most

\frac{1}{{(log n)}^{2 + μ}}

of mutation. The motif region can be detected, and each motif character can be recovered in sublinear time. A sub-quadratic randomized algorithm is developed to recover the motif with an

Ω (1)

mutation rate. A quadratic deterministic algorithm is developed to recover the motif with an

Ω (1)

mutation rate. It is an interesting problem if there is an algorithm to handle the case for an alphabet of a size of three. A more interesting problem is to extend the algorithm to handle larger mutation probability.

6.9. Future Works

Compared with other motif finding methods, our algorithm shows its great advantages. However, there are still some improvements that could be done on this algorithm. For example, though a sequence set has the consensus, the motif in each sequence may have high mutation rates; in addition, the length of each motif could also be different. Therefore, these two factors increase the difficulty for finding motifs, and currently, there is still no effective algorithm that could solve these problems. In the future, we plan to improve the efficiency of our algorithm by combining other motif finding methods, such as MEME; a combination may be made to make our algorithm have better performance in finding motifs from highly mutated sequences.

Acknowledgments

We thank Ming-Yang Kao for introducing us to this topic. We also thank Lusheng Wang and Xiaowen Liu for some discussions. We would like to thank Eugenio De Hayos for his helpful comments. We would like to thank the reviewers, whose comments greatly improved the presentation of this paper.

This research is supported in part by the NSF Early Career Award 0845376, and NSF HRD-1137764.

Conflicts of Interest

The authors declare no conflict of interest.

References

Frances, M.; Litman, A. On covering problems of codes. Theor. Comput. Sci. 1997, 30, 113–119. [Google Scholar] [CrossRef]
Ga̧sieniec, L.; Jansson, J.; Lingas, A. Efficient Approximation Algorithms for the Hamming Center Problem. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, USA, 17–19 January 1999; pp. S905–S906.
Stormo, G.; Hartzell, G., III. Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA 1991, 88, 5699–5703. [Google Scholar] [CrossRef] [PubMed]
Lawrence, C.; Reilly, A. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 1990, 7, 41–51. [Google Scholar] [CrossRef] [PubMed]
Hertz, G.; Stormo, G. Identification of Consensus Patterns in Unaligned DNA and Protein Sequences: A Large-Deviation Statistical Basis for Penalizing Gaps. In Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, Tallahassee, USA, 1–4 June 1994; pp. 201–216.
Stormo, G. Consensus patterns in DNA. In Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences; Doolitle, R.F., Ed.. Methods Enzymol. 1990, 183, 211–221. [Google Scholar]
Lanctot, J.K.; Li, M.; Ma, B.; Wang, L.; Zhang, L. Distinguishing string selection problems. Inf. Comput. 2003, 185, 41–55. [Google Scholar] [CrossRef]
Lucas, K.; Busch, M.; Mossinger, S.; Thompson, J. An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Comput. Appl. Biosci. 1991, 7, 525–529. [Google Scholar] [CrossRef] [PubMed]
Dopazo, J.; Rodríguez, A.; Sáiz, J.C.; Sobrino, F. Design of primers for PCR amplification of highly variable genomes. Comput. Appl. Biosci. 1993, 9, 123–125. [Google Scholar] [PubMed]
Proutski, V.; Holme, E.C. Primer master: A new program for the design and analysis of PCR primers. Comput. Appl. Biosci. 1996, 12, 253–255. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Ma, B.; Wang, L. On The Closest String and Substring Problems. J. ACM 2002, 49, 157–171. [Google Scholar] [CrossRef]
Li, M.; Ma, B.; Wang, L. Finding Similar Regions in Many Strings. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing, Atlanta, GA, USA, 1–4 May 1999; pp. 473–482.
Pevzner, P.; Sze, S. Combinatorial Approaches to Finding Subtle Signals in DNA Sequences. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, Toronto, ON, Canada, 19–23 July 2000; pp. 269–278.
Keich, U.; Pevzner, P. Finding motifs in the twilight zone. Bioinformatics 2002, 18, 1374–1381. [Google Scholar] [CrossRef] [PubMed]
Keich, U.; Pevzner, P. Subtle motifs: Defining the limits of motif finding algorithms. Bioinformatics 2002, 18, 1382–1390. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Dong, L. Randomized algorithms for motif detection. J. Bioinform. Comput. Biol. 2005, 3, 1039–1052. [Google Scholar] [CrossRef] [PubMed]
Chin, F.; Leung, H. Voting Algorithms for Discovering Long Motifs. In Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, Singapore, 17–21 January 2005; pp. 261–272.
Gusfield, D. Algorithms on Strings, Trees, and Sequences; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
Fu, B.; Kao, M.Y.; Wang, L. Probabilistic analysis of a motif discovery algorithm for multiple sequences. SIAM J. Discret. Math. 2009, 23, 1715–173. [Google Scholar] [CrossRef]
Fu, B.; Kao, M.Y.; Wang, L. Discovering almost any hidden motif from multiple sequences. ACM Transactions on Algorithms 2011, 7(2), 26. [Google Scholar] [CrossRef]
Liu, X.; Ma, B.; Wang, L. Voting Algorithms for the Motif Problem. In Proceedings of Computational Systems Bioinformatics Conference, (CSB’08), Stanford, CA, USA, 26–29 August 2008; pp. 37–47.
Motwani, R.; Raghavan, P. Randomized Algorithms; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from complete data vis the EM algorithm. J. R. Stat. Soc. 1977, 39, 1–38. [Google Scholar]
D’haesler, P. How does DNA sequence motif discovery work? Nat. Biotechnol. 2006, 24, 959–961. [Google Scholar] [CrossRef] [PubMed]
Lawrence, C.; Altschul, S.; Boguski, M.; Liu, J.; Neuwald, A.; Wootton, J. Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science 1993, 262, 262–5131. [Google Scholar] [CrossRef]
Sandve, G.K.K.; Abul, O.; Drabløs, F. Compo: Composite motif discovery using discrete models. BMC Bioinform. 2008, 9. [Google Scholar] [CrossRef] [PubMed]
Homann, O.; Johnson, A. MochiView: Versatile software for genome browsing and DNA motif analysis. BMC Biol. 2010, 8. [Google Scholar] [CrossRef] [PubMed]
Sinha, S.; Blanchette, M.; Tompa, M. PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinform. 2004, 5. [Google Scholar] [CrossRef]
Larsson, E.; Lindahl, P.; Mostad, P. HeliCis: A DNA motif discovery tool for colocalized motif pairs with periodic spacing. BMC Bioinform. 2007, 8. [Google Scholar] [CrossRef] [PubMed]
Romer, K.; Kayombya, G.R.; Fraenkel, E. WebMOTIFS: Automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches. Nucleic Acids Res. 2007, 35, W217–W220. [Google Scholar] [CrossRef] [PubMed]
Baker, H. GCR1 of Saccharomyces cerevisiae encodes a DNA binding protein whose binding is abolished by mutations in the CTTCC sequence motif. Proc. Natl. Acad. Sci. USA 1991, 88, 9443–9447. [Google Scholar] [CrossRef] [PubMed]

© 2013 by the authors licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).

Sublinear Time Motif Discovery from Multiple Sequences

Abstract

1. Introduction

2. Notations and the Model of Sequence Generation

3. Brief Introduction to the Algorithm

3.1. Algorithm

3.2. An Example

3.2.1.. Input Sequences

3.2.2.. Select Sample Points

3.2.3.. Collision Detection

3.2.4.. Improving the Boundaries

3.2.5.. Select Sample Points for the Sequences in Z 2

3.2.6.. Collision Detection Between S 1 ′ with the Sequences in Z 2

3.2.7.. Improving the Motif Boundaries for the Sequences in Z 2

3.2.8.. Motif Boundaries for the Sequences in Z 2

3.2.9.. Extracting the Motif Regions

3.2.10.. Recovering Motif via Voting

3.3. Our Results

4. Algorithm Recover-Motif

4.1. Some Parameters

4.2. Description of Algorithm Recover-Motif

4.2.1.. Boundary-Phase of Algorithm Recover-Motif

4.2.2.. Extract-Phase of Algorithm Recover-Motif

4.2.3.. Voting-Phase

4.2.4.. Entire Algorithm Recover-Motif

5. Deterministic Algorithm

6. Analysis of the Algorithm

6.1. Review of Some Classical Results in Probability

6.2. Analysis of Boundary-Phase of Algorithm Recover-Motif

6.3. Analysis of Extract-Phase and Voting-Phase of Algorithm Recover-Motif

6.4. Deterministic Algorithm for an Ω ( 1 ) Mutation Rate

6.5. Randomized Algorithms for Motif Detection

6.5.1.. Randomized Algorithm for an Ω ( 1 ) Mutation Rate

6.5.2.. Sublinear Time Algorithm for a 1 ( log n ) 2 + μ Mutation Rate

6.6. Experiments on Simulated Datasets

6.7. Experiments

6.8. Conclusions

6.9. Future Works

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

3.2.5.. Select Sample Points for the Sequences in $Z_{2}$

3.2.6.. Collision Detection Between $S_{1}^{'}$ with the Sequences in $Z_{2}$

3.2.7.. Improving the Motif Boundaries for the Sequences in $Z_{2}$

3.2.8.. Motif Boundaries for the Sequences in $Z_{2}$

6.4. Deterministic Algorithm for an $Ω (1)$ Mutation Rate

6.5.1.. Randomized Algorithm for an $Ω (1)$ Mutation Rate

6.5.2.. Sublinear Time Algorithm for a $\frac{1}{{(log n)}^{2 + μ}}$ Mutation Rate