# Filtering Degenerate Patterns with Application to Protein Sequence Analysis

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Ranking and Clustering Degenerate Patterns

**Table 1.**Output of Varun for the G-protein coupled receptors family 3 (id PS00980), consisting of 25 sequences of about 25,000 amino acids each. One of the relevant signatures is shown in bold [12].

Rank | Z-Score | Pattern |
---|---|---|

1 | 2.84E+09 | Y...L...C..[FYW]A..[STAH]R..P..FNE[STAH]K.I.F[STAH]M |

2 | 8.28E+07 | V-(1,3,4)G...S..[STAH]....N...L....Q-(4)[STAH]....L.[DN]...[FYW]..F....P....Q..A...I |

3 | 5.55E+07 | L-(2,3)F...Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I |

4 | 4.27E+07 | L-(2,3)F...Q.[STAH]..[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |

5 | 4.23E+07 | L....I...[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |

6 | 3.99E+07 | LF-(3)Q....[STAH][STAH]....S[DN]...[FYW]..F.R..P.D..Q..A...I |

7 | 3.38E+07 | LF-(3)Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I |

8 | 3.38E+07 | LF...Q....[STAH]-(4)L.[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |

9 | 3.29E+07 | I-(1)Q.[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |

10 | 3.29E+07 | I.Q-(4)[STAH]....LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |

11 | 3.29E+07 | I.Q.[STAH]..[STAH]-(4)LS[DN]...[FYW]..F.R..P.D..Q..A...I |

12 | 3.10E+07 | L....Q-(1,4)[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |

13 | 2.77E+07 | L[FYW]-(3)Q.[STAH]..[STAH]....LS....[FYW]..F.R..P.D..Q..A...I |

14 | 2.58E+07 | L-(4)Q.[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |

15 | 2.30E+07 | S.[STAH]S-(2,4)LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |

16 | 2.15E+07 | L-(1,3,4)C..[FYW]A..[STAH]R..P..F.E.K.I.F.M |

17 | 1.40E+07 | F-(1)I.Q...[STAH][STAH]-(4)L[STAH]....[FYW]..F.R..P.D..Q..A...I |

18 | 1.37E+07 | L-(2,4)I...[STAH].[STAH].[STAH]-(3)LS....[FYW]..F.R..P.D..Q..A...I |

19 | 1.02E+07 | L..I-(1)Q....[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |

20 | 8.65E+06 | I-(1)Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I |

21 | 8.19E+06 | S[STAH]-(1,2,3,4)LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |

22 | 7.98E+06 | Q-(3)[STAH][STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |

23 | 6.82E+06 | F-(3)Q....[STAH][STAH]...L[STAH]....[FYW]..F.R..P.D..Q..A...I |

24 | 5.66E+06 | A[STAH][STAH]-(2,3)LS[DN]...[FYW]..F.R..P.D..Q..A...I |

25 | 5.57E+06 | F.I-(3)[STAH]..[STAH]....L[STAH]....[FYW]..F.R..P.D..Q..A...I |

26 | 5.18E+06 | L.L-(4)Q....[STAH]....L-(1)[DN]...[FYW]..F.R..P.D..Q..A...I |

27 | 3.61E+06 | L.L-(2)I...[STAH]...[STAH]....[STAH]....[FYW]..F.R..P.D..Q..A...I |

28 | 3.48E+06 | [STAH].[STAH]-(1,2,3)LS[DN]...[FYW]..F.R..P.D..Q..A...I |

29 | 3.17E+06 | [STAH]...[STAH]...LS[DN]...[FYW]..F.R..P.D..Q..A...I |

30 | 2.47E+06 | L....Q-(4)[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |

31 | 2.43E+06 | V-(1,3)N.L....I-(3)[STAH]...[STAH]....[STAH]....[FYW]..F....P.D..Q..A...I |

32 | 2.22E+06 | [STAH][STAH][STAH]-(1,2,3)LS....[FYW]..F.R..P.D..Q..A...I |

33 | 2.06E+06 | [STAH].[STAH][STAH]....LS....[FYW]..F.R..P.D..Q..A...I |

34 | 2.03E+06 | Y...L...C...A...R..P..F.E.K.I-(1,4)[FYW][STAH] |

35 | 1.99E+06 | I.Q...[STAH]-(1)[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I |

36 | 1.99E+06 | I.Q-(1)[STAH]...[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I |

38 | 1.97E+06 | F.I...[STAH]-(3)[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I |

40 | 1.97E+06 | F.I-(3)[STAH]..[STAH]....L.[DN]...[FYW]..F....P.D..Q..A...I |

41 | 1.91E+06 | [STAH]..[STAH].K-(1,4)P..FNE[STAH]K.I.F[STAH]M |

42 | 1.72E+06 | CC[FYW].C..C....[FYW]-(2,4)[DN]..[STAH]C..C |

43 | 1.57E+06 | [STAH]-(1,3,4)[FYW]A..[STAH]R..P..F.E.K.I.F.M |

44 | 1.49E+06 | A-(1,3)[STAH]...L[STAH][DN]...[FYW]..F.R..P.D..Q..A...I |

45 | 1.36E+06 | Q...[STAH].[STAH]-(3)L[STAH]....[FYW]..F.R..P.D..Q..A...I |

46 | 1.32E+06 | I-(3)[STAH]..[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |

47 | 1.31E+06 | [STAH][STAH]-(1,2,3,4)L.[DN]...[FYW]..F.R..P.D..Q..A...I |

48 | 1.24E+06 | [STAH]..[STAH][STAH]-(1,3)LS....[FYW]..F.R..P.D..Q..A...I |

49 | 1.19E+06 | [FYW]-(1,3,4)[STAH]...P..FNE[STAH]K.I.F[STAH]M |

50 | 1.12E+06 | I...[STAH]-(3)[STAH]...L[STAH]....[FYW]..F.R..P.D..Q..A...I |

#### 1.2. Problem Formulation

## 2. Preliminary Definitions

**Definition 1**(Sequence) A sequence with character classes, or simply a sequence, is a string of consecutive symbols defined on $\left\{{2}^{{C}_{j}}\right\}$.

**Definition 2**(Sequence occurrence) A sequence, $p={p}_{1}{p}_{2}\dots {p}_{k}$, of length, k, is said to occur at a location, l, of a string, s, with $1\le l\le n$, if ${s}_{l+j-1}\in {p}_{j}$ for each $1\le j\le k$.

**Definition 3**(Pattern, location list) We say that $m=(p,{\mathcal{L}}_{m})$ is a pattern with sequence, $p={p}_{1}{p}_{2}\dots {p}_{k}$, and location list, ${\mathcal{L}}_{m}=({l}_{1},{l}_{2},\dots ,{l}_{\nu})$, if and only if all of the following hold: (i) p has at least two symbols, $\left|p\right|\ge 2$; (ii) The location list has at least q occurrences, $|{\mathcal{L}}_{m}|\ge q$; and (iii) There does not exist a location, ${l}^{\prime}\notin {\mathcal{L}}_{m}$, such that p occurs at ${l}^{\prime}$ in s (that is, ${\mathcal{L}}_{m}$ is complete).

## 3. Minimal Patterns and Pattern Priority

**Definition 4**(Minimal representation $\mu (\xb7)$) Given a sequence, p, of length, k, the minimal representation of p is a sequence, $\mu \left(p\right)$, of length, k, with symbols, $\mu {\left(p\right)}_{j}={\bigcup}_{l\in {\mathcal{L}}_{m}}{s}_{l+j-1}$, for $1\le j\le k$.

**Remark 1**The minimal representation of a sequence is unique.

**Remark 2**Since $\mu \left(p\right)$ is more specific than p, that is, $\mu \left(p\right)$ cannot have more occurrences in s than p, then the list of occurrences of $\mu \left(p\right)$ must be the same as p.

**Definition 5**(Minimal pattern) The minimal representation of m, given by $\mu \left(m\right)=(\mu \left(p\right),{\mathcal{L}}_{m})$, is called a minimal pattern.

j | 1 | 2 | 3 |
---|---|---|---|

${p}_{j}$ | $[a,c,d]$ | $[a,c,d]$ | $[a,b,e]$ |

$\mu {\left(p\right)}_{j}$ | a | $[a,c,d]$ | $[b,e]$ |

**Remark 3**Two patterns, m and ${m}^{\prime}$, in $\mathcal{M}$ may have the same location list; thus, they will be mapped into the same minimal pattern, $\mu \left(m\right)=\mu \left({m}^{\prime}\right)$. On the other hand, two minimal patterns with the same location list must have different lengths.

**Definition 6**(Degeneracy of a sequence $c(\xb7)$) The degeneracy of a sequence, p, of length, k, is defined as $c\left(p\right)={\sum}_{j=1}^{k}\left|{p}_{j}\right|$. The degeneracy of a pattern, $m=(p,{\mathcal{L}}_{m})$, denoted by $c\left(m\right)$, is defined as the degeneracy of its sequence, $c\left(p\right)$.

**Definition 7**(Pattern priority ‘→’) A pattern, m, of length, k, has priority over another pattern, ${m}^{\prime}$, of length, ${k}^{\prime}$, denoted, $m\to {m}^{\prime}$, if (1) $k>{k}^{\prime}$, or (2) $k={k}^{\prime}$ and $c\left(m\right)<c\left({m}^{\prime}\right)$, or (3) $k={k}^{\prime}$, $c\left(m\right)=c\left({m}^{\prime}\right)$, $min\{{\mathcal{L}}_{m}\setminus {\mathcal{L}}_{{m}^{\prime}}\}<min\{{\mathcal{L}}_{{m}^{\prime}}\setminus {\mathcal{L}}_{m}\}$, when both minima exist.

**Definition 8**(Sub-ordered, totally ordered) Given a set, $\mathcal{M}$, if a binary relation, R, over this set is (i) irreflexive; m R ${m}^{\prime}$ never holds (ii) antisymmetric, $m\ne {m}^{\prime}$ ⇒ not $(m$ R ${m}^{\prime}$ and ${m}^{\prime}$ R $m)$; and (iii) acyclic, then $\mathcal{M}$ is said to be sub-ordered. If R is also total, that is $(m$ R ${m}^{\prime}$ or ${m}^{\prime}$ R $m)$, then $\mathcal{M}$ is said to be totally sub-ordered. Furthermore, $\mathcal{M}$ is said to be totally ordered if R is irreflexive, antisymmetric, transitive, that is $(m$ R ${m}^{\prime}$ and ${m}^{\prime}$ R ${m}^{\prime \prime})$ ⇒ m R ${m}^{\prime \prime}$ and total.

**Lemma 1**A set, $\mathcal{M}$, is totally ordered under a binary relation, R, if and only if it is totally sub-ordered.

**Proof**It is straightforward that transitivity implies acyclicity, that is, if $\mathcal{M}$ is totally ordered, then $\mathcal{M}$ is also totally sub-ordered. It is also easy to see that acyclicity and totality, together, imply transitivity. Consider a chain of patterns in $\mathcal{M}$: ${m}_{1}$ R ${m}_{2}$, ${m}_{2}$ R ${m}_{3},\dots $, ${m}_{t-1}$ R ${m}_{t}$. Since for acyclicity, ${m}_{t}$ R ${m}_{1}$ does not hold, then, for totality, ${m}_{1}$ R ${m}_{t}$. Hence, transitivity holds and, therefore, the definitions totally sub-ordered and totally ordered coincide. ☁

**Fact 1**The binary relation of pattern priority is irreflexive and antisymmetric (since properties (1), (2) and (3) of Definition 7 are strictly defined).

**Lemma 2**Let m and ${m}^{\prime}$ be two patterns with $min\{{\mathcal{L}}_{m}\setminus {\mathcal{L}}_{{m}^{\prime}}\}<min\{{\mathcal{L}}_{{m}^{\prime}}\setminus {\mathcal{L}}_{m}\}$, such that the two minima both exist, and define j to be $min\{{\mathcal{L}}_{m}\setminus {\mathcal{L}}_{{m}^{\prime}}\}$. Then, the occurrences of m and ${m}^{\prime}$ at positions less than j must be the same.

**Proof**We have to prove that, in case (3) of Definition 7, the occurrences of m and ${m}^{\prime}$ less than j are identical. If m has an occurrence less than j that is not in ${\mathcal{L}}_{{m}^{\prime}}$, then $min\{{\mathcal{L}}_{m}\setminus {\mathcal{L}}_{{m}^{\prime}}\}<j$, which is impossible by hypothesis. Conversely, if ${m}^{\prime}$ has occurrences less than j that are not in ${\mathcal{L}}_{m}$, then $min\{{\mathcal{L}}_{{m}^{\prime}}\setminus {\mathcal{L}}_{m}\}<j=min\{{\mathcal{L}}_{m}\setminus {\mathcal{L}}_{{m}^{\prime}}\}$ contradicts again our assumptions. Thus, the occurrences of the two patterns less than j must be the same, and we call them paired. Finally, by assumptions, it trivially holds that $j\notin {\mathcal{L}}_{{m}^{\prime}}$. ☐

**Lemma 3**Let ${m}_{1}\to {m}_{2}$, ${m}_{2}\to {m}_{3},\dots $, ${m}_{t-1}\to {m}_{t}$ be a chain of patterns with the same length and degeneracy. Then, either ${m}_{1}\to {m}_{t}$ or ${\mathcal{L}}_{{m}_{t}}\subset {\mathcal{L}}_{{m}_{1}}$ holds.

**Proof**We will prove the statement by induction on t. Let the basis be $t=2$. In this case, the chain, ${m}_{1}\to {m}_{2}$, coincides with the result. We will show now that, if it holds either ${m}_{1}\to {m}_{t}$ or ${\mathcal{L}}_{{m}_{t}}\subset {\mathcal{L}}_{{m}_{1}}$, then either ${m}_{1}\to {m}_{t+1}$ or ${\mathcal{L}}_{{m}_{t+1}}\subset {\mathcal{L}}_{{m}_{1}}$ holds.

**Theorem 1**Any set of patterns, $\mathcal{M}$, is sub-ordered with respect to the binary relation of pattern priority.

**Proof**We have to prove that the relation of pattern priority is irreflexive, antisymmetric and acyclic. The first two properties are stated in Fact 1. Now, following the work in Lemma 3, we can prove that the acyclicity holds too. First, observe that length and degeneracy are intrinsic properties of the single pattern. If all patterns in $\mathcal{M}$ have different lengths and degeneracies, then by definition of pattern priority, it is always true that either $m\to {m}^{\prime}$ or ${m}^{\prime}\to m$, and a cycle can never exist, because of different lengths or degeneracies. Alternatively, consider a chain of patterns, ${m}_{1}\to {m}_{2},\dots $, ${m}_{t-1}\to {m}_{t}$, with the same length and degeneracy. In this case, we must use property (3) of Definition 7 to compare the patterns together. From Lemma 3, it follows that a cycle of pattern priority between any chain of patterns is, again, impossible, and hence, the acyclicity holds. ☐

**Theorem 2**Given any set of patterns, $\mathcal{M}$, its minimal set, $\mu \left(\mathcal{M}\right)$, is totally ordered under the binary relation of pattern priority.

**Proof**Along with Theorem 1, we have that any set of patterns, $\mathcal{M}$, is sub-ordered under pattern priority; thus, also the set of minimal patterns, $\mu \left(\mathcal{M}\right)$, is sub-ordered. Following Lemma 1, we have to prove that the totality holds on this new set, $\mu \left(\mathcal{M}\right)$, that is, every pair of minimal patterns must be comparable under pattern priority. In other words, if $m\ne {m}^{\prime}$, it must hold either $m\to {m}^{\prime}$ or ${m}^{\prime}\to m$.

## 4. Pattern Filtering

**Definition 9**(Underlying pattern) The set of patterns, $\mathcal{U}\subseteq \mu \left(\mathcal{M}\right)$, is said to be underlying if and only if:

- (i)
- Every pattern, m, in $\mathcal{U}$, called an underlying pattern, has at least q occurrences that are untied from all the untied occurrences of other patterns in $\mathcal{U}\setminus m$ and
- (ii)
- There does not exist a pattern, $m\in \mu \left(\mathcal{M}\right)\setminus \mathcal{U}$, such that m has at least q untied occurrences from all the untied occurrences of patterns in $\mathcal{U}$.

**Underlying Pattern Filtering (Input:**$\mathcal{M}$

**, q; Output:**$\mathcal{U}$)

- Compute the minimal set, $\mu \left(\mathcal{M}\right)$.
- Rank all minimal patterns in $\mu \left(\mathcal{M}\right)$ using the pattern priority rule.
- At each step, select the top pattern, m, from $\mu \left(\mathcal{M}\right)$:
- If all of its occurrences are tied/covered by some other patterns already in $\mathcal{U}$, discard m;
- Otherwise, if m has at least q untied occurrences, add m to $\mathcal{U}$ and update the locations of vector, Γ, in which m appears.

**Corollary 1**The number of patterns in $\mathcal{U}$ is $\le \lceil n/2\rceil $, independently of the size of $\mathcal{M}$.

## 5. Experimental Results

- Nickel-Dependent hydrogenases (id PS00508; in short, Ni). These are enzymes that catalyze the reversible activation of hydrogen and are further involved in the binding of nickel. The family is composed by 22 sequences of about 12,300 amino acids in total. This family contains two representative signatures, ${Ni}_{1}=$ RG[FILMV]E...............[EM PQS][KR].C[GR][ILMV]C and ${Ni}_{2}=$ [FY]D[IP][CU][AILMV][AGS]C.
- Coagulation factors 5/8 type C domain (FA58C) (id PS01286; in short, Fa). This family is composed by 40 sequences of about 46,500 amino acids in total. They share two signatures: ${Fa}_{1}=$[FWY][ILV].[AFILV][DEGNST]......[FILV]..[IV].[ILTV][KMQT]G and ${Fa}_{2}=$ [LM]R.[EG][ILPV].GC.
- Formate and nitrite transporters (id PS01005; in short, Form). The signature [LIVMA][LIVMY].G[GSTA][DES]L[FI][TN][GS] is present in 17 sequences of a total length of 5300 amino acids.
- Ubiquitin-Activating enzyme (id PS00865; in short, Ubi). The active site P[LIVMG]CT[LIVM][KRHA].[FTNM]P appears in 36 proteins of about 25,200 amino acids in total.
- RNA polymerases M/15 Kd subunits (id PS01030; in short, Poly). The representative signature [FY]C.[DEKSTG]C[GNK][DNSA][LIVMHG][LIVM] occurs in 29 sequences of about 4000 amino acids.
- Dbl homology domain (id PS00741; in short, Dbl). The signature [LM]..[LIVMFYWGS][LI]..[PEQ][LIVMRF]..[LIVM].[KRS].[LT].[LIVM].[DEQN][LIVM]... [STM] appears in 65 sequences of a total length of 18,750 amino acids.

**Figure 1.**Total number, sum of lengths and mean z-score of the patterns extracted using Varun and their corresponding underlying patterns, for the two protein families, $Ni$ and $Fa$. The dashed line in the total length diagrams indicates the total size of each family. Note that in (

**a**) Mean Z-Score and (

**b**) All diagrams, the ordinate is plotted on a logarithmic scale. (

**a**) Nickel-Dependent hydrogenases ($Ni$); (

**b**) Coagulation factors 5/8 type C domain ($Fa$).

${\mathit{N}\mathit{i}}_{\mathbf{1},\mathbf{2}}$ | ${\mathit{F}\mathit{a}}_{\mathbf{1},\mathbf{2}}$ | |||
---|---|---|---|---|

Binary Relation | Similarity | Rank | Similarity | Rank |

Pattern priority | 151/157 | 2.78 | 247/264 | 5.34 |

z-Score | 127/157 | 5.00 | 223/264 | 9.96 |

Probability | 127/157 | 5.00 | 223/264 | 9.96 |

Equal Probability | 127/157 | 5.00 | 223/264 | 9.96 |

Frequency | 93/157 | 22.78 | 168/264 | 9.42 |

Inverted frequency | 118/157 | 6.14 | 212/264 | 5.69 |

Lexicographic order | 93/157 | 5.50 | 142/264 | 11.77 |

Form | Ubi | Poly | Dbl | |||||
---|---|---|---|---|---|---|---|---|

Binary Relation | Similarity | Rank | Similarity | Rank | Similarity | Rank | Similarity | Rank |

Pattern priority | 186/205 | 4.72 | 190/198 | 3.40 | 215/234 | 4.25 | 498/522 | 5.20 |

z-Score | 167/205 | 6.00 | 178/198 | 4.74 | 212/234 | 5.86 | 455/522 | 7.10 |

Probability | 167/205 | 6.00 | 178/198 | 4.74 | 210/234 | 5.92 | 455/522 | 7.10 |

Equal Probability | 165/205 | 6.20 | 178/198 | 4.74 | 210/234 | 5.92 | 452/522 | 7.10 |

Frequency | 102/205 | 26.62 | 112/198 | 9.75 | 135/234 | 13.69 | 321/522 | 21.35 |

Inverted frequency | 154/205 | 7.92 | 159/198 | 6.00 | 188/234 | 10.10 | 436/522 | 13.74 |

Lexicographic order | 105/205 | 16.44 | 112/198 | 12.39 | 126/234 | 13.93 | 308/522 | 22.00 |

**Table 5.**Normalized maximum similarity with the reference patterns of the family nickel-dependent hydrogenases, for different quorums.

Quorum | Max Similarity ${\mathit{N}\mathit{i}}_{\mathbf{1}}$ | Max Similarity ${\mathit{N}\mathit{i}}_{\mathbf{2}}$ |
---|---|---|

(underlying/original) | (underlying/original) | |

5 | 26/26 | 9/12 |

10 | 18/18 | 12/12 |

15 | 11/11 | 9/12 |

20 | 9/9 | 12/12 |

22 | 9/9 | 12/12 |

25 | 6/6 | 6/6 |

30 | 6/6 | 6/6 |

**Table 6.**Normalized maximum similarity with the reference patterns of the family coagulation factors 5/8 type C domain, for different quorums.

Quorum | Max Similarity ${\mathit{F}\mathit{a}}_{\mathbf{1}}$ | Max Similarity ${\mathit{F}\mathit{a}}_{\mathbf{2}}$ |
---|---|---|

(underlying/original) | (underlying/original) | |

15 | 11/12 | 11/12 |

20 | 11/12 | 12/12 |

25 | 12/12 | 8/10 |

30 | 10/12 | 8/10 |

35 | 10/12 | 9/10 |

40 | 10/12 | 8/8 |

45 | 12/12 | 8/8 |

50 | 12/12 | 8/8 |

60 | 10/10 | 8/8 |

70 | 10/10 | 8/8 |

80 | 9/10 | 8/8 |

90 | 9/10 | 8/8 |

100 | 9/10 | 8/8 |

## 6. Conclusions

## Acknowledgments

## References

- Hulo, N.; Bairoch, A.; Bulliard, V.; Cerutti, L.; Cuche, B.; de Castro, E.; Lachaize, C.; Langendijk-Genevaux, P.; Sigrist, C. The 20 years of PROSITE. Nucleic Acids Res.
**2008**, 36, D245–D249. [Google Scholar] [CrossRef] [PubMed] - Parida, L. Pattern Discovery in Bioinformatics: Theory and Algorithms; Mathematical and Computational Biology, Chapman and Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar]
- Jensen, K.L.; Styczynski, M.P.; Rigoutsos, I.; Stephanopoulos, G.N. A generic motif discovery algorithm for sequential data. Bioinformatics
**2006**, 22, 21–28. [Google Scholar] [CrossRef] [PubMed] - Abrahamson, K. Generalized string matching. SIAM J. Comput.
**1987**, 16, 1039–1051. [Google Scholar] [CrossRef] - Navarro, G.; Raffinot, M. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol.
**2003**, 10, 903–923. [Google Scholar] [CrossRef] [PubMed] - Fredriksson, K.; Grabowski, S. Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr.
**2008**, 11, 335–357. [Google Scholar] [CrossRef] - Wu, S.; Manber, U. Fast text searching: Allowing errors. Commun. ACM
**1992**, 35, 83–91. [Google Scholar] [CrossRef] - Soldano, H.; Viari, A.; Champesme, M. Searching for flexible repeated patterns using a non-transitive similarity relation. Pattern Recognit. Lett.
**1995**, 16, 233–246. [Google Scholar] [CrossRef] - Pisanti, N.; Soldano, H.; Carpentier, M. Incremental inference of relational motifs with a degenerate Alphabet. Lect. Notes Comput. Sci.
**2005**, 3537, 229–240. [Google Scholar] - Frith, M.C.; Saunders, N.F.W.; Kobe, B.; Bailey, T.L. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol.
**2008**, 4. [Google Scholar] [CrossRef] [PubMed] - Sinha, S.; Tompa, M. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res.
**2002**, 30, 5549–5560. [Google Scholar] [CrossRef] [PubMed] - Apostolico, A.; Comin, M.; Parida, L. VARUN: Discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinforma.
**2010**, 7, 752–762. [Google Scholar] [CrossRef] [PubMed] - Pisanti, N.; Crochemore, M.; Grossi, R.; Sagot, M.F. Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans. Comput. Biol. Bioinforma.
**2005**, 2, 40–50. [Google Scholar] [CrossRef] [PubMed][Green Version] - Apostolico, A.; Comin, M.; Parida, L. Bridging lossy and lossless compression by motif pattern discovery. Lect. Notes Comput. Sci.
**2006**, 4123, 793–813. [Google Scholar] - Apostolico, A.; Comin, M.; Parida, L. Motifs in Ziv-Lempel-Welch Clef. In Proceedings of IEEE DCC Data Compression Conference, Snowbird, UT, USA, 23–25 March 2004; pp. 72–81.
- Apostolico, A.; Comin, M.; Parida, L. Mining, compressing and classifying with extensible motifs. Algorithms Mol. Biol.
**2006**, 1. [Google Scholar] [CrossRef] [PubMed] - Comin, M.; Verzotto, D. The Irredundant Class method for remote homology detection of protein sequences. J. Comput. Biol.
**2011**, 18, 1819–1829. [Google Scholar] [CrossRef] [PubMed] - Comin, M.; Verzotto, D. Classification of protein sequences by means of irredundant patterns. BMC Bioinforma.
**2010**, 11. [Google Scholar] [CrossRef] [PubMed] - Comin, M.; Verzotto, D. Alignment-Free phylogeny of whole genomes using underlying subwords. BMC Algorithms Mol. Biol.
**2012**, 7. [Google Scholar] [CrossRef] [PubMed] - Comin, M.; Parida, L. Detection of subtle variations as consensus motifs. Theory Comput. Sci.
**2008**, 395, 158–170. [Google Scholar] [CrossRef] - Comin, M.; Parida, L. Subtle Motif Discovery for Detection of Dna Regulatory Sites. In Proceedings of the 5th Asia-Pacific Bioinformatics Conference, APBC, Hong Kong, 14–17 Jan, 2007; Volume 5, pp. 27–36.
- Jensen, K.L.; Styczynski, M.P.; Rigoutsos, I.; Stephanopoulos, G.N. A generic motif discovery algorithm for sequential data. Bioinformatics
**2006**, 22, 21–28. [Google Scholar] [CrossRef] [PubMed] - Leslie, C.S.; Eskin, E.; Cohen, A.; Weston, J.; Noble, W.S. Mismatch string kernels for discriminative protein classification. Bioinformatics
**2004**, 20, 467–476. [Google Scholar] [CrossRef] [PubMed] - Dipartimento Di Ingegneria Dell’Informazione. Available online: http://www.dei.unipd.it/∼ciompin/main/filtering.html (accessed on 21 May 2013).
- Apostolico, A.; Comin, M.; Parida, L. Conservative extraction of over-represented extensible motifs. Bioinformatics
**2005**, 21, 9–18. [Google Scholar] [CrossRef] [PubMed] - Mendes, N.D.; Casimiro, A.C.; Santos, P.M.; Sá-Correia, I.; Oliveira, A.L.; Freitas, A.T. MUSA: A parameter free algorithm for the identification of biologically significant motifs. Bioinformatics
**2006**, 22, 2996–3002. [Google Scholar] [CrossRef] [PubMed] - Peng, C.H.; Hsu, J.T.; Chung, Y.S.; Lin, Y.J.; Chow, W.Y.; Hsu, D.F.; Tang, C.Y. Identification of degenerate motifs using position restricted selection and hybrid ranking combination. Nucleic Acids Res.
**2006**, 34, 6379–6391. [Google Scholar] [CrossRef] [PubMed] - Vishnevsky, O.V.; Kolchanov, N.A. ARGO: A web system for the detection of degenerate motifs and large-scale recognition of eukaryotic promoters. Nucleic Acids Res.
**2005**, 33, W417–W422. [Google Scholar] [CrossRef] [PubMed] - Chakravarty, A.; Carlson, J.M.; Khetani, R.S.; DeZiel, C.E.; Gross, R.H. SPACER: Identification of cis-regulatory elements with non-contiguous critical residues. Bioinformatics
**2007**, 23, 1029–1031. [Google Scholar] [CrossRef] [PubMed] - Wu, R.; Chaivorapol, C.; Zheng, J.; Li, H.; Liang, S. fREDUCE: Detection of degenerate regulatory elements using correlation with expression. BMC Bioinforma.
**2007**, 8. [Google Scholar] [CrossRef] [PubMed] - Wang, G.; Yu, T.; Zhang, W. WordSpy: Identifying transcription factor binding motifs by building a dictionary and learning a grammar. Nucleic Acids Res.
**2005**, 33, W412–W416. [Google Scholar] [CrossRef] [PubMed] - Ukkonen, E. Maximal and minimal representations of gapped and non-gapped motifs of a string. Theoret. Comput. Sci.
**2009**, 410, 4341–4349. [Google Scholar] [CrossRef] - Romer, K.; Kayombya, G.R.; Fraenkel, E. WebMOTIFS: Automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches. Nucleic Acids Res.
**2007**, 35, W217–W220. [Google Scholar] [CrossRef] [PubMed] - Zhang, S.; Su, W.; Yang, J. ARCS-Motif: Discovering correlated motifs from unaligned biological sequences. Bioinformatics
**2009**, 25, 183–189. [Google Scholar] [CrossRef] [PubMed] - Coatney, M.; Parthasarathy, S. MotifMiner: A General Toolkit for Efficiently Identifying Common Substructures in Molecules. In Proceedings of the 3rd IEEE BIBE, Maryland, MD, USA, 10–12 March 2003; pp. 336–340.
- Wijaya, E.; Yiu, S.M.; Son, N.T.; Kanagasabai, R.; Sung, W.K. MotifVoter: A novel ensemble method for fine-grained integration of generic motif finders. Bioinformatics
**2008**, 24, 2288–2295. [Google Scholar] [CrossRef] [PubMed] - Tompa, M.; Li, N.; Bailey, T.L.; Church, G.M.; Church, G.M.; Moor, B.D.; Eskin, E.; Favorov, A.V.; Frith, M.C.; Fu, Y.; Kent, W.J.; et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol.
**2005**, 23, 137–144. [Google Scholar] [CrossRef] [PubMed] - Edwards, R.J.; Davey, N.E.; Shields, D.C. CompariMotif: Quick and easy comparisons of sequence motifs. Bioinformatics
**2008**, 24, 1307–1309. [Google Scholar] [CrossRef] [PubMed] - Jiang, H.; Zhao, Y.; Chen, W.; Zheng, W. Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree. In Advances in Computational Biology; Arabnia, H.R., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 680, pp. 19–26. [Google Scholar]
- Edelman, G.M.; Gally, J.A. Degeneracy and complexity in biological systems. Proc. Natl. Acad. Sci. USA
**2001**, 98, 13763–13768. [Google Scholar] [CrossRef] [PubMed] - Shinozaki, D.; Akutsu, T.; Maruyama, O. Finding optimal degenerate patterns in DNA sequences. Bioinformatics
**2003**, 19, 206–214. [Google Scholar] [CrossRef] - Bailey, T.L.; Williams, N.; Misleh, C.; Li, W.W. MEME: Discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res.
**2006**, 34, 369–373. [Google Scholar] [CrossRef] [PubMed]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Comin, M.; Verzotto, D. Filtering Degenerate Patterns with Application to Protein Sequence Analysis. *Algorithms* **2013**, *6*, 352-370.
https://doi.org/10.3390/a6020352

**AMA Style**

Comin M, Verzotto D. Filtering Degenerate Patterns with Application to Protein Sequence Analysis. *Algorithms*. 2013; 6(2):352-370.
https://doi.org/10.3390/a6020352

**Chicago/Turabian Style**

Comin, Matteo, and Davide Verzotto. 2013. "Filtering Degenerate Patterns with Application to Protein Sequence Analysis" *Algorithms* 6, no. 2: 352-370.
https://doi.org/10.3390/a6020352