Filtering Degenerate Patterns with Application to Protein Sequence Analysis
Abstract
:1. Introduction
1.1. Ranking and Clustering Degenerate Patterns
Rank | Z-Score | Pattern |
---|---|---|
1 | 2.84E+09 | Y...L...C..[FYW]A..[STAH]R..P..FNE[STAH]K.I.F[STAH]M |
2 | 8.28E+07 | V-(1,3,4)G...S..[STAH]....N...L....Q-(4)[STAH]....L.[DN]...[FYW]..F....P....Q..A...I |
3 | 5.55E+07 | L-(2,3)F...Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I |
4 | 4.27E+07 | L-(2,3)F...Q.[STAH]..[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |
5 | 4.23E+07 | L....I...[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |
6 | 3.99E+07 | LF-(3)Q....[STAH][STAH]....S[DN]...[FYW]..F.R..P.D..Q..A...I |
7 | 3.38E+07 | LF-(3)Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I |
8 | 3.38E+07 | LF...Q....[STAH]-(4)L.[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |
9 | 3.29E+07 | I-(1)Q.[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |
10 | 3.29E+07 | I.Q-(4)[STAH]....LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |
11 | 3.29E+07 | I.Q.[STAH]..[STAH]-(4)LS[DN]...[FYW]..F.R..P.D..Q..A...I |
12 | 3.10E+07 | L....Q-(1,4)[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |
13 | 2.77E+07 | L[FYW]-(3)Q.[STAH]..[STAH]....LS....[FYW]..F.R..P.D..Q..A...I |
14 | 2.58E+07 | L-(4)Q.[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |
15 | 2.30E+07 | S.[STAH]S-(2,4)LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |
16 | 2.15E+07 | L-(1,3,4)C..[FYW]A..[STAH]R..P..F.E.K.I.F.M |
17 | 1.40E+07 | F-(1)I.Q...[STAH][STAH]-(4)L[STAH]....[FYW]..F.R..P.D..Q..A...I |
18 | 1.37E+07 | L-(2,4)I...[STAH].[STAH].[STAH]-(3)LS....[FYW]..F.R..P.D..Q..A...I |
19 | 1.02E+07 | L..I-(1)Q....[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |
20 | 8.65E+06 | I-(1)Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I |
21 | 8.19E+06 | S[STAH]-(1,2,3,4)LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I |
22 | 7.98E+06 | Q-(3)[STAH][STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I |
23 | 6.82E+06 | F-(3)Q....[STAH][STAH]...L[STAH]....[FYW]..F.R..P.D..Q..A...I |
24 | 5.66E+06 | A[STAH][STAH]-(2,3)LS[DN]...[FYW]..F.R..P.D..Q..A...I |
25 | 5.57E+06 | F.I-(3)[STAH]..[STAH]....L[STAH]....[FYW]..F.R..P.D..Q..A...I |
26 | 5.18E+06 | L.L-(4)Q....[STAH]....L-(1)[DN]...[FYW]..F.R..P.D..Q..A...I |
27 | 3.61E+06 | L.L-(2)I...[STAH]...[STAH]....[STAH]....[FYW]..F.R..P.D..Q..A...I |
28 | 3.48E+06 | [STAH].[STAH]-(1,2,3)LS[DN]...[FYW]..F.R..P.D..Q..A...I |
29 | 3.17E+06 | [STAH]...[STAH]...LS[DN]...[FYW]..F.R..P.D..Q..A...I |
30 | 2.47E+06 | L....Q-(4)[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |
31 | 2.43E+06 | V-(1,3)N.L....I-(3)[STAH]...[STAH]....[STAH]....[FYW]..F....P.D..Q..A...I |
32 | 2.22E+06 | [STAH][STAH][STAH]-(1,2,3)LS....[FYW]..F.R..P.D..Q..A...I |
33 | 2.06E+06 | [STAH].[STAH][STAH]....LS....[FYW]..F.R..P.D..Q..A...I |
34 | 2.03E+06 | Y...L...C...A...R..P..F.E.K.I-(1,4)[FYW][STAH] |
35 | 1.99E+06 | I.Q...[STAH]-(1)[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I |
36 | 1.99E+06 | I.Q-(1)[STAH]...[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I |
38 | 1.97E+06 | F.I...[STAH]-(3)[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I |
40 | 1.97E+06 | F.I-(3)[STAH]..[STAH]....L.[DN]...[FYW]..F....P.D..Q..A...I |
41 | 1.91E+06 | [STAH]..[STAH].K-(1,4)P..FNE[STAH]K.I.F[STAH]M |
42 | 1.72E+06 | CC[FYW].C..C....[FYW]-(2,4)[DN]..[STAH]C..C |
43 | 1.57E+06 | [STAH]-(1,3,4)[FYW]A..[STAH]R..P..F.E.K.I.F.M |
44 | 1.49E+06 | A-(1,3)[STAH]...L[STAH][DN]...[FYW]..F.R..P.D..Q..A...I |
45 | 1.36E+06 | Q...[STAH].[STAH]-(3)L[STAH]....[FYW]..F.R..P.D..Q..A...I |
46 | 1.32E+06 | I-(3)[STAH]..[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I |
47 | 1.31E+06 | [STAH][STAH]-(1,2,3,4)L.[DN]...[FYW]..F.R..P.D..Q..A...I |
48 | 1.24E+06 | [STAH]..[STAH][STAH]-(1,3)LS....[FYW]..F.R..P.D..Q..A...I |
49 | 1.19E+06 | [FYW]-(1,3,4)[STAH]...P..FNE[STAH]K.I.F[STAH]M |
50 | 1.12E+06 | I...[STAH]-(3)[STAH]...L[STAH]....[FYW]..F.R..P.D..Q..A...I |
1.2. Problem Formulation
2. Preliminary Definitions
3. Minimal Patterns and Pattern Priority
j | 1 | 2 | 3 |
---|---|---|---|
a |
4. Pattern Filtering
- (i)
- Every pattern, m, in , called an underlying pattern, has at least q occurrences that are untied from all the untied occurrences of other patterns in and
- (ii)
- There does not exist a pattern, , such that m has at least q untied occurrences from all the untied occurrences of patterns in .
- Compute the minimal set, .
- Rank all minimal patterns in using the pattern priority rule.
- At each step, select the top pattern, m, from :
- If all of its occurrences are tied/covered by some other patterns already in , discard m;
- Otherwise, if m has at least q untied occurrences, add m to and update the locations of vector, Γ, in which m appears.
5. Experimental Results
- Nickel-Dependent hydrogenases (id PS00508; in short, Ni). These are enzymes that catalyze the reversible activation of hydrogen and are further involved in the binding of nickel. The family is composed by 22 sequences of about 12,300 amino acids in total. This family contains two representative signatures, RG[FILMV]E...............[EM PQS][KR].C[GR][ILMV]C and [FY]D[IP][CU][AILMV][AGS]C.
- Coagulation factors 5/8 type C domain (FA58C) (id PS01286; in short, Fa). This family is composed by 40 sequences of about 46,500 amino acids in total. They share two signatures: [FWY][ILV].[AFILV][DEGNST]......[FILV]..[IV].[ILTV][KMQT]G and [LM]R.[EG][ILPV].GC.
- Formate and nitrite transporters (id PS01005; in short, Form). The signature [LIVMA][LIVMY].G[GSTA][DES]L[FI][TN][GS] is present in 17 sequences of a total length of 5300 amino acids.
- Ubiquitin-Activating enzyme (id PS00865; in short, Ubi). The active site P[LIVMG]CT[LIVM][KRHA].[FTNM]P appears in 36 proteins of about 25,200 amino acids in total.
- RNA polymerases M/15 Kd subunits (id PS01030; in short, Poly). The representative signature [FY]C.[DEKSTG]C[GNK][DNSA][LIVMHG][LIVM] occurs in 29 sequences of about 4000 amino acids.
- Dbl homology domain (id PS00741; in short, Dbl). The signature [LM]..[LIVMFYWGS][LI]..[PEQ][LIVMRF]..[LIVM].[KRS].[LT].[LIVM].[DEQN][LIVM]... [STM] appears in 65 sequences of a total length of 18,750 amino acids.
Binary Relation | Similarity | Rank | Similarity | Rank |
---|---|---|---|---|
Pattern priority | 151/157 | 2.78 | 247/264 | 5.34 |
z-Score | 127/157 | 5.00 | 223/264 | 9.96 |
Probability | 127/157 | 5.00 | 223/264 | 9.96 |
Equal Probability | 127/157 | 5.00 | 223/264 | 9.96 |
Frequency | 93/157 | 22.78 | 168/264 | 9.42 |
Inverted frequency | 118/157 | 6.14 | 212/264 | 5.69 |
Lexicographic order | 93/157 | 5.50 | 142/264 | 11.77 |
Form | Ubi | Poly | Dbl | |||||
---|---|---|---|---|---|---|---|---|
Binary Relation | Similarity | Rank | Similarity | Rank | Similarity | Rank | Similarity | Rank |
Pattern priority | 186/205 | 4.72 | 190/198 | 3.40 | 215/234 | 4.25 | 498/522 | 5.20 |
z-Score | 167/205 | 6.00 | 178/198 | 4.74 | 212/234 | 5.86 | 455/522 | 7.10 |
Probability | 167/205 | 6.00 | 178/198 | 4.74 | 210/234 | 5.92 | 455/522 | 7.10 |
Equal Probability | 165/205 | 6.20 | 178/198 | 4.74 | 210/234 | 5.92 | 452/522 | 7.10 |
Frequency | 102/205 | 26.62 | 112/198 | 9.75 | 135/234 | 13.69 | 321/522 | 21.35 |
Inverted frequency | 154/205 | 7.92 | 159/198 | 6.00 | 188/234 | 10.10 | 436/522 | 13.74 |
Lexicographic order | 105/205 | 16.44 | 112/198 | 12.39 | 126/234 | 13.93 | 308/522 | 22.00 |
Quorum | Max Similarity | Max Similarity |
---|---|---|
(underlying/original) | (underlying/original) | |
5 | 26/26 | 9/12 |
10 | 18/18 | 12/12 |
15 | 11/11 | 9/12 |
20 | 9/9 | 12/12 |
22 | 9/9 | 12/12 |
25 | 6/6 | 6/6 |
30 | 6/6 | 6/6 |
Quorum | Max Similarity | Max Similarity |
---|---|---|
(underlying/original) | (underlying/original) | |
15 | 11/12 | 11/12 |
20 | 11/12 | 12/12 |
25 | 12/12 | 8/10 |
30 | 10/12 | 8/10 |
35 | 10/12 | 9/10 |
40 | 10/12 | 8/8 |
45 | 12/12 | 8/8 |
50 | 12/12 | 8/8 |
60 | 10/10 | 8/8 |
70 | 10/10 | 8/8 |
80 | 9/10 | 8/8 |
90 | 9/10 | 8/8 |
100 | 9/10 | 8/8 |
6. Conclusions
Acknowledgments
References
- Hulo, N.; Bairoch, A.; Bulliard, V.; Cerutti, L.; Cuche, B.; de Castro, E.; Lachaize, C.; Langendijk-Genevaux, P.; Sigrist, C. The 20 years of PROSITE. Nucleic Acids Res. 2008, 36, D245–D249. [Google Scholar] [CrossRef] [PubMed]
- Parida, L. Pattern Discovery in Bioinformatics: Theory and Algorithms; Mathematical and Computational Biology, Chapman and Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar]
- Jensen, K.L.; Styczynski, M.P.; Rigoutsos, I.; Stephanopoulos, G.N. A generic motif discovery algorithm for sequential data. Bioinformatics 2006, 22, 21–28. [Google Scholar] [CrossRef] [PubMed]
- Abrahamson, K. Generalized string matching. SIAM J. Comput. 1987, 16, 1039–1051. [Google Scholar] [CrossRef]
- Navarro, G.; Raffinot, M. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 2003, 10, 903–923. [Google Scholar] [CrossRef] [PubMed]
- Fredriksson, K.; Grabowski, S. Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 2008, 11, 335–357. [Google Scholar] [CrossRef]
- Wu, S.; Manber, U. Fast text searching: Allowing errors. Commun. ACM 1992, 35, 83–91. [Google Scholar] [CrossRef]
- Soldano, H.; Viari, A.; Champesme, M. Searching for flexible repeated patterns using a non-transitive similarity relation. Pattern Recognit. Lett. 1995, 16, 233–246. [Google Scholar] [CrossRef]
- Pisanti, N.; Soldano, H.; Carpentier, M. Incremental inference of relational motifs with a degenerate Alphabet. Lect. Notes Comput. Sci. 2005, 3537, 229–240. [Google Scholar]
- Frith, M.C.; Saunders, N.F.W.; Kobe, B.; Bailey, T.L. Discovering sequence motifs with arbitrary insertions and deletions. PLoS Comput. Biol. 2008, 4. [Google Scholar] [CrossRef] [PubMed]
- Sinha, S.; Tompa, M. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2002, 30, 5549–5560. [Google Scholar] [CrossRef] [PubMed]
- Apostolico, A.; Comin, M.; Parida, L. VARUN: Discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinforma. 2010, 7, 752–762. [Google Scholar] [CrossRef] [PubMed]
- Pisanti, N.; Crochemore, M.; Grossi, R.; Sagot, M.F. Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans. Comput. Biol. Bioinforma. 2005, 2, 40–50. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Apostolico, A.; Comin, M.; Parida, L. Bridging lossy and lossless compression by motif pattern discovery. Lect. Notes Comput. Sci. 2006, 4123, 793–813. [Google Scholar]
- Apostolico, A.; Comin, M.; Parida, L. Motifs in Ziv-Lempel-Welch Clef. In Proceedings of IEEE DCC Data Compression Conference, Snowbird, UT, USA, 23–25 March 2004; pp. 72–81.
- Apostolico, A.; Comin, M.; Parida, L. Mining, compressing and classifying with extensible motifs. Algorithms Mol. Biol. 2006, 1. [Google Scholar] [CrossRef] [PubMed]
- Comin, M.; Verzotto, D. The Irredundant Class method for remote homology detection of protein sequences. J. Comput. Biol. 2011, 18, 1819–1829. [Google Scholar] [CrossRef] [PubMed]
- Comin, M.; Verzotto, D. Classification of protein sequences by means of irredundant patterns. BMC Bioinforma. 2010, 11. [Google Scholar] [CrossRef] [PubMed]
- Comin, M.; Verzotto, D. Alignment-Free phylogeny of whole genomes using underlying subwords. BMC Algorithms Mol. Biol. 2012, 7. [Google Scholar] [CrossRef] [PubMed]
- Comin, M.; Parida, L. Detection of subtle variations as consensus motifs. Theory Comput. Sci. 2008, 395, 158–170. [Google Scholar] [CrossRef]
- Comin, M.; Parida, L. Subtle Motif Discovery for Detection of Dna Regulatory Sites. In Proceedings of the 5th Asia-Pacific Bioinformatics Conference, APBC, Hong Kong, 14–17 Jan, 2007; Volume 5, pp. 27–36.
- Jensen, K.L.; Styczynski, M.P.; Rigoutsos, I.; Stephanopoulos, G.N. A generic motif discovery algorithm for sequential data. Bioinformatics 2006, 22, 21–28. [Google Scholar] [CrossRef] [PubMed]
- Leslie, C.S.; Eskin, E.; Cohen, A.; Weston, J.; Noble, W.S. Mismatch string kernels for discriminative protein classification. Bioinformatics 2004, 20, 467–476. [Google Scholar] [CrossRef] [PubMed]
- Dipartimento Di Ingegneria Dell’Informazione. Available online: http://www.dei.unipd.it/∼ciompin/main/filtering.html (accessed on 21 May 2013).
- Apostolico, A.; Comin, M.; Parida, L. Conservative extraction of over-represented extensible motifs. Bioinformatics 2005, 21, 9–18. [Google Scholar] [CrossRef] [PubMed]
- Mendes, N.D.; Casimiro, A.C.; Santos, P.M.; Sá-Correia, I.; Oliveira, A.L.; Freitas, A.T. MUSA: A parameter free algorithm for the identification of biologically significant motifs. Bioinformatics 2006, 22, 2996–3002. [Google Scholar] [CrossRef] [PubMed]
- Peng, C.H.; Hsu, J.T.; Chung, Y.S.; Lin, Y.J.; Chow, W.Y.; Hsu, D.F.; Tang, C.Y. Identification of degenerate motifs using position restricted selection and hybrid ranking combination. Nucleic Acids Res. 2006, 34, 6379–6391. [Google Scholar] [CrossRef] [PubMed]
- Vishnevsky, O.V.; Kolchanov, N.A. ARGO: A web system for the detection of degenerate motifs and large-scale recognition of eukaryotic promoters. Nucleic Acids Res. 2005, 33, W417–W422. [Google Scholar] [CrossRef] [PubMed]
- Chakravarty, A.; Carlson, J.M.; Khetani, R.S.; DeZiel, C.E.; Gross, R.H. SPACER: Identification of cis-regulatory elements with non-contiguous critical residues. Bioinformatics 2007, 23, 1029–1031. [Google Scholar] [CrossRef] [PubMed]
- Wu, R.; Chaivorapol, C.; Zheng, J.; Li, H.; Liang, S. fREDUCE: Detection of degenerate regulatory elements using correlation with expression. BMC Bioinforma. 2007, 8. [Google Scholar] [CrossRef] [PubMed]
- Wang, G.; Yu, T.; Zhang, W. WordSpy: Identifying transcription factor binding motifs by building a dictionary and learning a grammar. Nucleic Acids Res. 2005, 33, W412–W416. [Google Scholar] [CrossRef] [PubMed]
- Ukkonen, E. Maximal and minimal representations of gapped and non-gapped motifs of a string. Theoret. Comput. Sci. 2009, 410, 4341–4349. [Google Scholar] [CrossRef]
- Romer, K.; Kayombya, G.R.; Fraenkel, E. WebMOTIFS: Automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches. Nucleic Acids Res. 2007, 35, W217–W220. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Su, W.; Yang, J. ARCS-Motif: Discovering correlated motifs from unaligned biological sequences. Bioinformatics 2009, 25, 183–189. [Google Scholar] [CrossRef] [PubMed]
- Coatney, M.; Parthasarathy, S. MotifMiner: A General Toolkit for Efficiently Identifying Common Substructures in Molecules. In Proceedings of the 3rd IEEE BIBE, Maryland, MD, USA, 10–12 March 2003; pp. 336–340.
- Wijaya, E.; Yiu, S.M.; Son, N.T.; Kanagasabai, R.; Sung, W.K. MotifVoter: A novel ensemble method for fine-grained integration of generic motif finders. Bioinformatics 2008, 24, 2288–2295. [Google Scholar] [CrossRef] [PubMed]
- Tompa, M.; Li, N.; Bailey, T.L.; Church, G.M.; Church, G.M.; Moor, B.D.; Eskin, E.; Favorov, A.V.; Frith, M.C.; Fu, Y.; Kent, W.J.; et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 2005, 23, 137–144. [Google Scholar] [CrossRef] [PubMed]
- Edwards, R.J.; Davey, N.E.; Shields, D.C. CompariMotif: Quick and easy comparisons of sequence motifs. Bioinformatics 2008, 24, 1307–1309. [Google Scholar] [CrossRef] [PubMed]
- Jiang, H.; Zhao, Y.; Chen, W.; Zheng, W. Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree. In Advances in Computational Biology; Arabnia, H.R., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 680, pp. 19–26. [Google Scholar]
- Edelman, G.M.; Gally, J.A. Degeneracy and complexity in biological systems. Proc. Natl. Acad. Sci. USA 2001, 98, 13763–13768. [Google Scholar] [CrossRef] [PubMed]
- Shinozaki, D.; Akutsu, T.; Maruyama, O. Finding optimal degenerate patterns in DNA sequences. Bioinformatics 2003, 19, 206–214. [Google Scholar] [CrossRef]
- Bailey, T.L.; Williams, N.; Misleh, C.; Li, W.W. MEME: Discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006, 34, 369–373. [Google Scholar] [CrossRef] [PubMed]
© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Share and Cite
Comin, M.; Verzotto, D. Filtering Degenerate Patterns with Application to Protein Sequence Analysis. Algorithms 2013, 6, 352-370. https://doi.org/10.3390/a6020352
Comin M, Verzotto D. Filtering Degenerate Patterns with Application to Protein Sequence Analysis. Algorithms. 2013; 6(2):352-370. https://doi.org/10.3390/a6020352
Chicago/Turabian StyleComin, Matteo, and Davide Verzotto. 2013. "Filtering Degenerate Patterns with Application to Protein Sequence Analysis" Algorithms 6, no. 2: 352-370. https://doi.org/10.3390/a6020352
APA StyleComin, M., & Verzotto, D. (2013). Filtering Degenerate Patterns with Application to Protein Sequence Analysis. Algorithms, 6(2), 352-370. https://doi.org/10.3390/a6020352