Filtering Degenerate Patterns with Application to Protein Sequence Analysis

: In biology, the notion of degenerate pattern plays a central role for describing various phenomena. For example, protein active site patterns, like those contained in the PROSITE database, e.g., [ F Y ] DP C [ LIM ][ ASG ] C [ ASG ] , are, in general, represented by degenerate patterns with character classes. Researchers have developed several approaches over the years to discover degenerate patterns. Although these methods have been exhaustively and successfully tested on genomes and proteins, their outcomes often far exceed the size of the original input, making the output hard to be managed and to be interpreted by reﬁned analysis requiring manual inspection. In this paper, we discuss a characterization of degenerate patterns with character classes, without gaps, and we introduce the concept of pattern priority for comparing and ranking different patterns. We deﬁne the class of underlying patterns for ﬁltering any set of degenerate patterns into a new set that is linear in the size of the input sequence. We present some preliminary results on the detection of subtle signals in protein families. Results show that our approach drastically reduces the number of patterns in output for a tool for protein analysis, while retaining the representative patterns.


Introduction
In biology, the notion of degenerate pattern, or indeterminate pattern, plays a central role for describing various phenomena.For example, protein functional patterns, like those contained in the PROSITE database [1], e.g., [F Y ]DP C[LIM ][ASG]C[ASG], are, in general, represented by degenerate patterns with character classes and denote conserved sites in protein families.
The de novo discovery of degenerate patterns in protein and genome sequences is very important [2][3][4][5][6][7][8][9].Such patterns usually correspond to residues conserved by evolution, due to some significant structural or functional role.Moreover, the large availability of biological sequences, recently achieved with high-throughput sequencing technologies and a multitude of new protein discoveries, has increased the number and complexity of patterns required by scientists in order to perform a complete analysis of biological processes.
In order to fill this gap, researchers have developed several approaches over the years, following the framework of de novo degenerate pattern discovery [3,[10][11][12][13].Pattern discovery techniques have been used for a wide range of applications, from data compression [14,15] to data classification [16]; from protein family detection [17,18] to whole genome comparison [19], as well as the discovery of transcription factor binding sites (TFBS) [20,21].Although de novo degenerate pattern discovery methods have been exhaustively and successfully tested on genomes and sets of proteins [12,18,22,23], their outcome, a huge set of patterns, often far exceeds the size of the original input, making it impractical to be managed, and then interpreted with further analysis requiring manual inspection.
In this paper, we address the problem of degenerate pattern comparison and filtering, in order to shrink the output of any pattern discovery tool for the de novo identification of subtle biological signals.
We first give a characterization of degenerate patterns, in particular of patterns with character classes without wildcards/gaps, and define the notions of minimality and pattern priority for comparing and ranking different patterns.Then, we introduce the concept of underlying pattern for filtering any set of degenerate patterns into a new set that is linear in the size of the reference sequence s.Each underlying pattern will have the highest priority in some regions of s.We finally discuss some experimental results on the identification of subtle signals in protein families by means of underlying patterns.Preliminary results show that our approach drastically reduces the number of patterns in output from a de novo pattern discovery tool [12] for protein analysis, retaining the actual representative patterns [24].

Ranking and Clustering Degenerate Patterns
In this paper, we are interested in patterns with character classes, also called degenerate patterns (with no wildcards/gaps), which represent highly conserved parts in more complex signatures of protein families sharing remote homologies.The classification of their symbols often does not completely partition the entire alphabet of amino acids, and thus, in practical cases, patterns show classes with non-empty intersections.Let us consider, for example, the nickel-dependent hydrogenase large subunit signature RG [ In case the considered classes of characters form a partition of the original alphabet, Σ, we can map the input sequences into the new alphabet of symbols represented by the partition, thus considerably reducing the number of candidate patterns to be analyzed to find the real signature.Unfortunately, if we consider the (already discovered) signature reported in PROSITE, most of the time, this is not true.In most cases, this will result in the combinatorial explosion of candidate patterns to be considered [12,25], which consequently affects the efficiency of matching and discovering of new patterns.
Moreover, all pattern discovery tools must ultimately rank the output, according to some measure of importance.However, even if very sophisticated probabilistic scoring mechanisms are in place in a tool for protein analysis, such as Varun [12,25], there are cases where the direct interpretation of results is far from trivial.Let us consider, for instance, the worst case of Varun presented in Table 1.The relevant signatures, shown in bold, appear at position 42, whereas almost all top signatures are somehow very similar.In order to filter this output and enhance its readability, there are two main issues.On the one hand, most signatures are very similar in the contained degenerate patterns, and therefore, they must be clustered together.On the other hand, if we are interested in a specific region of the sequences under examination, we would like to select the most important signature that appears in that region according to some rule.These two issues are, in practice, tightly related and need to be addressed as one.In this article, we focus our attention on degenerate patterns that are the fundamental units of representative signatures, e.g., short linear motifs (SLiMs), in order to identify those representing the same loci, and thus, reducing the number of candidate solutions to be tested.In particular, we will follow and adapt the idea of pattern minimality introduced in [32] for exact patterns to the case of degenerate patterns.This latter notion considers exact patterns representing particular equivalent classes, or sets of locations, with the smallest possible content.In the same way, we will identify the unique degenerate pattern characterizing a set of locations with the minimal degeneracy among the patterns given in the input.
Recently, a number of ensemble methods addressing some of these clustering issues have been proposed [33][34][35][36].The general idea is that one can integrate the outcome of several pattern discovery algorithms based on sophisticated heuristics, in order to improve the ability of finding subtle biological signals in specific contexts.
The ensemble methods actually rank all the predicted patterns according to some scoring function and then report the top patterns.WebMotifs [33], ARCS-Pattern [34], MotifMiner [35] and MotifVoter [36] assume that the consensus of several state-of-the-art tools is likely to produce the actual functional pattern.They ultimately cluster all patterns and report only those from the best clusters.However, if none of the patterns from the individual finders can accurately capture the transcription factor binding sites, the performance of the ensemble methods will suffer.Although these methods help to improve the performance of pattern finding, the improvement is usually not significant.For example, as reported in [36], the application of most ensemble tools in Tompa's benchmark [37] and Escherichia coli datasets improved the average sensitivity by 62%, but the average precision is reduced by 15%; only MotifVoter promises to improve these numbers.
Table 1.Output of Varun for the G-protein coupled receptors family 3 (id PS00980), consisting of 25 sequences of about 25,000 amino acids each.One of the relevant signatures is shown in bold [12].

Problem Formulation
More formally, the problem we address is the following: Given a reference sequence, s, which is the concatenation of multiple protein sequences and a set of patterns with character classes, M, which is the outcome of one or more pattern finders, reduce the set, M, to a small number of ordered representative patterns, say U, such that every position of s is covered by at most one pattern of U.
In this paper, we will discuss the basic problems behind degenerate pattern comparison and ranking, in a simple and conservative fashion; in particular, we will focus our attention on protein sequences with remote evolutionary relationships.In principle, there are several ways to compare degenerate patterns, and they use the number of occurrences or the pattern locations; we will consider only the pattern locations.We propose a simple, yet effective, way that will be compared with other traditional binary relations in Section 5.There are mainly three features that characterize a degenerate pattern: Length, degeneracy and its location list.A trade-off between these three features is required in order to compare and rank all patterns [38].We chose the combination, called pattern priority, which tries to represent most of the representative patterns present in PROSITE.An intuitive way to rank all patterns will be to favor degenerate patterns representing large matching regions between sequences and that are more likely to be "functional" [39], and thereafter, those patterns with the least degeneracy, which represents the degree of divergence during evolution of sequences, and it is "a prerequisite for and an inescapable product of the process of natural selection itself" [40].
Finally, here, our main intent is to use and experimentally validate our heuristic pattern priority rule.This can be viewed as the first step towards an efficient de novo pattern discovery algorithm able to directly discover conserved patterns, without computing a (possible) exponential number of degenerate patterns.In fact, it is well known that the problem of finding degenerate patterns consistent with a set of heterogeneous sequences is, in general, NP-complete [41], and thus, fast heuristics are required.

Preliminary Definitions
A string is a sequence of symbols from an alphabet, Σ.The set of all strings over Σ is denoted by Σ * .The length of a string, s, is denoted by |s| and the i-th symbol of s is s i , where 1 ≤ i ≤ |s|.Let, now, s = s 1 s 2 . . .s n be a string of length |s| = n on Σ.This is the reference sequence and, in practice, will be the concatenation of several sequences.For ease of explanation, here, we consider only one sequence, and in Section 5, we will clarify how to deal with multiple sequences.A symbol from Σ, say σ, is called a solid character.A character class, C, is a subset of Σ of cardinality of at least two.Let us assume that we have several character classes, C j .For the rest of the paper, we assume that we are given a string, s, on Σ of length, n, and a positive integer, q, 2 ≤ q ≤ |s|, called quorum.
Definition 1 (Sequence) A sequence with character classes, or simply a sequence, is a string of consecutive symbols defined on {2 C j }.
In the literature, a sequence with character classes is also called a degenerate pattern or an indeterminate string [4][5][6][7][8][9].Let p be a sequence with character classes; then, we write its symbols by means of square brackets.For example, if p = p 1 p 2 . . .p k , where |p| = k , and each symbol, p j , with 1 ≤ j ≤ k, is either a solid character or a character class, C, and is written as   The main issue with this example is that the character classes might not be a partition of Σ.This is very common for PROSITE functional patterns, e.g., A pattern, m, is then defined as a pair, (p, L m ), where p is a sequence that occurs at the locations given by L m and L m is a set of at least q locations of s.For instance, in Table 2,  ) is a pattern with sequence, p = p 1 p 2 . . .p k , and location list, L m = (l 1 , l 2 , . . ., l ν ), if and only if all of the following hold: (i) p has at least two symbols, |p| ≥ 2; (ii) The location list has at least q occurrences, |L m | ≥ q; and (iii) There does not exist a location, l / ∈ L m , such that p occurs at l in s (that is, L m is complete).

Minimal Patterns and Pattern Priority
In this section, we introduce the notions of minimality and pattern priority.The former will be used to avoid useless characters in the definition of a pattern, the latter as a means for comparison.Definition 4 (Minimal representation µ(•)) Given a sequence, p, of length, k, the minimal representation of p is a sequence, µ(p), of length, k, with symbols, µ(p) j = l∈Lm s l+j−1 , for 1 ≤ j ≤ k.

Remark 1
The minimal representation of a sequence is unique.
Remark 2 Since µ(p) is more specific than p, that is, µ(p) cannot have more occurrences in s than p, then the list of occurrences of µ(p) must be the same as p.
Let µ(m) = (µ(p), L m ) be the minimal representation of a pattern, m, with sequence, p, then Remark 2 suggests to us that µ(m) agrees with the definition of pattern (that is, L µ(m) is complete): Definition 5 (Minimal pattern) The minimal representation of m, given by µ(m) = (µ(p), L m ), is called a minimal pattern.
Computing the minimal representation of a pattern is useful when a pattern is composed by character classes.To have a more concrete idea about this concept, Table 3 shows  Table 3. Example of minimal representation of a pattern, m, with sequence, p.
Let M be a set of patterns with character classes lying on the string, s.From Remark 1 we have that each pattern, m ∈ M, has a unique minimal representation, µ(m).Thus, one can easily check that µ(m) = µ(m ) is an equivalence relation.Let us map all the patterns in M into the set of their minimal version, µ(M), where each pattern, m ∈ µ(M), is the minimal, m = µ(m ), of some pattern, m ∈ M.Then, the set of patterns, M, is partitioned into equivalence classes by the binary relation of equality between minimal patterns.Remark 3 Two patterns, m and m , in M may have the same location list; thus, they will be mapped into the same minimal pattern, µ(m) = µ(m ).On the other hand, two minimal patterns with the same location list must have different lengths.
We call µ(M) the minimal set of M. Since mapping M in µ(M) could mean a drastic reduction in the number of patterns, this is, in practice, a first step in filtering.Now, we define a simple property of patterns with character classes, the degeneracy, that is, the number of characters in a pattern.With this notion, we can define the priority between patterns, as a means for comparing different patterns.Note that several notions of priority can be established at this stage.We choose a very intuitive combination of pattern length, pattern degeneracy and location list.Given our binary relation, called pattern priority, we want to prove that some basic properties holds: M is said to be totally sub-ordered.Furthermore, M is said to be totally ordered if R is irreflexive, antisymmetric, transitive, that is (m R m and m R m ) ⇒ m R m and total.
Lemma 1 A set, M, is totally ordered under a binary relation, R, if and only if it is totally sub-ordered.
Proof It is straightforward that transitivity implies acyclicity, that is, if M is totally ordered, then M is also totally sub-ordered.It is also easy to see that acyclicity and totality, together, imply transitivity.Consider a chain of patterns in M: Since for acyclicity, m t R m 1 does not hold, then, for totality, m 1 R m t .Hence, transitivity holds and, therefore, the definitions totally sub-ordered and totally ordered coincide.2 Now, before showing that some of these properties hold, we need to prove two preliminary results for the pattern priority rule.At first, we observe the following.

Fact 1
The binary relation of pattern priority is irreflexive and antisymmetric (since properties (1), ( 2) and (3) of Definition 7 are strictly defined).
Lemma 2 Let m and m be two patterns with min{L m \ L m } < min{L m \ L m }, such that the two minima both exist, and define j to be min{L m \ L m }.Then, the occurrences of m and m at positions less than j must be the same.
Proof We have to prove that, in case (3) of Definition 7, the occurrences of m and m less than j are identical.If m has an occurrence less than j that is not in L m , then min{L m \ L m } < j, which is impossible by hypothesis.Conversely, if m has occurrences less than j that are not in L m , then min{L m \ L m } < j = min{L m \ L m } contradicts again our assumptions.Thus, the occurrences of the two patterns less than j must be the same, and we call them paired.Finally, by assumptions, it trivially holds that j / ∈ L m . 2 . ., m t−1 → m t be a chain of patterns with the same length and degeneracy.Then, either m 1 → m t or L mt ⊂ L m 1 holds.
Proof We will prove the statement by induction on t.Let the basis be t = 2.In this case, the chain, m 1 → m 2 , coincides with the result.We will show now that, if it holds either Assume L mt ⊂ L m 1 .Define j to be min{L mt \ L m t+1 }, this minimum exists, since m t−1 → m t (by property (3) of Definition 7).It follows that j ∈ L m 1 and, from Lemma 2, that the occurrences of m t and m t+1 that are less than j are paired.Hence, these occurrences are shared also with m 1 .Since j / ∈ L m t+1 , then either m 1 → m t+1 holds (because j makes the difference between L m 1 and L m t+1 ) or m 1 and m t+1 are incomparable.In the latter case, L mt ⊂ L m 1 and L m 1 ⊆ L m t+1 imply L mt ⊂ L mt+1 , which is impossible, since m t and m t+1 are comparable by hypothesis.Thus, in this latter case, we have that L m t+1 ⊂ L m 1 .
Conversely, assume m 1 → m t .Define j as before and j = min{L m 1 \ L mt }.It holds that j / ∈ L m t+1 (as already observed), and that j = j, because of j / ∈ L mt .Then, by Lemma 2, the occurrences of m t and m t+1 that are less than j are paired, and the same is true for the occurrences of m 1 and m t that are less than j .It follows that, if m 1 and m t+1 are comparable, there exists an occurrence in L m 1 that is more on the left with respect to those of L m t+1 : If j < j, the occurrences of m 1 , m t and m t+1 less than j are paired together, and thus, j makes the difference; otherwise (j > j), the occurrences of m 1 , m t and m t+1 less than j are paired together, and hence, j ∈ L 1 makes the difference.Alternatively, L m 1 ⊆ L m t+1 implies that the occurrences of m 1 equal to or less than j are shared with m t+1 , which leads to m t+1 → m t , which is impossible. 2 Theorem 1 Any set of patterns, M, is sub-ordered with respect to the binary relation of pattern priority.
Proof We have to prove that the relation of pattern priority is irreflexive, antisymmetric and acyclic.The first two properties are stated in Fact 1. Now, following the work in Lemma 3, we can prove that the acyclicity holds too.First, observe that length and degeneracy are intrinsic properties of the single pattern.If all patterns in M have different lengths and degeneracies, then by definition of pattern priority, it is always true that either m → m or m → m, and a cycle can never exist, because of different lengths or degeneracies.Alternatively, consider a chain of patterns, m 1 → m 2 , . .., m t−1 → m t , with the same length and degeneracy.In this case, we must use property (3) of Definition 7 to compare the patterns together.From Lemma 3, it follows that a cycle of pattern priority between any chain of patterns is, again, impossible, and hence, the acyclicity holds. 2 Note that the non-decision on some binary comparisons finally discards the relation of totality and, as seen above, of transitivity.
The issue is that these patterns are not minimal, like most of the patterns in PROSITE.In the following, we set the basis to solve this problem.Theorem 2 Given any set of patterns, M, its minimal set, µ(M), is totally ordered under the binary relation of pattern priority.
Proof Along with Theorem 1, we have that any set of patterns, M, is sub-ordered under pattern priority; thus, also the set of minimal patterns, µ(M), is sub-ordered.Following Lemma 1, we have to prove that the totality holds on this new set, µ(M), that is, every pair of minimal patterns must be comparable under pattern priority.In other words, if m = m , it must hold either m → m or m → m.
From Remark 3, we have that L m = L m for two minimal patterns, m and m , with the same length, and thus, we have, without loss of generality, two cases to consider: L m L m or L m ⊂ L m .From now, m and m will be two minimal patterns with the same length and, for the former case, also with the same degeneracy c(m).Thus, in the former case, if we consider min{L m \ L m } and min{L m \ L m }, the two minima exist and are different from each other; hence, it holds that either m → m or m → m, respectively, if the minimum of the two sets is either in L m or in L m .In the latter case, L m ⊂ L m ⇒ c(m) < c(m ), since m and m are minimal patterns, and the respective location lists are complete (that is, the respective patterns of m and m must be different); therefore, it holds that m → m .Thus, for any set of minimal patterns under pattern priority, the totality holds.From Lemma 1, we can conclude that any set of minimal patterns is totally ordered under the pattern priority rule. 2 As a consequence of Theorem 2, all minimal patterns can be compared and ranked.We can further observe that every minimal pattern has priority over the patterns within its equivalence class, due to property (2) of pattern priority.Now, it is clear that any set of patterns can be mapped into its minimal representative set and that we can build a measure of total order over this set.

Pattern Filtering
Here, we describe an application of pattern priority; the objective is to select the most important patterns in µ(M) for each location of s, according to our pattern priority rule.If a pattern, m, is selected, we filter out all patterns with less priority that lay on the same locations of m.If these locations are, for example, transcription factor binding sites or coding sequences of a genome, we will select only one pattern that best represents these locations.We say that an occurrence l of m is tied to an occurrence l of m , if the two occurrences overlap, i.e., ([l, l Otherwise, we say that l is untied from l .Definition 9 (Underlying pattern) The set of patterns, U ⊆ µ(M), is said to be underlying if and only if: (i) Every pattern, m, in U, called an underlying pattern, has at least q occurrences that are untied from all the untied occurrences of other patterns in U \ m and (ii) There does not exist a pattern, m ∈ µ(M) \ U, such that m has at least q untied occurrences from all the untied occurrences of patterns in U.
This subset of µ(M) is composed only by those patterns that rank higher in our priority rule for some location of s.The following algorithm filters any set of patterns, M, into the reduced set of underlying patterns, U.
2. Rank all minimal patterns in µ(M) using the pattern priority rule.3.At each step, select the top pattern, m, from µ(M): • If all of its occurrences are tied/covered by some other patterns already in U, discard m; • Otherwise, if m has at least q untied occurrences, add m to U and update the locations of vector, Γ, in which m appears.
The correctness of the algorithm follows from Theorem 2. Let n be the size of the string, s.In the third step, for each pattern, m, that has been selected by our algorithm, we store the occurrences of m in a vector of Booleans Γ, that represents the locations of s.This means that, ∀ l ∈ L m , we store the value TRUE in the locations Γ[l + i], for 0 ≤ i < |m|.This vector is then used to check if an occurrence is untied in constant time.Thus, for each pattern in µ(M), checking for untied occurrences and updating the vector, Γ, takes O(n) time.In total, step 3 requires O(n|µ(M)|) time.
In short, the total complexity of the algorithm is O(n 2 |M|).Note that this complexity does not depend on the notion of pattern priority used; thus, also, other binary relations can be applied with the same complexity.If the structure of the pattern priority is considered, it is possible to reduce the complexity using some results developed for patterns without character classes [19].However, this is out of scope for this paper, since the Underlying Pattern Filtering is very fast in practice.
Finally, we observe that the untied occurrences of all patterns, m, in U are non-overlapping.From this consideration, we have the following result: Corollary 1 The number of patterns in U is ≤ n/2 , independently of the size of M.
As a consequence, the set, U, has linear dimension with respect to the size of s.

Experimental Results
In this section, we discuss the ability of underlying patterns to efficiently capture meaningful biological information.A general problem in genome and proteome research is the identification of signals represented by means of degenerate patterns.In this sense, some modern degenerate pattern discovery tools proved to be useful in biological sequence analysis, as we have discussed at the beginning of the paper.Here, we first collect the degenerate patterns in the output from one of these tools, say a set of patterns, M, and then, present some preliminary experimental results in order to support the theoretical properties shown in the above sections.
More generally, there are two types of scenarios where the notion of underlying patterns could be useful.The first case is when a region of interest has already been identified, so that it is possible to analyze and select only those patterns that are underlying with respect to that particular region, without considering the whole set of patterns.Another possible application is the case where we just want to filter all patterns in M, looking at the whole sequence.
In this context, we present some results for the latter scenario.We take as input, M, the set of patterns extracted by Varun [12,25], a tool for de novo pattern discovery.The dataset consists of six protein families for which Varun successfully extracts the representative patterns contained in the PROSITE signatures (release 20.85) [12].For each signature, we select all sequences in the Swiss-Prot database that share that signature.In summary, our dataset is the following.The signature To prove that the information captured by the underlying patterns is relevant with respect to the protein families under examination, we devised two kinds of tests.In the first test, we compare the original set of patterns in the input, M, with the set of underlying patterns in the output, U, using global measures.In the second, we investigate the ability to filter a list of meaningful patterns and retain the representative ones, perhaps with a higher rank.
In the first set of experiments, we use Varun to extract patterns from the above families of protein sequences for different quorums, q.For each family, we concatenate the sequences and extract patterns from the concatenation; thus, in all the experiments, the quorum refers to the number of occurrences in the concatenation of all sequences of a certain family.This notion of quorum can generate several spurious patterns and, thus, will make the filtering problem more challenging.We also investigate a different setup in which the quorum value reflects the number of sequences where the pattern is contained.This latter scenario produces similar results; however, the range of values for this quorum is bounded by the number of sequences, and this can limit the possibility to study the efficiency of the filtering process, while varying the quorum values.For these reasons, we choose to use the first definition of quorum.Then, we use the extracted patterns as input to our algorithm, presented in Section 4, in order to compute the underlying patterns U.In our experiments, each pattern, m ∈ M, extracted by Varun is with character classes and no gaps, where we set the quorum, q, for the underlying patterns, to be the same as the quorum for the original patterns in the input, as by the definition seen above.The length, n, of the string in the input is equal to the length of the concatenations reported above.The size of M varies with respect to the quorum, q, and for some configurations, can be larger than the sequence length (see Figure 1).All experiments were conducted on a common PC, and the filtering process requires on average less than 10 s.For both sets of patterns, M and U, we compute some global statistics.Figure 1 shows the number of patterns, the sum of lengths of all patterns and their mean z-score for the first two protein families (N i, F a).The other families share the same behavior (figure not shown).The z-score is computed employing the same formula reported in [12].As expected, the number of underlying patterns is always much smaller than the number of the original patterns.A similar conclusion can be drawn for the sum of lengths.More importantly, for small quorums, the total length of original patterns exceeds the length of sequences in both families, indicating that the original patterns are even larger than the set of sequences under examination.Moreover, as seen above, the sum of lengths of underlying patterns is always bounded by the length of sequences.These first two measures indicate that, not only the number, but also the total length of underlying patterns are much smaller than those of the original patterns.Therefore, the filtering process is space-efficient, as expected.
Another important measure is the mean z-score of M and U.The z-score of a pattern, p, is a statistical measurement of the degree of over-representation of p with respect to the expected number of its occurrences.The mean z-score is thus a global measure able to capture the average quality of patterns in a set; however, it is much more computationally expensive than our pattern priority, and we just use it to validate our approach.In Figure 1, we can see that, for all quorums, the average z-score of underlying patterns is always greater than those of original patterns, and in most cases, the difference is one or two orders of magnitude.To summarize, this first test confirms that the number and span of underlying patterns is much more manageable than the original set and, also, that their average quality is improved.
Once we have verified that the notion of underlying is a suitable filter, in a second series of experiments, we test the ability to retain meaningful patterns.To this end, we employ the candidate patterns obtained from the previous experiments to test the presence of PROSITE signatures.We consider in detail again the first two families, F a and N i, and the corresponding signatures.The other four families show similar results and are summarized in Table 4.
We consider N i 2 and F a 2 directly as degenerate patterns, due to their low number of gaps, while we split signatures, N i 1 and F a 1 , into two different patterns each and compute the statistics of these patterns accordingly.For both sets of patterns, M and U, we compute the maximum similarity between each pattern in the set and the two signatures.The similarity between two patterns, m and m , is the number of shared characters, including character classes, in the best alignment of m versus m , without considering indels.Tables 5 and 6 summarize the maximum similarity, for different quorums, of M and U, with each signature of the two protein families, N i and F a. The second and third columns report the maximum similarity for the set of underlying patterns divided by the similarity of the original set.For example, in the first row of Table 5, the quorum is five.In this case, the maximum similarity of M with the representative pattern, N i 1 , is 26.The same value is obtained also for the corresponding set of underlying patterns, U, thus indicating that the pattern, N i 1 , is retained with the same degree of accuracy.Table 4. Comparison of performance between different binary relations applied to the underlying patterns: Pattern priority, z-score, probability with distribution based on the amino acid frequencies in s, probability with no background (i.e., each amino acid scores 1/20), frequency of patterns in s, inverse frequency and the lexicographic order of occurrences.For each family, we summed up the maximum similarity with the two representative patterns for all quorums.Similarly, in the column rank, we show the average rank of the closest candidates to these patterns.In principle, the binary relation of pattern priority can be replaced by any other traditional means of comparison.To this end, we compared the pattern priority rule with other standard ranking methods that were applied to the underlying filtering step.Table 4 reports the average scores for each measure for all six protein families, where a large maximum similarity with the two PROSITE signatures and a higher rank are preferable.In this context, we consider different ranking methods: The lexicographic order of patterns (Lexicographic), the number of occurrences (Frequency) and its inverse (Inverted Frequency).Other statistical ranking methods are also considered here: Pattern probability assuming either an i.i.d.distribution of symbols based on amino acid frequencies (Probability) or an equal distribution of symbols (Equal Probability); z-score, computed using the probability value as in [12].
We can easily see that our pattern priority achieves the best scores among all methods for the detection of representative patterns.In addition, our heuristic ranks on average the reference patterns of N i in the top three out of 2239 candidate patterns in the input and those of F a in the top five out of 10,842 patterns, while for the other families, the reference patterns range from the top three to the top five selected patterns, on average.The definition of pattern priority was conceived especially for degenerate patterns, like those presented in this section.However, this framework can be used in conjunction with other comparison functions designed specifically for patterns with profiles or variable gaps, e.g., [38].
These preliminary experiments support the validity of the theoretical results presented in the previous sections and also prove their effectiveness for protein analysis.However, a more comprehensive experimental setting is desirable, in order to improve the evidence on the performance and to compare this method with other existing tools, like MEME [42].

Conclusions
In this paper, we have studied patterns with character classes, introducing basic properties for the comparison and ranking of patterns.We have proven theoretical results that support the validity of these properties.Most importantly, the pattern priority rule, together with the notion of underlying patterns, has proven to be valuable for the analysis of biological sequences, bounding the total length of degenerate patterns in the output from any modern pattern discovery tool.Preliminary experiments on protein families have shown the good performance of our approach as a filter to reduce the number of patterns in the output, while keeping the representative ones.More experiments should be conducted in order to compare these results with other tools, like MEME [42].We are currently working to extend the notion of underlying patterns for whole genome comparison.
and the two classes are C 1 = {a, c, d} and C 2 = {a, b, e}, then p = a[a, c, d][b, e] is a sequence with character classes of length k = 3.

p
j a [a, c, d] [b, e] m = (p, L m ) is a pattern with sequence p = a[a, c, d][b, e] of length k = 3 and location list L m = {1, 5, 8}.More formally: Definition 3 (Pattern, location list) We say that m = (p, L m an example of minimal representation of the pattern m = (p = [a, c, d][a, c, d][a, b, e], {1, 5, 8}), where the reference string is s = aabeadbace and the classes are C 1 = {a, c, d} and C 2 = {a, b, e}.In this case, we say that µ(m) = (µ(p) = a[a, c, d][b, e], {1, 5, 8}) is a minimal pattern.In practice, most of the functional patterns reported by PROSITE are not minimal.

Definition 6 (
Degeneracy of a sequence c(•)) The degeneracy of a sequence, p, of length, k, is defined as c(p) = k j=1 |p j |.The degeneracy of a pattern, m = (p, L m ), denoted by c(m), is defined as the degeneracy of its sequence, c(p).
For instance, consider s = aabeadbace, m = (a[a, c, d][b, e], {1, 5, 8}) and m = ([a, d]b[a, b, e], {2, 6}).Then, m has priority over m , written m → m , because they have the same length and degeneracy, but 1 = min{L m \ L m } < min{L m \ L m } = 2.If two patterns, m and m , have the same length and degeneracy, but either L m ⊆ L m or L m ⊂ L m , then we consider them incomparable; otherwise, they are comparable.Nevertheless, we will solve the non-comparability of two patterns with Theorem 2.
For instance, consider s = aabeadbace and the patterns m = ([a,d][a,b][a,e], {2, 6}) and m = ([a,d][b,e][a,e], {2, 6}) with the same list of occurrences: In short, |m| = |m | = 3, c(m) = c(m ) = 6 and L m = L m .This means that we are not able to compare m and m using the pattern priority rule.Another example is given by m 1 = (a[a,c,d][b,e], {1, 5, 8}), m 2 = ([b,e]a[a,c,d], {4, 7}) and m 3 = ([a,b][c,d][b,e], {5, 8}).In this case, m 1 → m 2 and m During the first step, in the worst case, a pattern, m, can have a location list, |L m |, that is O(n), and the comparison of two ordered lists of occurrences costs O(n).Thus, this step of the algorithm costs O(n 2 |M|).The second phase orders the set of µ(M), where each comparison between two patterns can take O(n) times; thus, this step costs O(n|µ(M)| log |µ(M)|).

Figure 1 .
Figure 1.Total number, sum of lengths and mean z-score of the patterns extracted using Varun and their corresponding underlying patterns, for the two protein families, N i and F a. The dashed line in the total length diagrams indicates the total size of each family.Note that in (a) Mean Z-Score and (b) All diagrams, the ordinate is plotted on a logarithmic scale.(a) Nickel-Dependent hydrogenases (N i); (b) Coagulation factors 5/8 type C domain (F a).

Table 2 .
Example of pattern.The occurrences of p in s are drawn in black.

Table 5 .
Normalized maximum similarity with the reference patterns of the family nickel-dependent hydrogenases, for different quorums.

Table 6 .
Normalized maximum similarity with the reference patterns of the family coagulation factors 5/8 type C domain, for different quorums.