Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings

: We present two modiﬁcations of Duval’s algorithm for computing the Lyndon factorization of a string. One of the algorithms has been designed for strings containing runs of the smallest character. It works best for small alphabets and it is able to skip a signiﬁcant number of characters of the string. Moreover, it can be engineered to have linear time complexity in the worst case. When there is a run-length encoded string R of length ρ , the other algorithm computes the Lyndon factorization of R in O ( ρ ) time and in constant space. It is shown by experimental results that the new variations are faster than Duval’s original algorithm in many scenarios.


Introduction
A string w is a rotation of another string w if w = uv and w = vu, for some strings u and v.A string is a Lyndon word if it is lexicographically smaller than all its proper rotations.Chen, Fox and Lyndon [1] introduced the unique factorization of a string in Lyndon words such that the sequence of factors is nonincreasing according to the lexicographical order.The Lyndon factorization is a key structure in a method for sorting the suffixes of a text [2], which is applied in the construction of the Burrows-Wheeler transform and the suffix array, as well as in the bijective variant of the Burrows-Wheeler transform [3,4].The Burrows-Wheeler transform is an invertible transformation of a string, based on sorting of its rotations, while the suffix array is a lexicographically sorted array of the suffixes of a string.They are the groundwork for many indexing and data compression methods.
Duval's algorithm [5] computes the Lyndon factorization in linear time and in constant space.Various other solutions for computing the Lyndon factorization have been proposed in the past.A parallel algorithm [6] was presented by Apostolico and Crochemore, while Roh et al. described an external memory algorithm [7].Recently, I et al. and Furuya et al. introduced algorithms to compute the Lyndon factorization of a string given in the grammar-compressed form and in the LZ78 encoding [8,9].
In this paper, we present two new variations of Duval's algorithm.The paper is an extended version of the conference paper [10].The first algorithm has been designed for strings containing runs of the smallest character.It works best for small alphabets like the DNA alphabet {a, c, g, t} and it is able to skip a significant portion of the string.The second variation works for strings compressed with run-length encoding.In run-length encoding, maximal sequences in which the same data value occurs in many consecutive data elements (called runs) are stored as a pair of a single data value and a count.When there is a run-length encoded string R of length ρ, our algorithm computes the Lyndon factorization of R in O(ρ) time and in constant space.This variation is thus preferable to Duval's algorithm when the strings are stored or maintained with run-length encoding.In our experiments, the new algorithms are considerably faster than the original one in the case of small alphabets, for both real and simulated data.
The rest of the paper is organized as follows.Section 2 defines background concepts Section 3 presents Duval's algorithm, Sections 4 and 5 introduce our variations of Duval's algorithm, Section 6 shows the results of our practical experiments, and the discussion of Section 7 concludes the article.

Basic Definitions
Let Σ be a finite ordered alphabet of σ symbols and let Σ * be the set of words (strings) over Σ ordered by lexicographic order.In this paper, we use the terms string, sequence, and word interchangeably.The empty word ε is a word of length 0. Let Σ + be equal to Σ * \ {ε}.Given a word w, we denote with |w| the length of w and with w[i] the i-th symbol of w, for 0 ≤ i < |w|.The concatenation of two words u and v is denoted by uv.Given two words u and v, v is a substring of u if there are indices 0 and for i > j u[i..j] = ε.We denote by u k the concatenation of k u's, for u ∈ Σ + and k ≥ 1.The longest border of a word w, denoted with β(w), is the longest proper prefix of w which is also a suffix of w.Let lcp(w, w ) denote the length of the longest common prefix of words w and w .We write is a rotation of w.A Lyndon word is a word w such that w < ROT(w, i), for 1 ≤ i < |w|.Given a Lyndon word w, the following properties hold: Both properties imply that no word a k , for a ∈ Σ, k ≥ 2, is a Lyndon word.The following result is due to Chen, Fox and Lyndon [11]: Theorem 1.Any word w admits a unique factorization CFL(w) = w 1 , w 2 , . . ., w m , such that w i is a Lyndon word, for 1 ≤ i ≤ m, and w 1 ≥ w 2 ≥ . . .≥ w m .
The interval of positions in w of the factor We assume the following property: Property 1.The output of an algorithm that, given a word w, computes the factorization CFL(w) is the sequence of intervals of positions of the factors in CFL(w).
The run-length encoding (RLE) of a word w, denoted by RLE(w), is a sequence of pairs (runs)

Duval's Algorithm
In this section we briefly describe Duval's algorithm for the computation of the Lyndon factorization of a word.Let L be the set of Lyndon words and let P = {w | w ∈ Σ + and wΣ * ∩ L = ∅} , be the set of nonempty prefixes of Lyndon words.Let also P = P ∪ {c k | k ≥ 2}, where c is the maximum symbol in Σ. Duval's algorithm is based on the following Lemmas, proved in [5]: Lemma 1.Let w ∈ Σ + and w 1 be the longest prefix of w = w 1 w which is in L. We have CFL(w) = w 1 CFL(w ).
Lemma 1 states that the computation of the Lyndon factorization of a word w can be carried out by computing the longest prefix w 1 of w = w 1 w which is a Lyndon word and then recursively restarting the process from w .Lemma 2 states that the nonempty prefixes of Lyndon words are all of the form (uv) k u, where u ∈ Σ * , v ∈ Σ + , k ≥ 1 and uv ∈ L. By the first property of Lyndon words, the longest prefix of (uv) k u which is in L is uv.Hence, if we know that w = (uv) k uav , (uv) k u ∈ P but (uv) k ua / ∈ P , then by Lemma 1 and by induction we have CFL(w) = w 1 w 2 . . .w k CFL(uav ), where w 1 = w 2 = . . .= w k = uv.For example, if w = abbabbaba, we have CFL(w) = abb abb CFL(aba), since abbabbab ∈ P while abbabbaba / ∈ P .Suppose that we have a procedure LF-NEXT(w, k) which computes, given a word w and an integer k, the pair (s, q) where s is the largest integer such that w[k..k + s − 1] ∈ L and q is the largest integer such that w The factorization of w can then be computed by iteratively calling LF-NEXT starting from position 0. When a given call to LF-NEXT returns, the factorization algorithm outputs the intervals [k + is, k + (i + 1)s − 1], for i = 0, . . ., q − 1, and restarts the factorization at position k + qs.Duval's algorithm implements LF-NEXT using Lemma 3, which explains how to compute, given a word w ∈ P and a symbol a ∈ Σ, whether wa ∈ P , and thus makes it possible to compute the factorization using a left to right parsing.Note that, given a word w (s, q) ← LF-NEXT(w, k) 4.
for i ← 1 to q do 5. < w} and w For example, if w = abbabbaba, we have min{j | w[j..8] < w} = 3, lcp(w, w[3..8]) = 5, and q = 2. Based on these Lemmas, the procedure LF-NEXT can be implemented by initializing j ← k + 1 and executing the following steps: and repeat step 1; otherwise return the pair (j, 1 + h/j ).It is not hard to verify that, if the lcp values are computed using symbol comparisons, then this procedure corresponds to the one used by Duval's original algorithm.

Improved Algorithm for Small Alphabets
Let w be a word over an alphabet Σ with CFL(w) = w 1 , w 2 , . . ., w m and let c be the smallest symbol in Σ. Suppose that there exists k ≥ 2, i ≥ 1 such that ck is a prefix of w i .If the last symbol of w is not c, then by Theorem 1 and by the properties of Lyndon words, ck is a prefix of each of w i+1 , w i+2 , . . ., w m .This property can be exploited to devise an algorithm for Lyndon factorization that can potentially skip symbols.Note that we assume Property 1, i.e., the output of the algorithm is the sequence of intervals of the factors in CFL(w), as otherwise we have to read all the symbols of w to output CFL(w).Our algorithm is based on the alternative formulation of Duval's algorithm by I et al. [8].Given a set of strings P, let Occ P (w) be the set of all (starting) positions in w corresponding to occurrences of the strings in P. We start with the following Lemmas: Lemma 6.Let w be a word and let s Proof.Let w i be the factor in CFL(w) such that s belongs to [a i , b i ], the interval of w i .To prove the claim we have to show that a i = s.Suppose by contradiction that s > a i , which implies which contradicts the second property of Lyndon words.Otherwise, since w i is a Lyndon word it must hold that w i < ROT(w i , s − a i ).This implies at least that w i [0] = w i [1] = c, which contradicts the hypothesis that s is the smallest element in Occ { c c} (w).
For example, if w = abaabaabbaab, we have CFL(w) = CFL(ab) CFL(aabaabbaab).For example, if w = aabaabbaab, we have P = {aaa, aab}, Occ P (w) = {0, 3, 7} and b 1 = 6.Based on these Lemmas, we can devise a faster factorization algorithm for words containing runs of c.The key idea is that, using Lemma 8, it is possible to skip symbols in the computation of b 1 , if a suitable string matching algorithm is used to compute Occ P (w).W.l.o.g.we assume that the last symbol of w is different from c.In the general case, by Lemma 6, we can reduce the factorization of w to the one of its longest prefix with last symbol different from c, as the remaining suffix is a concatenation of c symbols, whose factorization is a sequence of factors equal to c. Suppose that c c occurs in w.By Lemma 7 we can split the factorization of w in CFL(u) and CFL(v) where uv = w and |u| = min Occ { c c} (w).The factorization of CFL(u) can be computed using Duval's original algorithm. Concerning = c, and we can apply Lemma 8 on v to find the ending position s of the first factor in CFL(v), i.e., min{i To this end, we iteratively compute Occ P (v) until either a position i is found that satisfies v[i..|v| − 1] < v or we reach the end of the string.
, then, by Lemma 4, we do not need to verify the positions i ∈ Occ P (v) such that i ≤ i + h.The computation of Occ P (v) can be performed by using either an algorithm for multiple string matching for the set of patterns P or an algorithm for single string matching for the pattern cr , since Occ P (v) ⊆ Occ cr (v).Note that the same algorithm can also be used to compute min Occ c c(w) in the first phase.
Given that all the patterns in P differ in the last symbol only, we can express P more succinctly using a character class for the last symbol and match this pattern using a string matching algorithm that supports character classes, such as the algorithms based on bit-parallelism.In this respect, SBNDM2 [12], a variation of the BNDM algorithm [13] is an ideal choice, as it is sublinear on average.However, this method is preferable only if r + 1 is less than or equal to the machine word size in bits.
for i ← 1 to q do 24.
s ← s × q 27. for i ← 1 to l do 28.
output(e + i, e + i) If the total time spent for the iteration over the sets Occ P (v) is O(|w|), the worst case time complexity of LF-SKIP is linear.To see why, it is enough to observe that the positions i for which LF-SKIP verifies if v[i..|v| − 1] < v are a subset of the positions verified by the original algorithm.Indeed, given a string w satisfying the conditions of Lemma 8, for any position i ∈ Occ P there is no i ∈ {0, 1, . . ., |w| − 1} \ Occ P such that i + 1 ≤ i ≤ i + lcp(w, w[i ..|w| − 1]).Hence, the only way Duval's algorithm can skip a position i ∈ Occ P using Lemma 4 is by means of a smaller position i belonging to Occ P , which implies that the algorithms skip or verify the same positions in Occ P .

Computing the Lyndon Factorization of a Run-Length Encoded String
In this section we present an algorithm to compute the Lyndon factorization of a string given in RLE form.The algorithm is based on Duval's original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE, and has O(ρ)-time and O(1)-space complexity, where ρ is the length of the RLE.We start with the following Lemma: Lemma 9. Let w be a word over Σ and let w 1 , w 2 , . . . ,w m be its Lyndon factorization.For any Proof.Suppose by contradiction that j < k and either |w j | > 1 or |w k | > 1.By definition of j, k, we have w j ≥ w k .Moreover, since both [a j , b j ] and [a k , b k ] overlap with [a rle i , b rle i ], we also have we have that w j is a prefix of w k .Hence, in both cases we obtain w j < w k , which is a contradiction.
The consequence of this Lemma is that a run of length l in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to l unit-length factors.Formally: Corollary 1.Let w be a word over Σ and let w 1 , w 2 , . . ., w m be its Lyndon factorization.Then, for any 1 ≤ i ≤ |RLE(w)|, either there exists w j such that [a rle i , b rle i ] is contained in [a j , b j ] or there exist l i factors w j , w j+1 , . . . ,w j+l i −1 such that |w j+k | = 1 and a j+k ∈ [a rle i , b rle i ], for 0 ≤ k < l i .
This property can be exploited to obtain an algorithm for the Lyndon factorization that runs in O(ρ) time.First, we introduce the following definition: Definition 1.A word w is a LR word if it is either a Lyndon word or it is equal to a k , for some a ∈ Σ, k ≥ 2. The LR factorization of a word w is the factorization in LR words obtained from the Lyndon factorization of w by merging in a single factor the maximal sequences of unit-length factors with the same symbol.
For example, the LR factorization of cctgccaa is cctg, cc, aa .Observe that this factorization is a (reversible) encoding of the Lyndon factorization.Moreover, in this encoding it holds that each run in the RLE is contained in one factor and thus the size of the LR factorization is O(ρ).Let L be the set of LR words.Suppose that we have a procedure LF-RLE-NEXT(R, k) which computes, given an RLE sequence R and an integer k, the pair (s, q) where s is the largest integer such that c ρ .This implies that the pair (s, q) returned by LF-RLE-NEXT(R, k) satisfies LF-NEXT(c Based on Lemma 1, the factorization of R can then be computed by iteratively calling LF-RLE-NEXT starting from position 0. When a given call to LF-RLE-NEXT returns, the factorization algorithm outputs the intervals [k + is, k + (i + 1)s − 1] in R, for i = 0, . . ., q − 1, and restarts the factorization at position k + qs.
We now present the LF-RLE-NEXT algorithm.Analogously to Duval's algorithm, it reads the RLE sequence from left to right maintaining two integers, j and , which satisfy the following invariant: The integer j, initialized to k + 1, is the index of the next run to read and is incremented at each j−1 spans at least two runs, and equal to 0 otherwise.For example, in the case of the word ab 2 ab 2 ab we have β(ab 2 ab 2 ab) = ab 2 ab and = 4. Let i = k + .In general, if > 0, we have Note that the longest border may not fully cover the last (first) run of the corresponding prefix (suffix).Such the case is for example for the word ab 2 a 2 b.However, since c l k k . . .c l j−1 j−1 ∈ P it must hold that l j− = l k , i.e., the first run of the suffix is fully covered.Let Informally, the integer z is equal to 1 if the longest border of c l k k . . .c l j−1 j−1 does not fully cover the run (c i−1 , l i−1 ).By 1 we have that c l k k . . .c l j−1 j−1 can be written as (uv) q u, where For example, in the case of the word ab 2 ab 2 ab, for k = 0, we have j = 6, i = 4, q = 2, r = 2.The algorithm is based on the following Lemma: Lemma 10.Let j, be such that invariant 1 holds and let s = i − z.Then, we have the following cases: Proof.The idea is the following: we apply Lemma 3 with the word (uv) q u as defined above and symbol c j .Observe that c j is compared with symbol v[0], which is equal to c k+r−1 = c i−1 if z = 1 and to c k+r = c i otherwise.
First note that, if z = 1, c j = c i−1 , since otherwise we would have c j−1 = c i−1 = c j .In the first three cases, we obtain the first, second and third proposition of Lemma 3, respectively, for the word c l k k . . .c l j−1 j−1 c j .Independently of the derived proposition, it is easy to verify that the same proposition also holds for c We prove by induction that invariant 1 is maintained.At the beginning, the variables j and are initialized to k + 1 and 0, respectively, so the base case trivially holds.Suppose that the invariant holds for j, .Then, by Lemma 10, either c l k k . . .c l j j / ∈ P or it follows that the invariant also holds for j + 1, , where is equal to + 1, if z = 0, c j = c i and l j ≤ l i , and to 0 otherwise.When c l k k . . .c l j j / ∈ P the algorithm returns the pair (j − i, q), i.e., the length of uv and the corresponding exponent.
The code of the algorithm is shown in Figure 3.We now prove that the algorithm runs in O(ρ) time.First, observe that, by definition of LR factorization, the for loop at line 4 is executed O(ρ) times.Suppose that the number of iterations of the while loop at line 2 is n and let k 1 , k 2 , . . ., k n+1 be the corresponding values of k, with k 1 = 0 and k n+1 = |R|.We now show that the s-th call to LF-RLE-NEXT performs less than 2(k s+1 − k s ) iterations, which will yield O(ρ) number of iterations in total.This analysis is analogous to the one used by Duval.Suppose that i , j and z are the values of i, j and z at the end of the s-th call to LF-RLE-NEXT.The number of iterations performed during this call is equal to j − k s .We have k s+1 = k s + q(j − i ), where q = j −k s −z j−i , which implies j − k s < 2(k s+1 − k s ) + 1, since, for any positive integers x, y, x < 2 x/y y holds.
for i ← 1 to q do 5.
if j = |R| or c j < c s or 9.
(c j = c s and l j > l s and c j < c s+1 ) then 10.
return (j − i, (j if c j > c s or l j > l s then 13.

Experimental Results
We tested extensively the algorithms LF-DUVAL, LF-SKIP, and LF-RLE.In addition, we also tested variations of LF-DUVAL and LF-SKIP, denoted as LF-DUVAL2 and LF-SKIP2.LF-DUVAL2 performs an if-test which is always true in line 9 of LF-NEXT.This form, which is advantageous for compiler optimization, can be justified by the formulation of the original algorithm [5] where there is a three branch test of w[j − 1] and w[i − 1].LF-SKIP2, after finding the first cr , searches for cr until cr+1 is found, whereas LF-SKIP searches for cr x where x is a character class.The experiments were run on Intel Core i7-4578U with 3 GHz clock speed and 16 GB RAM.The algorithms were written in the C programming language and compiled with gcc 5.4.0 using the O3 optimization level.
Testing LF-SKIP.At first we tested the variations of LF-SKIP against the variations of LF-DUVAL.The texts were random sequences of 5 MB symbols.For each alphabet size σ = 2, 4, . . ., 256 we generated 100 sequences with a uniform distribution, and each run with each sequence was repeated 500 times.The average run times are given in Table 1 which is shown in a graphical form in Figure 4.  LF-SKIP was faster than the best variation of LF-DUVAL for all tested values of σ.The speed-up was significant for small alphabets.LF-SKIP2 was faster than LF-SKIP for σ ≤ 16 and slower for σ > 16.
The speed of LF-DUVAL did not depend on σ.LF-DUVAL2 became faster when the size of the alphabet grew.For large alphabets LF-DUVAL2 was faster than LF-DUVAL and for small alphabets the other way round.In additional refined experiments, σ = 5 was the threshold value.When we compiled LF-DUVAL and LF-DUVAL2 without optimization, both of the variations behaved in a similar way.So the better performance of LF-DUVAL2 for large alphabets is due to compiler optimization, possibly by cause of branch prediction.
We tested the variations of LF-SKIP also with longer random sequences of four characters up to 500 MB.The average speed did not essentially change when the sequence became longer.
In addition, we tested LF-SKIP and LF-SKIP2 with real texts.At first we did experiments with texts of natural language.Because runs are very short in a natural language and newline or some other control character is the smallest character, the benefit of LF-SKIP or LF-SKIP2 was marginal.If it were acceptable to relax the lexicographic order of the characters, some gain could be obtained.For example, LF-SKIP achieved the speed-up of 2 over LF-DUVAL2 in the case of the KJV Bible when 'l' is the smallest character.
For the DNA sequence of fruitfly (15 MB), LF-SKIP2 was 20.3 times faster than LF-DUVAL.For the protein sequence of the saccharomyces cerevisiae (2.9 MB), LF-SKIP2 was 8.7 times faster than LF-DUVAL2.The run times on these biological sequences are shown in Table 2. Testing LF-RLE.To assess the performance of the LF-RLE algorithm, we tested it together with LF-DUVAL, LF-DUVAL2 and LF-SKIP2 for random binary sequences of 5 MB with different probability distributions, so as to vary the number of runs in the sequence.The running time of LF-RLE does not include the time needed to compute the RLE of the sequence, i.e., we assumed that the sequence is given in the RLE form, since otherwise other algorithms are preferable.For each test we generated 100 sequences, and each run with each sequence was repeated 500 times.The average run times are given in Table 3 which is shown in a graphical form in Figure 5.  Table 3 shows that LF-RLE was the fastest for distributions P(0) = 0.05, 0.9, and 0.95.Table 3 also reveals that LF-RLE and LF-DUVAL2 worked symmetrically for distributions of zero and one, but LF-SKIP2 worked unsymmetrically which is due to the fact that LF-SKIP2 searches for the runs of the smallest character which was zero in this case.
In our tests the run time of LF-DUVAL was about 14.7 ms for all sequences of 5 MB.Thus LF-DUVAL is a better choice than LF-DUVAL2 for cases P(0) = 0.3 and 0.7.

Conclusions
We presented new variations of Duval's algorithm for computing the Lyndon factorization of a string.The first algorithm LF-SKIP was designed for strings containing runs of the smallest character in the alphabet and it is able to skip a significant portion of the characters of the string.The second algorithm LF-RLE is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O(ρ) time and constant space.Our experimental results show that these algorithms can offer a significant speed-up over Duval's original algorithm.Especially LF-SKIP is efficient in the case of biological sequences.

Figure 1 .Lemma 4 .Lemma 5 .
Figure 1.Duval's algorithm to compute the Lyndon factorization of a string.The following is an alternative formulation of Duval's algorithm by I et al.[8]:

Figure 2 .
Figure 2. The algorithm to compute the Lyndon factorization that can potentially skip symbols.

1 ,ρ
for i = 1, . . ., q − 1. Observe that, by Lemma 9, which is in L , since otherwise the run (c k+s , l k+s ) would span two factors in the LR factorization of c l k k . . .c l ρ

iteration until either j = |R| or c l k k . . . c l j− 1 j− 1 /
∈ P .The integer , initialized to 0, is the length in runs of the longest border of c

1 .
If c j < c s then c and 1 holds for j + 1, = 0; Moreover, if z = 0, we also have: 3.If c j = c i and l j ≤ l i , then c l k k . . .c l j j ∈ P and 1 holds for j + 1, = + 1; 4. If c j = c i and l j > l i , either c j < c i+1 and c l k k . . .c l j j / ∈ P or c j > c i+1 , c l k k . . .c l j j ∈ L and 1 holds for j + 1, = 0.

j
j , m ≤ l j .Consider now the fourth case.By a similar reasoning, we have that the third proposition of Lemma 3 holds for c and c j , c j is compared to c i+1 and we must have c j = c i+1 as otherwise c i = c j = c i+1 .Hence, either the first (ifc j < c i+1 ) or the second (if c j > c i+1 )proposition of Lemma 3 must hold for the word c l k k . . .c l i +1 j .

Figure 3 .
Figure 3.The algorithm to compute the Lyndon factorization of a run-length encoded string.

Figure 4 .
Figure 4. Comparison of the algorithms on random sequences (5 MB) with a uniform distribution of a varying alphabet size.

Figure 5 .
Figure 5.Comparison of the algorithms on random binary sequences (5 MB) with a skew distribution.

Table 1 .
Run times in milliseconds on random sequences (5 MB) with a uniform distribution of a varying alphabet size.

Table 2 .
Run times in milliseconds on two biological sequences.

Table 3 .
Run times in milliseconds on random binary sequences (5 MB) with a skew distribution.