Next Article in Journal
Correction: Sato, M., et al. Total Optimization of Energy Networks in a Smart City by Multi-Population Global-Best Modified Brain Storm Optimization with Migration, Algorithms 2019, 12, 15
Previous Article in Journal
On the Role of Clustering and Visualization Techniques in Gene Microarray Data
Previous Article in Special Issue
Applications of Non-Uniquely Decodable Codes to Privacy-Preserving High-Entropy Data Representation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings †

by
Sukhpal Singh Ghuman
1,
Emanuele Giaquinta
2 and
Jorma Tarhio
3,*
1
Faculty of Applied Science & Technology, Sheridan College, 7899 McLaughlin Road, Brampton, ON L6Y 5H9, Canada
2
F-Secure Corporation, P.O.B. 24, FI-00181 Helsinki, Finland
3
Department of Computer Science, Aalto University, P.O.B. 15400, FI-00076 Aalto, Finland
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper: Ghuman, S.S.; Giaquinta, E.; Tarhio, J. Alternative algorithms for Lyndon factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, 1–3 September 2014; pp. 169–178.
Algorithms 2019, 12(6), 124; https://doi.org/10.3390/a12060124
Submission received: 24 May 2019 / Accepted: 17 June 2019 / Published: 21 June 2019
(This article belongs to the Special Issue String Matching and Its Applications)

Abstract

:
We present two modifications of Duval’s algorithm for computing the Lyndon factorization of a string. One of the algorithms has been designed for strings containing runs of the smallest character. It works best for small alphabets and it is able to skip a significant number of characters of the string. Moreover, it can be engineered to have linear time complexity in the worst case. When there is a run-length encoded string R of length ρ , the other algorithm computes the Lyndon factorization of R in O ( ρ ) time and in constant space. It is shown by experimental results that the new variations are faster than Duval’s original algorithm in many scenarios.

1. Introduction

A string w is a rotation of another string w if w = u v and w = v u , for some strings u and v. A string is a Lyndon word if it is lexicographically smaller than all its proper rotations. Chen, Fox and Lyndon [1] introduced the unique factorization of a string in Lyndon words such that the sequence of factors is nonincreasing according to the lexicographical order. The Lyndon factorization is a key structure in a method for sorting the suffixes of a text [2], which is applied in the construction of the Burrows-Wheeler transform and the suffix array, as well as in the bijective variant of the Burrows-Wheeler transform [3,4]. The Burrows-Wheeler transform is an invertible transformation of a string, based on sorting of its rotations, while the suffix array is a lexicographically sorted array of the suffixes of a string. They are the groundwork for many indexing and data compression methods.
Duval’s algorithm [5] computes the Lyndon factorization in linear time and in constant space. Various other solutions for computing the Lyndon factorization have been proposed in the past. A parallel algorithm [6] was presented by Apostolico and Crochemore, while Roh et al. described an external memory algorithm [7]. Recently, I et al. and Furuya et al. introduced algorithms to compute the Lyndon factorization of a string given in the grammar-compressed form and in the LZ78 encoding [8,9].
In this paper, we present two new variations of Duval’s algorithm. The paper is an extended version of the conference paper [10]. The first algorithm has been designed for strings containing runs of the smallest character. It works best for small alphabets like the DNA alphabet {a, c, g, t} and it is able to skip a significant portion of the string. The second variation works for strings compressed with run-length encoding. In run-length encoding, maximal sequences in which the same data value occurs in many consecutive data elements (called runs) are stored as a pair of a single data value and a count. When there is a run-length encoded string R of length ρ , our algorithm computes the Lyndon factorization of R in O ( ρ ) time and in constant space. This variation is thus preferable to Duval’s algorithm when the strings are stored or maintained with run-length encoding. In our experiments, the new algorithms are considerably faster than the original one in the case of small alphabets, for both real and simulated data.
The rest of the paper is organized as follows. Section 2 defines background concepts Section 3 presents Duval’s algorithm, Section 4 and Section 5 introduce our variations of Duval’s algorithm, Section 6 shows the results of our practical experiments, and the discussion of Section 7 concludes the article.

2. Basic Definitions

Let Σ be a finite ordered alphabet of σ symbols and let Σ * be the set of words (strings) over Σ ordered by lexicographic order. In this paper, we use the terms string, sequence, and word interchangeably. The empty word ε is a word of length 0. Let Σ + be equal to Σ * { ε } . Given a word w, we denote with | w | the length of w and with w [ i ] the i-th symbol of w, for 0 i < | w | . The concatenation of two words u and v is denoted by u v . Given two words u and v, v is a substring of u if there are indices 0 i , j < | u | such that v = u [ i ] u [ j ] . If i = 0 ( j = | u | 1 ) then v is a prefix (suffix) of u. The substring u [ i ] u [ j ] of u is denoted by u [ i . . j ] , and for i > j u [ i . . j ] = ε . We denote by u k the concatenation of ku’s, for u Σ + and k 1 . The longest border of a word w, denoted with β ( w ) , is the longest proper prefix of w which is also a suffix of w. Let l c p ( w , w ) denote the length of the longest common prefix of words w and w . We write w < w if either l c p ( w , w ) = | w | < | w | , i.e., if w is a proper prefix of w , or if w [ l c p ( w , w ) ] < w [ l c p ( w , w ) ] . For any 0 i < | w | , ROT ( w , i ) = w [ i . . | w | 1 ] w [ 0 . . i 1 ] is a rotation of w. A Lyndon word is a word w such that w < ROT ( w , i ) , for 1 i < | w | . Given a Lyndon word w, the following properties hold:
  • | β ( w ) | = 0 ;
  • either | w | = 1 or w [ 0 ] < w [ | w | 1 ] .
Both properties imply that no word a k , for a Σ , k 2 , is a Lyndon word. The following result is due to Chen, Fox and Lyndon [11]:
Theorem 1.
Any word w admits a unique factorization C F L ( w ) = w 1 , w 2 , , w m , such that w i is a Lyndon word, for 1 i m , and w 1 w 2 w m .
The interval of positions in w of the factor w i in C F L ( w ) = w 1 , w 2 , , w m is [ a i , b i ] , where a i = j = 1 i 1 | w j | , b i = j = 1 i | w j | 1 , for i = 1 , , m . We assume the following property:
Property 1.
The output of an algorithm that, given a word w, computes the factorization C F L ( w ) is the sequence of intervals of positions of the factors in C F L ( w ) .
The run-length encoding (RLE) of a word w, denoted by RLE ( w ) , is a sequence of pairs (runs) ( c 1 , l 1 ) , ( c 2 , l 2 , ) , , ( c ρ , l ρ ) such that c i Σ , l i 1 , c i c i + 1 for 1 i < ρ , and w = c 1 l 1 c 2 l 2 c ρ l ρ . The interval of positions in w of the run ( c i , l i ) is [ a i r l e , b i r l e ] where a i r l e = j = 1 i 1 l j , b i r l e = j = 1 i l j 1 .

3. Duval’s Algorithm

In this section we briefly describe Duval’s algorithm for the computation of the Lyndon factorization of a word. Let L be the set of Lyndon words and let
P = { w | w Σ + and w Σ * L } ,
be the set of nonempty prefixes of Lyndon words. Let also P = P { c k | k 2 } , where c is the maximum symbol in Σ . Duval’s algorithm is based on the following Lemmas, proved in [5]:
Lemma 1.
Let w Σ + and w 1 be the longest prefix of w = w 1 w which is in L. We have C F L ( w ) = w 1 C F L ( w ) .
Lemma 2.
P = { ( u v ) k u | u Σ * , v Σ + , k 1 and u v L } .
Lemma 3.
Let w = ( u a v ) k u , with u , v Σ * , a Σ , k 1 and u a v L . The following propositions hold:
1. 
For a Σ and a > a , w a P ;
2. 
For a Σ and a < a , w a L ;
3. 
For a = a , w a P L .
Lemma 1 states that the computation of the Lyndon factorization of a word w can be carried out by computing the longest prefix w 1 of w = w 1 w which is a Lyndon word and then recursively restarting the process from w . Lemma 2 states that the nonempty prefixes of Lyndon words are all of the form ( u v ) k u , where u Σ * , v Σ + , k 1 and u v L . By the first property of Lyndon words, the longest prefix of ( u v ) k u which is in L is u v . Hence, if we know that w = ( u v ) k u a v , ( u v ) k u P but ( u v ) k u a P , then by Lemma 1 and by induction we have C F L ( w ) = w 1 w 2 w k C F L ( u a v ) , where w 1 = w 2 = = w k = u v . For example, if w = a b b a b b a b a , we have C F L ( w ) = a b b a b b C F L ( a b a ) , since a b b a b b a b P while a b b a b b a b a P .
Suppose that we have a procedure LF-next ( w , k ) which computes, given a word w and an integer k, the pair ( s , q ) where s is the largest integer such that w [ k . . k + s 1 ] L and q is the largest integer such that w [ k + i s . . k + ( i + 1 ) s 1 ] = w [ k . . k + s 1 ] , for i = 1 , , q 1 . The factorization of w can then be computed by iteratively calling LF-next starting from position 0. When a given call to LF-next returns, the factorization algorithm outputs the intervals [ k + i s , k + ( i + 1 ) s 1 ] , for i = 0 , , q 1 , and restarts the factorization at position k + q s . Duval’s algorithm implements LF-next using Lemma 3, which explains how to compute, given a word w P and a symbol a Σ , whether w a P , and thus makes it possible to compute the factorization using a left to right parsing. Note that, given a word w P with | β ( w ) | = i , we have w [ 0 . . | w | i 1 ] L and w = ( w [ 0 . . | w | i 1 ] ) q w [ 0 . . r 1 ] with q = | w | | w | i and r = | w | mod ( | w | i ) . For example, if w = a b b a b b a b , we have | w | = 8 , | β ( w ) | = 5 , q = 2 , r = 2 and w = ( a b b ) 2 a b . The code of Duval’s algorithm is shown in Figure 1. The algorithm has O ( | w | ) -time and O ( 1 ) -space complexity.
The following is an alternative formulation of Duval’s algorithm by I et al. [8]:
Lemma 4.
Let j > 0 be any position of a string w such that w < w [ i . . | w | 1 ] for any 0 < i j and l c p ( w , w [ j . . | w | 1 ] ) 1 . Then, w < w [ k . . | w | 1 ] also holds for any j < k j + l c p ( w , w [ j . . | w | 1 ] ) .
Lemma 5.
Let w be a string with C F L ( w ) = w 1 , w 2 , , w m . It holds that | w 1 | = min { j | w [ j . . | w | 1 ] < w } and w 1 = w 2 = = w q = w [ 0 . . | w 1 | 1 ] , where q = 1 + l c p ( w , w [ | w 1 | . . | w | 1 ] ) / | w 1 | .
For example, if w = a b b a b b a b a , we have min { j | w [ j . . 8 ] < w } = 3 , l c p ( w , w [ 3 . . 8 ] ) = 5 , and q = 2 . Based on these Lemmas, the procedure LF-next can be implemented by initializing j k + 1 and executing the following steps: (1) compute h l c p ( w [ k . . | w | 1 ] , w [ j . . | w | 1 ] ) . (2) if j + h < | w | and w [ k + h ] < w [ j + h ] set j j + h + 1 and repeat step 1; otherwise return the pair ( j , 1 + h / j ) . It is not hard to verify that, if the l c p values are computed using symbol comparisons, then this procedure corresponds to the one used by Duval’s original algorithm.

4. Improved Algorithm for Small Alphabets

Let w be a word over an alphabet Σ with C F L ( w ) = w 1 , w 2 , , w m and let c ¯ be the smallest symbol in Σ . Suppose that there exists k 2 , i 1 such that c ¯ k is a prefix of w i . If the last symbol of w is not c ¯ , then by Theorem 1 and by the properties of Lyndon words, c ¯ k is a prefix of each of w i + 1 , w i + 2 , , w m . This property can be exploited to devise an algorithm for Lyndon factorization that can potentially skip symbols. Note that we assume Property 1, i.e., the output of the algorithm is the sequence of intervals of the factors in C F L ( w ) , as otherwise we have to read all the symbols of w to output C F L ( w ) . Our algorithm is based on the alternative formulation of Duval’s algorithm by I et al. [8]. Given a set of strings P , let O c c P ( w ) be the set of all (starting) positions in w corresponding to occurrences of the strings in P . We start with the following Lemmas:
Lemma 6.
Let w be a word and let s = max { i | w [ i ] > c ¯ } { 1 } . Then, we have C F L ( w ) = C F L ( w [ 0 . . s ] ) C F L ( c ¯ ( | w | 1 s ) ) .
Proof. 
If s = 1 or s = | w | 1 the Lemma plainly holds. Otherwise, Let w i be the factor in C F L ( w ) such that s belongs to [ a i , b i ] , the interval of w i . To prove the claim we have to show that b i = s . Suppose by contradiction that s < b i , which implies | w i | 2 . Then, w i [ | w i | 1 ] = c ¯ , which contradicts the second property of Lyndon words. □
For example, if w = a b a a b a a b b a a b a a , we have C F L ( w ) = C F L ( a b a a b a a b b a a b ) C F L ( a a ) .
Lemma 7.
Let w be a word such that c ¯ c ¯ occurs in it and let s = min O c c { c ¯ c ¯ } ( w ) . Then, we have C F L ( w ) = C F L ( w [ 0 . . s 1 ] ) C F L ( w [ s . . | w | 1 ] ) .
Proof. 
Let w i be the factor in C F L ( w ) such that s belongs to [ a i , b i ] , the interval of w i . To prove the claim we have to show that a i = s . Suppose by contradiction that s > a i , which implies | w i | 2 . If s = b i then w i [ | w i | 1 ] = c ¯ , which contradicts the second property of Lyndon words. Otherwise, since w i is a Lyndon word it must hold that w i < ROT ( w i , s a i ) . This implies at least that w i [ 0 ] = w i [ 1 ] = c ¯ , which contradicts the hypothesis that s is the smallest element in O c c { c ¯ c ¯ } ( w ) . □
For example, if w = a b a a b a a b b a a b , we have C F L ( w ) = C F L ( a b ) C F L ( a a b a a b b a a b ) .
Lemma 8.
Let w be a word such that w [ 0 ] = w [ 1 ] = c ¯ and w [ | w | 1 ] c ¯ . Let r be the smallest position in w such that w [ r ] c ¯ . Let also P = { c ¯ r c | c w [ r ] } . Then we have
b 1 = min { s O c c P ( w ) | w [ s . . | w | 1 ] < w } { | w | } 1 ,
where b 1 is the ending position of factor w 1 .
Proof. 
By Lemma 5 we have that b 1 = min { s | w [ s . . | w | 1 ] < w } 1 . Since w [ 0 . . r 1 ] = c ¯ r and | w | r + 1 , for any string v such that v < w we must have that either v [ 0 . . r ] P , if | v | r + 1 , or v = c ¯ | v | otherwise. Since w [ | w | 1 ] c ¯ , the only position s that satisfies w [ s . . | w | 1 ] = c ¯ | w | s is | w | , corresponding to the empty word. Hence,
{ s | w [ s . . | w | 1 ] < w } = { s O c c P ( w ) | w [ s . . | w | 1 ] < w } { | w | } .
 □
For example, if w = a a b a a b b a a b , we have P = { a a a , a a b } , O c c P ( w ) = { 0 , 3 , 7 } and b 1 = 6 . Based on these Lemmas, we can devise a faster factorization algorithm for words containing runs of c ¯ . The key idea is that, using Lemma 8, it is possible to skip symbols in the computation of b 1 , if a suitable string matching algorithm is used to compute O c c P ( w ) . W.l.o.g. we assume that the last symbol of w is different from c ¯ . In the general case, by Lemma 6, we can reduce the factorization of w to the one of its longest prefix with last symbol different from c ¯ , as the remaining suffix is a concatenation of c ¯ symbols, whose factorization is a sequence of factors equal to c ¯ . Suppose that c ¯ c ¯ occurs in w. By Lemma 7 we can split the factorization of w in C F L ( u ) and C F L ( v ) where u v = w and | u | = min O c c { c ¯ c ¯ } ( w ) . The factorization of C F L ( u ) can be computed using Duval’s original algorithm.
Concerning v, let r = min { i | v [ i ] c ¯ } . By definition v [ 0 ] = v [ 1 ] = c ¯ and v [ | v | 1 ] c ¯ , and we can apply Lemma 8 on v to find the ending position s of the first factor in C F L ( v ) , i.e., min { i O c c P ( v ) | v [ i . . | v | 1 ] < v } , where P = { c ¯ r c | c v [ r ] } . To this end, we iteratively compute O c c P ( v ) until either a position i is found that satisfies v [ i . . | v | 1 ] < v or we reach the end of the string. Let h = l c p ( v , v [ i . . | v | 1 ] ) , for a given i O c c P ( v ) . Observe that h r and, if v < v [ i . . | v | 1 ] , then, by Lemma 4, we do not need to verify the positions i O c c P ( v ) such that i i + h . The computation of O c c P ( v ) can be performed by using either an algorithm for multiple string matching for the set of patterns P or an algorithm for single string matching for the pattern c ¯ r , since O c c P ( v ) O c c c ¯ r ( v ) . Note that the same algorithm can also be used to compute min O c c c ¯ c ¯ ( w ) in the first phase.
Given that all the patterns in P differ in the last symbol only, we can express P more succinctly using a character class for the last symbol and match this pattern using a string matching algorithm that supports character classes, such as the algorithms based on bit-parallelism. In this respect, SBNDM2 [12], a variation of the BNDM algorithm [13] is an ideal choice, as it is sublinear on average. However, this method is preferable only if r + 1 is less than or equal to the machine word size in bits.
Let h = l c p ( v , v [ s . . | v | 1 ] ) and q = 1 + h / s . Based on Lemma 5, the algorithm then outputs the intervals of the factors v [ ( i 1 ) s . . i s 1 ] , for i = 1 , q , and iteratively applies the above method on v = v [ s q . . | v | 1 ] . It is not hard to verify that, if v ε , then | v | r + 1 , v [ 0 . . r 1 ] = c ¯ and v [ | v | 1 ] c ¯ , and so Lemma 8 can be used on v . The code of the algorithm, named LF-skip, is shown in Figure 2. The computation of the value r = min { i | v [ i ] c ¯ } for v takes advantage of the fact that v [ 0 . . r 1 ] = c ¯ , so as to avoid useless comparisons.
If the total time spent for the iteration over the sets O c c P ( v ) is O ( | w | ) , the worst case time complexity of LF-skip is linear. To see why, it is enough to observe that the positions i for which LF-skip verifies if v [ i . . | v | 1 ] < v are a subset of the positions verified by the original algorithm. Indeed, given a string w satisfying the conditions of Lemma 8, for any position i O c c P there is no i { 0 , 1 , , | w | 1 } O c c P such that i + 1 i i + l c p ( w , w [ i . . | w | 1 ] ) . Hence, the only way Duval’s algorithm can skip a position i O c c P using Lemma 4 is by means of a smaller position i belonging to O c c P , which implies that the algorithms skip or verify the same positions in O c c P .

5. Computing the Lyndon Factorization of a Run-Length Encoded String

In this section we present an algorithm to compute the Lyndon factorization of a string given in RLE form. The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE, and has O ( ρ ) -time and O ( 1 ) -space complexity, where ρ is the length of the RLE. We start with the following Lemma:
Lemma 9.
Let w be a word over Σ and let w 1 , w 2 , , w m be its Lyndon factorization. For any 1 i | RLE ( w ) | , let 1 j , k m , j k , such that a i r l e [ a j , b j ] and b i r l e [ a k , b k ] . Then, either j = k or | w j | = | w k | = 1 .
Proof. 
Suppose by contradiction that j < k and either | w j | > 1 or | w k | > 1 . By definition of j , k , we have w j w k . Moreover, since both [ a j , b j ] and [ a k , b k ] overlap with [ a i r l e , b i r l e ] , we also have w j [ | w j | 1 ] = w k [ 0 ] . If | w j | > 1 , then, by definition of w j , we have w j [ 0 ] < w j [ | w j | 1 ] = w k [ 0 ] . Instead, if | w k | > 1 and | w j | = 1 , we have that w j is a prefix of w k . Hence, in both cases we obtain w j < w k , which is a contradiction. □
The consequence of this Lemma is that a run of length l in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to l unit-length factors. Formally:
Corollary 1.
Let w be a word over Σ and let w 1 , w 2 , , w m be its Lyndon factorization. Then, for any 1 i | RLE ( w ) | , either there exists w j such that [ a i r l e , b i r l e ] is contained in [ a j , b j ] or there exist l i factors w j , w j + 1 , , w j + l i 1 such that | w j + k | = 1 and a j + k [ a i r l e , b i r l e ] , for 0 k < l i .
This property can be exploited to obtain an algorithm for the Lyndon factorization that runs in O ( ρ ) time. First, we introduce the following definition:
Definition 1.
A word w is a LR word if it is either a Lyndon word or it is equal to a k , for some a Σ , k 2 . The LR factorization of a word w is the factorization in LR words obtained from the Lyndon factorization of w by merging in a single factor the maximal sequences of unit-length factors with the same symbol.
For example, the LR factorization of c c t g c c a a is c c t g , c c , a a . Observe that this factorization is a (reversible) encoding of the Lyndon factorization. Moreover, in this encoding it holds that each run in the RLE is contained in one factor and thus the size of the LR factorization is O ( ρ ) . Let L be the set of LR words. Suppose that we have a procedure LF - RLE - NEXT ( R , k ) which computes, given an RLE sequence R and an integer k, the pair ( s , q ) where s is the largest integer such that c k l k c k + s 1 l k + s 1 L and q is the largest integer such that c k + i s l k + i s c k + ( i + 1 ) s 1 l k + ( i + 1 ) s 1 = c k l k c k + s 1 l k + s 1 , for i = 1 , , q 1 . Observe that, by Lemma 9, c k l k c k + s 1 l k + s 1 is the longest prefix of c k l k c ρ l ρ which is in L , since otherwise the run ( c k + s , l k + s ) would span two factors in the L R factorization of c k l k c ρ l ρ . This implies that the pair ( s , q ) returned by LF - RLE - NEXT ( R , k ) satisfies
LF - NEXT ( c k l k c ρ l ρ , 0 ) = ( i = 0 s 1 l k + i , q ) if s > 1 , ( 1 , l k ) otherwise .
Based on Lemma 1, the factorization of R can then be computed by iteratively calling LF-rle-next starting from position 0. When a given call to LF-rle-next returns, the factorization algorithm outputs the intervals [ k + i s , k + ( i + 1 ) s 1 ] in R, for i = 0 , , q 1 , and restarts the factorization at position k + q s .
We now present the LF - RLE - NEXT algorithm. Analogously to Duval’s algorithm, it reads the RLE sequence from left to right maintaining two integers, j and , which satisfy the following invariant:
c k l k c j 1 l j 1 P ; = | RLE ( β ( c k l k c j 1 l j 1 ) ) | if j k > 1 , 0 otherwise .
The integer j, initialized to k + 1 , is the index of the next run to read and is incremented at each iteration until either j = | R | or c k l k c j 1 l j 1 P . The integer , initialized to 0, is the length in runs of the longest border of c k l k c j 1 l j 1 , if c k l k c j 1 l j 1 spans at least two runs, and equal to 0 otherwise. For example, in the case of the word a b 2 a b 2 a b we have β ( a b 2 a b 2 a b ) = a b 2 a b and = 4 . Let i = k + . In general, if > 0 , we have
l j 1 l i 1 , l k l j , β ( c k l k c j 1 l j 1 ) = c k l k c k + 1 l k + 1 c i 2 l i 2 c i 1 l j 1 = c j l k c j + 1 l j + 1 c j 2 l j 2 c j 1 l j 1 .
Note that the longest border may not fully cover the last (first) run of the corresponding prefix (suffix). Such the case is for example for the word a b 2 a 2 b . However, since c k l k c j 1 l j 1 P it must hold that l j = l k , i.e., the first run of the suffix is fully covered. Let
z = 1 if > 0 l j 1 < l i 1 , 0 otherwise .
Informally, the integer z is equal to 1 if the longest border of c k l k c j 1 l j 1 does not fully cover the run ( c i 1 , l i 1 ) . By 1 we have that c k l k c j 1 l j 1 can be written as ( u v ) q u , where
q = j k z j i , r = z + ( j k z ) mod ( j i ) , u = c j r l j r c j 1 l j 1 , u v = c k l k c j 1 l j 1 = c i r l i r c j r 1 l j r 1 , u v L
For example, in the case of the word a b 2 a b 2 a b , for k = 0 , we have j = 6 , i = 4 , q = 2 , r = 2 . The algorithm is based on the following Lemma:
Lemma 10.
Let j , be such that invariant 1 holds and let s = i z . Then, we have the following cases:
1. 
If c j < c s then c k l k c j l j P ;
2. 
If c j > c s then c k l k c j l j L and 1 holds for j + 1 , = 0 ;
Moreover, if z = 0 , we also have:
3. 
If c j = c i and l j l i , then c k l k c j l j P and 1 holds for j + 1 , = + 1 ;
4. 
If c j = c i and l j > l i , either c j < c i + 1 and c k l k c j l j P or c j > c i + 1 , c k l k c j l j L and 1 holds for j + 1 , = 0 .
Proof. 
The idea is the following: we apply Lemma 3 with the word ( u v ) q u as defined above and symbol c j . Observe that c j is compared with symbol v [ 0 ] , which is equal to c k + r 1 = c i 1 if z = 1 and to c k + r = c i otherwise.
First note that, if z = 1 , c j c i 1 , since otherwise we would have c j 1 = c i 1 = c j . In the first three cases, we obtain the first, second and third proposition of Lemma 3, respectively, for the word c k l k c j 1 l j 1 c j . Independently of the derived proposition, it is easy to verify that the same proposition also holds for c k l k c j 1 l j 1 c j m , m l j . Consider now the fourth case. By a similar reasoning, we have that the third proposition of Lemma 3 holds for c k l k c j l i . If we then apply Lemma 3 to c k l k c j l i and c j , c j is compared to c i + 1 and we must have c j c i + 1 as otherwise c i = c j = c i + 1 . Hence, either the first (if c j < c i + 1 ) or the second (if c j > c i + 1 ) proposition of Lemma 3 must hold for the word c k l k c j l i + 1 . □
We prove by induction that invariant 1 is maintained. At the beginning, the variables j and are initialized to k + 1 and 0, respectively, so the base case trivially holds. Suppose that the invariant holds for j , . Then, by Lemma 10, either c k l k c j l j P or it follows that the invariant also holds for j + 1 , , where is equal to + 1 , if z = 0 , c j = c i and l j l i , and to 0 otherwise. When c k l k c j l j P the algorithm returns the pair ( j i , q ) , i.e., the length of u v and the corresponding exponent.
The code of the algorithm is shown in Figure 3. We now prove that the algorithm runs in O ( ρ ) time. First, observe that, by definition of LR factorization, the for loop at line 4 is executed O ( ρ ) times. Suppose that the number of iterations of the while loop at line 2 is n and let k 1 , k 2 , , k n + 1 be the corresponding values of k, with k 1 = 0 and k n + 1 = | R | . We now show that the s-th call to LF - RLE - NEXT performs less than 2 ( k s + 1 k s ) iterations, which will yield O ( ρ ) number of iterations in total. This analysis is analogous to the one used by Duval. Suppose that i , j and z are the values of i, j and z at the end of the s-th call to LF RLE NEXT . The number of iterations performed during this call is equal to j k s . We have k s + 1 = k s + q ( j i ) , where q = j k s z j i , which implies j k s < 2 ( k s + 1 k s ) + 1 , since, for any positive integers x , y , x < 2 x / y y holds.

6. Experimental Results

We tested extensively the algorithms LF-Duval, LF-skip, and LF-rle. In addition, we also tested variations of LF-Duval and LF-skip, denoted as LF-Duval2 and LF-skip2. LF-Duval2 performs an if-test
if w [ j 1 ] = w [ i 1 ] then
which is always true in line 9 of LF-next. This form, which is advantageous for compiler optimization, can be justified by the formulation of the original algorithm [5] where there is a three branch test of w [ j 1 ] and w [ i 1 ] . LF-skip2, after finding the first c ¯ r , searches for c ¯ r until c ¯ r + 1 is found, whereas LF-skip searches for c ¯ r x where x is a character class.
The experiments were run on Intel Core i7-4578U with 3 GHz clock speed and 16 GB RAM. The algorithms were written in the C programming language and compiled with gcc 5.4.0 using the O3 optimization level.
Testing LF-skip. At first we tested the variations of LF-skip against the variations of LF-Duval. The texts were random sequences of 5 MB symbols. For each alphabet size σ = 2 , 4 , , 256 we generated 100 sequences with a uniform distribution, and each run with each sequence was repeated 500 times. The average run times are given in Table 1 which is shown in a graphical form in Figure 4.
LF-skip was faster than the best variation of LF-Duval for all tested values of σ . The speed-up was significant for small alphabets. LF-skip2 was faster than LF-skip for σ 16 and slower for σ > 16 .
The speed of LF-Duval did not depend on σ . LF-Duval2 became faster when the size of the alphabet grew. For large alphabets LF-Duval2 was faster than LF-Duval and for small alphabets the other way round. In additional refined experiments, σ = 5 was the threshold value. When we compiled LF-Duval and LF-Duval2 without optimization, both of the variations behaved in a similar way. So the better performance of LF-Duval2 for large alphabets is due to compiler optimization, possibly by cause of branch prediction.
We tested the variations of LF-skip also with longer random sequences of four characters up to 500 MB. The average speed did not essentially change when the sequence became longer.
In addition, we tested LF-skip and LF-skip2 with real texts. At first we did experiments with texts of natural language. Because runs are very short in a natural language and newline or some other control character is the smallest character, the benefit of LF-skip or LF-skip2 was marginal. If it were acceptable to relax the lexicographic order of the characters, some gain could be obtained. For example, LF-skip achieved the speed-up of 2 over LF-Duval2 in the case of the KJV Bible when ‘l’ is the smallest character.
For the DNA sequence of fruitfly (15 MB), LF-skip2 was 20.3 times faster than LF-Duval. For the protein sequence of the saccharomyces cerevisiae (2.9 MB), LF-skip2 was 8.7 times faster than LF-Duval2. The run times on these biological sequences are shown in Table 2.
Testing LF-rle. To assess the performance of the LF-rle algorithm, we tested it together with LF-Duval, LF-Duval2 and LF-skip2 for random binary sequences of 5 MB with different probability distributions, so as to vary the number of runs in the sequence. The running time of LF-rle does not include the time needed to compute the RLE of the sequence, i.e., we assumed that the sequence is given in the RLE form, since otherwise other algorithms are preferable. For each test we generated 100 sequences, and each run with each sequence was repeated 500 times. The average run times are given in Table 3 which is shown in a graphical form in Figure 5.
Table 3 shows that LF-rle was the fastest for distributions P ( 0 ) = 0.05, 0.9, and 0.95. Table 3 also reveals that LF-rle and LF-Duval2 worked symmetrically for distributions of zero and one, but LF-skip2 worked unsymmetrically which is due to the fact that LF-skip2 searches for the runs of the smallest character which was zero in this case.
In our tests the run time of LF-Duval was about 14.7 ms for all sequences of 5 MB. Thus LF-Duval is a better choice than LF-Duval2 for cases P ( 0 ) = 0.3 and 0.7.

7. Conclusions

We presented new variations of Duval’s algorithm for computing the Lyndon factorization of a string. The first algorithm LF-skip was designed for strings containing runs of the smallest character in the alphabet and it is able to skip a significant portion of the characters of the string. The second algorithm LF-rle is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length ρ in O ( ρ ) time and constant space. Our experimental results show that these algorithms can offer a significant speed-up over Duval’s original algorithm. Especially LF-skip is efficient in the case of biological sequences.

Author Contributions

Formal analysis, E.G.; Investigation, S.S.G.; Methodology, J.T.; Software, E.G. and J.T.; Supervision, J.T.; Writing—original draft, S.S.G. and E.G.; Writing—review & editing, J.T.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, K.T.; Fox, R.H.; Lyndon, R.C. Free differential calculus. IV. The quotient groups of the lower central series. Ann. Math. 1958, 68, 81–95. [Google Scholar] [CrossRef]
  2. Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M. Sorting suffixes of a text via its Lyndon factorization. In Proceedings of the Prague Stringology Conference 2013, Prague, Czech Republic, 2–4 September 2013; pp. 119–127. [Google Scholar]
  3. Gil, J.Y.; Scott, D.A. A bijective string sorting transform. arXiv 2012, arXiv:1201.3077. [Google Scholar]
  4. Kufleitner, M. On bijective variants of the Burrows-Wheeler transform. In Proceedings of the Prague Stringology Conference 2009, Prague, Czech Republic, 31 August–2 September 2009; pp. 65–79. [Google Scholar]
  5. Duval, J.P. Factorizing words over an ordered alphabet. J. Algorithms 1983, 4, 363–381. [Google Scholar] [CrossRef]
  6. Apostolico, A.; Crochemore, M. Fast parallel Lyndon factorization with applications. Math. Syst. Theory 1995, 28, 89–108. [Google Scholar] [CrossRef] [Green Version]
  7. Roh, K.; Crochemore, M.; Iliopoulos, C.S.; Park, K. External memory algorithms for string problems. Fundam. Inform. 2008, 84, 17–32. [Google Scholar]
  8. Tomohiro, I.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Faster Lyndon factorization algorithms for SLP and LZ78 compressed text. Theor. Comput. Sci. 2016, 656, 215–224. [Google Scholar] [CrossRef]
  9. Furuya, I.; Nakashima, Y.; Tomohiro, I.; Inenaga, S.; Bannai, H.; Takeda, M. Lyndon Factorization of Grammar Compressed Texts Revisited. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM 2018), Qingdao, China, 2–4 July 2018. [Google Scholar] [CrossRef]
  10. Ghuman, S.S.; Giaquinta, E.; Tarhio, J. Alternative algorithms for Lyndon factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, 1–3 September 2014; pp. 169–178. [Google Scholar]
  11. Lothaire, M. Combinatorics on Words; Cambridge Mathematical Library, Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
  12. Durian, B.; Holub, J.; Peltola, H.; Tarhio, J. Improving practical exact string matching. Inf. Process. Lett. 2010, 110, 148–152. [Google Scholar] [CrossRef]
  13. Navarro, G.; Raffinot, M. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM J. Exp. Algorithm 2000, 5, 4. [Google Scholar] [CrossRef]
Figure 1. Duval’s algorithm to compute the Lyndon factorization of a string.
Figure 1. Duval’s algorithm to compute the Lyndon factorization of a string.
Algorithms 12 00124 g001
Figure 2. The algorithm to compute the Lyndon factorization that can potentially skip symbols.
Figure 2. The algorithm to compute the Lyndon factorization that can potentially skip symbols.
Algorithms 12 00124 g002
Figure 3. The algorithm to compute the Lyndon factorization of a run-length encoded string.
Figure 3. The algorithm to compute the Lyndon factorization of a run-length encoded string.
Algorithms 12 00124 g003
Figure 4. Comparison of the algorithms on random sequences (5 MB) with a uniform distribution of a varying alphabet size.
Figure 4. Comparison of the algorithms on random sequences (5 MB) with a uniform distribution of a varying alphabet size.
Algorithms 12 00124 g004
Figure 5. Comparison of the algorithms on random binary sequences (5 MB) with a skew distribution.
Figure 5. Comparison of the algorithms on random binary sequences (5 MB) with a skew distribution.
Algorithms 12 00124 g005
Table 1. Run times in milliseconds on random sequences (5 MB) with a uniform distribution of a varying alphabet size.
Table 1. Run times in milliseconds on random sequences (5 MB) with a uniform distribution of a varying alphabet size.
σ LF-DuvalLF-Duval2LF-skipLF-skip2
214.621.92.51.5
414.614.91.61.1
814.79.11.31.1
1614.76.41.31.2
3214.75.01.41.6
6414.74.31.72.3
12814.74.02.03.2
19214.63.81.73.7
25614.63.82.64.1
Table 2. Run times in milliseconds on two biological sequences.
Table 2. Run times in milliseconds on two biological sequences.
LF-DuvalLF-Duval2LF-skipLF-skip2
DNA (15 MB)44.752.23.02.2
Protein (2.9 MB)8.53.40.500.39
Table 3. Run times in milliseconds on random binary sequences (5 MB) with a skew distribution.
Table 3. Run times in milliseconds on random binary sequences (5 MB) with a skew distribution.
P(zero)LF-DuvalLF-Duval2LF-skip2LF-rle
0.0514.65.71.40.70
0.1014.77.81.11.3
0.2014.712.41.02.4
0.3014.817.41.23.2
0.7014.716.91.73.2
0.8014.612.72.02.4
0.9014.68.42.81.3
0.9514.76.34.70.70

Share and Cite

MDPI and ACS Style

Ghuman, S.S.; Giaquinta, E.; Tarhio, J. Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings. Algorithms 2019, 12, 124. https://doi.org/10.3390/a12060124

AMA Style

Ghuman SS, Giaquinta E, Tarhio J. Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings. Algorithms. 2019; 12(6):124. https://doi.org/10.3390/a12060124

Chicago/Turabian Style

Ghuman, Sukhpal Singh, Emanuele Giaquinta, and Jorma Tarhio. 2019. "Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings" Algorithms 12, no. 6: 124. https://doi.org/10.3390/a12060124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop