Next Article in Journal
Variational Multiscale Nonparametric Regression: Algorithms and Implementation
Next Article in Special Issue
Re-Pair in Small Space
Previous Article in Journal
A Deep Gaussian Process-Based Flight Trajectory Prediction Approach and Its Application on Conflict Detection
Previous Article in Special Issue
Efficient Data Structures for Range Shortest Unique Substring Queries
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Computing Maximal Lyndon Substrings of a String

by
Frantisek Franek
1,† and
Michael Liut
2,*,†
1
Department of Computing and Software, McMaster University, Hamilton, ON L8S 4K1, Canada
2
Department of Mathematical and Computational Sciences, University of Toronto Mississauga, Mississauga, ON L5L 1C6, Canada
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Algorithms 2020, 13(11), 294; https://doi.org/10.3390/a13110294
Submission received: 23 September 2020 / Revised: 4 November 2020 / Accepted: 10 November 2020 / Published: 12 November 2020
(This article belongs to the Special Issue Combinatorial Methods for String Processing)

Abstract

:
There are two reasons to have an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. introduced in 2015 a linear algorithm to compute all runs of a string that relies on knowing all right-maximal Lyndon substrings of the input string, and secondly, Franek et al. showed in 2017 a linear equivalence of sorting suffixes and sorting right-maximal Lyndon substrings of a string, inspired by a novel suffix sorting algorithm of Baier. In 2016, Franek et al. presented a brief overview of algorithms for computing the Lyndon array that encodes the knowledge of right-maximal Lyndon substrings of the input string. Among those presented were two well-known algorithms for computing the Lyndon array: a quadratic in-place algorithm based on the iterated Duval algorithm for Lyndon factorization and a linear algorithmic scheme based on linear suffix sorting, computing the inverse suffix array, and applying to it the next smaller value algorithm. Duval’s algorithm works for strings over any ordered alphabet, while for linear suffix sorting, a constant or an integer alphabet is required. The authors at that time were not aware of Baier’s algorithm. In 2017, our research group proposed a novel algorithm for the Lyndon array. Though the proposed algorithm is linear in the average case and has O ( n log ( n ) ) worst-case complexity, it is interesting as it emulates the fast Fourier algorithm’s recursive approach and introduces τ -reduction, which might be of independent interest. In 2018, we presented a linear algorithm to compute the Lyndon array of a string inspired by Phase I of Baier’s algorithm for suffix sorting. This paper presents the theoretical analysis of these two algorithms and provides empirical comparisons of both of their C++ implementations with respect to the iterated Duval algorithm.

Graphical Abstract

1. Introduction

In combinatorics on words, Lyndon words play a very important role. Lyndon words, a special case of Hall words, were named after Roger Lyndon, who was looking for a suitable description of the generators of free Lie algebras [1]. Despite their humble beginnings, to date, Lyndon words have facilitated many applications in mathematics and computer science, some of which are: constructing de Bruijin sequences, constructing bases in free Lie algebras, finding the lexicographically smallest or largest substring in a string, and succinct prefix matching of highly periodic strings; see Marcus et al. [2], and the informative paper by Berstel and Perrin [3] and the references therein.
The pioneering work on Lyndon decomposition was already introduced by Chen, Fox, and Lyndon in [4]. The Lyndon decomposition theorem is not explicitly stated there; nevertheless, it follows from the work presented there.
Theorem 1
(Lyndon decomposition theorem, Chen+Fox+Lyndon, 1958). For any word x , there are unique Lyndon words u 1 , , u k so that u i + 1 u i for any 1 i < k , and x = u 1 u 2 u k , where ≺ denotes the lexicographic ordering.
As there exists a bijection between Lyndon words over an alphabet of cardinality k and irreducible polynomials over F k [5], many results are known about this factorization: the average number of factors, the average length of the longest factor [6] and of the shortest [7]. Several algorithms deal with Lyndon factorization. Duval gave in [8] an elegant algorithm that computes, in linear time and in-place, the factorization of a word into Lyndon words. More about its implementation can be found in [9]. In [10], Fredricksen and Maiorana presented an algorithm generating all Lyndon words up to a given length in lexicographical order. This algorithm runs in a constant average time.
Two of the latest applications of Lyndon words are due to Bannai et al. In [11], they employed Lyndon roots of runs to prove the runs conjecture that the number of runs in a string is bounded by the length of the string. In the same paper, they presented an algorithm to compute all the runs in a string in linear time that requires the knowledge of all right-maximal Lyndon substrings of the input string with respect to an order of the alphabet and its inverse. The latter result was the major reason for our interest in computing the right-maximal Lyndon substrings of a given string. Though the terms word and string are interchangeable, and so are the terms subword and substring, in the following, we prefer to use exclusively the terms string and substring to avoid confusing the reader.
There are at least two reasons for having an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. published in 2017 [11] a linear algorithm to compute all runs in a string that depends on knowing all right-maximal Lyndon substrings of the input string, and secondly, in 2017, Franek et al. in [12] showed a linear equivalence of sorting suffixes and sorting right-maximal Lyndon substrings, inspired by Phase II of a suffix sorting algorithm introduced by Baier in 2015 (Master’s thesis [13]) and published in 2016 [14].
The most significant feature of the runs algorithm presented in [11] is that it relies on knowing the right-maximal Lyndon substrings of the input string for some order of the alphabet and for the inverse of that order, while all other linear algorithms for runs rely on Lempel–Ziv factorization of the input string. It also raised the issue about which approach may be more efficient: to compute the Lempel–Ziv factorization or to compute all right-maximal Lyndon substrings. There are several efficient linear algorithms for Lempel–Ziv factorization (e.g., see [15,16] and the references therein).
Interestingly, Kosolobov [17] showed that for a general alphabet, in the decision tree model, the runs problem is easier than the Lempel–Ziv decomposition. His result supports the conjecture that there must be a linear random access memory model algorithm finding all runs.
Baier introduced in [13], and published in [14], a new algorithm for suffix sorting. Though Lyndon strings were never mentioned in [13,14], it was noticed by Cristoph Diegelmann in a personal communication [18] that Phase I of Baier’s suffix sort identifies and sorts all right-maximal Lyndon substrings.
The right-maximal Lyndon substrings of a string x = x [ 1 . . n ] can be best encoded in the so-called Lyndon array, introduced in [19] and closely related to the Lyndon tree of [20]: an integer array L [ 1 . . n ] so that for any i 1 . . n , L [ i ] = the length of the right-maximal Lyndon substring starting at the position i.
In an overview [19], Franek et al. discussed an algorithm based on an iterative application of Duval’s Lyndon factorization algorithm [8], which we refer to here as IDLA, and an algorithmic scheme based on Hohlweg and Reutenauer’s work [20], which we refer to as SSLA. The authors were not aware of Baier’s algorithm at that time. Two additional algorithms were presented there, a quadratic recursive application of Duval’s algorithm and an algorithm NSV* with possibly O ( n log ( n ) ) worst-case complexity based on ranges that can be compared in constant time for constant alphabets. The correctness of NSV* and its complexity were discussed there just informally.
The algorithm IDLA (see Figure 1) is simple and in-place, so no additional space is required except for the storage for the string and the Lyndon array. It is completely independent of the alphabet of the string and does not require the alphabet to be sorted; all it requires is that the alphabet be ordered, i.e., only pairwise comparisons of the alphabet symbols are needed. Its weakness is its quadratic worst-case complexity, which becomes a problem for longer strings with long right-maximal Lyndon substrings, as one of our experiments showed (see Figure 11 in Section 7).
In our empirical work, we used IDLA as a control for comparison and as a verifier of the results. Note that the reason the procedure MaxLyn of Figure 1 really computes the longest Lyndon prefix is not obvious and is based on the properties of periods of prefixes; see [8] or Observation 6 and Lemma 11 in [19].
Lemma 1, below, characterizes right-maximal Lyndon substrings in terms of the relationships of the suffixes and follows from the work of Hohlweg and Reutenauer [20]. Though the definition of the proto-Lyndon substring is formally given in Section 2 below, it suffices to say the it means that it is a prefix of a Lyndon substring of the string. The definition of the lexicographic ordering ≺ is also given Section 2. The proof of this lemma is delayed to the end of Section 2, where all the technical terms needed are defined.
Lemma 1.
Consider a string x [ 1 . . n ] over an alphabet ordered by ≺.
A substring x [ i . . j ] is proto-Lyndon if and only if:
(a) 
x [ i . . n ] x [ k . . n ] for any i < k j .
A substring x [ i . . j ] is right-maximal Lyndon if and only if:
(b) 
x [ i . . n ] is proto-Lyndon and
(c) 
either j = n or x [ j + 1 . . n ] x [ i . . n ] .
Thus, the Lyndon array is an NSV (Next Smaller Value) array of the inverse suffix array. Consequently, the Lyndon array can be computed by sorting the suffixes, i.e., computing the suffix array, then computing the inverse suffix array, and then applying NSV to it; see [19]. Computing the inverse suffix array and applying NSV are “naturally” linear, and computing the suffix array can be implemented to be linear; see [19,21] and the references therein. The execution and space characteristics are dominated by those of the first step, i.e., computation of the suffix array. We refer to this scheme as SSLA.
In 2018, a linear algorithm to compute the Lyndon array from a given Burrows–Wheeler transform was presented [22]. Since the Burrows–Wheeler transform is computed in linear time from the suffix array, it is yet another scheme of how to obtain the Lyndon array via suffix sorting: compute the suffix array; from the suffix array, compute the Burrows–Wheeler transform, then compute the Lyndon array during the inversion of the Burrows–Wheeler transform. We refer to this scheme as BWLA.
The introduction of Baier’s suffix sort in 2015 and the consequent realization of the connection to right-maximal Lyndon substrings brought up the realization that there was an elementary (not relying on a pre-processed global data structure such as a suffix array or a Burrows–Wheeler transform) algorithm to compute the Lyndon array, and that, despite its original clumsiness, could be eventually refined to outperform any SSLA or BWLA implementation: any implementation of a suffix sorting-based scheme requires a full suffix sort and then some additional processing, while Baier’s approach is “just” a partial suffix sort; see [23].
In this work, we present two additional algorithms for the Lyndon array not discussed in [19]. The C++ source code of the three implementations IDLA, TRLA, and BSLA is available; see [24]. Note that the procedure IDLA is in the lynarr.hpp file.
The first algorithm presented here is TRLA. TRLA is a τ -reduction based Lyndon array algorithm that follows Farach’s approach used in his remarkable linear algorithm for suffix tree construction [25] and reproduced very successfully in all linear algorithms for suffix sorting (e.g., see [21,26] and the references therein). Farach’s approach follows the Cooley–Tukey algorithm for the fast Fourier transform relying on recursion to lower the quadratic complexity to O ( n log ( n ) ) complexity; see [27]. TRLA was first introduced by the authors in 2019 (see [28,29]) and presented as a part of Liut’s Ph.D. thesis [30].
The second algorithm, BSLA, is a Baier’s sort-based Lyndon array algorithm. BSLA is based on the idea of Phase I of Baier’s suffix sort, though our implementation necessarily differs from Baier’s. BSLA was first introduced at the Prague Stringology Conference 2018 [23] and also presented as a part of Liut’s Ph.D. thesis [30] in 2019; here, we present a complete and refined theoretical analysis of the algorithm and a more efficient implementation than that initially introduced.
The paper is structured as follows: In Section 2, the basic notions and terminology are presented. In Section 3, the TRLA algorithm is presented and analysed. In Section 4, the BSLA algorithm is presented and analysed. In Section 5, the datasets with random strings of various lengths and over various alphabets and other datasets used in the empirical tests are described. In Section 6, the conclusion of the research and the future work are presented. The results of the empirical measurements of the performance of IDLA, TRLA, and BSLA on those datasets are presented in Section 7 in both tabular and graphical forms.

2. Basic Notation and Terminology

Most of the fundamental notions, definitions, facts, and string algorithms can be found in [31,32,33,34]. For the ease of access, this section includes those that are directly related to the work herein.
The set of integers is denoted by Z . For two integers i j , the range i . . j = { k Z | i k j } . An alphabet is a finite or infinite set of symbols, or equivalently called letters. We always assume that the sentinel symbol $ is not in the alphabet and is always assumed to be lexicographically the smallest. A string over an alphabet A is a finite sequence of symbols from A . A $-terminated string over A is a string over A terminated by $. We use the array notation indexing from 1 for strings; thus, x [ 1 . . n ] indicates a string of length n; the first symbol is the symbol with index 1, i.e., x [ 1 ] ; the second symbol is the symbol with index 2, i.e., x [ 2 ] , etc. Thus, x [ 1 . . n ] = x [ 1 ] x [ 2 ] x [ n ] . For a $-terminated string x of length n, x [ n + 1 ] = $ . The alphabet of string x , denoted as A x , is the set of all distinct alphabet symbols occurring in x .
We use the term strings over a constant alphabet if the alphabet is a fixed finite alphabet. The integer alphabet is the infinite alphabet A = { 0 , 1 , 2 , } . We use the term strings over integer alphabet for the strings over the alphabet { 0 , 1 , 2 , } with an additional constraint that all letters occurring in the string are all smaller than the length of the string, i.e., in this paper, x [ 1 . . n ] is a string over the integer alphabet if it is a string over the alphabet { 0 , 1 , n 1 } . Many authors use a more general definition; for instance, Burkhardt and Kärkkäinen [35] defined it as any set of integers of size n o ( 1 ) ; however, our results can easily be adapted to such more general definitions without changing their essence.
We use a bold font to denote strings; thus, x denotes a string, while x denotes some other mathematical entity such as an integer. The empty string is denoted by ε and has length zero. The length or size of string x = x [ 1 . . n ] is n. The length of a string x is denoted by | x | . For two strings x = x [ 1 . . n ] and y = y [ 1 . . m ] , the concatenation x y is a string u
where u [ i ] = x [ i ] f o r i n , y [ i n ] f o r n < i n + m .
If x = u v w , then u is a prefix, v a substring, and w a suffix of x . If u (respectively, v , w ) is empty, then it is called an empty prefix (respectively, empty substring, empty suffix); if | u | < | x | (respectively, | v | < | x | , | w | < | x | ), then it is called a proper prefix (respectively, proper substring, proper suffix). If x = u v , then vu is called a rotation or a conjugate of x ; if either u = ε or v = ε , then the rotation is called trivial. A non-empty string x is primitive if there is no string y and no integer k 2 so that x = y k = y y y k t i m e s .
A non-empty string x has a non-trivial border u if u is both a non-empty proper prefix and a non-empty proper suffix of x . Thus, both ε and x are trivial borders of x . A string without a non-trivial border is call unbordered.
Let ≺ be a total order of an alphabet A . The order is extended to all finite strings over the alphabet A : for x = x [ 1 . . n ] and y = y [ 1 . . n ] , x y if either x is a proper prefix of y or there is a j min { n , m } so that x [ 1 ] = y [ 1 ] , ..., x [ j 1 ] = y [ j 1 ] and x [ j ] y [ j ] . This total order induced by the order of the alphabet is called a lexicographic order of all non-empty strings over A . We denote by x y if either x y or x = y . A string x over A is Lyndon for a given order ≺ of A if x is strictly lexicographically smaller than any non-trivial rotation of x . In particular:
x is Lyndon ⇒x is unbordered ⇒ x is primitive
Note that the reverse implications do not hold: a b a is primitive but neither unbordered, nor Lyndon, while a c a a b is unbordered, but not Lyndon. A substring x [ i . . j ] of x [ 1 . . n ] , 1 i j n is a right-maximal Lyndon substring of x if it is Lyndon and either j = n or for any k > j , x [ i . . k ] is not Lyndon.
A substring x [ i . . j ] of a string x [ 1 . . n ] is proto-Lyndon if there is a j k n so that x [ i . . k ] is Lyndon. The Lyndon array of a string x = x [ 1 . . n ] is an integer array L [ 1 . . n ] so that L [ i ] = j where j n i is a maximal integer such that x [ i . . i + j 1 ] is Lyndon. Alternatively, we can define it as an integer array L [ 1 . . n ] so that L [ i ] = j where j is the last position of the right-maximal Lyndon substring starting at the position i. The relationship between those two definitions is straightforward: L [ i ] = L [ i ] + i 1 or L [ i ] = L [ i ] i + 1 .
Proof ofLemma 1.
We first prove the following claim:
Claim. Substring x [ i . . j ] is right-maximal Lyndon if and only if:
( b )
x [ i . . n ] x [ k . . n ] for any i < k j and
( c )
either j = n or x [ j + 1 . . n ] x [ i . . n ] .
Let x [ i . . j ] be right-maximal Lyndon. We need to show that ( b ) and ( c ) hold.
Let i < k j . Since it is Lyndon, x [ i . . j ] x [ k . . j ] . Thus, there is 0 r so that i + r j and j + r j and x [ i + ] = x [ j + ] for any 0 < r and x [ i + r ] x [ j + r ] . It follows that x [ i . . n ] x [ j . . n ] , and ( b ) is satisfied.
If j = n , then ( c ) holds. Therefore, let us assume that j < n .
(1)
If x [ i ] x [ j + 1 ] , then x [ i . . j + 1 ] x [ j + 1 . . j + 1 ] ; together with x [ i . . j ] x [ k . . j ] for any i < k j , it shows that x [ i . . j + 1 ] is Lyndon, contradicting the right-maximality of x [ i . . j ] .
(2)
If x [ i ] x [ j + 1 ] , then x [ i . . n ] x [ j + 1 . . n ] , and ( c ) holds.
(3)
If x [ i ] = x [ j + 1 ] , then there are a prefix u , strings v and w , and an integer r 1 so that u v = x [ i . . j ] , x [ i . . n ] = u v u r w , and x [ j + 1 . . n ] = u r w . Let us take a maximal such u and a maximal such r.
(3a)
Let r 2 . Then, since u v = x [ i . . j ] is Lyndon, u v , and so, u u u v ; hence,  u r w u v u r w , and ( c ) holds.
(3b)
Let r = 1 , i.e., x [ i . . n ] = u v u w and x [ j + 1 . . n ] = u w . There is no common prefix of v and w as it would contradict the maximality of u . There are thus two mutually exclusive cases, either v [ 1 ] w [ 1 ] or v [ 1 ] w [ 1 ] . Let v [ 1 ] = c 1 and w [ 1 ] = c 2 . If  c 1 c 2 , then u c 1 u c 2 , and so, u v u c 2 . For any u = u 1 u 2 , u v u 2 v as uv is Lyndon, so  u v u 2 c 2 , giving u v c 2 to be Lyndon, a contradiction with the right-maximality of u v . Therefore, c 1 c 2 ; thus, u w u v u w , and ( c ) holds.
Now, we go in the opposite direction. Assuming ( b ) and ( c ) , we need to show that x [ i . . j ] is right-maximal Lyndon.
Consider the right-maximal substring of x starting at the position i; it is x [ i . . ] for some i n .
(1)
If < j , then by the first part of this proof, x [ i . . n ] x [ + 1 . . n ] , contradicting the assumption ( b ) as + 1 j .
(2)
If j < , then j + 1 , and by the first part of this proof, x [ i . . n ] x [ j + 1 . . n ] , contradicting the assumption ( c )  .
(3)
If = j , then x [ i . . j ] is right-maximal Lyndon.
Now, we can prove ( a ) . Let x [ i . . j ] be a proto-Lyndon substring of x . By definition, it is a prefix of a Lyndon substring of x and, hence, a prefix of a right-maximal Lyndon substring of x , say x [ i . . ] for some j n . It follows from the claim that x [ i . . n ] x [ k . . n ] for any i < k . Since j , ( a ) holds. For the opposite direction, if ( a ) holds, then there are two possibilities: either there is j < n so that x [ i . . n ] x [ . . n ] or x [ i . . n ] x [ k . . n ] for any i < k n . By the claim, in the former case, x [ i . . 1 ] is a right-maximal Lyndon substring of x , while in the latter case, x [ i . . n ] is a right-maximal Lyndon substring of x  . Thus, in both cases, x [ i . . j ] is a prefix of a Lyndon substring of x .
With ( a ) proven, we can now replace ( b ) in the claim with ( b ) , completing the proof. □

3. τ -Reduction Algorithm (TRLA)

The purpose of this section is to introduce a recursive algorithm TRLA for computing the Lyndon array of a string. As will be shown below, the most significant aspect is the so-called τ -reduction of a string and how the Lyndon array of the τ -reduced string can be expanded to a partially filled Lyndon array for the whole string, as well as how to compute the missing values. This section thus provides the mathematical justification for the algorithm and, in so doing, proves the correctness of the algorithm. The mathematical understanding of the algorithm provides the bases for the bounding of its worst-case complexity by O ( n log ( n ) ) and determining the linearity of the average-case complexity.
The first idea of the algorithm was proposed in Paracha’s 2017 Ph.D. thesis [36]. It follows Farach’s approach [25]:
(1)
reduce the input string x to y ;
(2)
by recursion, compute the Lyndon array of y ; and
(3)
from the Lyndon array of y , compute the Lyndon array of x .
The input strings for the algorithm are $-terminated strings over an integer alphabet. The reduction computed in (1) is important. All linear algorithms for suffix array computations use the proximity property of suffixes: comparing x [ i . . n ] and x [ j . . n ] can be done by comparing x [ i ] and x [ j ] and, if they are the same, comparing x [ i + 1 . . n ] with x [ j + 1 . . n ] . For instance, in the first linear algorithm for the suffix array by Kärkkäinen and Sanders [37], obtaining the sorted suffixes for positions i 0 ( m o d 3 ) and i 1 ( m o d 3 ) via the recursive call is sufficient to determine the order of suffixes for the i 2 ( m o d 3 ) positions, then merging both lists together. However, there is no such proximity property for right-maximal Lyndon substrings, so the reduction itself must have a property that helps determine some of the values of the Lyndon array of x from the Lyndon array of y and computing the rest.
In our algorithm, we use a special reduction, which we call τ -reduction, defined in Section 3.2, that reduces the original string to at most 1 2 and at least 2 3 of its length. The algorithm computes y as a τ -reduction of the input string x in Step (1) in linear time. In Step (3), it expands the Lyndon array of the reduced string computed by Step (2) to an incomplete Lyndon array of the original string also in linear time. The incomplete Lyndon array computed in (3) is about 1 2 to 2 3 full, and for every position i with an unknown value, the values at positions i 1 and i + 1 are known. In particular, the values at Position 1 and position n are both known. Therefore, much information is provided by the recursive Step (2). For instance, for 00011001, via the recursive call, we would identify the right-maximal Lyndon substrings that are underlined in 00 01 1 ̲ ̲ 00 1 ̲ ̲ ̲ and would need to compute the missing right-maximal Lyndon substrings that are underlined in 0 00 1 ̲ 1 ̲ 0 01 ̲ .
However, computing the missing values of the incomplete Lyndon array takes at most O ( n log ( n ) ) steps, as we will show, resulting in the overall worst-case complexity of O ( n log ( n ) ) . When the input string is such that the missing values of the incomplete Lyndon array of the input string can be computed in linear time, the overall execution of the algorithm is linear as well, and thus, the average case complexity will be shown to be linear in the length of the input string.
In the following subsections, we describe the τ -reduction in several steps: first, the τ -pairing, then choosing the τ -alphabet, and finally, the computation of the τ ( x ) . The τ -reduction may be of some general interest as it preserves (see Lemma 6) some right-maximal Lyndon substrings of the original string.

3.1. τ -Pairing

Consider a $-terminated string x = x [ 1 . . n ] whose alphabet A x is ordered by ≺ where x [ n + 1 ] = $ and $ a for any a A x . A τ -pair consists of a pair of adjacent positions from the range 1 . . n + 1 . The τ -pairs are computed by induction:
  • the initial τ -pair is ( 1 , 2 ) ;
  • if ( i 1 , i ) is the last τ -pair computed, then:
       if i = n 1 then
          the next τ -pair is set to ( n , n + 1 )
          stop
       elseif i n then
          stop
       elseif x [ i 1 ] x [ i ] and x [ i ] x [ i + 1 ] then
          the next τ -pair is set to ( i , i + 1 ) ; repeat 2.
       else
          the next τ -pair is set to ( i + 1 , i + 2 ) ; repeat 2.
Every position of the input string that occurs in some τ -pair as the first element is labelled black; all others are labelled white. Note that Position 1 is always black, while the last position n can be either black or white; however, the positions n 1 and n cannot be simultaneously both black. Note also that most of the τ -pairs do not overlap; if two τ -pairs overlap, they overlap in a position i such that 1 < i < n and x [ i 1 ] x [ i ] and x [ i ] x [ i + 1 ] . The first position and the last position never figure in an overlap of τ -pairs. Moreover, a τ -pair can be involved in at most one overlap; for an illustration, see Figure 2; for the formal proof see Lemma 2.
Lemma 2.
Let ( i 1 , i 1 + 1 ) ( i k , i k + 1 ) be the τpairs of a string x = x [ 1 . . n ] . Then, for any j , 1 . . k :
(1) 
i f | ( i j , i j + 1 ) ( i , i + 1 ) | = 1 , t h e n f o r a n y m j , , | ( i j , i j + 1 ) ( i m , i m + 1 ) | = 0 ,
(2) 
| ( i j , i j + 1 ) ( i , i + 1 ) | 1 .
Proof. 
This is by induction; trivially true for | x | = 1 as ( 1 , 2 ) is the only τ -pair. Assume it is true for | x | n 1 .
  • Case ( i k , i k + 1 ) = ( n , n + 1 ) :
    Then, ( i k 1 , i k 1 + 1 ) = ( n 2 , n 1 ) , and so, ( i 1 , i 1 + 1 ) ( i k 1 , i k 1 + 1 ) are τ -pairs of x [ 1 . . n 1 ] ; thus, they satisfy (1) and (2) by the induction hypothesis. However, ( n , n + 1 ) ( i , i + 1 ) = for 1 < k , so (1) and (2) hold for ( i 1 , i 1 + 1 ) ( i k , i k + 1 ) .
  • Cases ( i k , i k + 1 ) = ( n 1 , n ) and ( i k 1 , i k 1 + 1 ) = ( n 2 , n 1 ) :
    Therefore, ( i 1 , i 1 + 1 ) ( i k 1 , i k 1 + 1 ) are τ -pairs of x [ 1 . . n 1 ] , and thus, they satisfy (1) and (2) by the induction hypothesis. However, ( i k , i k + 1 ) ( i , i + 1 ) = for 1 < k 1 , and  ( i k , i k + 1 ) ( i k 1 , i k 1 + 1 ) = { i k 1 } = n 1 ; so, | ( i k , i k + 1 ) ( i k 1 , i k 1 + 1 ) | 1 , and so, (1) and (2) hold for ( i 1 , i 1 + 1 ) ( i k , i k + 1 ) .
  • Cases ( i k , i k + 1 ) = ( n 1 , n ) and ( i k 1 , i k 1 + 1 ) = ( n 3 , n 2 ) :
    Then, ( i 1 , i 1 + 1 ) ( i k 1 , i k 1 + 1 ) are τ -pairs of x [ 1 . . n 2 ] , so they satisfy (1) and (2) by the induction hypothesis. However, ( i k , i k + 1 ) ( i , i + 1 ) = for 1 < k , so (1) and (2) hold for ( i 1 , i 1 + 1 ) ( i k , i k + 1 ) .

3.2. τ -Reduction

For each τ -pair ( i , i + 1 ) , we consider the pair of alphabet symbols ( x [ i ] , x [ i + 1 ] ) . We call them symbol τ -pairs. They are in a total order ⊲ induced by ≺ : ( x [ i j ] , x [ i j + 1 ] ) ( x [ i ] , x [ i + 1 ] ) if either x [ i j ] x [ i ] , or  x [ i j ] = x [ i ] and x [ i j + 1 ] x [ i + 1 ] . They are sorted using the radix sort with a key of size two and assigned letters from a chosen τ -alphabet that is a subset of { 0 , 1 , , | τ ( x ) | } so that the assignment preserves the order. Since the input string is over an integer alphabet, the radix sort is linear.
In the example (Figure 2), the τ -pairs are ( 1 , 2 ) ( 3 , 4 ) ( 4 , 5 ) ( 6 , 7 ) ( 7 , 8 ) ( 9 , 10 ) , and so, the symbol τ -pairs are ( 0 , 1 ) ( 1 , 0 ) ( 0 , 2 ) ( 3 , 1 ) ( 1 , 2 ) ( 2 , $ ) . The sorted symbol τ -pairs are ( 0 , 1 ) ( 0 , 2 ) ( 1 , 0 ) ( 1 , 2 ) ( 2 , $ ) ( 3 , 1 ) . Thus, we chose as our τ -alphabet { 0 , 1 , 2 , 3 , 4 , 5 } , and so, the symbol τ -pairs are assigned these letters: ( 0 , 1 ) 0 , ( 0 , 2 ) 1 , ( 1 , 0 ) 2 , ( 1 , 2 ) 3 , ( 2 , $ ) 4 , and ( 3 , 1 ) 5 . Note that the assignments respect the order ⊲ of the symbols τ -pairs and the natural order < of { 0 , 1 , 2 , 3 , 4 , 5 } .
The τ -letters are substituted for the symbol τ -pairs, and the resulting string is terminated with $. This string is called the τ -reduction of x and denoted τ ( x ) , and it is a $-terminated string over an integer alphabet. For our running example from Figure 2, τ ( x ) = 021534 . The next lemma justifies calling the above transformation a reduction.
Lemma 3.
For any string x , 1 2 | x | | τ ( x ) | 2 3 | x | .
Proof. 
There are two extreme cases; the first is when all the τ -pairs do not overlap at all, then | τ ( x ) | = 1 2 | x | ; and the second is when all the τ -pairs overlap, then | τ ( x ) | = 2 3 | x | . Any other case must be in between. □
Let B ( x ) denote the set of all black positions of x . For any i 1 . . | τ ( x ) | , b ( i ) = j where j is a black position in x of the τ -pair corresponding to the new symbol in τ ( x ) at position i, while t ( j ) assigns each black position of x the position in τ ( x ) where the corresponding new symbol is, i.e., b ( t ( j ) ) = j and t ( b ( i ) ) = i . Thus,
1 . . | τ ( x ) | t b B ( x )
In addition, we define p as the mapping of the τ -pairs to the τ -alphabet.
In our running example from Figure 2, t ( 1 ) = 1 , t ( 3 ) = 2 , t ( 4 ) = 3 , t ( 6 ) = 4 , t ( 7 ) = 5 , and  t ( 9 ) = 6 , while b ( 1 ) = 1 , b ( 2 ) = 3 , b ( 3 ) = 4 , b ( 4 ) = 6 , b ( 5 ) = 7 , and b ( 6 ) = 9 . For the letter mapping, we get p ( 1 , 2 ) = 0 , p ( 3 , 4 ) = 2 , p ( 4 , 5 ) = 1 , p ( 6 , 7 ) = 5 , p ( 7 , 8 ) = 3 , and p ( 9 , 10 ) = 4 .

3.3. Properties Preserved by τ -Reduction

The most important property of τ -reduction is a preservation of right-maximal Lyndon substrings of x that start at black positions. This means there is a closed formula that gives, for every right-maximal Lyndon substring of τ ( x ) , a corresponding right-maximal Lyndon substring of x . Moreover, the formula for any black position can be computed in constant time. It is simpler to present the following results using L , the alternative form of the Lyndon array, the one where the end positions of right-maximal Lyndon substrings are stored rather than their lengths. More formally:
Theorem 2.
Let x = x [ 1 . . n ] ; let L τ ( x ) [ 1 . . m ] be the Lyndon array of τ ( x ) ; and let L x [ 1 . . n ] be the Lyndon array of x .
Then, for any black i 1 . . n , L x [ i ] = b ( r ) if b ( r ) = n or x [ b ( r ) + 1 ] x [ i ] b ( r ) + 1 otherwise , where r = L τ ( x ) [ t ( i ) ] .
The proof of the theorem requires a series of lemmas that are presented below. First, we show that τ -reduction preserves the relationships of certain suffixes of x .
Lemma 4.
Let x = x [ 1 . . n ] , and let τ ( x ) = τ ( x ) [ 1 . . m ] . Let i j and 1 i , j n . If i and j are both black positions, then x [ i . . n ] x [ j . . n ] implies τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
Proof. 
Since i and j are both black positions, both t ( i ) and t ( j ) are defined, and t ( i ) t ( j ) . Let us assume that x [ i . . n ] x [ j . . n ] . The proof is argued in several cases determined by the nature of the relationship x [ i . . n ] x [ j . . n ] .
(1)
Case: x [ i . . n ] is a proper prefix of x [ j . . n ] .
Then, | x [ i . . n ] | = n i + 1 < | x [ j . . n ] | = n j + 1 , and so, j < i . It follows that x [ j . . j + n i ] = x [ i . . n ] , and thus, x [ i . . n ] is a border of x [ j . . n ] .
(1a)
Case: j + n i is black.
Since n may be either black or white, we need to discuss two cases.
(1a α )
Case: n is white.
Since n is white, the last τ -pair of x must be ( n 1 , n ) . The τ -pairs of x [ j . . j + n i ] must be the same as the τ -pairs of x [ i . . n ] ; the last τ -pair of x [ j . . j + n i ] must be ( j + n i 1 , j + n i ) . Since j + n i is black by our assumption (1a), the next τ -pair of x must be ( j + n i , j + n i + 1 ) , as indicated in the following diagram:
Algorithms 13 00294 i001
Thus, τ ( x ) [ t ( j ) . . t ( j + n i 1 ) ] = τ ( x ) [ t ( i ) . . t ( n 1 ) ] . Since t ( n 1 ) = m , we have τ ( x ) [ t ( j ) . . t ( j + n i 1 ) ] = τ ( x ) [ t ( i ) . . m ] , and so, τ ( x ) [ t ( i ) . . m ] is a proper prefix of τ ( x ) [ t ( j ) . . m ] giving τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
(1a β )
Case: n is black.
Then, the last τ -pair of x must be ( n , n + 1 ) , and hence, the last τ -pair of x [ j . . j + n i ] , so the next τ -pair is ( j + n i , j + n i + 1 ) ; since n 1 cannot be black when n is, the situation is as indicated in the following diagram:
Algorithms 13 00294 i002
Thus, τ ( x ) [ t ( i ) . . t ( n 2 ) ] = τ ( x ) [ t ( j ) . . t ( j + n i 2 ) ] . Since x [ j + n i ] = x [ n ] and ( x [ n ] , x [ n + 1 ] ) = ( x [ n ] , $ ) , we have ( x [ j + n i ] , x [ j + n i + 1 ] ) ( x [ n ] , x [ n + 1 ] ) , and so, τ ( x ) [ t ( j + n i ) ] τ ( x ) [ t ( n ) ] , giving τ ( x ) [ t ( j ) . . t ( n ) ] τ ( x ) [ t ( i ) . . t ( n ) ] . Since t ( n ) = m , we have τ ( x ) [ t ( j ) . . m ] τ ( x ) [ t ( i ) . . m ] .
(1b)
Case: j + n i is white.
Then, j + n i 1 is black; hence, n 1 is black; so, n must also be white, and thus, τ ( x ) [ t ( j ) . . t ( j + n i 1 ) ] = τ ( x ) [ t ( i ) . . t ( n 1 ) ] , as indicated by the following diagram:
Algorithms 13 00294 i003
Since t ( n 1 ) = m , we have τ ( x ) [ t ( j ) . . t ( j + n i 1 ) ] = τ ( x ) [ t ( i ) . . m ] , and so, τ ( x ) [ t ( i ) . . m ] is a proper prefix of τ ( x ) [ t ( j ) . . m ] , giving τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
(2)
Case: x [ i ] x [ j ] or ( x [ i ] = x [ j ] and x [ i + 1 ] x [ j + 1 ] ).
Then, ( x [ i ] , x [ i + 1 ] ) ( x [ j ] , x [ j + 1 ] ) , and so, τ ( x ) [ t ( i ) ] τ ( x ) [ t ( j ) ] , and thus, τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
(3)
Case: for some 3 , x [ i . . i + 1 ] = x [ j . . j + 1 ] , while x [ i + ] x [ j + ] .
First note that i + 2 and j + 2 are either both black, or both are white:
  • If i + 2 is white, then the τ -pairs ( ( i , i + 1 ) , , ( i + 3 , i + 2 ) ) of x [ i . . n ] correspond one-to-one to the τ -pairs ( ( j , j + 1 ) , , ( j + 3 , j + 2 ) ) of x [ j . . n ] . To determine what follows ( i + 3 , i + 2 ) , we need to know the relationship between the values x [ i + 3 ] , x [ i + 2 ] , and x [ i + 1 ] . Since x [ i + 3 ] = x [ j + 3 ] , x [ i + 2 ] = x [ j + 2 ] , and  x [ i + 1 ] = x [ j + 1 ] , the values x [ j + 3 ] , x [ j + 2 ] , and x [ j + 1 ] have the same relationship, and thus, the  τ -pair following ( j + 3 , j + 2 ) will be the “same” as the τ -pair following ( i + 3 , i + 2 ) . Since i + 2 is white, the τ -pair following ( i + 3 , i + 2 ) is ( i + 1 , i + ) , and so, the τ -pair following ( j + 3 , j + 2 ) is ( j + 1 , j + ) , making j + 2 white as well.
  • If i + 2 is black, then the τ -pairs ( ( i , i + 1 ) , , ( i + 2 , i + 1 ) ) of x [ i . . n ] correspond one-to-one to the τ -pairs ( ( j , j + 1 ) , , ( j + 2 , j + 1 ) ) of x [ j . . n ] . It follows that j + 2 is black as well.
We proceed by discussing these two cases for the colours of i + 2 and j + 2 .
(3a)
Case when i + 2 and j + 2 are both white.
Therefore, we have the τ -pairs ( ( i , i + 1 ) , , ( i + 3 , i + 2 ) , ( i + 1 , i + ) ) for x [ i . . n ] that correspond one-to-one to the τ -pairs ( ( j , j + 1 ) , , ( j + 3 , j + 2 ) , ( j + 1 , j + ) ) for x [ j . . n ] . It follows that τ ( x ) [ t ( i ) . . t ( i + 3 ) ] = τ ( x ) [ t ( j ) . . t ( j + 3 ) ] . τ ( x ) [ t ( i + 1 ) ] = p ( i + 1 , i + ) and τ ( x ) [ t ( j + 1 ) ] = p ( j + 1 , j + ) . Since  x [ i + 1 ] = x [ j + 1 ] and, by our assumption (3), x [ i + ] x [ j + ] , it follows that ( i + 1 , i + ) ( j + 1 , j + ) , giving p ( i + 1 , i + ) p ( j + 1 , j + ) , and so, τ ( x ) [ t ( i + 1 ) ] τ ( x ) [ t ( j + 1 ) ] . Since t ( i + 3 ) + 1 = t ( i + 1 )  and t ( j + 3 ) + 1 = t ( j + 1 ) , we have τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
(3b)
Case when i + 2 and j + 2 are both black.
Therefore, we have the τ -pairs ( ( i , i + 1 ) , , ( i + 2 , i + 1 ) ) for x [ i . . n ] that correspond one-to-one to the τ -pairs ( ( j , j + 1 ) , , ( j + 2 , j + 1 ) ) for x [ j . . n ] . It follows that τ ( x ) [ t ( i ) . . t ( i + 2 ) ] = τ ( x ) [ t ( j ) . . t ( j + 2 ) ] .
We need to discuss the four cases based on the colours of i + 1 and j + 1 .
(3b α )
Both i + 1 and j + 1 are black.
It follows that the next τ -pair for x [ i . . n ] is ( i + 1 , i + ) , and the next τ -pair for x [ j . . n ] is ( j + 1 , j + ) . It follows that t ( i + 2 ) + 1 = t ( i + 1 ) and t ( j + 2 ) + 1 = t ( j + 1 ) . Hence, τ ( x ) [ t ( i + 2 ) + 1 ] = p ( i + 1 , i + ) and τ ( x ) [ t ( j + 2 ) + 1 ] = p ( j + 1 , j + ) . Since x [ i + 1 ] = x [ j + 1 ] and, by Assumption (3), x [ i + ] x [ j + ] , we have ( x [ i + 1 ] , x [ i + ] ) ( x [ j + 1 ] , x [ j + ] ) , and so, p ( x [ i + 1 ] , x [ i + ] ) p ( x [ j + 1 ] , x [ j + ] ) , giving us τ ( x ) [ t ( i + 2 ) + 1 ] τ ( x ) [ t ( j + 2 ) + 1 ] . It follows that τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
(3b β )
i + 1 is white, and j + 1 is black.
It follows that the next τ -pair for x [ i . . n ] is ( i + , i + + 1 ) , and the next τ -pair for x [ j . . n ] is ( j + 1 , j + ) . It follows that t ( i + 2 ) + 1 = t ( i + ) , while t ( j + 2 ) + 1 = t ( j + 1 ) . Thus, τ ( x ) [ t ( i + 2 ) + 1 ] = p ( i + , i + + 1 ) and τ ( x ) [ t ( j + 2 ) + 1 ] = p ( j + 1 , j + ) . Since j + 1 is black, we know that x [ j 2 ] x [ j 1 ] x [ j + ] . Since x [ i + 2 ] = x [ j + 2 ] and x [ i + 1 ] = x [ j + 1 ] , we have x [ i 2 ] x [ i 1 ] , and so, x [ i 1 ] x [ i + ] , as otherwise, i 1 would be black. This gives us x [ j + 1 ] x [ i + ] . Thus,  ( x [ i + ] , x [ i + + 1 ] ) ( x [ j + 1 ] , x [ j + ] ) , giving  p ( i + , i + + 1 ) p ( j + 1 , j + ) and, ultimately, τ ( x ) [ t ( i + ) ] τ ( x ) [ t ( j + 1 ) ] . The last step is to realize that t ( i + 2 ) + 1 = t ( i + ) and t ( j + 2 ) + 1 = t ( j + 2 ) , which gives us τ ( x ) [ t ( i + 2 ) + 1 ] τ ( x ) [ t ( j + 2 ) + 1 ] . It follows that τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
(3b γ )
i + 1 is black, and j + 1 is white.
It follows that the next τ -pair for x [ i . . n ] is ( i + 1 , i + ) , and the next τ -pair for x [ j . . n ] is ( j + , j + + 1 ) . It follows that t ( i + 2 ) + 1 = t ( i + 1 ) , while  t ( j + 2 ) + 1 = t ( j + ) . Thus, τ ( x ) [ t ( i + 2 ) + 1 ] = p ( i + 1 , i + ) and τ ( x ) [ t ( j + 2 ) + 1 ] = p ( j + , j + + 1 ) . Since i + 1 is black, we know that x [ i + 2 ] x [ i + 1 ] x [ i + ] x [ j + ] , where the last inequality is our Assumption (3). Therefore, x [ j + 1 ] = x [ i + 1 ] x [ j + ] . Thus,  ( x [ i + 1 ] , x [ i + ] ) ( x [ j + ] , x [ j + + 1 ] ) , giving p ( i + 1 , i + ) p ( j + , j + + 1 ) , τ ( x ) [ t ( i + 1 ) ] τ ( x ) [ t ( j + ) ] , and ultimately, τ ( x ) [ t ( i + 2 ) + 1 ] = τ ( x ) [ t ( i + 1 ) ] τ ( x ) [ t ( j + ) ] = τ ( x ) [ t ( j + 2 ) + 1 ] . It follows that τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
(3b δ )
Both i + 1 and j + 1 are white.
Then, the next τ -pair for x [ i . . n ] is ( i + , i + + 1 ) , and the next τ -pair for x [ j . . n ] is ( j + , j + + 1 ) . It follows that t ( i + 2 ) + 1 = t ( i + ) , while  t ( j + 2 ) + 1 = t ( j + ) . Thus, τ ( x ) [ t ( i + 2 ) + 1 ] = p ( i + , i + + 1 ) and τ ( x ) [ t ( j + 2 ) + 1 ] = p ( j + , j + + 1 ) . Since x [ i + 1 ] = x [ j + 1 ] and, by our Assumption (3), x [ i + ] x [ j + ] , ( x [ i + ] , x [ i + 1 ] ) ( x [ j + ] , x [ j + 1 ] ) , giving  p ( i + , i + + 1 ) p ( j + , j + + 1 ) , τ ( x ) [ t ( i + ) ] τ ( x ) [ t ( j + ) ] , and ultimately, τ ( x ) [ t ( i + 2 ) + 1 ] = τ ( x ) [ t ( i + ) ] τ ( x ) [ t ( j + ) ] = τ ( x ) [ t ( j + 2 ) + 1 ] . It follows that τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) . . m ] .
Lemma 5 shows that τ -reduction preserves the proto-Lyndon property of certain proto-Lyndon substrings of x .
Lemma 5.
Let x = x [ 1 . . n ] , and let τ ( x ) = τ ( x ) [ 1 . . m ] . Let 1 i < j n . Let x [ i . . j ] be a proto-Lyndon substring of x , and let i be a black position.
Then, τ ( x ) [ t ( i ) . . t ( j ) ] is proto Lyndon if j is black τ ( x ) [ t ( i ) . . t ( j 1 ) ] is proto Lyndon if j is white .
Proof. 
Let us first assume that j is black.
Since both i and j are black, t ( i ) and t ( j ) are defined. Let i 1 = t ( i ) , j 1 = t ( j ) , and consider k 1 , so that i 1 < k 1 j 1 . Let k = b ( k 1 ) . Then, t ( k ) = k 1 and i < k j , and so, x [ i . . n ] x [ k . . n ] by Lemma 1 as x [ i . . j ] is proto-Lyndon. It follows that τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( k ) . . m ] by Lemma 4. Thus,  τ ( x ) [ i 1 . . m ] τ ( x ) [ k 1 . . m ] for any i 1 < k 1 j 1 , and so, τ ( x ) [ i 1 . . j 1 ] is proto-Lyndon by Lemma 1.
Now, let us assume that j is white.
Then, j 1 is black, and x [ i . . j 1 ] is proto-Lyndon, so as in the previous case, τ ( x ) [ t ( i ) . . t ( j 1 ) ] is proto-Lyndon. □
Now, we can show that τ -reduction preserves some right-maximal Lyndon substrings.
Lemma 6.
Let x = x [ 1 . . n ] , and let τ ( x ) = τ ( x ) [ 1 . . m ] . Let 1 i < j n . Let x [ i . . j ] be a right-maximal Lyndon substring, and let i be a black position.
Then, τ ( x ) [ t ( i ) . . t ( j ) ] is   a   right maximal   Lyndon   substring if j is   black τ ( x ) [ t ( i ) . . t ( j 1 ) ] is   a   right maximal   Lyndon   substring if j is   white .
Proof. 
Since x [ i . . j ] is Lyndon and hence proto-Lyndon, by Lemma 5, we know that τ ( x ) [ t ( i ) . . t ( j ) ] is proto-Lyndon for j black, while for white j, τ ( x ) [ t ( i ) . . t ( j 1 ) ] is proto-Lyndon. Thus, in order to conclude that the respective strings are right-maximal Lyndon substrings, we only need to prove that the property ( c ) of Lemma 1 holds in both cases.
Since x [ i . . j ] is right-maximal Lyndon, either j = n or x [ j + 1 . . n ] x [ i . . n ] by Lemma 1, giving j = n or x [ j + 1 ] x [ i ] . Since x [ i . . j ] is Lyndon and hence unbordered, x [ i ] x [ j ] . Thus, either j = n or x [ j + 1 ] x [ i ] x [ j ] .
If j = n , then there are two simple cases. If n is white, n 1 is black and m = t ( n 1 ) , so  t ( j 1 ) = m , giving us (c) of Lemma 1 for τ ( x ) [ t ( i ) . . t ( j 1 ) ] . On the other hand, if n is black, then m = t ( n ) , and so, m = t ( j ) giving us (c) of Lemma 1 for τ ( x ) [ t ( i ) . . t ( j ) ] .
Thus, in the following, we can assume that j < n and that x [ j + 1 ] x [ i ] x [ j ] . We will proceed by discussing two possible cases, one where j is black and the other where j is white.
(1)
Case: j is black.
We need to show that either t ( j ) = m or τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j ) + 1 . . m ] .
If j = n , then t ( j ) = m , and we are done. Thus, we can assume that j < n . We must show that τ ( x ) [ t ( j ) + 1 . . m ] τ ( x ) [ t ( i ) . . m ] .
(1a)
Case: x [ j + 1 ] x [ j + 2 ] .
Then, x [ j ] x [ j + 1 ] and x [ j + 1 ] x [ j + 2 ] , and so, j + 1 is black. It follows that t ( j ) + 1 = t ( j + 1 ) . By Lemma 4, τ ( x ) [ t ( j + 1 ) . . m ] τ ( x ) [ t ( i ) . . m ] because x [ j + 1 . . n ] x [ i . . n ] , thus  τ ( x ) [ t ( j ) + 1 . . m ] τ ( x ) [ t ( i ) . . m ] .
(1b)
Case: x [ j + 1 ] x [ j + 2 ] .
Then, x [ j ] x [ i ] x [ j + 1 ] x [ j + 2 ] . It follows that the τ -pair ( j , j + 1 ) is followed by a τ -pair ( j + 2 , j + 3 ) , and thus, t ( j ) + 1 = t ( j + 2 ) . Thus,
( x [ j + 2 ] , x [ j + 3 ] ) ( x [ i ] , x [ i + 1 ] ) ( x [ j ] , x [ j + 1 ] ) ; hence,
p ( j + 2 , j + 3 ) p ( i , i + 1 ) p ( j , j + 1 ) . Since τ ( x ) [ t ( j ) + 1 ] = p ( j + 2 , j + 3 ) , τ ( x ) [ t ( i ) ] = p ( j , i + 1 ) , and τ ( x ) [ t ( j ) ] = p ( j , j + 1 ) , it follows that τ ( x ) [ t ( j ) + 1 ] τ ( x ) [ t ( i ) ] , and so, τ ( x ) [ t ( j ) + 1 . . m ] τ ( x ) [ t ( i ) . . m ] .
(2)
Case: j is white.
We need to prove that τ ( x ) [ t ( j 1 ) + 1 . . m ] τ ( x ) [ t ( i ) . . m ] . Since j is white, necessarily both j 1 and j + 1 are black and t ( j 1 ) + 1 = t ( j + 1 ) . By Lemma 5, τ ( x ) [ t ( i ) . . t ( j 1 ) ] is proto-Lyndon as both i and j 1 are black and x [ i . . j 1 ] is proto-Lyndon. Since x [ i . . n ] x [ j + 1 . . n ] and both i and j + 1 are black, by Lemma 4, we get τ ( x ) [ t ( i ) . . m ] τ ( x ) [ t ( j + 1 ) . . m ] = τ ( x ) [ t ( j 1 ) + 1 . . m ] .
Now, we are ready to tackle the proof of Theorem 2.
Proof of Theorem 2.
Let L x [ i ] = j where i is black. Then, t ( i ) is defined, and x [ i . . j ] is a right-maximal Lyndon substring of x . We proceed by analysis of the two possible cases of the label for the position j. Let (*) denote the condition from the theorem, i.e.,
(*)
b ( L τ ( x ) [ t ( i ) ] ) = n or x [ b ( L τ ( x ) [ t ( i ) ] ) + 1 ] x [ i ]
(1)
Case: j is black.
Then, by Lemma 6, τ ( x ) [ t ( i ) . . t ( j ) ] is a right-maximal Lyndon substring of τ ( x ) ; hence,  L τ ( x ) [ t ( i ) ] = t ( j ) . Therefore, b ( L τ ( x ) [ t ( i ) ] ) = b ( t ( j ) ) = j = L x [ i ] . We have to also prove that the condition (*) holds.
If j = n , then the condition (*) holds. Therefore, assume that j < n . Since  x [ i . . j ] is right-maximal, by Lemma 1, x [ j + 1 . . n ] x [ i . . n ] , and so, x [ j + 1 ] x [ i ] . Then,  x [ b ( L τ ( x ) [ t ( i ) ] ) + 1 ] = x [ b ( t ( j ) ) + 1 ] = x [ j + 1 ] x [ i ] .
(2)
Case: j is white.
Then, j 1 is black, and τ ( x ) [ t ( j 1 ) ] = p ( j 1 , j ) . By Lemma 6, τ ( x ) [ t ( i ) . . t ( j 1 ) ] is a right-maximal Lyndon substring of τ ( x ) ; hence, L τ ( x ) [ t ( i ) ] = t ( j 1 ) , so b ( L τ ( x ) [ t ( i ) ] ) = b ( t ( j 1 ) ) = j 1 , giving b ( L τ ( x ) [ t ( i ) ] + 1 ) = j .
We want to show that the condition (*) does not hold.
If b ( L τ ( x ) [ t ( i ) ] ) = n , then j 1 = n , which is impossible as j n . Since x [ i . . j ] is Lyndon, x [ i ] x [ j ] , and so, x [ i ] x [ b ( L τ ( x ) [ t ( i ) ] + 1 ) ] . Thus, Condition (*) does not hold.

3.4. Computing L x from L τ ( x )

Theorem 2 indicates how to compute the partial L x from L τ ( x ) . The procedure is given in Figure 3.
To compute the missing values, the partial array is processed from right to left. When a missing value at position i is encountered (note that it is recognized by L x [ i ] = n i l ), the Lyndon array L x [ i + 1 . . n ] is completely filled, and also, L x [ i 1 ] is known. Recall that L x [ i + 1 ] is the ending position of the right-maximal Lyndon substring starting at the position i + 1 . In several cases, we can determine the value of L x [ i ] in constant time:
(1)
if i = n , then L x [ i ] = i .
(2)
if x [ i ] x [ i + 1 ] , then L x [ i ] = i .
(3)
if x [ i ] = x [ i + 1 ] and L x [ i + 1 ] = i + 1 and either i + 1 = n or i + 1 = L x [ i 1 ] , then L x [ i ] = i .
(4)
if x [ i ] x [ i + 1 ] and L x [ i + 1 ] = i + 1 and either i + 1 = n or i + 1 = L x [ i 1 ] , then L x [ i ] = i + 1 .
(5)
if x [ i ] x [ i + 1 ] and L x [ i + 1 ] > i + 1 and either L x [ i + 1 ] = n or L x [ i + 1 ] = L x [ i 1 ] , then  L x [ i ] = L x [ i + 1 ] .
We call such points easy. All others will be referred to as hard. For a hard point i, it means that x [ i ] is followed by at least two consecutive right-maximal Lyndon substrings before reaching either L x [ i 1 ] or n, and we might need to traverse them all.
The while loop, seen in Figure 4’s procedure, is the likely cause of the O ( n log ( n ) ) complexity. At first glance, it may seem that the complexity might be O ( n 2 ) ; however, the doubling of the length of the string when a hard point is introduced actually trims it down to an O ( n log ( n ) ) worst-case complexity. See Section 3.5 for more details and Section 7 for the measurements and graphs.
Consider our running example from Figure 2. Since τ ( x ) = 021534 , we have L τ ( x ) [ 1 . . 6 ] = 6 , 2 , 6 , 4 , 6 , 6 giving L x [ 1 . . 9 ] = 9 , , 3 , 9 , , 6 , 9 , , 9 . Computing L x [ 8 ] is easy as x [ 8 ] = x [ 9 ] , and so, L x [ 8 ] = 8 . L x [ 5 ] is more complicated and an example of a hard point: we can extend the right-maximal Lyndon substring from L x [ 6 ] to the left to 23, but no more, so L x [ 5 ] = 6 . Computing L x [ 2 ] is again easy as x [ 2 ] = x [ 3 ] , and so, L x [ 2 ] = 2 . Thus, L x [ 1 . . 9 ] = 9 , 2 , 3 , 9 , 6 , 6 , 9 , 8 , 9 .

3.5. The Complexity of TRLA

To determine the complexity of the algorithm, we attach to each position i a counter r e d [ i ] initialized to zero. Imagine a hard point j indicated by the following diagram:
Algorithms 13 00294 i004
A 1 represents the right-maximal Lyndon substring starting at the position j + 1 ; A 2  represents the right-maximal Lyndon substring following immediately A 1 and so forth. To make j a hard point, r 2 and x [ j ] x [ j + 1 ] . The value of s t o p is determined by:    
                s t o p = L x [ j 1 ] i f L x [ j 1 ] > j 1 n o t h e r w i s e .    
To determine the right-maximal Lyndon substring starting at the hard position j, we need first to check if A 1 can be left-extended by x [ j ] to make j A 1 Lyndon; we are using abbreviated notation j A 1 for the substring x [ j . . k ] where A 1 = x [ j + 1 . . k ] ; in simple words, j A 1 represents the left-extension of A 1 by one position. If j A 1 is proto-Lyndon, we have to check whether A 2 can be left-extended by j A 1 to a Lyndon substring. If j A 1 A 2 is Lyndon, we must continue until we check whether j A 1 A 2 A r 1 is Lyndon. If so, we must check whether j A 1 A r is Lyndon. We need not go beyond s t o p .
How do we check if j A 1 A k can left-extend A k + 1 to a Lyndon substring? If j A 1 A k A k + 1 , we can stop, and j A 1 A k is the right-maximal Lyndon substring starting at position j. If  j A 1 A k A k + 1 , we need to continue. Since s t o p is the last position of the right-maximal Lyndon substring at the position j 1 or n, we are assured to stop there. When comparing the substring j A 1 A k with A k + 1 , we increment the counter r e d [ i ] at every position of A k + 1 used in the comparison. When done with the whole array, the value of r e d [ i ] represents how many times i was used in various comparisons, for any position i.
Consider a position i that was used k times for k 4 , i.e., r e d [ i ] = k . In the next four diagrams and related text, the upper indices of A and C do not represent powers; they are just indices. The next diagram indicates the configuration when the counter r e d [ i ] was incremented for the first time in the comparison of j 1 A 1 1 A r 1 1 1 and A r 1 1 during the computation of the missing value L x [ j 1 ] where:
s t o p 1 = L x [ j 1 1 ] i f L x [ j 1 1 ] > j 1 1 n o t h e r w i s e
Algorithms 13 00294 i005 The next diagram indicates the configuration when the counter r e d [ i ] was incremented for the second time in the comparison of j 2 A 1 2 A r 2 1 2 and A r 2 2 during the computation of the missing value L x [ j 2 ] where:
s t o p 2 = L x [ j 2 1 ] i f L x [ j 2 1 ] > j 2 1 n o t h e r w i s e
Algorithms 13 00294 i006 The next diagram indicates the configuration when the counter r e d [ i ] was incremented for the third time in the comparison of j 3 A 1 3 A r 3 1 3 and A r 3 3 during the computation of the missing value L x [ j 3 ] where:
s t o p 3 = L x [ j 3 1 ] i f L x [ j 3 1 ] > j 3 1 n o t h e r w i s e
Algorithms 13 00294 i007 The next diagram indicates the configuration when the counter r e d [ i ] was incremented for the fourth time in the comparison of j 4 A 1 4 A r 4 1 4 and A r 4 4 during the computation of the missing value L x [ j 4 ] where:
s t o p 4 = L x [ j 4 1 ] i f L x [ j 4 1 ] > j 4 1 n o t h e r w i s e
Algorithms 13 00294 i008 and so forth until the k-th increment of r e d [ i ] . Thus, if r e d [ i ] = k , then n 2 k 1 ( n 1 + 1 ) 2 k as n 1 + 1 2 . Thus, n 2 k , and so, k log ( n ) . Thus, either k < 4 or k log ( n ) . Therefore, the overall complexity is O ( n log ( n ) ) .
To show that the average case complexity is linear, we first recall that the overall complexity of TRLA is determined by the procedure filling the missing values. We showed above that there are at most log ( n ) missing values (hard positions) that cannot be determined in constant time. We overestimate the number of strings of length n over an alphabet of size Σ , 2 Σ n , which will force a non-linear computation, by assuming that every possible log ( n ) subset of indices with any possible letter assignment forces the worst performance. Thus, there are Σ n n log ( n ) Σ log ( n ) strings that are processed in linear time, say with a constant K 1 , and there are n log ( n ) Σ log ( n ) strings that are processed in the worst time, with a constant K 2 . Let K = max ( K 1 , K 2 ) . Then, the average time is bounded by:
( Σ n n log ( n ) Σ log ( n ) ) K n + n log ( n ) Σ log ( n ) K n log ( n ) Σ n =
K n + K n n log ( n ) Σ log ( n ) Σ n ( log ( n ) 1 )
K n + K n n log ( n ) Σ log ( n ) Σ n log ( n )
K n + K n n log ( n ) Σ log ( n ) log ( n ) ! Σ n log ( n )
K n + K n n 2 log ( n ) 2 n
K n + K n = 2 K n
for n 2 7 . The last step follows from the fact that n 2 log ( n ) 2 n for any n 2 7 .
The combinatorics of the processing is too complicated to ascertain whether the worst-case complexity is linear or not. We tried to generate strings that might give the worst performance. We used three different formulas to generate the strings, nesting the white indices that might require non-constant computation: the dataset extreme_trla of binary strings is created using the recursive formula u k + 1 = 00 u k 0 u k , using the first 100 shortest binary Lyndon strings as the start u 0 . The moment the size u k exceeds the required length of the string, the recursion stops, and the string is trimmed to the required length. For the extreme_trla1 dataset, we used the same approach with the formula u k + 1 = 000 u k 00 u k , and for the extreme_trla2 dataset, we used the formula u k + 1 = 0000 u k 00 u k .
The space complexity of our C++ implementation is bounded by 9 n integers. This upper bound is derived from the fact that a Tau object (see Tau.hpp [24]) requires 3 n integers of space for a string of length n. Therefore, the first call to TRLA requires 3 n , the next recursive call at most 3 2 3 n , the next recursive call at most 3 ( 2 3 ) 2 n , ...; thus, 3 n + 3 2 3 n + 3 ( 2 3 ) 2 n + 3 ( 2 3 ) 3 n + = 3 n ( 1 + 2 3 + ( 2 3 ) 2 + ( 2 3 ) 3 + ( 2 3 ) 4 + ) = 3 n 1 1 2 3 = 9 n . However, it should be possible to bring it down to 6 n integers.

4. The Algorithm BSLA

The purpose of this section is to present a linear algorithm BSLA for computing the Lyndon array of a string over an integer alphabet. The algorithm is based on a series of refinements of a list of groups of indices of the input string. The refinement is driven by a group that is already complete, and the refinement process makes the immediately preceding group also complete. In turn, this newly completed group is used as the driver of the next round of the refinement. In this fashion, the refinement proceeds from right to left until all the groups in the list are complete. The initial list of groups consists of the groups of indices with the same alphabet symbol. The section contains proper definitions of all these terms—group, complete group, and refinement. In the process of refinement, each newly created group is assigned a specific substring of the input string referred to as the context of the group. Throughout the process, the list of the groups is maintained in an increasing lexicographic order by their contexts. Moreover, at every stage, the contexts of all the groups are Lyndon substrings of x with an additional property that the contexts of the complete groups are right-maximal Lyndon substrings. Hence, when the refinement is completed, the contexts of all the groups in the list represent all the right-maximal Lyndon substrings of x . The mathematics of the process of refinement is necessary in order to ascertain its correctness and completeness and to determine the worst-case complexity of the algorithm.

4.1. Notation and Basic Notions of BSLA

For the sake of simplicity, we fix a string x = x [ 1 . . n ] for the whole Section 4.1; all the definitions and the observations apply and refer to this x .
A group G is a non-empty set of indices of x . The group G is assigned a context, i.e., a substring con ( G ) of x with the property that for any i G , x [ i . . i + | c o n ( G ) | 1 ] = c o n ( G ) . If i G , then C ( i ) denotes the occurrence of the context of G at the position i, i.e., the substring C ( i ) = x [ i . . i + | c o n ( G ) | 1 ] . We say that a group G is smaller than or precedes a group G if c o n ( G ) c o n ( G ) .
Definition 1.
An ordered list of groups G k , G k 1 , , G 2 , G 1 is agroup configurationif:
( C 1 )
G k G k 1 G 2 G 1 = 1 . . n ;
( C 2 )
G j G = for any 1 < j k ;
( C 3 )
c o n ( G k ) c o n ( G k 1 ) c o n ( G 2 ) c o n ( G 1 ) ;
( C 4 )
For any j 1 . . k , c o n ( G j ) is a Lyndon substring of x .
Note that ( C 1 ) and ( C 2 ) guarantee that G k , G k 1 , , G 2 , G 1 is a disjoint partitioning of 1 . . n . For  i { 1 , . . , n } , gr ( i ) denotes the unique group to which i belongs, i.e., if i G t , then g r ( i ) = G t . Note that using this notation, C ( i ) = x [ i . . i + | c o n ( g r ( i ) ) | 1 ] .
The mapping prev is defined by p r e v ( i ) = max { j < i | c o n ( g r ( j ) ) c o n ( g r ( i ) ) } if such j exists, otherwise p r e v ( i ) = n i l .
For a group G from a group configuration, we define an equivalence ∼ on G as follows: i j iff g r ( p r e v ( i ) ) = g r ( p r e v ( j ) ) or p r e v ( i ) = p r e v ( j ) = n i l . The symbol [ i ] denotes the class of equivalence ∼ that contains i, i.e., [ i ] = { j G | j i } . If p r e v ( i ) = n i l , then the class [ i ] is called trivial. An interesting observation states that if G is viewed as an ordered set of indices, then a non-trivial [ i ] is an interval:
Observation 7.
Let G be a group from a group configuration for x . Consider an i G such that p r e v ( i ) n i l . Let j 1 = min [ i ] and j 2 = max [ i ] . Then, [ i ] = { j G | j 1 j j 2 } .
Proof. 
Since p r e v ( j 1 ) is a candidate to be p r e v ( j ) , p r e v ( j ) n i l and p r e v ( j 1 ) p r e v ( j ) p r e v ( j 2 ) = p r e v ( j 1 ) , so p r e v ( j ) = p r e v ( j 1 ) = p r e v ( j 2 ) . □
On each non-trivial class of ∼, we define a relation ≈ as follows: i j iff | j i | = | c o n ( G ) | ; in simple terms, it means that the occurrence C ( i ) of c o n ( G ) is immediately followed by the occurrence C ( j ) of c o n ( G ) . The transitive closure of ≈ is a relation of equivalence, which we also denote by ≈. The symbol [ i ] denotes the class of equivalence ≈ containing i, i.e., [ i ] = { j [ i ] | j i } .
For each j from a non-trivial [ i ] , we define the valence by v a l ( j ) = | [ i ] | . In simple terms, v a l ( i ) is the number of elements from [ i ] that are i . Thus, 1 v a l ( i ) | G | .
Interestingly, if G is viewed as an ordered set of indices, then [ i ] is a subinterval of the interval [ i ] :
Observation 8.
Let G be a group from a group configuration for x . Consider an i G such that p r e v ( i ) n i l . Let j 1 = min [ i ] and j 2 = max [ i ] . Then, [ i ] = { j [ i ] | j 1 j j 2 } .
Proof. 
We argue by contradiction. Assume that there is an j [ i ] so that j 1 < j < j 2 and j [ i ] . Take the minimal such j. Consider j = j | c o n ( G ) | . Then, j [ i ] , and since, j < j , j [ i ] due to the minimality of j. Therefore, i j j , and so, j i , a contradiction. □
Definition 2.
A group G iscompleteif for any i G , the occurrence C ( i ) of c o n ( G ) is a right-maximal Lyndon substring of x .
A group configuration G k , G k 1 , , G 2 , G 1 ist-complete, 1 t k , if
( C 5 )
the groups G t , , G 1 are complete;
( C 6 )
the mapping p r e v isproperon G t :
for any i G t , if p r e v ( i ) n i l and v = v a l ( i ) , then there are i 1 , , i v G t , i { i 1 , , i v } , p r e v ( i ) = p r e v ( i 1 ) = = p r e v ( i v ) , and so that C ( p r e v ( i ) ) C ( i 1 ) C ( i v ) is a prefix of x [ j . . n ] ;
( C 7 )
the family { C ( i ) | i 1 . . n } isproper:
( a ) if C ( j ) is a proper substring of C ( i ) , i.e., C ( j ) C ( i ) , then c o n ( G t ) c o n ( g r ( j ) ) ,
( b ) if C ( i ) is followed immediately by C ( j ) , i.e., when i + | c o n ( g r ( i ) ) | = j , and C ( i ) C ( j ) , then  c o n ( g r ( j ) ) c o n ( G t ) ;
( C 8 )
the family { C ( i ) | i 1 . . n } has theMongeproperty, i.e., if C ( i ) C ( j ) , then C ( i ) C ( j ) or C ( j ) C ( i ) .
The condition ( C 6 ) is all-important for carrying out the refinement process (see ( R 3 ) below). The conditions ( C 7 ) and ( C 8 ) are necessary for asserting that the condition ( C 6 ) is preserved during the refinement process.

4.2. The Refinement

For the sake of simplicity, we fix a string x = x [ 1 . . n ] for the whole Section 4.2; all the definitions, lemmas, and theorems apply and refer to this x .
Lemma 9.
Let A x = { a 1 , , a k } and a 1 a 2 a k . For 1 k , define G = { i 1 . . n | x [ i ] = a k + 1 } with context a k + 1 . Then, G k , , G 1 is a one-complete group configuration.
Proof. 
( C 1 ), ( C 2 ), ( C 3 ), and ( C 4 ) are straightforward to verify. To verify ( C 5 ), we need to show that G 1 is complete. Any occurrence of a k in x is a right-maximal Lyndon substring, so G 1 is complete.
To verify ( C 6 ), consider j = p r e v ( i ) and v a l ( i ) = v for i G 1 . Consider any r such that j < r < i . If x [ r ] a k , then p r e v ( i ) < r , which contradicts the definition of p r e v . Hence, x [ r ] = a k , and so, x [ j + 1 ] = = x [ i ] = x [ j + v + 1 ] = a k , while x [ j ] = a q for some q < k . It follows that x [ j . . n ] has a q ( a k ) v as a prefix.
The condition ( C 7 ( a ) ) is trivially satisfied as no C ( i ) can have a proper substring. If C ( i ) is immediately followed by C ( j ) and C ( i ) C ( j ) , then C ( i ) = x [ i ] , j = i + 1 , C ( j ) = x [ i + 1 ] , and  x [ i ] x [ i + 1 ] . Then,  c o n ( C ( j ) ) = x [ i + 1 ] a k = c o n ( G 1 ) , so ( C 7 ( b ) ) is also satisfied.
To verify ( C 8 ), consider C ( i ) C ( j ) . Then, C ( i ) = x [ i ] = x [ j ] = C ( j ) . □
Let G k , , G t , , G 1 by a t-complete group configuration. The refinement is driven by the group G t , and it might only partition the groups that precede it, i.e., the groups G k , , G t + 1 , while the groups G t , , G 1 remain unchanged.
( R 1 )
Partition G t into classes of the equivalence ∼.
G t = [ i 1 ] [ i 2 ] [ i p ] X where X = { i G t | p r e v ( i ) = n i l } may be possibly empty and i 1 < i 2 < < i p .
( R 2 )
Partition every class [ i ] , 1 p , into classes of the equivalence ≈.
[ i ] = [ j , 1 ] [ j , 2 ] [ j , m ] where v a l ( j , 1 ) < v a l ( j , 2 ) < < v a l ( j , m ) .
( R 3 )
Therefore, we have a list of classes in this order: [ j 1 , 1 ] , [ j 1 , 2 ] , ... [ j 1 , m 1 ] , [ j 2 , 1 ] , [ j 2 , 2 ] , ... [ j 2 , m 2 ] , ..., [ j p , 1 ] , [ j p , 2 ] , ... [ j p , m p ] . This list is processed from left to right. Note that for each i [ j , 𝓀 ] , p r e v ( i ) g r ( j , 𝓀 ) , and v a l ( i ) = v a l ( j , 𝓀 ) .
For each j , 𝓀 , move all elements { p r e v ( i ) | i [ j , 𝓀 ] } from the group g r ( p r e v ( j , 𝓀 ) ) into a new group H, place H in the list of groups right after the group g r ( p r e v ( j , 𝓀 ) ) , and set its context to c o n ( g r ( p r e v ( j , 𝓀 ) ) ) c o n ( g r ( j , 𝓀 ) ) v a l ( j , 𝓀 ) . (Note, that this “doubling of the contexts” is possible due to ( C 6 ) ) . Then, update p r e v :
  • All values of p r e v are correct except possibly the values of p r e v for indices from H. It may be the case that for i H , there is i g r ( j , 𝓀 ) , so that p r e v ( i ) < i , so p r e v ( i ) must be reset to the maximal such i . (Note that before the removal of H from g r ( j , 𝓀 ) , the index i was not eligible to be considered for p r e v ( i ) as i and i were both from the same group.)
Theorem 3 shows that having a t-complete group configuration G k , , G t + 1 , G t , , G 1 and refining it by G t , then the resulting system of groups is a ( t + 1 ) -complete group configuration. This allows carrying out the refinement in an iterative fashion.
Theorem 3.
Let Conf = G k , , G t + 1 , G t , , G 1 be a t-complete group configuration, 1 t . After performing the refinement of Conf by group G t , the resulting system of groups denoted as Conf is a ( t + 1 ) -complete group configuration.
Proof. 
We carry the proof in a series of claims. The symbols g r ( ) , c o n ( ) , C ( ) , p r e v ( ) , and v a l ( ) denote the functions for Conf, while g r ( ) , c o n ( ) , C ( ) , p r e v ( ) , and v a l ( ) denote the functions for Conf .
When a group G t + 1 is partitioned, a part of it is moved as the next group in the list, and we call it H t + 1 ; thus, G t + 1 H t + 1 G t . For details, please see ( R 3 ) above.
Claim 1.
Conf is a group configuration, i.e., ( C 1 ) , ( C 2 ) , ( C 3 ) , and ( C 4 ) for Conf hold.
Proof of Claim 1.
( C 1 ) and ( C 2 ) follow from the fact that the process is a refinement, i.e., a group is either preserved as is or is partitioned into two or more groups. The doubling of the contexts in Step ( R 3 ) guarantees that the increasing order of the contexts is preserved, i.e., ( C 3 ) holds. For any j G t so that j = p r e v ( i ) n i l , c o n ( g r ( p r e v ( j ) ) ) is Lyndon, and c o n ( g r ( j ) ) is also Lyndon, while c o n ( g r ( p r e v ( j ) ) ) c o n ( g r ( j ) ) , so  c o n ( g r ( p r e v ( j ) ) ) c o n ( g r ( j ) ) v a l ( j ) is Lyndon as well; thus, ( C 4 ) holds.
To illustrate the concatenation: let us call c o n ( g r ( p r e v ( j ) ) ) as A and c o n ( g r ( j ) ) as B, and let v a l ( j ) = m , then we know that A is Lyndon and B is Lyndon and A B ; so, A B m is clearly Lyndon as if A and B were letters.
This concludes the proof of Claim 1.
Claim 2.
{ C ( i ) | i 1 . . n } is proper and has the Monge property, i.e., ( C 7 ) and ( C 8 ) for Conf hold.
Proof of Claim 2.
Consider C ( i ) for some i 1 . . n . There are two possibilities:
  • C ( i ) = C ( i ) or
  • C ( i ) = C ( i ) C ( i 1 ) C ( i v ) , for some i 1 , i 2 , , i v G t , so that for any 1 v , i = p r e v ( i ) , C ( i ) = c o n ( G t ) , v = v a l ( i ) and for any 1 < k and i + 1 = i + | c o n ( G t ) | . Note that c o n ( g r ( i ) ) c o n ( G t ) .
Consider C ( i ) and C ( j ) for some 1 i < j n .
  • Case C ( i ) = C ( i ) and C ( j ) = C ( j ) .
    (a)
    Show that ( C 7 ( a ) ) holds.
    If C ( j ) C ( i ) , then C ( j ) C ( i ) , and so, by ( C 7 ( a ) ) for Conf, c o n ( G t ) c o n ( g r ( j ) ) , and thus, c o n ( H t + 1 ) c o n ( G t ) c o n ( g r ( j ) ) = c o n ( g r ( j ) ) . Therefore, ( C 7 ( a ) ) for Conf holds.
    (b)
    Show that ( C 8 ) holds. If C ( i ) C ( j ) , then C ( i ) C ( j ) , so C ( j ) C ( i ) , and so, C ( j ) C ( i ) ; so, ( C 8 ) for Conf holds.
  • Case C ( i ) = C ( i ) and C ( j ) = C ( j ) C ( j 1 ) . . C ( j w ) ,
    where w = v a l ( j 1 ) , C ( j 1 ) = = C ( j w ) = c o n ( G t ) , and j 1 j w .
    (a)
    Show that ( C 7 ( a ) ) holds.
    If C ( j ) C ( i ) , then C ( j ) C ( j 1 ) . . C ( j w ) C ( i ) ; hence, C ( j ) C ( i ) , and so, by ( C 7 ( a ) ) for Conf, c o n ( G t ) c o n ( g r ( j ) ) . By the t-completeness of Conf, C ( j ) is a right-maximal Lyndon substring, a contradiction with C ( j ) C ( j 1 ) . . , C ( j w ) being Lyndon. This is an impossible case.
    (b)
    Show that ( C 8 ) holds.
    If C ( i ) C ( j ) , then C ( j ) C ( i ) by ( C 8 ) for Conf. By ( C 7 ( a ) ) for Conf, C ( j ) cannot be a suffix of C ( i ) as c o n ( g r ( j ) ) c o n ( G t ) . Hence, C ( i ) C ( j 1 ) , and so, C ( j ) C ( j 1 ) C ( i ) ; and since C ( j 1 ) cannot be a suffix of C ( i ) as g r ( j 1 ) = G t , it follows that C ( i ) C ( j 2 ) , ..., ultimately giving C ( j ) C ( j 1 ) C ( j w ) C ( i ) . Therefore, ( C 8 ) for Conf holds.
  • Case C ( i ) = C ( i ) C ( i 1 ) . . C ( i v ) and C ( j ) = C ( j ) ,
    where v = v a l ( i 1 ) , C ( i 1 ) = = C ( i v ) = c o n ( G t ) , and i 1 i v .
    (a)
    Show that ( C 7 ( a ) ) holds.
    If C ( j ) C ( i ) , then either C ( j ) C ( i ) , which implies by ( C 7 ( a ) ) for Conf that c o n ( G t ) c o n ( g r ( j ) ) , giving c o n ( H t + 1 ) c o n ( G t ) = c o n ( G t ) c o n ( g r ( j ) ) = c o n ( g r ( j ) ) , or  C ( j ) C ( i ) for some 1 v . If C ( j ) = C ( i ) , then g r ( j ) = g r ( i ) = G t , giving  c o n ( H t + 1 ) c o n ( G t ) = c o n ( g r ( j ) ) . Therefore ( C 7 ( a ) ) for Conf holds.
    (b)
    Show that ( C 8 ) holds.
    Let C ( i ) C ( j ) . Consider D = { i | 1 v a n d C ( j ) C ( i ) } .
    Assume that D :
    • By ( C 8 ) for Conf, either C ( j ) i D C ( i ) C ( i ) , and we are done, or i D C ( i ) C ( j ) . Let i 𝓀 be the smallest element of D . Since C ( i 𝓀 ) cannot be the prefix of C ( j ) , it means that i 𝓀 = i 1 . Since C ( i 1 ) cannot be a prefix of C ( j ) , it means that C ( i ) C ( j ) , and so, C ( j ) C ( i ) , which contradicts the fact that C ( j ) i D C ( i ) C ( i ) .
    Assume that D = :
    • Then, C ( i ) C ( j ) , and so, by ( C 8 ) for Conf, C ( j ) C ( i ) C ( i ) as i < j .
  • Case C ( i ) = C ( i ) C ( i 1 ) . . C ( i v ) and C ( j ) = C ( j ) C ( j 1 ) C ( j w ) ,
    where v = v a l ( i 1 ) , C ( i 1 ) = = C ( i v ) = c o n ( G t ) , and i 1 i v and where v = v a l ( j 1 ) , C ( j 1 ) = = C ( j w ) = c o n ( G t ) , and j 1 j w .
    (a)
    Show that ( C 7 ( a ) ) holds.
    Let C ( j ) C ( i ) . Then, either C ( j ) C ( i ) , and so, c o n ( G t ) c o n ( g r ( j ) ) , implying that C ( j ) is maximal contradicting C ( j ) C ( j 1 ) C ( j w ) being Lyndon. Thus, C ( j ) C ( i ) for some 1 v . However, then, c o n ( G t ) c o n ( g r ( j ) ) , implying that C ( j ) is maximal, again a contradiction. This is an impossible case.
    (b)
    Show that ( C 8 ) holds.
    Let C ( i ) C ( j ) . Let us first assume that C ( i ) C ( j ) . Then, C ( j ) C ( i ) . Since  C ( j ) cannot be a suffix of C ( i ) , it follows that C ( i ) C ( j 1 ) . Therefore,  C ( j ) C ( j 1 ) C ( i ) . Repeating this argument leads to C ( j ) C ( j 1 ) C ( j w ) C ( i ) , and we are done.
    Therefore, assume that C ( i ) C ( j ) = . Let 1 v be the smallest such that C ( i ) C ( j ) . Such an must exist. Then, i j . If i = j , then either C ( i ) is a prefix of C ( j ) or vice versa, both impossibilities; hence, i < j . Repeating the same arguments as for i, we get that C ( j ) C ( j 1 ) . . C ( j w ) C ( i ) , and so, we are done.
It remains to show that ( C 7 ( b ) ) for Conf holds.
Consider C ( i ) immediately followed by C ( j ) with C ( i ) C ( j ) .
  • Assume that g r ( j ) { G t 1 , , G 1 } .
    Then, c o n ( G t ) = c o n ( G t ) , g r ( j ) = g r ( j ) , and c o n ( g r ( j ) ) = c o n ( g r ( j ) ) . If C ( i ) = C ( i ) , then  C ( i ) C ( j ) , and C ( i ) is immediately followed by C ( j ) , so by ( C 7 ( b ) ) for Conf, we have a contradiction. Thus, C ( i ) = C ( i ) C ( i 1 ) C ( i v ) for v = v a l ( i ) and c o n ( g r ( i v ) ) = c o n ( G t ) c o n ( g r ( j ) ) , and C ( i v ) is immediately followed by C ( j ) , a contradiction by ( C 7 ( b ) ) for Conf.
  • Assume that g r ( j ) = G t .
    Then, the group g r ( i ) is partitioned when refining by G t , and so, C ( i ) = c o n ( g r ( i ) ) = c o n ( g r ( i ) ) C ( j ) v for v = v a l ( j ) . Since C ( i ) is immediately followed by C ( j ) = c o n ( G t ) , we have again a contradiction, as it implies that v a l ( j ) = v + 1 .
This concludes the proof of Claim 2.
Claim 3.
The function prev is proper on H t + 1 , i.e., ( C 6 ) for Conf holds.
Proof of Claim 3.
Let j = p r e v ( i ) and i H t + 1 with v a l ( i ) = v . Then, | [ i ] | = v , and so, [ i ] = { i 1 , , i v } , where  i 1 < i 2 < < i v . Hence, i 1 , , i v H t + 1 , C ( i 1 ) = = C ( i v ) = c o n ( H t + 1 ) , and j = p r e v ( i ) = p r e v ( i 1 ) = = p r e v ( i v ) , and so, j < i 1 . It remains to show that C ( j ) C ( i 1 ) C ( i v ) is a prefix of x [ j . . n ] . It suffices to show that C ( j ) is immediately followed by C ( i 1 ) .
If C ( j ) C ( i 1 ) , then by the Monge property ( C 8 ) , C ( i 1 ) C ( j ) as j < i 1 , and so, by ( C 7 ( a ) ) , c o n ( H t + 1 ) c o n ( g r ( i 1 ) ) = c o n ( H t + 1 ) , a contradiction.
Thus, C ( j ) C ( i 1 ) = . Set j 1 = j + | c o n ( g r ( j ) ) | . It follows that j 1 i 1 . Assume that j 1 < i 1 . Since j = p r e v ( i 1 ) and j < i 1 , c o n ( g r ( j 1 ) ) c o n ( g r ( i 1 ) ) = c o n ( H t + 1 ) . Since j 1 H t + 1 , c o n ( g r ( j 1 ) ) c o n ( H t + 1 ) . Consider C ( j 1 ) . If C ( j 1 ) C ( i 1 ) , then by ( C 8 ) , C ( i 1 ) C ( j 1 ) , and so, by ( C 7 ( a ) ) , c o n ( H t + 1 ) c o n ( g r ( i 1 ) ) = c o n ( H t + 1 ) , a contradiction. Thus, C ( j 1 ) C ( i 1 ) = . Since  C ( j 1 ) immediately follows C ( j ) , by ( C 7 ( b ) ) , c o n ( g r ( j 1 ) ) c o n ( H t + 1 ) , a contradiction. Therefore, j 1 = i 1 , and so, p r e v is proper on H t + 1 . □
This concludes the proof of Claim 3.
Claim 4.
H t + 1 is a complete group, i.e., ( C 5 ) for Conf holds.
Proof of Claim 4.
Assume that there is i H t + 1 so that C ( i ) is not maximal, i.e., for some k i + | c o n ( H t + 1 ) | , x [ i . . k ] is a right-maximal Lyndon substring of x .
Either k = n and so c o n ( g r ( k ) ) = x [ k ] , and so, C ( k ) is a suffix of x [ i . . k ] , or k < n , and then, x [ k + 1 ] x [ k ] , since x [ k + 1 ] x [ k ] implies that x [ i . . k + 1 ] is Lyndon, a contradiction with the right-maximality of x [ i . . k ] . Consider C ( k ) , then C ( k ) x [ i . . k ] , and so, C ( k ) = x [ k ] .
Therefore, there is j 1 so that i + | c o n ( H t + 1 ) | j 1 k , and C ( j 1 ) is a suffix of x [ i . . k ] . Take the smallest such j 1 . If j 1 = i + | c o n ( H t + 1 ) | , then C ( i ) C ( j 1 ) as x [ i . . k ] = C ( i ) C ( j 1 ) is Lyndon. By  ( C 7 ( b ) ) , C ( j 1 ) c o n ( H t + 1 ) , so we have c o n ( H t + 1 ) = C ( i ) C ( j 1 ) c o n ( H t + 1 ) , a contradiction.
Therefore, j 1 > i + | c o n ( H t + 1 ) | . Consider x [ j 1 1 ] . If x [ j 1 1 ] x [ j 1 ] , x [ j 1 1 . . k ] is Lyndon, and since x [ j 1 . . k ] = C ( j 1 ) , x [ j 1 1 . . k ] would be a context of g r ( j 1 1 ) , this contradicts the fact j 1 was chosen to be the smallest such one. Therefore, x [ j 1 1 ] x [ j 1 ] , and so, c o n ( g r ( j 1 1 ) ) = x [ j 1 1 ] . Thus, there is j 2 , i + | c o n ( H t + 1 ) | j 2 < j 1 k , and C ( j 2 ) is a suffix of x [ i . . j 1 1 ] . Take the smallest such j 2 . If C ( j 2 ) C ( j 1 ) , then by ( C 7 ( b ) ) , C ( j 1 ) c o n ( H t + 1 ) , a contradiction. Hence, C ( j 2 ) C ( j 1 ) . If j 2 = i + i + | c o n ( H t + 1 ) | , then x [ i . . k ] = C ( i ) C ( j 2 ) C ( j 1 ) , and so, by ( C 7 ( b ) ) , C ( j 2 ) c o n ( H t + 1 ) , a contradiction. Hence, i + | c o n ( H t + 1 ) | < j 2 .
The same argument done for j 2 can now be done for j 3 . We end up with i + | c o n ( H t + 1 ) | j 3 < j 2 < j 1 k and with C ( j 3 ) C ( j 2 ) C ( j 1 ) c o n ( H t + 1 ) . If i + | c o n ( H t + 1 ) | = j 3 , then we have a contradiction, so i + | c o n ( H t + 1 ) | < j 3 . These arguments can be repeated only finitely many times, and we obtain i + | c o n ( H t + 1 ) | = j < j 1 < < j 2 < j 1 k so that x [ i . . k ] = C ( i ) C ( j ) C ( j 1 C ( j 2 ) C ( j 1 ) , which is a contradiction.
Therefore, our initial assumption that C ( i ) is not maximal always leads to a contradiction. □
This concludes the proof of Claim 4.
The four claims show that all the conditions ( C 1 ) ... ( C 8 ) are satisfied for Conf , and that proves Theorem 3.
As the last step, we show that when the process of refinement is completed, all right-maximal Lyndon substrings of x are identified and sorted via the contexts of the groups of the final configuration.
Theorem 4.
Let Conf 1 = G k 1 1 , G k 1 1 1 , , G 2 1 , G 1 1 with g r 1 ( ) , c o n 1 ( ) , C 1 ( ) , p r e v 1 ( ) , and v a l 1 ( ) be the initial 1complete group configuration from Lemma 9.
Let Conf 2 = G k 2 2 , G k 2 1 2 , , G 2 2 , G 1 2 with g r 2 ( ) , c o n 2 ( ) , C 2 ( ) , p r e v 2 ( ) , and v a l 2 ( ) be the 2complete group configuration obtained from Conf 1 through the refinement by the group G 1 1 .
Let Conf 3 = G k 3 3 , G k 3 1 3 , , G 2 3 , G 1 3 with g r 3 ( ) , c o n 3 ( ) , C 3 ( ) , p r e v 3 ( ) , and v a l 3 ( ) be the 3complete group configuration obtained from Conf 2 through the refinement by the group G 2 2 .
...
Let Conf r = G k r r , G k r 1 r , , G 2 r , G 1 r with g r r ( ) , c o n r ( ) , C r ( ) , p r e v r ( ) , and v a l r ( ) be the rcomplete group configuration obtained from Conf r 1 through the refinement by the group G r 1 r 1 . Let Conf r be the final configuration after the refinement runs out.
Then, x [ i . . k ] is a right-maximal Lyndon substring of x iff x [ i . . k ] = C r ( i ) = c o n r ( g r r ( i ) ) .
Proof. 
That all the groups of Conf r are complete follows from Theorem 3, and hence, every C r ( i ) is a right-maximal Lyndon string. Let x [ i . . k ] be a right-maximal Lyndon substring of x . Consider C r ( i ) ; since it is maximal, it must be equal to x [ i . . k ] . □

4.3. Motivation for the Refinement

The process of refinement is in fact a process of the gradual revealing of the Lyndon substrings, which we call the water draining method:
(a)
lower the water level by one;
(b)
extend the existing Lyndon substrings; the revealed letters are used to extend the existing Lyndon substrings where possible, or became Lyndon substrings of length one otherwise;
(c)
consolidate the new Lyndon substrings; processed from the right, if several Lyndon substrings are adjacent and can be joined to a longer Lyndon substring, they are joined.
The diagram in Figure 5 and the description that follows it illustrate the method for a string 011023122. The input string is visualized as a curve, and the height at each point is the value of the letter at that position.
In Figure 5, we illustrate the process:
(1)
We start with the string 011023122 and a full tank of water.
(2)
We drain one level; only 3 is revealed; there is nothing to extend, nothing to consolidate.
(3)
We drain one more level, and three 2’s are revealed; the first 2 extends 3 to 23, and the remaining two 2’s form Lyndon substrings 2 of length one; there is nothing to consolidate.
(4)
We drain one more level, and three 1’s are revealed; the first two 1’s form Lyndon substrings 1 of length one; the third 1 extends 22 to 122; there is nothing to consolidate.
(5)
We drain one more level, and two 0’s are revealed; the first 0 extends 11 to 011; the second 0 extends 23 to 023; in the consolidation phase, 023 is joined with 122 to form a Lyndon substring 023122, and then, 011 is joined with 023122 to form a Lyndon substring 011023122.
Therefore, during the process, the following right-maximal Lyndon substrings were identified: 3 at Position 6, 23 at Position 5, 2 at Positions 8 and 9, 1 at Positions 2 and 3, 122 at Position 7, 023 at Position 4, and finally, 011023122 at Position 1. Note that all positions are accounted for; we really have all right-maximal Lyndon substrings of the string 011023122.
In Figure 6, we present an illustrative example for the string 011023122, where the arrows represent the p r e v mapping shown only on the group used for the refinement. The groups used for the refinement are indicated by the bold font.

4.4. The Complexity of BSLA

The computation of the initial configuration can be done in linear time. To compute the initial value of prev in linear time, a stack-based approach similar to the NSV algorithm is used. Since all groups are non-empty, there can never be more groups than n. Theorem 3 is at the heart of the algorithm. The refinement by the last completed group is linear in the size of the group, including the update of prev . Therefore, the overall worst-case complexity of BSLA is linear in the length of the input string.

5. Data and Measurements

Initially, computations were performed on the Department of Computing and Software’s moore server; memory: 32 GB (DDR4 @ 2400 MHz), CPU: 8 × Intel Xeon E5-2687W v4 @ 3.00 GHz, OS: Linux Version 2.6.18-419.el5 (gcc Version 4.1.2 and Red Hat Version 4.1.2-55). To verify correctness, new randomized data were produced and computed independently on the University of Toronto Mississauga’s octolab cluster; memory: 8 × 32 GB (DDR4 @ 3200 MHz), CPU: 8 × AMD Ryzen Threadripper 1920X (12-Core) @ 4.00 GHz, OS: Ubuntu 16.04.6 LTS (gcc Version 5.4.0). The results of both were extremely similar, and those reported herein are those generated using the moore server. All the programs were compiled without any additional level of optimization (i.e., neither -O1, nor -O2, nor -O3 flags were specified for the compilation). The CPU time was measured in clock ticks with 1,000,000 clock ticks per second. Since the execution time was negligible for short strings, the processing of the same string was repeated several times (the repeat factor varied from 10 6 , for strings of length 10, to one, for strings of length 5 × 10 6 ), resulting in a higher precision. Thus, for graphing, the logarithmic scale was used for both, x-axis representing the length of the strings and y-axis representing the time. We used four categories of randomly generated datasets:
(1)
bin
random strings over an integer alphabet with exactly two distinct letters (kind of binary strings).
(2)
dna
random strings over an integer alphabet with exactly four distinct letters (kind of random DNA strings).
(3)
eng
random strings over an integer alphabet with exactly 26 distinct letters (kind of random English).
(4)
int
random strings over an integer alphabet (i.e., over the alphabet { 0 , , n 1 } ).
Each dataset contains 100 randomly generated strings of the same length. For each category, there were datasets for length s 10, 50, 10 2 , 5 × 10 2 , ..., 10 5 , 5 × 10 5 , 10 6 , and 5 × 10 6 . The minimum, average, and maximum times for each dataset were computed. Since the variance for each dataset was minimal, the results for minimum times and the results for maximum times completely mimicked the results for the average times, so we only present the averages here.
Tables 1–4 and the graphs in Figures 7–10 from Section 7 clearly indicate that the performance of the three algorithms is linear and virtually indistinguishable. We expected IDLA and TRLA to exhibit linear behaviour on random strings as such strings tend to have many, but short right-maximal Lyndon substrings. However, we did not expect the results to be so close.
Despite the fact that IDLA performed in linear time on the random strings, it is relatively easy to force it into its worst quadratic performance. The dataset extreme_idla contains individual strings 0123 n 1 of the required lengths. Table 5 and the graph in Figure 11 from Section 7 show this clearly.
In Section 3.5, we describe how the three datasets, extreme_trla, extreme_trla1, and extreme_trla2, are generated and why. The results of experimenting with these datasets do not suggest that the worst-case complexity for TRLA is O ( n log ( n ) ) . Yet again, the performances of the three algorithms are linear and virtually indistinguishable; see Tables 6–8 and the graphs in Figures 12–14 in Section 7.

6. Conclusions and Future Work

We present two novel algorithms for computing right-maximal Lyndon substrings. The first one, TRLA, has a simple implementation with a complicated theory behind it. Its average time complexity is linear in the length of the input string, and its worst-case complexity is no worse than O ( n log ( n ) ) . The τ -reduction used in the algorithm is an interesting reduction preserving right-maximal Lyndon substrings, a fact used significantly in the design of the algorithm. Interestingly, it seem to slightly outperform BSLA, at least on the datasets used for our experimentations. BSLA, the second algorithm, is linear and elementary in the sense that it does not require a pre-processed global data structure. Being linear and elementary, BSLA is more interesting, and it is possible that its performance could be more streamlined. However, both the theory and implementation of BSLA are rather complex.
On random strings, none of the two algorithms were significantly better than the simple IDLA, whose implementation is just a few lines. However, its quadratic worst-case complexity is an obstacle, as our experiments indicated.
Additional effort needs to go into proving TRLA’s worst-case complexity. The experiments performed did not indicate that it is not linear even in the worst case. Both algorithms need to be compared to some efficient implementation of SSLA and BWLA.

7. Results

This section contains the measurements of the average times for the datasets discussed in the previous section. For better understanding of the data, we present them in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 and Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14. All the graphs include the curve x = y for reference.

Author Contributions

Conceptualization, F.F. and M.L.; methodology, F.F. and M.L.; software, F.F. and M.L.; validation, F.F. and M.L.; formal analysis, F.F. and M.L.; investigation, F.F. and M.L.; resources, F.F. and M.L.; data curation, F.F. and M.L.; writing, original draft preparation, F.F. and M.L.; writing, review and editing, F.F. and M.L.; visualization, F.F. and M.L.; supervision, F.F. and M.L.; project administration, F.F. and M.L.; funding acquisition, F.F. Both authors read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Sciences and Research Council of Canada (NSERC) Grant RGPIN/5504-2018.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BSLABaier’s Sort Lyndon Array
IDLAIterative Duval Lyndon Array
TRLATau Reduction Lyndon Array

References

  1. Lyndon, R.C. On Burnside’s Problem. II. Trans. Am. Math. Soc. 1955, 78, 329–332. [Google Scholar]
  2. Marcus, S.; Sokol, D. 2D Lyndon words and applications. Algorithmica 2017, 77, 116–133. [Google Scholar] [CrossRef]
  3. Berstel, J.; Perrin, D. The origins of combinatorics on words. Eur. J. Comb. 2007, 28, 996–1022. [Google Scholar] [CrossRef] [Green Version]
  4. Chen, K.; Fox, R.; Lyndon, R. Free differential calculus IV. The quotient groups of the lower central series. Ann. Math. 2nd Ser. 1958, 68, 81–95. [Google Scholar] [CrossRef]
  5. Golomb, S. Irreducible polynomials, synchronizing codes, primitive necklaces and cyclotomic algebra. Comb. Math. Appl. 1967, 4, 358–370. [Google Scholar]
  6. Flajolet, P.; Gourdon, X.; Panario, D. The complete analysis of a polynomial factorization algorithm over finite fields. J. Algorithms 2001, 40, 37–81. [Google Scholar] [CrossRef] [Green Version]
  7. Panario, D.; Richmond, B. Smallest components in decomposable structures:exp-log class. Algorithmica 2001, 29, 205–226. [Google Scholar] [CrossRef]
  8. Duval, J.P. Factorizing words over an ordered alphabet. J. Algorithms 1983, 4, 363–381. [Google Scholar] [CrossRef]
  9. Berstel, J.; Pocchiola, M. Average cost of Duval’s algorithm for generating Lyndon words. Theor. Comput. Sci. 1994, 132, 415–425. [Google Scholar] [CrossRef]
  10. Fredricksen, H.; Maiorana, J. Necklaces of beads in k colors and k-ary de Bruijn sequences. Discret. Math. 1983, 23, 207–210. [Google Scholar] [CrossRef] [Green Version]
  11. Bannai, H.; Tomohiro, I.; Inenaga, S.; Nakashima, Y.; Takeda, M.; Tsuruta, K. The “Runs” Theorem. SIAM J. Comput. 2017, 46, 1501–1514. [Google Scholar] [CrossRef]
  12. Franek, F.; Paracha, A.; Smyth, W. The linear equivalence of the suffix array and the partially sorted Lyndon array. In Proceedings of the Prague Stringology Conference, Prague, Czech Republic, 28–30 August 2017; pp. 77–84. [Google Scholar]
  13. Baier, U. Linear-Time Suffix Sorting—A New Approach for Suffix Array Construction. Master’s Thesis, University of Ulm, Ulm, Germany, 2015. [Google Scholar]
  14. Baier, U. Linear-Time Suffix Sorting—A New Approach for Suffix Array Construction. In Proceedings of the 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016), Tel Aviv, Israel, 27–29 June 2016; Grossi, R., Lewenstein, M., Eds.; Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik: Dagstuhl, Germany, 2016; Volume 54, pp. 1–12. [Google Scholar]
  15. Chen, G.; Puglisi, S.; Smyth, W. Lempel-Ziv factorization using less time & space. Math. Comput. Sci. 2013, 1, 605–623. [Google Scholar]
  16. Crochemore, M.; Ilie, L.; Smyth, W. A simple algorithm for computing the Lempel-Ziv factorization. In Proceedings of the 18th Data Compression Conference, Snowbird, UT, USA, 25–27 March 2008; pp. 482–488. [Google Scholar]
  17. Kosolobov, D. Lempel-Ziv factorization may be harder than computing all runs. In Proceedings of the 32 International Symposium on Theoretical Aspects of Computer Science—STACS 2015, Garching, Germany, 4–7 March 2015; pp. 582–593. [Google Scholar]
  18. Digelmann, C.; (Frankfurt, Germany). Personal communication, 2016.
  19. Franek, F.; Sohidull Islam, A.; Sohel Rahman, M.; Smyth, W. Algorithms to compute the Lyndon array. In Proceedings of the Prague Stringology Conference 2016, Prague, Czech Republic, 29–31 August 2016; pp. 172–184. [Google Scholar]
  20. Hohlweg, C.; Reutenauer, C. Lyndon words, permutations and trees. Theor. Comput. Sci. 2003, 307, 173–178. [Google Scholar] [CrossRef] [Green Version]
  21. Nong, G.; Zhang, S.; Chan, W.H. Linear suffix array construction by almost pure induced-sorting. In Proceedings of the 2009 Data Compression Conference, Snowbird, UT, USA, 16–18 March 2009; pp. 193–202. [Google Scholar]
  22. Louza, F.; Smyth, W.; Manzini, G.; Telles, G. Lyndon array construction during Burrows–Wheeler inversion. J. Discret. Algorithms 2018, 50, 2–9. [Google Scholar] [CrossRef]
  23. Franek, F.; Liut, M.; Smyth, W. On Baier’s sort of maximal Lyndon substrings. In Proceedings of the Prague Stringology Conference 2018, Prague, Czech Republic, 27–28 August 2018; pp. 63–78. [Google Scholar]
  24. C++ Code for IDLA, TRLA and BSLA Algorithms. Available online: https://github.com/MichaelLiut/Computing-LyndonArray (accessed on 3 November 2020).
  25. Farach, M. Optimal suffix tree construction with large alphabets. In Proceedings of the 38th IEEE Symp. Foundations of Computer Science, Miami Beach, FL, USA, 20–22 October 1997; pp. 137–143. [Google Scholar]
  26. Nong, G. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 2013, 31, 1–15. [Google Scholar] [CrossRef]
  27. Cooley, J.; Tukey, J. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
  28. Franek, F.; Liut, M. Computing Maximal Lyndon Substrings of a String, AdvOL Report 2019/2, McMaster University. Available online: http://optlab.mcmaster.ca//component/option,com_docman/task,cat_view/gid,77/Itemid,92 (accessed on 1 March 2019).
  29. Franek, F.; Liut, M. Algorithms to compute the Lyndon array revisited. In Proceedings of the Prague Stringology Conference 2019, Prague, Czech Republic, 26–28 August 2019; pp. 16–28. [Google Scholar]
  30. Liut, M. Computing Lyndon Arrays. Ph.D. Thesis, McMaster University, Hamilton, ON, Canada, 2019. [Google Scholar]
  31. Lothaire, M. Combinatorics on Words; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  32. Lothaire, M. Applied Combinatorics on Words; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  33. Smyth, B. Computing Patterns in Strings; Pearson Addison-Wesley: Boston, MA, USA, 2003. [Google Scholar]
  34. Louza, F.; Gog, S.; Telles, G. Construction of Fundamental Data Structures for Strings; Springer: Cham, Switzerland, 2020. [Google Scholar]
  35. Burkhardt, S.; Kärkkäinen, J. Fast Lightweight Suffix Array Construction and Checking. In Proceedings of the 14th Annual Conference on Combinatorial Pattern Matching, Michoacan, Mexico, 25–27 June 2003; Springer: Berlin, Heidelberg, 2003; pp. 55–69. [Google Scholar]
  36. Paracha, A. Lyndon Factors and Periodicities in Strings. Ph.D. Thesis, McMaster University, Hamilton, ON, Canada, 2017. [Google Scholar]
  37. Kärkkäinen, J.; Sanders, P. Simple linear work suffix array construction. In Proceedings of the 30th International Conference on Automata, Languages and Programming, Eindhoven, The Netherlands, 30 June–4 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 943–955. [Google Scholar]
Figure 1. Algorithm IDLA.
Figure 1. Algorithm IDLA.
Algorithms 13 00294 g001
Figure 2. Illustration of the τ -reduction of a string 011023122. The rounded rectangles indicate symbol τ -pairs; the ovals indicate the τ -pairs. below are the colour labels of the positions; at the bottom is the τ -reduction.
Figure 2. Illustration of the τ -reduction of a string 011023122. The rounded rectangles indicate symbol τ -pairs; the ovals indicate the τ -pairs. below are the colour labels of the positions; at the bottom is the τ -reduction.
Algorithms 13 00294 g002
Figure 3. Computing the partial Lyndon array of the input string.
Figure 3. Computing the partial Lyndon array of the input string.
Algorithms 13 00294 g003
Figure 4. Computing missing values of the Lyndon array of the input string.
Figure 4. Computing missing values of the Lyndon array of the input string.
Algorithms 13 00294 g004
Figure 5. The water draining method for 011023122. Stages (1)–(6) explained in the text.
Figure 5. The water draining method for 011023122. Stages (1)–(6) explained in the text.
Algorithms 13 00294 g005
Figure 6. Group refinement for 011023122.
Figure 6. Group refinement for 011023122.
Algorithms 13 00294 g006
Figure 7. Average times for dataset bin ( 10 6 clock ticks per second).
Figure 7. Average times for dataset bin ( 10 6 clock ticks per second).
Algorithms 13 00294 g007
Figure 8. Average times for dataset dna ( 10 6 clock ticks per second).
Figure 8. Average times for dataset dna ( 10 6 clock ticks per second).
Algorithms 13 00294 g008
Figure 9. Average times for dataset eng ( 10 6 clock ticks per second).
Figure 9. Average times for dataset eng ( 10 6 clock ticks per second).
Algorithms 13 00294 g009
Figure 10. Average times for dataset int ( 10 6 clock ticks per second).
Figure 10. Average times for dataset int ( 10 6 clock ticks per second).
Algorithms 13 00294 g010
Figure 11. Average times for dataset extreme_idla ( 10 6 clock ticks per second).
Figure 11. Average times for dataset extreme_idla ( 10 6 clock ticks per second).
Algorithms 13 00294 g011
Figure 12. Average times for dataset extreme_trla ( 10 6 clock ticks per second).
Figure 12. Average times for dataset extreme_trla ( 10 6 clock ticks per second).
Algorithms 13 00294 g012
Figure 13. Average times for dataset extreme_trla1 ( 10 6 clock ticks per second).
Figure 13. Average times for dataset extreme_trla1 ( 10 6 clock ticks per second).
Algorithms 13 00294 g013
Figure 14. Average times for dataset extreme_trla2 ( 10 6 clock ticks per second).
Figure 14. Average times for dataset extreme_trla2 ( 10 6 clock ticks per second).
Algorithms 13 00294 g014
Table 1. Average times for dataset bin ( 10 6 clock ticks per second).
Table 1. Average times for dataset bin ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 3.651 × 10 1 1.582 1.054
50 4.082 1.050 × 10 6.372
100 1.101 × 10 2.140 × 10 1.277 × 10
500 8.655 × 10 1.127 × 10 2 6.786 × 10
1000 1.975 × 10 2 2.335 × 10 2 1.484 × 10 2
5000 1.278 × 10 3 1.218 × 10 3 8.595 × 10 2
10,000 2.765 × 10 3 2.423 × 10 3 1.820 × 10 3
50,000 1.665 × 10 4 1.272 × 10 4 1.018 × 10 4
100,000 3.606 × 10 4 2.523 × 10 4 2.113 × 10 4
500,000 2.071 × 10 5 1.338 × 10 5 1.493 × 10 5
1,000,000 4.387 × 10 5 2.717 × 10 5 4.080 × 10 5
5,000,000 2.483 × 10 6 1.561 × 10 6 3.098 × 10 6
Table 2. Average times for dataset dna ( 10 6 clock ticks per second).
Table 2. Average times for dataset dna ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 3.699 × 10 1 1.579 1.080
50 3.509 1.037 × 10 6.627
100 8.898 2.109 × 10 1.403 × 10
500 6.403 × 10 1.123 × 10 2 7.228 × 10
1000 1.431 × 10 2 2.332 × 10 2 1.544 × 10 2
5000 8.749 × 10 2 1.207 × 10 3 9.039 × 10 2
10,000 1.912 × 10 3 2.460 × 10 3 1.935 × 10 3
50,000 1.134 × 10 4 1.280 × 10 4 1.110 × 10 4
100,000 2.431 × 10 4 2.588 × 10 4 2.316 × 10 4
500,000 1.383 × 10 5 1.390 × 10 5 1.781 × 10 5
1,000,000 2.916 × 10 5 2.865 × 10 5 4.994 × 10 5
5,000,000 1.643 × 10 6 1.968 × 10 6 3.752 × 10 6
Table 3. Average times for dataset eng ( 10 6 clock ticks per second).
Table 3. Average times for dataset eng ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 3.526 × 10 1 1.584 9.865 × 10 1
50 3.162 1.006 × 10 5.960
100 7.315 2.057 × 10 1.317 × 10
500 4.996 × 10 1.117 × 10 2 7.245 × 10
1000 1.112 × 10 2 2.354 × 10 2 1.542 × 10 2
5000 6.722 × 10 2 1.210 × 10 3 9.087 × 10 2
10,000 1.452 × 10 3 2.427 × 10 3 2.042 × 10 3
50,000 8.505 × 10 3 1.306 × 10 4 1.301 × 10 4
100,000 1.802 × 10 4 2.688 × 10 4 2.768 × 10 4
500,000 1.025 × 10 5 1.428 × 10 5 2.381 × 10 5
1,000,000 2.171 × 10 5 3.253 × 10 5 7.236 × 10 5
5,000,000 1.206 × 10 6 2.599 × 10 6 6.092 × 10 6
Table 4. Average times for dataset int ( 10 6 clock ticks per second).
Table 4. Average times for dataset int ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 3.547 × 10 1 1.645 9.794 × 10 1
50 3.032 9.992 5.609
100 7.279 2.032 × 10 1.153 × 10
500 4.845 × 10 1.136 × 10 2 6.184 × 10
1000 1.057 × 10 2 2.376 × 10 2 1.294 × 10 2
5000 6.428 × 10 2 1.218 × 10 3 7.753 × 10 2
10,000 1.388 × 10 3 2.544 × 10 3 1.796 × 10 3
50,000 8.055 × 10 3 1.448 × 10 4 1.088 × 10 4
100,000 1.710 × 10 4 2.943 × 10 4 2.379 × 10 4
500,000 9.829 × 10 4 1.825 × 10 5 2.740 × 10 5
1,000,000 2.071 × 10 5 4.827 × 10 5 7.989 × 10 5
5,000,000 1.162 × 10 6 5.143 × 10 6 6.635 × 10 6
Table 5. Average times for dataset extreme_idla ( 10 6 clock ticks per second).
Table 5. Average times for dataset extreme_idla ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 7.900 × 10 1 1.440 7.000 × 10 1
50 1.830 × 10 8.200 3.600
100 7.190 × 10 1.590 × 10 6.800
500 1.778 × 10 3 7.300 × 10 3.550 × 10
1000 7.105 × 10 3 1.430 × 10 2 6.900 × 10
5000 1.776 × 10 5 7.100 × 10 2 3.400 × 10 2
10,000 7.111 × 10 5 1.550 × 10 3 6.800 × 10 2
50,000 1.784 × 10 7 8.050 × 10 3 3.400 × 10 3
100,000 7.130 × 10 7 1.600 × 10 4 6.800 × 10 3
500,000 1.783 × 10 9 8.200 × 10 4 3.700 × 10 4
1,000,000 7.137 × 10 9 1.660 × 10 5 7.800 × 10 4
5,000,000 1.813 × 10 11 8.800 × 10 5 4.950 × 10 5
Table 6. Average times for dataset extreme_trla ( 10 6 clock ticks per second).
Table 6. Average times for dataset extreme_trla ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 4.588 × 10 1 1.628 1.126
50 4.987 1.039 × 10 7.112
100 1.275 × 10 2.179 × 10 1.439 × 10
500 9.033 × 10 1.101 × 10 2 6.914 × 10
1000 2.060 × 10 2 2.222 × 10 2 1.392 × 10 2
5000 1.319 × 10 3 1.171 × 10 3 7.699 × 10 2
10,000 2.896 × 10 3 2.394 × 10 3 1.652 × 10 3
50,000 2.209 × 10 4 1.263 × 10 4 8.992 × 10 3
100,000 3.965 × 10 4 2.567 × 10 4 1.862 × 10 4
500,000 2.233 × 10 5 1.349 × 10 5 1.091 × 10 5
1,000,000 4.734 × 10 5 2.759 × 10 5 3.104 × 10 5
5,000,000 2.632 × 10 6 1.458 × 10 6 2.298 × 10 6
Table 7. Average times for dataset extreme_trla1 ( 10 6 clock ticks per second).
Table 7. Average times for dataset extreme_trla1 ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 5.040 × 10 1 1.600 1.117
50 5.910 1.042 × 10 7.290
100 1.460 × 10 2.145 × 10 1.446 × 10
500 1.146 × 10 2 1.126 × 10 2 6.979 × 10
1000 2.662 × 10 2 2.284 × 10 2 1.379 × 10 2
5000 1.694 × 10 3 1.205 × 10 3 7.853 × 10 2
10,000 3.734 × 10 3 2.477 × 10 3 1.739 × 10 3
50,000 2.276 × 10 4 1.310 × 10 4 9.683 × 10 3
100,000 4.901 × 10 4 2.796 × 10 4 2.009 × 10 4
500,000 2.928 × 10 5 1.465 × 10 5 1.238 × 10 5
1,000,000 6.199 × 10 5 3.000 × 10 5 3.622 × 10 5
5,000,000 3.432 × 10 6 1.642 × 10 6 2.778 × 10 6
Table 8. Average times for dataset extreme_trla2 ( 10 6 clock ticks per second).
Table 8. Average times for dataset extreme_trla2 ( 10 6 clock ticks per second).
String LengthTime in TicksTime in TicksTime in Ticks
IDLATRLABSLA
10 5.041 × 10 1 1.683 1.121
50 6.160 1.020 × 10 7.257
100 1.526 × 10 2.090 × 10 1.441 × 10
500 1.367 × 10 2 1.074 × 10 2 7.117 × 10
1000 3.202 × 10 2 2.135 × 10 2 1.390 × 10 2
5000 2.024 × 10 3 1.145 × 10 3 7.966 × 10 2
10,000 4.500 × 10 3 2.257 × 10 3 1.762 × 10 3
50,000 2.728 × 10 4 1.172 × 10 4 1.012 × 10 4
100,000 5.941 × 10 4 2.362 × 10 4 2.115 × 10 4
500,000 3.639 × 10 5 1.262 × 10 5 1.351 × 10 5
1,000,000 7.719 × 10 5 2.571 × 10 5 3.915 × 10 5
5,000,000 4.263 × 10 6 1.323 × 10 6 3.118 × 10 6
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Franek, F.; Liut, M. Computing Maximal Lyndon Substrings of a String. Algorithms 2020, 13, 294. https://doi.org/10.3390/a13110294

AMA Style

Franek F, Liut M. Computing Maximal Lyndon Substrings of a String. Algorithms. 2020; 13(11):294. https://doi.org/10.3390/a13110294

Chicago/Turabian Style

Franek, Frantisek, and Michael Liut. 2020. "Computing Maximal Lyndon Substrings of a String" Algorithms 13, no. 11: 294. https://doi.org/10.3390/a13110294

APA Style

Franek, F., & Liut, M. (2020). Computing Maximal Lyndon Substrings of a String. Algorithms, 13(11), 294. https://doi.org/10.3390/a13110294

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop