Computing Maximal Lyndon Substrings of a String

Franek, Frantisek; Liut, Michael

doi:10.3390/a13110294

Open AccessArticle

Computing Maximal Lyndon Substrings of a String

by

Frantisek Franek

^1,† and

Michael Liut

^2,*,†

¹

Department of Computing and Software, McMaster University, Hamilton, ON L8S 4K1, Canada

²

Department of Mathematical and Computational Sciences, University of Toronto Mississauga, Mississauga, ON L5L 1C6, Canada

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2020, 13(11), 294; https://doi.org/10.3390/a13110294

Submission received: 23 September 2020 / Revised: 4 November 2020 / Accepted: 10 November 2020 / Published: 12 November 2020

(This article belongs to the Special Issue Combinatorial Methods for String Processing)

Download

Browse Figures

Versions Notes

Abstract

:

There are two reasons to have an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. introduced in 2015 a linear algorithm to compute all runs of a string that relies on knowing all right-maximal Lyndon substrings of the input string, and secondly, Franek et al. showed in 2017 a linear equivalence of sorting suffixes and sorting right-maximal Lyndon substrings of a string, inspired by a novel suffix sorting algorithm of Baier. In 2016, Franek et al. presented a brief overview of algorithms for computing the Lyndon array that encodes the knowledge of right-maximal Lyndon substrings of the input string. Among those presented were two well-known algorithms for computing the Lyndon array: a quadratic in-place algorithm based on the iterated Duval algorithm for Lyndon factorization and a linear algorithmic scheme based on linear suffix sorting, computing the inverse suffix array, and applying to it the next smaller value algorithm. Duval’s algorithm works for strings over any ordered alphabet, while for linear suffix sorting, a constant or an integer alphabet is required. The authors at that time were not aware of Baier’s algorithm. In 2017, our research group proposed a novel algorithm for the Lyndon array. Though the proposed algorithm is linear in the average case and has

O (n log (n))

worst-case complexity, it is interesting as it emulates the fast Fourier algorithm’s recursive approach and introduces

τ

-reduction, which might be of independent interest. In 2018, we presented a linear algorithm to compute the Lyndon array of a string inspired by Phase I of Baier’s algorithm for suffix sorting. This paper presents the theoretical analysis of these two algorithms and provides empirical comparisons of both of their C++ implementations with respect to the iterated Duval algorithm.

Keywords:

combinatorics on words; string algorithms; regularities in strings; suffix sorting; Lyndon substrings; Lyndon arrays; right-maximal Lyndon substrings; tau-reduction algorithm; Baier’s sort algorithm; iterative Duval algorithm

Graphical Abstract

1. Introduction

In combinatorics on words, Lyndon words play a very important role. Lyndon words, a special case of Hall words, were named after Roger Lyndon, who was looking for a suitable description of the generators of free Lie algebras [1]. Despite their humble beginnings, to date, Lyndon words have facilitated many applications in mathematics and computer science, some of which are: constructing de Bruijin sequences, constructing bases in free Lie algebras, finding the lexicographically smallest or largest substring in a string, and succinct prefix matching of highly periodic strings; see Marcus et al. [2], and the informative paper by Berstel and Perrin [3] and the references therein.

The pioneering work on Lyndon decomposition was already introduced by Chen, Fox, and Lyndon in [4]. The Lyndon decomposition theorem is not explicitly stated there; nevertheless, it follows from the work presented there.

Theorem 1

(Lyndon decomposition theorem, Chen+Fox+Lyndon, 1958). For any word

x

, there are unique Lyndon words

u_{1}, \dots, u_{k}

so that

u_{i + 1} ≺ u_{i}

for any

1 \leq i < k

, and

x = u_{1} u_{2} \dots u_{k}

, where ≺ denotes the lexicographic ordering.

As there exists a bijection between Lyndon words over an alphabet of cardinality k and irreducible polynomials over

F_{k}

[5], many results are known about this factorization: the average number of factors, the average length of the longest factor [6] and of the shortest [7]. Several algorithms deal with Lyndon factorization. Duval gave in [8] an elegant algorithm that computes, in linear time and in-place, the factorization of a word into Lyndon words. More about its implementation can be found in [9]. In [10], Fredricksen and Maiorana presented an algorithm generating all Lyndon words up to a given length in lexicographical order. This algorithm runs in a constant average time.

Two of the latest applications of Lyndon words are due to Bannai et al. In [11], they employed Lyndon roots of runs to prove the runs conjecture that the number of runs in a string is bounded by the length of the string. In the same paper, they presented an algorithm to compute all the runs in a string in linear time that requires the knowledge of all right-maximal Lyndon substrings of the input string with respect to an order of the alphabet and its inverse. The latter result was the major reason for our interest in computing the right-maximal Lyndon substrings of a given string. Though the terms word and string are interchangeable, and so are the terms subword and substring, in the following, we prefer to use exclusively the terms string and substring to avoid confusing the reader.

There are at least two reasons for having an efficient algorithm for identifying all right-maximal Lyndon substrings of a string: firstly, Bannai et al. published in 2017 [11] a linear algorithm to compute all runs in a string that depends on knowing all right-maximal Lyndon substrings of the input string, and secondly, in 2017, Franek et al. in [12] showed a linear equivalence of sorting suffixes and sorting right-maximal Lyndon substrings, inspired by Phase II of a suffix sorting algorithm introduced by Baier in 2015 (Master’s thesis [13]) and published in 2016 [14].

The most significant feature of the runs algorithm presented in [11] is that it relies on knowing the right-maximal Lyndon substrings of the input string for some order of the alphabet and for the inverse of that order, while all other linear algorithms for runs rely on Lempel–Ziv factorization of the input string. It also raised the issue about which approach may be more efficient: to compute the Lempel–Ziv factorization or to compute all right-maximal Lyndon substrings. There are several efficient linear algorithms for Lempel–Ziv factorization (e.g., see [15,16] and the references therein).

Interestingly, Kosolobov [17] showed that for a general alphabet, in the decision tree model, the runs problem is easier than the Lempel–Ziv decomposition. His result supports the conjecture that there must be a linear random access memory model algorithm finding all runs.

Baier introduced in [13], and published in [14], a new algorithm for suffix sorting. Though Lyndon strings were never mentioned in [13,14], it was noticed by Cristoph Diegelmann in a personal communication [18] that Phase I of Baier’s suffix sort identifies and sorts all right-maximal Lyndon substrings.

The right-maximal Lyndon substrings of a string

x = x [1 . . n]

can be best encoded in the so-called Lyndon array, introduced in [19] and closely related to the Lyndon tree of [20]: an integer array

L [1 . . n]

so that for any

i \in 1 . . n

,

L [i] =

the length of the right-maximal Lyndon substring starting at the position i.

In an overview [19], Franek et al. discussed an algorithm based on an iterative application of Duval’s Lyndon factorization algorithm [8], which we refer to here as IDLA, and an algorithmic scheme based on Hohlweg and Reutenauer’s work [20], which we refer to as SSLA. The authors were not aware of Baier’s algorithm at that time. Two additional algorithms were presented there, a quadratic recursive application of Duval’s algorithm and an algorithm NSV* with possibly

O (n log (n))

worst-case complexity based on ranges that can be compared in constant time for constant alphabets. The correctness of NSV* and its complexity were discussed there just informally.

The algorithm IDLA (see Figure 1) is simple and in-place, so no additional space is required except for the storage for the string and the Lyndon array. It is completely independent of the alphabet of the string and does not require the alphabet to be sorted; all it requires is that the alphabet be ordered, i.e., only pairwise comparisons of the alphabet symbols are needed. Its weakness is its quadratic worst-case complexity, which becomes a problem for longer strings with long right-maximal Lyndon substrings, as one of our experiments showed (see Figure 11 in Section 7).

In our empirical work, we used IDLA as a control for comparison and as a verifier of the results. Note that the reason the procedure MaxLyn of Figure 1 really computes the longest Lyndon prefix is not obvious and is based on the properties of periods of prefixes; see [8] or Observation 6 and Lemma 11 in [19].

Lemma 1, below, characterizes right-maximal Lyndon substrings in terms of the relationships of the suffixes and follows from the work of Hohlweg and Reutenauer [20]. Though the definition of the proto-Lyndon substring is formally given in Section 2 below, it suffices to say the it means that it is a prefix of a Lyndon substring of the string. The definition of the lexicographic ordering ≺ is also given Section 2. The proof of this lemma is delayed to the end of Section 2, where all the technical terms needed are defined.

Lemma 1.

Consider a string

x [1 . . n]

over an alphabet ordered by ≺.

A substring

x [i . . j]

is proto-Lyndon if and only if:

(a): $x [i . . n] ≺ x [k . . n]$ for any $i < k \leq j$ .

A substring

x [i . . j]

is right-maximal Lyndon if and only if:

(b): $x [i . . n]$ is proto-Lyndon and
(c): either $j = n$ or $x [j + 1 . . n] ≺ x [i . . n]$ .

Thus, the Lyndon array is an NSV (Next Smaller Value) array of the inverse suffix array. Consequently, the Lyndon array can be computed by sorting the suffixes, i.e., computing the suffix array, then computing the inverse suffix array, and then applying NSV to it; see [19]. Computing the inverse suffix array and applying NSV are “naturally” linear, and computing the suffix array can be implemented to be linear; see [19,21] and the references therein. The execution and space characteristics are dominated by those of the first step, i.e., computation of the suffix array. We refer to this scheme as SSLA.

In 2018, a linear algorithm to compute the Lyndon array from a given Burrows–Wheeler transform was presented [22]. Since the Burrows–Wheeler transform is computed in linear time from the suffix array, it is yet another scheme of how to obtain the Lyndon array via suffix sorting: compute the suffix array; from the suffix array, compute the Burrows–Wheeler transform, then compute the Lyndon array during the inversion of the Burrows–Wheeler transform. We refer to this scheme as BWLA.

The introduction of Baier’s suffix sort in 2015 and the consequent realization of the connection to right-maximal Lyndon substrings brought up the realization that there was an elementary (not relying on a pre-processed global data structure such as a suffix array or a Burrows–Wheeler transform) algorithm to compute the Lyndon array, and that, despite its original clumsiness, could be eventually refined to outperform any SSLA or BWLA implementation: any implementation of a suffix sorting-based scheme requires a full suffix sort and then some additional processing, while Baier’s approach is “just” a partial suffix sort; see [23].

In this work, we present two additional algorithms for the Lyndon array not discussed in [19]. The C++ source code of the three implementations IDLA, TRLA, and BSLA is available; see [24]. Note that the procedure IDLA is in the lynarr.hpp file.

The first algorithm presented here is TRLA. TRLA is a

τ

-reduction based Lyndon array algorithm that follows Farach’s approach used in his remarkable linear algorithm for suffix tree construction [25] and reproduced very successfully in all linear algorithms for suffix sorting (e.g., see [21,26] and the references therein). Farach’s approach follows the Cooley–Tukey algorithm for the fast Fourier transform relying on recursion to lower the quadratic complexity to

O (n log (n))

complexity; see [27]. TRLA was first introduced by the authors in 2019 (see [28,29]) and presented as a part of Liut’s Ph.D. thesis [30].

The second algorithm, BSLA, is a Baier’s sort-based Lyndon array algorithm. BSLA is based on the idea of Phase I of Baier’s suffix sort, though our implementation necessarily differs from Baier’s. BSLA was first introduced at the Prague Stringology Conference 2018 [23] and also presented as a part of Liut’s Ph.D. thesis [30] in 2019; here, we present a complete and refined theoretical analysis of the algorithm and a more efficient implementation than that initially introduced.

The paper is structured as follows: In Section 2, the basic notions and terminology are presented. In Section 3, the TRLA algorithm is presented and analysed. In Section 4, the BSLA algorithm is presented and analysed. In Section 5, the datasets with random strings of various lengths and over various alphabets and other datasets used in the empirical tests are described. In Section 6, the conclusion of the research and the future work are presented. The results of the empirical measurements of the performance of IDLA, TRLA, and BSLA on those datasets are presented in Section 7 in both tabular and graphical forms.

2. Basic Notation and Terminology

Most of the fundamental notions, definitions, facts, and string algorithms can be found in [31,32,33,34]. For the ease of access, this section includes those that are directly related to the work herein.

The set of integers is denoted by

Z

. For two integers

i \leq j

, the range

i . . j = {k \in Z | i \leq k \leq j}

. An alphabet is a finite or infinite set of symbols, or equivalently called letters. We always assume that the sentinel symbol $ is not in the alphabet and is always assumed to be lexicographically the smallest. A string over an alphabet

A

is a finite sequence of symbols from

A

. A $-terminated string over

A

is a string over

A

terminated by $. We use the array notation indexing from 1 for strings; thus,

x [1 . . n]

indicates a string of length n; the first symbol is the symbol with index 1, i.e.,

x [1]

; the second symbol is the symbol with index 2, i.e.,

x [2]

, etc. Thus,

x [1 . . n] = x [1] x [2] \dots x [n]

. For a $-terminated string

x

of length n,

x [n + 1] = $

. The alphabet of string

x

, denoted as

A_{x}

, is the set of all distinct alphabet symbols occurring in

x

.

We use the term strings over a constant alphabet if the alphabet is a fixed finite alphabet. The integer alphabet is the infinite alphabet

A = {0, 1, 2, \dots}

. We use the term strings over integer alphabet for the strings over the alphabet

{0, 1, 2, \dots}

with an additional constraint that all letters occurring in the string are all smaller than the length of the string, i.e., in this paper,

x [1 . . n]

is a string over the integer alphabet if it is a string over the alphabet

{0, 1, \dots n - 1}

. Many authors use a more general definition; for instance, Burkhardt and Kärkkäinen [35] defined it as any set of integers of size

n^{o (1)}

; however, our results can easily be adapted to such more general definitions without changing their essence.

We use a bold font to denote strings; thus, x denotes a string, while x denotes some other mathematical entity such as an integer. The empty string is denoted by

ε

and has length zero. The length or size of string

x = x [1 . . n]

is n. The length of a string

x

is denoted by

| x |

. For two strings

x = x [1 . . n]

and

y = y [1 . . m]

, the concatenation

x y

is a string

u

where

u [i] = \{\begin{matrix} x [i] f o r i \leq n, \\ y [i - n] f o r n < i \leq n + m . \end{matrix}

If

x = u v w

, then

u

is a prefix,

v

a substring, and

w

a suffix of

x

. If

u

(respectively,

v

,

w

) is empty, then it is called an empty prefix (respectively, empty substring, empty suffix); if

| u | < | x |

(respectively,

| v | < | x |

,

| w | < | x |

), then it is called a proper prefix (respectively, proper substring, proper suffix). If

x = u v

, then

vu

is called a rotation or a conjugate of

x

; if either

u = ε

or

v = ε

, then the rotation is called trivial. A non-empty string

x

is primitive if there is no string

y

and no integer

k \geq 2

so that

x = y^{k} = \underset{k t i m e s}{\underset{⏟}{y y \dots y}}

.

A non-empty string

x

has a non-trivial border

u

if

u

is both a non-empty proper prefix and a non-empty proper suffix of

x

. Thus, both

ε

and

x

are trivial borders of

x

. A string without a non-trivial border is call unbordered.

Let ≺ be a total order of an alphabet

A

. The order is extended to all finite strings over the alphabet

A

: for

x = x [1 . . n]

and

y = y [1 . . n]

,

x ≺ y

if either

x

is a proper prefix of

y

or there is a

j \leq min {n, m}

so that

x [1] = y [1]

, ...,

x [j - 1] = y [j - 1]

and

x [j] ≺ y [j]

. This total order induced by the order of the alphabet is called a lexicographic order of all non-empty strings over

A

. We denote by

x ⪯ y

if either

x ≺ y

or

x = y

. A string

x

over

A

is Lyndon for a given order ≺ of

A

if

x

is strictly lexicographically smaller than any non-trivial rotation of

x

. In particular:

x is Lyndon ⇒x is unbordered ⇒ x is primitive

Note that the reverse implications do not hold:

a b a

is primitive but neither unbordered, nor Lyndon, while

a c a a b

is unbordered, but not Lyndon. A substring

x [i . . j]

of

x [1 . . n]

,

1 \leq i \leq j \leq n

is a right-maximal Lyndon substring of

x

if it is Lyndon and either

j = n

or for any

k > j

,

x [i . . k]

is not Lyndon.

A substring

x [i . . j]

of a string

x [1 . . n]

is proto-Lyndon if there is a

j \leq k \leq n

so that

x [i . . k]

is Lyndon. The Lyndon array of a string

x = x [1 . . n]

is an integer array

L [1 . . n]

so that

L [i] = j

where

j \leq n - i

is a maximal integer such that

x [i . . i + j - 1]

is Lyndon. Alternatively, we can define it as an integer array

L^{'} [1 . . n]

so that

L^{'} [i] = j

where j is the last position of the right-maximal Lyndon substring starting at the position i. The relationship between those two definitions is straightforward:

L^{'} [i] = L [i] + i - 1

or

L [i] = L^{'} [i] - i + 1

.

Proof ofLemma 1.

We first prove the following claim:

Claim. Substring

x [i . . j]

is right-maximal Lyndon if and only if:

$(b^{'})$: $x [i . . n] ≺ x [k . . n]$ for any $i < k \leq j$ and
$(c)$: either $j = n$ or $x [j + 1 . . n] ≺ x [i . . n]$ .

Let

x [i . . j]

be right-maximal Lyndon. We need to show that

(b^{'})

and

(c)

hold.

Let

i < k \leq j

. Since it is Lyndon,

x [i . . j] ≺ x [k . . j]

. Thus, there is

0 \leq r

so that

i + r \leq j

and

j + r \leq j

and

x [i + ℓ] = x [j + ℓ]

for any

0 \leq ℓ < r

and

x [i + r] ≺ x [j + r]

. It follows that

x [i . . n] ≺ x [j . . n]

, and

(b^{'})

is satisfied.

If

j = n

, then

(c)

holds. Therefore, let us assume that

j < n

.

(1)

If

x [i] ≺ x [j + 1]

, then

x [i . . j + 1] ≺ x [j + 1 . . j + 1]

; together with

x [i . . j] ≺ x [k . . j]

for any

i < k \leq j

, it shows that

x [i . . j + 1]

is Lyndon, contradicting the right-maximality of

x [i . . j]

.

(2)

If

x [i] ≻ x [j + 1]

, then

x [i . . n] ≻ x [j + 1 . . n]

, and

(c)

holds.

(3)

If

x [i] = x [j + 1]

, then there are a prefix

u

, strings

v

and

w

, and an integer

r \geq 1

so that

u v = x [i . . j]

,

x [i . . n] = u v u^{r} w

, and

x [j + 1 . . n] = u^{r} w

. Let us take a maximal such

u

and a maximal such r.

(3a): Let $r \geq 2$ . Then, since $u v = x [i . . j]$ is Lyndon, $u ≺ v$ , and so, $u u ≺ u v$ ; hence, $u^{r} w ≺ u v u^{r} w$ , and $(c)$ holds.
(3b): Let $r = 1$ , i.e., $x [i . . n] = u v u w$ and $x [j + 1 . . n] = u w$ . There is no common prefix of $v$ and $w$ as it would contradict the maximality of $u$ . There are thus two mutually exclusive cases, either $v [1] ≺ w [1]$ or $v [1] ≻ w [1]$ . Let $v [1] = c_{1}$ and $w [1] = c_{2}$ . If $c_{1} ≺ c_{2}$ , then $u c_{1} ≺ u c_{2}$ , and so, $u v ≺ u c_{2}$ . For any $u = u_{1} u_{2}$ , $u v ≺ u_{2} v$ as $uv$ is Lyndon, so $u v ≺ u_{2} c_{2}$ , giving $u v c_{2}$ to be Lyndon, a contradiction with the right-maximality of $u v$ . Therefore, $c_{1} ≻ c_{2}$ ; thus, $u w ≺ u v u w$ , and $(c)$ holds.

Now, we go in the opposite direction. Assuming

(b^{'})

and

(c)

, we need to show that

x [i . . j]

is right-maximal Lyndon.

Consider the right-maximal substring of

x

starting at the position i; it is

x [i . . ℓ]

for some

i \leq ℓ \leq n

.

(1): If $ℓ < j$ , then by the first part of this proof, $x [i . . n] ≻ x [ℓ + 1 . . n]$ , contradicting the assumption $(b^{'})$ as $ℓ + 1 \leq j$ .
(2): If $j < ℓ$ , then $j + 1 \leq ℓ$ , and by the first part of this proof, $x [i . . n] ≺ x [j + 1 . . n]$ , contradicting the assumption $(c)$ .
(3): If $ℓ = j$ , then $x [i . . j]$ is right-maximal Lyndon.

Now, we can prove

(a)

. Let

x [i . . j]

be a proto-Lyndon substring of

x

. By definition, it is a prefix of a Lyndon substring of

x

and, hence, a prefix of a right-maximal Lyndon substring of

x

, say

x [i . . ℓ]

for some

j \leq ℓ \leq n

. It follows from the claim that

x [i . . n] ≺ x [k . . n]

for any

i < k \leq ℓ

. Since

j \leq ℓ

,

(a)

holds. For the opposite direction, if

(a)

holds, then there are two possibilities: either there is

j < ℓ \leq n

so that

x [i . . n] ≻ x [ℓ . . n]

or

x [i . . n] ≺ x [k . . n]

for any

i < k \leq n

. By the claim, in the former case,

x [i . . ℓ - 1]

is a right-maximal Lyndon substring of

x

, while in the latter case,

x [i . . n]

is a right-maximal Lyndon substring of

x

. Thus, in both cases,

x [i . . j]

is a prefix of a Lyndon substring of

x

.

With

(a)

proven, we can now replace

(b^{'})

in the claim with

(b)

, completing the proof. □

3. $τ$ -Reduction Algorithm (TRLA)

The purpose of this section is to introduce a recursive algorithm TRLA for computing the Lyndon array of a string. As will be shown below, the most significant aspect is the so-called

τ

-reduction of a string and how the Lyndon array of the

τ

-reduced string can be expanded to a partially filled Lyndon array for the whole string, as well as how to compute the missing values. This section thus provides the mathematical justification for the algorithm and, in so doing, proves the correctness of the algorithm. The mathematical understanding of the algorithm provides the bases for the bounding of its worst-case complexity by

O (n log (n))

and determining the linearity of the average-case complexity.

The first idea of the algorithm was proposed in Paracha’s 2017 Ph.D. thesis [36]. It follows Farach’s approach [25]:

(1): reduce the input string $x$ to $y$ ;
(2): by recursion, compute the Lyndon array of $y$ ; and
(3): from the Lyndon array of $y$ , compute the Lyndon array of $x$ .

The input strings for the algorithm are $-terminated strings over an integer alphabet. The reduction computed in (1) is important. All linear algorithms for suffix array computations use the proximity property of suffixes: comparing

x [i . . n]

and

x [j . . n]

can be done by comparing

x [i]

and

x [j]

and, if they are the same, comparing

x [i + 1 . . n]

with

x [j + 1 . . n]

. For instance, in the first linear algorithm for the suffix array by Kärkkäinen and Sanders [37], obtaining the sorted suffixes for positions

i \equiv 0 (m o d 3)

and

i \equiv 1 (m o d 3)

via the recursive call is sufficient to determine the order of suffixes for the

i \equiv 2 (m o d 3)

positions, then merging both lists together. However, there is no such proximity property for right-maximal Lyndon substrings, so the reduction itself must have a property that helps determine some of the values of the Lyndon array of

x

from the Lyndon array of

y

and computing the rest.

In our algorithm, we use a special reduction, which we call

τ

-reduction, defined in Section 3.2, that reduces the original string to at most

\frac{1}{2}

and at least

\frac{2}{3}

of its length. The algorithm computes

y

as a

τ

-reduction of the input string

x

in Step (1) in linear time. In Step (3), it expands the Lyndon array of the reduced string computed by Step (2) to an incomplete Lyndon array of the original string also in linear time. The incomplete Lyndon array computed in (3) is about

\frac{1}{2}

to

\frac{2}{3}

full, and for every position i with an unknown value, the values at positions

i - 1

and

i + 1

are known. In particular, the values at Position 1 and position n are both known. Therefore, much information is provided by the recursive Step (2). For instance, for 00011001, via the recursive call, we would identify the right-maximal Lyndon substrings that are underlined in

\underset{̲}{00 \underset{̲}{01 \underset{̲}{1}} \underset{̲}{00 \underset{̲}{1}}}

and would need to compute the missing right-maximal Lyndon substrings that are underlined in

0 \underset{̲}{00 \underset{̲}{1} 1} 0 \underset{̲}{01}

.

However, computing the missing values of the incomplete Lyndon array takes at most

O (n log (n))

steps, as we will show, resulting in the overall worst-case complexity of

O (n log (n))

. When the input string is such that the missing values of the incomplete Lyndon array of the input string can be computed in linear time, the overall execution of the algorithm is linear as well, and thus, the average case complexity will be shown to be linear in the length of the input string.

In the following subsections, we describe the

τ

-reduction in several steps: first, the

τ

-pairing, then choosing the

τ

-alphabet, and finally, the computation of the

τ (x)

. The

τ

-reduction may be of some general interest as it preserves (see Lemma 6) some right-maximal Lyndon substrings of the original string.

3.1. $τ$ -Pairing

Consider a $-terminated string

x = x [1 . . n]

whose alphabet

A_{x}

is ordered by ≺ where

x [n + 1] = $

and

$ ≺ a

for any

a \in A_{x}

. A $τ$ -pair consists of a pair of adjacent positions from the range

1 . . n + 1

. The

τ

-pairs are computed by induction:

the initial $τ$ -pair is $(1, 2)$ ;
if $(i - 1, i)$ is the last $τ$ -pair computed, then:
if $i = n - 1$ then
    the next $τ$ -pair is set to $(n, n + 1)$
    stop
elseif $i \geq n$ then
    stop
elseif $x [i - 1] ≻ x [i]$ and $x [i] ⪯ x [i + 1]$ then
    the next $τ$ -pair is set to $(i, i + 1)$ ; repeat 2.
else
    the next $τ$ -pair is set to $(i + 1, i + 2)$ ; repeat 2.

Every position of the input string that occurs in some

τ

-pair as the first element is labelled black; all others are labelled white. Note that Position 1 is always black, while the last position n can be either black or white; however, the positions

n - 1

and n cannot be simultaneously both black. Note also that most of the

τ

-pairs do not overlap; if two

τ

-pairs overlap, they overlap in a position i such that

1 < i < n

and

x [i - 1] ≻ x [i]

and

x [i] ⪯ x [i + 1]

. The first position and the last position never figure in an overlap of

τ

-pairs. Moreover, a

τ

-pair can be involved in at most one overlap; for an illustration, see Figure 2; for the formal proof see Lemma 2.

Lemma 2.

Let

(i_{1}, i_{1} + 1) \dots (i_{k}, i_{k} + 1)

be the τpairs of a string

x = x [1 . . n]

. Then, for any

j, ℓ \in 1 . . k

:

(1): $i f | (i_{j}, i_{j} + 1) \cap (i_{ℓ}, i_{ℓ} + 1) | = 1, t h e n f o r a n y m \neq j, ℓ, | (i_{j}, i_{j} + 1) \cap (i_{m}, i_{m} + 1) | = 0$ ,
(2): $| (i_{j}, i_{j} + 1) \cap (i_{ℓ}, i_{ℓ} + 1) | \leq 1$ .

Proof.

This is by induction; trivially true for

| x | = 1

as

(1, 2)

is the only

τ

-pair. Assume it is true for

| x | \leq n - 1

.

Case $(i_{k}, i_{k} + 1) = (n, n + 1)$ :
Then, $(i_{k - 1}, i_{k - 1} + 1) = (n - 2, n - 1)$ , and so, $(i_{1}, i_{1} + 1) \dots (i_{k - 1}, i_{k - 1} + 1)$ are $τ$ -pairs of $x [1 . . n - 1]$ ; thus, they satisfy (1) and (2) by the induction hypothesis. However, $(n, n + 1) \cap (i_{ℓ}, i_{ℓ + 1}) = \emptyset$ for $1 \leq ℓ < k$ , so (1) and (2) hold for $(i_{1}, i_{1} + 1) \dots (i_{k}, i_{k} + 1)$ .
Cases $(i_{k}, i_{k} + 1) = (n - 1, n)$ and $(i_{k - 1}, i_{k - 1} + 1) = (n - 2, n - 1)$ :
Therefore, $(i_{1}, i_{1} + 1) \dots (i_{k - 1}, i_{k - 1} + 1)$ are $τ$ -pairs of $x [1 . . n - 1]$ , and thus, they satisfy (1) and (2) by the induction hypothesis. However, $(i_{k}, i_{k} + 1) \cap (i_{ℓ}, i_{ℓ} + 1) = \emptyset$ for $1 \leq ℓ < k - 1$ , and $(i_{k}, i_{k} + 1) \cap (i_{k - 1}, i_{k - 1} + 1) = {i_{k - 1}} = n - 1$ ; so, $| (i_{k}, i_{k} + 1) \cap (i_{k - 1}, i_{k - 1} + 1) | \leq 1$ , and so, (1) and (2) hold for $(i_{1}, i_{1} + 1) \dots (i_{k}, i_{k} + 1)$ .
Cases $(i_{k}, i_{k} + 1) = (n - 1, n)$ and $(i_{k - 1}, i_{k - 1} + 1) = (n - 3, n - 2)$ :
Then, $(i_{1}, i_{1} + 1) \dots (i_{k - 1}, i_{k - 1} + 1)$ are $τ$ -pairs of $x [1 . . n - 2]$ , so they satisfy (1) and (2) by the induction hypothesis. However, $(i_{k}, i_{k} + 1) \cap (i_{ℓ}, i_{ℓ} + 1) = \emptyset$ for $1 \leq ℓ < k$ , so (1) and (2) hold for $(i_{1}, i_{1} + 1) \dots (i_{k}, i_{k} + 1)$ .

□

3.2. $τ$ -Reduction

For each

τ

-pair

(i, i + 1)

, we consider the pair of alphabet symbols

(x [i], x [i + 1])

. We call them symbol $τ$ -pairs. They are in a total order ⊲ induced by ≺ :

(x [i_{j}], x [i_{j} + 1]) ⊲ (x [i_{ℓ}], x [i_{ℓ} + 1])

if either

x [i_{j}] ≺ x [i_{ℓ}]

, or

x [i_{j}] = x [i_{ℓ}]

and

x [i_{j} + 1] ≺ x [i_{ℓ} + 1]

. They are sorted using the radix sort with a key of size two and assigned letters from a chosen

τ

-alphabet that is a subset of

{0, 1, \dots, | τ (x) |}

so that the assignment preserves the order. Since the input string is over an integer alphabet, the radix sort is linear.

In the example (Figure 2), the

τ

-pairs are

(1, 2) (3, 4) (4, 5) (6, 7) (7, 8) (9, 10)

, and so, the symbol

τ

-pairs are

(0, 1) (1, 0) (0, 2) (3, 1) (1, 2) (2, $)

. The sorted symbol

τ

-pairs are

(0, 1) (0, 2) (1, 0)

(1, 2) (2, $) (3, 1)

. Thus, we chose as our

τ

-alphabet

{0, 1, 2, 3, 4, 5}

, and so, the symbol

τ

-pairs are assigned these letters:

(0, 1) \to 0

,

(0, 2) \to 1

,

(1, 0) \to 2

,

(1, 2) \to 3

,

(2, $) \to 4

, and

(3, 1) \to 5

. Note that the assignments respect the order ⊲ of the symbols

τ

-pairs and the natural order < of

{0, 1, 2, 3, 4, 5}

.

The

τ

-letters are substituted for the symbol

τ

-pairs, and the resulting string is terminated with $. This string is called the $τ$ -reduction of

x

and denoted

τ (x)

, and it is a $-terminated string over an integer alphabet. For our running example from Figure 2,

τ (x) = 021534

. The next lemma justifies calling the above transformation a reduction.

Lemma 3.

For any string

x

,

\frac{1}{2} | x | \leq | τ (x) | \leq \frac{2}{3} | x |

.

Proof.

There are two extreme cases; the first is when all the

τ

-pairs do not overlap at all, then

| τ (x) | = \frac{1}{2} | x |

; and the second is when all the

τ

-pairs overlap, then

| τ (x) | = \frac{2}{3} | x |

. Any other case must be in between. □

Let

B (x)

denote the set of all black positions of

x

. For any

i \in 1 . . | τ (x) |

,

b (i) = j

where j is a black position in

x

of the

τ

-pair corresponding to the new symbol in

τ (x)

at position i, while

t (j)

assigns each black position of

x

the position in

τ (x)

where the corresponding new symbol is, i.e.,

b (t (j)) = j

and

t (b (i)) = i

. Thus,

1 . . | τ (x) | ⇄_{t}^{b} B (x)

In addition, we define p as the mapping of the

τ

-pairs to the

τ

-alphabet.

In our running example from Figure 2,

t (1) = 1

,

t (3) = 2

,

t (4) = 3

,

t (6) = 4

,

t (7) = 5

, and

t (9) = 6

, while

b (1) = 1

,

b (2) = 3

,

b (3) = 4

,

b (4) = 6

,

b (5) = 7

, and

b (6) = 9

. For the letter mapping, we get

p (1, 2) = 0

,

p (3, 4) = 2

,

p (4, 5) = 1

,

p (6, 7) = 5

,

p (7, 8) = 3

, and

p (9, 10) = 4

.

3.3. Properties Preserved by $τ$ -Reduction

The most important property of

τ

-reduction is a preservation of right-maximal Lyndon substrings of

x

that start at black positions. This means there is a closed formula that gives, for every right-maximal Lyndon substring of

τ (x)

, a corresponding right-maximal Lyndon substring of

x

. Moreover, the formula for any black position can be computed in constant time. It is simpler to present the following results using

L^{'}

, the alternative form of the Lyndon array, the one where the end positions of right-maximal Lyndon substrings are stored rather than their lengths. More formally:

Theorem 2.

Let

x = x [1 . . n]

; let

L_{τ (x)}^{'} [1 . . m]

be the Lyndon array of

τ (x)

; and let

L_{x}^{'} [1 . . n]

be the Lyndon array of

x

.

Then, for any black

i \in 1 . . n

,

L_{x}^{'} [i] = \{\begin{matrix} b (r) & if b (r) = n or x [b (r) + 1] ⪯ x [i] \\ b (r) + 1 & otherwise, \end{matrix}

where

r = L_{τ (x)}^{'} [t (i)]

.

The proof of the theorem requires a series of lemmas that are presented below. First, we show that

τ

-reduction preserves the relationships of certain suffixes of

x

.

Lemma 4.

Let

x = x [1 . . n]

, and let

τ (x) = τ (x) [1 . . m]

. Let

i \neq j

and

1 \leq i, j \leq n

. If i and j are both black positions, then

x [i . . n] ≺ x [j . . n]

implies

τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]

.

Proof.

Since i and j are both black positions, both

t (i)

and

t (j)

are defined, and

t (i) \neq t (j)

. Let us assume that

x [i . . n] ≺ x [j . . n]

. The proof is argued in several cases determined by the nature of the relationship

x [i . . n] ≺ x [j . . n]

.

(1)

Case:

x [i . . n]

is a proper prefix of

x [j . . n]

.

Then,

| x [i . . n] | = n - i + 1 < | x [j . . n] | = n - j + 1

, and so,

j < i

. It follows that

x [j . . j + n - i] = x [i . . n]

, and thus,

x [i . . n]

is a border of

x [j . . n]

.

(1a)

Case:

j + n - i

is black.

Since n may be either black or white, we need to discuss two cases.

(1a $α$ ): Case: n is white.
Since n is white, the last $τ$ -pair of $x$ must be $(n - 1, n)$ . The $τ$ -pairs of $x [j . . j + n - i]$ must be the same as the $τ$ -pairs of $x [i . . n]$ ; the last $τ$ -pair of $x [j . . j + n - i]$ must be $(j + n - i - 1, j + n - i)$ . Since $j + n - i$ is black by our assumption (1a), the next $τ$ -pair of $x$ must be $(j + n - i, j + n - i + 1)$ , as indicated in the following diagram:

Thus, $τ (x) [t (j) . . t (j + n - i - 1)] = τ (x) [t (i) . . t (n - 1)]$ . Since $t (n - 1) = m$ , we have $τ (x) [t (j) . . t (j + n - i - 1)] = τ (x) [t (i) . . m]$ , and so, $τ (x) [t (i) . . m]$ is a proper prefix of $τ (x) [t (j) . . m]$ giving $τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]$ .
(1a $β$ ): Case: n is black.
Then, the last $τ$ -pair of $x$ must be $(n, n + 1)$ , and hence, the last $τ$ -pair of $x [j . . j + n - i]$ , so the next $τ$ -pair is $(j + n - i, j + n - i + 1)$ ; since $n - 1$ cannot be black when n is, the situation is as indicated in the following diagram:

Thus, $τ (x) [t (i) . . t (n - 2)] = τ (x) [t (j) . . t (j + n - i - 2)]$ . Since $x [j + n - i] = x [n]$ and $(x [n], x [n + 1]) = (x [n], $)$ , we have $(x [j + n - i], x [j + n - i + 1]) ⊲ (x [n], x [n + 1])$ , and so, $τ (x) [t (j + n - i)] ≺ τ (x) [t (n)]$ , giving $τ (x) [t (j) . . t (n)] ≺ τ (x) [t (i) . . t (n)]$ . Since $t (n) = m$ , we have $τ (x) [t (j) . . m] ≺ τ (x) [t (i) . . m]$ .

(1b)

Case:

j + n - i

is white.

Then,

j + n - i - 1

is black; hence,

n - 1

is black; so, n must also be white, and thus,

τ (x) [t (j) . . t (j + n - i - 1)] = τ (x) [t (i) . . t (n - 1)]

, as indicated by the following diagram:

Since

t (n - 1) = m

, we have

τ (x) [t (j) . . t (j + n - i - 1)] = τ (x) [t (i) . . m]

, and so,

τ (x) [t (i) . . m]

is a proper prefix of

τ (x) [t (j) . . m]

, giving

τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]

.

(2)

Case:

x [i] ≺ x [j]

or (

x [i] = x [j]

and

x [i + 1] ≺ x [j + 1]

).

Then,

(x [i], x [i + 1]) ⊲ (x [j], x [j + 1])

, and so,

τ (x) [t (i)] ≺ τ (x) [t (j)]

, and thus,

τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]

.

(3)

Case: for some

ℓ \geq 3

,

x [i . . i + ℓ - 1] = x [j . . j + ℓ - 1]

, while

x [i + ℓ] ≺ x [j + ℓ]

.

First note that

i + ℓ - 2

and

j + ℓ - 2

are either both black, or both are white:

If $i + ℓ - 2$ is white, then the $τ$ -pairs $((i, i + 1), \dots, (i + ℓ - 3, i + ℓ - 2))$ of $x [i . . n]$ correspond one-to-one to the $τ$ -pairs $((j, j + 1), \dots, (j + ℓ - 3, j + ℓ - 2))$ of $x [j . . n]$ . To determine what follows $(i + ℓ - 3, i + ℓ - 2)$ , we need to know the relationship between the values $x [i + ℓ - 3]$ , $x [i + ℓ - 2]$ , and $x [i + ℓ - 1]$ . Since $x [i + ℓ - 3] = x [j + ℓ - 3]$ , $x [i + ℓ - 2] = x [j + ℓ - 2]$ , and $x [i + ℓ - 1] = x [j + ℓ - 1]$ , the values $x [j + ℓ - 3]$ , $x [j + ℓ - 2]$ , and $x [j + ℓ - 1]$ have the same relationship, and thus, the $τ$ -pair following $(j + ℓ - 3, j + ℓ - 2)$ will be the “same” as the $τ$ -pair following $(i + ℓ - 3, i + ℓ - 2)$ . Since $i + ℓ - 2$ is white, the $τ$ -pair following $(i + ℓ - 3, i + ℓ - 2)$ is $(i + ℓ - 1, i + ℓ)$ , and so, the $τ$ -pair following $(j + ℓ - 3, j + ℓ - 2)$ is $(j + ℓ - 1, j + ℓ)$ , making $j + ℓ - 2$ white as well.
If $i + ℓ - 2$ is black, then the $τ$ -pairs $((i, i + 1), \dots, (i + ℓ - 2, i + ℓ - 1))$ of $x [i . . n]$ correspond one-to-one to the $τ$ -pairs $((j, j + 1), \dots, (j + ℓ - 2, j + ℓ - 1))$ of $x [j . . n]$ . It follows that $j + ℓ - 2$ is black as well.

We proceed by discussing these two cases for the colours of

i + ℓ - 2

and

j + ℓ - 2

.

(3a)

Case when

i + ℓ - 2

and

j + ℓ - 2

are both white.

Therefore, we have the

τ

-pairs

((i, i + 1), \dots, (i + ℓ - 3, i + ℓ - 2), (i + ℓ - 1, i + ℓ))

for

x [i . . n]

that correspond one-to-one to the

τ

-pairs

((j, j + 1), \dots, (j + ℓ - 3, j + ℓ - 2), (j + ℓ - 1, j + ℓ))

for

x [j . . n]

. It follows that

τ (x) [t (i) . . t (i + ℓ - 3)] = τ (x) [t (j) . . t (j + ℓ - 3)]

.

τ (x) [t (i + ℓ - 1)] = p (i + ℓ - 1, i + ℓ)

and

τ (x) [t (j + ℓ - 1)] = p (j + ℓ - 1, j + ℓ)

. Since

x [i + ℓ - 1] = x [j + ℓ - 1]

and, by our assumption (3),

x [i + ℓ] ≺ x [j + ℓ]

, it follows that

(i + ℓ - 1, i + ℓ) ⊲ (j + ℓ - 1, j + ℓ)

, giving

p (i + ℓ - 1, i + ℓ) ≺ p (j + ℓ - 1, j + ℓ)

, and so,

τ (x) [t (i + ℓ - 1)] ≺ τ (x) [t (j + ℓ - 1)]

. Since

t (i + ℓ - 3) + 1 = t (i + ℓ - 1)

and

t (j + ℓ - 3) + 1 = t (j + ℓ - 1)

, we have

τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]

.

(3b)

Case when

i + ℓ - 2

and

j + ℓ - 2

are both black.

Therefore, we have the

τ

-pairs

((i, i + 1), \dots, (i + ℓ - 2, i + ℓ - 1))

for

x [i . . n]

that correspond one-to-one to the

τ

-pairs

((j, j + 1), \dots, (j + ℓ - 2, j + ℓ - 1))

for

x [j . . n]

. It follows that

τ (x) [t (i) . . t (i + ℓ - 2)] = τ (x) [t (j) . . t (j + ℓ - 2)]

.

We need to discuss the four cases based on the colours of

i + ℓ - 1

and

j + ℓ - 1

.

(3b $α$ ): Both $i + ℓ - 1$ and $j + ℓ - 1$ are black.
It follows that the next $τ$ -pair for $x [i . . n]$ is $(i + ℓ - 1, i + ℓ)$ , and the next $τ$ -pair for $x [j . . n]$ is $(j + ℓ - 1, j + ℓ)$ . It follows that $t (i + ℓ - 2) + 1 = t (i + ℓ - 1)$ and $t (j + ℓ - 2) + 1 = t (j + ℓ - 1)$ . Hence, $τ (x) [t (i + ℓ - 2) + 1] = p (i + ℓ - 1, i + ℓ)$ and $τ (x) [t (j + ℓ - 2) + 1] = p (j + ℓ - 1, j + ℓ)$ . Since $x [i + ℓ - 1] = x [j + ℓ - 1]$ and, by Assumption (3), $x [i + ℓ] ≺ x [j + ℓ]$ , we have $(x [i + ℓ - 1], x [i + ℓ]) ⊲ (x [j + ℓ - 1], x [j + ℓ])$ , and so, $p (x [i + ℓ - 1], x [i + ℓ]) ≺ p (x [j + ℓ - 1], x [j + ℓ])$ , giving us $τ (x) [t (i + ℓ - 2) + 1] ≺ τ (x) [t (j + ℓ - 2) + 1]$ . It follows that $τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]$ .
(3b $β$ ): $i + ℓ - 1$ is white, and $j + ℓ - 1$ is black.
It follows that the next $τ$ -pair for $x [i . . n]$ is $(i + ℓ, i + ℓ + 1)$ , and the next $τ$ -pair for $x [j . . n]$ is $(j + ℓ - 1, j + ℓ)$ . It follows that $t (i + ℓ - 2) + 1 = t (i + ℓ)$ , while $t (j + ℓ - 2) + 1 = t (j + ℓ - 1)$ . Thus, $τ (x) [t (i + ℓ - 2) + 1] = p (i + ℓ, i + ℓ + 1)$ and $τ (x) [t (j + ℓ - 2) + 1] = p (j + ℓ - 1, j + ℓ)$ . Since $j + ℓ - 1$ is black, we know that $x [j - ℓ - 2] ≻ x [j - ℓ - 1] ⪯ x [j + ℓ]$ . Since $x [i + ℓ - 2] = x [j + ℓ - 2]$ and $x [i + ℓ - 1] = x [j + ℓ - 1]$ , we have $x [i - ℓ - 2] ≻ x [i - ℓ - 1]$ , and so, $x [i - ℓ - 1] ≻ x [i + ℓ]$ , as otherwise, $i - ℓ - 1$ would be black. This gives us $x [j + ℓ - 1] ≻ x [i + ℓ]$ . Thus, $(x [i + ℓ], x [i + ℓ + 1]) ⊲ (x [j + ℓ - 1], x [j + ℓ])$ , giving $p (i + ℓ, i + ℓ + 1) ≺ p (j + ℓ - 1, j + ℓ)$ and, ultimately, $τ (x) [t (i + ℓ)] ≺ τ (x) [t (j + ℓ - 1)]$ . The last step is to realize that $t (i + ℓ - 2) + 1 = t (i + ℓ)$ and $t (j + ℓ - 2) + 1 = t (j + ℓ - 2)$ , which gives us $τ (x) [t (i + ℓ - 2) + 1] ≺ τ (x) [t (j + ℓ - 2) + 1]$ . It follows that $τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]$ .
(3b $γ$ ): $i + ℓ - 1$ is black, and $j + ℓ - 1$ is white.
It follows that the next $τ$ -pair for $x [i . . n]$ is $(i + ℓ - 1, i + ℓ)$ , and the next $τ$ -pair for $x [j . . n]$ is $(j + ℓ, j + ℓ + 1)$ . It follows that $t (i + ℓ - 2) + 1 = t (i + ℓ - 1)$ , while $t (j + ℓ - 2) + 1 = t (j + ℓ)$ . Thus, $τ (x) [t (i + ℓ - 2) + 1] = p (i + ℓ - 1, i + ℓ)$ and $τ (x) [t (j + ℓ - 2) + 1] = p (j + ℓ, j + ℓ + 1)$ . Since $i + ℓ - 1$ is black, we know that $x [i + ℓ - 2] ≻ x [i + ℓ - 1] ⪯ x [i + ℓ] ≺ x [j + ℓ]$ , where the last inequality is our Assumption (3). Therefore, $x [j + ℓ - 1] = x [i + ℓ - 1] ≺ x [j + ℓ]$ . Thus, $(x [i + ℓ - 1], x [i + ℓ]) ⊲ (x [j + ℓ], x [j + ℓ + 1])$ , giving $p (i + ℓ - 1, i + ℓ) ≺ p (j + ℓ, j + ℓ + 1)$ , $τ (x) [t (i + ℓ - 1)] ≺ τ (x) [t (j + ℓ)]$ , and ultimately, $τ (x) [t (i + ℓ - 2) + 1] = τ (x) [t (i + ℓ - 1)] ≺ τ (x) [t (j + ℓ)] = τ (x) [t (j + ℓ - 2) + 1]$ . It follows that $τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]$ .
(3b $δ$ ): Both $i + ℓ - 1$ and $j + ℓ - 1$ are white.
Then, the next $τ$ -pair for $x [i . . n]$ is $(i + ℓ, i + ℓ + 1)$ , and the next $τ$ -pair for $x [j . . n]$ is $(j + ℓ, j + ℓ + 1)$ . It follows that $t (i + ℓ - 2) + 1 = t (i + ℓ)$ , while $t (j + ℓ - 2) + 1 = t (j + ℓ)$ . Thus, $τ (x) [t (i + ℓ - 2) + 1] = p (i + ℓ, i + ℓ + 1)$ and $τ (x) [t (j + ℓ - 2) + 1] = p (j + ℓ, j + ℓ + 1)$ . Since $x [i + ℓ - 1] = x [j + ℓ - 1]$ and, by our Assumption (3), $x [i + ℓ] ≺ x [j + ℓ]$ , $(x [i + ℓ], x [i + ℓ - 1]) ≺ (x [j + ℓ], x [j + ℓ - 1])$ , giving $p (i + ℓ, i + ℓ + 1) ≺ p (j + ℓ, j + ℓ + 1)$ , $τ (x) [t (i + ℓ)] ≺ τ (x) [t (j + ℓ)]$ , and ultimately, $τ (x) [t (i + ℓ - 2) + 1] = τ (x) [t (i + ℓ)] ≺ τ (x) [t (j + ℓ)] = τ (x) [t (j + ℓ - 2) + 1]$ . It follows that $τ (x) [t (i) . . m] ≺ τ (x) [t (j) . . m]$ .

□

Lemma 5 shows that

τ

-reduction preserves the proto-Lyndon property of certain proto-Lyndon substrings of

x

.

Lemma 5.

Let

x = x [1 . . n]

, and let

τ (x) = τ (x) [1 . . m]

. Let

1 \leq i < j \leq n

. Let

x [i . . j]

be a proto-Lyndon substring of

x

, and let i be a black position.

Then,

\{\begin{matrix} τ (x) [t (i) . . t (j)] is proto - Lyndon & if j is black \\ τ (x) [t (i) . . t (j - 1)] is proto - Lyndon & if j is white . \end{matrix}

Proof.

Let us first assume that j is black.

Since both i and j are black,

t (i)

and

t (j)

are defined. Let

i_{1} = t (i)

,

j_{1} = t (j)

, and consider

k_{1}

, so that

i_{1} < k_{1} \leq j_{1}

. Let

k = b (k_{1})

. Then,

t (k) = k_{1}

and

i < k \leq j

, and so,

x [i . . n] ≺ x [k . . n]

by Lemma 1 as

x [i . . j]

is proto-Lyndon. It follows that

τ (x) [t (i) . . m] ≺ τ (x) [t (k) . . m]

by Lemma 4. Thus,

τ (x) [i_{1} . . m] ≺ τ (x) [k_{1} . . m]

for any

i_{1} < k_{1} \leq j_{1}

, and so,

τ (x) [i_{1} . . j_{1}]

is proto-Lyndon by Lemma 1.

Now, let us assume that j is white.

Then,

j - 1

is black, and

x [i . . j - 1]

is proto-Lyndon, so as in the previous case,

τ (x) [t (i) . . t (j - 1)]

is proto-Lyndon. □

Now, we can show that

τ

-reduction preserves some right-maximal Lyndon substrings.

Lemma 6.

Let

x = x [1 . . n]

, and let

τ (x) = τ (x) [1 . . m]

. Let

1 \leq i < j \leq n

. Let

x [i . . j]

be a right-maximal Lyndon substring, and let i be a black position.

Then,

\{\begin{matrix} τ (x) [t (i) . . t (j)] is a right - maximal Lyndon substring & if j is black \\ τ (x) [t (i) . . t (j - 1)] is a right - maximal Lyndon substring & if j is white . \end{matrix}

Proof.

Since

x [i . . j]

is Lyndon and hence proto-Lyndon, by Lemma 5, we know that

τ (x) [t (i) . . t (j)]

is proto-Lyndon for j black, while for white j,

τ (x) [t (i) . . t (j - 1)]

is proto-Lyndon. Thus, in order to conclude that the respective strings are right-maximal Lyndon substrings, we only need to prove that the property

(c)

of Lemma 1 holds in both cases.

Since

x [i . . j]

is right-maximal Lyndon, either

j = n

or

x [j + 1 . . n] ≺ x [i . . n]

by Lemma 1, giving

j = n

or

x [j + 1] ⪯ x [i]

. Since

x [i . . j]

is Lyndon and hence unbordered,

x [i] ≺ x [j]

. Thus, either

j = n

or

x [j + 1] ⪯ x [i] ≺ x [j]

.

If

j = n

, then there are two simple cases. If n is white,

n - 1

is black and

m = t (n - 1)

, so

t (j - 1) = m

, giving us (c) of Lemma 1 for

τ (x) [t (i) . . t (j - 1)]

. On the other hand, if n is black, then

m = t (n)

, and so,

m = t (j)

giving us (c) of Lemma 1 for

τ (x) [t (i) . . t (j)]

.

Thus, in the following, we can assume that

j < n

and that

x [j + 1] ⪯ x [i] ≺ x [j]

. We will proceed by discussing two possible cases, one where j is black and the other where j is white.

(1)

Case: j is black.

We need to show that either

t (j) = m

or

τ (x) [t (i) . . m] ≻ τ (x) [t (j) + 1 . . m]

.

If

j = n

, then

t (j) = m

, and we are done. Thus, we can assume that

j < n

. We must show that

τ (x) [t (j) + 1 . . m] ≺ τ (x) [t (i) . . m]

.

(1a): Case: $x [j + 1] ⪯ x [j + 2]$ .
Then, $x [j] ≻ x [j + 1]$ and $x [j + 1] ⪯ x [j + 2]$ , and so, $j + 1$ is black. It follows that $t (j) + 1 = t (j + 1)$ . By Lemma 4, $τ (x) [t (j + 1) . . m] ≺ τ (x) [t (i) . . m]$ because $x [j + 1 . . n] ≺ x [i . . n]$ , thus $τ (x) [t (j) + 1 . . m] ≺ τ (x) [t (i) . . m]$ .
(1b): Case: $x [j + 1] ≻ x [j + 2]$ .
Then, $x [j] ≻ x [i] ⪰ x [j + 1] ≻ x [j + 2]$ . It follows that the $τ$ -pair $(j, j + 1)$ is followed by a $τ$ -pair $(j + 2, j + 3)$ , and thus, $t (j) + 1 = t (j + 2)$ . Thus,
$(x [j + 2], x [j + 3]) ⊲ (x [i], x [i + 1]) ⊲ (x [j], x [j + 1])$ ; hence,
$p (j + 2, j + 3) ≺ p (i, i + 1) ≺ p (j, j + 1)$ . Since $τ (x) [t (j) + 1] = p (j + 2, j + 3)$ , $τ (x) [t (i)] = p (j, i + 1)$ , and $τ (x) [t (j)] = p (j, j + 1)$ , it follows that $τ (x) [t (j) + 1] ≺ τ (x) [t (i)]$ , and so, $τ (x) [t (j) + 1 . . m] ≺ τ (x) [t (i) . . m]$ .

(2)

Case: j is white.

We need to prove that

τ (x) [t (j - 1) + 1 . . m] ≺ τ (x) [t (i) . . m]

. Since j is white, necessarily both

j - 1

and

j + 1

are black and

t (j - 1) + 1 = t (j + 1)

. By Lemma 5,

τ (x) [t (i) . . t (j - 1)]

is proto-Lyndon as both i and

j - 1

are black and

x [i . . j - 1]

is proto-Lyndon. Since

x [i . . n] ≻ x [j + 1 . . n]

and both i and

j + 1

are black, by Lemma 4, we get

τ (x) [t (i) . . m] ≻ τ (x) [t (j + 1) . . m] = τ (x) [t (j - 1) + 1 . . m]

.

□

Now, we are ready to tackle the proof of Theorem 2.

Proof of Theorem 2.

Let

L_{x}^{'} [i] = j

where i is black. Then,

t (i)

is defined, and

x [i . . j]

is a right-maximal Lyndon substring of

x

. We proceed by analysis of the two possible cases of the label for the position j. Let (*) denote the condition from the theorem, i.e.,

(*): $b (L_{τ (x)}^{'} [t (i)]) = n$ or $x [b (L_{τ (x)}^{'} [t (i)]) + 1] ⪯ x [i]$
(1): Case: j is black.
Then, by Lemma 6, $τ (x) [t (i) . . t (j)]$ is a right-maximal Lyndon substring of $τ (x)$ ; hence, $L_{τ (x)}^{'} [t (i)] = t (j)$ . Therefore, $b (L_{τ (x)}^{'} [t (i)]) = b (t (j)) = j = L_{x}^{'} [i]$ . We have to also prove that the condition (*) holds.
If $j = n$ , then the condition (*) holds. Therefore, assume that $j < n$ . Since $x [i . . j]$ is right-maximal, by Lemma 1, $x [j + 1 . . n] ≺ x [i . . n]$ , and so, $x [j + 1] ⪯ x [i]$ . Then, $x [b (L_{τ (x)}^{'} [t (i)]) + 1] = x [b (t (j)) + 1] = x [j + 1] ⪯ x [i]$ .
(2): Case: j is white.
Then, $j - 1$ is black, and $τ (x) [t (j - 1)] = p (j - 1, j)$ . By Lemma 6, $τ (x) [t (i) . . t (j - 1)]$ is a right-maximal Lyndon substring of $τ (x)$ ; hence, $L_{τ (x)}^{'} [t (i)] = t (j - 1)$ , so $b (L_{τ (x)}^{'} [t (i)]) =$ $b (t (j - 1)) = j - 1$ , giving $b (L_{τ (x)}^{'} [t (i)] + 1) = j$ .
We want to show that the condition (*) does not hold.
If $b (L_{τ (x)}^{'} [t (i)]) = n$ , then $j - 1 = n$ , which is impossible as $j \leq n$ . Since $x [i . . j]$ is Lyndon, $x [i] ≺ x [j]$ , and so, $x [i] ≺ x [b (L_{τ (x)}^{'} [t (i)] + 1)]$ . Thus, Condition (*) does not hold.

□

3.4. Computing $L_{x}^{'}$ from $L_{τ (x)}^{'}$

Theorem 2 indicates how to compute the partial

L_{x}^{'}

from

L_{τ (x)}^{'}

. The procedure is given in Figure 3.

To compute the missing values, the partial array is processed from right to left. When a missing value at position i is encountered (note that it is recognized by

L_{x}^{'} [i] = n i l

), the Lyndon array

L_{x}^{'} [i + 1 . . n]

is completely filled, and also,

L_{x}^{'} [i - 1]

is known. Recall that

L_{x}^{'} [i + 1]

is the ending position of the right-maximal Lyndon substring starting at the position

i + 1

. In several cases, we can determine the value of

L_{x}^{'} [i]

in constant time:

(1): if $i = n$ , then $L_{x}^{'} [i] = i$ .
(2): if $x [i] ≻ x [i + 1]$ , then $L_{x}^{'} [i] = i$ .
(3): if $x [i] = x [i + 1]$ and $L_{x}^{'} [i + 1] = i + 1$ and either $i + 1 = n$ or $i + 1 = L_{x}^{'} [i - 1]$ , then $L_{x}^{'} [i] = i$ .
(4): if $x [i] ≺ x [i + 1]$ and $L_{x}^{'} [i + 1] = i + 1$ and either $i + 1 = n$ or $i + 1 = L_{x}^{'} [i - 1]$ , then $L_{x}^{'} [i] = i + 1$ .
(5): if $x [i] ⪯ x [i + 1]$ and $L_{x}^{'} [i + 1] > i + 1$ and either $L_{x}^{'} [i + 1] = n$ or $L_{x}^{'} [i + 1] = L_{x}^{'} [i - 1]$ , then $L_{x}^{'} [i] = L_{x}^{'} [i + 1]$ .

We call such points easy. All others will be referred to as hard. For a hard point i, it means that

x [i]

is followed by at least two consecutive right-maximal Lyndon substrings before reaching either

L_{x}^{'} [i - 1]

or n, and we might need to traverse them all.

The while loop, seen in Figure 4’s procedure, is the likely cause of the

O (n log (n))

complexity. At first glance, it may seem that the complexity might be

O (n^{2})

; however, the doubling of the length of the string when a hard point is introduced actually trims it down to an

O (n log (n))

worst-case complexity. See Section 3.5 for more details and Section 7 for the measurements and graphs.

Consider our running example from Figure 2. Since

τ (x) = 021534

, we have

L_{τ (x)}^{'} [1 . . 6] = 6, 2, 6, 4, 6, 6

giving

L_{x}^{'} [1 . . 9] = 9, •, 3, 9, •, 6, 9, •, 9

. Computing

L_{x}^{'} [8]

is easy as

x [8] = x [9]

, and so,

L_{x}^{'} [8] = 8

.

L_{x}^{'} [5]

is more complicated and an example of a hard point: we can extend the right-maximal Lyndon substring from

L_{x}^{'} [6]

to the left to 23, but no more, so

L_{x}^{'} [5] = 6

. Computing

L_{x}^{'} [2]

is again easy as

x [2] = x [3]

, and so,

L_{x}^{'} [2] = 2

. Thus,

L_{x}^{'} [1 . . 9] = 9, 2, 3, 9, 6, 6, 9, 8, 9

.

3.5. The Complexity of TRLA

To determine the complexity of the algorithm, we attach to each position i a counter

r e d [i]

initialized to zero. Imagine a hard point j indicated by the following diagram:

A_{1}

represents the right-maximal Lyndon substring starting at the position

j + 1

;

A_{2}

represents the right-maximal Lyndon substring following immediately

A_{1}

and so forth. To make j a hard point,

r \geq 2

and

x [j] ⪯ x [j + 1]

. The value of

s t o p

is determined by:

s t o p = \{\begin{matrix} L_{x}^{'} [j - 1] & i f L_{x}^{'} [j - 1] > j - 1 \\ n & o t h e r w i s e \end{matrix}

.

To determine the right-maximal Lyndon substring starting at the hard position j, we need first to check if

A_{1}

can be left-extended by

x [j]

to make

j A_{1}

Lyndon; we are using abbreviated notation

j A_{1}

for the substring

x [j . . k]

where

A_{1} = x [j + 1 . . k]

; in simple words,

j A_{1}

represents the left-extension of

A_{1}

by one position. If

j A_{1}

is proto-Lyndon, we have to check whether

A_{2}

can be left-extended by

j A_{1}

to a Lyndon substring. If

j A_{1} A_{2}

is Lyndon, we must continue until we check whether

j A_{1} A_{2} \dots A_{r - 1}

is Lyndon. If so, we must check whether

j A_{1} \dots A_{r}

is Lyndon. We need not go beyond

s t o p

.

How do we check if

j A_{1} \dots A_{k}

can left-extend

A_{k + 1}

to a Lyndon substring? If

j A_{1} \dots A_{k} ⪰ A_{k + 1}

, we can stop, and

j A_{1} \dots A_{k}

is the right-maximal Lyndon substring starting at position j. If

j A_{1} \dots A_{k} ≺ A_{k + 1}

, we need to continue. Since

s t o p

is the last position of the right-maximal Lyndon substring at the position

j - 1

or n, we are assured to stop there. When comparing the substring

j A_{1} \dots A_{k}

with

A_{k + 1}

, we increment the counter

r e d [i]

at every position of

A_{k + 1}

used in the comparison. When done with the whole array, the value of

r e d [i]

represents how many times i was used in various comparisons, for any position i.

Consider a position i that was used k times for

k \geq 4

, i.e.,

r e d [i] = k

. In the next four diagrams and related text, the upper indices of A and C do not represent powers; they are just indices. The next diagram indicates the configuration when the counter

r e d [i]

was incremented for the first time in the comparison of

j_{1} A_{1}^{1} \dots A_{r_{1} - 1}^{1}

and

A_{r_{1}}^{1}

during the computation of the missing value

L_{x}^{'} [j_{1}]

where:

s t o p_{1} = \{\begin{matrix} L_{x}^{'} [j_{1} - 1] & i f L_{x}^{'} [j_{1} - 1] > j_{1} - 1 \\ n & o t h e r w i s e \end{matrix}

The next diagram indicates the configuration when the counter

r e d [i]

was incremented for the second time in the comparison of

j_{2} A_{1}^{2} \dots A_{r_{2} - 1}^{2}

and

A_{r_{2}}^{2}

during the computation of the missing value

L_{x}^{'} [j_{2}]

where:

s t o p_{2} = \{\begin{matrix} L_{x}^{'} [j_{2} - 1] & i f L_{x}^{'} [j_{2} - 1] > j_{2} - 1 \\ n & o t h e r w i s e \end{matrix}

The next diagram indicates the configuration when the counter

r e d [i]

was incremented for the third time in the comparison of

j_{3} A_{1}^{3} \dots A_{r_{3} - 1}^{3}

and

A_{r_{3}}^{3}

during the computation of the missing value

L_{x}^{'} [j_{3}]

where:

s t o p_{3} = \{\begin{matrix} L_{x}^{'} [j_{3} - 1] & i f L_{x}^{'} [j_{3} - 1] > j_{3} - 1 \\ n & o t h e r w i s e \end{matrix}

The next diagram indicates the configuration when the counter

r e d [i]

was incremented for the fourth time in the comparison of

j_{4} A_{1}^{4} \dots A_{r_{4} - 1}^{4}

and

A_{r_{4}}^{4}

during the computation of the missing value

L_{x}^{'} [j_{4}]

where:

s t o p_{4} = \{\begin{matrix} L_{x}^{'} [j_{4} - 1] & i f L_{x}^{'} [j_{4} - 1] > j_{4} - 1 \\ n & o t h e r w i s e \end{matrix}

and so forth until the k-th increment of

r e d [i]

. Thus, if

r e d [i] = k

, then

n \geq 2^{k - 1} (n_{1} + 1) \geq 2^{k}

as

n_{1} + 1 \geq 2

. Thus,

n \geq 2^{k}

, and so,

k \leq log (n)

. Thus, either

k < 4

or

k \leq log (n)

. Therefore, the overall complexity is

O (n log (n))

.

To show that the average case complexity is linear, we first recall that the overall complexity of TRLA is determined by the procedure filling the missing values. We showed above that there are at most

log (n)

missing values (hard positions) that cannot be determined in constant time. We overestimate the number of strings of length n over an alphabet of size

Σ

,

2 \leq Σ \leq n

, which will force a non-linear computation, by assuming that every possible

log (n)

subset of indices with any possible letter assignment forces the worst performance. Thus, there are

Σ^{n} - (\binom{n}{log (n)}) Σ^{log (n)}

strings that are processed in linear time, say with a constant

K_{1}

, and there are

(\binom{n}{log (n)}) Σ^{log (n)}

strings that are processed in the worst time, with a constant

K_{2}

. Let

K = max (K_{1}, K_{2})

. Then, the average time is bounded by:

\frac{(Σ^{n} - (\binom{n}{log (n)}) Σ^{log (n)}) K n + (\binom{n}{log (n)}) Σ^{log (n)} K n log (n)}{Σ^{n}} =

K n + K n \frac{(\binom{n}{log (n)}) Σ^{log (n)}}{Σ^{n}} (log (n) - 1) \leq

K n + K n \frac{(\binom{n}{log (n)}) Σ^{log (n)}}{Σ^{n}} log (n) \leq

K n + K n \frac{n^{log (n)} Σ^{log (n)}}{log (n)! Σ^{n}} log (n) \leq

K n + K n \frac{n^{2 log (n)}}{2^{n}} \leq

K n + K n = 2 K n

for

n \geq 2^{7}

. The last step follows from the fact that

n^{2 log (n)} \leq 2^{n}

for any

n \geq 2^{7}

.

The combinatorics of the processing is too complicated to ascertain whether the worst-case complexity is linear or not. We tried to generate strings that might give the worst performance. We used three different formulas to generate the strings, nesting the white indices that might require non-constant computation: the dataset extreme_trla of binary strings is created using the recursive formula

u_{k + 1} =

00

u_{k}

0

u_{k}

, using the first 100 shortest binary Lyndon strings as the start

u_{0}

. The moment the size

u_{k}

exceeds the required length of the string, the recursion stops, and the string is trimmed to the required length. For the extreme_trla1 dataset, we used the same approach with the formula

u_{k + 1} =

000

u_{k}

00

u_{k}

, and for the extreme_trla2 dataset, we used the formula

u_{k + 1} =

0000

u_{k}

00

u_{k}

.

The space complexity of our C++ implementation is bounded by

9 n

integers. This upper bound is derived from the fact that a Tau object (see Tau.hpp [24]) requires

3 n

integers of space for a string of length n. Therefore, the first call to TRLA requires

3 n

, the next recursive call at most

3 \frac{2}{3} n

, the next recursive call at most

3 {(\frac{2}{3})}^{2} n

, ...; thus,

3 n + 3 \frac{2}{3} n + 3 {(\frac{2}{3})}^{2} n + 3 {(\frac{2}{3})}^{3} n + \dots = 3 n (1 + \frac{2}{3} + {(\frac{2}{3})}^{2} + {(\frac{2}{3})}^{3} + {(\frac{2}{3})}^{4} + \dots) = 3 n \frac{1}{1 - \frac{2}{3}} = 9 n

. However, it should be possible to bring it down to

6 n

integers.

4. The Algorithm BSLA

The purpose of this section is to present a linear algorithm BSLA for computing the Lyndon array of a string over an integer alphabet. The algorithm is based on a series of refinements of a list of groups of indices of the input string. The refinement is driven by a group that is already complete, and the refinement process makes the immediately preceding group also complete. In turn, this newly completed group is used as the driver of the next round of the refinement. In this fashion, the refinement proceeds from right to left until all the groups in the list are complete. The initial list of groups consists of the groups of indices with the same alphabet symbol. The section contains proper definitions of all these terms—group, complete group, and refinement. In the process of refinement, each newly created group is assigned a specific substring of the input string referred to as the context of the group. Throughout the process, the list of the groups is maintained in an increasing lexicographic order by their contexts. Moreover, at every stage, the contexts of all the groups are Lyndon substrings of

x

with an additional property that the contexts of the complete groups are right-maximal Lyndon substrings. Hence, when the refinement is completed, the contexts of all the groups in the list represent all the right-maximal Lyndon substrings of

x

. The mathematics of the process of refinement is necessary in order to ascertain its correctness and completeness and to determine the worst-case complexity of the algorithm.

4.1. Notation and Basic Notions of BSLA

For the sake of simplicity, we fix a string

x = x [1 . . n]

for the whole Section 4.1; all the definitions and the observations apply and refer to this

x

.

A group

G

is a non-empty set of indices of

x

. The group G is assigned a context, i.e., a substring

con (G)

of

x

with the property that for any

i \in G

,

x [i . . i + | c o n (G) | - 1] = c o n (G)

. If

i \in G

, then

C (i)

denotes the occurrence of the context of G at the position i, i.e., the substring

C (i) = x [i . . i + | c o n (G) | - 1]

. We say that a group

G^{'}

is smaller than or precedes a group

G^{″}

if

c o n (G^{'}) ≺ c o n (G^{″})

.

Definition 1.

An ordered list of groups

〈 G_{k}, G_{k - 1}, \dots, G_{2}, G_{1} 〉

is agroup configurationif:

$(C_{1})$: $G_{k} \cup G_{k - 1} \cup \dots \cup G_{2} \cup G_{1} = 1 . . n$ ;
$(C_{2})$: $G_{j} \cap G_{ℓ} = \emptyset$ for any $1 \leq ℓ < j \leq k$ ;
$(C_{3})$: $c o n (G_{k}) ≺ c o n (G_{k - 1}) ≺ \dots ≺ c o n (G_{2}) ≺ c o n (G_{1})$ ;
$(C_{4})$: For any $j \in 1 . . k$ , $c o n (G_{j})$ is a Lyndon substring of $x$ .

Note that (

C_{1}

) and (

C_{2}

) guarantee that

〈 G_{k}, G_{k - 1}, \dots, G_{2}, G_{1} 〉

is a disjoint partitioning of

1 . . n

. For

i \in {1, . ., n}

,

gr (i)

denotes the unique group to which i belongs, i.e., if

i \in G_{t}

, then

g r (i) = G_{t}

. Note that using this notation,

C (i) = x [i . . i + | c o n (g r (i)) | - 1]

.

The mapping

prev

is defined by

p r e v (i) = max {j < i | c o n (g r (j)) ≺ c o n (g r (i))}

if such j exists, otherwise

p r e v (i) = n i l

.

For a group G from a group configuration, we define an equivalence ∼ on G as follows:

i \sim j

iff

g r (p r e v (i)) = g r (p r e v (j))

or

p r e v (i) = p r e v (j) = n i l

. The symbol

{[i]}_{_{\sim}}

denotes the class of equivalence ∼ that contains i, i.e.,

{[i]}_{_{\sim}} = {j \in G | j \sim i}

. If

p r e v (i) = n i l

, then the class

{[i]}_{_{\sim}}

is called trivial. An interesting observation states that if G is viewed as an ordered set of indices, then a non-trivial

{[i]}_{_{\sim}}

is an interval:

Observation 7.

Let G be a group from a group configuration for

x

. Consider an

i \in G

such that

p r e v (i) \neq n i l

. Let

j_{1} = min {[i]}_{_{\sim}}

and

j_{2} = max {[i]}_{_{\sim}}

. Then,

{[i]}_{_{\sim}} = {j \in G | j_{1} \leq j \leq j_{2}}

.

Proof.

Since

p r e v (j_{1})

is a candidate to be

p r e v (j)

,

p r e v (j) \neq n i l

and

p r e v (j_{1}) \leq p r e v (j) \leq p r e v (j_{2}) = p r e v (j_{1})

, so

p r e v (j) = p r e v (j_{1}) = p r e v (j_{2})

. □

On each non-trivial class of ∼, we define a relation ≈ as follows:

i \approx j

iff

| j - i | = | c o n (G) |

; in simple terms, it means that the occurrence

C (i)

of

c o n (G)

is immediately followed by the occurrence

C (j)

of

c o n (G)

. The transitive closure of ≈ is a relation of equivalence, which we also denote by ≈. The symbol

{[i]}_{_{\approx}}

denotes the class of equivalence ≈ containing i, i.e.,

{[i]}_{_{\approx}} = {j \in {[i]}_{_{\sim}} | j \approx i}

.

For each j from a non-trivial

{[i]}_{_{\sim}}

, we define the valence by

{v a l (j) = | [i]}_{_{\approx}} |

. In simple terms,

v a l (i)

is the number of elements from

{[i]}_{_{\sim}}

that are

\approx i

. Thus,

1 \leq v a l (i) \leq | G |

.

Interestingly, if G is viewed as an ordered set of indices, then

{[i]}_{_{\approx}}

is a subinterval of the interval

{[i]}_{_{\sim}}

:

Observation 8.

Let G be a group from a group configuration for

x

. Consider an

i \in G

such that

p r e v (i) \neq n i l

. Let

j_{1} = min {[i]}_{_{\approx}}

and

j_{2} = max {[i]}_{_{\approx}}

. Then,

{[i]}_{_{\approx}} = {j \in {[i]}_{_{\sim}} | j_{1} \leq j \leq j_{2}}

.

Proof.

We argue by contradiction. Assume that there is an

j \in {[i]}_{_{\sim}}

so that

j_{1} < j < j_{2}

and

j \notin {[i]}_{_{\approx}}

. Take the minimal such j. Consider

j^{'} = j - | c o n (G) |

. Then,

j^{'} \in {[i]}_{_{\sim}}

, and since,

j^{'} < j

,

j^{'} \in {[i]}_{_{\approx}}

due to the minimality of j. Therefore,

i \approx j^{'} \approx j

, and so,

j \approx i

, a contradiction. □

Definition 2.

A group G iscompleteif for any

i \in G

, the occurrence

C (i)

of

c o n (G)

is a right-maximal Lyndon substring of

x

.

A group configuration

〈 G_{k}, G_{k - 1}, \dots, G_{2}, G_{1} 〉

ist-complete,

1 \leq t \leq k

, if

$(C_{5})$: the groups $G_{t}, \dots, G_{1}$ are complete;
$(C_{6})$: the mapping $p r e v$ isproperon $G_{t}$ :
for any $i \in G_{t}$ , if $p r e v (i) \neq n i l$ and $v = v a l (i)$ , then there are $i_{1}, \dots, i_{v} \in G_{t}$ , $i \in {i_{1}, \dots, i_{v}}$ , $p r e v (i) = p r e v (i_{1}) = \dots = p r e v (i_{v})$ , and so that $C (p r e v (i)) C (i_{1}) \dots C (i_{v})$ is a prefix of $x [j . . n]$ ;
$(C_{7})$: the family ${C (i) | i \in 1 . . n}$ isproper:
$(a)$ if $C (j)$ is a proper substring of $C (i)$ , i.e., $C (j) ⊊ C (i)$ , then $c o n (G_{t}) ≺ c o n (g r (j))$ ,
$(b)$ if $C (i)$ is followed immediately by $C (j)$ , i.e., when $i + | c o n (g r (i)) | = j$ , and $C (i) ≺ C (j)$ , then $c o n (g r (j)) ⪯ c o n (G_{t})$ ;
$(C_{8})$: the family ${C (i) | i \in 1 . . n}$ has theMongeproperty, i.e., if $C (i) \cap C (j) \neq \emptyset$ , then $C (i) \subseteq C (j)$ or $C (j) \subseteq C (i)$ .

The condition

(C_{6})

is all-important for carrying out the refinement process (see

(R_{3})

below). The conditions

(C_{7})

and

(C_{8})

are necessary for asserting that the condition

(C_{6})

is preserved during the refinement process.

4.2. The Refinement

For the sake of simplicity, we fix a string

x = x [1 . . n]

for the whole Section 4.2; all the definitions, lemmas, and theorems apply and refer to this

x

.

Lemma 9.

Let

A_{x} = {a_{1}, \dots, a_{k}}

and

a_{1} ≺ a_{2} ≺ \dots ≺ a_{k}

. For

1 \leq ℓ \leq k

, define

G_{ℓ} = {i \in 1 . . n | x [i] = a_{k + 1 - ℓ}}

with context

a_{k + 1 - ℓ}

. Then,

〈 G_{k}, \dots, G_{1} 〉

is a one-complete group configuration.

Proof.

(

C_{1}

), (

C_{2}

), (

C_{3}

), and (

C_{4}

) are straightforward to verify. To verify (

C_{5}

), we need to show that

G_{1}

is complete. Any occurrence of

a_{k}

in

x

is a right-maximal Lyndon substring, so

G_{1}

is complete.

To verify (

C_{6}

), consider

j = p r e v (i)

and

v a l (i) = v

for

i \in G_{1}

. Consider any r such that

j < r < i

. If

x [r] \neq a_{k}

, then

p r e v (i) < r

, which contradicts the definition of

p r e v

. Hence,

x [r] = a_{k}

, and so,

x [j + 1] = \dots = x [i] = \dots x [j + v + 1] = a_{k}

, while

x [j] = a_{q}

for some

q < k

. It follows that

x [j . . n]

has

a_{q} {(a_{k})}^{v}

as a prefix.

The condition

(C_{7} (a))

is trivially satisfied as no

C (i)

can have a proper substring. If

C (i)

is immediately followed by

C (j)

and

C (i) ≺ C (j)

, then

C (i) = x [i]

,

j = i + 1

,

C (j) = x [i + 1]

, and

x [i] ≺ x [i + 1]

. Then,

c o n (C (j)) = x [i + 1] ⪯ a_{k} = c o n (G_{1})

, so

(C_{7} (b))

is also satisfied.

To verify (

C_{8}

), consider

C (i) \cap C (j) \neq \emptyset

. Then,

C (i) = x [i] = x [j] = C (j)

. □

Let

〈 G_{k}, \dots, G_{t}, \dots, G_{1} 〉

by a t-complete group configuration. The refinement is driven by the group

G_{t}

, and it might only partition the groups that precede it, i.e., the groups

G_{k}, \dots, G_{t + 1}

, while the groups

G_{t}, \dots, G_{1}

remain unchanged.

$(R_{1})$

Partition

G_{t}

into classes of the equivalence ∼.

G_{t} = {[i_{1}]}_{_{\sim}} \cup {[i_{2}]}_{_{\sim}} \cup \dots \cup {[i_{p}]}_{_{\sim}} \cup X

where

X = {i \in G_{t} | p r e v (i) = n i l}

may be possibly empty and

i_{1} < i_{2} < \dots < i_{p}

.

$(R_{2})$

Partition every class

{[i_{ℓ}]}_{_{\sim}}

,

1 \leq ℓ \leq p

, into classes of the equivalence ≈.

{[i_{ℓ}]}_{_{\sim}} = {[j_{ℓ, 1}]}_{_{\approx}} \cup {[j_{ℓ, 2}]}_{_{\approx}} \cup \dots \cup {[j_{ℓ, m_{ℓ}}]}_{_{\approx}}

where

v a l (j_{ℓ, 1}) < v a l (j_{ℓ, 2}) < \dots < v a l (j_{ℓ, m_{ℓ}})

.

$(R_{3})$

Therefore, we have a list of classes in this order:

{[j_{1, 1}]}_{_{\approx}}

,

{[j_{1, 2}]}_{_{\approx}}

, ...

{[j_{1, m_{1}}]}_{_{\approx}}

,

{[j_{2, 1}]}_{_{\approx}}

,

{[j_{2, 2}]}_{_{\approx}}

, ...

{[j_{2, m_{2}}]}_{_{\approx}}

, ...,

{[j_{p, 1}]}_{_{\approx}}

,

{[j_{p, 2}]}_{_{\approx}}

, ...

{[j_{p, m_{p}}]}_{_{\approx}}

. This list is processed from left to right. Note that for each

i \in {[j_{ℓ, 𝓀}]}_{_{\approx}}

,

p r e v (i) \in g r (j_{ℓ, 𝓀})

, and

v a l (i) = v a l (j_{ℓ, 𝓀})

.

For each

j_{ℓ, 𝓀}

, move all elements

{p r e v (i) | i \in {[j_{ℓ, 𝓀}]}_{_{\approx}}}

from the group

g r (p r e v (j_{ℓ, 𝓀}))

into a new group H, place H in the list of groups right after the group

g r (p r e v (j_{ℓ, 𝓀}))

, and set its context to

c o n (g r (p r e v (j_{ℓ, 𝓀}))) c o n {(g r (j_{ℓ, 𝓀}))}^{v a l (j_{ℓ, 𝓀})}

. (Note, that this “doubling of the contexts” is possible due to

(C_{6}))

. Then, update

p r e v

:

All values of $p r e v$ are correct except possibly the values of $p r e v$ for indices from H. It may be the case that for $i \in H$ , there is $i^{'} \in g r (j_{ℓ, 𝓀})$ , so that $p r e v (i) < i^{'}$ , so $p r e v (i)$ must be reset to the maximal such $i^{'}$ . (Note that before the removal of H from $g r (j_{ℓ, 𝓀})$ , the index $i^{'}$ was not eligible to be considered for $p r e v (i)$ as i and $i^{'}$ were both from the same group.)

Theorem 3 shows that having a t-complete group configuration

〈 G_{k}, \dots,

G_{t + 1}, G_{t}, \dots, G_{1} 〉

and refining it by

G_{t}

, then the resulting system of groups is a

(t + 1)

-complete group configuration. This allows carrying out the refinement in an iterative fashion.

Theorem 3.

Let

Conf = 〈 G_{k}, \dots, G_{t + 1}, G_{t}, \dots, G_{1} 〉

be a t-complete group configuration,

1 \leq t

. After performing the refinement of Conf by group

G_{t}

, the resulting system of groups denoted as

Conf^{'}

is a

(t + 1)

-complete group configuration.

Proof.

We carry the proof in a series of claims. The symbols

g r ()

,

c o n ()

,

C ()

,

p r e v ()

, and

v a l ()

denote the functions for Conf, while

g r^{'} ()

,

c o n^{'} ()

,

C^{'} ()

,

p r e v^{'} ()

, and

v a l^{'} ()

denote the functions for

Conf^{'}

.

When a group

G_{t + 1}

is partitioned, a part of it is moved as the next group in the list, and we call it

H_{t + 1}

; thus,

G_{t + 1} ≺ H_{t + 1} ≺ G_{t}

. For details, please see (

R_{3}

) above.

Claim 1.

$Conf^{'}$ is a group configuration, i.e.,

(C_{1})

,

(C_{2})

,

(C_{3})

, and

(C_{4})

for

Conf^{'}

hold.

Proof of Claim 1.

(

C_{1}

) and (

C_{2}

) follow from the fact that the process is a refinement, i.e., a group is either preserved as is or is partitioned into two or more groups. The doubling of the contexts in Step

(R_{3})

guarantees that the increasing order of the contexts is preserved, i.e., (

C_{3}

) holds. For any

j \in G_{t}

so that

j = p r e v (i) \neq n i l

,

c o n (g r (p r e v (j)))

is Lyndon, and

c o n (g r (j))

is also Lyndon, while

c o n (g r (p r e v (j))) ≺ c o n (g r (j))

, so

c o n (g r (p r e v (j))) c o n {(g r (j))}^{v a l (j)}

is Lyndon as well; thus,

(C_{4})

holds.

To illustrate the concatenation: let us call

c o n (g r (p r e v (j)))

as A and

c o n (g r (j))

as B, and let

v a l (j) = m

, then we know that A is Lyndon and B is Lyndon and

A ≺ B

; so,

A B^{m}

is clearly Lyndon as if A and B were letters. □

This concludes the proof of Claim 1.

Claim 2.

{C^{'} (i) | i \in 1 . . n}

is proper and has the Monge property, i.e.,

(C_{7})

and

(C_{8})

for

Conf^{'}

hold.

Proof of Claim 2.

Consider

C^{'} (i)

for some

i \in 1 . . n

. There are two possibilities:

$C^{'} (i) = C (i)$ or
$C^{'} (i) = C (i) C (i_{1}) \dots C (i_{v})$ , for some $i_{1}, i_{2}, \dots, i_{v} \in G_{t}$ , so that for any $1 \leq ℓ \leq v$ , $i = p r e v (i_{ℓ})$ , $C (i_{ℓ}) = c o n (G_{t})$ , $v = v a l (i_{ℓ})$ and for any $1 \leq ℓ < k$ and $i_{ℓ + 1} = i_{ℓ} + | c o n (G_{t}) |$ . Note that $c o n (g r (i)) ≺ c o n (G_{t})$ .

Consider

C^{'} (i)

and

C^{'} (j)

for some

1 \leq i < j \leq n

.

Case $C^{'} (i) = C (i)$ and $C^{'} (j) = C (j)$ .
(a)
Show that $(C_{7} (a))$ holds.
If $C^{'} (j) ⊊ C^{'} (i)$ , then $C (j) ⊊ C (i)$ , and so, by $(C_{7} (a))$ for Conf, $c o n (G_{t}) ≺ c o n (g r (j))$ , and thus, $c o n^{'} (H_{t + 1}) ≺ c o n (G_{t}) ≺ c o n (g r (j)) = c o n^{'} (g r^{'} (j))$ . Therefore, $(C_{7} (a))$ for $Conf^{'}$ holds.
(b)
Show that $(C_{8})$ holds. If $C^{'} (i) \cap C^{'} (j) \neq \emptyset$ , then $C (i) \cap C (j) \neq \emptyset$ , so $C (j) \subseteq C (i)$ , and so, $C^{'} (j) \subseteq C^{'} (i)$ ; so, $(C_{8})$ for $Conf^{'}$ holds.
Case $C^{'} (i) = C (i)$ and $C^{'} (j) = C (j) C (j_{1}) . . C (j_{w})$ ,
where $w = v a l (j_{1})$ , $C (j_{1}) = \dots = C (j_{w}) = c o n (G_{t})$ , and $j_{1} \approx \dots \approx j_{w}$ .
(a)
Show that $(C_{7} (a))$ holds.
If $C^{'} (j) ⊊ C^{'} (i)$ , then $C (j) C (j_{1}) . . C (j_{w}) ⊊ C (i)$ ; hence, $C (j) ⊊ C (i)$ , and so, by $(C_{7} (a))$ for Conf, $c o n (G_{t}) ≺ c o n (g r (j))$ . By the t-completeness of Conf, $C (j)$ is a right-maximal Lyndon substring, a contradiction with $C (j) C (j_{1}) . ., C (j_{w})$ being Lyndon. This is an impossible case.
(b)
Show that $(C_{8})$ holds.
If $C^{'} (i) \cap C^{'} (j) \neq \emptyset$ , then $C (j) \subseteq C (i)$ by $(C_{8})$ for Conf. By $(C_{7} (a))$ for Conf, $C (j)$ cannot be a suffix of $C (i)$ as $c o n (g r (j)) ≺ c o n (G_{t})$ . Hence, $C (i) \cap C (j_{1}) \neq \emptyset$ , and so, $C (j) C (j_{1}) \subseteq C (i)$ ; and since $C (j_{1})$ cannot be a suffix of $C (i)$ as $g r (j_{1}) = G_{t}$ , it follows that $C (i) \cap C (j_{2}) \neq \emptyset$ , ..., ultimately giving $C (j) C (j_{1}) \dots C (j_{w}) \subseteq C (i)$ . Therefore, $(C_{8})$ for $Conf^{'}$ holds.
Case $C^{'} (i) = C (i) C (i_{1}) . . C (i_{v})$ and $C^{'} (j) = C (j)$ ,
where $v = v a l (i_{1})$ , $C (i_{1}) = \dots = C (i_{v}) = c o n (G_{t})$ , and $i_{1} \approx \dots \approx i_{v}$ .
(a)
Show that $(C_{7} (a))$ holds.
If $C^{'} (j) ⊊ C^{'} (i)$ , then either $C (j) ⊊ C (i)$ , which implies by $(C_{7} (a))$ for Conf that $c o n (G_{t}) ≺ c o n (g r (j))$ , giving $c o n^{'} (H_{t + 1}) ≺ c o n^{'} (G_{t}) = c o n (G_{t}) ≺ c o n (g r (j)) = c o n^{'} (g r^{'} (j))$ , or $C (j) \subseteq C (i_{ℓ})$ for some $1 \leq ℓ \leq v$ . If $C (j) = C (i_{ℓ})$ , then $g r (j) = g r (i_{ℓ}) = G_{t}$ , giving $c o n^{'} (H_{t + 1}) ≺ c o n (G_{t}) = c o n (g r (j))$ . Therefore $(C_{7} (a))$ for $Conf^{'}$ holds.
(b)
Show that $(C_{8})$ holds.
Let $C^{'} (i) \cap C^{'} (j) \neq \emptyset$ . Consider $D = {i_{ℓ} | 1 \leq ℓ \leq v a n d C (j) \cap C (i_{ℓ}) \neq \emptyset}$ .
Assume that $D \neq \emptyset$ :
By $(C_{8})$ for Conf, either $C (j) \subseteq ⋃_{i_{ℓ} \in D} C (i_{ℓ}) \subseteq C^{'} (i)$ , and we are done, or $⋃_{i_{ℓ} \in D} C (i_{ℓ}) \subseteq C (j)$ . Let $i_{𝓀}$ be the smallest element of $D$ . Since $C (i_{𝓀})$ cannot be the prefix of $C (j)$ , it means that $i_{𝓀} = i_{1}$ . Since $C (i_{1})$ cannot be a prefix of $C (j)$ , it means that $C (i) \cap C (j) \neq \emptyset$ , and so, $C (j) \subseteq C (i)$ , which contradicts the fact that $C (j) \subseteq ⋃_{i_{ℓ} \in D} C (i_{ℓ}) \subseteq C^{'} (i)$ .
Assume that $D = \emptyset$ :
Then, $C (i) \cap C (j) \neq \emptyset$ , and so, by $(C_{8})$ for Conf, $C (j) \subseteq C (i) \subseteq C^{'} (i)$ as $i < j$ .
Case $C^{'} (i) = C (i) C (i_{1}) . . C (i_{v})$ and $C^{'} (j) = C (j) C (j_{1}) \dots C (j_{w})$ ,
where $v = v a l (i_{1})$ , $C (i_{1}) = \dots = C (i_{v}) = c o n (G_{t})$ , and $i_{1} \approx \dots \approx i_{v}$ and where $v = v a l (j_{1})$ , $C (j_{1}) = \dots = C (j_{w}) = c o n (G_{t})$ , and $j_{1} \approx \dots \approx j_{w}$ .
(a)
Show that $(C_{7} (a))$ holds.
Let $C^{'} (j) ⊊ C^{'} (i)$ . Then, either $C (j) \subseteq C (i)$ , and so, $c o n (G_{t}) ≺ c o n (g r (j))$ , implying that $C (j)$ is maximal contradicting $C (j) C (j_{1}) \dots C (j_{w})$ being Lyndon. Thus, $C (j) ⊊ C (i_{ℓ})$ for some $1 \leq ℓ \leq v$ . However, then, $c o n (G_{t}) ≺ c o n (g r (j))$ , implying that $C (j)$ is maximal, again a contradiction. This is an impossible case.
(b)
Show that $(C_{8})$ holds.
Let $C^{'} (i) \cap C^{'} (j) \neq \emptyset$ . Let us first assume that $C (i) \cap C (j) \neq \emptyset$ . Then, $C (j) \subseteq C (i)$ . Since $C (j)$ cannot be a suffix of $C (i)$ , it follows that $C (i) \cap C (j_{1}) \neq \emptyset$ . Therefore, $C (j) C (j_{1}) \subseteq C (i)$ . Repeating this argument leads to $C (j) C (j_{1}) \dots C (j_{w}) \subseteq C (i)$ , and we are done.
Therefore, assume that $C (i) \cap C (j) = \emptyset$ . Let $1 \leq ℓ \leq v$ be the smallest such that $C (i_{ℓ}) \cap C (j) \neq \emptyset$ . Such an ℓ must exist. Then, $i_{ℓ} \leq j$ . If $i_{ℓ} = j$ , then either $C (i_{ℓ})$ is a prefix of $C (j)$ or vice versa, both impossibilities; hence, $i_{ℓ} < j$ . Repeating the same arguments as for i, we get that $C (j) C (j_{1}) . . C (j_{w}) \subseteq C (i_{ℓ})$ , and so, we are done.

It remains to show that

(C_{7} (b))

for

Conf^{'}

holds.

Consider

C^{'} (i)

immediately followed by

C^{'} (j)

with

C^{'} (i) ≺ C^{'} (j)

.

Assume that $g r^{'} (j) \in {G_{t - 1}, \dots, G_{1}}$ .
Then, $c o n (G_{t}) = c o n^{'} (G_{t})$ , $g r (j) = g r^{'} (j)$ , and $c o n (g r (j)) = c o n^{'} (g r^{'} (j))$ . If $C^{'} (i) = C (i)$ , then $C (i) ≺ C (j)$ , and $C (i)$ is immediately followed by $C (j)$ , so by $(C_{7} (b))$ for Conf, we have a contradiction. Thus, $C^{'} (i) = C (i) C (i_{1}) \dots C (i_{v})$ for $v = v a l (i)$ and $c o n (g r (i_{v})) = c o n (G_{t}) ≺ c o n (g r (j))$ , and $C (i_{v})$ is immediately followed by $C (j)$ , a contradiction by $(C_{7} (b))$ for Conf.
Assume that $g r^{'} (j) = G_{t}$ .
Then, the group $g r (i)$ is partitioned when refining by $G_{t}$ , and so, $C^{'} (i) = c o n^{'} (g r^{'} (i)) = c o n (g r (i)) C {(j)}^{v}$ for $v = v a l (j)$ . Since $C^{'} (i)$ is immediately followed by $C^{'} (j) = c o n (G_{t})$ , we have again a contradiction, as it implies that $v a l (j) = v + 1$ .

□

This concludes the proof of Claim 2.

Claim 3.

The function

{prev}^{'}

is proper on

H_{t + 1}

, i.e.,

(C_{6})

for

Conf^{'}

holds.

Proof of Claim 3.

Let

j = p r e v^{'} (i)

and

i \in H_{t + 1}

with

v a l^{'} (i) = v

. Then,

{| [i]}_{_{\approx}} | = v

, and so,

{[i]}_{_{\approx}} = {i_{1}, \dots, i_{v}}

, where

i_{1} < i_{2} < \dots < i_{v}

. Hence,

i_{1}, \dots, i_{v} \in H_{t + 1}

,

C^{'} (i_{1}) = \dots = C^{'} (i_{v}) = c o n^{'} (H_{t + 1})

, and

j = p r e v^{'} (i) = p r e v^{'} (i_{1}) = \dots = p r e v^{'} (i_{v})

, and so,

j < i_{1}

. It remains to show that

C^{'} (j) C^{'} (i_{1}) \dots C^{'} (i_{v})

is a prefix of

x [j . . n]

. It suffices to show that

C^{'} (j)

is immediately followed by

C^{'} (i_{1})

.

If

C^{'} (j) \cap C^{'} (i_{1}) \neq \emptyset

, then by the Monge property

(C_{8})

,

C^{'} (i_{1}) \subseteq C^{'} (j)

as

j < i_{1}

, and so, by

(C_{7} (a))

,

c o n^{'} (H_{t + 1}) ≺ c o n^{'} (g r^{'} (i_{1})) = c o n^{'} (H_{t + 1})

, a contradiction.

Thus,

C^{'} (j) \cap C^{'} (i_{1}) = \emptyset

. Set

j_{1} = j + | c o n^{'} (g r^{'} (j)) |

. It follows that

j_{1} \leq i_{1}

. Assume that

j_{1} < i_{1}

. Since

j = p r e v^{'} (i_{1})

and

j < i_{1}

,

c o n^{'} (g r^{'} (j_{1})) ⪰ c o n^{'} (g r^{'} (i_{1})) = c o n^{'} (H_{t + 1})

. Since

j_{1} \notin H_{t + 1}

,

c o n^{'} (g r^{'} (j_{1})) ≻ c o n^{'} (H_{t + 1})

. Consider

C^{'} (j_{1})

. If

C^{'} (j_{1}) \cap C^{'} (i_{1}) \neq \emptyset

, then by

(C_{8})

,

C^{'} (i_{1}) \subseteq C^{'} (j_{1})

, and so, by

(C_{7} (a))

,

c o n^{'} (H_{t + 1}) ≺ c o n^{'} (g r^{'} (i_{1})) = c o n^{'} (H_{t + 1})

, a contradiction. Thus,

C^{'} (j_{1}) \cap C^{'} (i_{1}) = \emptyset

. Since

C^{'} (j_{1})

immediately follows

C^{'} (j)

, by

(C_{7} (b))

,

c o n^{'} (g r^{'} (j_{1})) ⪯ c o n^{'} (H_{t + 1})

, a contradiction. Therefore,

j_{1} = i_{1}

, and so,

p r e v^{'}

is proper on

H_{t + 1}

. □

This concludes the proof of Claim 3.

Claim 4.

H_{t + 1}

is a complete group, i.e.,

(C_{5})

for

Conf^{'}

holds.

Proof of Claim 4.

Assume that there is

i \in H_{t + 1}

so that

C^{'} (i)

is not maximal, i.e., for some

k \geq i + | c o n^{'} (H_{t + 1}) |

,

x [i . . k]

is a right-maximal Lyndon substring of

x

.

Either

k = n

and so

c o n^{'} (g r^{'} (k)) = x [k]

, and so,

C^{'} (k)

is a suffix of

x [i . . k]

, or

k < n

, and then,

x [k + 1] ≺ x [k]

, since

x [k + 1] ⪯ x [k]

implies that

x [i . . k + 1]

is Lyndon, a contradiction with the right-maximality of

x [i . . k]

. Consider

C^{'} (k)

, then

C^{'} (k) \subseteq x [i . . k]

, and so,

C^{'} (k) = x [k]

.

Therefore, there is

j_{1}

so that

i + | c o n^{'} (H_{t + 1}) | \leq j_{1} \leq k

, and

C^{'} (j_{1})

is a suffix of

x [i . . k]

. Take the smallest such

j_{1}

. If

j_{1} = i + | c o n^{'} (H_{t + 1}) |

, then

C^{'} (i) ≺ C^{'} (j_{1})

as

x [i . . k] = C^{'} (i) C^{'} (j_{1})

is Lyndon. By

(C_{7} (b))

,

C^{'} (j_{1}) ⪯ c o n^{'} (H_{t + 1})

, so we have

c o n^{'} (H_{t + 1}) = C^{'} (i) ≺ C^{'} (j_{1}) ⪯ c o n^{'} (H_{t + 1})

, a contradiction.

Therefore,

j_{1} > i + | c o n^{'} (H_{t + 1}) |

. Consider

x [j_{1} - 1]

. If

x [j_{1} - 1] ⪯ x [j_{1}]

,

x [j_{1} - 1 . . k]

is Lyndon, and since

x [j_{1} . . k] = C^{'} (j_{1})

,

x [j_{1} - 1 . . k]

would be a context of

g r^{'} (j_{1} - 1)

, this contradicts the fact

j_{1}

was chosen to be the smallest such one. Therefore,

x [j_{1} - 1] ≻ x [j_{1}]

, and so,

c o n^{'} (g r^{'} (j_{1} - 1)) = x [j_{1} - 1]

. Thus, there is

j_{2}

,

i + | c o n^{'} (H_{t + 1}) | \leq j_{2} < j_{1} \leq k

, and

C^{'} (j_{2})

is a suffix of

x [i . . j_{1} - 1]

. Take the smallest such

j_{2}

. If

C^{'} (j_{2}) ≺ C^{'} (j_{1})

, then by

(C_{7} (b))

,

C^{'} (j_{1}) ⪯ c o n^{'} (H_{t + 1})

, a contradiction. Hence,

C^{'} (j_{2}) ⪰ C^{'} (j_{1})

. If

j_{2} = i + i + | c o n^{'} (H_{t + 1}) |

, then

x [i . . k] = C^{'} (i) C^{'} (j_{2}) C^{'} (j_{1})

, and so, by

(C_{7} (b))

,

C^{'} (j_{2}) ⪯ c o n^{'} (H_{t + 1})

, a contradiction. Hence,

i + | c o n^{'} (H_{t + 1}) | < j_{2}

.

The same argument done for

j_{2}

can now be done for

j_{3}

. We end up with

i + | c o n^{'} (H_{t + 1}) | \leq j_{3} < j_{2} < j_{1} \leq k

and with

C^{'} (j_{3}) ⪰ C^{'} (j_{2}) ⪰ C^{'} (j_{1}) ≻ c o n^{'} (H_{t + 1})

. If

i + | c o n^{'} (H_{t + 1}) | = j_{3}

, then we have a contradiction, so

i + | c o n^{'} (H_{t + 1}) | < j_{3}

. These arguments can be repeated only finitely many times, and we obtain

i + | c o n^{'} (H_{t + 1}) | = j_{ℓ} < j_{ℓ - 1} < \dots < j_{2} < j_{1} \leq k

so that

x [i . . k] = C^{'} (i) C^{'} (j_{ℓ}) C^{'} (j_{ℓ - 1} \dots C^{'} (j_{2}) C^{'} (j_{1})

, which is a contradiction.

Therefore, our initial assumption that

C^{'} (i)

is not maximal always leads to a contradiction. □

This concludes the proof of Claim 4.

The four claims show that all the conditions

(C_{1})

...

(C_{8})

are satisfied for

Conf^{'}

, and that proves Theorem 3.

As the last step, we show that when the process of refinement is completed, all right-maximal Lyndon substrings of

x

are identified and sorted via the contexts of the groups of the final configuration.

Theorem 4.

Let

{Conf}_{1} = 〈 G_{k_{1}}^{1}, G_{k_{1} - 1}^{1}, \dots, G_{2}^{1}, G_{1}^{1} 〉

with

g r_{1} ()

,

c o n_{1} ()

,

C_{1} ()

,

p r e v_{1} ()

, and

v a l_{1} ()

be the initial 1complete group configuration from Lemma 9.

Let

{Conf}_{2} = 〈 G_{k_{2}}^{2}, G_{k_{2} - 1}^{2}, \dots, G_{2}^{2}, G_{1}^{2} 〉

with

g r_{2} ()

,

c o n_{2} ()

,

C_{2} ()

,

p r e v_{2} ()

, and

v a l_{2} ()

be the 2complete group configuration obtained from

{Conf}_{1}

through the refinement by the group

G_{1}^{1}

.

Let

{Conf}_{3} = 〈 G_{k_{3}}^{3}, G_{k_{3} - 1}^{3}, \dots, G_{2}^{3}, G_{1}^{3} 〉

with

g r_{3} ()

,

c o n_{3} ()

,

C_{3} ()

,

p r e v_{3} ()

, and

v a l_{3} ()

be the 3complete group configuration obtained from

{Conf}_{2}

through the refinement by the group

G_{2}^{2}

.

...

Let

{Conf}_{r} = 〈 G_{k_{r}}^{r}, G_{k_{r} - 1}^{r}, \dots, G_{2}^{r}, G_{1}^{r} 〉

with

g r_{r} ()

,

c o n_{r} ()

,

C_{r} ()

,

p r e v_{r} ()

, and

v a l_{r} ()

be the rcomplete group configuration obtained from

{Conf}_{r - 1}

through the refinement by the group

G_{r - 1}^{r - 1}

. Let

{Conf}_{r}

be the final configuration after the refinement runs out.

Then,

x [i . . k]

is a right-maximal Lyndon substring of

x

iff

x [i . . k] = C_{r} (i) = c o n_{r} (g r_{r} (i))

.

Proof.

That all the groups of

{Conf}_{r}

are complete follows from Theorem 3, and hence, every

C_{r} (i)

is a right-maximal Lyndon string. Let

x [i . . k]

be a right-maximal Lyndon substring of

x

. Consider

C_{r} (i)

; since it is maximal, it must be equal to

x [i . . k]

. □

4.3. Motivation for the Refinement

The process of refinement is in fact a process of the gradual revealing of the Lyndon substrings, which we call the water draining method:

(a): lower the water level by one;
(b): extend the existing Lyndon substrings; the revealed letters are used to extend the existing Lyndon substrings where possible, or became Lyndon substrings of length one otherwise;
(c): consolidate the new Lyndon substrings; processed from the right, if several Lyndon substrings are adjacent and can be joined to a longer Lyndon substring, they are joined.

The diagram in Figure 5 and the description that follows it illustrate the method for a string 011023122. The input string is visualized as a curve, and the height at each point is the value of the letter at that position.

In Figure 5, we illustrate the process:

(1): We start with the string 011023122 and a full tank of water.
(2): We drain one level; only 3 is revealed; there is nothing to extend, nothing to consolidate.
(3): We drain one more level, and three 2’s are revealed; the first 2 extends 3 to 23, and the remaining two 2’s form Lyndon substrings 2 of length one; there is nothing to consolidate.
(4): We drain one more level, and three 1’s are revealed; the first two 1’s form Lyndon substrings 1 of length one; the third 1 extends 22 to 122; there is nothing to consolidate.
(5): We drain one more level, and two 0’s are revealed; the first 0 extends 11 to 011; the second 0 extends 23 to 023; in the consolidation phase, 023 is joined with 122 to form a Lyndon substring 023122, and then, 011 is joined with 023122 to form a Lyndon substring 011023122.

Therefore, during the process, the following right-maximal Lyndon substrings were identified: 3 at Position 6, 23 at Position 5, 2 at Positions 8 and 9, 1 at Positions 2 and 3, 122 at Position 7, 023 at Position 4, and finally, 011023122 at Position 1. Note that all positions are accounted for; we really have all right-maximal Lyndon substrings of the string 011023122.

In Figure 6, we present an illustrative example for the string 011023122, where the arrows represent the

p r e v

mapping shown only on the group used for the refinement. The groups used for the refinement are indicated by the bold font.

4.4. The Complexity of BSLA

The computation of the initial configuration can be done in linear time. To compute the initial value of

prev

in linear time, a stack-based approach similar to the NSV algorithm is used. Since all groups are non-empty, there can never be more groups than n. Theorem 3 is at the heart of the algorithm. The refinement by the last completed group is linear in the size of the group, including the update of

prev

. Therefore, the overall worst-case complexity of BSLA is linear in the length of the input string.

5. Data and Measurements

Initially, computations were performed on the Department of Computing and Software’s moore server; memory: 32 GB (DDR4 @ 2400 MHz), CPU: 8 × Intel Xeon E5-2687W v4 @ 3.00 GHz, OS: Linux Version 2.6.18-419.el5 (gcc Version 4.1.2 and Red Hat Version 4.1.2-55). To verify correctness, new randomized data were produced and computed independently on the University of Toronto Mississauga’s octolab cluster; memory: 8 × 32 GB (DDR4 @ 3200 MHz), CPU: 8 × AMD Ryzen Threadripper 1920X (12-Core) @ 4.00 GHz, OS: Ubuntu 16.04.6 LTS (gcc Version 5.4.0). The results of both were extremely similar, and those reported herein are those generated using the moore server. All the programs were compiled without any additional level of optimization (i.e., neither -O1, nor -O2, nor -O3 flags were specified for the compilation). The CPU time was measured in clock ticks with 1,000,000 clock ticks per second. Since the execution time was negligible for short strings, the processing of the same string was repeated several times (the repeat factor varied from

10^{6}

, for strings of length 10, to one, for strings of length

5 \times 10^{6}

), resulting in a higher precision. Thus, for graphing, the logarithmic scale was used for both, x-axis representing the length of the strings and y-axis representing the time. We used four categories of randomly generated datasets:

(1): bin
random strings over an integer alphabet with exactly two distinct letters (kind of binary strings).
(2): dna
random strings over an integer alphabet with exactly four distinct letters (kind of random DNA strings).
(3): eng
random strings over an integer alphabet with exactly 26 distinct letters (kind of random English).
(4): int
random strings over an integer alphabet (i.e., over the alphabet ${0, \dots, n - 1}$ ).

Each dataset contains 100 randomly generated strings of the same length. For each category, there were datasets for length s 10, 50,

10^{2}

,

5 \times 10^{2}

, ...,

10^{5}

,

5 \times 10^{5}

,

10^{6}

, and

5 \times 10^{6}

. The minimum, average, and maximum times for each dataset were computed. Since the variance for each dataset was minimal, the results for minimum times and the results for maximum times completely mimicked the results for the average times, so we only present the averages here.

Tables 1–4 and the graphs in Figures 7–10 from Section 7 clearly indicate that the performance of the three algorithms is linear and virtually indistinguishable. We expected IDLA and TRLA to exhibit linear behaviour on random strings as such strings tend to have many, but short right-maximal Lyndon substrings. However, we did not expect the results to be so close.

Despite the fact that IDLA performed in linear time on the random strings, it is relatively easy to force it into its worst quadratic performance. The dataset extreme_idla contains individual strings

0123 \dots n - 1

of the required lengths. Table 5 and the graph in Figure 11 from Section 7 show this clearly.

In Section 3.5, we describe how the three datasets, extreme_trla, extreme_trla1, and extreme_trla2, are generated and why. The results of experimenting with these datasets do not suggest that the worst-case complexity for TRLA is

O (n log (n))

. Yet again, the performances of the three algorithms are linear and virtually indistinguishable; see Tables 6–8 and the graphs in Figures 12–14 in Section 7.

6. Conclusions and Future Work

We present two novel algorithms for computing right-maximal Lyndon substrings. The first one, TRLA, has a simple implementation with a complicated theory behind it. Its average time complexity is linear in the length of the input string, and its worst-case complexity is no worse than

O (n log (n))

. The

τ

-reduction used in the algorithm is an interesting reduction preserving right-maximal Lyndon substrings, a fact used significantly in the design of the algorithm. Interestingly, it seem to slightly outperform BSLA, at least on the datasets used for our experimentations. BSLA, the second algorithm, is linear and elementary in the sense that it does not require a pre-processed global data structure. Being linear and elementary, BSLA is more interesting, and it is possible that its performance could be more streamlined. However, both the theory and implementation of BSLA are rather complex.

On random strings, none of the two algorithms were significantly better than the simple IDLA, whose implementation is just a few lines. However, its quadratic worst-case complexity is an obstacle, as our experiments indicated.

Additional effort needs to go into proving TRLA’s worst-case complexity. The experiments performed did not indicate that it is not linear even in the worst case. Both algorithms need to be compared to some efficient implementation of SSLA and BWLA.

7. Results

This section contains the measurements of the average times for the datasets discussed in the previous section. For better understanding of the data, we present them in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 and Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14. All the graphs include the curve

x = y

for reference.

Author Contributions

Conceptualization, F.F. and M.L.; methodology, F.F. and M.L.; software, F.F. and M.L.; validation, F.F. and M.L.; formal analysis, F.F. and M.L.; investigation, F.F. and M.L.; resources, F.F. and M.L.; data curation, F.F. and M.L.; writing, original draft preparation, F.F. and M.L.; writing, review and editing, F.F. and M.L.; visualization, F.F. and M.L.; supervision, F.F. and M.L.; project administration, F.F. and M.L.; funding acquisition, F.F. Both authors read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Sciences and Research Council of Canada (NSERC) Grant RGPIN/5504-2018.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BSLA	Baier’s Sort Lyndon Array
IDLA	Iterative Duval Lyndon Array
TRLA	Tau Reduction Lyndon Array

References

Lyndon, R.C. On Burnside’s Problem. II. Trans. Am. Math. Soc. 1955, 78, 329–332. [Google Scholar]
Marcus, S.; Sokol, D. 2D Lyndon words and applications. Algorithmica 2017, 77, 116–133. [Google Scholar] [CrossRef]
Berstel, J.; Perrin, D. The origins of combinatorics on words. Eur. J. Comb. 2007, 28, 996–1022. [Google Scholar] [CrossRef] [Green Version]
Chen, K.; Fox, R.; Lyndon, R. Free differential calculus IV. The quotient groups of the lower central series. Ann. Math. 2nd Ser. 1958, 68, 81–95. [Google Scholar] [CrossRef]
Golomb, S. Irreducible polynomials, synchronizing codes, primitive necklaces and cyclotomic algebra. Comb. Math. Appl. 1967, 4, 358–370. [Google Scholar]
Flajolet, P.; Gourdon, X.; Panario, D. The complete analysis of a polynomial factorization algorithm over finite fields. J. Algorithms 2001, 40, 37–81. [Google Scholar] [CrossRef] [Green Version]
Panario, D.; Richmond, B. Smallest components in decomposable structures:exp-log class. Algorithmica 2001, 29, 205–226. [Google Scholar] [CrossRef]
Duval, J.P. Factorizing words over an ordered alphabet. J. Algorithms 1983, 4, 363–381. [Google Scholar] [CrossRef]
Berstel, J.; Pocchiola, M. Average cost of Duval’s algorithm for generating Lyndon words. Theor. Comput. Sci. 1994, 132, 415–425. [Google Scholar] [CrossRef]
Fredricksen, H.; Maiorana, J. Necklaces of beads in k colors and k-ary de Bruijn sequences. Discret. Math. 1983, 23, 207–210. [Google Scholar] [CrossRef] [Green Version]
Bannai, H.; Tomohiro, I.; Inenaga, S.; Nakashima, Y.; Takeda, M.; Tsuruta, K. The “Runs” Theorem. SIAM J. Comput. 2017, 46, 1501–1514. [Google Scholar] [CrossRef]
Franek, F.; Paracha, A.; Smyth, W. The linear equivalence of the suffix array and the partially sorted Lyndon array. In Proceedings of the Prague Stringology Conference, Prague, Czech Republic, 28–30 August 2017; pp. 77–84. [Google Scholar]
Baier, U. Linear-Time Suffix Sorting—A New Approach for Suffix Array Construction. Master’s Thesis, University of Ulm, Ulm, Germany, 2015. [Google Scholar]
Baier, U. Linear-Time Suffix Sorting—A New Approach for Suffix Array Construction. In Proceedings of the 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016), Tel Aviv, Israel, 27–29 June 2016; Grossi, R., Lewenstein, M., Eds.; Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik: Dagstuhl, Germany, 2016; Volume 54, pp. 1–12. [Google Scholar]
Chen, G.; Puglisi, S.; Smyth, W. Lempel-Ziv factorization using less time & space. Math. Comput. Sci. 2013, 1, 605–623. [Google Scholar]
Crochemore, M.; Ilie, L.; Smyth, W. A simple algorithm for computing the Lempel-Ziv factorization. In Proceedings of the 18th Data Compression Conference, Snowbird, UT, USA, 25–27 March 2008; pp. 482–488. [Google Scholar]
Kosolobov, D. Lempel-Ziv factorization may be harder than computing all runs. In Proceedings of the 32 International Symposium on Theoretical Aspects of Computer Science—STACS 2015, Garching, Germany, 4–7 March 2015; pp. 582–593. [Google Scholar]
Digelmann, C.; (Frankfurt, Germany). Personal communication, 2016.
Franek, F.; Sohidull Islam, A.; Sohel Rahman, M.; Smyth, W. Algorithms to compute the Lyndon array. In Proceedings of the Prague Stringology Conference 2016, Prague, Czech Republic, 29–31 August 2016; pp. 172–184. [Google Scholar]
Hohlweg, C.; Reutenauer, C. Lyndon words, permutations and trees. Theor. Comput. Sci. 2003, 307, 173–178. [Google Scholar] [CrossRef] [Green Version]
Nong, G.; Zhang, S.; Chan, W.H. Linear suffix array construction by almost pure induced-sorting. In Proceedings of the 2009 Data Compression Conference, Snowbird, UT, USA, 16–18 March 2009; pp. 193–202. [Google Scholar]
Louza, F.; Smyth, W.; Manzini, G.; Telles, G. Lyndon array construction during Burrows–Wheeler inversion. J. Discret. Algorithms 2018, 50, 2–9. [Google Scholar] [CrossRef]
Franek, F.; Liut, M.; Smyth, W. On Baier’s sort of maximal Lyndon substrings. In Proceedings of the Prague Stringology Conference 2018, Prague, Czech Republic, 27–28 August 2018; pp. 63–78. [Google Scholar]
C++ Code for IDLA, TRLA and BSLA Algorithms. Available online: https://github.com/MichaelLiut/Computing-LyndonArray (accessed on 3 November 2020).
Farach, M. Optimal suffix tree construction with large alphabets. In Proceedings of the 38th IEEE Symp. Foundations of Computer Science, Miami Beach, FL, USA, 20–22 October 1997; pp. 137–143. [Google Scholar]
Nong, G. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 2013, 31, 1–15. [Google Scholar] [CrossRef]
Cooley, J.; Tukey, J. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Franek, F.; Liut, M. Computing Maximal Lyndon Substrings of a String, AdvOL Report 2019/2, McMaster University. Available online: http://optlab.mcmaster.ca//component/option,com_docman/task,cat_view/gid,77/Itemid,92 (accessed on 1 March 2019).
Franek, F.; Liut, M. Algorithms to compute the Lyndon array revisited. In Proceedings of the Prague Stringology Conference 2019, Prague, Czech Republic, 26–28 August 2019; pp. 16–28. [Google Scholar]
Liut, M. Computing Lyndon Arrays. Ph.D. Thesis, McMaster University, Hamilton, ON, Canada, 2019. [Google Scholar]
Lothaire, M. Combinatorics on Words; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Lothaire, M. Applied Combinatorics on Words; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Smyth, B. Computing Patterns in Strings; Pearson Addison-Wesley: Boston, MA, USA, 2003. [Google Scholar]
Louza, F.; Gog, S.; Telles, G. Construction of Fundamental Data Structures for Strings; Springer: Cham, Switzerland, 2020. [Google Scholar]
Burkhardt, S.; Kärkkäinen, J. Fast Lightweight Suffix Array Construction and Checking. In Proceedings of the 14th Annual Conference on Combinatorial Pattern Matching, Michoacan, Mexico, 25–27 June 2003; Springer: Berlin, Heidelberg, 2003; pp. 55–69. [Google Scholar]
Paracha, A. Lyndon Factors and Periodicities in Strings. Ph.D. Thesis, McMaster University, Hamilton, ON, Canada, 2017. [Google Scholar]
Kärkkäinen, J.; Sanders, P. Simple linear work suffix array construction. In Proceedings of the 30th International Conference on Automata, Languages and Programming, Eindhoven, The Netherlands, 30 June–4 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 943–955. [Google Scholar]

Figure 1. Algorithm IDLA.

Figure 2. Illustration of the

τ

-reduction of a string 011023122. The rounded rectangles indicate symbol

τ

-pairs; the ovals indicate the

τ

-pairs. below are the colour labels of the positions; at the bottom is the

τ

-reduction.

Figure 2. Illustration of the

τ

-reduction of a string 011023122. The rounded rectangles indicate symbol

τ

-pairs; the ovals indicate the

τ

-pairs. below are the colour labels of the positions; at the bottom is the

τ

-reduction.

Figure 3. Computing the partial Lyndon array of the input string.

Figure 4. Computing missing values of the Lyndon array of the input string.

Figure 5. The water draining method for 011023122. Stages (1)–(6) explained in the text.

Figure 6. Group refinement for 011023122.

Figure 7. Average times for dataset bin (

10^{6}

clock ticks per second).

Figure 7. Average times for dataset bin (

10^{6}

clock ticks per second).

Figure 8. Average times for dataset dna (

10^{6}

clock ticks per second).

Figure 8. Average times for dataset dna (

10^{6}

clock ticks per second).

Figure 9. Average times for dataset eng (

10^{6}

clock ticks per second).

Figure 9. Average times for dataset eng (

10^{6}

clock ticks per second).

Figure 10. Average times for dataset int (

10^{6}

clock ticks per second).

Figure 10. Average times for dataset int (

10^{6}

clock ticks per second).

Figure 11. Average times for dataset extreme_idla (

10^{6}

clock ticks per second).

Figure 11. Average times for dataset extreme_idla (

10^{6}

clock ticks per second).

Figure 12. Average times for dataset extreme_trla (

10^{6}

clock ticks per second).

Figure 12. Average times for dataset extreme_trla (

10^{6}

clock ticks per second).

Figure 13. Average times for dataset extreme_trla1 (

10^{6}

clock ticks per second).

Figure 13. Average times for dataset extreme_trla1 (

10^{6}

clock ticks per second).

Figure 14. Average times for dataset extreme_trla2 (

10^{6}

clock ticks per second).

Figure 14. Average times for dataset extreme_trla2 (

10^{6}

clock ticks per second).

Table 1. Average times for dataset bin (

10^{6}

clock ticks per second).

Table 1. Average times for dataset bin (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$3.651 \times 10^{- 1}$	$1.582$	$1.054$
50	$4.082$	$1.050 \times 10$	$6.372$
100	$1.101 \times 10$	$2.140 \times 10$	$1.277 \times 10$
500	$8.655 \times 10$	$1.127 \times 10^{2}$	$6.786 \times 10$
1000	$1.975 \times 10^{2}$	$2.335 \times 10^{2}$	$1.484 \times 10^{2}$
5000	$1.278 \times 10^{3}$	$1.218 \times 10^{3}$	$8.595 \times 10^{2}$
10,000	$2.765 \times 10^{3}$	$2.423 \times 10^{3}$	$1.820 \times 10^{3}$
50,000	$1.665 \times 10^{4}$	$1.272 \times 10^{4}$	$1.018 \times 10^{4}$
100,000	$3.606 \times 10^{4}$	$2.523 \times 10^{4}$	$2.113 \times 10^{4}$
500,000	$2.071 \times 10^{5}$	$1.338 \times 10^{5}$	$1.493 \times 10^{5}$
1,000,000	$4.387 \times 10^{5}$	$2.717 \times 10^{5}$	$4.080 \times 10^{5}$
5,000,000	$2.483 \times 10^{6}$	$1.561 \times 10^{6}$	$3.098 \times 10^{6}$

Table 2. Average times for dataset dna (

10^{6}

clock ticks per second).

Table 2. Average times for dataset dna (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$3.699 \times 10^{- 1}$	$1.579$	$1.080$
50	$3.509$	$1.037 \times 10$	$6.627$
100	$8.898$	$2.109 \times 10$	$1.403 \times 10$
500	$6.403 \times 10$	$1.123 \times 10^{2}$	$7.228 \times 10$
1000	$1.431 \times 10^{2}$	$2.332 \times 10^{2}$	$1.544 \times 10^{2}$
5000	$8.749 \times 10^{2}$	$1.207 \times 10^{3}$	$9.039 \times 10^{2}$
10,000	$1.912 \times 10^{3}$	$2.460 \times 10^{3}$	$1.935 \times 10^{3}$
50,000	$1.134 \times 10^{4}$	$1.280 \times 10^{4}$	$1.110 \times 10^{4}$
100,000	$2.431 \times 10^{4}$	$2.588 \times 10^{4}$	$2.316 \times 10^{4}$
500,000	$1.383 \times 10^{5}$	$1.390 \times 10^{5}$	$1.781 \times 10^{5}$
1,000,000	$2.916 \times 10^{5}$	$2.865 \times 10^{5}$	$4.994 \times 10^{5}$
5,000,000	$1.643 \times 10^{6}$	$1.968 \times 10^{6}$	$3.752 \times 10^{6}$

Table 3. Average times for dataset eng (

10^{6}

clock ticks per second).

Table 3. Average times for dataset eng (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$3.526 \times 10^{- 1}$	$1.584$	$9.865 \times 10^{- 1}$
50	$3.162$	$1.006 \times 10$	$5.960$
100	$7.315$	$2.057 \times 10$	$1.317 \times 10$
500	$4.996 \times 10$	$1.117 \times 10^{2}$	$7.245 \times 10$
1000	$1.112 \times 10^{2}$	$2.354 \times 10^{2}$	$1.542 \times 10^{2}$
5000	$6.722 \times 10^{2}$	$1.210 \times 10^{3}$	$9.087 \times 10^{2}$
10,000	$1.452 \times 10^{3}$	$2.427 \times 10^{3}$	$2.042 \times 10^{3}$
50,000	$8.505 \times 10^{3}$	$1.306 \times 10^{4}$	$1.301 \times 10^{4}$
100,000	$1.802 \times 10^{4}$	$2.688 \times 10^{4}$	$2.768 \times 10^{4}$
500,000	$1.025 \times 10^{5}$	$1.428 \times 10^{5}$	$2.381 \times 10^{5}$
1,000,000	$2.171 \times 10^{5}$	$3.253 \times 10^{5}$	$7.236 \times 10^{5}$
5,000,000	$1.206 \times 10^{6}$	$2.599 \times 10^{6}$	$6.092 \times 10^{6}$

Table 4. Average times for dataset int (

10^{6}

clock ticks per second).

Table 4. Average times for dataset int (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$3.547 \times 10^{- 1}$	$1.645$	$9.794 \times 10^{- 1}$
50	$3.032$	$9.992$	$5.609$
100	$7.279$	$2.032 \times 10$	$1.153 \times 10$
500	$4.845 \times 10$	$1.136 \times 10^{2}$	$6.184 \times 10$
1000	$1.057 \times 10^{2}$	$2.376 \times 10^{2}$	$1.294 \times 10^{2}$
5000	$6.428 \times 10^{2}$	$1.218 \times 10^{3}$	$7.753 \times 10^{2}$
10,000	$1.388 \times 10^{3}$	$2.544 \times 10^{3}$	$1.796 \times 10^{3}$
50,000	$8.055 \times 10^{3}$	$1.448 \times 10^{4}$	$1.088 \times 10^{4}$
100,000	$1.710 \times 10^{4}$	$2.943 \times 10^{4}$	$2.379 \times 10^{4}$
500,000	$9.829 \times 10^{4}$	$1.825 \times 10^{5}$	$2.740 \times 10^{5}$
1,000,000	$2.071 \times 10^{5}$	$4.827 \times 10^{5}$	$7.989 \times 10^{5}$
5,000,000	$1.162 \times 10^{6}$	$5.143 \times 10^{6}$	$6.635 \times 10^{6}$

Table 5. Average times for dataset extreme_idla (

10^{6}

clock ticks per second).

Table 5. Average times for dataset extreme_idla (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$7.900 \times 10^{- 1}$	$1.440$	$7.000 \times 10^{- 1}$
50	$1.830 \times 10$	$8.200$	$3.600$
100	$7.190 \times 10$	$1.590 \times 10$	$6.800$
500	$1.778 \times 10^{3}$	$7.300 \times 10$	$3.550 \times 10$
1000	$7.105 \times 10^{3}$	$1.430 \times 10^{2}$	$6.900 \times 10$
5000	$1.776 \times 10^{5}$	$7.100 \times 10^{2}$	$3.400 \times 10^{2}$
10,000	$7.111 \times 10^{5}$	$1.550 \times 10^{3}$	$6.800 \times 10^{2}$
50,000	$1.784 \times 10^{7}$	$8.050 \times 10^{3}$	$3.400 \times 10^{3}$
100,000	$7.130 \times 10^{7}$	$1.600 \times 10^{4}$	$6.800 \times 10^{3}$
500,000	$1.783 \times 10^{9}$	$8.200 \times 10^{4}$	$3.700 \times 10^{4}$
1,000,000	$7.137 \times 10^{9}$	$1.660 \times 10^{5}$	$7.800 \times 10^{4}$
5,000,000	$1.813 \times 10^{11}$	$8.800 \times 10^{5}$	$4.950 \times 10^{5}$

Table 6. Average times for dataset extreme_trla (

10^{6}

clock ticks per second).

Table 6. Average times for dataset extreme_trla (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$4.588 \times 10^{- 1}$	$1.628$	$1.126$
50	$4.987$	$1.039 \times 10$	$7.112$
100	$1.275 \times 10$	$2.179 \times 10$	$1.439 \times 10$
500	$9.033 \times 10$	$1.101 \times 10^{2}$	$6.914 \times 10$
1000	$2.060 \times 10^{2}$	$2.222 \times 10^{2}$	$1.392 \times 10^{2}$
5000	$1.319 \times 10^{3}$	$1.171 \times 10^{3}$	$7.699 \times 10^{2}$
10,000	$2.896 \times 10^{3}$	$2.394 \times 10^{3}$	$1.652 \times 10^{3}$
50,000	$2.209 \times 10^{4}$	$1.263 \times 10^{4}$	$8.992 \times 10^{3}$
100,000	$3.965 \times 10^{4}$	$2.567 \times 10^{4}$	$1.862 \times 10^{4}$
500,000	$2.233 \times 10^{5}$	$1.349 \times 10^{5}$	$1.091 \times 10^{5}$
1,000,000	$4.734 \times 10^{5}$	$2.759 \times 10^{5}$	$3.104 \times 10^{5}$
5,000,000	$2.632 \times 10^{6}$	$1.458 \times 10^{6}$	$2.298 \times 10^{6}$

Table 7. Average times for dataset extreme_trla1 (

10^{6}

clock ticks per second).

Table 7. Average times for dataset extreme_trla1 (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$5.040 \times 10^{- 1}$	$1.600$	$1.117$
50	$5.910$	$1.042 \times 10$	$7.290$
100	$1.460 \times 10$	$2.145 \times 10$	$1.446 \times 10$
500	$1.146 \times 10^{2}$	$1.126 \times 10^{2}$	$6.979 \times 10$
1000	$2.662 \times 10^{2}$	$2.284 \times 10^{2}$	$1.379 \times 10^{2}$
5000	$1.694 \times 10^{3}$	$1.205 \times 10^{3}$	$7.853 \times 10^{2}$
10,000	$3.734 \times 10^{3}$	$2.477 \times 10^{3}$	$1.739 \times 10^{3}$
50,000	$2.276 \times 10^{4}$	$1.310 \times 10^{4}$	$9.683 \times 10^{3}$
100,000	$4.901 \times 10^{4}$	$2.796 \times 10^{4}$	$2.009 \times 10^{4}$
500,000	$2.928 \times 10^{5}$	$1.465 \times 10^{5}$	$1.238 \times 10^{5}$
1,000,000	$6.199 \times 10^{5}$	$3.000 \times 10^{5}$	$3.622 \times 10^{5}$
5,000,000	$3.432 \times 10^{6}$	$1.642 \times 10^{6}$	$2.778 \times 10^{6}$

Table 8. Average times for dataset extreme_trla2 (

10^{6}

clock ticks per second).

Table 8. Average times for dataset extreme_trla2 (

10^{6}

clock ticks per second).

String Length	Time in Ticks	Time in Ticks	Time in Ticks
	IDLA	TRLA	BSLA
10	$5.041 \times 10^{- 1}$	$1.683$	$1.121$
50	$6.160$	$1.020 \times 10$	$7.257$
100	$1.526 \times 10$	$2.090 \times 10$	$1.441 \times 10$
500	$1.367 \times 10^{2}$	$1.074 \times 10^{2}$	$7.117 \times 10$
1000	$3.202 \times 10^{2}$	$2.135 \times 10^{2}$	$1.390 \times 10^{2}$
5000	$2.024 \times 10^{3}$	$1.145 \times 10^{3}$	$7.966 \times 10^{2}$
10,000	$4.500 \times 10^{3}$	$2.257 \times 10^{3}$	$1.762 \times 10^{3}$
50,000	$2.728 \times 10^{4}$	$1.172 \times 10^{4}$	$1.012 \times 10^{4}$
100,000	$5.941 \times 10^{4}$	$2.362 \times 10^{4}$	$2.115 \times 10^{4}$
500,000	$3.639 \times 10^{5}$	$1.262 \times 10^{5}$	$1.351 \times 10^{5}$
1,000,000	$7.719 \times 10^{5}$	$2.571 \times 10^{5}$	$3.915 \times 10^{5}$
5,000,000	$4.263 \times 10^{6}$	$1.323 \times 10^{6}$	$3.118 \times 10^{6}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Franek, F.; Liut, M. Computing Maximal Lyndon Substrings of a String. Algorithms 2020, 13, 294. https://doi.org/10.3390/a13110294

AMA Style

Franek F, Liut M. Computing Maximal Lyndon Substrings of a String. Algorithms. 2020; 13(11):294. https://doi.org/10.3390/a13110294

Chicago/Turabian Style

Franek, Frantisek, and Michael Liut. 2020. "Computing Maximal Lyndon Substrings of a String" Algorithms 13, no. 11: 294. https://doi.org/10.3390/a13110294

APA Style

Franek, F., & Liut, M. (2020). Computing Maximal Lyndon Substrings of a String. Algorithms, 13(11), 294. https://doi.org/10.3390/a13110294

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computing Maximal Lyndon Substrings of a String

Abstract

1. Introduction

2. Basic Notation and Terminology

3. $τ$ -Reduction Algorithm (TRLA)

3.1. $τ$ -Pairing

3.2. $τ$ -Reduction

3.3. Properties Preserved by $τ$ -Reduction

3.4. Computing $L_{x}^{'}$ from $L_{τ (x)}^{'}$

3.5. The Complexity of TRLA

4. The Algorithm BSLA

4.1. Notation and Basic Notions of BSLA

4.2. The Refinement

4.3. Motivation for the Refinement

4.4. The Complexity of BSLA

5. Data and Measurements

6. Conclusions and Future Work

7. Results

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Computing Maximal Lyndon Substrings of a String

Abstract

1. Introduction

2. Basic Notation and Terminology

3. τ -Reduction Algorithm (TRLA)

3.1. τ -Pairing

3.2. τ -Reduction

3.3. Properties Preserved by τ -Reduction

3.4. Computing L x ′ from L τ ( x ) ′

3.5. The Complexity of TRLA

4. The Algorithm BSLA

4.1. Notation and Basic Notions of BSLA

4.2. The Refinement

4.3. Motivation for the Refinement

4.4. The Complexity of BSLA

5. Data and Measurements

6. Conclusions and Future Work

7. Results

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. $τ$ -Reduction Algorithm (TRLA)

3.1. $τ$ -Pairing

3.2. $τ$ -Reduction

3.3. Properties Preserved by $τ$ -Reduction

3.4. Computing $L_{x}^{'}$ from $L_{τ (x)}^{'}$