Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings

Ghuman, Sukhpal Singh; Giaquinta, Emanuele; Tarhio, Jorma

doi:10.3390/a12060124

Open AccessArticle

Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings^†

by

Sukhpal Singh Ghuman

¹,

Emanuele Giaquinta

² and

Jorma Tarhio

^3,*

¹

Faculty of Applied Science & Technology, Sheridan College, 7899 McLaughlin Road, Brampton, ON L6Y 5H9, Canada

²

F-Secure Corporation, P.O.B. 24, FI-00181 Helsinki, Finland

³

Department of Computer Science, Aalto University, P.O.B. 15400, FI-00076 Aalto, Finland

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper: Ghuman, S.S.; Giaquinta, E.; Tarhio, J. Alternative algorithms for Lyndon factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, 1–3 September 2014; pp. 169–178.

Algorithms 2019, 12(6), 124; https://doi.org/10.3390/a12060124

Submission received: 24 May 2019 / Accepted: 17 June 2019 / Published: 21 June 2019

(This article belongs to the Special Issue String Matching and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

We present two modifications of Duval’s algorithm for computing the Lyndon factorization of a string. One of the algorithms has been designed for strings containing runs of the smallest character. It works best for small alphabets and it is able to skip a significant number of characters of the string. Moreover, it can be engineered to have linear time complexity in the worst case. When there is a run-length encoded string R of length

ρ

, the other algorithm computes the Lyndon factorization of R in

O (ρ)

time and in constant space. It is shown by experimental results that the new variations are faster than Duval’s original algorithm in many scenarios.

Keywords:

Lyndon factorization; string algorithms; run-length encoding

1. Introduction

A string w is a rotation of another string

w^{'}

if

w = u v

and

w^{'} = v u

, for some strings u and v. A string is a Lyndon word if it is lexicographically smaller than all its proper rotations. Chen, Fox and Lyndon [1] introduced the unique factorization of a string in Lyndon words such that the sequence of factors is nonincreasing according to the lexicographical order. The Lyndon factorization is a key structure in a method for sorting the suffixes of a text [2], which is applied in the construction of the Burrows-Wheeler transform and the suffix array, as well as in the bijective variant of the Burrows-Wheeler transform [3,4]. The Burrows-Wheeler transform is an invertible transformation of a string, based on sorting of its rotations, while the suffix array is a lexicographically sorted array of the suffixes of a string. They are the groundwork for many indexing and data compression methods.

Duval’s algorithm [5] computes the Lyndon factorization in linear time and in constant space. Various other solutions for computing the Lyndon factorization have been proposed in the past. A parallel algorithm [6] was presented by Apostolico and Crochemore, while Roh et al. described an external memory algorithm [7]. Recently, I et al. and Furuya et al. introduced algorithms to compute the Lyndon factorization of a string given in the grammar-compressed form and in the LZ78 encoding [8,9].

In this paper, we present two new variations of Duval’s algorithm. The paper is an extended version of the conference paper [10]. The first algorithm has been designed for strings containing runs of the smallest character. It works best for small alphabets like the DNA alphabet {a, c, g, t} and it is able to skip a significant portion of the string. The second variation works for strings compressed with run-length encoding. In run-length encoding, maximal sequences in which the same data value occurs in many consecutive data elements (called runs) are stored as a pair of a single data value and a count. When there is a run-length encoded string R of length

ρ

, our algorithm computes the Lyndon factorization of R in

O (ρ)

time and in constant space. This variation is thus preferable to Duval’s algorithm when the strings are stored or maintained with run-length encoding. In our experiments, the new algorithms are considerably faster than the original one in the case of small alphabets, for both real and simulated data.

The rest of the paper is organized as follows. Section 2 defines background concepts Section 3 presents Duval’s algorithm, Section 4 and Section 5 introduce our variations of Duval’s algorithm, Section 6 shows the results of our practical experiments, and the discussion of Section 7 concludes the article.

2. Basic Definitions

Let

Σ

be a finite ordered alphabet of

σ

symbols and let

Σ^{*}

be the set of words (strings) over

Σ

ordered by lexicographic order. In this paper, we use the terms string, sequence, and word interchangeably. The empty word

ε

is a word of length 0. Let

Σ^{+}

be equal to

Σ^{*} ∖ {ε}

. Given a word w, we denote with

| w |

the length of w and with

w [i]

the i-th symbol of w, for

0 \leq i < | w |

. The concatenation of two words u and v is denoted by

u v

. Given two words u and v, v is a substring of u if there are indices

0 \leq i, j < | u |

such that

v = u [i] \dots u [j]

. If

i = 0

(

j = | u | - 1

) then v is a prefix (suffix) of u. The substring

u [i] \dots u [j]

of u is denoted by

u [i . . j]

, and for

i > j

u [i . . j] = ε

. We denote by

u^{k}

the concatenation of ku’s, for

u \in Σ^{+}

and

k \geq 1

. The longest border of a word w, denoted with

β (w)

, is the longest proper prefix of w which is also a suffix of w. Let

l c p (w, w^{'})

denote the length of the longest common prefix of words w and

w^{'}

. We write

w < w^{'}

if either

l c p (w, w^{'}) = | w | < | w^{'} |

, i.e., if w is a proper prefix of

w^{'}

, or if

w [l c p (w, w^{'})] < w^{'} [l c p (w, w^{'})]

. For any

0 \leq i < | w |

,

ROT (w, i) = w [i . . | w | - 1] w [0 . . i - 1]

is a rotation of w. A Lyndon word is a word w such that

w < ROT (w, i)

, for

1 \leq i < | w |

. Given a Lyndon word w, the following properties hold:

$| β (w) | = 0$ ;
either $| w | = 1$ or $w [0] < w [| w | - 1]$ .

Both properties imply that no word

a^{k}

, for

a \in Σ

,

k \geq 2

, is a Lyndon word. The following result is due to Chen, Fox and Lyndon [11]:

Theorem 1.

Any word w admits a unique factorization

C F L (w) = w_{1}, w_{2}, \dots, w_{m}

, such that

w_{i}

is a Lyndon word, for

1 \leq i \leq m

, and

w_{1} \geq w_{2} \geq \dots \geq w_{m}

.

The interval of positions in w of the factor

w_{i}

in

C F L (w) = w_{1}, w_{2}, \dots, w_{m}

is

[a_{i}, b_{i}]

, where

a_{i} = \sum_{j = 1}^{i - 1} | w_{j} |

,

b_{i} = \sum_{j = 1}^{i} | w_{j} | - 1

, for

i = 1, \dots, m

. We assume the following property:

Property 1.

The output of an algorithm that, given a word w, computes the factorization

C F L (w)

is the sequence of intervals of positions of the factors in

C F L (w)

.

The run-length encoding (RLE) of a word w, denoted by

RLE (w)

, is a sequence of pairs (runs)

〈 (c_{1}, l_{1}), (c_{2}, l_{2},), \dots, (c_{ρ}, l_{ρ}) 〉

such that

c_{i} \in Σ

,

l_{i} \geq 1

,

c_{i} \neq c_{i + 1}

for

1 \leq i < ρ

, and

w = c_{1}^{l_{1}} c_{2}^{l_{2}} \dots c_{ρ}^{l_{ρ}}

. The interval of positions in w of the run

(c_{i}, l_{i})

is

[a_{i}^{r l e}, b_{i}^{r l e}]

where

a_{i}^{r l e} = \sum_{j = 1}^{i - 1} l_{j}

,

b_{i}^{r l e} = \sum_{j = 1}^{i} l_{j} - 1

.

3. Duval’s Algorithm

In this section we briefly describe Duval’s algorithm for the computation of the Lyndon factorization of a word. Let L be the set of Lyndon words and let

P = {w | w \in Σ^{+} and w Σ^{*} \cap L \neq \emptyset},

be the set of nonempty prefixes of Lyndon words. Let also

P^{'} = P \cup {c^{k} | k \geq 2}

, where c is the maximum symbol in

Σ

. Duval’s algorithm is based on the following Lemmas, proved in [5]:

Lemma 1.

Let

w \in Σ^{+}

and

w_{1}

be the longest prefix of

w = w_{1} w^{'}

which is in L. We have

C F L (w) = w_{1} C F L (w^{'})

.

Lemma 2.

P^{'} = {{(u v)}^{k} u | u \in Σ^{*}, v \in Σ^{+}, k \geq 1 and u v \in L}

.

Lemma 3.

Let

w = {(u a v^{'})}^{k} u

, with

u, v^{'} \in Σ^{*}

,

a \in Σ

,

k \geq 1

and

u a v^{'} \in L

. The following propositions hold:

1.: For $a^{'} \in Σ$ and $a > a^{'}$ , $w a^{'} \notin P^{'}$ ;
2.: For $a^{'} \in Σ$ and $a < a^{'}$ , $w a^{'} \in L$ ;
3.: For $a^{'} = a$ , $w a^{'} \in P^{'} ∖ L$ .

Lemma 1 states that the computation of the Lyndon factorization of a word w can be carried out by computing the longest prefix

w_{1}

of

w = w_{1} w^{'}

which is a Lyndon word and then recursively restarting the process from

w^{'}

. Lemma 2 states that the nonempty prefixes of Lyndon words are all of the form

{(u v)}^{k} u

, where

u \in Σ^{*}, v \in Σ^{+}, k \geq 1 and u v \in L

. By the first property of Lyndon words, the longest prefix of

{(u v)}^{k} u

which is in L is

u v

. Hence, if we know that

w = {(u v)}^{k} u a v^{'}

,

{(u v)}^{k} u \in P^{'}

but

{(u v)}^{k} u a \notin P^{'}

, then by Lemma 1 and by induction we have

C F L (w) = w_{1} w_{2} \dots w_{k} C F L (u a v^{'})

, where

w_{1} = w_{2} = \dots = w_{k} = u v

. For example, if

w = a b b a b b a b a

, we have

C F L (w) = a b b a b b C F L (a b a)

, since

a b b a b b a b \in P^{'}

while

a b b a b b a b a \notin P^{'}

.

Suppose that we have a procedure LF-next

(w, k)

which computes, given a word w and an integer k, the pair

(s, q)

where s is the largest integer such that

w [k . . k + s - 1] \in L

and q is the largest integer such that

w [k + i s . . k + (i + 1) s - 1] = w [k . . k + s - 1]

, for

i = 1, \dots, q - 1

. The factorization of w can then be computed by iteratively calling LF-next starting from position 0. When a given call to LF-next returns, the factorization algorithm outputs the intervals

[k + i s, k + (i + 1) s - 1]

, for

i = 0, \dots, q - 1

, and restarts the factorization at position

k + q s

. Duval’s algorithm implements LF-next using Lemma 3, which explains how to compute, given a word

w \in P^{'}

and a symbol

a \in Σ

, whether

w a \in P^{'}

, and thus makes it possible to compute the factorization using a left to right parsing. Note that, given a word

w \in P^{'}

with

| β (w) | = i

, we have

w [0 . . | w | - i - 1] \in L

and

w = (w [0 . . | w | - i - {1])}^{q} w [0 . . r - 1]

with

q = ⌊ \frac{| w |}{| w | - i} ⌋

and

r = | w | mod (| w | - i)

. For example, if

w = a b b a b b a b

, we have

| w | = 8

,

| β (w) | = 5

,

q = 2

,

r = 2

and

w = {(a b b)}^{2} a b

. The code of Duval’s algorithm is shown in Figure 1. The algorithm has

O (| w |)

-time and

O (1)

-space complexity.

The following is an alternative formulation of Duval’s algorithm by I et al. [8]:

Lemma 4.

Let

j > 0

be any position of a string w such that

w < w [i . . | w | - 1]

for any

0 < i \leq j

and

l c p (w, w [j . . | w | - 1]) \geq 1

. Then,

w < w [k . . | w | - 1]

also holds for any

j < k \leq j + l c p (w, w [j . . | w | - 1])

.

Lemma 5.

Let w be a string with

C F L (w) = w_{1}, w_{2}, \dots, w_{m}

. It holds that

| w_{1} | = min {j | w [j . . | w | - 1] < w}

and

w_{1} = w_{2} = \dots = w_{q} = w [0 . . | w_{1} | - 1]

, where

q = 1 + ⌊ l c p (w, w [| w_{1} | . . | w | - 1]) / | w_{1} | ⌋

.

For example, if

w = a b b a b b a b a

, we have

min {j | w [j . . 8] < w} = 3

,

l c p (w, w [3 . . 8]) = 5

, and

q = 2

. Based on these Lemmas, the procedure LF-next can be implemented by initializing

j \leftarrow k + 1

and executing the following steps: (1) compute

h \leftarrow l c p (w [k . . | w | - 1], w [j . . | w | - 1])

. (2) if

j + h < | w |

and

w [k + h] < w [j + h]

set

j \leftarrow j + h + 1

and repeat step 1; otherwise return the pair

(j, 1 + ⌊ h / j ⌋)

. It is not hard to verify that, if the

l c p

values are computed using symbol comparisons, then this procedure corresponds to the one used by Duval’s original algorithm.

4. Improved Algorithm for Small Alphabets

Let w be a word over an alphabet

Σ

with

C F L (w) = w_{1}, w_{2}, \dots, w_{m}

and let

\bar{c}

be the smallest symbol in

Σ

. Suppose that there exists

k \geq 2, i \geq 1

such that

{\bar{c}}^{k}

is a prefix of

w_{i}

. If the last symbol of w is not

\bar{c}

, then by Theorem 1 and by the properties of Lyndon words,

{\bar{c}}^{k}

is a prefix of each of

w_{i + 1}, w_{i + 2}, \dots, w_{m}

. This property can be exploited to devise an algorithm for Lyndon factorization that can potentially skip symbols. Note that we assume Property 1, i.e., the output of the algorithm is the sequence of intervals of the factors in

C F L (w)

, as otherwise we have to read all the symbols of w to output

C F L (w)

. Our algorithm is based on the alternative formulation of Duval’s algorithm by I et al. [8]. Given a set of strings

P

, let

O c c_{P} (w)

be the set of all (starting) positions in w corresponding to occurrences of the strings in

P

. We start with the following Lemmas:

Lemma 6.

Let w be a word and let

s = max {i | w [i] > \bar{c}} \cup {- 1}

. Then, we have

C F L (w) = C F L (w [0 . . s]) C F L ({\bar{c}}^{(| w | - 1 - s)})

.

Proof.

If

s = - 1

or

s = | w | - 1

the Lemma plainly holds. Otherwise, Let

w_{i}

be the factor in

C F L (w)

such that s belongs to

[a_{i}, b_{i}]

, the interval of

w_{i}

. To prove the claim we have to show that

b_{i} = s

. Suppose by contradiction that

s < b_{i}

, which implies

| w_{i} | \geq 2

. Then,

w_{i} [| w_{i} | - 1] = \bar{c}

, which contradicts the second property of Lyndon words. □

For example, if

w = a b a a b a a b b a a b a a

, we have

C F L (w) = C F L (a b a a b a a b b a a b) C F L (a a)

.

Lemma 7.

Let w be a word such that

\bar{c} \bar{c}

occurs in it and let

s = min O c c_{{\bar{c} \bar{c}}} (w)

. Then, we have

C F L (w) = C F L (w [0 . . s - 1]) C F L (w [s . . | w | - 1])

.

Proof.

Let

w_{i}

be the factor in

C F L (w)

such that s belongs to

[a_{i}, b_{i}]

, the interval of

w_{i}

. To prove the claim we have to show that

a_{i} = s

. Suppose by contradiction that

s > a_{i}

, which implies

| w_{i} | \geq 2

. If

s = b_{i}

then

w_{i} [| w_{i} | - 1] = \bar{c}

, which contradicts the second property of Lyndon words. Otherwise, since

w_{i}

is a Lyndon word it must hold that

w_{i} < ROT (w_{i}, s - a_{i})

. This implies at least that

w_{i} [0] = w_{i} [1] = \bar{c}

, which contradicts the hypothesis that s is the smallest element in

O c c_{{\bar{c} \bar{c}}} (w)

. □

For example, if

w = a b a a b a a b b a a b

, we have

C F L (w) = C F L (a b) C F L (a a b a a b b a a b)

.

Lemma 8.

Let w be a word such that

w [0] = w [1] = \bar{c}

and

w [| w | - 1] \neq \bar{c}

. Let r be the smallest position in w such that

w [r] \neq \bar{c}

. Let also

P = {{\bar{c}}^{r} c | c \leq w [r]}

. Then we have

b_{1} = min {s \in O c c_{P} (w) | w [s . . | w | - 1] < w} \cup {| w |} - 1,

where

b_{1}

is the ending position of factor

w_{1}

.

Proof.

By Lemma 5 we have that

b_{1} = min {s | w [s . . | w | - 1] < w} - 1

. Since

w [0 . . r - 1] = {\bar{c}}^{r}

and

| w | \geq r + 1

, for any string v such that

v < w

we must have that either

v [0 . . r] \in P

, if

| v | \geq r + 1

, or

v = {\bar{c}}^{| v |}

otherwise. Since

w [| w | - 1] \neq \bar{c}

, the only position s that satisfies

w [s . . | w | - 1] = {\bar{c}}^{| w | - s}

is

| w |

, corresponding to the empty word. Hence,

{s | w [s . . | w | - 1] < w} = {s \in O c c_{P} (w) | w [s . . | w | - 1] < w} \cup {| w |} .

□

For example, if

w = a a b a a b b a a b

, we have

P = {a a a, a a b}

,

O c c_{P} (w) = {0, 3, 7}

and

b_{1} = 6

. Based on these Lemmas, we can devise a faster factorization algorithm for words containing runs of

\bar{c}

. The key idea is that, using Lemma 8, it is possible to skip symbols in the computation of

b_{1}

, if a suitable string matching algorithm is used to compute

O c c_{P} (w)

. W.l.o.g. we assume that the last symbol of w is different from

\bar{c}

. In the general case, by Lemma 6, we can reduce the factorization of w to the one of its longest prefix with last symbol different from

\bar{c}

, as the remaining suffix is a concatenation of

\bar{c}

symbols, whose factorization is a sequence of factors equal to

\bar{c}

. Suppose that

\bar{c} \bar{c}

occurs in w. By Lemma 7 we can split the factorization of w in

C F L (u)

and

C F L (v)

where

u v = w

and

| u | = min O c c_{{\bar{c} \bar{c}}} (w)

. The factorization of

C F L (u)

can be computed using Duval’s original algorithm.

Concerning v, let

r = min {i | v [i] \neq \bar{c}}

. By definition

v [0] = v [1] = \bar{c}

and

v [| v | - 1] \neq \bar{c}

, and we can apply Lemma 8 on v to find the ending position s of the first factor in

C F L (v)

, i.e.,

min {i \in O c c_{P} (v) | v [i . . | v | - 1] < v}

, where

P = {{\bar{c}}^{r} c | c \leq v [r]}

. To this end, we iteratively compute

O c c_{P} (v)

until either a position i is found that satisfies

v [i . . | v | - 1] < v

or we reach the end of the string. Let

h = l c p (v, v [i . . | v | - 1])

, for a given

i \in O c c_{P} (v)

. Observe that

h \geq r

and, if

v < v [i . . | v | - 1]

, then, by Lemma 4, we do not need to verify the positions

i^{'} \in O c c_{P} (v)

such that

i^{'} \leq i + h

. The computation of

O c c_{P} (v)

can be performed by using either an algorithm for multiple string matching for the set of patterns

P

or an algorithm for single string matching for the pattern

{\bar{c}}^{r}

, since

O c c_{P} (v) \subseteq O c c_{{\bar{c}}^{r}} (v)

. Note that the same algorithm can also be used to compute

min O c c_{\bar{c} \bar{c}} (w)

in the first phase.

Given that all the patterns in

P

differ in the last symbol only, we can express

P

more succinctly using a character class for the last symbol and match this pattern using a string matching algorithm that supports character classes, such as the algorithms based on bit-parallelism. In this respect, SBNDM2 [12], a variation of the BNDM algorithm [13] is an ideal choice, as it is sublinear on average. However, this method is preferable only if

r + 1

is less than or equal to the machine word size in bits.

Let

h = l c p (v, v [s . . | v | - 1])

and

q = 1 + ⌊ h / s ⌋

. Based on Lemma 5, the algorithm then outputs the intervals of the factors

v [(i - 1) s . . i s - 1]

, for

i = 1, \dots q

, and iteratively applies the above method on

v^{'} = v [s q . . | v | - 1]

. It is not hard to verify that, if

v^{'} \neq ε

, then

| v^{'} | \geq r + 1

,

v^{'} [0 . . r - 1] = \bar{c}

and

v^{'} [| v^{'} | - 1] \neq \bar{c}

, and so Lemma 8 can be used on

v^{'}

. The code of the algorithm, named LF-skip, is shown in Figure 2. The computation of the value

r^{'} = min {i | v^{'} [i] \neq \bar{c}}

for

v^{'}

takes advantage of the fact that

v^{'} [0 . . r - 1] = \bar{c}

, so as to avoid useless comparisons.

If the total time spent for the iteration over the sets

O c c_{P} (v)

is

O (| w |)

, the worst case time complexity of LF-skip is linear. To see why, it is enough to observe that the positions i for which LF-skip verifies if

v [i . . | v | - 1] < v

are a subset of the positions verified by the original algorithm. Indeed, given a string w satisfying the conditions of Lemma 8, for any position

i \in O c c_{P}

there is no

i^{'} \in {0, 1, \dots, | w | - 1} ∖ O c c_{P}

such that

i^{'} + 1 \leq i \leq i^{'} + l c p (w, w [i^{'} . . | w | - 1])

. Hence, the only way Duval’s algorithm can skip a position

i \in O c c_{P}

using Lemma 4 is by means of a smaller position

i^{'}

belonging to

O c c_{P}

, which implies that the algorithms skip or verify the same positions in

O c c_{P}

.

5. Computing the Lyndon Factorization of a Run-Length Encoded String

In this section we present an algorithm to compute the Lyndon factorization of a string given in RLE form. The algorithm is based on Duval’s original algorithm and on a combinatorial property between the Lyndon factorization of a string and its RLE, and has

O (ρ)

-time and

O (1)

-space complexity, where

ρ

is the length of the RLE. We start with the following Lemma:

Lemma 9.

Let w be a word over Σ and let

w_{1}, w_{2}, \dots, w_{m}

be its Lyndon factorization. For any

1 \leq i \leq | RLE (w) |

, let

1 \leq j, k \leq m

,

j \leq k

, such that

a_{i}^{r l e} \in [a_{j}, b_{j}]

and

b_{i}^{r l e} \in [a_{k}, b_{k}]

. Then, either

j = k

or

| w_{j} | = | w_{k} | = 1

.

Proof.

Suppose by contradiction that

j < k

and either

| w_{j} | > 1

or

| w_{k} | > 1

. By definition of

j, k

, we have

w_{j} \geq w_{k}

. Moreover, since both

[a_{j}, b_{j}]

and

[a_{k}, b_{k}]

overlap with

[a_{i}^{r l e}, b_{i}^{r l e}]

, we also have

w_{j} [| w_{j} | - 1] = w_{k} [0]

. If

| w_{j} | > 1

, then, by definition of

w_{j}

, we have

w_{j} [0] < w_{j} [| w_{j} | - 1] = w_{k} [0]

. Instead, if

| w_{k} | > 1

and

| w_{j} | = 1

, we have that

w_{j}

is a prefix of

w_{k}

. Hence, in both cases we obtain

w_{j} < w_{k}

, which is a contradiction. □

The consequence of this Lemma is that a run of length l in the RLE is either contained in one factor of the Lyndon factorization, or it corresponds to l unit-length factors. Formally:

Corollary 1.

Let w be a word over Σ and let

w_{1}, w_{2}, \dots, w_{m}

be its Lyndon factorization. Then, for any

1 \leq i \leq | RLE (w) |

, either there exists

w_{j}

such that

[a_{i}^{r l e}, b_{i}^{r l e}]

is contained in

[a_{j}, b_{j}]

or there exist

l_{i}

factors

w_{j}, w_{j + 1}, \dots, w_{j + l_{i} - 1}

such that

| w_{j + k} | = 1

and

a_{j + k} \in [a_{i}^{r l e}, b_{i}^{r l e}]

, for

0 \leq k < l_{i}

.

This property can be exploited to obtain an algorithm for the Lyndon factorization that runs in

O (ρ)

time. First, we introduce the following definition:

Definition 1.

A word w is a LR word if it is either a Lyndon word or it is equal to

a^{k}

, for some

a \in Σ

,

k \geq 2

. The LR factorization of a word w is the factorization in LR words obtained from the Lyndon factorization of w by merging in a single factor the maximal sequences of unit-length factors with the same symbol.

For example, the LR factorization of

c c t g c c a a

is

〈 c c t g, c c, a a 〉

. Observe that this factorization is a (reversible) encoding of the Lyndon factorization. Moreover, in this encoding it holds that each run in the RLE is contained in one factor and thus the size of the LR factorization is

O (ρ)

. Let

L^{'}

be the set of LR words. Suppose that we have a procedure

LF - RLE - NEXT (R, k)

which computes, given an RLE sequence R and an integer k, the pair

(s, q)

where s is the largest integer such that

c_{k}^{l_{k}} \dots c_{k + s - 1}^{l_{k + s - 1}} \in L^{'}

and q is the largest integer such that

c_{k + i s}^{l_{k + i s}} \dots c_{k + (i + 1) s - 1}^{l_{k + (i + 1) s - 1}} = c_{k}^{l_{k}} \dots c_{k + s - 1}^{l_{k + s - 1}}

, for

i = 1, \dots, q - 1

. Observe that, by Lemma 9,

c_{k}^{l_{k}} \dots c_{k + s - 1}^{l_{k + s - 1}}

is the longest prefix of

c_{k}^{l_{k}} \dots c_{ρ}^{l_{ρ}}

which is in

L^{'}

, since otherwise the run

(c_{k + s}, l_{k + s})

would span two factors in the

L R

factorization of

c_{k}^{l_{k}} \dots c_{ρ}^{l_{ρ}}

. This implies that the pair

(s, q)

returned by

LF - RLE - NEXT (R, k)

satisfies

LF - NEXT (c_{k}^{l_{k}} \dots c_{ρ}^{l_{ρ}}, 0) = \{\begin{matrix} (\sum_{i = 0}^{s - 1} l_{k + i}, q) & if s > 1, \\ (1, l_{k}) & otherwise . \end{matrix}

Based on Lemma 1, the factorization of R can then be computed by iteratively calling LF-rle-next starting from position 0. When a given call to LF-rle-next returns, the factorization algorithm outputs the intervals

[k + i s, k + (i + 1) s - 1]

in R, for

i = 0, \dots, q - 1

, and restarts the factorization at position

k + q s

.

We now present the

LF - RLE - NEXT

algorithm. Analogously to Duval’s algorithm, it reads the RLE sequence from left to right maintaining two integers, j and ℓ, which satisfy the following invariant:

\begin{matrix} c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}} \in P^{'}; \\ ℓ = \{\begin{matrix} | RLE (β (c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}})) | & if j - k > 1, \\ 0 & otherwise . \end{matrix} \end{matrix}

(1)

The integer j, initialized to

k + 1

, is the index of the next run to read and is incremented at each iteration until either

j = | R |

or

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}} \notin P^{'}

. The integer ℓ, initialized to 0, is the length in runs of the longest border of

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}}

, if

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}}

spans at least two runs, and equal to 0 otherwise. For example, in the case of the word

a b^{2} a b^{2} a b

we have

β (a b^{2} a b^{2} a b) = a b^{2} a b

and

ℓ = 4

. Let

i = k + ℓ

. In general, if

ℓ > 0

, we have

\begin{matrix} l_{j - 1} \leq l_{i - 1}, l_{k} \leq l_{j - ℓ}, \\ β (c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}}) = c_{k}^{l_{k}} c_{k + 1}^{l_{k + 1}} \dots c_{i - 2}^{l_{i - 2}} c_{i - 1}^{l_{j - 1}} = c_{j - ℓ}^{l_{k}} c_{j - ℓ + 1}^{l_{j - ℓ + 1}} \dots c_{j - 2}^{l_{j - 2}} c_{j - 1}^{l_{j - 1}} . \end{matrix}

Note that the longest border may not fully cover the last (first) run of the corresponding prefix (suffix). Such the case is for example for the word

a b^{2} a^{2} b

. However, since

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}} \in P^{'}

it must hold that

l_{j - ℓ} = l_{k}

, i.e., the first run of the suffix is fully covered. Let

z = \{\begin{matrix} 1 & if ℓ > 0 \land l_{j - 1} < l_{i - 1}, \\ 0 & otherwise . \end{matrix}

Informally, the integer z is equal to 1 if the longest border of

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}}

does not fully cover the run

(c_{i - 1}, l_{i - 1})

. By 1 we have that

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}}

can be written as

{(u v)}^{q} u

, where

\begin{matrix} q = ⌊ \frac{j - k - z}{j - i} ⌋, r = z + (j - k - z) mod (j - i), \\ u = c_{j - r}^{l_{j - r}} \dots c_{j - 1}^{l_{j - 1}}, u v = c_{k}^{l_{k}} \dots c_{j - ℓ - 1}^{l_{j - ℓ - 1}} = c_{i - r}^{l_{i - r}} \dots c_{j - r - 1}^{l_{j - r - 1}}, \\ u v \in L^{'} \end{matrix}

For example, in the case of the word

a b^{2} a b^{2} a b

, for

k = 0

, we have

j = 6, i = 4, q = 2, r = 2

. The algorithm is based on the following Lemma:

Lemma 10.

Let

j, ℓ

be such that invariant 1 holds and let

s = i - z

. Then, we have the following cases:

1.: If $c_{j} < c_{s}$ then $c_{k}^{l_{k}} \dots c_{j}^{l_{j}} \notin P^{'}$ ;
2.: If $c_{j} > c_{s}$ then $c_{k}^{l_{k}} \dots c_{j}^{l_{j}} \in L^{'}$ and 1 holds for $j + 1$ , $ℓ^{'} = 0$ ;

Moreover, if

z = 0

, we also have:

3.: If $c_{j} = c_{i}$ and $l_{j} \leq l_{i}$ , then $c_{k}^{l_{k}} \dots c_{j}^{l_{j}} \in P^{'}$ and 1 holds for $j + 1$ , $ℓ^{'} = ℓ + 1$ ;
4.: If $c_{j} = c_{i}$ and $l_{j} > l_{i}$ , either $c_{j} < c_{i + 1}$ and $c_{k}^{l_{k}} \dots c_{j}^{l_{j}} \notin P^{'}$ or $c_{j} > c_{i + 1}$ , $c_{k}^{l_{k}} \dots c_{j}^{l_{j}} \in L^{'}$ and 1 holds for $j + 1$ , $ℓ^{'} = 0$ .

Proof.

The idea is the following: we apply Lemma 3 with the word

{(u v)}^{q} u

as defined above and symbol

c_{j}

. Observe that

c_{j}

is compared with symbol

v [0]

, which is equal to

c_{k + r - 1} = c_{i - 1}

if

z = 1

and to

c_{k + r} = c_{i}

otherwise.

First note that, if

z = 1

,

c_{j} \neq c_{i - 1}

, since otherwise we would have

c_{j - 1} = c_{i - 1} = c_{j}

. In the first three cases, we obtain the first, second and third proposition of Lemma 3, respectively, for the word

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}} c_{j}

. Independently of the derived proposition, it is easy to verify that the same proposition also holds for

c_{k}^{l_{k}} \dots c_{j - 1}^{l_{j - 1}} c_{j}^{m}

,

m \leq l_{j}

. Consider now the fourth case. By a similar reasoning, we have that the third proposition of Lemma 3 holds for

c_{k}^{l_{k}} \dots c_{j}^{l_{i}}

. If we then apply Lemma 3 to

c_{k}^{l_{k}} \dots c_{j}^{l_{i}}

and

c_{j}

,

c_{j}

is compared to

c_{i + 1}

and we must have

c_{j} \neq c_{i + 1}

as otherwise

c_{i} = c_{j} = c_{i + 1}

. Hence, either the first (if

c_{j} < c_{i + 1}

) or the second (if

c_{j} > c_{i + 1}

) proposition of Lemma 3 must hold for the word

c_{k}^{l_{k}} \dots c_{j}^{l_{i} + 1}

. □

We prove by induction that invariant 1 is maintained. At the beginning, the variables j and ℓ are initialized to

k + 1

and 0, respectively, so the base case trivially holds. Suppose that the invariant holds for

j, ℓ

. Then, by Lemma 10, either

c_{k}^{l_{k}} \dots c_{j}^{l_{j}} \notin P^{'}

or it follows that the invariant also holds for

j + 1, ℓ^{'}

, where

ℓ^{'}

is equal to

ℓ + 1

, if

z = 0

,

c_{j} = c_{i}

and

l_{j} \leq l_{i}

, and to 0 otherwise. When

c_{k}^{l_{k}} \dots c_{j}^{l_{j}} \notin P^{'}

the algorithm returns the pair

(j - i, q)

, i.e., the length of

u v

and the corresponding exponent.

The code of the algorithm is shown in Figure 3. We now prove that the algorithm runs in

O (ρ)

time. First, observe that, by definition of LR factorization, the for loop at line 4 is executed

O (ρ)

times. Suppose that the number of iterations of the while loop at line 2 is n and let

k_{1}, k_{2}, \dots, k_{n + 1}

be the corresponding values of k, with

k_{1} = 0

and

k_{n + 1} = | R |

. We now show that the s-th call to

LF - RLE - NEXT

performs less than

2 (k_{s + 1} - k_{s})

iterations, which will yield

O (ρ)

number of iterations in total. This analysis is analogous to the one used by Duval. Suppose that

i^{'}

,

j^{'}

and

z^{'}

are the values of i, j and z at the end of the s-th call to

LF - RLE - NEXT

. The number of iterations performed during this call is equal to

j^{'} - k_{s}

. We have

k_{s + 1} = k_{s} + q (j^{'} - i^{'})

, where

q = ⌊ \frac{j^{'} - k_{s} - z}{j - i^{'}} ⌋

, which implies

j^{'} - k_{s} < 2 (k_{s + 1} - k_{s}) + 1

, since, for any positive integers

x, y

,

x < 2 ⌊ x / y ⌋ y

holds.

6. Experimental Results

We tested extensively the algorithms LF-Duval, LF-skip, and LF-rle. In addition, we also tested variations of LF-Duval and LF-skip, denoted as LF-Duval2 and LF-skip2. LF-Duval2 performs an if-test

if w [j - 1] = w [i - 1] then

which is always true in line 9 of LF-next. This form, which is advantageous for compiler optimization, can be justified by the formulation of the original algorithm [5] where there is a three branch test of

w [j - 1]

and

w [i - 1]

. LF-skip2, after finding the first

{\bar{c}}^{r}

, searches for

{\bar{c}}^{r}

until

{\bar{c}}^{r + 1}

is found, whereas LF-skip searches for

{\bar{c}}^{r} x

where x is a character class.

The experiments were run on Intel Core i7-4578U with 3 GHz clock speed and 16 GB RAM. The algorithms were written in the C programming language and compiled with gcc 5.4.0 using the O3 optimization level.

Testing LF-skip. At first we tested the variations of LF-skip against the variations of LF-Duval. The texts were random sequences of 5 MB symbols. For each alphabet size

σ = 2, 4, \dots, 256

we generated 100 sequences with a uniform distribution, and each run with each sequence was repeated 500 times. The average run times are given in Table 1 which is shown in a graphical form in Figure 4.

LF-skip was faster than the best variation of LF-Duval for all tested values of

σ

. The speed-up was significant for small alphabets. LF-skip2 was faster than LF-skip for

σ \leq 16

and slower for

σ > 16

.

The speed of LF-Duval did not depend on

σ

. LF-Duval2 became faster when the size of the alphabet grew. For large alphabets LF-Duval2 was faster than LF-Duval and for small alphabets the other way round. In additional refined experiments,

σ = 5

was the threshold value. When we compiled LF-Duval and LF-Duval2 without optimization, both of the variations behaved in a similar way. So the better performance of LF-Duval2 for large alphabets is due to compiler optimization, possibly by cause of branch prediction.

We tested the variations of LF-skip also with longer random sequences of four characters up to 500 MB. The average speed did not essentially change when the sequence became longer.

In addition, we tested LF-skip and LF-skip2 with real texts. At first we did experiments with texts of natural language. Because runs are very short in a natural language and newline or some other control character is the smallest character, the benefit of LF-skip or LF-skip2 was marginal. If it were acceptable to relax the lexicographic order of the characters, some gain could be obtained. For example, LF-skip achieved the speed-up of 2 over LF-Duval2 in the case of the KJV Bible when ‘l’ is the smallest character.

For the DNA sequence of fruitfly (15 MB), LF-skip2 was 20.3 times faster than LF-Duval. For the protein sequence of the saccharomyces cerevisiae (2.9 MB), LF-skip2 was 8.7 times faster than LF-Duval2. The run times on these biological sequences are shown in Table 2.

Testing LF-rle. To assess the performance of the LF-rle algorithm, we tested it together with LF-Duval, LF-Duval2 and LF-skip2 for random binary sequences of 5 MB with different probability distributions, so as to vary the number of runs in the sequence. The running time of LF-rle does not include the time needed to compute the RLE of the sequence, i.e., we assumed that the sequence is given in the RLE form, since otherwise other algorithms are preferable. For each test we generated 100 sequences, and each run with each sequence was repeated 500 times. The average run times are given in Table 3 which is shown in a graphical form in Figure 5.

Table 3 shows that LF-rle was the fastest for distributions

P (0) =

0.05, 0.9, and 0.95. Table 3 also reveals that LF-rle and LF-Duval2 worked symmetrically for distributions of zero and one, but LF-skip2 worked unsymmetrically which is due to the fact that LF-skip2 searches for the runs of the smallest character which was zero in this case.

In our tests the run time of LF-Duval was about 14.7 ms for all sequences of 5 MB. Thus LF-Duval is a better choice than LF-Duval2 for cases

P (0) =

0.3 and 0.7.

7. Conclusions

We presented new variations of Duval’s algorithm for computing the Lyndon factorization of a string. The first algorithm LF-skip was designed for strings containing runs of the smallest character in the alphabet and it is able to skip a significant portion of the characters of the string. The second algorithm LF-rle is for strings compressed with run-length encoding and computes the Lyndon factorization of a run-length encoded string of length

ρ

in

O (ρ)

time and constant space. Our experimental results show that these algorithms can offer a significant speed-up over Duval’s original algorithm. Especially LF-skip is efficient in the case of biological sequences.

Author Contributions

Formal analysis, E.G.; Investigation, S.S.G.; Methodology, J.T.; Software, E.G. and J.T.; Supervision, J.T.; Writing—original draft, S.S.G. and E.G.; Writing—review & editing, J.T.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, K.T.; Fox, R.H.; Lyndon, R.C. Free differential calculus. IV. The quotient groups of the lower central series. Ann. Math. 1958, 68, 81–95. [Google Scholar] [CrossRef]
Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M. Sorting suffixes of a text via its Lyndon factorization. In Proceedings of the Prague Stringology Conference 2013, Prague, Czech Republic, 2–4 September 2013; pp. 119–127. [Google Scholar]
Gil, J.Y.; Scott, D.A. A bijective string sorting transform. arXiv 2012, arXiv:1201.3077. [Google Scholar]
Kufleitner, M. On bijective variants of the Burrows-Wheeler transform. In Proceedings of the Prague Stringology Conference 2009, Prague, Czech Republic, 31 August–2 September 2009; pp. 65–79. [Google Scholar]
Duval, J.P. Factorizing words over an ordered alphabet. J. Algorithms 1983, 4, 363–381. [Google Scholar] [CrossRef]
Apostolico, A.; Crochemore, M. Fast parallel Lyndon factorization with applications. Math. Syst. Theory 1995, 28, 89–108. [Google Scholar] [CrossRef] [Green Version]
Roh, K.; Crochemore, M.; Iliopoulos, C.S.; Park, K. External memory algorithms for string problems. Fundam. Inform. 2008, 84, 17–32. [Google Scholar]
Tomohiro, I.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Faster Lyndon factorization algorithms for SLP and LZ78 compressed text. Theor. Comput. Sci. 2016, 656, 215–224. [Google Scholar] [CrossRef]
Furuya, I.; Nakashima, Y.; Tomohiro, I.; Inenaga, S.; Bannai, H.; Takeda, M. Lyndon Factorization of Grammar Compressed Texts Revisited. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM 2018), Qingdao, China, 2–4 July 2018. [Google Scholar] [CrossRef]
Ghuman, S.S.; Giaquinta, E.; Tarhio, J. Alternative algorithms for Lyndon factorization. In Proceedings of the Prague Stringology Conference 2014, Prague, Czech Republic, 1–3 September 2014; pp. 169–178. [Google Scholar]
Lothaire, M. Combinatorics on Words; Cambridge Mathematical Library, Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
Durian, B.; Holub, J.; Peltola, H.; Tarhio, J. Improving practical exact string matching. Inf. Process. Lett. 2010, 110, 148–152. [Google Scholar] [CrossRef]
Navarro, G.; Raffinot, M. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM J. Exp. Algorithm 2000, 5, 4. [Google Scholar] [CrossRef]

Figure 1. Duval’s algorithm to compute the Lyndon factorization of a string.

Figure 2. The algorithm to compute the Lyndon factorization that can potentially skip symbols.

Figure 3. The algorithm to compute the Lyndon factorization of a run-length encoded string.

Figure 4. Comparison of the algorithms on random sequences (5 MB) with a uniform distribution of a varying alphabet size.

Figure 5. Comparison of the algorithms on random binary sequences (5 MB) with a skew distribution.

Table 1. Run times in milliseconds on random sequences (5 MB) with a uniform distribution of a varying alphabet size.

$σ$	LF-Duval	LF-Duval2	LF-skip	LF-skip2
2	14.6	21.9	2.5	1.5
4	14.6	14.9	1.6	1.1
8	14.7	9.1	1.3	1.1
16	14.7	6.4	1.3	1.2
32	14.7	5.0	1.4	1.6
64	14.7	4.3	1.7	2.3
128	14.7	4.0	2.0	3.2
192	14.6	3.8	1.7	3.7
256	14.6	3.8	2.6	4.1

Table 2. Run times in milliseconds on two biological sequences.

	LF-Duval	LF-Duval2	LF-skip	LF-skip2
DNA (15 MB)	44.7	52.2	3.0	2.2
Protein (2.9 MB)	8.5	3.4	0.50	0.39

Table 3. Run times in milliseconds on random binary sequences (5 MB) with a skew distribution.

P(zero)	LF-Duval	LF-Duval2	LF-skip2	LF-rle
0.05	14.6	5.7	1.4	0.70
0.10	14.7	7.8	1.1	1.3
0.20	14.7	12.4	1.0	2.4
0.30	14.8	17.4	1.2	3.2
⋯
0.70	14.7	16.9	1.7	3.2
0.80	14.6	12.7	2.0	2.4
0.90	14.6	8.4	2.8	1.3
0.95	14.7	6.3	4.7	0.70

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ghuman, S.S.; Giaquinta, E.; Tarhio, J. Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings. Algorithms 2019, 12, 124. https://doi.org/10.3390/a12060124

AMA Style

Ghuman SS, Giaquinta E, Tarhio J. Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings. Algorithms. 2019; 12(6):124. https://doi.org/10.3390/a12060124

Chicago/Turabian Style

Ghuman, Sukhpal Singh, Emanuele Giaquinta, and Jorma Tarhio. 2019. "Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings" Algorithms 12, no. 6: 124. https://doi.org/10.3390/a12060124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings^†

Abstract

1. Introduction

2. Basic Definitions

3. Duval’s Algorithm

4. Improved Algorithm for Small Alphabets

5. Computing the Lyndon Factorization of a Run-Length Encoded String

6. Experimental Results

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings †

Abstract

1. Introduction

2. Basic Definitions

3. Duval’s Algorithm

4. Improved Algorithm for Small Alphabets

5. Computing the Lyndon Factorization of a Run-Length Encoded String

6. Experimental Results

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Lyndon Factorization Algorithms for Small Alphabets and Run-Length Encoded Strings^†