Due to the property of the longest first approach, we have the following observation.
In what follows, we will show our algorithm which outputs a context-free grammar which generates a given string. Our algorithm heavily uses the SLSTree structure.
3.1. How to Find Using
In this section, we show how to find an LRF of from .
The next lemmas characterize an LRF of that is not represented by a node of .
Lemma 1 If an LRF x of is not represented by a node of , then .
Proof. Let and . Since x is a repeating factor of , , which means that . If , then it contradicts the precondition that x is not represented by a node of . Hence we have . Moreover, since x is an LRF of , we have . However, if we assume , this contradicts the precondition that x is an LRF of , since and we obtain a longer LRF . Hence we have . □
The above lemma implies that an LRF
x is not represented by a node of
only if the first and the last occurrences of
x form a
square in
. For example, see
Figure 1 that illustrates
for
. One can see that
is an LRF of
but it is not represented by a node of
.
However, the following lemma guarantees that it is indeed sufficient to consider the strings represented by nodes of as candidates for .
Lemma 2 Let x be an LRF of that is not represented by a node of
. Then, there exists another LRF y of
that is represented by a node of
such that
. Moreover, x is no longer present in
after a substitution for y (see also Figure 2). Proof. Let and . It follows from Lemma 1 that . Suppose that x is represented on an edge from some node s to some node t of . Let . Then we have . Let y be the suffix of u of length . It is clear that . Since , . Thus y is an LRF of . Since u is represented by node t and and , we know that . Hence y is represented by a node of . Since x occurs only within the region , x does not occur in after a substitution for y. □
In the running example of
Figure 1,
is an LRF of
that is represented by a node of
. After its two occurrences are replaced by a non-terminal symbol
, then
, which is an LRF of
not represented by a node of
, is no more present in
.
After constructing
, we create a bin-sorted list of the internal nodes of
in the decreasing order of their lengths. This can be done in linear time by a standard
Figure 2.
Illustration for proof of Lemma 2. Since u is represented by a node of , we know that .
Figure 2.
Illustration for proof of Lemma 2. Since u is represented by a node of , we know that .
traversal on
. We remark that a new internal node
v may appear in
for some
, which did not exist in
. However, we have that
. Thus, we can maintain the bin-sorted list by inserting node
v in constant time.
Given a node s in the bin-sorted list, we can determine whether is repeating or not by using , as follows.
Lemma 3 Let s be any node of with and let be the children of s. Then is a disjoint union of .
Proof. Clear from the definition of . □
Lemma 4 For any node s of such that , it takes amortized constant time to check whether or not is an LRF of .
Proof. Let
be the children of
s. Then,
is repeating if and only if
Remark that the values of
and
are stored in node
and can be referred to in constant time. Since the above inequality is checked at most once for each node
s, it takes amortized constant time. □
Suppose we have found an LRF of as mentioned above. In the sequel, we show our greedy strategy to select occurrences of the LRF in to be replaced with a new non-terminal symbol.
The next lemma is essentially the same as Lemma 2 of Kida et al. [
1].
Lemma 5 For any non-repeating factor x of , forms a single arithmetic progression.
Therefore, for any non-repeating factor x of , can be expressed by an ordered triple consisting of minimum element , maximum element , and cardinality , which takes constant space.
Lemma 6 Let s be any node of such that is an LRF of , and be any child of s. Then, contains at most two positions corresponding to non-overlapping occurrences of in .
Proof. Assume for contrary that
contains three non-overlapping occurrences of
, and let them be
in the increasing order. Then we have
which implies that
and
are non-overlapping. Moreover, since
, we have
. However, this contradicts the precondition that
is an LRF of
. □
From Lemma 6, each child of node s such that is an LRF, corresponds to at most two non-overlapping occurrences of . Due to Lemma 3, we can greedily select occurrences of to be replaced by a new non-terminal symbol, by checking all children of node s. According to Lemma 5, it takes amortized constant time to select such occurrences for each node s.
Note that we have to select occurrences of so that no occurrences of remain in the text string, and at least two occurrences of are selected. We remark that we can greedily choose at least occurrences.
3.2. How to Update to
Let L be the set of the greedily selected occurrences of in . For any , let denote the string obtained after replacing the first i occurrences of with non-terminal symbol . Namely, and .
In this section we show how to update to . Let p be the beginning position of the i-th occurrence in L. Assume that we have , and that we have replaced with non-terminal symbol such that . We now have , and we have to update to .
A naive way to obtain
is to remove all the suffixes of
from
and insert all the suffixes of
into it. However, since only the nodes not longer than
are important for our longest-first strategy, only the suffixes
such that
and
for any
have to be removed from
, and only the suffixes
have to be inserted into the tree (see the light-shaded suffixes of
Figure 3).
Lemma 7 For any t, let r be the shortest node of such that is a prefix of .
Assume .
If , then there exists an edge in from the root node to labeled with .
If , then there exists a node s in such that and s has an edge labeled with and leading to .
Proof. Consider Case 1 (see also
Figure 4). Since
,
. Hence
is a non-repeating factor of
. By Lemma 5,
forms a single arithmetic progression. Also, since
,
. Therefore, if
Figure 3.
at position p of is replaced by non-terminal symbol in . Every is removed from the tree and every is inserted into the tree (the light-shaded suffixes in the right figure). In addition, every for is removed from the tree (the dark-shaded suffixes in the right figure).
Figure 3.
at position p of is replaced by non-terminal symbol in . Every is removed from the tree and every is inserted into the tree (the light-shaded suffixes in the right figure). In addition, every for is removed from the tree (the dark-shaded suffixes in the right figure).
Figure 4.
Illustration of Case 1 of Lemma 7.
Figure 4.
Illustration of Case 1 of Lemma 7.
, then
. Hence there exists an edge from the root node to
labeled with
in
.
Consider Case 2 (see also
Figure 5). Let
. Then
. Since
, and since
r is not longer than the reference node in the path spelling out
from the root node of
, there exists at least one integer
m such that
and
. Hence there exists a node
s in
such that
and has an out-going edge labeled with
and leading to
.□
It is not difficult to see that the edge in each case of Lemma 7 does not exist in . Hence we create the edge when we update to .
The next lemma states how to locate node s of Case 2 of Lemma 7.
Lemma 8 For each t, we can locate node s such that in amortized constant time.
Proof. Let be the longest node in the tree such that is a prefix of .
Figure 5.
Illustration of Case 2 of Lemma 7.
Figure 5.
Illustration of Case 2 of Lemma 7.
Consider the largest possible t and denote it by . Since , the node can be found in time by going down the path that spells out from the root node (recall that Σ is fixed). Let be the string such that . If , then we create a new child node of such that . Otherwise, we set .
Now assume that we have located nodes and . We can then locate as follows. Consider node . Remark that is a prefix of , and thus we can detect in time by using the suffix link. After finding , we can locate or create in constant time.
The total time cost for detecting
for all
is linear in
Hence we can locate each
in amortized constant time. □
Let
v be the reference node in the path from the root to some
. Assume that
is removed from the subtree of
v, and redirected to node
s in the same path, such that
. In order to update
to
, we have to maintain triple
for node
v. One may be concerned that if
is neither
or
and
in
Figure 6.
Illustration of proof for Lemma 9.
Figure 6.
Illustration of proof for Lemma 9.
, the occurrences of
in
do not form a single arithmetic progression any more. However, we have the following lemma. For any factor
y of
, let
, namely,
denotes the occurrences of
y in
that overlap with the
i-th greedily selected occurrence of
in
.
Lemma 9 Let v be any reference node of such that .
For any integer ,
if ,
then there is no integer r such that and .
(See Figure 6). Proof. Assume for contrary that there exists integer r such that and . Since , there exist integers such that , and . For any integer j such that and , we have . Since , . As is non-repeating, . Since , is a factor of . Therefore, there exist two integers such that . Since , is repeating and . It contradicts that is an LRF of . □
Recall that p is the beginning position of the i-th largest greedily selected occurrence of in . Also, for any such that for every , we have removed from the subtree rooted at the reference node v and have reconnected it to node s such that . According to the above lemma, if , for every is removed from the subtree of v. After processing , then is updated to where is the step of the progression, and is updated to .
Notice that
for every
has to be removed from the tree, since
and therefore this leaf node should not exist in
(see the dark-shaded suffixes of
Figure 3). Removing each leaf can be done in constant time. Maintaining the information about the triple for the arithmetic progression of the reference nodes can be done in the same way as mentioned above.
The following lemma states how to locate each reference node.
Lemma 10 Let p be the i-th greedily selected occurrence of in . For any integer such that , let denote the reference node of in the path from the root spelling out suffix . For each j such that , we can locate the reference node in amortized constant time.
Figure 7.
The left figure illustrates how to find from . The right one illustrates a special case where . Once , it stands that for any .
Figure 7.
The left figure illustrates how to find from . The right one illustrates a special case where . Once , it stands that for any .
Proof. Let . We find by spelling out from the root in time, since there can be at most nodes in the path from the root to .
Suppose we have found
. We find
as follows. Let
be the parent node of
. We have
and
. We go to
. Since
, we have
. Thus, we can find
by going down the path starting from
and spelling out
. (See also the left illustration of
Figure 7).
A special case happens when there exists a node
s in the path from the root to
, such that
and the edge from
s in the path starts with some non-terminal symbol
with
. Namely,
. Due to the property of the longest first approach, we have
. Thus
. Moreover, for any
,
. (See also the right illustration of
Figure 7). It is thus clear that each
can be found in constant time. Since
, the leaves corresponding to
with
do not exist in
. □
From the above discussions, we conclude that:
Theorem 1 For any string , the proposed algorithm for text compression by longest first substitution runs in time using space.
Pseudo-codes of our algorithms are shown in Algorithms 1, 2, and 3.