2.2. General Approach
We will successfully solve the problem if we count, for each , the number of occurrences of p in the string , which arises from inserting t into s at position k. These occurrences can be divided into those that
- (I)
lie entirely within or entirely within ;
- (II)
Lie entirely within t;
- (III)
Start within and end within t;
- (IV)
Start within t and end within ; and
- (V)
Start within and end within , and in between extend over the entire t.
Each of these five types of occurrences will be considered separately and the results added up at the end. Of course, type (II) comes into play only if , and type (V) only if .
(I) Suppose that p appears at position i in s, that is . This occurrence lies entirely within if , therefore ; there are a total of such occurrences.
However, if we want the occurrence to lie entirely within , that means that ; such occurrences can be counted by taking the number of all occurrences of p in s and subtracting those that start at positions from 0 to . Thus, we get .
(II) The second type consists of all occurrences of p in t, so there are of these. We use this result at every k.
(III) The third type consists of occurrences of p that begin within and end within t. Suppose that the left j characters of p lie in and the rest in t; thus ends with , and t starts with . We already know that the longest prefix of p appearing at the end of is characters long; for the purposes of our discussion, we can therefore replace with .
Let be the number of those occurrences of p in that start within the first part, i.e., in . One possibility is that such an occurrence starts at the beginning of the string (in this case t must start with ); the other option is that it starts a bit later, so that only the first characters of our occurrence lie in . Therefore ends with ; and we already know that the next position j for which this condition is satisfied is . The table D is prepared in advance with Algorithm 1 (with a linear time complexity), and then at each k, we know that occurrences of p start within and end within t.
In the above procedure, we need to be able to quickly check whether
t starts with
for a given
i. We will use tables
and
that were prepared in advance. Now
tells us that
is the longest suffix of
p that occurs at the beginning of
t; the second longest such suffix is
, the third
, and so on. We therefore prepare the table
E (with Algorithm 2), which tells us, for each possible index
j, whether
occurs at the beginning of
t. Hence, the condition “if
t starts with
” in our procedure for computing table
D can now be checked by looking at the value of
.
Algorithm 1 Computation of the table —occurrences of p in that start within the first part |
; for to : (*occurrences that have fewer than i characters in *) ; if t starts with then (*the occurrence of p at the start of *) ; |
Algorithm 2 Computation of the table —does occur at the start of t? |
for to do ; ; while : ; ; |
(IV) The fourth type consists of the occurrences of p that start within t and end within . These can be counted using the same procedure as for (III), except that we reverse all three strings.
(V) We are left with those occurrences of p that start within , end within , and contain the entire t in between. This means that must end with some prefix of p and we already know that the longest such prefix is characters long; similarly, must start with some suffix of p, and the longest such suffix is characters long, where . Instead of strings and we can focus on and from here on. (These i and j can of course be different for different k, and where relevant we will refer to them as and .) We will present several ways of counting the occurrences of p in strings of the form , from simpler and less efficient to more efficient, but more complicated. Everything we have presented so far regarding types (I) to (IV) had linear time and space complexity, , so the computational complexity of the solution as a whole depends mainly on how we solve the task for type (V).
2.3. Quadratic Solution
Let be the number of those occurrences of p in the string that start within and end within . This function is interesting for and (so that the first and third part, and , are shorter than p). Edge cases are , because then the first or third part is empty and therefore p can not overlap with it.
Let us now consider the general case. One occurrence of
p in
may already appear at the beginning of this string; and the rest must begin later, such that within the first part of our string,
, they only have some shorter prefix, say
, which is therefore also a suffix of the string
. The longest such prefix is of length
; all these later occurrences therefore lie not only within
, but also within
, so there are
of them. We have thus obtained a recurrence that is the core of Algorithm 3 for calculating all possible values of the function
f using dynamic programming.
Algorithm 3 A quadratic dynamic programming solution for counting occurrences of p in |
1 for to do for to : 2 ; (*Does p occur at the start of ?*) 3 if t occurs in p at position i (i.e., if ) then 4 if and starts with then 5 ; (*Add later occurrences.*) 6 ; 7 ; |
Before the procedure for calculating the function
f is actually useful, we still need to consider some important details. We check the condition in line 3 by checking if
. In line 4 the condition
checks whether an occurrence of
p that started at the beginning of the string
would even reach
at the end, because we are only interested in such occurrences—those that lie entirely within
, have already been considered in under type (III) in
Section 2.2. Next, we have to check, in line 4, whether some suffix of
p begins with some shorter suffix of
p. Recall that the longest
that occurs at the beginning of the string
is the one with
; the next is
for
, and so on. In line 4 we are essentially checking whether this sequence of shorter suffixes eventually reaches
. For this purpose we can use a tree structure, which we will also need later in more efficient solutions. Let us construct a tree
in which there is a node for each
u from 1 to
; the root of the tree is the node
(recall that the number in the suffix notation, such as
, represents the position or index at which this suffix begins; therefore, shorter suffixes have larger indices, and the largest among them is
, which represents the empty suffix), and for every
, let node
u be a child of node
. Line 4 now actually asks whether node
is an ancestor of node
j in the tree
.
This can be checked efficiently if we do a preorder traversal of all nodes in the tree—that is, the order in which we list the root first, then recursively add each of its subtrees (Algorithm 4). For each node
u, we remember its position in that order (say
) and the position of the last of its descendants (say
). Then the descendants of
u are exactly those nodes that are located in the order at indices from
to
. To check whether some
is a prefix of
, we only have to check if
. (Another way to check whether
is a prefix of
is with the
Z-algorithm ([
7], pp. 7–10). For any string
w, we can prepare a table
in
time, where the element
(for
) tells us the length of the longest common prefix of the strings
w and
. The question we are interested in—i.e., whether
is a prefix of
—is equivalent to asking whether
, which is equivalent to
, and hence to the question whether
and
match in the first
characters, i.e., whether
. However, for our purposes the tree-based solution is more useful than the one using the
Z-algorithm, as we will later use the tree in solutions with a lower time complexity than the quadratic one that we are describing now.)
Our procedure so far (Algorithm 3) has computed
by checking whether
p occurs at the beginning of
and handling subsequent occurrences via
for
. Of course, we could also go the other way: we would first check whether
p occurs at the end (instead of the beginning) of
and then add earlier occurrences via
for
. To check if
p occurs at the end of
, we should first check whether
t occurs in
p at indices from
to
j (with the same table as before in line 3), and then we would be interested in whether
ends with
. Here, it would be helpful to construct a tree
, which would have one node for each
u from 0 to
; node 0 would be the root, and for every
, the node
u would be a child of node
. Using the procedure
Preorder on this tree would give us tables
and
. We should then check whether
is an ancestor of node
i in the tree.
Algorithm 4 Tree traversal for determining the ancestor–descendant relationships |
procedure Traverse (tree T, node u, tables and , index i): ; ; for every child v of node u in tree T: Traverse(T, v, σ, τ, i); ; return i; |
procedure Preorder(input: tree T; output: tables and ): let and be tables with indices from 1 to ; Traverse(T, root of T, σ, τ, 0); return ; main call: Preorder(TS); |
Thus, we see that we can compute quickly, in time, either from or from ; both ways will come in handy later.
Our current procedure for computing the function f would take time to prepare trees and and tables , , and , followed by time to calculate the value of for all possible pairs . This solution has a time complexity of .
2.4. Distinguished Nodes
We do not really need the values of , which we calculated in the quadratic solution, for all pairs , but only for one such pair for each k (i.e., each possible position where t can be inserted into s), which is only values and not all . We will show how to compute those values of the function f that we really need, without computing all the others.
Let us traverse the tree from the bottom up (from leaves towards the root) and compute the sizes of the subtrees; whenever we encounter a node u with its subtree (let us call it ; it consists of u and all its descendants) of size at least nodes, we mark u as a distinguished node and cut off the subtree . Cutting it off means we will not count the nodes in it when we consider subtrees of u’s ancestors higher up in the tree. When we finally obtain to the root, we mark it as well, regardless of how many nodes are still left in the tree at that time. Algorithm 5 calculates, for each node u, its nearest marked ancestor ; if u itself is marked, then .
Now we have at most
marked nodes (because every time we marked a node, we also cut off least
nodes from the tree, and the number of all nodes is
); and each node is less than
steps away from its nearest marked ancestor (if it were
or more steps away, some lower-lying ancestor of this node would have a subtree with at least
nodes, and would therefore get marked and its subtree cut off).
Algorithm 5 Marking distinguished nodes |
procedure Mark(u): (*Variable N counts the number of nodes in reachable through unmarked descendants.*) ; for every child v of node u: Mark(v); if or then M[u]: = u ; (*Mark u.*) else M[u]: = −1; return N; (*Unmarked u.*) |
main call: Mark(0); (*Start at the root—node 0.*) (*Pass the information about the nearest marked ancestor down the tree to all unmarked nodes. (from parents to children)*) for u: = 1 to |p| − 1 do if M[u] < 0 then M[u]: = M[Pp[u]]; |
We know that
for every
i. For each marked
i, we can compute all values of
for
from
; since there are at most
marked nodes, this takes
time. Let us describe how we can answer each of the
queries
that we are interested in. The nearest marked ancestor of node
is
; since it is marked, we already know
for all
j, i.e., also for
, and since
lies in the tree
at most
steps below node
, we can calculate
from
in at most
steps. Since we have to do this for each query, it takes overall
time. To save space, we can solve queries in groups according to
; after we process all queries that share the same node
, we can forget the results
. Algorithm 6 presents the pseudocode of this approach to counting the occurrences of type (V).
Algorithm 6 Rearranging queries |
(*Group queries based on the nearest marked ancestor.*) for to do if then empty list; for to do add k to ; (*Process marked nodes u.*) for to do if : (*Compute for all j as .*) ; for downto 1 do in store that is computed from , which is stored in ; (*Answer queries , where u is the nearest marked ancestor of .*) for each k in : (*Prepare the path in the tree from to its ancestor u.*) empty stack; ; while : push i to S and assign ; (*Compute for all nodes i on the path from u to .*) ; (*i.e., *) while S not empty: node popped from S; from , currently stored in r, compute and store it in r; (*r is now equal to , i.e., the answer to query k, which is the number of occurrences of p in that start within and end within .*) |
Everything we did to count the occurrences of types (I) to (IV) had a linear time complexity in terms of the length of the input strings, so the overall time complexity of this solution is with an space complexity. (Note that this time complexity can be worse than the quadratic solution in cases where .)
2.5. Geometric Interpretation
We want to count those occurrences of
p in
that start within
and end within
(
Figure 1). Such an occurrence is therefore of the form
, where
,
, and the following three conditions apply: (
a)
must be a suffix of
, (
b)
must be a prefix of
, and (
c)
t must appear in
p as a substring starting at index
ℓ, i.e.,
.
To check (c), we prepare the table in advance and check whether .
Let us now consider the condition (a), i.e., that must be a suffix of . We know that the longest prefix of p that is also a suffix of is characters long; the second longest is characters long, and so on. The condition that must be a suffix of essentially asks whether ℓ occurs somewhere in the sequence . This is equivalent to asking whether ℓ is an ancestor of i in the tree , which, as we have already seen, can be checked with the condition .
Similar reasoning holds for condition (b), i.e., that must be a prefix of , except that now instead of the table and the tree we use the table and the tree . The condition (b) asks whether is an ancestor of j in the tree , which can be checked with .
The two inequalities we have thus obtained,
have a geometric interpretation which will prove useful: if we think of
and
as the
x- and
y-coordinates, respectively, of a point on the two-dimensional plane, then the two inequalities require the
x-coordinate to lie on a certain range and the
y-coordinate to lie within a certain other range; in other words, the point must lie within a certain rectangle.
Recall that i and j depend on the position k where the string t has been inserted into s; therefore, to avoid confusion, we will now denote them by and . We will refer to the point from the previous paragraph as , and to the rectangle as . Thus, we now see that p occurs in such that the occurrence of t at position ℓ of p aligns with the occurrence of t at position k of if and only if lies in . Actually we are not interested in whether a match occurs for a particular k and ℓ, but in how many occurrences there are for a particular k; in other words, for each k (from 1 to ) we want to know how many rectangles contain the point . Here ℓ ranges over all positions where t appears as a substring in p (i.e., where or, equivalently, ); by limiting ourselves to such ℓ, we also take care of condition (c). Thus, we have points and rectangles, and for each point we are interested in how many of the rectangles contain it.
We have thus reduced our string-searching problem to a geometrical problem, which turns out to be useful because the latter problem can be solved efficiently using a well-known computational geometry approach called a
plane sweep [
8]. Imagine moving a vertical line across the plane from left to right (
x-coordinates—i.e., the values from the
table—go from 1 to
) and let us maintain some data structure with information about which
y-coordinate intervals are covered by those rectangles that are present at the current
x-coordinate. When we reach the left edge of a rectangle during the plane sweep, we must insert it into this data structure, and when we reach its right edge, we remove it from the data structure. Because the
y-coordinates (i.e., the values from the
table) in our case also take values from 1 to
, a suitable data structure is, for example, the Fenwick tree [
9]. Imagine a table
h in which the element
stores the difference between the number of rectangles (from those present at the current
x-coordinate) that have their lower edge at
y, and those that have their upper edge at
. Then the sum
is equal to the number of rectangles that (at the current
x) cover the point
. The Fenwick tree allows us to compute such sums in
time and also make updates of the tree with the same time complexity, when any of the values
changes. Algorithm 7 describes this plane sweep with pseudocode.
Algorithm 7 Plane sweep |
let F be a Fenwick tree with all values initialized to 0; for to : for every rectangle with the left edge at : increase and decrease in F by 1; for every point with : compute the sum in F, i.e., the number of occurrences of p in that start within and end within ; for every rectangle with the right edge at : decrease and increase in F by 1; |
We have to prepare in advance, for every coordinate
x, a list of rectangles that have their left edge at
x, a list of rectangles that have right edge at
x, and a list of points
with this
x-coordinate. Preparing these lists takes
time; each operation on
F takes
time, and these operations are
insertions of rectangles,
deletions and
sum queries. If we add everything we have seen in dealing with types (I) to (IV) in
Section 2.2, the total time complexity of our solution is
. Instead of using the Fenwick tree, we could use the square root decomposition on the table
h (where we divide the table
h into approximately
blocks with
elements each and we maintain the sum of each block), but the time complexity would increase to
.
2.6. Linear-Time Solution
We will first review a recently published algorithm by Ganardi and Gawrychowski [
10] (hereinafter: GG), which we can adapt for solving our problem with a linear time complexity. The algorithm we need is in ([
10], Theorem 3.2); we will only need to adjust it a little so that it counts the occurrences of
p instead of just checking whether any occurrence exists at all. Intuitively, this algorithm is based on the following observations: if a string of the form
(where
is a prefix of
p and
is a suffix of
p) contains an occurrence of
p which begins within
u and ends within
v, this occurrence of
p must have a sufficiently long overlap with at least one of the strings
u,
t and
v. All these three strings are substrings of
p, and if
p overlaps heavily with one of its substrings, it necessarily follows that the overlapping part of the string must be periodic, and any additional occurrences of
p in
can only be found by shifting
p by an integer number of these periods. The number of occurrences can thus be calculated without having to deal with each occurrence individually. The details of this, however, are somewhat more complicated and we will present them in the rest of this section.
In the following, we will come across some basic concepts related to the periodicity of strings ([
11], Chapter 8). We will say that the string
x is
periodic with a period of length
d if
and if
for each
i in the range
; in other words, if
, i.e., if some string of length
is both a prefix and a suffix of
x. The prefix
is called a
period of
x; if there is no risk of confusion, we will also use the term
period for its length
d.
A string x can have periods of different lengths; we will refer to the shortest period of x as its base period. If some period of x is at most characters long, then its length is a multiple of the base period. (This can be proven by contradiction. Suppose this were not always true; consider any such x that has a base period y and also some longer period z, where and for which is not a multiple of . Then is of the form for some and some r from the range . Define and ; thus and . Let us look at the first characters of the string x; since y is a period of x, this prefix has the form ; and since z is a period of x, this prefix has the form ; but it is the same prefix both times, so the last characters of this prefix must be the same: . This means that if we concatenate several copies of the strings and together, it does not matter in what order we do it; so . Now consider ; since x has a period y and , x is a prefix of , and since , x is also a prefix of ; therefore, x has a period , which is in contradiction with the initial assumption that y is the base (i.e., shortest) period of x.)
It follows from the definition of periodicity that if y is both a prefix and a suffix of x (and if ), then x is periodic with a period of length . (If, in addition, also holds, so that occurrences of the string y at the beginning and end of the string x overlap, y is also periodic with a period of this length.) This period is the shorter and the longer y is, so we obtain the base period at the longest y that is both a prefix and a suffix of the string x. If instead of x we consider only its prefix , we know that the longest y that is both a prefix and a suffix of is characters long; therefore, has a base period (if , this formula gives the base period i, which of course means that is not periodic at all.
Let
lcp be the length of the longest common prefix of
and
, i.e., of two suffixes of
p. We can compute this in
time if we have preprocessed the string
p to build its suffix tree and some auxiliary tables (in
time). Each suffix of
p corresponds to a node in its suffix tree; the longest common prefix of two suffixes is obtained by finding the lowest common ancestor of the corresponding nodes, which is a classic problem and can be answered in constant time with linear preprocessing time [
12,
13]. To build a suffix tree in linear time, one can use, for example, Ukkonen’s algorithm [
14], or first construct a suffix array (e.g., using the DC3 algorithm of Kärkkäinen et al. [
15,
16]) and its corresponding longest-common-prefix array [
17], and then use these two arrays to construct the suffix tree (this latter approach is what we used in our implementation). We can similarly compute the longest common suffix of two prefixes of
p, which will be useful later.
Suppose that x and y are two substrings of p, e.g., and , and that we would like to check whether x appears as a substring in y starting at position k; then (provided that ) we only need to check whether lcp; if that is true, x really appears there (), otherwise the value of the function lcp tells us after how many characters the first mismatch occurs. We can further generalize this: if are substrings of p and we are interested in whether x appears as a substring in starting at position k, we need to perform at most three calls to the lcp function: first, for the part where x overlaps with u; if no mismatch is observed there, we check the area where x overlaps with t; if everything matches there too, we check the area where x overlaps with v.
Let us begin with a rough outline of the GG algorithm (Algorithm 8). Recall that we would like to solve
problems of the form “how many occurrences of
p are there in the string
that start within
u and end in
v?”, where
u is some prefix of
p, and
v is some suffix of
p. (To be more precise: for each position
k from the range
we are interested in occurrences of
p in the string
, where, as we have seen, we can limit ourselves to the string
with
and
.) Assume, of course, that
and that
t appears at least once as a substring in
p (which can be checked with the table
), because otherwise the occurrences we are looking for cannot exist at all. Since the strings
p and
t are the same in all our queries, we will list only
u and
v as the arguments of the GG function. Regarding the loop 1–3, we note that it will have to be executed at most twice, since
u starts as a prefix of
p, i.e., it is shorter than
; it is at least halved in each iteration of the loop, therefore it will be shorter than
after at most two halvings. In step 2, we rely on the fact that the occurrences of
p we are looking for overlap significantly with
u (by at least one half of the latter); we will look at the details in
Section 2.6.1. In step 3, we want to discover for some substring of
p (namely the right half of the current
u, which was a prefix of
p) how long is the longest suffix of this substring that is also a prefix of
p; we do not know how to do this in
time, but we can solve
such problems as a batch in
time, which is good enough for our purposes, since we know that we will have to execute the GG algorithm for
pairs
. (For details of this step, see
Section 2.6.3.) Steps 4–6 are just a mirror image of steps 1–3 and the same considerations apply to them; that loop also makes at most two iterations. (Furthermore, note that steps 3 and 6 ensure that no occurrence of
p is counted more than once: for example, those counted in step 2 all start within the left half of
u, and this is immediately cut off in step 3, so these occurrences do not appear again, e.g., in step 5). In step 7, we then know that
u and
v are shorter than
; if
is longer than
p, then
t must be at least
characters long and we can make use of this when counting occurrences of
p in
; we will look at the details in
Section 2.6.2. As we will see, steps 2 and 7 each take
time; the GG algorithm can therefore be executed
times in
time, and
time is spent on preprocessing the data, so we have a solution with a linear time complexity.
Algorithm 8 Adjustment of the GG algorithm |
algorithm GG(u, v): input: two strings, u (a prefix of p) and v (a suffix of p); output: the number of occurrences of p in that start within u and end within v; 1 while : 2 count occurrences of p in that start within the left half of u (at positions ) and end within v; 3 the longest suffix of the current u that is shorter than and is also a prefix of p; 4 while : 5 count occurrences of p in that start within u and end within the right half of v (at most characters before the end of v); 6 the longest prefix of the current v that is shorter than and is also a suffix of p; 7 count all occurrences of p in the current string ; |
2.6.1. Counting Early and Late Occurrences
Let us now consider step 2 of Algorithm 8 in more detail. The main idea will be as follows: we will find the base period of the string u; calculate how far this period extends from the beginning of strings and p; and discuss various cases of alignment with respect to the lengths of these periodic parts.
We are, then, interested in occurrences of
p in
that begin at some position
. Since
u is also a prefix of
p, the occurrence of
u at the beginning of
p overlaps by at least one half with that at the beginning of
(
Figure 2). So
u’s suffix of length
is also its prefix; therefore,
u has a period of length
; and since the length of this period is at most
, it must be a multiple of the base period. Let us denote the base period by
d (represented by the arcs under the strings in
Figure 2); recall that it can be computed as
. We need to consider only positions of the form
for integer
; the lower bound for
is obtained from the conditions
and
(so that the occurrence of
p ends within
v and not sooner), and the upper bound from the conditions
(so that the occurrence of
p begins within the first half of
u) and
(so that the occurrence of
p does not extend beyond the end of
). Let us call these bounds
and
. If
, we can immediately conclude that no appropriate occurrence of
p exists.
The string
u is a prefix of
p and has a base period
d; perhaps there exists a longer prefix of
p with this period; let
be the longest such prefix. We obtain it by finding the longest common prefix of the strings
p and
, i.e.,
lcp(0,d). Let
be the length of the longest common prefix of the strings
and
(in other words, we are interested in whether
appears in
starting at
; we already know that there is certainly no mismatch with
u, so at most two calls of
lcp are enough to check whether everything matches with
t and
v or find where the first mismatch occurs). Then we know that the string
is periodic with the same period of length
d as
and
u. An example is shown in
Figure 3.
(1) If , we have found a suitable occurrence of p, and since has a period of d until the end of this occurrence, this means that if we were now to shift p in steps of d characters to the left and then compare it with the same part of , we would see exactly the same substring in after each such shift as before, i.e., we would also find occurrences of p in all these places. In this case, all from to are suitable, so we have found occurrences.
(2) If , we have found an occurrence of in starting at . This occurrence of may or may not continue into an occurrence of the entire p; this can be checked with at most two more calls of lcp. As for possible occurrences further to the left, i.e., starting at for some : in such an occurrence, should match with the character , which is still in the periodic part of (), i.e., it is equal to the character d places further to the left, which in turn must be equal to the character . So and should be the same, but they certainly are not, because then the periodic prefix of p (with period d) would be at least characters long, not just characters. Thus, we see that for there is definitely a mismatch and p does not occur there.
(3) If , we have a mismatch between and . We did not find an occurrence of p at , but what about for smaller values of ? If we now move p left in steps of d, the same character of the string will have to match , , and so on. As long as these indices are smaller than , all these characters are equal to the character , since this part of p is periodic with period d; therefore, a mismatch will occur for them as well. In general, if p starts at , the character will have to match ; a necessary condition to avoid the aforementioned mismatch is therefore or . Thus, we have obtained a new, tighter upper bound for , which we will call . For , the string lies entirely within the periodic part of the string , i.e., within .
(3.1) If , we know that there is no possible and we will not find an occurrence of p here.
(3.2) Otherwise, if , we see that for , the string p (if it starts at position in ) lies entirely within the periodic part of ; since means that p is itself entirely periodic with the same period of length d, we can conclude that, for each (from to ), p matches completely with the corresponding part of ; thus there are occurrences.
(3.3) However, if
, we can reason very similarly as in case (2). At
we have an occurrence of
, which may or may not be extended to an occurrence of the whole
p; we check this with at most two calls of
lcp. At
the string
lies entirely within the periodic part of the string
; if we then decrease
and thus shift
p to the left by
d or a multiple of
d characters, then
will certainly also fall within the periodic part of the string
. In order for the entire
p to occur at such a position, among other things,
and
would have to match with the corresponding characters of the string
, but since these two characters are both in the periodic part of the string
, they are equal, while the characters
and
are not equal (because
is the longest prefix of
p with period
d); therefore, a mismatch will definitely occur for every
and
p does not appear there. Step 2 of the GG algorithm can therefore be summarized by Algorithm 9.
Algorithm 9 Occurrences of p in that start within the left half of u and end within v. |
; ; if then return 0; length of the longest prefix of p with a period d; length of the longest common prefix of and ; if : if then return else if p is a prefix of then return 1 else return 0; else: ; if then return 0 else if then return else if p is a prefix of then return 1 else return 0; |
2.6.2. Counting the Remaining Occurrences
Now let us consider step 7 of Algorithm 8. At that point in the GG algorithm, we have already truncated u and v so that they are shorter than . If or or , we can immediately conclude that there are no relevant occurrences of p. We can assume that , as otherwise the string would certainly be shorter than .
We are interested in such occurrences of p in that start within u and end within v; such an occurrence also covers the entire t in the middle. Candidates for a suitable occurrence of p in are therefore present only where t appears in p (more precisely: the fact that t occurs in p starting at ℓ is a necessary condition for p to appear in starting at ). We compute in advance (using the table ) the index of the first and (if it exists) second occurrence of t as a substring in p; let us call them and . If there was just one occurrence of t in p, then there is also just one candidate for the occurrence of p in and with two more calls of lcp we can check whether p really occurs there (i.e., whether is a suffix of u and is a prefix of v).
Now suppose that there are at least two occurrences of
t in
p. Let us consider some occurrence of
p in
(of course one that starts within
u and ends within
v); the string
t in
matches one of the occurrences of
t in
p; let
ℓ be the position where this occurrence of
t in
p begins (
Figure 4). There are at most
characters to the left of this occurrence of
t in
p (otherwise
p would already extend beyond the left edge of
), and there are at most
characters to the right of it (otherwise
p would extend beyond the right edge of
). Therefore, no other occurrence of
t in
p can begin more than
characters to the left of
ℓ or more than
characters to the right of
ℓ. If another occurrence of
t in
p begins at position
, then
.
Since no occurrence of t in p is very far from ℓ, it follows that the first two occurrences of t in p (i.e., those at and ) cannot be very far from each other. We can verify this as follows: (1) if the occurrence of t in p at ℓ was exactly the first occurrence of t in p (i.e., if ), it follows that is greater than by at most , so ; (2) if the occurrence of t in p at ℓ was not the first occurrence of t in p, but the second or some later one (that is, if ), it follows that and are both less than or equal to ℓ, but by at most , so again .
We see that if p occurs in (starting within u and ending within v), the first two occurrences of t in p are less than apart. So, we can start by computing and if , we can immediately conclude that we will not find, in , any occurrences of p that start within u and end within v.
Otherwise, we know henceforth that , so the first two occurrences of t in p overlap by more than half; the overlapping part is, therefore, both a prefix and a suffix of t, say for and ; so t is periodic with a period of length . Now suppose that t has some shorter period . If , the characters and both lie in the first occurrence of t in p, so they are equal since t has period ; if , then both of these characters lie in the second occurrence of t in p and are therefore equal. Since this is true for all , we have found another intermediate occurrence of t in p starting at , which contradicts the assumption that the second occurrence is the one at . Thus, d is the shortest period of t.
Consider the first occurrence of t in p, the one starting at , and see how far left and right of it we could extend t’s period of length d; that is, we need to compute the longest common suffix of the strings and , and the longest common prefix of the strings and . Now we can imagine p partitioned into , where is the maximal substring that contains that occurrence of t and has a period d.
Similarly, in
, let us see how far into
u and
v the middle part of
t could be extended while maintaining periodicity with period
d. This requires some caution:
t starts with the beginning of the period, so when we extend the middle part from the right end of
u leftwards, we must expect the end of the period at the right end of
u. So we must compute the longest common suffix of
u and something that ends with the end of
t’s period, which is not necessarily
t itself because its length is not necessarily a multiple of the period length. A similar consideration applies to extending the middle part from the left end of
v rightwards. Let
be the length of what remains of
t if we restrict ourselves only to periods that occur in their entirety. Since
, it follows that
, so
is longer than
and
. If we now find the longest common suffix of the strings
u and
and the longest common prefix of the strings
v and
, we will receive what we are looking for, and we do not have to worry about running out of
t before
u or
v. From now on, let us imagine
in the form
, where
is periodic with a period of length
d.
Figure 5 shows the resulting partition of the strings
and
p.
We know that if
p appears in
, then the
t in
matches some occurrence of
t in
p, say the one at
ℓ (see
Figure 5); if this is not precisely the first occurrence (the one at
), then the first one can start at most
characters further to the left (otherwise
p would extend over the left edge of
there); both occurrences of
t in
p (at
and
ℓ) overlap by more than half, so
t is periodic with a period of length
; this is less than
, so this period is a multiple of the base period, i.e.,
d. Thus, we see that only those occurrences of
t in
p that begin at positions of the form
for integer
are relevant.
Furthermore, since the occurrences of t in p at ℓ and at overlap by more than half, this overlapping part is longer than d, i.e., the base period of t. In p, the content of this overlapping part, since it lies within the first occurrence of t in p, continues with the period d to the left and right throughout , but since this overlapping part also lies within the occurrence of t at ℓ, which matches with t in , the content of this overlapping part also continues in (with the period d) to the left and right throughout . If we look at what happens in where we have the string in p, we see that if extended to the left of , we would have a contradiction, because we chose in such a way that a period of length d cannot extend further than the left edge of (if we start from t in ). For a similar reason, cannot extend to the right of ; it must therefore remain within .
We said that t starts at in p and matches with t in ; so p starts at in , and starts at in . From the condition that must not extend to the left of , we get , and since it must not extend to the right of , we get . In addition, we still expect that p starts within u, so , and that it ends within v, so . From these inequalities, we can now determine the smallest and largest acceptable value of ( and ).
If , we can immediately conclude that there are no relevant occurrences of p. Otherwise, if and are empty strings, then and for every (from to ) lies within , so there are no mismatches; then we have occurrences.
If is not empty, we can check the case separately (i.e., we check whether p appears in starting at ; we need at most two calls of lcp). What about larger ? If we go from to some larger , the string p moves by d or a multiple of d characters to the left relative to ; this moves , (i.e., the first character of the string ) so far to the left that its corresponding character of (with which the first character of the string will have to match if p is to occur at the current position in ) belongs to the periodic part (i.e., ); in addition, recall that (if the new is still valid at all, i.e., ) the string also overlaps with by more than one whole period, so the character is still in such a position that its corresponding character in is in the periodic part of ; so here we have two characters in that are exactly d places apart, hence they are equal. The first character of cannot be equal to the character d places to the left of it, because otherwise the periodic part of p, i.e., , could be expanded even further to the right and this character would not belong to but instead to . We see that it is impossible that, in the new position of p, the character and the one d places to the left of it would match with the corresponding characters of ; there will be a mismatch in at least one place, so p can not occur there. Thus, we do not need to consider the case at all.
The remaining possibility is that
is empty, but
is not empty. An analogous analysis to the previous paragraph tells us that it is sufficient to check whether
p occurs in
at
, and we do not need to consider smaller values of
. Thus, step 7 of the GG algorithm can be described with the pseudocode of Algorithm 10.
Algorithm 10 Occurrences of p in the shortened string . |
positions of the first two occurrences of t in p; if there is only one occurrence then return 1 or 0, depending on whether p occurs in at position or not; ; if then return 0; partition p into , where is the maximal substring that contains the first occurrence of t in p and has period d; partition into , where is the maximal substring containing t between u and v and has period d; compute and from the aforementioned inequalities; if then return 0 else if is nonempty then return 1 or 0, depending on whether p appears in at position or not else if is nonempty then return 1 or 0, depending on whether p appears in at position or not else return ; |
All of these things can be prepared in advance (, using the table ) or computed in time (with a small constant number of computations of the longest common prefix or suffix).
2.6.3. Finding the Next Candidate Prefix (or Suffix)
In this subsection we describe step 3 of Algorithm 8 in more detail. The purpose of that step is to answer, in
time, up to
queries (one for each position
k where the string
t may be inserted into
s) of the form “given a prefix
u of
p, find the longest suffix of
u that is shorter than
and is also a prefix of
p”. We may describe such a query with a pair
, where
is the length of the original prefix and
is the maximum length of the new shorter prefix that we are looking for. Ganardi and Gawrychowski showed that this problem can be reduced to a weighted-ancestor problem in the suffix tree of
([
10], Lemma 2.2), for which they then provided a linear-time solution ([
10], Section 4). However, this solution is quite complex and in the present subsection we will present a simpler alternative which avoids the need for weighted-ancestor queries.
The longest suffix of
that is also a prefix of
p has the length
, the second longest has the length
, and so on. We could examine this sequence until we reach an element that is
; that would be the answer to the query
. We can also imagine this procedure as climbing the tree
(first introduced in
Section 2.3), starting from the node
and proceeding upwards towards the root until we reach a node with the value
. However, we can save time by observing that several of these paths up the tree, for different queries, sometimes meet in the same node and thenceforth always move together.
If several paths reach some node v and do not end there (because the bounds of those queries are ), all these paths will continue into v’s parent and will never separate again, regardless of where in v’s subtree they started. Thus, in a certain sense, we no longer need to distinguish between the nodes w, v, and v’s descendants, as the result of any query with is the same regardless of which of these nodes its path begins in. We can imagine these nodes as having merged into one, namely into v’s parent w.
To follow the paths up the tree, we will visit the nodes of the tree in decreasing order, from
to 0. Upon reaching node
v, we can answer all queries whose bound is exactly
, and then we can merge
v with its parent, thereby taking into account the fact that all paths that reach
v and do not end there will proceed to
v’s parent. For every node
w that has not yet been merged with its parent, we maintain a set of all nodes that have already been merged into
w; at the start of the procedure, each node forms a singleton set by itself. The pseudocode of this approach is shown in Algorithm 11.
Algorithm 11 Answering a batch of queries in the KMP-tree . |
1 ; 2 for downto 0: 3 for each query having : 4 the answer to this query is the smallest element of that member of F which contains ; 5 merge, in F, the set containing v and the set containing ; |
The following invariant holds at the start of each iteration of the main loop: F contains one set for each node , and this set contains w as well as those children of w (in the tree ) that are greater than v, and all the descendants of those children. In each node has a smaller value than its children, therefore w is the smallest member of . If some query has and , this means that is a descendant of w and that all the nodes on the path from to w, except w itself, are greater than v; therefore the path for this query will rise from to w and then stop; thus step 4 of our algorithm is correct in reporting w as the answer to this query. Step 5 ensures that the loop invariant is maintained into the next iteration. Step 3 requires us to sort the queries by , which can be achieved in linear time using counting sort.
The time complexity of this algorithm depends on how we implement the union-find data structure
F. The traditional implementation using a disjoint-set forest results in a time complexity of
for Algorithm 11 [
18], and hence
for the solution of our problem as a whole. Here
is the inverse Ackermann function, which grows very slowly, making this solution almost linear. However, to obtain a truly linear solution, we can use the slightly more complex
static tree set union data structure due to Gabow and Tarjan [
19,
20], which assumes that the elements are arranged in a tree and that the union operation is always performed between a set containing some node of the tree and the set containing its parent; hence this structure is perfectly suited to our needs. (A minor note to anyone intending to reimplement their approach: ([
20], p. 212) uses the
macrofind operation in step 7 of
find, whereas ([
19], p. 248) uses
microfind; the latter is correct while the former can lead to an infinite loop. Moreover it may be useful to point out that
must return, as the “name” of the macroset containing
x, not an arbitrary member of that set, but the member which is closest to the root of the tree.) Algorithm 11 then runs in
time and our solution as a whole in
time.