2.2. General Approach
We will successfully solve the problem if we count, for each , the number of occurrences of p in the string , which arises from inserting t into s at position k. These occurrences can be divided into those that
- (I)
- lie entirely within  or entirely within ; 
- (II)
- Lie entirely within t; 
- (III)
- Start within  and end within t; 
- (IV)
- Start within t and end within ; and 
- (V)
- Start within  and end within , and in between extend over the entire t. 
- Each of these five types of occurrences will be considered separately and the results added up at the end. Of course, type (II) comes into play only if , and type (V) only if . 
(I) Suppose that p appears at position i in s, that is . This occurrence lies entirely within  if , therefore ; there are a total of  such occurrences.
However, if we want the occurrence to lie entirely within , that means that ; such occurrences can be counted by taking the number of all occurrences of p in s and subtracting those that start at positions from 0 to . Thus, we get .
(II) The second type consists of all occurrences of p in t, so there are  of these. We use this result at every k.
(III) The third type consists of occurrences of p that begin within  and end within t. Suppose that the left j characters of p lie in  and the rest in t; thus  ends with , and t starts with . We already know that the longest prefix of p appearing at the end of  is  characters long; for the purposes of our discussion, we can therefore replace  with .
Let  be the number of those occurrences of p in  that start within the first part, i.e., in . One possibility is that such an occurrence starts at the beginning of the string (in this case t must start with ); the other option is that it starts a bit later, so that only the first  characters of our occurrence lie in . Therefore  ends with ; and we already know that the next position j for which this condition is satisfied is . The table D is prepared in advance with Algorithm 1 (with a linear time complexity), and then at each k, we know that  occurrences of p start within  and end within t.
In the above procedure, we need to be able to quickly check whether 
t starts with 
 for a given 
i. We will use tables 
 and 
 that were prepared in advance. Now 
 tells us that 
 is the longest suffix of 
p that occurs at the beginning of 
t; the second longest such suffix is 
, the third 
, and so on. We therefore prepare the table 
E (with Algorithm 2), which tells us, for each possible index 
j, whether 
 occurs at the beginning of 
t. Hence, the condition “if 
t starts with 
” in our procedure for computing table 
D can now be checked by looking at the value of 
.
        
| Algorithm 1 Computation of the table —occurrences of p in  that start within the first part | 
|    ;    for  to :       (*occurrences that have fewer than i characters in *)       ;       if t starts with  then          (*the occurrence of p at the start of *)          ; | 
| Algorithm 2 Computation of the table —does  occur at the start of t? | 
|    for  to  do ;    ;    while :       ; ; | 
(IV) The fourth type consists of the occurrences of p that start within t and end within . These can be counted using the same procedure as for (III), except that we reverse all three strings.
(V) We are left with those occurrences of p that start within , end within , and contain the entire t in between. This means that  must end with some prefix of p and we already know that the longest such prefix is  characters long; similarly,  must start with some suffix of p, and the longest such suffix is  characters long, where . Instead of strings  and  we can focus on  and  from here on. (These i and j can of course be different for different k, and where relevant we will refer to them as  and .) We will present several ways of counting the occurrences of p in strings of the form , from simpler and less efficient to more efficient, but more complicated. Everything we have presented so far regarding types (I) to (IV) had linear time and space complexity, , so the computational complexity of the solution as a whole depends mainly on how we solve the task for type (V).
  2.3. Quadratic Solution
Let  be the number of those occurrences of p in the string  that start within  and end within . This function is interesting for  and  (so that the first and third part,  and , are shorter than p). Edge cases are , because then the first or third part is empty and therefore p can not overlap with it.
Let us now consider the general case. One occurrence of 
p in 
 may already appear at the beginning of this string; and the rest must begin later, such that within the first part of our string, 
, they only have some shorter prefix, say 
, which is therefore also a suffix of the string 
. The longest such prefix is of length 
; all these later occurrences therefore lie not only within 
, but also within 
, so there are 
 of them. We have thus obtained a recurrence that is the core of Algorithm 3 for calculating all possible values of the function 
f using dynamic programming.
        
| Algorithm 3 A quadratic dynamic programming solution for counting occurrences of p in | 
| 1    for   to   do for  to : 2       ;          (*Does p occur at the start of ?*) 3       if t occurs in p at position i              (i.e., if ) then 4            if  and                 starts with  then 5             ;         (*Add later occurrences.*) 6       ; 7       ; | 
Before the procedure for calculating the function 
f is actually useful, we still need to consider some important details. We check the condition in line 3 by checking if 
. In line 4 the condition 
 checks whether an occurrence of 
p that started at the beginning of the string 
 would even reach 
 at the end, because we are only interested in such occurrences—those that lie entirely within 
, have already been considered in under type (III) in 
Section 2.2. Next, we have to check, in line 4, whether some suffix of 
p begins with some shorter suffix of 
p. Recall that the longest 
 that occurs at the beginning of the string 
 is the one with 
; the next is 
 for 
, and so on. In line 4 we are essentially checking whether this sequence of shorter suffixes eventually reaches 
. For this purpose we can use a tree structure, which we will also need later in more efficient solutions. Let us construct a tree 
 in which there is a node for each 
u from 1 to 
; the root of the tree is the node 
 (recall that the number in the suffix notation, such as 
, represents the position or index at which this suffix begins; therefore, shorter suffixes have larger indices, and the largest among them is 
, which represents the empty suffix), and for every 
, let node 
u be a child of node 
. Line 4 now actually asks whether node 
 is an ancestor of node 
j in the tree 
.
This can be checked efficiently if we do a preorder traversal of all nodes in the tree—that is, the order in which we list the root first, then recursively add each of its subtrees (Algorithm 4). For each node 
u, we remember its position in that order (say 
) and the position of the last of its descendants (say 
). Then the descendants of 
u are exactly those nodes that are located in the order at indices from 
 to 
. To check whether some 
 is a prefix of 
, we only have to check if 
. (Another way to check whether 
 is a prefix of 
 is with the 
Z-algorithm ([
7], pp. 7–10). For any string 
w, we can prepare a table 
 in 
 time, where the element 
 (for 
) tells us the length of the longest common prefix of the strings 
w and 
. The question we are interested in—i.e., whether 
 is a prefix of 
—is equivalent to asking whether 
, which is equivalent to 
, and hence to the question whether 
 and 
 match in the first 
 characters, i.e., whether 
. However, for our purposes the tree-based solution is more useful than the one using the 
Z-algorithm, as we will later use the tree in solutions with a lower time complexity than the quadratic one that we are describing now.)
Our procedure so far (Algorithm 3) has computed 
 by checking whether 
p occurs at the beginning of 
 and handling subsequent occurrences via 
 for 
. Of course, we could also go the other way: we would first check whether 
p occurs at the end (instead of the beginning) of 
 and then add earlier occurrences via 
 for 
. To check if 
p occurs at the end of 
, we should first check whether 
t occurs in 
p at indices from 
 to 
j (with the same table as before in line 3), and then we would be interested in whether 
 ends with 
. Here, it would be helpful to construct a tree 
, which would have one node for each 
u from 0 to 
; node 0 would be the root, and for every 
, the node 
u would be a child of node 
. Using the procedure 
Preorder on this tree would give us tables 
 and 
. We should then check whether 
 is an ancestor of node 
i in the tree.
        
| Algorithm 4 Tree traversal for determining the ancestor–descendant relationships | 
| procedure Traverse (tree T, node u,       tables  and , index i):    ; ;    for every child v of node u in tree T:        Traverse(T, v, σ, τ, i);    ; return i; | 
| procedure Preorder(input: tree T;       output: tables  and ):    let  and  be tables with indices from 1 to ;    Traverse(T, root of T, σ, τ, 0);    return ; main call:  Preorder(TS); | 
Thus, we see that we can compute  quickly, in  time, either from  or from ; both ways will come in handy later.
Our current procedure for computing the function f would take  time to prepare trees  and  and tables , ,  and , followed by  time to calculate the value of  for all possible pairs . This solution has a time complexity of .
  2.4. Distinguished Nodes
We do not really need the values of , which we calculated in the quadratic solution, for all pairs , but only for one such pair  for each k (i.e., each possible position where t can be inserted into s), which is only  values and not all . We will show how to compute those values of the function f that we really need, without computing all the others.
Let us traverse the tree  from the bottom up (from leaves towards the root) and compute the sizes of the subtrees; whenever we encounter a node u with its subtree (let us call it ; it consists of u and all its descendants) of size at least  nodes, we mark u as a distinguished node and cut off the subtree . Cutting it off means we will not count the nodes in it when we consider subtrees of u’s ancestors higher up in the tree. When we finally obtain to the root, we mark it as well, regardless of how many nodes are still left in the tree at that time. Algorithm 5 calculates, for each node u, its nearest marked ancestor ; if u itself is marked, then .
Now we have at most 
 marked nodes (because every time we marked a node, we also cut off least 
 nodes from the tree, and the number of all nodes is 
); and each node is less than 
 steps away from its nearest marked ancestor (if it were 
 or more steps away, some lower-lying ancestor of this node would have a subtree with at least 
 nodes, and would therefore get marked and its subtree cut off).
        
| Algorithm 5 Marking distinguished nodes | 
| procedure Mark(u):    (*Variable N counts the number of nodes in  reachable through unmarked descendants.*)    ;    for every child v of node u:        Mark(v);    if    or      then M[u]: = u  ;     (*Mark u.*)    else  M[u]: = −1; return   N;   (*Unmarked u.*) | 
| main call:    Mark(0);   (*Start at the root—node 0.*)    (*Pass the information about the nearest marked ancestor down the tree to all unmarked nodes.       (from parents to children)*)    for   u: = 1 to |p| − 1 do       if  M[u] < 0 then   M[u]: = M[Pp[u]]; | 
We know that 
 for every 
i. For each marked 
i, we can compute all values of 
 for 
 from 
; since there are at most 
 marked nodes, this takes 
 time. Let us describe how we can answer each of the 
 queries 
 that we are interested in. The nearest marked ancestor of node 
 is 
; since it is marked, we already know 
 for all 
j, i.e., also for 
, and since 
 lies in the tree 
 at most 
 steps below node 
, we can calculate 
 from 
 in at most 
 steps. Since we have to do this for each query, it takes overall 
 time. To save space, we can solve queries in groups according to 
; after we process all queries that share the same node 
, we can forget the results 
. Algorithm 6 presents the pseudocode of this approach to counting the occurrences of type (V).
        
| Algorithm 6 Rearranging queries | 
| (*Group queries  based on the nearest marked ancestor.*) for  to  do    if    then  empty list; for  to  do add k to ; (*Process marked nodes u.*) for  to  do if :    (*Compute  for all j as .*)    ;     for    downto 1 do       in  store  that is computed from , which is stored in ;    (*Answer queries  , where u is the nearest marked ancestor of  .*)    for each k in :       (*Prepare the path in the tree from  to its ancestor u.*)        empty stack; ;       while : push i to S and assign ;       (*Compute  for all nodes i on the path from u to .*)       ; (*i.e., *)       while S not empty:           node popped from S;          from , currently stored in r, compute  and store it in r;       (*r is now equal to  , i.e., the answer to query k, which is the number of occurrences of         p in  that start within  and end within .*) | 
Everything we did to count the occurrences of types (I) to (IV) had a linear time complexity in terms of the length of the input strings, so the overall time complexity of this solution is  with an  space complexity. (Note that this time complexity can be worse than the quadratic solution in cases where .)
  2.5. Geometric Interpretation
We want to count those occurrences of 
p in 
 that start within 
 and end within 
 (
Figure 1). Such an occurrence is therefore of the form 
, where 
, 
, and the following three conditions apply: (
a) 
 must be a suffix of 
, (
b) 
 must be a prefix of 
, and (
c) 
t must appear in 
p as a substring starting at index 
ℓ, i.e., 
.
To check (c), we prepare the table  in advance and check whether .
Let us now consider the condition (a), i.e., that  must be a suffix of . We know that the longest prefix of p that is also a suffix of  is  characters long; the second longest is  characters long, and so on. The condition that  must be a suffix of  essentially asks whether ℓ occurs somewhere in the sequence . This is equivalent to asking whether ℓ is an ancestor of i in the tree , which, as we have already seen, can be checked with the condition .
Similar reasoning holds for condition (b), i.e., that  must be a prefix of , except that now instead of the table  and the tree  we use the table  and the tree . The condition (b) asks whether  is an ancestor of j in the tree , which can be checked with .
The two inequalities we have thus obtained,
        have a geometric interpretation which will prove useful: if we think of 
 and 
 as the 
x- and 
y-coordinates, respectively, of a point on the two-dimensional plane, then the two inequalities require the 
x-coordinate to lie on a certain range and the 
y-coordinate to lie within a certain other range; in other words, the point must lie within a certain rectangle.
Recall that i and j depend on the position k where the string t has been inserted into s; therefore, to avoid confusion, we will now denote them by  and . We will refer to the point from the previous paragraph as , and to the rectangle as . Thus, we now see that p occurs in  such that the occurrence of t at position ℓ of p aligns with the occurrence of t at position k of  if and only if  lies in . Actually we are not interested in whether a match occurs for a particular k and ℓ, but in how many occurrences there are for a particular k; in other words, for each k (from 1 to ) we want to know how many rectangles  contain the point . Here ℓ ranges over all positions where t appears as a substring in p (i.e., where  or, equivalently, ); by limiting ourselves to such ℓ, we also take care of condition (c). Thus, we have  points and  rectangles, and for each point we are interested in how many of the rectangles contain it.
We have thus reduced our string-searching problem to a geometrical problem, which turns out to be useful because the latter problem can be solved efficiently using a well-known computational geometry approach called a 
plane sweep [
8]. Imagine moving a vertical line across the plane from left to right (
x-coordinates—i.e., the values from the 
 table—go from 1 to 
) and let us maintain some data structure with information about which 
y-coordinate intervals are covered by those rectangles that are present at the current 
x-coordinate. When we reach the left edge of a rectangle during the plane sweep, we must insert it into this data structure, and when we reach its right edge, we remove it from the data structure. Because the 
y-coordinates (i.e., the values from the 
 table) in our case also take values from 1 to 
, a suitable data structure is, for example, the Fenwick tree [
9]. Imagine a table 
h in which the element 
 stores the difference between the number of rectangles (from those present at the current 
x-coordinate) that have their lower edge at 
y, and those that have their upper edge at 
. Then the sum 
 is equal to the number of rectangles that (at the current 
x) cover the point 
. The Fenwick tree allows us to compute such sums in 
 time and also make updates of the tree with the same time complexity, when any of the values 
 changes. Algorithm 7 describes this plane sweep with pseudocode.
        
| Algorithm 7 Plane sweep | 
|    let F be a Fenwick tree with all values  initialized to 0;    for  to :       for every rectangle  with the left edge at :          increase  and decrease  in F by 1;       for every point  with :          compute the sum  in F, i.e., the number of occurrences of          p in  that start within  and end within ;       for every rectangle  with the right edge at :          decrease  and increase  in F by 1; | 
We have to prepare in advance, for every coordinate 
x, a list of rectangles that have their left edge at 
x, a list of rectangles that have right edge at 
x, and a list of points 
 with this 
x-coordinate. Preparing these lists takes 
 time; each operation on 
F takes 
 time, and these operations are 
 insertions of rectangles, 
 deletions and 
 sum queries. If we add everything we have seen in dealing with types (I) to (IV) in 
Section 2.2, the total time complexity of our solution is 
. Instead of using the Fenwick tree, we could use the square root decomposition on the table 
h (where we divide the table 
h into approximately 
 blocks with 
 elements each and we maintain the sum of each block), but the time complexity would increase to 
.
  2.6. Linear-Time Solution
We will first review a recently published algorithm by Ganardi and Gawrychowski [
10] (hereinafter: GG), which we can adapt for solving our problem with a linear time complexity. The algorithm we need is in ([
10], Theorem 3.2); we will only need to adjust it a little so that it counts the occurrences of 
p instead of just checking whether any occurrence exists at all. Intuitively, this algorithm is based on the following observations: if a string of the form 
 (where 
 is a prefix of 
p and 
 is a suffix of 
p) contains an occurrence of 
p which begins within 
u and ends within 
v, this occurrence of 
p must have a sufficiently long overlap with at least one of the strings 
u, 
t and 
v. All these three strings are substrings of 
p, and if 
p overlaps heavily with one of its substrings, it necessarily follows that the overlapping part of the string must be periodic, and any additional occurrences of 
p in 
 can only be found by shifting 
p by an integer number of these periods. The number of occurrences can thus be calculated without having to deal with each occurrence individually. The details of this, however, are somewhat more complicated and we will present them in the rest of this section.
In the following, we will come across some basic concepts related to the periodicity of strings ([
11], Chapter 8). We will say that the string 
x is 
periodic with a period of length 
d if 
 and if 
 for each 
i in the range 
; in other words, if 
, i.e., if some string of length 
 is both a prefix and a suffix of 
x. The prefix 
 is called a 
period of 
x; if there is no risk of confusion, we will also use the term 
period for its length 
d.
A string x can have periods of different lengths; we will refer to the shortest period of x as its base period. If some period of x is at most  characters long, then its length is a multiple of the base period. (This can be proven by contradiction. Suppose this were not always true; consider any such x that has a base period y and also some longer period z, where  and for which  is not a multiple of . Then  is of the form  for some  and some r from the range . Define  and ; thus  and . Let us look at the first  characters of the string x; since y is a period of x, this prefix has the form ; and since z is a period of x, this prefix has the form ; but it is the same prefix both times, so the last  characters of this prefix must be the same: . This means that if we concatenate several copies of the strings  and  together, it does not matter in what order we do it; so . Now consider ; since x has a period y and , x is a prefix of , and since , x is also a prefix of ; therefore, x has a period , which is in contradiction with the initial assumption that y is the base (i.e., shortest) period of x.)
It follows from the definition of periodicity that if y is both a prefix and a suffix of x (and if ), then x is periodic with a period of length . (If, in addition,  also holds, so that occurrences of the string y at the beginning and end of the string x overlap, y is also periodic with a period of this length.) This period  is the shorter and the longer y is, so we obtain the base period at the longest y that is both a prefix and a suffix of the string x. If instead of x we consider only its prefix , we know that the longest y that is both a prefix and a suffix of  is  characters long; therefore,  has a base period  (if , this formula gives the base period i, which of course means that  is not periodic at all.
Let 
lcp be the length of the longest common prefix of 
 and 
, i.e., of two suffixes of 
p. We can compute this in 
 time if we have preprocessed the string 
p to build its suffix tree and some auxiliary tables (in 
 time). Each suffix of 
p corresponds to a node in its suffix tree; the longest common prefix of two suffixes is obtained by finding the lowest common ancestor of the corresponding nodes, which is a classic problem and can be answered in constant time with linear preprocessing time [
12,
13]. To build a suffix tree in linear time, one can use, for example, Ukkonen’s algorithm [
14], or first construct a suffix array (e.g., using the DC3 algorithm of Kärkkäinen et al. [
15,
16]) and its corresponding longest-common-prefix array [
17], and then use these two arrays to construct the suffix tree (this latter approach is what we used in our implementation). We can similarly compute the longest common suffix of two prefixes of 
p, which will be useful later.
Suppose that x and y are two substrings of p, e.g.,  and , and that we would like to check whether x appears as a substring in y starting at position k; then (provided that ) we only need to check whether  lcp; if that is true, x really appears there (), otherwise the value of the function lcp tells us after how many characters the first mismatch occurs. We can further generalize this: if  are substrings of p and we are interested in whether x appears as a substring in  starting at position k, we need to perform at most three calls to the lcp function: first, for the part where x overlaps with u; if no mismatch is observed there, we check the area where x overlaps with t; if everything matches there too, we check the area where x overlaps with v.
Let us begin with a rough outline of the GG algorithm (Algorithm 8). Recall that we would like to solve 
 problems of the form “how many occurrences of 
p are there in the string 
 that start within 
u and end in 
v?”, where 
u is some prefix of 
p, and 
v is some suffix of 
p. (To be more precise: for each position 
k from the range 
 we are interested in occurrences of 
p in the string 
, where, as we have seen, we can limit ourselves to the string 
 with 
 and 
.) Assume, of course, that 
 and that 
t appears at least once as a substring in 
p (which can be checked with the table 
), because otherwise the occurrences we are looking for cannot exist at all. Since the strings 
p and 
t are the same in all our queries, we will list only 
u and 
v as the arguments of the GG function. Regarding the loop 1–3, we note that it will have to be executed at most twice, since 
u starts as a prefix of 
p, i.e., it is shorter than 
; it is at least halved in each iteration of the loop, therefore it will be shorter than 
 after at most two halvings. In step 2, we rely on the fact that the occurrences of 
p we are looking for overlap significantly with 
u (by at least one half of the latter); we will look at the details in 
Section 2.6.1. In step 3, we want to discover for some substring of 
p (namely the right half of the current 
u, which was a prefix of 
p) how long is the longest suffix of this substring that is also a prefix of 
p; we do not know how to do this in 
 time, but we can solve 
 such problems as a batch in 
 time, which is good enough for our purposes, since we know that we will have to execute the GG algorithm for 
 pairs 
. (For details of this step, see 
Section 2.6.3.) Steps 4–6 are just a mirror image of steps 1–3 and the same considerations apply to them; that loop also makes at most two iterations. (Furthermore, note that steps 3 and 6 ensure that no occurrence of 
p is counted more than once: for example, those counted in step 2 all start within the left half of 
u, and this is immediately cut off in step 3, so these occurrences do not appear again, e.g., in step 5). In step 7, we then know that 
u and 
v are shorter than 
; if 
 is longer than 
p, then 
t must be at least 
 characters long and we can make use of this when counting occurrences of 
p in 
; we will look at the details in 
Section 2.6.2. As we will see, steps 2 and 7 each take 
 time; the GG algorithm can therefore be executed 
 times in 
 time, and 
 time is spent on preprocessing the data, so we have a solution with a linear time complexity.
        
| Algorithm 8 Adjustment of the GG algorithm | 
| algorithm GG(u, v): input: two strings, u (a prefix of p) and v (a suffix of p); output: the number of occurrences of p in  that start               within u and end within v; 1    while : 2       count occurrences of p in  that start within the left half of u         (at positions ) and end within v; 3        the longest suffix of the current u that is               shorter than  and is also a prefix of p; 4    while : 5       count occurrences of p in  that start within u and end within the right half of v         (at most  characters before the end of v); 6        the longest prefix of the current v that is               shorter than  and is also a suffix of p; 7    count all occurrences of p in the current string ; | 
  2.6.1. Counting Early and Late Occurrences
Let us now consider step 2 of Algorithm 8 in more detail. The main idea will be as follows: we will find the base period of the string u; calculate how far this period extends from the beginning of strings  and p; and discuss various cases of alignment with respect to the lengths of these periodic parts.
We are, then, interested in occurrences of 
p in 
 that begin at some position 
. Since 
u is also a prefix of 
p, the occurrence of 
u at the beginning of 
p overlaps by at least one half with that at the beginning of 
 (
Figure 2). So 
u’s suffix of length 
 is also its prefix; therefore, 
u has a period of length 
; and since the length of this period is at most 
, it must be a multiple of the base period. Let us denote the base period by 
d (represented by the arcs under the strings in 
Figure 2); recall that it can be computed as 
. We need to consider only positions of the form 
 for integer 
; the lower bound for 
 is obtained from the conditions 
 and 
 (so that the occurrence of 
p ends within 
v and not sooner), and the upper bound from the conditions 
 (so that the occurrence of 
p begins within the first half of 
u) and 
 (so that the occurrence of 
p does not extend beyond the end of 
). Let us call these bounds 
 and 
. If 
, we can immediately conclude that no appropriate occurrence of 
p exists.
The string 
u is a prefix of 
p and has a base period 
d; perhaps there exists a longer prefix of 
p with this period; let 
 be the longest such prefix. We obtain it by finding the longest common prefix of the strings 
p and 
, i.e., 
 lcp(0,d). Let 
 be the length of the longest common prefix of the strings 
 and 
 (in other words, we are interested in whether 
 appears in 
 starting at 
; we already know that there is certainly no mismatch with 
u, so at most two calls of 
lcp are enough to check whether everything matches with 
t and 
v or find where the first mismatch occurs). Then we know that the string 
 is periodic with the same period of length 
d as 
 and 
u. An example is shown in 
Figure 3.
(1) If , we have found a suitable occurrence of p, and since  has a period of d until the end of this occurrence, this means that if we were now to shift p in steps of d characters to the left and then compare it with the same part of , we would see exactly the same substring in  after each such shift as before, i.e., we would also find occurrences of p in all these places. In this case, all  from  to  are suitable, so we have found  occurrences.
(2) If , we have found an occurrence of  in  starting at . This occurrence of  may or may not continue into an occurrence of the entire p; this can be checked with at most two more calls of lcp. As for possible occurrences further to the left, i.e., starting at  for some : in such an occurrence,  should match with the character , which is still in the periodic part of  (), i.e., it is equal to the character d places further to the left, which in turn must be equal to the character . So  and  should be the same, but they certainly are not, because then the periodic prefix of p (with period d) would be at least  characters long, not just  characters. Thus, we see that for  there is definitely a mismatch and p does not occur there.
(3) If , we have a mismatch between  and . We did not find an occurrence of p at , but what about  for smaller values of ? If we now move p left in steps of d, the same character of the string  will have to match , , and so on. As long as these indices are smaller than , all these characters are equal to the character , since this part of p is periodic with period d; therefore, a mismatch will occur for them as well. In general, if p starts at , the character  will have to match ; a necessary condition to avoid the aforementioned mismatch is therefore  or . Thus, we have obtained a new, tighter upper bound for , which we will call . For , the string  lies entirely within the periodic part of the string , i.e., within .
(3.1) If , we know that there is no possible  and we will not find an occurrence of p here.
(3.2) Otherwise, if , we see that for , the string p (if it starts at position  in ) lies entirely within the periodic part of ; since  means that p is itself entirely periodic with the same period of length d, we can conclude that, for each  (from  to ), p matches completely with the corresponding part of ; thus there are  occurrences.
(3.3) However, if 
, we can reason very similarly as in case (2). At 
 we have an occurrence of 
, which may or may not be extended to an occurrence of the whole 
p; we check this with at most two calls of 
lcp. At 
 the string 
 lies entirely within the periodic part of the string 
; if we then decrease 
 and thus shift 
p to the left by 
d or a multiple of 
d characters, then 
 will certainly also fall within the periodic part of the string 
. In order for the entire 
p to occur at such a position, among other things, 
 and 
 would have to match with the corresponding characters of the string 
, but since these two characters are both in the periodic part of the string 
, they are equal, while the characters 
 and 
 are not equal (because 
 is the longest prefix of 
p with period 
d); therefore, a mismatch will definitely occur for every 
 and 
p does not appear there. Step 2 of the GG algorithm can therefore be summarized by Algorithm 9.
          
| Algorithm 9 Occurrences of p in  that start within the left half of u and end within v. | 
|    ;    ;    if  then return 0;     length of the longest prefix of p with a period d;     length of the longest common prefix of  and ;    if :       if  then return        else if p is a prefix of  then return 1       else return 0;    else:       ;       if  then return 0       else if  then return        else if p is a prefix of  then return 1       else return 0; | 
  2.6.2. Counting the Remaining Occurrences
Now let us consider step 7 of Algorithm 8. At that point in the GG algorithm, we have already truncated u and v so that they are shorter than . If  or  or , we can immediately conclude that there are no relevant occurrences of p. We can assume that , as otherwise the string  would certainly be shorter than .
We are interested in such occurrences of p in  that start within u and end within v; such an occurrence also covers the entire t in the middle. Candidates for a suitable occurrence of p in  are therefore present only where t appears in p (more precisely: the fact that t occurs in p starting at ℓ is a necessary condition for p to appear in  starting at ). We compute in advance (using the table ) the index of the first and (if it exists) second occurrence of t as a substring in p; let us call them  and . If there was just one occurrence of t in p, then there is also just one candidate for the occurrence of p in  and with two more calls of lcp we can check whether p really occurs there (i.e., whether  is a suffix of u and  is a prefix of v).
Now suppose that there are at least two occurrences of 
t in 
p. Let us consider some occurrence of 
p in 
 (of course one that starts within 
u and ends within 
v); the string 
t in 
 matches one of the occurrences of 
t in 
p; let 
ℓ be the position where this occurrence of 
t in 
p begins (
Figure 4). There are at most 
 characters to the left of this occurrence of 
t in 
p (otherwise 
p would already extend beyond the left edge of 
), and there are at most 
 characters to the right of it (otherwise 
p would extend beyond the right edge of 
). Therefore, no other occurrence of 
t in 
p can begin more than 
 characters to the left of 
ℓ or more than 
 characters to the right of 
ℓ. If another occurrence of 
t in 
p begins at position 
, then 
.
Since no occurrence of t in p is very far from ℓ, it follows that the first two occurrences of t in p (i.e., those at  and ) cannot be very far from each other. We can verify this as follows: (1) if the occurrence of t in p at ℓ was exactly the first occurrence of t in p (i.e., if ), it follows that  is greater than  by at most , so ; (2) if the occurrence of t in p at ℓ was not the first occurrence of t in p, but the second or some later one (that is, if ), it follows that  and  are both less than or equal to ℓ, but by at most , so again .
We see that if p occurs in  (starting within u and ending within v), the first two occurrences of t in p are less than  apart. So, we can start by computing  and if , we can immediately conclude that we will not find, in , any occurrences of p that start within u and end within v.
Otherwise, we know henceforth that , so the first two occurrences of t in p overlap by more than half; the overlapping part is, therefore, both a prefix and a suffix of t, say  for  and ; so t is periodic with a period of length . Now suppose that t has some shorter period . If , the characters  and  both lie in the first occurrence of t in p, so they are equal since t has period ; if , then both of these characters lie in the second occurrence of t in p and are therefore equal. Since this is true for all , we have found another intermediate occurrence of t in p starting at , which contradicts the assumption that the second occurrence is the one at . Thus, d is the shortest period of t.
Consider the first occurrence of t in p, the one starting at , and see how far left and right of it we could extend t’s period of length d; that is, we need to compute the longest common suffix of the strings  and , and the longest common prefix of the strings  and . Now we can imagine p partitioned into , where  is the maximal substring that contains that occurrence of t and has a period d.
Similarly, in 
, let us see how far into 
u and 
v the middle part of 
t could be extended while maintaining periodicity with period 
d. This requires some caution: 
t starts with the beginning of the period, so when we extend the middle part from the right end of 
u leftwards, we must expect the end of the period at the right end of 
u. So we must compute the longest common suffix of 
u and something that ends with the end of 
t’s period, which is not necessarily 
t itself because its length is not necessarily a multiple of the period length. A similar consideration applies to extending the middle part from the left end of 
v rightwards. Let 
 be the length of what remains of 
t if we restrict ourselves only to periods that occur in their entirety. Since 
, it follows that 
, so 
 is longer than 
 and 
. If we now find the longest common suffix of the strings 
u and 
 and the longest common prefix of the strings 
v and 
, we will receive what we are looking for, and we do not have to worry about running out of 
t before 
u or 
v. From now on, let us imagine 
 in the form 
, where 
 is periodic with a period of length 
d. 
Figure 5 shows the resulting partition of the strings 
 and 
p.
We know that if 
p appears in 
, then the 
t in 
 matches some occurrence of 
t in 
p, say the one at 
ℓ (see 
Figure 5); if this is not precisely the first occurrence (the one at 
), then the first one can start at most 
 characters further to the left (otherwise 
p would extend over the left edge of 
 there); both occurrences of 
t in 
p (at 
 and 
ℓ) overlap by more than half, so 
t is periodic with a period of length 
; this is less than 
, so this period is a multiple of the base period, i.e., 
d. Thus, we see that only those occurrences of 
t in 
p that begin at positions of the form 
 for integer 
 are relevant.
Furthermore, since the occurrences of t in p at ℓ and at  overlap by more than half, this overlapping part is longer than d, i.e., the base period of t. In p, the content of this overlapping part, since it lies within the first occurrence of t in p, continues with the period d to the left and right throughout , but since this overlapping part also lies within the occurrence of t at ℓ, which matches with t in , the content of this overlapping part also continues in  (with the period d) to the left and right throughout . If we look at what happens in  where we have the string  in p, we see that if  extended to the left of , we would have a contradiction, because we chose  in such a way that a period of length d cannot extend further than the left edge of  (if we start from t in ). For a similar reason,  cannot extend to the right of ; it must therefore remain within .
We said that t starts at  in p and matches with t in ; so p starts at  in , and  starts at  in . From the condition that  must not extend to the left of , we get , and since it must not extend to the right of , we get . In addition, we still expect that p starts within u, so , and that it ends within v, so . From these inequalities, we can now determine the smallest and largest acceptable value of  ( and ).
If , we can immediately conclude that there are no relevant occurrences of p. Otherwise, if  and  are empty strings, then  and for every  (from  to )  lies within , so there are no mismatches; then we have  occurrences.
If  is not empty, we can check the case  separately (i.e., we check whether p appears in  starting at ; we need at most two calls of lcp). What about larger ? If we go from  to some larger , the string p moves by d or a multiple of d characters to the left relative to ; this moves , (i.e., the first character of the string ) so far to the left that its corresponding character of  (with which the first character of the string  will have to match if p is to occur at the current position in ) belongs to the periodic part (i.e., ); in addition, recall that (if the new  is still valid at all, i.e., ) the string  also overlaps with  by more than one whole period, so the character  is still in such a position that its corresponding character in  is in the periodic part of ; so here we have two characters in  that are exactly d places apart, hence they are equal. The first character of  cannot be equal to the character d places to the left of it, because otherwise the periodic part of p, i.e., , could be expanded even further to the right and this character would not belong to  but instead to . We see that it is impossible that, in the new position of p, the character  and the one d places to the left of it would match with the corresponding characters of ; there will be a mismatch in at least one place, so p can not occur there. Thus, we do not need to consider the case  at all.
The remaining possibility is that 
 is empty, but 
 is not empty. An analogous analysis to the previous paragraph tells us that it is sufficient to check whether 
p occurs in 
 at 
, and we do not need to consider smaller values of 
. Thus, step 7 of the GG algorithm can be described with the pseudocode of Algorithm 10.
          
| Algorithm 10 Occurrences of p in the shortened string . | 
|     positions of the first two occurrences of t in p;    if there is only one occurrence then return 1 or 0,       depending on whether p occurs in  at position  or not;    ; if  then return 0;    partition p into , where  is the maximal substring that       contains the first occurrence of t in p and has period d;    partition  into , where  is the maximal substring       containing t between u and v and has period d;    compute  and  from the aforementioned inequalities;    if  then return 0    else if  is nonempty then return 1 or 0,       depending on whether p appears in  at position  or not    else if  is nonempty then return 1 or 0,       depending on whether p appears in  at position  or not    else return ; | 
All of these things can be prepared in advance (,  using the table ) or computed in  time (with a small constant number of computations of the longest common prefix or suffix).
  2.6.3. Finding the Next Candidate Prefix (or Suffix)
In this subsection we describe step 3 of Algorithm 8 in more detail. The purpose of that step is to answer, in 
 time, up to 
 queries (one for each position 
k where the string 
t may be inserted into 
s) of the form “given a prefix 
u of 
p, find the longest suffix of 
u that is shorter than 
 and is also a prefix of 
p”. We may describe such a query with a pair 
, where 
 is the length of the original prefix and 
 is the maximum length of the new shorter prefix that we are looking for. Ganardi and Gawrychowski showed that this problem can be reduced to a weighted-ancestor problem in the suffix tree of 
 ([
10], Lemma 2.2), for which they then provided a linear-time solution ([
10], Section 4). However, this solution is quite complex and in the present subsection we will present a simpler alternative which avoids the need for weighted-ancestor queries.
The longest suffix of 
 that is also a prefix of 
p has the length 
, the second longest has the length 
, and so on. We could examine this sequence until we reach an element that is 
; that would be the answer to the query 
. We can also imagine this procedure as climbing the tree 
 (first introduced in 
Section 2.3), starting from the node 
 and proceeding upwards towards the root until we reach a node with the value 
. However, we can save time by observing that several of these paths up the tree, for different queries, sometimes meet in the same node and thenceforth always move together.
If several paths reach some node v and do not end there (because the bounds  of those queries are ), all these paths will continue into v’s parent  and will never separate again, regardless of where in v’s subtree they started. Thus, in a certain sense, we no longer need to distinguish between the nodes w, v, and v’s descendants, as the result of any query with  is the same regardless of which of these nodes its path begins in. We can imagine these nodes as having merged into one, namely into v’s parent w.
To follow the paths up the tree, we will visit the nodes of the tree in decreasing order, from 
 to 0. Upon reaching node 
v, we can answer all queries whose bound is exactly 
, and then we can merge 
v with its parent, thereby taking into account the fact that all paths that reach 
v and do not end there will proceed to 
v’s parent. For every node 
w that has not yet been merged with its parent, we maintain a set of all nodes that have already been merged into 
w; at the start of the procedure, each node forms a singleton set by itself. The pseudocode of this approach is shown in Algorithm 11.
          
| Algorithm 11 Answering a batch of queries  in the KMP-tree . | 
| 1    ; 2    for  downto 0: 3       for each query  having : 4          the answer to this query is the smallest element of             that member of F which contains ; 5       merge, in F, the set containing v and the set containing ; | 
The following invariant holds at the start of each iteration of the main loop: F contains one set  for each node , and this set  contains w as well as those children of w (in the tree ) that are greater than v, and all the descendants of those children. In  each node has a smaller value than its children, therefore w is the smallest member of . If some query has  and , this means that  is a descendant of w and that all the nodes on the path from  to w, except w itself, are greater than v; therefore the path for this query will rise from  to w and then stop; thus step 4 of our algorithm is correct in reporting w as the answer to this query. Step 5 ensures that the loop invariant is maintained into the next iteration. Step 3 requires us to sort the queries by , which can be achieved in linear time using counting sort.
The time complexity of this algorithm depends on how we implement the union-find data structure 
F. The traditional implementation using a disjoint-set forest results in a time complexity of 
 for Algorithm 11 [
18], and hence 
 for the solution of our problem as a whole. Here 
 is the inverse Ackermann function, which grows very slowly, making this solution almost linear. However, to obtain a truly linear solution, we can use the slightly more complex 
static tree set union data structure due to Gabow and Tarjan [
19,
20], which assumes that the elements are arranged in a tree and that the union operation is always performed between a set containing some node of the tree and the set containing its parent; hence this structure is perfectly suited to our needs. (A minor note to anyone intending to reimplement their approach: ([
20], p. 212) uses the 
macrofind operation in step 7 of 
find, whereas ([
19], p. 248) uses 
microfind; the latter is correct while the former can lead to an infinite loop. Moreover it may be useful to point out that 
 must return, as the “name” of the macroset containing 
x, not an arbitrary member of that set, but the member which is closest to the root of the tree.) Algorithm 11 then runs in 
 time and our solution as a whole in 
 time.