Sliding Suffix Tree using LCA

We consider a sliding window $W$ over a stream of characters from some alphabet of constant size. The user wants to perform deterministic substring matching on the current sliding window content and obtain positions of the matches. We present an indexed version of the sliding window using the suffix tree, the link tree and the lowest common ancestor. The data structure of size $\Theta(|W|)$ has optimal time queries $\Theta(m+occ)$ and amortized constant time updates, where $m$ is the length of the query string and $occ$ is the number of its occurrences.


Introduction and Related Work
Text indexing and big data in general is a well studied computer science and engineering field. A specially intriguing area is (infinite) streams of data which are too big to fit onto disk, and consequently, cannot be indexed in the traditional way (e.g. by using FM-index [5]). In practice, data streams are processed on-the-fly by efficient, carefully engineered filters. An excerpt of the data called text features are stored for later usage, while the original stream data is discarded. In our research, we consider an infinite stream of characters where the main memory holds the most recent characters in the stream in terms of sliding window. At any moment, a user wants to find all occurrences of the given substring in the current window. In general, to answer the query, we could construct an automaton from a query using KMP [9] or Boyer-Moore [1], and then feed the stream to the constructed automaton. This however requires that all queries are known in advance. On the other hand, if the query arrives on-the-fly, the automaton needs to be constructed from scratch. In both cases we need to scan the whole window which requires linear time in the size of a window. A better possibility would be to run Ukkonen's online suffix tree construction algorithm [11] and construct the suffix tree. When the query arrives, we inject a delimiter character to finalize the suffix tree construction and perform the query on the constructed tree. However, finalizing might take, in the worst case, linear time in the size of the window.
In this paper we show how to construct and maintain an indexed version of the sliding window allowing a user to find occurrences of a substring in optimal time and space. This is the first data structure for on-the-fly text indexing which requires amortized constant time for updates and worst case optimal time for queries.
In the following section we define the notation and preliminary data structures and algorithms. In Section 3 we formally present a sliding suffix tree and we conclude in Section 4 with discussion and open problems. We denote by W the sliding window over an infinite input stream of characters which are from an alphabet of constant size. By n, we denote the number of all characters read so far. To store a suffix starting at the current position, we store the current n. At any later time n , we can retrieve the content of this suffix as W [n − (n − |W |) :], where n − n < |W |, for the suffix to be present in W .

Suffix Tree, Suffix Links Tree, and Lowest Common Ancestor
A suffix tree is a dictionary containing each suffix of the text as a key with its position in the text as its value. The data structure is implemented as a PATRICIA tree, where each internal node stores a skip value and the first (discriminative) character of the incoming edge, whereas each leaf stores the position of the suffix in the text. We denote by |α| a string depth operation of some node α in a suffix tree and define it as a sum of all skip values from the root to α. Each edge implicitly contains a label which is a substring of a suffix starting and ending at the string depth of the originating and the string depth of the terminating node respectively. We say a node α spells out string A, where A is a concatenation of all labels from the root to α. Or more formally, take a leaf in a subtree of α and let it store position i of a suffix, then A = W [i − (n − |W |) : i + |α| − (n − |W |)]. Next, let A and A be strings which are spelled out by nodes α and α respectively. We define a suffix link as an edge from node α to α, if A = A [2 :], and denote this by α = suffix_link(α ). If we follow suffix links from α i times, we write this as suffix_link(α ) i .
We define a suffix links tree as follows. For each internal node in a suffix tree let there be a node in a suffix links tree. For each suffix link from α to α in the suffix tree, α is a parent of α in the suffix links tree. Consequently, following a suffix link from α i times is the same as finding the i th ancestor of α in the suffix links tree.
The lowest common ancestor (LCA) of nodes α and α in a tree is the deepest node which is an ancestor of both α and α . The first constant time, linear space LCA algorithm was presented in [7] and later simplified by [10]. The dynamic version of the data structure still running in constant time and linear space was introduced in [3]. We will use this result to perform constant time LCA lookups and maintain it in amortized constant time.

Ukkonen's online suffix tree construction algorithm
In [11] Ukkonen presented a suffix tree construction algorithm which builds the data structure in a single pass. During the construction, the algorithm maintains the following invariants in amortized constant time: implicit buffer B which corresponds to the longest repeated suffix of the text processed so far, the active node β which represents a node where we end up by navigating B in the suffix tree constructed so far, i.e. B is a prefix of a string spelled out by β.
The execution of the algorithm can be viewed as an automaton with two states. In the buffering state, the automaton reads a character c from the input stream and implicitly extends all labels of leaves by c. Then, it checks whether c matches the next character of the prefix of length |B| + 1 spelled out by β. If it does, c is appended to B and the automaton remains in the buffering state reading the next character. When |B| = |β|, a child in direction of c becomes a new active node.
On the other hand, if character c does not match, automaton switches to the expanding state. First it inserts a new branch in direction of c with a single leaf storing the suffix position n − |B|. If |B| = |β|, the new branch is added as a child to β. Otherwise, if |B| < |β| the incoming edge of β is split such that the string depth of the newly inserted internal node is |B|, and the new branch is added to this node. Once the branch is inserted, the first character is removed from B obtaining new B = B[2 :]. A new active node corresponding to B is found in the following way. Let α denote the parent of the original active node β. Then the new active node β is a node obtained by navigating suffix B [|α| :] from a node suffix_link(α). When β is obtained, c is reconsidered. If a branch in direction of c exists, the automaton switches to buffering state. Otherwise, it remains in the expanding state and repeats the new branch insertion. Each time the expanding state is re-entered, B is shortened for one character. In the worst case, if c does not occur in the text yet, the suffix links will be followed all the way up to the root node, and c will be added as a new child to the root node. In this case the implicit buffer B will be an empty string.
We say the currently constructed suffix tree is unfinalized, until B is completely emptied. Moreover, there are exactly |B| leaves missing in the unfinalized tree and these correspond to suffixes of B. For finite texts we finalize the suffix tree at the end by appending a unique character $ which forces the algorithm to empty B and finalize the tree. For infinite streams however, there is no final character. Consequently, we need to support: 1. Queries: When performing queries, we need to report the occurrences both in the partially constructed suffix tree and in B.

2.
Maintenance: The original Ukkonen's algorithm supports adding a new character to the indexed text. When a window is shifted, we also remove the oldest (longest) suffix from the text.

Sliding Suffix Tree
The sliding suffix tree is an indexed version of the current sliding window content W . Formally, we define two operations: find(W , Q) -returns all positions of the query string Q in W . shift(W , c) -appends a character c to W and removes the oldest character from W .
Initially, W is empty and until the length of W reaches the desired size, shift operation only appends new characters.
The sliding suffix tree is built on top of Ukkonen's online suffix tree construction algorithm. We maintain a possibly unfinalized suffix tree T including implicit buffer B and active node β ( Fig. 1 on the left).  In the next two subsections we show how to perform the find operation in time Θ(|Q|+occ) in the worst case and the shift operation in constant amortized time. As a model of computation, we use the standard RAM model.

Queries
To find all occurrences of query Q in W , we first navigate Q in T . Let T Q correspond to a subtree rooted at the node at which we finished the navigation. Leaves of T Q make up the first part of the resulting set. In Figure 1 occ 1 corresponds to such occurrence. Also, position of occ 2 in the same figure will be contained in one of the leaves of T Q , since T contains all suffixes that start at the beginning of W up to the beginning of B.
The second part of the resulting set are the missing leaves of T Q due to the unfinalized state of T . Intuitively, these leaves correspond to suffixes of B which start with Q. occ 3 in Figure 1 illustrates one such position. Obviously, if |B| < |Q| there are no matches of Q in B and we solely return the leaves of T Q . If |B| = |Q|, we test whether the active node β is the same node as the root of T Q . If it is, we add one additional occurrence at position n − |B| to the resulting set.
The case |B| > |Q| requires special attention. One solution would be to scan B for Q using KMP or similar approaches. But since |B| = O(|W |) in the worst case, we cannot afford the scan. In the remainder of this subsection we show how to determine the missing leaves in time O(|Q| + occ). First, we claim that the navigated subtree T Q always exists, if there are any occurrences of Q to be found in B.

Lemma 1. If Q exists in buffer B, then a subtree T Q exists by navigating the query Q in T .
Proof. If Q exists somewhere in B, then Q is a substring of a string spelled out by β. From the property of the suffix tree, by following the suffix links from β we will find a node which spells out a string with Q at the beginning. This node is a root of T Q .
To consider occurrences of Q in B where |B| > |Q|, we determine the relation of each node in T Q to β. Since |T Q | = O(occ) we can afford this operation, if we spend at most constant time per node. We proceed depending on whether β is an internal node of T or not.

Lemma 2.
Let β be the active node of T , and let β be an internal node. String Q is located in B at position i, iff suffix_link(β) i is a node in T Q .
Proof. (⇒) We need to prove that a node corresponding to a suffix of B which starts with Q exists in T Q since T is not finalized. Recall the expanding state of Ukkonen's algorithm. At each call, the operation adds a leaf and possibly an internal node, whereas the existing internal nodes are left untouched. Since β is an internal node, no changes will be made either to it or the nodes visited when recursively following the suffix link from β, since they are also internal nodes. Therefore, a node corresponding to a suffix of B which begins with Q exists in T Q , if such a suffix exists in B.
(⇐) By definition of β, B is a prefix of a string which β spells out. β is also an internal node, so it will always contain an outgoing suffix link (in case |β| = 1, let the suffix link point to the root node). When following the suffix link of β, each time we implicitly remove one character from the beginning of B. Suppose we follow the suffix link i times and reach a node which is a member of T Q . By definition of the suffix tree, each node in T Q spells out a string which starts with Q. Therefore, our reached node corresponds to a suffix of B at position i and starts with Q.
By using the lowest common ancestor operation (LCA) we can check in constant time whether a node is reachable from another node by following the suffix links in T . If α is an ancestor of β in L (i.e. the LCA of α and β in L is α), then α is reachable by following the suffix links from β in T . To determine all occurrences of Q in B in time O(|T Q |), for each candidate node α in T Q we find its LCA with β in L. If the LCA is α and α is an i th ancestor of β in L, then by Lemma 2 Q is located in B at position i.
If β is a leaf of T , we cannot use the approach described above, because leaves do not have usable suffix links. We find occurrences of Q in B by exposing a repetitive pattern P inside B. . The expansion of the tree will not occur because Bc was present in the text before and consequently a corresponding edge in partially constructed suffix tree will exist.

Corollary 4. Let there be a single leaf in the subtree obtained when navigating B in T and let x be the position stored in this leaf. The repetitive pattern
Proof. Since the leaf storing x is the only leaf in the obtained subtree, there are exactly two occurrences of B in the text. The first one at position x and the second one at position n − |B|. If |P | > |B|, then B is a prefix of P , because the leaf spelling out P was obtained by navigating B. If |P | < |B|, then B = P k P due to the buffer pumping lemma.
With the help of the lemma and the corollary above we can efficiently determine the positions of Q in B by exposing the repetitive property of the pattern P inside B. Depending on the length |P |, two cases are possible as illustrated in Figure 2. If |P | ≤ |Q| (Fig. 2.a), we scan for Q in B up to position 2|Q| − 1 inside B and for each such occurrence of Q at some position y we add occurrences y, y + |P |, y + 2|P |, . . . to the resulting set until we reach n − |Q|. We require O(|Q| + occ) time in the worst case. If |P | > |Q| (Fig. 2.b), we visit the leaves of T Q and consider the suffixes starting inside the interval x : n − |B| − 1 of the stream. For each such occurrence z, we add z + |P |, z + 2|P | . . . to the resulting set until we reach n − |Q|. We spend O(occ) time in the worst case.
The data structure we used consists of (T , L, A), where T requires O(|W |) space in the worst case (i.e. |B| = 0) and assuming an alphabet of constant size. Next, L contains the same number of nodes as T and is oblivious to the alphabet size, so the space complexity has the same upper bound. Finally, A used for constant time LCA queries on L requires linear space in terms of the number of nodes in L. This brings us to the following theorem.

Maintenance
To shift window W , we read a character c and add it to our data structure and at the same time remove the oldest (longest) stored suffix. During the maintenance no queries can be performed.
To add a character, we first execute the original Ukkonen algorithm as described in subsection 2.2. During the expanding state we add to T either one node (a new leaf is added to the active node) or two nodes (the incoming edge of the active node is split and a new leaf is added). Since L contains only internal nodes of T , it remains unchanged in the first case and in the second case, a node is also added to L as follows.
When the expanding state is visited the first time, a new internal node γ T is added to T . We also add a new node γ L to L. At this point no suffix link originating in γ T has been set, so γ L does not have a parent in L yet. In the next step either an expanding state is re-entered or a buffering state is entered. If the expanding state is re-entered, we repeat the procedure obtaining new nodes γ T and γ L . Now, a suffix link is created from γ T to γ T and consequently a parent of γ L becomes γ L . If the buffering state is entered, either a root node or a node containing the matched character is reached. Instead of creating new nodes in T and L as we did in the expanding state, we create a suffix link to an existing node in T and set the parent of a node in L accordingly. Adding a suffix to T requires constant amortized time [11]. During the re-entrances to the expanding state, a chain of nodes was formed in L which was finally attached to the existing node in constant time when the buffering state was entered. For updating A, attaching a chain of nodes to a tree requires linear time in the length of a chain [3]. By amortizing all expanding calls, adding a new character takes amortized constant time.
To remove the oldest stored suffix from T , we first find the corresponding leaf (e.g. by following a linked list of all leaves). If the leaf's parent has three or more children, the parent remains unchanged and we just remove the leaf from T . Since leaves of T are not present in L, L and consequently A remain unchanged.
On the other hand, if the leaf's parent has exactly two children, we remove the leaf from T and also its parent γ T from T and γ L from L. To remove γ T we merge its incoming and the remaining outgoing edges. Due to the following lemma, we can also safely remove γ L since it is always a leaf in L.

Lemma 6.
Let γ T be a node with two children in T , where one child is a leaf storing a position of the longest suffix n − |W |. Then, γ T is not a terminating node of any suffix link.

Proof by contradiction.
Assume there is a node γ T in T with a suffix link pointing to γ T . Since γ T has two children, γ T has at most two children, because γ T contains a subset of nodes of γ T . Observe the child of γ T storing the position n − |W | i.e. it spells out W . One child in γ T should then spell out W prepended by some character. Since W is already the longest suffix which exists in the window, a longer suffix and its corresponding leaf do not exist. Then, only one child of γ T remains and due to path compression γ T does not exist in T which contradicts the initial assumption.
In the moment of removal, the removed leaf or its parent can be an active node β. If this is the case, then B was a prefix of the removed suffix. Recall that at any time, B corresponds to the longest repeated suffix of the window. Since the oldest suffix is removed by shifting the window, a new longest repeated suffix is consequently shortened for one character by updating B to B [2 :]. To find a new β and an edge corresponding to the updated B, we simply follow the suffix link of the β's parent and navigate the remainder of B from the obtained node. The navigation time is amortized over all expanding calls, so finding a new β requires amortized constant time.
To remove a leaf from L and A we require constant time in the worst case [3]. During the shift operation, no additional data structures are used. Consequently, the space complexity of the sliding suffix tree remains asymptotically unchanged. We conclude with the following theorem.

Conclusions and Open Problems
In this paper we presented a sliding suffix tree for performing online substring queries on a stream. By extending Ukkonen's online suffix tree construction algorithm, the presented data An open question remains whether the data structure can be updated in worst case constant time. There is a well known linear time suffix sorting lower bound [4], but to our knowledge, no per-character lower bound has been explored. Ukkonen's algorithm requires, by design, an amortized constant time for updates due to the implicit buffer of unfinalized nodes. To the best of our knowledge, no other online suffix tree construction algorithm has been developed without the implicit buffer.
In this paper, we assumed a constant size of the alphabet Σ in asymptotic times for queries and updates. For arbitrary size of Σ, the current implementation of T data structure requires an additional factor of lg |Σ| time to determine a child at each step and maintain the same space complexity whereas L and A data structures are oblivious to |Σ|. An interesting question is whether the same asymptotic times can be achieved for integer alphabets as was done in [4] for texts of fixed length. In our case |Σ| = O(|W |), but the alphabet can change in time.
Streaming algorithms are common in heavy throughput environments, therefore it seems feasible to involve parallelism. Recently, two methods were introduced for performing finegrained parallel queries on suffix trees [2,8]. Both methods perform queries on static data structures only and perhaps supporting the shift operation used by the sliding suffix tree might be feasible. From a more coarse-grained parallelism point of view, the current query and update operations must be executed atomically. An interesting design question is whether the data structure could be designed in a mutable way, so a query and an update can be performed simultaneously, if different parts of the data structure are involved.
Finally, the presented data structure, while theoretically feasible, should also be competitive in practice. From our point of view, the main issue with tree-based data structures used in the sliding suffix tree is space consumption. The majority of the size accounts for the auxiliary data structure used for constant time lowest common ancestor. Some work on practical lowest common ancestor data structures has already been done in [6]. We believe that once the data structure is succinctly implemented, it should present a viable alternative to existing solutions.