The major problem with Algorithm I is the time taken for conflict resolution. Since the worst case number of conflicts is in , an algorithm that performs a sequential resolution of each conflict can do no better than time in the worst case. We improve the algorithm by modifying the recursion step, and the conflict resolution strategy. Specifically, we still use binary partitioning, but we use a non-symmetric treatment of the two branches at each recursion step. That is, only one branch will be sorted, and the second branch will be sorted based on the sorted results from the first branch. This is motivated by the observation at the end of the last section. We also use a preprocessing stage inspired by methods in information theory to facilitate fast conflict resolution.
4.4. Improved Sorting - Using Shannon-Fano-Elias Codes
The key to improved sorting is to reduce the number of bucket sorts in the above procedure. We do this by pre-computing some information before hand, so that the sorting can be performed based on a small block of symbols, rather than one symbol at a time. Let m be the block size. With the precomputed information, we can perform a comparison involving an m-block symbol in time. This will reduce the number of bucket sorts required at each level h from to , each involving symbols. By an appropriate choice of m, we can reduce the complexity of the overall sorting procedure. For instance, with , this will lead to an overall worst case complexity in time for determining the suffix array of from that of . With , this gives time. We use in subsequent discussions.
The question that remains then is
how to perform the required computations, such that
all the needed block values can be obtained in linear time. Essentially, we need a pair-wise global partial ordering of the suffixes involved in each recursive step. First, we observe that we only need to consider the ordering between pairs of suffixes at the same level of recursion. The relative order between suffixes at different levels is not needed. For instance, using the example sequence of
Figure 3, the sets of suffixes for which we need the pair-wise orderings will be those at positions:
. Each subset corresponds to a level of recursion, from
to
. Notice that we don’t necessarily need those at level
, as we can directly induce the sorted order for these suffixes from
, after sorting
.
We need a procedure to return the relative order between each pair of suffix positions in constant time. Given that we already have an ordering from the right tree in
, we only need to consider the prefixes of the suffixes in the left tree up to the corresponding positions in
, such that we can use entries in
to break the tie, after a possible
bucket sorts. Let
be the
m-length prefix of the suffix
:
. We can use a simple hash function to compute a representative of
, for instance using the polynomial hash function:
where
, if
x is the
k-th symbol in
, and
is the nearest prime number
. The problem is that the ordering information is lost in the modulus operation. Although order-preserving hash functions exist (see [
31]), these run in
time on average, without much guarantees on their worst case. Also, with the
m-length blocks, this may require
time on average.
We use an information theoretic approach to determine the ordering for the pairs. We consider the required representation for each m-length block as a codeword that can be used to represent the block. The codewords are constrained to be order preserving: That is, iff and iff , where is the codeword for sequence x. Unlike in traditional source coding where we are given one long sequence to produce its compact representation, here, we have a set of short sequences, and we need to produce their respective compact representations, and these representations must be order preserving.
Let
be the probability of
, the
m-length block starting at position
i in
T. Let
be the probability of symbol
. If necessary, we can pad
T with a maximum of
’$’ symbols, to form a valid
m-block at the end of the sequence. We compute the quantity:
. Recall that
, and
. For a given sequence
T, we should have:
. However, since
T may not contain all the possible
length blocks in
, we need to normalize the product of probabilities to form a probability space:
To determine the code for
, we then use the cumulative distribution function (cdf) for the
’s, and determine the corresponding position for each
in this cdf. Essentially, this is equivalent to dividing a number line in the range [0 1], such that each
is assigned a range proportional to its probability,
. See
Figure 4. The total number of divisions will be equal to the number of unique
m-length blocks in
T. The problem then is to determine the specific interval on this number line that corresponds to
, and to choose a tag
to represent
.
Figure 4.
Code assignment by successive partitioning of a number line.
Figure 4.
Code assignment by successive partitioning of a number line.
We use the following assignment procedure to compute the tag,
. First we determine the interval for the tag, based on which we compute the tag. Define the cumulative distribution function for the symbols in
:
. The symbol probabilities,
’s are simply obtained based on the
’s. For each symbol
in Σ, we have an open interval in the cdf:
. Now, given the sequence
,
, the procedure steps through the sequence. At each step
along the sequence, we can compute
and
, the respective current upper and lower ranges for the tag using the following relations:
The procedure stops at
, and the values of
and
at this final step will be the range of the tag,
. We can choose the tag
as any number in the range:
. Thus we chose
as the mid point of the range at the final step:
.
Figure 5(a) shows an example run of this procedure for a short sequence:
with a simple alphabet,
, and where each symbol has an equal probability
. This gives
.
Figure 5.
Code assignment procedure, using an example sequence. The vertical line represents the current state of the number line. The current interval at each step in the procedure is shown with a darker shade. The symbol considered at each step is listed under their respective number lines (a) Using the sequence . (b) Evolution of code assignment procedure, after removing the first symbol (in previous sequence acabd), and bringing in a new symbol a to form a new sequence: .
Figure 5.
Code assignment procedure, using an example sequence. The vertical line represents the current state of the number line. The current interval at each step in the procedure is shown with a darker shade. The symbol considered at each step is listed under their respective number lines (a) Using the sequence . (b) Evolution of code assignment procedure, after removing the first symbol (in previous sequence acabd), and bringing in a new symbol a to form a new sequence: .
Lemma 2: The tag assignment procedure results in a tag that is unique to .
Proof. The procedure described can be seen as an extension of the Shanon-Fano-Elias coding procedure used in source coding and information theory [
32]. Each tag
is analogous to an arithmetic coding sequence of the given
m-length block,
. The open interval defined by
for each
m-length sequence is unique to the sequence.
To see this uniqueness, we notice that the final number line at step m represents the cdf for all m-length blocks that appeared in the original sequence T. Denote this cdf for the m-blocks as . Given , the i-th m-block in T, the size of its interval is given by . Since all probabilities are positive, we see that whenever . Therefore, determines uniquely. Thus, serves as a unique code for . Choosing any number within the upper and lower bounds for each define a unique tag for . Thus the chosen tag defined by the midpoint of this interval is unique to . □
Lemma 3: The tags generated by the assignment procedure are order preserving.
Proof. Consider the ordering of the tags for different m-length blocks. Each step in the assignment procedure uses the same fixed order of the symbols on a number line, based on their order in Σ. Thus, the position of the upper and lower bounds at each step depends on the previous symbols considered, and the position of the current symbol in the ordered list of symbols in Σ. Therefore the ’s are ordered with respect to the lexicographic ordering of the ’s: , and . □
Lemma 4: Given , all the required tags , can be computed in time.
Proof. Suppose we have already determined
and
for the
m-block
as described above. For efficient processing, we can compute
and the tag
in the
number line, using the previous values for
and
. This is based on the fact that
and
are consecutive positions in
T (In practice, we need only a fraction of the positions in
T, which will mean less time and space are required. But here we describe the procedure for the entire
T since the complexity remains the same.). In particular, given
, and
. We compute
as:
Then, we compute
using Equation (
1). Thus, all the required
’s can be computed in
=
time.
Similarly, given the tag
for
, and its upper and lower bounds
and
, we can compute the new tag
for the incoming
m-block,
based on the structure of the assignment procedure used to compute
, (see
Figure 5(b)). We compute the new tag
by first computing it’s upper and lower bounds. Denote the respective upper and lower bounds for
as:
. Similarly, we use
for the respective bounds for
. Let
be the first symbol in
. Its probability is given by
. Also, let
be the new symbol that is shifted in. Its probability is given by
, and we also know it’s position in the cdf. We first compute the intermediate bounds at step
when using
, namely:
|
|
Multiplying by
changes the probability space from the previous range of
to
. After the computations, we can then perform the last step in the assignment procedure to determine the final range for the new tag:
|
|
The tag is then computed as the average of the two bounds as before. The worst case time complexity of this procedure is in . The component comes from the time needed to sort the unique symbols in T before computing the cdf. This can be performed in linear time using counting sort. Since , this gives a worst case time bound of to compute the required codes for all the m-length blocks. □
Figure 5(b) shows a continuation of the previous example, with the old
m-block:
, and a new
m-block
. That is, the new symbol
a has been shifted in, while the first symbol in the old block has been shifted out. We observe that the general structure in
Figure 5(a) is not changed by the incoming symbol, except only at the first step and last step. For the running example, the new value will be
.
Table 2 shows the evolution of the upper and lower bounds for the two adjacent
m-blocks. The bounds are obtained from the figures.
Table 2.
Upper and lower bounds on the current interval on the number line.
Table 2.
Upper and lower bounds on the current interval on the number line.
| a | c | a | b | d | c | a | b | d | a |
| 1 | | | | | 1 | | | | |
| 0 | 0 | | | | 0 | | | | |
| 1 | | | | | 1 | | | | |
Having determined
, which is fractional, we can then assign the final code for
by mapping the tags to an integer in the range
. This can be done using a simple formula:
where
, and
. Notice that here, the
’s computed will not necessarily be consecutive. But they will be ordered. Also, the number of distinct
’s is at most
n. The difference between
and
will depend on
and
. The floor function, however, could break down the lexicographic ordering. A better approach is to simply record the position where each
fell on the number line. We then read off these positions from 0 to 1, and use the count at which each
is encountered as its code. This is easily done using the cummulative count of occurence of each distinct
. Since the
’s are implicitly sorted, so are the
’s. We have thus obtained an ordering of all the
m-length substrings in
T. This is still essentially a
partial ordering of all the suffixes based on their first
m symbols, but a total order on the distinct
m-length prefixes of the suffixes.
We now present our main results in the following theorems:
Theorem 2: Given the sequence , from a fixed alphabet Σ, , all the m-length prefixes of the suffixes of T can be ordered in linear time and linear space in the worst case.
Proof. The theorem follows from Lemmas 2 to 4. The correctness of the ordering follows from Lemma 2 and Lemma 3. The time complexity follows from Lemma 3. What remains is to prove is the space complexity. We only need to maintain two extra arrays, one for the number line at each step, and the other to keep the cumulative distribution function. Thus, the space needed is also linear in . □
Theorem 3: Given the sequence , with symbols from a fixed alphabet Σ, , Algorithm II computes the suffix array of T in worst case time and space.
Proof. At each iteration, the recursive call applies only to the suffixes in . Thus, the running time for the algorithm is given by the solution to the recurrence : . This gives . Combined with Theorem 2, this establishes the linear time bound for the overall algorithm.
We improve the space requirement using an alternative merging procedure. Since we now have
and
, we can modify the merging step by exploiting the fact that any conflict that can arise during the merging can be resolved by using only
. To resolve a conflict between suffix
in
and suffix
in
, we need to consider two cases:
Case 1: If , we compare versus , since the relative order of both and are available from .
Case 2: If , we compare versus . Again, for this case, the tie is broken using the triplet, since the relative order of both and are also available from .
Consider the step just before we obtain
from
as needed to obtain the final
. We needed the codes for the
m-blocks in sorting to obtain
. Given the 1:2 non-symetric partitioning used, at this point, the number of such
m-blocks needed for the algorithm will be
. These require
integers to store. We need
integers to store
. At this point, we also still have the inverse SA used to merge the left and right suffix arrays to form
. This requires
integers for storage. Thus, the overall space needed at this point will be
integers, in addition to the space for
T. However, after getting
, we no longer need the integer codes for the
m-length blocks. Also, the merging does not involve
, so this need not be computed. Thus, we compute
, and re-use the space for the
m-block codes.
requires
integers. Further, since we are merging
and
from the same direction, we can construct the final SA in-place, by re-using part of the space used for the already merged sections of
and
. (See for example [
33,
34]). Thus, the overall space requirement in bits will be
, where we need
bits to store
T,
bits for the output suffix array, and
bits for
. □
The above translates to a total space requirement of bytes, using standard assumptions of 4 bytes per integer, and 1 byte per symbol.
Though the space above is , the bits used to store could further be reduced. We do this by by making some observations in the above two cases encountered during merging. We can notice that after obtaining and , we do not really need to store the text T anymore. The key observation is that, merging of and proceeds in the same direction for each array, for instance, from the least to the largest suffix. Thus, at the k-th step, the symbol at position (that is, ) can easily be obtained using , and two arrays, namely, : which stores the symbols in Σ in lexicographic order, and that stores the cummulative count for each symbol in .
For , we compute and use the value as index into . We then use the position in the arrray to determine the symbol value from . Similarly we obtain the symbol , using a second set of arrays. For symbol we do not have . However, we can observe that symbol will be some symbol in . Hence, we can use and to determine the symbol, as described above.
Thus, we can now release the space currently used to store T and use this in part to store , and then merge and using and the two sets of arrays. The space saving gained in doing this will be: ( bits. Using this in the previous space analysis leads to a final space requirement of bits. This gives bytes, assuming at 4 bytes per integer.
Finally, since we do not need anymore, we can release the space it is occupying. Compute a new set of and arrays (in place) for the newly computed SA. The second set of arrays are no longer needed. Using SA and the new and arrays, we now recover the original sequence T, at no extra space cost.