Next Article in Journal
A Family of Tools for Supporting the Learning of Programming
Previous Article in Journal
Recognition of Pulmonary Nodules in Thoracic CT Scans Using 3-D Deformable Object Models of Different Classes
Previous Article in Special Issue
Interactive Compression of Digital Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Suffix-Sorting via Shannon-Fano-Elias Codes

Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506-6109, USA
*
Author to whom correspondence should be addressed.
Algorithms 2010, 3(2), 145-167; https://doi.org/10.3390/a3020145
Submission received: 23 December 2009 / Revised: 18 March 2010 / Accepted: 18 March 2010 / Published: 1 April 2010
(This article belongs to the Special Issue Data Compression)

Abstract

:
Given a sequence T = t 0 t 1 t n 1 of size n = | T | , with symbols from a fixed alphabet Σ, ( | Σ | n ) , the suffix array provides a listing of all the suffixes of T in a lexicographic order. Given T, the suffix sorting problem is to construct its suffix array. The direct suffix sorting problem is to construct the suffix array of T directly without using the suffix tree data structure. While algorithims for linear time, linear space direct suffix sorting have been proposed, the actual constant in the linear space is still a major concern, given that the applications of suffix trees and suffix arrays (such as in whole-genome analysis) often involve huge data sets. In this work, we reduce the gap between current results and the minimal space requirement. We introduce an algorithm for the direct suffix sorting problem with worst case time complexity in O ( n ) , requiring only ( 1 2 3 n log n n log | Σ | + O ( 1 ) ) bits in memory space. This implies 5 2 3 n + O ( 1 ) bytes for total space requirment, (including space for both the output suffix array and the input sequence T) assuming n 2 32 , | Σ | 256 , and 4 bytes per integer. The basis of our algorithm is an extension of Shannon-Fano-Elias codes used in source coding and information theory. This is the first time information-theoretic methods have been used as the basis for solving the suffix sorting problem.

1. Introduction

The suffix array provides a compact representation of all the n suffixes of T in a lexicographic order. The sequence of positions in T of the first symbols from the sorted suffixes in this lexicographic order is the suffix array for the sequence. The suffix sorting problem is to construct the suffix array of T. Mamber and Myers [1] were the first to propose an O ( n log n ) algorithm to construct the suffix array with three to five times less space than the traditional suffix tree. Other methods for fast suffix sorting in O ( n log n ) time have been reported in [2], while memory efficient constructions were considered in [3]. Puglisi et al. [4] provide a detailed comparison of different recently proposed linear time algorithms for suffix sorting.
Suffix trees and suffix arrays have drawn a significant attention in recent years due to their theoretical linear time and linear space construction, and their logarithmic search performance. Gusfield [5] provides a detailed study on the use of suffix trees in the analysis of biological sequences. Suffix sorting is also an important problem in data compression, especially for compression schemes that are based on the Burrows-Wheeler Transform [6,7]. In fact, it is known that the suffix sorting stage is a major bottleneck in BWT-based compression schemes[6,7,8]. The suffix array is usually favored over suffix trees due to its smaller memory footprint. This is important given that the applications of suffix trees and suffix arrays (such as in whole-genome sequence analysis) often involve huge data sets. An important problem therefore is to construct suffix arrays in linear time, while requiring space for the input sequence, in addition to n log n bits, the minimal space required to hold the suffix array. This translates to 5 n bytes of total space requirment, assuming n 2 32 , | Σ | = 256, and 4 bytes per integer. Existing algorithms, such as the KS Algorithm [9] require 13 n bytes, while the KA Algorithm [10] requires 10 n bytes.
This paper presents a space-efficient linear time algorithm for solving the direct suffix sorting problem, using only 5 2 3 n bytes. The basis of the algorithm is an extension of the Shannon-Fano-Elias codes pupularly used in arithmetic coding. In the next section, we provide a background to the problem, and briefly describe related work. Section 3 presents our basic algorithm for suffix sorting. Section 4 improves the complexity of the basic algorithm using methods from source coding and information theory.

2. Background

The BWT [6,7] performs a permutation of the characters in the input sequence, such that characters in lexically similar contexts will be near to each other. Thus, the BWT is very closely related to the suffix tree and suffix array - two important data structures used in pattern matching, analysis of sequence redundancy, and in data compression. The major link is the fact the BWT provides a lexicographic sorting of the contexts as part of the permutation of the input sequence. The recent book [7] provides a detailed treatment of the link between the BWT and suffix arrays and suffix trees. Apart from the BWT, other popular compression schemes have also been linked with the suffix tree data structure. For instance, the PPM* compression was implemented using suffix trees and suffix tries in [11]. The sequence decomposition step required for constructing the dictionary in LZ-based schemes can be performed by the use of tree structures, such as binary search trees, or suffix tries [12]. A detailed analysis on the use of suffix trees in data compression is provided in Szpankowski [13]. When decorated with the array of longest common prefixes, the suffix array can be used in place of the suffix tree in almost any situation where a suffix tree is used [14]. This capability is very important for applications that require huge storage, given the smaller memory requirement of the suffix array. An important issue is how to devise efficient methods for constructing the suffix array, without the suffix tree.
Some methods for constructing suffix arrays first build the suffix tree, and then construct the suffix array by performing an inorder traversal of the suffix tree. Farach al.[15] proposed a divide and conquer method to construct the suffix tree for a given sequence in linear time. The basic idea is to divide the sequence into odd and even sequences, based on the position of the symbols. Then, the suffix tree is constructed recursively for the odd sequence. Using the suffix tree for the odd sequence, they construct the suffix tree for the even sequence. The final step merges the suffix tree from the odd and even sequences into one suffix tree using a coupled depth-first search. The result is a linear time algorithm for sufix tree construction. This divide and conquer approach is the precursor to many recent algorithms for direct suffix sorting.
Given the memory requirement and implementation difficulty of the suffix tree, it is desirable to construct the suffix array directly, without using the suffix tree. Also, for certain applications such as in data compression where only the suffix array is needed, avoiding the construction of the suffix tree will have some advantages, especially with respect to memory requirements. More importantly, direct suffix sorting without the suffix tree raises some interesting algorithmic challenges. Thus, more recently, various methods have been proposed to construct the suffix array directly from the sequence [9,10,16,17,18], without the need for a suffix tree.
Kim et al. [16] followed an approach similar to Farach’s above, but for the purpose of constructing the suffix array directly. They introduce the notion of equivalent classes between strings, which they use to perform coupled depth-first searches at the merging stage. In [9] a divide and conquer approach similar to Farach’s method was used, but for direct construction of the suffix array. Here, rather than dividing the sequence into two symmetric parts, the sequence was divided into two unequal parts, by considering suffixes that begin at positions ( i mod 3 0 ) in the sequence. These suffixes are recursively sorted, and then the remaining suffixes are sorted based on information in the first part. The two sorted suffixes are then combined using a merging step to produce the final suffix array. Thus, a major difference is in the way they divided the sequences into two parts, and in the merging step.
Ko and Aluru [10] also used recursive partitioning, but following a fundamentally different approach to construct the suffix array in linear space and linear time. They use a binary marking strategy whereby each suffix in T is classified as either an S-suffix or an L-suffix, depending on its relative order with its next neighbor. Let T i = t i t i + 1 t i + 2 t n 1 denote the suffix T i of sequence T starting at position i. An S-suffix is a suffix that is lexicographically smaller than its right neighbor in T, while an L-suffix is one that is lexicographically larger than its right neighbor. That is, T i is an S-suffix if T i T i + 1 , otherwise T i is an L-suffix. This classification is motivated by the observation that an S-suffix is always lexicographically greater than any L-suffix that starts with the same first character. The two types of suffixes are then treated differently, whereby the S-suffixes are sorted recursively by performing some special distance computations. The L-suffixes are then sorted using the sorted order of the S-suffixes. The classification scheme is very similar to the approach earlier used by Itoh and Tanaka [19]. But the algorithm in [19] runs in O ( n log n ) time on average.
Lightweight suffix sorting algorithms [3,17,20,21] have also been proposed which pay more attention on the extra working space required. Working space excludes the space for the output suffix array. In general, the suffix array requires n log n bit space, while the input text requires another n log | Σ | bits, or in the worst case n log n bits. In [22], an algorithm that runs in O ( n log n ) worst case time, requiring O ( n log n + ( n / log n ) ) working space was proposed, while Hon et al. [20] constructed suffix arrays for integer alphabets in O ( n log n ) time, using O ( n log n ) -bit space. Nong and Zhang [17] combined ideas from the KS algorithm [9] and the KA algorithm [10] to develop a method that works in O ( n log | Σ | ) time, using ( n log | Σ | + | Σ | log n ) -bit working space (without the output suffix array). In [23,24] they extended the method to use 2 n + O ( 1 ) bytes of working space (or a total space of 7 n + O ( 1 ) bytes, including space for the suffix array, and the original sequence), by exploiting special properties of the L and S suffixes used in the KA algorithm, and of the leftmost S-type substrings. In-place suffix sorting was considered in [25], where suffix sorting was performed for strings with symbols from a general alphabet using O ( 1 ) working space in O ( n log n ) time. In some other related work [2,26], computing the BWT was viewed as a suffix sorting problem. Okanohara and Sadakane [26] modified a suffix sorting algorithm to compute the BWT using a working space of O ( n log | Σ | log log | Σ | n ) bits. There has also been various efforts on compressed suffix arrays and compressed suffix trees as a means to tackle the space problem with suffix trees and suffix arrays. (See [27,28,29] for examples). We do not consider compressed data structures in this work. A detailed survey on suffix array construction algorithms is provided in [4].

2.1. Main Contribution

We propose a divide-and-conquer sort-and-merge algorithm for direct suffix sorting on a given input string. Given a string of length n, our algorithm runs in O ( n ) worst case time and space. The algorithm recursively divides an input sequence into two parts, performs suffix sorting on the first part, then sorts the second part based on the sorted suffix from the first. It then merges the two smaller sorted suffixes to provide the final sorted array. The method is unique in its use of Shannon-Fano-Elias codes in efficient construction of a global partial order for the suffixes. To our knowledge, this is the first time information-theoretic methods have been used as the basis for solving the suffix sorting problem.
Our algorithm also differs from previous approaches in the use of a simple partitioning step, and how it exploits this simple partitioning scheme for conflict resolution. The total space requirement for the proposed algorithm is ( 5 3 n log n n log | Σ | ) + O ( 1 ) bits, including the space for the output suffix array, and for the original sequence. Using the standard assumption of 4 bytes per integer, with | Σ | 256 , n 2 32 , we get a total space requirement of 5 2 3 n bytes, including the n bytes for the original string and the 4 n bytes for the output suffix array. This is a significant improvement when compared with other algorithms, such as the 10 n bytes required by the KA algorithm [10], or the 13 n bytes required by the KS algorithm [9].

3. Algorithm I: Basic Algorithm

3.1. Notation

Let T = t 0 t 1 t 2 t n 1 be the input sequence of length n, with symbol alphabet Σ = { σ 0 , σ 1 , , σ | Σ | 1 } . Let T i = t i t i + 1 t i + 2 t n 1 denote the suffix of T starting at position i ( i = 0 , 1 , 2 , n 1 ) . Let T [ i ] = t i denote the i-th symbol in T. For any two strings, say α and β, we use α β to denote that the α preceeds β in lexicographic order. (Clearly, α and β could be individual symbols, from the same alphabet, i.e. | α | = | β | = 1 .) We use $ as the end of sequence symbol, where $ Σ and $ < σ , σ Σ . Further, we use S A to denote the suffix array of T, and S to denote the sorted list of first characters in the suffixes. Essentially, S [ i ] = T [ S A [ i ] ] . Given S A , we use S A to denote its inverse. We define S A as follows: S A [ i ] = k if S A [ k ] = i ; i , k = 0 , 1 , , n 1 . That is, S [ k ] = S [ S A [ i ] ] = T [ i ] . We use p i to denote the probability of symbol T [ i ] , and P i the probability of the substring T [ i i + m 1 ] , the m-length substring starting at position i.

3.2. Overview

We take the general divide and conquer approach:
  • Divide the sequence into two groups;
  • Construct the suffix array for the first group;
  • Construct the suffix array for the second group;
  • Merge the suffix arrays from the two groups to form the suffix array for the parent sequence;
  • Perform the above steps recursively to construct the complete suffix array for the entire sequence.
Figure 1 shows a schematic diagram for the working of the basic algorithm using an example sequence, T = aaaabaaaabxaaaab . The basic idea is to recursively partition the input sequence in a top-down manner into two equal-length subsequences according the odd and even positions in the sequence. After reaching the subsequences with length 2 , the algorithm then recursively merges and sorts the subsequences using a bottom-up approach, based on the partial suffix arrays from the lower levels of the recursion. Thus, the algorithm does not start the merging procedure until it reaches the last partition on a given branch.
Each block in the figure contains two rows. The first row indicates the position of the current block of symbols in the original sequence T. The second row indicates the current symbols. The current symbols are unsorted in the downstream dividing block and sorted in the upstream merging block. To see how the algorithm works, starting from the top left block follow the solid division arrow, the horizontal trivial sort arrow (dotted arrow), and then the dashed merge arrows. The procedure ends at the top right block. We briefly describe each algorithmic step in the following. Later, we modify this basic algorithm for improved complexity results.
Figure 1. Basic working of the proposed algorithm using an example. The original sequence is indicated at the top left. The sorted sequence and the suffix array are indicated in the top right box.
Figure 1. Basic working of the proposed algorithm using an example. The original sequence is indicated at the top left. The sorted sequence and the suffix array are indicated in the top right box.
Algorithms 03 00145 g001

3.3. Algorithm I

3.3.1. Divide T

If | T | 2 , divide T into 2 subsequences, T 1 and T 2 . T 1 contains all the symbols at the even positions of T. T 2 contains all the symbols at the odd positions of T. That is, T 1 = { T [ j ] , j [ 0 , | T | ) | j mod 2 = 0 } , T 2 = { T [ j ] , j [ 0 , | T | ) | j mod 2 = 1 }

3.3.2. Merge SA of T 1 and SA of T 2

Let S A 1 and S A 2 be the suffix array of T 1 and T 2 respectively. Let S A be the suffix array of T. If T 1 and T 2 have been sorted to obtain their respective suffix arrays S A 1 and S A 2 , then we can merge S A 1 and S A 2 in linear time on average, to form the suffix array S A . Without loss in generality, we assume S A 1 = a 0 a 1 a 2 a u , S A 2 = b 0 b 1 b 2 b v and S A = c 0 c 1 c 2 c u + v are the sorted indices of T 1 , T 2 and T respectively. Given a k , we use a ^ k to denote its corresponding position in T. That is, T 1 [ S A 1 [ k ] ] = T 1 [ a k ] = T [ a ^ k ] . Similarly, for b ^ k . a ^ k and b ^ k are easily obtained from a k , based on the level of recursion, and whether we are on a left branch or a right branch. For k = 0 ( 0 k u ) , l = 0 ( 0 l v ) and g = 0 ( 0 g u + v ) , we compare the partially ordered subsequences using a k and b l , viz.
If  T [ a ^ k ] T [ b ^ l ] , { S A [ C g ] a ^ k , S [ C g ] T [ a ^ k ] ; k + + , g + + If  T [ b ^ l ] T [ a ^ k ] , { S A [ C g ] b ^ l , S [ C g ] T [ b ^ l ] ; l + + , g + + T [ a ^ k ] = T [ b ^ l ] , { c x ResolveConflict ( a ^ k , b ^ l ) S A [ C g ] c ^ x , S [ C g ] T [ c ^ x ] ; g + +
Whenever we compare two symbols, T [ a ^ k ] from T 1 and T [ b ^ l ] from T 2 , we might get into the ambiguous situation whereby the two symbols are the same (i.e., T [ a ^ k ] = T [ b ^ l ] ). Thus, we cannot easily decide which suffix precedes the other, (i.e., whether T a ^ k T b ^ l , or T b ^ l T a ^ k ), based on the individual symbols. We call this situation a conflict. The key to the approach is how efficiently we can resolve potential conflicts as the algorithm progresses. We exploit the nature of the division procedure, and the fact that we are dealing with substrings (suffixes) of the same string, for efficient conflict resolution. Thus, the result of the merging step will be a partially sorted subsequence, based on the sorted order of the smaller child subsequences. An important difference here is that unlike in other related approaches [9,10,16], our sort-order at each recursion level is global with respect to T, rather than being local to the subsequence at the current recursion step. This is important, as it significantly simplifies the subsequent merging step.

3.3.3. Recursive Call

Using the above procedure, we recursively construct the suffix array of T 1 from its two children T 11 and T 12 . Similarly, we obtain the suffix array for T 2 from its children T 21 and T 22 . We follow this recursive procedure until the base case is reached (when the length of the subsequence is 2 ).

3.4. Conflict Resolution

We use the notions of conflict sequence or conflict pairs. Two suffixes T i and T j form a conflict sequence in T if T i [ 0 k ] = T j [ 0 k ] , (that is, T [ i i + k ] = T [ j j + k ] ) for some k 1 . Thus, the two suffixes can not be assigned a total order after considering their first k symbols. We say that the conflict between T i and T j is resolved whenever T [ i i + l k 1 ] = T [ j j + l k 1 ] , and T [ i + l k ] T [ j + l k ] . Here, l k is called the conflict length. We call the triple ( T i , T j , l k ) a conflict pair, or conflict sequence. We use the notation C P ( i , j , l k ) to denote a conflict pair T i and T j with a conflict length of l k . We also use C P ( i , j ) to denote a conflict pair where the conflict length is yet to be determined, or is not important given the context of the discussion.
The challenge, therefore, is how to resolve conflicts efficiently given the recursive partitioning framework proposed. Obviously, l k n , the sequence length. Thus, the minimum distance from the start of each conflicting suffix to the end of the sequence ( min { n i , n j } ), or the distance from an already sorted symbol can determine how long it will take to resolve a conflict. Here, we consider how the previously resolved conflicts can be exploited for a quick resolution of other conflicts. We maintain a traceback record on such previously resolved conflicts in a conflict tree, or a conflict table.
Figure 2 shows the conflict tree for the example sequence, T = aaaabaaabxaaaab . Without loss in generality and for easier exposition of the basic methodology, we assume that n = 2 x , for some positive integer x. Our conflict resolution strategy is based on the following properties of conflict sequences and conflict trees.
  • Given the even-odd recursive partitioning scheme, at any given recursion level, conflicts can only occur between specific sets of positions in the original sequence T. For instance, at the lowest level, conflicts can only occur between T 0 , and T n / 2 , or more generally, between T i , and T i + n / 2 . See Figure 2.
  • Given the conflict sequence, T i and T j the corresponding conflict pair C P ( i , j , l k ) is unique in T. That is, only one conflict pair can have the pair of start positions ( i , j ) . Thus the conflict pairs C P ( i , j , l k ) and C P ( j , i , l k ) are equivalent. Hence, we represent both C P ( i , j , l k ) and C P ( j , i , l k ) as C P ( i , j , l k ) , where, i < j .
  • Consider C P ( i , j ) at level h , 0 h log n in the tree. We define its neighbors as the conflict pairs: C P ( i , j ) , with ( i = i + 2 q , j = j + 2 q ) , or ( i = i 2 q , j = j 2 q ) , where q > 0 , with q = h , h 1 , h 2 , , 1 , and i , j n . Essentially, neighboring conflicts are found on the same level in the conflict tree, from the leftmost node to the current node. For example, for R 6 = C P ( 2 , 14 ) at h = 2 , the neighbor will be: { R 2 = C P ( 0 , 12 ) } . We notice that, by our definition, not all conflicts in the same node are neighbors.
Figure 2. Conflict tree for the example sequence T = a a a a b a a a a b x a a a a b . The original sequence is indicated at the root node. The notation R i R j indicates that conflict pair R i is resolved by R j after a certain number of steps, given that R j has been resolved. Conflict pairs are labeled following a depth-first traversal on the tree.
Figure 2. Conflict tree for the example sequence T = a a a a b a a a a b x a a a a b . The original sequence is indicated at the root node. The notation R i R j indicates that conflict pair R i is resolved by R j after a certain number of steps, given that R j has been resolved. Conflict pairs are labeled following a depth-first traversal on the tree.
Algorithms 03 00145 g002
Given the conflict tree, we can determine whether a given conflict pair C P ( i , j ) can be resolved based on previous trace back information (i.e., previously resolved conflicts). From the conflict tree, we see that this determination can be made by checking only a fixed number of neighboring conflict pairs. In fact, from the definition above, we see that the size of the neighborhood is at most 6. For C P ( i , j ) , the neighbors will be the conflict pairs in the set C P ( i , j ) , with ( i = i 2 h , j = j 2 h ) ; ( i = i + 2 h , j = j + 2 h ) ; ( i = i 2 h 1 , j = j 2 h 1 ) ; ( i = i + 2 h 1 , j = j + 2 h 1 ) ; ( i = i 2 h 2 , j = j 2 h 2 ) ; and ( i = i + 2 h 2 , j = j + 2 h 2 ) . Some of these pair of indices may not be valid indices for a conflict pair. For example, using the previous example, R 6 = C P ( 2 , 14 ) has only one valid neighbor, { R 2 = C P ( 0 , 12 ) } . Also, given that we resolve the conflicts based on the depth first traversal order, neighbors of C P ( i , j ) that are located on a node to the right of C P ( i , j ) in the conflict tree cannot be used to resolve conflict C P ( i , j ) . Therefore, in some cases, less than six probes may be needed.
Let C P ( i , j ) be the current conflict. Let C P ( i , j , l k ) be a previously resolved neighboring conflict. We determine whether C P ( i , j ) can be resolved based on C P ( i , j , l k ) using a simple check: Let Δ = ( i i ) = ( j j ) . If Δ is negative, then C P ( i , j ) can be resolved with C P ( i , j , l k ) iff: | Δ | l k 1 Conversely, if Δ is positive, then C P ( i , j ) can be resolved with C P ( i , j , l k ) iff: l k 0 If any of the two conditions holds, we say that C P ( i , j ) is resolved by C P ( i , j , l k ) , after Δ steps. Essentially, this means that after Δ comparison steps, C P ( i , j ) becomes equivalent to C P ( i , j , l k ) . We denote this equivalence using the notation: C P ( i , j ) C P ( i , j , l k ) . The following algorithm shows how we can resolve a given conflict pair, C P ( i , j ) .
ResolveConflict ( i , j )
Generate neighbors for C P ( i , j )
Remove invalid neighbors
Probe the valid neighbors for previously resolved conflict pairs
for each valid neighbor, compute the quantity d = ( l k | Δ | ) ; end for
Select the neighbor C P ( i , j , l k ) with maximum value for d
if ( Δ < 0 and | Δ | l k ), then /* no extra work is needed */
  Compute l k = l k + Δ 1 .
else if ( Δ > 0 and l k 0 ), then
  Perform at most ( Δ 1 ) extra comparison steps
  if (conflict is not resolved after the Δ 1 comparisons), then
   /*Resolve conflict using C P ( i , j , l k ) */
   Compute l k = l k + Δ
  end if
end if
else /* conflict cannot be resolved using earlier conflicts */
  Resolve conflict using direct symbol-wise comparisons
end if
The complexity of conflict resolution (when it can be resolved using previous conflicts) thus depends on the parameter Δ. Table 1 shows the conflict pairs for the example sequence used to generate the conflict tree of Figure 2. That table includes the corresponding Δ value where a conflict can be resolved based on a previous conflict.
The final consideration is how to store the conflict tree to ensure constant-time access to the resolved conflict pairs. Since for a given C P ( i , j ) , the ( i , j ) pair is unique, we can use these for direct access to the conflict tree. We can store the conflict tree as a simple linear array, where the positions in the array are determined based on the ( i , j ) values, and hence the height in the tree. To avoid the sorting that may be needed for fast access using the ( i , j ) indices, we use a simple hash function, where each hash value can be computed in constant time. The size of this array will be O ( n log n ) in the worst case, since there are at most n log n conflicts (see analysis below). The result will be an O ( 1 ) time access to any given conflict pair, given its ( i , j ) index in T.
Table 1. Conflict pairs for T = aaaabaaaabxaaaab . R q R p indicates that conflict pair R q is resolved by a previously resolved conflict pair R p , after a fixed number of steps. That is, after the fixed number of steps Δ, conflict pair R q becomes equivalent to conflict pair R p .
Table 1. Conflict pairs for T = aaaabaaaabxaaaab . R q R p indicates that conflict pair R q is resolved by a previously resolved conflict pair R p , after a fixed number of steps. That is, after the fixed number of steps Δ, conflict pair R q becomes equivalent to conflict pair R p .
R C P ( i , j , l k ) ΔR C P ( i , j , l k ) ΔR C P ( i , j , l k ) ΔR C P ( i , j , l k ) Δ
R 1 ( 0 , 8 , 0 ) R 9 ( 6 , 8 , 1 ) R 17 ( 3 , 7 , 1 ) R 5 1 R 25 ( 1 , 12 , 4 ) R 22 1
R 2 ( 0 , 12 , 3 ) R 10 ( 2 , 8 , 1 ) R 7 2 R 18 ( 5 , 11 , 5 ) R 8 + 1 R 26 ( 1 , 6 , 4 ) R 23 1
R 3 ( 8 , 12 , 1 ) R 11 ( 8 , 14 , 2 ) R 8 2 R 19 ( 5 , 7 , 2 ) R 9 + 1 R 27 ( 6 , 13 , 2 ) R 24 1
R 4 ( 6 , 14 , 1 ) R 12 ( 5 , 13 , 2 ) R 4 + 1 R 20 ( 1 , 7 , 2 ) R 10 + 1 R 28 ( 2 , 13 , 3 ) R 25 1
R 5 ( 2 , 6 , 2 ) R 3 + 6 R 13 ( 1 , 5 , 3 ) R 21 ( 7 , 13 , 3 ) R 11 + 1 R 29 ( 2 , 7 , 3 ) R 26 1
R 6 ( 2 , 14 , 1 ) R 2 2 R 14 ( 1 , 13 , 2 ) R 5 + 1 R 22 ( 0 , 11 , 5 ) R 30 ( 7 , 14 , 1 ) R 27 1
R 7 ( 0 , 6 , 3 ) R 15 ( 3 , 11 , 1 ) R 6 + 1 R 23 ( 0 , 5 , 5 ) R 31 ( 3 , 14 , 2 ) R 28 1
R 8 ( 6 , 12 , 4 ) R 16 ( 7 , 11 , 2 ) R 3 + 1 R 24 ( 5 , 12 , 3 ) R 32 ( 3 , 8 , 2 ) R 29 1

3.5. Complexity Analysis

Clearly, the complexity of the algorithm depends on the overall number of conflicts, the number of conflicts that require explicit symbol-wise comparisons, the number that can be resolved using earlier conflicts, and how much extra comparsions are required for those. We can notice that we require symbol-wise comparison mainly for the conflicts at the leftmost nodes on the conflict tree, and the first few conflicts in each node. For the sequence used in Figure 2 and Table 1, with n = 16 , we have the following: 32 total number of conflict pairs; 20 can be resolved with only look-ups based on the neighbors (with no extra comparisons); 1 requires extra comparisons after lookup (a total of 6 extra comparisons); 11 cannot be resolved using lookups, and thus requires direct symbol-wise; comparisons (a total of 20 such comparisons in all). This gives 44 overall total comparisons, and 21 total lookups.
This can be compared with the results for the worst case sequence with the same length, T = a 16 = aaaaaaaaaaaaaaaa . Here we obtain: 49 total number of conflict pairs; 37 can be resolved with only look-ups based on the neighbors (with no extra comparisons); 8 requires extra comparisons after lookup (a total of 12 extra comparisons); 4 cannot be resolved using lookups, and thus require direct symbol-wise; comparisons (a total of 19 such comparisons in all). This results in 41 overall total comparisons and 45 total lookups. While the total number of comparisons is similar for both cases, the number of lookups is significantly higher for T = a n , the worst case sequence.
The following lemma establishes the total number of symbol comparisons and the number of conflicts that can be encountered using the algorithm.
Lemma 1: Given an input sequence T = t 0 , t 2 , t n 1 , with length n, the maximum number of symbol comparisons and the maximum number of conflict pairs are each in O ( n log n ) .
Proof. Deciding on whether we can resolve a given conflict based on its neigbors requires only constant time, using Algorithm ResolveConflict(). Assume we are sorting the worst case sequence, T = a n . Consider level h in the conflict tree. First, we resolve the left-most conflict at this level, say conflict R L . This will require at most n 2 h m a x h = n 2 h symbol comparisons. For example, at h = h m a x 1 , the level just before the lowest level, the leftmost conflict will be R 1 = C P ( 0 , n 2 ) , requiring n 2 comparsions. However, each conflict on the same level with R L can be resolved with at most n 2 h m a x h + 1 = 2 h 1 comparisons, plus constant time lookup using the earlier resolved neighbors. A similar analysis can be made for conflict pairs C P ( i , j ) , where both i and j are odd. There are are at most ( n 2 h 1 ) conflicts for each node at level h of the tree. Thus, the total number of comparisons required to resolve all conflicts will be i = 0 log n 2 i ( n 2 i 1 ) n log n . In the worst case, each node in the tree will have the maximum ( n 2 h 1 ) number of conflicts. Thus, similar to the overall number of comparisons, the worst case total number of conflicts will be in O ( n log n ) . □
We state our complexity results in the following theorem:
Theorem 1: Given an input sequence T = t 0 , t 2 , t n 1 , with symbols from an alphabet Σ, Algorithm I solves the suffix sorting problem in O ( n log n ) time and O ( n ) space, for both the average case and the worst case.
Proof. The worst case result on required time follows from Lemma 1 above. Now consider the random string with symbols in Σ unifomly distributed. Here, the probablity of encountering one symbol is 1 | Σ | . Thus, the probability of matching two substrings from T will decrease rapidly as the length of the substrings increase. In fact, the probability of a conflict of length l k will be P r { C P ( i , j , l k ) } = 1 | Σ | 2 l k . Following [30], the maximum value of l k , the length of the longest common prefix for a random string is given by l m a x = O ( log | Σ | n ) . Thus, the average number of matching symbol comparisons for a given conflict will be:
η c o m p a r e = l k = 1 l m a x ( l k ) P r { C P ( i , j , l k ) } = 1 | Σ | 2 . 1 + 2 | Σ | 2 . 2 + 3 | Σ | 2 . 3 + + l m a x | Σ | 2 . l m a x
This gives:
η c o m p a r e = 1 | Σ | 2 . l m a x 1 | Σ | = 1 1 n 2 1 1 | Σ | 2 = n 2 1 n 2 . | Σ | 2 | Σ | 2 1 2
For each conflict pair, there will be exactly one mismatch, requiring one comparison. Thus, on average, we require at most 3 comparisons to resolve each conflict pair.
The probability of a conflict is just the probablity that any two randomly chosen symbols will match. This is simply 1 | Σ | 2 . However, we still need to make at least one comparison for each potential conflict pair. The number of this potential conflict pairs is exacty the same as the worst case number of conflicts. Thus, although the average number of comparisons per conflict is constant, the total time required to resolve all conflicts is still in O ( n log n ) .
We can reduce the space required for conflict resolution as follows. A level-h conflict pair can be resolved based on only previous conflicts (its neighbors), all at the same level, h. Thus, rather than the current depth-first traversal, we change the traversal order on the tree, and use a bottom-up breadth-first traversal. Then, starting with the lowest level, h = log n , we resolve all conflicts at a given level (starting from the leftmost), before moving up to the next level. Then, we re-use the space used for the lower levels in the tree. This implies a maximum of n 1 entries in the hash table at any given time, or O ( n ) space.
We make a final observation about the nature of the algorithm above. It may be seen that, in deed, we do not need to touch T 2 , the odd tree at all, until the very last stage of final merging. Since we can sort T 1 to obtain S A 1 without reference to T 2 , we can therefore use S A 1 to sort T 2 using radix sort, since positions in T 1 and T 2 are adjacent in T. This will eliminate consideration of the ( n 1 ) worst case conflicts at level h = 0 , and all the conflicts in T 2 . This will however not change the complexity results, since the main culprit is the O ( n log n ) total number of conflicts in the worst case. We make use of this observation in the next section, to develop an O ( n ) time and space algorithm for suffix sorting.

4. Algorithm II: Improved Algorithm

The major problem with Algorithm I is the time taken for conflict resolution. Since the worst case number of conflicts is in O ( n log n ) , an algorithm that performs a sequential resolution of each conflict can do no better than O ( n log n ) time in the worst case. We improve the algorithm by modifying the recursion step, and the conflict resolution strategy. Specifically, we still use binary partitioning, but we use a non-symmetric treatment of the two branches at each recursion step. That is, only one branch will be sorted, and the second branch will be sorted based on the sorted results from the first branch. This is motivated by the observation at the end of the last section. We also use a preprocessing stage inspired by methods in information theory to facilitate fast conflict resolution.

4.1. Overview of Algorithm II

In Algorithm I, we perform symmetric division of T into two subsequences T 1 and T 2 , and then merge their respective suffix arrays S A 1 and S A 2 to form S A , the suffix array of T. S A 1 and S A 2 in turn are obtained by recursive division of T 1 and T 2 and subsequent merging of the suffix arrays of their respective children. The improvement in Algorithm II is that when we divide T into T 1 and T 2 , the division is no longer symetric. Similar to the KS Algorithm [9], here we form T 1 and T 2 as follows: T 1 = { T [ j ] , j [ 0 , | T | ) | j mod 3 = 0 } , T 2 = { T [ j ] , j [ 0 , | T | ) | j mod 3 0 } . This is a 1:2 asymetric partitioning. The division schemes are special cases of the general 1 : η partitioning, where η = 2 in the above, while η = 1 in the symetric partitioning scheme of Algorithm I. An important parameter is α, defined as: α = 1 + η η . In the current case, α = 1 . 5 .
Further, the recursive call is now made on only one branch of T, namely T 2 . After we obtain the suffix array S A 2 for T 2 , we radix sort T 1 based on the values in S A 2 to construct S A 1 . This radix sorting step only takes linear time. The merging step remains similar to Algorithm I, though with some important changes to accomodate the above non-symetric partitioning of T. We also now use an initial ordering stage for faster conflict resolution.

4.2. Sort T 2 to Form S A 2

After dividing T into T 1 and T 2 , we need to sort each child sequence to form its own smaller suffix array. Consider T 2 . We form S A 2 by performing the required sorting recursively, using a non-symmetric treatment of each branch. T 2 is thus sorted by a recursive subdivision, and local sorting at each step. The suffix arrays of the two children are then merged to form the suffix array of their parent. Figure 3 shows this procedure for the example sequence T = aaaabaaaabxaaaab . Merging can be performed as before (with changes given the non-symetric nature of T 1 and T 2 ). The key to the algorithm is how to obtain the sorted array of the left child from that of the right child at each recursion step.
Figure 3. Asymmetric recursive partitioning for improved algorithm, using T = aaaabaaaabxaaaaab . Recursive partitioning is performed on the right branch.
Figure 3. Asymmetric recursive partitioning for improved algorithm, using T = aaaabaaaabxaaaaab . Recursive partitioning is performed on the right branch.
Algorithms 03 00145 g003

4.3. Sort T 1 by Inducing S A 1 from S A 2

Without loss in generality, we assume T 1 is unsorted and T 2 has been sorted to form S A 2 . Given the partitioning scheme, we have that for any t k T 1 , t k + 1 T 2 . There must exist an ordering of t k + 1 S A 2 , since the indices in S A 2 are unique. For each k, we construct the set of pairs P, given by P = { t k , S A 2 [ k + 1 ] ) | t k T 1 } . Each pair in P is unique. Then, we radix sort P to generate the suffix array S A 1 of T 1 . This step can thus be accomplished in linear time. This only works for the highest level (i.e., obtaining S A 1 from S A 2 ). However, we can use a similar procedure, but with some added work, at the other levels of recursion.
Consider a lower level of recursion, say at level h. See Figure 3. We can append the values in the S A from the right tree so that we can perform successive bucket sorts. Thus, we use the S A from the right tree as the tie breaker, after a certain number of steps. In general, this will not change the number of comparisons at the lowest level of the tree. However, for the general case, this will ensure that at most α h 1 bucket sorts are performed, involving α h m a x h symbols in each bucket sort, where h m a x = log α n is the lowest level of recursion. For instance, using the example in Figure 3, at h = 3 , we will have T 1 = a 2 a 8 , and T 2 = b 4 a 7 a 11 a 13 (for convenience, we have used superscripts to denote the respective positions in the original sequence T). Assume T 2 has been sorted to obtain S A 2 = [ 3 2 1 0 ] . Using S A 2 , we now form the sequences:
T [ 2 ] T [ 4 ] 3
T [ 8 ] T [ 11 ] 1
where symbol ’∘’ denotes concatenation. The last symbol in each sequence is obtained from S A 2 , the suffix array of the right tree. These sequences are then bucket-sorted to obtain S A 1 . Since bucket sorting is linear in the number of input symbols, in the worst case, this will result in a total of O ( n ) time at each level. Thus, the overall time for this procedure will still be in O ( n log n ) worst case.
We can also observe that with the asymetric partitioning, and by making the recursive call on only one partition at each level, we reduce both the overall number of conflicts and number of comparisons to O ( n ) . But the number of lookups could still be in O ( n log n ) worst case.

4.4. Improved Sorting - Using Shannon-Fano-Elias Codes

The key to improved sorting is to reduce the number of bucket sorts in the above procedure. We do this by pre-computing some information before hand, so that the sorting can be performed based on a small block of symbols, rather than one symbol at a time. Let m be the block size. With the precomputed information, we can perform a comparison involving an m-block symbol in O ( 1 ) time. This will reduce the number of bucket sorts required at each level h from α h 1 to α h 1 m , each involving α h m a x h symbols. By an appropriate choice of m, we can reduce the complexity of the overall sorting procedure. For instance, with m = log log n , this will lead to an overall worst case complexity in O ( n log n log log n ) time for determining the suffix array of T 1 from that of T 2 . With m = log n , this gives O ( n ) time. We use m = log n in subsequent discussions.
The question that remains then is how to perform the required computations, such that all the needed block values can be obtained in linear time. Essentially, we need a pair-wise global partial ordering of the suffixes involved in each recursive step. First, we observe that we only need to consider the ordering between pairs of suffixes at the same level of recursion. The relative order between suffixes at different levels is not needed. For instance, using the example sequence of Figure 3, the sets of suffixes for which we need the pair-wise orderings will be those at positions: { { 1 , 5 , 10 , 14 } ; { 2 , 8 } ; { 4 , 13 } } . Each subset corresponds to a level of recursion, from h = 2 to h = h m a x 1 . Notice that we don’t necessarily need those at level h = 1 , as we can directly induce the sorted order for these suffixes from S A 2 , after sorting T 2 .
We need a procedure to return the relative order between each pair of suffix positions in constant time. Given that we already have an ordering from the right tree in S A 2 , we only need to consider the prefixes of the suffixes in the left tree up to the corresponding positions in T 2 , such that we can use entries in S A 2 to break the tie, after a possible α h 1 m bucket sorts. Let Q i be the m-length prefix of the suffix T i : Q i = T [ i i + m 1 ] . We can use a simple hash function to compute a representative of Q i , for instance using the polynomial hash function:
h ( Q i ) = j = 0 m 1 | Σ | m 1 f ( Q i [ j ] ) mod n
where f ( x ) = k , if x is the k-th symbol in Σ , k = 0 , 1 , , | Σ | 1 , and n is the nearest prime number n . The problem is that the ordering information is lost in the modulus operation. Although order-preserving hash functions exist (see [31]), these run in O ( n ) time on average, without much guarantees on their worst case. Also, with the m-length blocks, this may require O ( m n ) = O ( n log n ) time on average.
We use an information theoretic approach to determine the ordering for the pairs. We consider the required representation for each m-length block as a codeword that can be used to represent the block. The codewords are constrained to be order preserving: That is, C ( Q i ) < C ( Q j ) iff Q i Q j and C ( Q i ) = C ( Q j ) iff Q i = Q j , where C ( x ) is the codeword for sequence x. Unlike in traditional source coding where we are given one long sequence to produce its compact representation, here, we have a set of short sequences, and we need to produce their respective compact representations, and these representations must be order preserving.
Let P i be the probability of Q i , the m-length block starting at position i in T. Let p i be the probability of symbol t i = T [ i ] . If necessary, we can pad T with a maximum of ( m 1 ) ’$’ symbols, to form a valid m-block at the end of the sequence. We compute the quantity: P i = k = i i + m 1 p k . Recall that t i = T [ i ] Σ , and j = 0 | Σ | 1 P r { σ j } = 1 . For a given sequence T, we should have: i = 0 n 1 P i = 1 . However, since T may not contain all the possible m length blocks in Σ m , we need to normalize the product of probabilities to form a probability space:
P i = P i i = 0 n 1 P i
To determine the code for Q i , we then use the cumulative distribution function (cdf) for the P i ’s, and determine the corresponding position for each P i in this cdf. Essentially, this is equivalent to dividing a number line in the range [0 1], such that each Q i is assigned a range proportional to its probability, P i . See Figure 4. The total number of divisions will be equal to the number of unique m-length blocks in T. The problem then is to determine the specific interval on this number line that corresponds to Q i , and to choose a tag q i to represent Q i .
Figure 4. Code assignment by successive partitioning of a number line.
Figure 4. Code assignment by successive partitioning of a number line.
Algorithms 03 00145 g004
We use the following assignment procedure to compute the tag, q i . First we determine the interval for the tag, based on which we compute the tag. Define the cumulative distribution function for the symbols in Σ = { σ 0 , σ 1 , , σ | Σ | 1 } : F x ( σ j ) = v = 0 j P r { σ v } . The symbol probabilities, P r { σ v } ’s are simply obtained based on the p i ’s. For each symbol σ k in Σ, we have an open interval in the cdf: [ F x ( σ k 1 ) F x ( σ k ) ) . Now, given the sequence Q i = s 1 s 2 s k s m , s i Σ , the procedure steps through the sequence. At each step k , k = 1 , 2 , , m along the sequence, we can compute U ( k ) and L ( k ) , the respective current upper and lower ranges for the tag using the following relations:
L ( 0 ) = 0 ; U ( 0 ) = 1 for   k = 1  to  m : U ( k ) = L ( k 1 ) + [ U ( k 1 ) L ( k 1 ) ] F x ( s k ) L ( k ) = L ( k 1 ) + [ U ( k 1 ) L ( k 1 ) ] F x ( s k 1 )
The procedure stops at k = m , and the values of U ( k ) and L ( k ) at this final step will be the range of the tag, q i . We can choose the tag q i as any number in the range: L ( m ) q i < U ( m ) . Thus we chose q i as the mid point of the range at the final step: q i = U ( m ) + L ( m ) 2 . Figure 5(a) shows an example run of this procedure for a short sequence: Q i = a c a b d with a simple alphabet, Σ = { a , b , c , d } , and where each symbol has an equal probability p i = 1 4 . This gives q i = 271 2048 .
Figure 5. Code assignment procedure, using an example sequence. The vertical line represents the current state of the number line. The current interval at each step in the procedure is shown with a darker shade. The symbol considered at each step is listed under their respective number lines (a) Using the sequence Q i = a c a b d . (b) Evolution of code assignment procedure, after removing the first symbol (in previous sequence acabd), and bringing in a new symbol a to form a new sequence: Q i + 1 = c a b d a .
Figure 5. Code assignment procedure, using an example sequence. The vertical line represents the current state of the number line. The current interval at each step in the procedure is shown with a darker shade. The symbol considered at each step is listed under their respective number lines (a) Using the sequence Q i = a c a b d . (b) Evolution of code assignment procedure, after removing the first symbol (in previous sequence acabd), and bringing in a new symbol a to form a new sequence: Q i + 1 = c a b d a .
Algorithms 03 00145 g005
Lemma 2: The tag assignment procedure results in a tag q i that is unique to Q i .
Proof. The procedure described can be seen as an extension of the Shanon-Fano-Elias coding procedure used in source coding and information theory [32]. Each tag q i is analogous to an arithmetic coding sequence of the given m-length block, Q i . The open interval defined by [ ( L ( m ) U ( m ) ) for each m-length sequence is unique to the sequence.
To see this uniqueness, we notice that the final number line at step m represents the cdf for all m-length blocks that appeared in the original sequence T. Denote this cdf for the m-blocks as F x m . Given Q i , the i-th m-block in T, the size of its interval is given by ( U ( m ) L ( m ) ) = P i . Since all probabilities are positive, we see that F x m ( Q i ) F x m ( Q j ) whenever Q i Q j . Therefore, F x m ( Q i ) determines Q i uniquely. Thus, F x m ( Q i ) serves as a unique code for Q i . Choosing any number q i within the upper and lower bounds for each Q i define a unique tag for Q i . Thus the chosen tag defined by the midpoint of this interval is unique to Q i . □
Lemma 3: The tags generated by the assignment procedure are order preserving.
Proof. Consider the ordering of the tags for different m-length blocks. Each step in the assignment procedure uses the same fixed order of the symbols on a number line, based on their order in Σ. Thus, the position of the upper and lower bounds at each step depends on the previous symbols considered, and the position of the current symbol in the ordered list of symbols in Σ. Therefore the q i ’s are ordered with respect to the lexicographic ordering of the Q i ’s: q i < q j Q i Q j , and q i = q j Q i = Q j . □
Lemma 4: Given T = t 0 t 1 t n 1 , all the required tags q i , i , can be computed in O ( n ) time.
Proof. Suppose we have already determined P i and q i for the m-block Q i as described above. For efficient processing, we can compute P i + 1 and the tag q i + 1 in the [ 0 1 ] number line, using the previous values for P i and q i . This is based on the fact that Q i and Q i + 1 are consecutive positions in T (In practice, we need only a fraction of the positions in T, which will mean less time and space are required. But here we describe the procedure for the entire T since the complexity remains the same.). In particular, given Q i = T [ i i + m 1 ] , and Q i + 1 = T [ i + 1 i + m ] . We compute P i + 1 as:
P i + 1 = P i . p i + m p i
Then, we compute P i + 1 using Equation (1). Thus, all the required P i ’s can be computed in O ( n + m ) = O ( n + log n ) = O ( n ) time.
Similarly, given the tag q i for Q i , and its upper and lower bounds U ( m ) and L ( m ) , we can compute the new tag q i + 1 for the incoming m-block, Q i + 1 based on the structure of the assignment procedure used to compute q i , (see Figure 5(b)). We compute the new tag q i + 1 by first computing it’s upper and lower bounds. Denote the respective upper and lower bounds for q i as: U i ( m ) , L i ( m ) . Similarly, we use U i + 1 ( m ) , L i + 1 ( m ) for the respective bounds for q i + 1 . Let s = T [ i ] = Q i [ 0 ] be the first symbol in Q i . Its probability is given by p i . Also, let s n e w = T [ i + m ] be the new symbol that is shifted in. Its probability is given by p i + m , and we also know it’s position in the cdf. We first compute the intermediate bounds at step k = m 1 when using Q i + 1 , namely:
U i + 1 ( m 1 ) = [ U ( m ) F x ( s ) ] ( 1 p i )
L i + 1 ( m 1 ) = [ L ( m ) F x ( s 1 ) ] ( 1 p i )
Multiplying by ( 1 p i ) changes the probability space from the previous range of [ F x ( s 1 ) F x ( s ) ) to [ 0 1 ] . After the computations, we can then perform the last step in the assignment procedure to determine the final range for the new tag:
U i + 1 ( m ) = L i + 1 ( m 1 ) + [ U i + 1 ( m 1 ) L i + 1 ( m 1 ) ] F x ( s n e w )
L i + 1 ( m ) = L i + 1 ( m 1 ) + [ U i + 1 ( m 1 ) L i + 1 ( m 1 ) ] F x ( s n e w 1 )
The tag q i + 1 is then computed as the average of the two bounds as before. The worst case time complexity of this procedure is in O ( n + m + | Σ | ) = O ( n + log n + | Σ | ) . The | Σ | component comes from the time needed to sort the unique symbols in T before computing the cdf. This can be performed in linear time using counting sort. Since | Σ | n , this gives a worst case time bound of O ( n ) to compute the required codes for all the O ( n ) m-length blocks. □
Figure 5(b) shows a continuation of the previous example, with the old m-block: Q i = a c a b d , and a new m-block Q i + 1 = c a b d a . That is, the new symbol a has been shifted in, while the first symbol in the old block has been shifted out. We observe that the general structure in Figure 5(a) is not changed by the incoming symbol, except only at the first step and last step. For the running example, the new value will be q i + 1 = 1051 2048 . Table 2 shows the evolution of the upper and lower bounds for the two adjacent m-blocks. The bounds are obtained from the figures.
Table 2. Upper and lower bounds on the current interval on the number line.
Table 2. Upper and lower bounds on the current interval on the number line.
acabdcabda
U k 1 1 4 3 16 9 64 17 128 1 3 4 9 16 17 32 17 32
L k 00 1 8 1 8 33 256 0 1 2 1 2 33 64 135 256
U k L k 1 1 4 1 16 1 64 1 256 1 1 4 1 16 1 64 1 256
Having determined q i , which is fractional, we can then assign the final code for Q i by mapping the tags to an integer in the range [ 0 n 1 ] . This can be done using a simple formula:
c i = C ( Q i ) = ( n 1 ) q i q m i n q m a x q m i n
where q m i n = min i { q i } , and q m a x = max i { q i } . Notice that here, the c i ’s computed will not necessarily be consecutive. But they will be ordered. Also, the number of distinct q i ’s is at most n. The difference between c i and c i + 1 will depend on P i and P i + 1 . The floor function, however, could break down the lexicographic ordering. A better approach is to simply record the position where each Q i fell on the number line. We then read off these positions from 0 to 1, and use the count at which each Q i is encountered as its code. This is easily done using the cummulative count of occurence of each distinct Q i . Since the q i ’s are implicitly sorted, so are the c i ’s. We have thus obtained an ordering of all the m-length substrings in T. This is still essentially a partial ordering of all the suffixes based on their first m symbols, but a total order on the distinct m-length prefixes of the suffixes.
We now present our main results in the following theorems:
Theorem 2: Given the sequence T = t 0 t 1 t n 1 , from a fixed alphabet Σ, | Σ | n , all the m-length prefixes of the suffixes of T can be ordered in linear time and linear space in the worst case.
Proof. The theorem follows from Lemmas 2 to 4. The correctness of the ordering follows from Lemma 2 and Lemma 3. The time complexity follows from Lemma 3. What remains is to prove is the space complexity. We only need to maintain two extra O ( | Σ | ) arrays, one for the number line at each step, and the other to keep the cumulative distribution function. Thus, the space needed is also linear in O ( n + m + | Σ | ) = O ( n ) . □
Theorem 3: Given the sequence T = t 0 t 1 t n 1 , with symbols from a fixed alphabet Σ, | Σ | n , Algorithm II computes the suffix array of T in O ( n ) worst case time and space.
Proof. At each iteration, the recursive call applies only to the 2 3 n suffixes in T 2 . Thus, the running time for the algorithm is given by the solution to the recurrence : φ ( n ) = φ ( 2 3 n ) + O ( n ) . This gives φ ( n ) = O ( n ) . Combined with Theorem 2, this establishes the linear time bound for the overall algorithm.
We improve the space requirement using an alternative merging procedure. Since we now have S A 1 and S A 2 , we can modify the merging step by exploiting the fact that any conflict that can arise during the merging can be resolved by using only S A 2 . To resolve a conflict between suffix T 1 i in T 1 and suffix T 2 j in T 2 , we need to consider two cases:
  • Case 1: If j  mod  3 = 1 , we compare T 1 [ i ] , S A 2 [ S A 2 [ i + 1 ] ] versus T 2 [ j ] , S A 2 [ S A 2 [ j + 1 ] ] , since the relative order of both T 1 i + 1 and T 2 j + 1 are available from S A 2 .
  • Case 2: If j  mod  3 = 2 , we compare T 1 [ i ] , T 1 [ i + 1 ] , S A 2 [ S A 2 [ i + 2 ] ] versus T 2 [ j ] , T 2 [ j + 1 ] , S A 2 [ S A 2 [ j + 1 ] ] . Again, for this case, the tie is broken using the triplet, since the relative order of both T 1 i + 2 and T 2 j + 2 are also available from S A 2 .
Consider the step just before we obtain S A 1 from S A 2 as needed to obtain the final S A . We needed the codes for the m-blocks in sorting to obtain S A 2 . Given the 1:2 non-symetric partitioning used, at this point, the number of such m-blocks needed for the algorithm will be 2 3 . 2 3 n . These require 4 9 n integers to store. We need 2 3 n integers to store S A 2 . At this point, we also still have the inverse SA used to merge the left and right suffix arrays to form S A 2 . This requires 4 9 n integers for storage. Thus, the overall space needed at this point will be 1 5 9 n integers, in addition to the space for T. However, after getting S A 2 , we no longer need the integer codes for the m-length blocks. Also, the merging does not involve S A 1 , so this need not be computed. Thus, we compute S A 2 , and re-use the space for the m-block codes. S A 2 requires 2 3 n integers. Further, since we are merging S A 1 and S A 2 from the same direction, we can construct the final SA in-place, by re-using part of the space used for the already merged sections of S A 1 and S A 2 . (See for example [33,34]). Thus, the overall space requirement in bits will be ( n log | Σ | + n log n + 2 3 n log n ) = ( n log | Σ | + 1 2 3 n log n ) , where we need n log | Σ | bits to store T, n log n bits for the output suffix array, and 2 3 n log n bits for S A 2 . □
The above translates to a total space requirement of 7 2 3 n bytes, using standard assumptions of 4 bytes per integer, and 1 byte per symbol.
Though the space above is O ( n ) , the 2 3 n log n bits used to store S A 2 could further be reduced. We do this by by making some observations in the above two cases encountered during merging. We can notice that after obtaining S A 2 and S A 1 , we do not really need to store the text T anymore. The key observation is that, merging of S A 2 and S A 1 proceeds in the same direction for each array, for instance, from the least to the largest suffix. Thus, at the k-th step, the symbol at position i = S A 2 [ k ] (that is, T 2 [ i ] ) can easily be obtained using S A 2 , and two O ( | Σ | ) arrays, namely, B 1 : which stores the symbols in Σ in lexicographic order, and B 2 that stores the cummulative count for each symbol in T 2 .
For T 2 [ i + 1 ] , we compute S A 2 [ S A 2 [ i ] + 1 ] ) and use the value as index into B 2 . We then use the position in the B 2 arrray to determine the symbol value from B 1 . Similarly we obtain the symbol T 1 [ i ] = T 1 [ S A 1 [ k ] ] , using a second set of O ( | Σ | ) arrays. For symbol T 1 [ i + 1 ] we do not have S A 1 . However, we can observe that symbol T 1 [ i + 1 ] will be some symbol T 2 [ j ] in T 2 . Hence, we can use S A 2 and S A 2 to determine the symbol, as described above.
Thus, we can now release the space currently used to store T and use this in part to store S A 2 , and then merge S A 1 and S A 2 using S A 2 and the two sets of O ( | Σ | ) B 1 , B 2 arrays. The space saving gained in doing this will be: ( n log | Σ | 2 ( | Σ | log | Σ | + | Σ | log n ) ) n log | Σ | bits. Using this in the previous space analysis leads to a final space requirement of ( n log n + 2 3 n log n + 2 | Σ | ( log | Σ | + log n ) n log | Σ | ) ( 1 2 3 n log n n log | Σ | ) bits. This gives 5 2 3 n bytes, assuming n 2 32 , | Σ | 256 at 4 bytes per integer.
Finally, since we do not need S A 2 anymore, we can release the space it is occupying. Compute a new set of B 1 and B 2 arrays (in place) for the newly computed SA. The second set of O ( | Σ | ) arrays are no longer needed. Using SA and the new B 1 and B 2 arrays, we now recover the original sequence T, at no extra space cost.

5. Conclusion and Discussion

We have proposed two algorithms for solving the suffix sorting problem. The first algorithm runs in O ( n log n ) time and O ( n ) space, for both the averge case and worst case. Using ideas from Shannon-Fano-Elias codes used in infomation theory, the second algorithm improved the first to an O ( n ) worst case time and space complexity. The algorithms proposed perform direct suffix sorting on the input sequence, circumventing the need to first construct the suffix tree.
We mention that the proposed algorithms are generally independent of the type of alphabet, Σ. The only requirement is that Σ be fixed during the run of the algorithm. Any given fixed alphabet can be mapped to a corresponding integer alphabet. Also, since practically the number of unique symbols in | T | cannot be more than n, the size of T, it is safe to say that n | Σ | .
For practical implementation, one will need to consider the problem of practical code assignment, since the procedure described may involve dealing with very small fractions, depending on m, the block length, and | Σ | . This is a standard problem in practical data compression. With n 2 32 , | Σ | = 256 , we have m = 32 ,and thus we may need to store values as small as 1 | Σ | 32 = 1 256 32 while computing the tags. This translates to 1 2 256 , or about 1 . 158 × 10 77 . In most implementations, a variable of type double can store values as small as 1 . 7 × 10 308 . For the case of 1:2 asymetric partitioning used above, we need only only 4 n 9 of the m-blocks, and hence, the overall space needed to store them will still be n integers, since the size of double is typically twice that of integers. To use the approach for much larger sequences, periodic re-scaling and re-normalization schemes can be used to address the problem. Another approach will be to consider occurrence counts, rather than occurrence probabilities for the m-blocks. Here, the final number line will be a cummulative count, rather than a cdf, with the total counts being n. Then, the integer code for each m-block can easily be read off this number line, based on its overall frequency of occurrence. Therefore, we will need space for at most 4 n 9 + | Σ | integers in computing the integer codes for the positions needed. Moffat et al. [35] provide some ideas on how to address practical problems in arithmetic coding.
Overall, by a careful re-use of previously allocated space, the algorithm requires ( 1 2 3 n log n n log | Σ | ) bits, including the n bytes needed to store the original string. This translates to 5 2 3 n bytes, using standard assumptions of 4 byte per integer, and 1 byte for each symbol. This is a significant improvement over the 10 n bytes required by the KA algorithm [10], or the 13 n bytes required by the KS algorithm [9]. Our algorithm is also unique in its use of Shannon-Fano-Elias codes, traditionally used in source coding, for efficient suffix sorting. This is the first time information-theoretic methods have been used as the basis for solving the suffix sorting problem. We believe that this new idea of using an information theoretic approach to suffix sorting could shed a new light on the problems of suffix array construction, their analysis, and applications.

Acknowledgements

This work was partially supported by a DOE CAREER grant: No: DE-FG02-02ER25541, and an NSF ITR grant: IIS-0228370. A short version of this paper was presented at the 2008 IEEE Data Compression Conference, Snowbird, Utah.

References

  1. Manber, U.; Myers, G. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 1993, 22, 935–948. [Google Scholar] [CrossRef]
  2. Larsson, N.J.; Sadakane, K. Faster suffix sorting. Theoret. Comput. Sci. 2007, 317, 258–272. [Google Scholar] [CrossRef]
  3. Manzini, G.; Ferragina, P. Engineering a lightweight suffix array construction algorithm. Algorithmca 2004, 40, 33–50. [Google Scholar] [CrossRef]
  4. Puglisi, S.J.; Smyth, W.F.; Turpin, A. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 2007, 39, 1–31. [Google Scholar] [CrossRef]
  5. Gusfield, D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
  6. Burrows, M.; Wheeler, D.J. A Block-Sorting Lossless Data Compression Algorithm; Research Report 124; Digital Equipment Corporation: Palo Alto, CA, USA, 1994. [Google Scholar]
  7. Adjeroh, D.; Bell, T.; Mukherjee, A. The Burrows-Wheeler Transform: Data Compression, Suffix Arrays and Pattern Matching; Springer-Verlag: New York, NY, USA, 2008. [Google Scholar]
  8. Seward, J. On the performance of BWT sorting algorithms. In Proceedings of IEEE Data Compression Conference, Snowbird, UT, USA, March 28–30, 2000; Volume 17, pp. 173–182.
  9. Kärkkäinen, J.; Sanders, P.; Burkhardt, S. Linear work suffix array construction. J. ACM 2006, 53, 918–936. [Google Scholar] [CrossRef]
  10. Ko, P.; Aluru, A. Space-efficient linear time construction of suffix arrays. J. Discrete Algorithms 2005, 3, 143–156. [Google Scholar] [CrossRef]
  11. Cleary, J.G.; Teahan, W.J. Unbounded length contexts for PPM. Comput. J. 1997, 40, 67–75. [Google Scholar] [CrossRef]
  12. Bell, T.; Cleary, J.; Witten, I. Text Compression; Prentice-Hall: Englewood Cliffs, NJ, USA, 1990. [Google Scholar]
  13. Szpankowski, W. Asymptotic properties of data compression and suffix trees. IEEE Trans. Inf. Theory 1993, 39, 1647–1659. [Google Scholar] [CrossRef]
  14. Abouelhoda, M.I.; Kurtz, S.; Ohlebusch, E. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2004, 2, 53–86. [Google Scholar] [CrossRef]
  15. Farach-Colton, M.; Ferragina, P.; Muthukrishnan, S. On the sorting-complexity of suffix tree construction. J. ACM 2000, 47, 987–1011. [Google Scholar] [CrossRef]
  16. Kim, D.K.; Sim, J.S.; Park, H.; Park, K. Constructing suffix arrays in linear time. J. Discrete Algorithms 2005, 3, 126–142. [Google Scholar] [CrossRef]
  17. Nong, G.; Zhang, S. Optimal lightweight construction of suffix arrays for constant alphabets. In Proceedings of Workshop on Algorithms and Data Structures, Halifax, Canada, August 15–17, 2007; Volume 4619, pp. 613–624.
  18. Maniscalco, M.A.; Puglisi, S.J. Engineering a lightweight suffix array construction algorithm. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, June 10–12, 2008.
  19. Itoh, H.; Tanaka, H. An efficient method for in memory construction of suffix arrays. In Proceedings of String Processing and Information Retrieval Symposium and International Workshop on Groupware, Cancun, Mexico, September 22–24, 1999; pp. 81–88.
  20. Hon, W.; Sadakane, K.; Sung, W. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings of IEEE Symposium on Foundations of Computer Science, Cambridge, MA, USA, October 11–14, 2003.
  21. Na, J.C. Linear-time construction of compressed suffix arrays using O (n log n)-bit working space for large alphabets. In Proceedings of 16th Annual Symposium on Combinatorial Pattern Matching 2005, LNCS, Jeju Island, Korea, June 19–22, 2005; Volume 3537, pp. 57–67.
  22. Burkhardt, S.; Kärkkäinen, J. Fast lightweight suffix array construction and checking. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, USA, January 12–14, 2003.
  23. Nong, G.; Zhang, S.; Chan, W.H. Linear time suffix array construction using D-critical substrings. In CPM; Kucherov, G., Ukkonen, E., Eds.; Springer: New York, NY, USA, 2009; Volume 5577, pp. 54–67. [Google Scholar]
  24. Nong, G.; Zhang, S.; Chan, W.H. Linear suffix array construction by almost pure induced-sorting. In DCC; Storer, J.A., Marcellin, M.W., Eds.; IEEE Computer Society: Hoboken, NJ, USA, 2009; pp. 193–202. [Google Scholar]
  25. Franceschini, G.; Muthukrishnan, S. In-place suffix sorting. In ICALP; Arge, L., Cachin, C., Jurdzinski, T., Tarlecki, A., Eds.; Springer: New York, NY, USA, 2007; Volume 4596, pp. 533–545. [Google Scholar]
  26. Okanohara, D.; Sadakane, K. A linear-time Burrows-Wheeler Transform using induced sorting. In SPIRE; Karlgren, J., Tarhio, J., Hyyrö, H., Eds.; Springer: New York, NY, USA, 2009; Volume 5721, pp. 90–101. [Google Scholar]
  27. Ferragina, P.; Manzini, G. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Scienc, Redondo Beach, CA, USA, November 12–14, 2000; pp. 390–398.
  28. Grossi, R.; Vitter, J.S. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22–24, 2005.
  29. Sirén, J. Compressed suffix arrays for massive data. In SPIRE; Karlgren, J., Tarhio, J., Hyyrö, H., Eds.; Springer: New York, NY, USA, 2009; Volume 5721, pp. 63–74. [Google Scholar]
  30. Karlin, S.; Ghandour, G.; Ost, F.; Tavare, S.; Korn, L. New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 1983, 80, 5660–5664. [Google Scholar] [CrossRef] [PubMed]
  31. Fox, E.A.; Chen, Q.F.; Daoud, A.M.; Heath, L.S. Order-preserving minimal perfect hash functions and information retrieval. ACM Trans. Inf. Syst. 1991, 9, 281–308. [Google Scholar] [CrossRef]
  32. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Interscience: Malden, MA, USA, 1991. [Google Scholar]
  33. Symvonis, A. Optimal stable merging. Comput. J. 1995, 38, 681–690. [Google Scholar] [CrossRef]
  34. Huang, B.; Langston, M. Fast stable sorting in constant extra space. Comput. J. 1992, 35, 643–649. [Google Scholar] [CrossRef]
  35. Moffat, A.; Neal, R.M.; Witten, I.H. Arithmetic coding revisited. ACM Trans. Inf. Syst. 1995, 16, 256–294. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Adjeroh, D.; Nan, F. Suffix-Sorting via Shannon-Fano-Elias Codes. Algorithms 2010, 3, 145-167. https://doi.org/10.3390/a3020145

AMA Style

Adjeroh D, Nan F. Suffix-Sorting via Shannon-Fano-Elias Codes. Algorithms. 2010; 3(2):145-167. https://doi.org/10.3390/a3020145

Chicago/Turabian Style

Adjeroh, Donald, and Fei Nan. 2010. "Suffix-Sorting via Shannon-Fano-Elias Codes" Algorithms 3, no. 2: 145-167. https://doi.org/10.3390/a3020145

APA Style

Adjeroh, D., & Nan, F. (2010). Suffix-Sorting via Shannon-Fano-Elias Codes. Algorithms, 3(2), 145-167. https://doi.org/10.3390/a3020145

Article Metrics

Back to TopTop