The solution that we describe is a combination of two results: some results about deferred data structures for multisets, which support queries in a “lazy” way; and some results about optimal prefix free codes themselves, about the relation between the computational cost of partially sorting a set of positive integers and the computational cost of computing a binary optimal prefix free code for the corresponding frequency distribution. We describe the general intuition of our solution in Section 2.1
, the deferred data structure in Section 2.2
, and the algorithm in Section 2.3
2.1. General Intuition
Observing that the algorithm suggested by Huffman [1
] in 1952 always creates the internal nodes in increasing order of weight, van Leeuwen [14
] described in 1976 an algorithm to compute optimal prefix free codes in linear time when the input (i.e., the weights of the external nodes) is given in sorted order. A close look at the execution of van Leeuwen’s algorithm [14
] reveals a sequence of sequential searches for the insertion rank r
of the weight of some internal node in the list of weights of external nodes. Such sequential search could be replaced by a more efficient search algorithm in order to reduce the number of comparisons performed (e.g., a doubling search [22
] would find such a rank r
comparisons instead of the r
comparisons spent by a sequential search). Of course, this would reduce the number of comparisons performed, but it would not reduce the number of algebraic operations (in this case, sums) performed, and hence neither would it significantly reduce the total running time of the algorithm.
Example 1. Consider the following instance for the computation of a optimal prefix free code formed by sorted positive weights such that the first internal node created has larger weight than the largest weight in the original array (i.e., ). On such an instance, van Leeuwen’s algorithm  starts by performing comparisons in the equivalent of a sequential search in W for : a binary search would perform comparisons instead, and a doubling search  no more than comparisons.
As mentioned above, any algorithm must access (and sum) each weight at least once in order to compute an optimal prefix free code for the input, so that reducing the number of comparisons does not reduce the running time of van Leeuwen’s algorithm on a sorted input, but it illustrates how instances with clustered external and internal nodes are “easier” than instances in which they are interleaved.
The algorithm suggested by Huffman [1
] starts with a heap of external nodes, selects the two nodes of minimal weight, pair them into a new node which it adds to the heap, and iterates till only one node is left. Whereas the type of the nodes selected, external or internal, does not matter in the analysis of the complexity of Huffman’s algorithm, we claim that the computational cost of optimal prefix free codes can be greatly reduced on instances where many external nodes are selected consecutively. We define the “EI signature” of an instance as the first step toward the characterization of such instances:
Given an instance of the optimal prefix free code problem formed by n positive weights , its EI signature is a string of length over the alphabet (where E stands for “external” and I for “internal”) marking, at each step of the algorithm suggested by Huffman , whether an external or internal node is chosen as the minimum (including the last node returned by the algorithm, for simplicity).
The analysis described in Section 3
is based on the number
of maximal blocks of consecutive positions formed only of E
in the EI signature of the instance
. We can already show some basic properties of this measure:
Given the EI signature of n unsorted positive weights ,
The number of occurrences of E in the signature is ;
The number of occurrences of I in the signature is ;
The length of the signature is their sum, ;
The signature starts with two E;
The signature finishes with one I;
The number of consecutive occurrences of in the signature is one more than the number of occurrences of in it, ;
The number of consecutive occurrences of in the signature is at least 1 and at most , .
The three first properties are simple consequences of basic properties on binary trees. starts with two E as the first two nodes paired are always external. finishes with one I as the last node returned is always (for ) an internal node. The two last properties are simple consequences of the fact that is a binary string starting with an E and finishing with an I. □
Example 2. For example, consider the text “ABBCCCDDDDEEEEEFFFFFGGGGGGHHHHHHH” formed by the concatenation of one occurrence of “A”, two occurrences of “B”, three occurrences of “C”, four occurrences of “D”, five occurrences of “E”, five (again) occurrences of “F”, six occurrences of “G” and seven occurrences of “H”, so that the corresponding frequencies are . It corresponds to an instance of size , of EI signature of length 15, which starts with , finishes with I, and contains only occurrences of (underlined), corresponding to a decomposition into maximal blocks of consecutive s (in bold), out of a maximal potential number of 7 for the alphabet size .
Instances such as that presented in Example 2, with very few blocks of E
(more than in Example 3, but much less than in the worst case), are easier to solve than instances with many such blocks. For example, an instance W
of length n
such that its EI signature
is composed of a single run of nE
s followed by a single run of I
s (such as the one described in Figure 1
) can be solved in linear time, and in particular without sorting the weights: it is enough to assign the codelength
largest weights and the codelength
smallest weights. Separating those weights is a simple select operation, supported in amortized linear time by the data structures described in the following section. We describe two other extreme examples, starting with one where all the weights are equal (as a particular case of when they are all within a factor of two of each other).
Example 3. Consider the text “ba_bb_caca_ba_cc” from Figure 1. Each of the four messages (input symbols) of its alphabet occurs exactly 4 times, so that an optimal prefix free code assigns a uniform codelength of 2 bits to all messages (see Figure 1). There is no need to sort the messages by frequency (and the prefix free code does not yield any information about the order in which the messages would be sorted by monotone frequencies), and accordingly the EI signature of this text, “EEE EI II”, has a single block of Es, indicating a very easy instance. The same holds if the text is such that the frequencies of the messages are all within a factor of two of each other.
On the other hand, some instances present the opposite extreme, where no weight is within a factor of two of any other, which “forces” any algorithm computing optimal prefix free codes to sort the weights.
Example 4. Consider the text “aaaaaaaabcc____” from Figure 2 (composed of one occurrence of “b”, two occurrences of “c”, four occurrences of “_”, and eight occurrences of “a”), such that the frequencies of its messages follow an exponential distribution, so that an optimal prefix free code assigns different codelengths to almost all messages (see the third column of the array in Figure 2). Any optimal prefix free code for this instance yields all the information required to sort the messages by frequencies. Accordingly, the EI signature “E EIEIEI” of this instance has three blocks of Es (out of three possible ones) for this value of the alphabet size , indicating a more difficult instance. The same holds with more general distributions, as long as no two pairs of message frequencies are within a factor of two of each other.
Those various examples should give an intuition of the features of the instance that our techniques aim to take advantage of. We describe those techniques more formally in the two following sections, starting with the deferred data structure allowing to partially sort the weights in Section 2.2
, and following with the algorithm itself in Section 2.3
2.2. Partial Sum Deferred Data Structure
Given a multiset
of size n
on an alphabet
, Karp et al. [23
] defined the first deferred data structure supporting for all
queries such as rank
, the number of elements which are strictly smaller than x
; and select
, the value of the r
-th smallest value (counted with multiplicity) in W
. Their data structure supports q
queries in time within
, all in the comparison model (Karp et al.’s result [23
] is actually better than this formula when the number of queries q
is larger than the size n
of the multiset, but such configuration does not occur in the case of the computation of an optimal prefix free code, where the number q
of queries is always smaller or equal to n
: for simplicity, we summarize their result as
). To achieve this results, it partially sorts its data in order to minimize the computational cost of future queries, but avoids sorting all of the data if the set of queries does not require it: the queries have then become operators in some sense (they modify the data representation). Note that whereas the running time of each individual query depends on the state of the data representation, the answer to each query is itself independent of the state of the data representation.
Karp et al.’s data structure [23
] supports only rank and select queries in the comparison model, whereas the computation of optimal prefix free codes requires to sum pairs of weights from the input, and the algorithm that we propose in Section 2.3
requires to sum weights from a range in the input. Such requirement can be reduced to partialSum queries. Whereas partialSum queries have been defined in the literature based on the positions in the input array, we define such queries here in a way that depends only on the content of the multiset (as opposed to a definition depending on the order in which the multiset is given in the input), so that it can be generalized to deferred data structures.
(Partial sum data structure).
Given n unsorted positive weights , a partial sum data structure supports the following queries:
rank, the number of elements which are strictly smaller than x in W;
select, the value of the r-th smallest value (counted with multiplicity) in W;
partialSum, the sum of the r smallest elements (counted with multiplicity) in W.
Example 5. Given the array ,
the number of elements strictly smaller than 5 is rank,
the sixth smallest value is select (counting with redundancies), and
the sum of the two smallest elements is partialSum.
We describe below how to extend Karp et al.’s deferred data structure [23
], which already supports rank and select queries on multisets, in order to add the support for partialSum queries, with an amortized running time within a constant factor of the asymptotic time of the original solution. Note that the operations performed by the data structure are not any more within the comparison model, but rather in the algebraic decision tree computational model, as they introduce algebraic operations (additions) on the elements of the multiset. The result is a direct extension of Karp et al. [23
], adding a sub-task taking linear time (updating partial sums in an interval of positions) to a sub-task which was already taken linear time (partitioning this same interval by a pivot):
Given n unsorted positive weights , there is a partial sum deferred data structure which supports q operations of type rank, select, and partialSum in time within , all within the algebraic decision tree computational model.
Karp et al. [23
] described a deferred data structure which supports the rank and select queries (but not partialSum queries). It is based on median computations and
-trees, and performs q
queries on n
values in time within
, all within the comparison model (and hence in the even less restricted algebraic decision tree computational model). We describe below how to modify their data structure in a simple way, so that to support partialSum queries with asymptotically negligible additional cost.
At the initialization of the data structure, compute the n
partial sums corresponding to the n
positions of the unsorted array. After each median computation and partitioning in a rank or select query, recompute the partial sums on the range of values newly partitioned, which increases the cost of the query only by a constant factor. When answering a partialSum query, perform a select query, and then return the value of the partial sum corresponding to the value by the select query: the asymptotic complexity is within a constant factor of the one described by Karp et al. [23
Barbay et al. [24
] further improved Karp et al.’s result [23
] with a simpler data structure (a single binary array) and a finer analysis taking into account the gaps between the positions hit by the queries. Barbay et al.’s results [24
] can similarly be augmented in order to support partialSum queries while increasing the computational complexity by only a constant factor. This finer result is not relevant to the analysis described in Section 3
, given the lack of specific features of the distribution of the gaps between the positions hits by the queries, as generated by the GDM algorithm described in Section 2.3
Such a deferred data structure is sufficient to simply execute van Leeuwen’s algorithm [14
] on an unsorted array of positive integers, but would not result in an improvement in the computational complexity: such a simple variant of van Leeuwen’s algorithm [14
] is simply performing n
select operations on the input, effectively sorting the unsorted array. We describe in the next section an algorithm which uses the deferred data structure described above to batch the operations on the external nodes, and to defer the computation of the weights of some internal nodes to later, so that for many instances the input is not completely sorted at the end of the execution, which indeed reduces the total cost of the execution of the algorithm.
2.3. Algorithm “Group–Dock–Mix” (GDM) for the Binary Case
There are five main phases in the GDM algorithm: the initialization, three phases (grouping, docking, and mixing, giving the name “GDM” to the algorithm) inside a loop running until only internal nodes are left to process, and the conclusion:
In the initialization phase, initialize the partial sum deferred data structure with the input, and the first internal node by pairing the two smallest weights of the input.
In the grouping phase, detect and group the weights smaller than the smallest internal node: this corresponds to a run of consecutive E in the EI signature of the instance.
In the docking phase, pair the consecutive positions of those weights (as opposed to the weights themselves, which can be reordered by future operations) into internal nodes, and pair those internal nodes until the weight of at least one such internal node becomes equal or larger than the smallest remaining weight: this corresponds to a run of consecutive I in the EI signature of the instance.
In the mixing phase, rank the smallest unpaired weight among the weights of the available internal nodes, and pairs the internal node of smaller weight two by two, leaving the largest one unpaired: this corresponds to an occurrence of in the EI signature of the instance. This is the most complicated (and most costly) phase of the algorithm.
In the conclusion phase, with i internal nodes left to process (and no external node left), assign codelength to the largest ones and codelength to the smallest ones: this corresponds to the last run of consecutive I in the EI signature of the instance.
The algorithm and its complexity analysis distinguish two types of internal nodes: pure nodes, which descendants were all paired during the same grouping phase; and mixed nodes, each of which either is the ancestor of such a mixed node, or pairs a pure internal node with an external node, or pairs two pure internal nodes produced at distinct phases of the GDM algorithm. The distinction is important as the algorithm computes the weight of any mixed node at its creation (potentially generating several data structure operations), whereas it defers the computation of the weight of some pure nodes to later, and does not compute the weight of some pure nodes. We will discuss this further in Section 5
about the non istance optimality of the solution presented.
Before describing each phase more in detail, it is important to observe the following invariant of the algorithm:
Given an instance of the optimal binary prefix free code problem formed by positive weights , between each phase of the algorithm, all unpaired internal nodes have weight within a constant factor of two (i.e., the maximal weight of an unpaired internal node is strictly smaller than twice the minimal weight of an unpaired internal node).
The generally property is proven by checking that each phase preserves it:
Initialization: there is only one internal node at the end of this phase, hence the conditions for the property to stand are created.
Grouping: no internal node is created, hence the property is preserved.
Docking: pairing until at least one internal node has weight equal or larger than the smallest weight of a remaining weight (future external node) insures the property.
Mixing: as this phase pairs all internal nodes except possibly the one of largest weight, the property is preserved.
Conclusion: A single node is left at the end of the phase, hence the property.
As the initialization phase creates the property and each other phase preserves it, the property is verified through the execution of the algorithm. □
We now proceed to describe each phase in more details:
Initialization: Initialize the deferred data structure partial sum with the input; compute the weight currentMinInternal of the first internal node through the operation partialSum (the sum of the two smallest weights); create this internal node, of weight currentMinInternal and children 1 and 2 (the positions of the first and second weights, in any order); compute the weight currentMinExternal of the first unpaired weight (i.e., the first available external node) by the operation select; setup the variables nbInternals and nbExternalProcessed .
Grouping: Compute the position r of the first unpaired weight larger than the smallest unpaired internal node, through the operation rank(currentMinInternal); pair the (() modulo 2) indices to form pure internal nodes; compute the parity of the number of unpaired weights smaller than the first unpaired internal node; if it is odd, select the r-th weight through the operation , compute the weight of the first unpaired internal node, compare it with the next unpaired weight, to form one mixed node by combining the minimal of the two with the extraneous weight.
Docking: Pair all internal nodes by batches (by Property 1, their weights are all within a factor of two, so all internal nodes of a generation are processed before any internal node of the next generation); after each batch, compare the weight of the largest such internal node (compute it through on its range if it is a pure node, otherwise it is already computed) with the first unpaired weight: if smaller, pair another batch, and if larger, the phase is finished.
Mixing: Rank the smallest unpaired weight among the weights of the available internal nodes, by a doubling search starting from the beginning of the list of internal nodes. For each comparison, if the internal node’s weight is not already known, compute it through a partialSum operation on the corresponding range (if it is a mixed node, it is already known). If the number r of internal nodes of weight smaller than the unpaired weight is odd, pair all but one, compute the weight of the last one and pair it with the unpaired weight. If r is even, pair all of the r internal nodes of weight smaller than the unpaired weight, compare the weight of the next unpaired internal node with the weight of the next unpaired external node, and pair the minimum of the two with the first unpaired weight. If there are some unpaired weights left, go back to the Grouping phase, otherwise continue to the Conclusion phase.
Conclusion: There are only internal nodes left, and their weights are all within a factor of two from each other. Pair the nodes two by two in batch as in the docking phase, computing the weight of an internal node only when the number of internal nodes of a batch is odd.
The combination of those phases forms the GDM algorithm, which computes an optimal prefix free code given an unsorted sets of positive integers.
The tree returned by the GDM algorithm describes an optimal binary prefix free code for its input.
In the next section, we analyze the number q of rank, select and partialSum queries performed by the GDM algorithm, and deduce from it the complexity of the algorithm in term of algebraic operations.