1. Introduction
We describe a data structure for storing and updating a set S of integers whose elements are values taken in the range with N a given integer. Typically, the “superset”, or “universe”, E corresponds to the indices of the N elements of a problem instance, and we are interested in storing, using and updating subsets of the universe.
A set of integers is a very basic mathematical object and it finds use in uncountably many applications of computer programs. Performing the basic set operations (i.e., insertion, deletion, membership test and elements enumeration) in the most effective possible way would benefit all algorithms in which sets of integers play a relevant role (just to mention one example, consider the algorithms operating on graphs, for which the universe is the set of nodes/edges of a graph and the procedure needs to store and update a subset of nodes/edges). The goal of the paper is to provide optimal set primitives such as those mentioned above.
The literature on algorithms and data structures is very rich, and various possible representations and implementations of such sets are discussed in many textbooks, among which there are some classics, such as [
1,
2,
3]. The implementations of the sets fall mainly in two categories: (i) those for which operations such as insertion, deletion and membership test are fast but other operations, such as listing all the elements of the set, performing union and intersection are slow; (ii) those for which operations such as insertion, deletion and membership test are slow but other operations, such as listing all the elements of the set, performing union and intersection are fast. In the first category, we recall the
bitmap representation of the set (which uses an array
v of
N booleans, where
iff
) and some forms of
hash tables using either optimal hashing or buckets [
4]. In the second category, we can represent a set by an array of size
N in which only the first
entries are filled and contain the elements of the set, a structure called a
partially filled array (PFA). Usually the PFA is unsorted (maintaining the elements sorted can speed-up operations such as the membership test, but it slows down insertion and deletion) Another option is to store the elements in a
linked list, again unsorted. Finally, one can resort to some tree-like structures, such as the
AVL trees, a self-balancing type of binary search trees [
5]. For these structures, insertion, deletion and membership tests are reasonably fast (but not optimal), but the trade-off is that some other operations are slowed down by the overhead paid in order to maintain the structure sorted.
Main Results and Paper Organization
In this paper, we propose a new data structure, called FASTSET, that has optimal time performance for all the main set operations. In particular, operations such as insertion and deletion of an element, membership test, and access to an element via an index in are all . From these primitive operations, we derive more complex operations, such as listing all elements at cost , computing the intersection of two sets (i.e., ) at cost and the union of two sets (i.e., ) at cost .
In
Table 1 we report a list of operations and their cost both for the various aforementioned types of data structures for set representation and for
FASTSET. As far as the memory requirement is concerned, for some of them it is
(namely Bitmap, PFAs, and
FASTSET), for some it is
(namely Linked List and AVL Tree), while it is
for the Bucket Hashtable with
b buckets.
The remainder of the paper is organized as follows. In
Section 2 we describe the implementation of the various set operations for a
FASTSET. In
Section 3 we give two simple examples of algorithms using set data structures. In particular, we describe two popular greedy algorithms, one for Vertex Cover and the other for Max-Cut.
Section 4 is devoted to computational experiments, in which
FASTSETs are compared to various set implementations from the standard library of the
Java distribution [
6]. We have chosen
Java since it is one of the popular languages and offers state-of-the-art implementations of class data structures for all the structures we want to compare to. Anyway, we remark that the main contribution of this paper is of theoretical nature and thus the results are valid for all implementations (i.e., in whichever language) of the data structures discussed. Some conclusions are drawn in
Section 6.
2. Implementation
A
FASTSET is implemented by two integer arrays, of size
and
N, which we call
elem[ ] and
pos[ ], respectively. The array
elem[ ] contains the elements of
S, in no particular order, consecutively between the positions 1 and
, while
elem[0] stores the value of
. The array
pos[ ] has the function of specifying, for each
, if
or
, and, in the latter case, it tells the position occupied by
i within
elem[ ]. More specifically,
The main idea in order to achieve optimal time performance is remarkably simple. Our goal is to achieve both the benefits of a PFA, in which listing all elements is optimal (i.e., it is ) but accessing the individual elements, for removal and membership tests is slow (i.e., it is , and of a bitmap implementation where accessing the individual elements is optimal (i.e., it has cost ) but listing all elements is slow (i.e., it is ). To this end, in our implementation we use the array elem[ ] as a PFA, and the array pos[ ] as a bitmap. Moreover, not only pos[ ] is a bitmap, but it provides a way to update the partially filled array elem[ ] after each deletion in time rather than .
We will now describe how the set operations can be implemented with the complexity stated in
Table 1. The implementation is quite straightforward. We will use a pseudocode, similar to
C. In particular, our functions will have as parameters
pointers to
FASTSETs, in order to avoid passing the entire data structures.
2.1. Membership
To check for membership of an element v, we just need to look at and see if it is non-zero, at cost .
Boolean Belongs( FASTSET* s, int v ) {
return ( s->pos[v] > 0 )
}
2.2. Cardinality
The cardinality is readily available in at cost .
int Cardinality( FASTSET* s ) {
return s->elem[0]
}
2.3. Insertion
Each insertion happens at the end of the region of consecutive elements stored in elem[ ]. Since we have direct access to the last element through , the cost is .
void Insert( FASTSET* s, int newel ) {
if ( Belongs ( s, newel ) ) return // newel is already present in s
s->elem[0] := s->elem[0] + 1
s->elem[s->elem[0]] := newel
s->pos[newel] := s->elem[0]
}
Please note that there is no need for a test of full-set condition, since there is enough space for the largest subset possible (namely the whole
E), and no element can be repeated in this data structure. See
Figure 1a–e for examples of insertions in a
FASTSET and corresponding updates of the data structure.
2.4. Deletion
Assume we want to delete an element
v (which may or may not be in
S), so that
. When
, this is obtained by copying the last element of
elem, let it be
w, onto
v (by using
pos, we know where
v is) and decreasing
(i.e.,
elem[0]). In doing so, we update
pos[
w], assigning
pos[
v] to it, then let
pos[
v] := 0. The final cost is
. See
Figure 1d for an example of deletion in a
FASTSET and the corresponding update of the data structure.
void Delete( FASTSET* s, int v ) {
if ( NOT Belongs ( s, v ) ) return // v was not in S
int w := s->elem[s->elem[0]]
s->elem[s->pos[v]] := w
s->pos[w] := s->pos[v]
s->pos[v] := 0
s->elem[0] := s->elem[0] - 1
}
2.5. Access to an Element and List All Elements
We can have direct access to each element of S, via an index in , at cost . From this it follows that we can list all elements of S at cost . The corresponding procedures are the following:
int GetElement( FASTSET* s, int k ) {
return s->elem[k]
}
int* GetAll( FASTSET* s ) {
int* list = malloc( s->elem[0] * sizeof(int) )
for ( int k := 1; k <= s->elem[0]; k++ )
list[k-1] = s->elem[k]
return list
}
2.6. Intersection
Assume A and B are sets. We want to compute their intersection and store it in C, initially empty. We go through all the elements of the smaller set, and, if they are also in the other set, we put them in C. The final cost is .
void Intersection( FASTSET* A, FASTSET* B, FASTSET* C ) {
if ( Cardinality(A) < Cardinality(B) )
FASTSET* smaller := A
FASTSET* other := B
else
FASTSET* smaller := B
FASTSET* other := A
for ( int k := 1; k <= Cardinality(smaller); k++ )
if ( Belongs( other, GetElement( smaller, k ) ) )
Insert( C, GetElement( smaller, k ) )
}
2.7. Union
Assume A and B are sets. We want to compute their union and store it in C, initially empty. We go through all the elements of each of the two sets and put them in C. The final cost is .
void Union( FASTSET* A, FASTSET* B, FASTSET* C ) {
for ( int k := 1; k <= Cardinality(A); k++ )
Insert( C, GetElement( A, k ))
for ( int k := 1; k <= Cardinality(B); k++ )
Insert( C, GetElement( B, k ))
}
2.8. Initialization
A FASTSET is initialized by specifying the range for the elements value, and then simply allocating memory for two arrays:
FASTSET* Create( int N ) {
FASTSET* p := calloc( FASTSET, 1 )
p->pos := calloc( int, N )
p->elem := malloc( (N+1) * sizeof(int) )
p->elem[0] := 0 // empty set
return p
}
We assume that calloc allocates a block of memory initialized to 0s, in which case there is nothing else to be done. If, on the other hand, calloc returns a block of memory not initialized to 0, we must perform a for cycle to initialize pos[ ] to 0s (note that there is no need to initialize the other entries of elem[ ], since they will be written before read).
Finally, sometimes the following operation is useful for re-initializing a FASTSET, since, for large N, it can be faster (namely ) than creating an empty FASTSET from scratch (that is because of calloc zeroing):
void Clean( FASTSET* s ) {
for ( int i:=1; i <= s->elem[0]; i++ )
s->pos[s->elem[i]] := 0
s->elem[0] := 0
}
It is clear that both creating and cleaning algorithms for
FASTSET, as they are outlined, require
time, because of zeroing the
pos[ ] array (by
calloc or a loop, respectively). It is possible, however, to use a trick originally outlined in [
7] (exercise 2.12) to avoid the initialization while leaving “garbage” in
pos[ ]:
FASTSET* Create( int N ) { // now O(1)
FASTSET* p := malloc( sizeof(FASTSET) )
p->pos := malloc( N * sizeof(int) ) // unknown data in pos
p->elem := malloc( (N+1) * sizeof(int) )
p->elem[0] := 0 // empty set
return p
}
void Clean( FASTSET* s ) { // now O(1)
s->elem[0] := 0; // no loop needed: garbage left in pos
}
We only need to make a slightly more complex belonging check, that must now handle the garbage possibly present in pos:
Boolean Belongs( FASTSET* s, int v ) { // STILL O(1)
k := s->pos[v]
return ( k > 0 AND k <= s->elem[0] AND s->elem[k] == v )
}
4. Computational Experiments
In this section, we report on some computational experiments (performed on Intel
® Core™ i3 CPU M 350 @ 2.27 GHz with 2.8 GB of RAM) in which we have compared the performance of
FASTSETs to that of other set data structures included in the standard library of the Java distribution. In particular, Java provides three classes implementing sets via different data structures, namely (i)
BitSet [
8] implementing the Bitmap data structure; (ii)
TreeSet [
9] implementing a set as a self-balancing tree with logarithmic cost for all main operations; (iii)
HashSet [
10] implementing a set by a hash table. We have coded the class
FASTSET in Java and have run a first set of experiments in which we have performed a large number of random operations (such as insertions and removals of random elements) for various values of
N. The results are listed in
Table 2. From the table it appears that
FASTSET and
BitSet perform very similarly with respect to single-element operations, and they are both better that the other two data structures. It should be remarked that in actual implementations such as this one, Bitmaps are very effective at these type of operations, since they can exploit the speed of some low-level instructions (such as logical operators) for accessing the individual bits of a word. When we turn to listing all elements, however,
FASTSETs outperform the other implementations. In particular, a
GetAll after 50,000 random insertion on a
FASTSET is from 10 up to 30 times faster than for the other data structures.
In a second run of experiments, we have considered the two simple combinatorial algorithms described in
Section 3, which make heavy use of sets during their execution. The pseudocode listed in Algorithms 1 and 2 refers to an implementation using
FASTSETs. The algorithms were translated in equivalent algorithms using
BitSet,
TreeSet and
HashSet in place of
FASTSET. The translation was made so as to be as fair and accurate as possible. This was easy for most operations, since they translate directly into equivalent operations on the other data structures. Other operations, such as using
GetElem() in a loop to access all the elements of a
FASTSET, were realized by calling the
iterator methods that the Java classes provide, and which are optimized to make sequential access to all the elements of each structure.
For both Vertex Cover and Max-Cut the tests were run on random graphs of
n vertices and an average of
edges, where each edge has probability
p of being in the graph. We have used
and
, so that the largest graphs have 5000 nodes and about
edges. For each pair
we have generated 5 instances and computed the algorithms average running time. The results are reported in
Table 3 for Vertex Cover and
Table 4 for Max-Cut. For both problems, the implementation using
FASTSETs was the fastest. In particular, on the Vertex Cover problem, the use of
FASTSETs yields running times between one half and one quarter with respect to the other data structures. The results for Max-Cut are even better, with the other data structures being, on average, from 3 times to 30 times slower than
FASTSETs for this procedure.
5. Space Complexity and Limitations
The data structure we have described may not be appropriate when the universe of all possible values is very large, especially if the amount of available memory is an issue. Indeed, the memory requirement is and this can be prohibitive (e.g., when N is a large power of 2 such as ). In this case, even if the memory is available, it is hard to assume that the allocation of such a chunk of memory is an operation.
The same problem, however, would be true of other data structures which require memory, such as the bitmaps and the optimal hashtables. Clearly, when the size of the set tends to become as large as N, all the possible data structures would incur in the same memory problem (indeed, in this case our structure would be optimal as far as memory consumption, since it would be linear in the size of the set). When, on the other hand then data structures based on dynamic allocation of memory (such as linked lists and AVL trees) are better as far as memory consumption than FASTSETs. We remark, however, that the case of an exponentially large N is a rare situation, since most of the times the integers that we deal with in our sets are indices, identifying the elements of a size-N array representing a problem instance.