To evaluate RQ-TSC, we integrate our design into an existing benchmark [
6] and compare it against several state-of-the-art approaches: Bundle [
10]; EBR-RQ [
6], a lock-free variant of Arbel-Raviv and Brown’s epoch-based reclamation technique; RLU [
9], which provides RCU-like synchronization for concurrent writers; and vCAS [
11], a lock-based linearizable data structure implementation that uses vCAS objects in place of pointers and metadata (using source code from [
10]). Note that RLU is omitted from the skip list results due to the lack of an available implementation. All competitors use single atomic counters with a backoff strategy [
36] to generate timestamps, thus reducing contention. Furthermore, To explore the performance impact of contention on a single atomic variable, we used a version of Bundle without backoff strategy, called Bundle-no-backoff, which does not apply the backoff strategy to the increment operation of the single atomic variable.
All code is written in C++11 and compiled with g++ 11.4 with -std=c++11 -O3 -mcx16 and linked with the jemalloc [
37] multi-threaded memory allocator. Experiments are conducted on a machine equipped with two Xeon Platinum 8336C processors, featuring 128 hyperthreaded cores in total and a 108 MB L3 cache, and running Ubuntu 22.04. Memory management for all methods is handled via epoch-based memory reclamation. For each experiment below, threads run a predefined combination of update, contains, and range query operations, employing uniformly random keys. The data structure is preloaded with half of the keys in the designated key range, and updates are split equally between insertions and deletions. Workloads are reported as
, where
U is the percentage of updates,
C is the percentage of contains, and
is the percentage of range queries. All reported results are an average of three runs of three seconds each. The key ranges per data structure are: 10,000 for the lazy list, and 1,000,000 for both the skip list and BST. Ranges are 50 keys long by default, unless stated otherwise.
Data Structure Performance
Figure 3 shows the total throughput of operations on a lazy list with varying the number of worker threads at both 10% updates and 90% updates with 10% range queries.
At 10% updates, RLU has update-side overhead since they have to wait for reads in progress, which becomes apparent as we add more threads. For vCAS, it adds an additional dereference overhead per-node due to accessing a version list for any node, resulting in worse performance for read-intensive workloads. At 90% updates, RLU and vCAS have poor performance. This happens because RLU’s heavy update synchronization causes its performance to degrade significantly in update-intensive workloads. In both cases, the RQ-TSC, Bundle, Bundle-no-backoff, and EBR-RQ have similar performance. On the other hand, this experiment also implies that contention on a single atomic counter will not have significant impact on the workload with small objects such as those using 10,000 keys.
Varying Workload Mix.
Figure 4 and
Figure 5 report the throughput (Mops/s) of different workloads for the skip list and Citrus tree.
Experimental results show that data structures enhanced with our method outperform RLU, EBR-RQ, and Bundle under mixed workloads. With the exception of corner cases involving 0% or 100% updates, RQ-TSC matches or exceeds the performance of the next best approach in the majority of scenarios. Under such mixed conditions, RQ-TSC delivers speedups of up to 1.3× over EBR-RQ in the skip list (
Figure 4d) and 2.3× in the Citrus tree (
Figure 5a). Against RLU and Bundle, RQ-TSC can reach 1.9× (
Figure 5f) and 1.2× (
Figure 4c) better performance.
For the skip list, EBR-RQ performs better due to the lock-free mechanism. When it encounters an unfinished update operation, a range query helps complete the update operation before proceeding with its own operation. This help-based lock-free mechanism benefits update performance, but may affect range query performance. RQ-TSC performs as well as or better than EBR-RQ after threads grow beyond a single node at 50% and 90% updates. For read-intensive workloads (
Figure 4a,b,e), RQ-TSC has similar performance to Bundle and Bundle-no-backoff. This is because for them, the increment of the atomic counter is performed by update operations; therefore, optimization for the atomic counter will not benefit read-intensive workloads. This case is also similar to that of the Citrus tree discussed below. For write-intensive workloads (
Figure 4c,d,f), Bundle-no-backoff performs significantly worse than RQ-TSC beyond a NUMA node due to cross-node coherence traffic caused by contention on the single atomic counter; for instance, Bundle-no-backoff is 29% worse than RQ-TSC at 90% updates. This implies that using a single atomic counter to generate timestamps can significantly affect performance. Although a backoff strategy can reduce contention on the atomic counter, it still suffers from global spinning and presents challenges in tuning the backoff timeouts.
For the Citrus tree, RQ-TSC significantly outperforms EBR-RQ across mixed workloads, except for the 90% updates, when the number of threads exceeds 48. This is because EBR-RQ has to visit more additional nodes in the limbo lists and announced deletions to collect nodes belonging in their snapshot. For RLU, its performance degrades significantly when there are more updates in the workload (
Figure 5c,d). At high-update workloads, the Citrus tree does not scale well beyond one NUMA node, regardless of the presence of range queries. This happens because the update operation’s traversal section is protected by RCU’s read lock, which is operated atomically. However, before a delete operation physically removes a node, it calls the
function to read the atomic lock variables of all threads, which can lead to significant cross-node coherence traffic.
Relative to vCAS, RQ-TSC performs up to 1.2×–1.6× better in low-update workloads (namely 0%, 2%, and 10% updates) across all thread configurations, thanks to its lightweight traversal. In workloads with intensive updates (50% and 90% updates), RQ-TSC is more performant than vCAS and Bundle for the skip list. This is because it relies on hardware timestamps instead of an atomic counter to produce version timestamps, substantially reducing contention between concurrent threads. For the Citrus tree, vCAS performs better at 50% and 90% updates when there are large numbers of threads.
This is because the assignment of timestamps in vCAS is wait-free, that is, any concurrent operation can help complete it, whereas RQ-TSC and Bundle must wait when an update operation encounters a pending version. Moreover, RQ-TSC still outperforms Bundle due to its efficient versioning mechanism and the generation of timestamps in a scalable way.
In the read-only workload (
Figure 5e), RLU shows good performance because of the lack of synchronization costs, and attains the best performance in the Citrus tree. However, the experiment result shows that even a low percentage of updates can substantially degrade RLU performance (
Figure 5b) due to the synchronization caused by writers. Consequently, under update-intensive workloads (
Figure 5c,d,f), RLU exhibits the lowest performance. In contrast, EBR-RQ requires each range query to increment a global timestamp counter, resulting in poorer efficiency under read-heavy workloads compared to update-intensive scenarios. While EBR-RQ performs efficiently in update-only conditions (
Figure 5f), its performance significantly degrades in read-only workloads (
Figure 5e), especially beyond a single NUMA node. Bundle moves the responsibility of incrementing the global timestamp to update operations, which has an effect on performance in write-dominated workloads. This is apparent in the skip list.
Table 1 and
Table 2 show time statistics (in cycles) for 50% update and 90% update workloads on the skip list. Execution time statistics are presented using five key percentiles (5%, 25%, 50%, 75%, and 95%). The tested operation types include
(
), insertions (
), deletions (
), and range queries (
). For example, the value
a at percentile
in the table indicates that
of executions for that operation type completed within ≤
a cycles. For the workload with 90% updates (90-0-10), since it contains no
operations, the execution time for
in
Table 2 is left blank. We found that for both workloads, RQ-TS is significantly faster than Bundle and vCAS for update operations (e.g., inserts and deletes) because their updates require atomically increasing the global counter to generate timestamps, whereas RQ-TS achieves this by relying on hardware timestamps. For range queries, RQ-TSC is significantly faster than EBR-RQ because EBR-RQ requires traversing extra limbo lists and announced deletions.
Varying Range Query Size.
Figure 6 presents the throughput (Mops/s) for update and range query operations on a Citrus tree. The setup involves partitioning a single NUMA node such that 32 hardware threads (half the total) run 100% updates and the remaining 32 run 100%, with the range query length being adjusted.
We note that RQ-TSC, Bundle, EBR-RQ, and vCAS do not see their update performance adversely affected by range query length, since updates are not blocked by range queries. RLU, however, experiences a notable throughput reduction when range query size surpasses 64. This stems from RLU’s requirement to synchronize with in-progress read operations, leading to longer range queries substantially harming update performance. For range queries, EBR-RQ has a mostly stable range query throughput with varying range size. Since a range query has to check limbo lists and announced deletion lists, as well as suffer from expensive DCSS operations, even short-range queries perform poorly. On the other hand, RQ-TSC, Bundle, vCAS, and RLU exhibit far better performance when dealing with short-range queries.