Scaling Linearizable Range Queries on Modern Multi-Cores

Zhang, Chen; Yi, Zhengming; Zhu, Xinghui

doi:10.3390/computers14090381

Open AccessArticle

Scaling Linearizable Range Queries on Modern Multi-Cores

by

Chen Zhang

,

Zhengming Yi

^* and

Xinghui Zhu

College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(9), 381; https://doi.org/10.3390/computers14090381

Submission received: 20 August 2025 / Revised: 7 September 2025 / Accepted: 8 September 2025 / Published: 11 September 2025

Download

Browse Figures

Versions Notes

Abstract

In this paper we introduce Range Query Timestamp Counter (RQ-TSC), a general approach to provide scalable and linearizable range query operations for highly concurrent lock-based data structures. RQ-TSC is a multi-versioned building block that relies on hardware timestamps (e.g., obtained through hardware timestamp counter register on x86_64) to generate version timestamps, which greatly reduce a point of contention on a shared atomic counter. To evaluate the performance of RQ-TSC, we apply it to three data structures: a linked list, a skip list, and a binary search tree. Experiments show that our approach can improve scalability significantly. Moreover, in almost all cases, range queries on these data structures built from our design perform as well as or better than state-of-the-art concurrent data structures that support linearizable range queries.

Keywords:

concurrent data structures; range queries; synchronization; multi-cores

1. Introduction

Concurrent sets play a significant role in systems like database management, where they serve as efficient indexing structures. Realistic database workloads often involve frequent range query operations, which retrieve all keys within a specified range [low, high] that are present in the set. Additionally, modern key-value stores, such as RocksDB [1] and others [2,3], have expanded their traditional APIs to incorporate support for range query operations.

Range queries have long posed significant challenges for efficient implementation. While it is straightforward to enable range queries by applying locks over extensive or numerous sections of the data structure, such methods often compromise parallelism. Other efforts have focused on providing range queries by highly specific data structure implementations [4,5]. A general approach to implementing efficient range queries involves capturing a snapshot of the data structure and subsequently constructing the result set from this snapshot. However, existing methods for snapshot creation often entail substantial traversal overhead [6,7,8] or synchronization overhead [9], and may also suffer from contention on a single atomic counter [6,9,10,11].

In this paper, we propose RQ-TSC, an optimization building block, which adds support for linearizable range queries to an existing lock-based concurrent data structure. To support linearizable wait-free range queries, RQ-TSC employs multi-version concurrency controls (MVCC) [12], as in the bundled reference general approach [10], which uses a version list to store the history of a link between data structure nodes. However, this general approach relies on a single atomic counter to generate version timestamps, which can lead to a single point of contention on the counter. On the other hand, modern multi-core servers feature non-uniform memory access (NUMA) [13]. In such machines, the cores are grouped into nodes, where each node has a separate local cache and memory. It is faster to access local memory or local cache within a node than to access the memory of other nodes. Contention on the counter can lead to expensive cross-node communication on such machines.

To address this issue, we leverage a hardware timestamp-based versioning mechanism in Jiffy [5] to scalably generate versioned timestamps. For example, on the x86_64 platform, timestamps can be obtained through the CPU timestamp counter (TSC) register [14], a cycle-level resolution clock whose value can be read without performing a system call (e.g., using the RDTSC instruction). RQ-TSC is a technique that integrates bundled references [10] with Jiffy’s timestamping. By using this combined approach, RQ-TSC can support linearizable range queries while offering scalable performance on NUMA multi-core machines.

We demonstrate the benefit of our approach by applying it to three widely used lock-based ordered set implementations, namely a linked list, a skip list, and a binary search tree (BST). Among them, the linked list is a data structure that is convenient for illustrating our design details, while the skip list and binary search tree represent efficient indexing structures commonly used in database systems. We build RQ-TSC and implement these data structures on top of bundled data structures [10]. We also compare to state-of-the-art approaches that support linearizable range queries, including bundled references [10], read-log-update (RLU) [9], the versioned CAS (vCAS) approach [11], and a solution based on epoch-based reclamation (EBR-RQ) [6]. Analyzing the performance results, we found that in almost all cases, our data structure performs as well as or better than all of these data structures. For example, concerning mixed workloads, RQ-TSC offers up to 2.3× improvement over EBR-RQ, up to 1.9× improvement over RLU, and up to 1.2× improvement over Bundle. Compared with vCAS, RQ-TSC’s performance reaches up to 2× improvement in linked list experiments. Also, RQ-TSC reaches up to 1.4× improvement over vCAS in skip list, and up to 1.5× improvement in BST at max threads.

Contributions. In summary, our contributions are as follows:

We present an optimization building block that combines ideas from the bundled reference and Jiffy’s timestamping approaches, for adding scalable and linearizable range query support to existing lock-based concurrent data structures.
We apply RQ-TSC to three different data structures. Experiments over a variety of workloads show that our approach exhibits scalable behavior and often outperforms other state-of-the-art approaches despite being more general.

The remainder of this paper is organized as follows. Section 2 discusses work closely related to ours. Section 3 provides background on bundled references. In Section 4, we detail our system’s design and the implementation of its operations (updates, contains, and range scans). Section 5 then describes the application of RQ-TSC to other data structures. Section 6 analyzes memory reclamation and system correctness. We present our experimental evaluation in Section 7, and finally, Section 8 concludes the paper.

2. Related Work

Existing data structures have been meticulously engineered to support range queries [4,5]. However, their underlying techniques are highly specific to each data structure’s design, making them challenging to adapt across different data structures. Furthermore, although numerous effective concurrent data structures have been developed, many still do not support range queries, and it remains difficult to apply the aforementioned specialized techniques to add this functionality. This has spurred interest in generalized mechanisms that enable range query support on existing concurrent data structures [6,7,9,10,11].

A straightforward approach to carrying out range queries involves utilizing transactional memory (TM). TM lets chunks of code execute as transactions which either complete commitment and become effective atomically or abort with no effect on shared memory. It can be realized in software (STM) [15] or hardware (HTM) [16]. However, STM is not efficient for highly contended data structures due to transaction restarts, and HTM has limitations that make transactions more likely to abort as they access more memory. Read-Copy-Update (RCU) [17] provides wait-free reads while ensuring mutual exclusion between writes. Nevertheless, RCU targets read-intensive workloads and is notoriously hard to use. Read-log-update (RLU) [9] combines locking, RCU, and STM techniques. It provides simple programmability, but it requires redesigning a data structure. NR [13] is a general method that transforms a data structure into a NUMA-aware concurrent data structure, but it does not support linearizable range queries.

Erez Petrank and Shahar Timnat [7] introduced the snap-collector object which multiple threads can use to cooperate in building a snapshot of the data structure for range queries. To build a snapshot, range queries scan the data structure and add nodes that are not marked as logically deleted to the list of the snap-collector. In addition, other concurrent update operations also need to forward to them reports regarding operations which range queries might have missed. After building a snapshot, range queries can collect the proper nodes from the snapshot constructed from both the list of the snap-collector and the reports of the snap-collector. However, establishing a snapshot of the entire data structure is expensive for range queries. An extension [8] of the snap-collector enables establishing a snapshot of only a range of the data structure instead of all elements. However, this approach still suffers from many of the same issues as the original design.

Arbel-Raviv and Brown [6] proposed a technique for adding range queries to concurrent data structures that reclaim memory using EBR. To delete a node, a delete operation first announces nodes that would be deleted and then adds them to the limbo list, which is maintained by EBR. To perform a range query, it first traverses the data structure, and gets each key whose node was inserted before the range query. Then, it traverses the announced deletions and the limbo lists to locate any keys in the range that were deleted after the range query and were missed during the traversal. However, traversing announced deletions and limbo lists can lead to additional costs for workloads involving range queries.

vCAS [11] implements range queries for lock-free data structures by replacing CAS objects with vCAS objects. Each vCAS object maintains a version list of vnodes. Each vnode represents a version of the CAS object and consists of a timestamp, a value for the CAS object, and a head pointing to a version list. Since a CAS object in lock-free data structures is often the link between data structure nodes, by traversing links at a given time, linearizable range queries can be achieved. However, each traversal for vCAS objects requires first accessing the vnode in the version list and then the object it points to, which introduces per-node overheads. In addition, there is an optimization for the vCAS object, which stores the timestamp and the version pointer directly in the data structure node. Though this optimization reduces per-node cost when accessing the most recent version of a vCAS object, it is designed for specific lock-free concurrent data structures that satisfy the recorded-once property [11]. VERLIB [18] converts any data structure into one with the recorded-once property and applies the vCAS approach to general lock-free constructions [19], but it still suffers from similar traversal overheads as the vCAS approach.

Bundled reference [10] (abbreviated as Bundle) is a building block for adding efficient range queries to lock-based linked data structures. In bundled references, a link (e.g., the next pointer of a node in the linked list) between data structure nodes is augmented with a Bundle that encapsulates the history of this link. A Bundle is implemented as a version list. Each version in the version list contains the value of the link and a timestamp when the link was added to the Bundle. Given a moment, linearizable range queries are implemented by traversing the versions corresponding to that moment. In addition, bundled references also keep original links to reduce traversal cost. However, each thread performing an update operation must atomically add one to the global counter to generate timestamps, which can cause heavy contention on this counter for NUMA multi-cores.

KiWi [4] is a key–value map which offers lock-free updates and wait-free range queries; however, it suffers from synchronization between the index level and data level, as well as contention on an atomic counter. Jiffy [5] is a lock-free and multi-versioned skip list-based index, which relies on hardware timestamps. Although Jiffy provides linearizable range queries, it suffers from overhead for registering and unregistering a snapshot, and targets the specific lock-free data structure. We modify this versioning mechanism and apply it to general techniques. Sagonas et al. [20] proposed a contention adapting (CA) tree-based data structure with linearizable range scans. The data structure is a lock-free binary search tree with variable-sized containers as leaves. Range queries are achieved by replacing the leaf node with a new special node used by concurrent threads to help with completing the range queries. However, range operations need to synchronize with update operations that operate on keys in the same range as the range query. SnapTree [21] is a lock-based relaxed balance AVL tree. SnapTree uses a clone operation for linearizable snapshots and range queries, which can significantly degrade the performance of concurrent updates.

3. Background

In this section, we provide an overview of Bundle. The main idea of Bundle is to create snapshots of the links between data structure nodes (such as the nodes’ next pointers), which enables range queries to access a consistent view of the data structure’s state at a specific point in time through these snapshots.

To this end, Bundle uses a version list to store the history of every link in the data structure. Each version stores the value of the link, as well as the value of the global atomic timestamp counter that was recorded when the link was added to the version list. Whenever an update operation completes, a new version containing the latest value of the link and its corresponding timestamp is inserted at the head of the version list. Therefore, the head of the version list always stores the latest value of the link. A range query first locates the lower bound of the range, then reads the atomic timestamp counter, and finally traverses the range of the linked data structure using the most recent version whose timestamp is less than or equal to the observed timestamp.

We use an example similar to that in [10] to demonstrate Bundle’s application. A node with key n is denoted as node n. As shown in the Figure 1, each node contains a version list storing the history of its next pointer. The triangle in the figure represents the version, while the different colors of the triangle indicate the different operations used to create that version. The figure shows the linked list state after sequentially performing

i n s e r t (20)

,

i n s e r t (10)

,

i n s e r t (30)

, and

r e m o v e (20)

. Suppose that the list is initialized with head and tail sentinel nodes, and the head node has a single version whose timestamp is “0” pointing to the tail node (Figure 1a). The initial operation inserting 20 (Figure 1b) generates a version marked with timestamp “1” that points to the newly added node 20, adding this version to the front of the version list. It also creates a version with the same timestamp within this new node, which points to the tail node, and inserts this version at the head of the version list. Similarly, inserting 10 and 30 adds versions with timestamps “2” and “3”, respectively (Figure 1c,d). The last operation that deletes 20 first creates a new version (with timestamp “4”) at the head of node 10’s version list, pointing to node 30. Then it updates the predecessor node 10’s next pointer to point to the node 30 to unlink the node 20 (Figure 1e).

Consider a scenario where multiple range queries commence at distinct timestamps, running concurrently with the aforementioned update operations. We use

R Q_{i}

to indicate a range query that reads the value i of the hardware timestamp counter. Suppose there are two range query operations,

R Q_{1}

and

R Q_{5}

, and their query ranges match the entire key range.

R Q_{1}

starts from the head sentinel node and traverses the linked list by following blue links whose timestamps are less than or equal to 1. Therefore, it will go through the head node, node 20, and the tail node. For

R Q_{5}

, when its traversal reaches node 10, it will access the version list of that node, select the first version whose timestamp is less than or equal to 5, and then traverse the node pointed to by this version. As a result,

R Q_{5}

will traverse node 30 by accessing the version whose timestamp is 4 and thus not observe node 20.

4. The RQ-TSC Building Block

The bundled references technique converts a data structure into one in which each link is associated with a bundle. A Bundle, implemented as a version list, maintains a history of previous values for that link. Each version consists of the following:

(i)

p t r

: the value of a link;

(i i)

t s

: a timestamp when the link was generated; and

(i i i)

n e x t

: a pointer to the next (older) version in the version list. In RQ-TSC, we optimize the bundling mechanism by using timestamps obtained directly from the CPU’s timestamp counter. Furthermore, we develop an optimistic versioning mechanism based on [5] to support efficient range queries. In bundled references, time advances by adding the atomic counter. However, this can lead to expensive cache-coherence communication on NUMA multi-core systems. For example, an update operation that modifies this counter first brings the corresponding cache line to its local core cache (L1 cache), then updates it with changing the state of the cache line from shared to modified. However, copies of this cache line in other cores (e.g., the cores running range query operations) will invalidate it [22]. As a result, range queries that read that cache line again need to fetch it from the L1 cache of the updater’s core, which can lead to expensive communication between the cores. However, the cost increases significantly when the cores are located on different NUMA nodes because cross-node communication is usually more expensive than intra-node communication. On the other hand, multiple writers updating the atomic counter concurrently can cause cache-coherence traffic. This is because each update for this counter invalidates the cache-line copy in other cores.

There are two traversal functions used to return the next node in RQ-TSC data structures. The first is GetNextNode, which returns a node using the original links (similar to Bundle); the second is GetLinkFromVersion, which returns a node via a version list, that is, it retrieves the correct version by evaluating versions on a version list. To reduce the space overhead, our approach relies on epoch-base memory reclamation (EBR) [23] to reclaim removed nodes and outdated versions that are no longer accessed by any active operations.

4.1. Update Operations

To provide linearizable behavior [24] (all operations appear as if they were executed sequentially on a single CPU), threads in general approaches [6,10,11] typically synchronize on a single atomic counter, which is used to generate timestamps. However, this can lead to global contention that becomes a bottleneck with increasing core count. To circumvent such bottlenecks, we utilize a high-resolution CPU-based clock for timestamp generation. For instance, on x86_64 architecture, timestamps can be acquired directly from the Timestamp Counter (TSC) register. This 64-bit hardware counter offers cycle-level precision and functions as a system-wide clock that is generally synchronized across all cores in a NUMA architecture [14]. The TSC resets to 0 when the machine restarts and advances at a constant rate. Reading the TSC is extremely fast because it is performed by a single instruction (RDTSC or RDTSCP) and does not require a system call [5].

In RQ-TS, we use the CPU timestamp counter to generate the version’s timestamp (obtained by invoking GetCurrentTS()). When a version is first created by an update operation, an optimistic timestamp is assigned to the version. After the update operation completes, it writes the final timestamp to the version using CAS and the timestamp in the version never changes again. An optimistic timestamp is negative, which signals to a concurrent thread (encountering a version with such a timestamp) that an update operation has not completed yet.

An optimistic timestamp

t_o p t

is generated as

t_o p t = - (t s_s t a r t + 1)

, where

t s_s t a r t = G e t C u r r e n t T S ()

is the value of the TSC register at the start of the update operation. Since the TSC is monotonically increasing, the final timestamp

t_f i n a l

assigned to the version will necessarily satisfy

t_f i n a l > = | t_o p t | = t s_s t a r t + 1

. One issue with using hardware timestamps is that they are not guaranteed to be strictly monotonic increasing. For example, two threads running on different cores may obtain the same timestamp. We address this issue by using a method similar to [5]. Specifically, before writing the final timestamp

t_f i n a l

to the version, we increment

t s_s t a r t

by 1 and use a spin in a loop calling

G e t C u r r e n t T S ()

until it returns a timestamp which is greater than

t_f i n a l

. On the other hand, we assume no constant physical clock skew between cores [25] because processor vendors claim that the hardware clocks are synchronized among cores [26]. This assumption is widely adopted in prior research [5,27,28,29,30,31].

In order to make range queries observe a consistent view of the data structure when updates complete, RQ-TSC implements an update in three steps. In the first step (Algorithm 1), it first obtains the optimistic timestamp (Line 1) from the CPU timestamp counter and creates versions with new node pointers and the optimistic timestamp (Line 3), then inserts them to the version list of updated nodes (Line 8), and finally acquires the final timestamp and returns it (Line 10). In the second step, it performs the original critical section of the operation to make the update operation visible to other operations, which is similar to bundled references. In the last step, it invokes AssignFinalVersion (Algorithm 2) to assign the final timestamp (returned by PrepareVersions) to the pending versions. A pending version is a version with an optimistic (negative) timestamp. Note that in the initial step, as a range query traverses a node’s version list, encountering a pending version will cause it to wait until the version is assigned a final timestamp. This guarantees no concurrent update that linearizes before the range query is overlooked. Meanwhile, concurrent updates attempting to add their own pending version are required to block until the current update finishes with a final timestamp (Algorithm 1, Line 3), which ensures that such updates to the same version are properly ordered according to hardware timestamps.

Algorithm 1: PrepareVersions

Algorithm 2: AssignFinalVersion

4.2. Range Queries and Contains Operations

The range query operation is shown in Algorithm 3. Similarly to bundled references, our traversal approach is also divided into three phases. In a three-stage traversal, a range query traverses nodes using original links and only uses a version list to traverse nodes when it reaches the start of the range. However, this approach reads a single atomic counter to fix the snapshot, which can cause substantial coherence traffic on NUMA multi-core systems. To mitigate this issue, we use hardware timestamps combined with an optimistic version evaluation approach (shown in Algorithm 4) to improve range query performance.

In the first phase, a range query traverses the data structure to locate the node directly preceding the range (Lines 3–6). During this phase, the range query traverses nodes by following original links without accessing version lists and blocking on pending versions. It is worth noting that

G e t N e x t N o d e

is a procedure specific to the data structure, providing the next node in a traversal without the use of a version list.

In the second phase, the range query traverses the data structure using a version list, achieving a linearizable traversal. During this phase, the range query first establishes its linearizable snapshot of the data structure by obtaining timestamps from the CPU timestamp counter (Line 7). Reading TSC is the linearization point for a range query. If the linearization point of an update operation lies after the one of the range query, then the update is not observed by the range query. Next, it executes a loop which performs

G e t L i n k F r o m V e r s i o n

to locate the first node belonging to the range (Lines 8–10). This loop is used to ensure consistency, i.e., no nodes are inserted or deleted after the first phase ends and before reading the CPU timestamp counter. In the final phase, the range query traverses all nodes within the range via the version list and adds them to the result set (Line 11).

G e t L i n k F r o m V e r s i o n

invokes

D e r e f e r e n c e V e r s i o n

(Algorithm 4) to acquire the correct links. Given the head of a version list and a timestamp

t s

,

D e r e f e r e n c e V e r s i o n

scans the list, evaluating each version v with timestamp t to find the one where t is the greatest final timestamp

\leq t s

. For each version evaluated,

D e r e f e r e n c e V e r s i o n

does the following:

If $| t |$ > $t s$ , it skips reading version v;
If $t > 0 \land t \leq t s$ , and the version list contains no version with timestamp $t^{'}$ , s.t. $t < t^{'} \leq t s$ , then retrieve version v and return the value of the link that version v contains;
if $t < 0 \land - t \leq t s$ , it waits until version v is not pending.

Algorithm 3: RangeQuery

Algorithm 4: DereferenceVersion

In bundled references, range query and contains operations will be blocking when the most recent version is pending. However, this may degrade the performance of data structures in read-intensive workloads. In RQ-TSC, a range query is not necessarily blocking on the pending version, and whether to block is determined by the absolute value of the optimistic timestamp in this pending version. As shown in Algorithm 4, when encountering a version with an optimistic (negative) timestamp (Line 6), the range query checks the absolute value of the optimistic timestamp (Line 7). If the timestamp

t s

is less than the absolute value of the optimistic timestamp, it will proceed to the next version in the version list without blocking (Line 13). Otherwise, the range query is blocking on this version until a final timestamp is assigned to it (Lines 8–12), since it is the most recent version which should be included in its snapshot. The optimistic (negative) timestamp may only appear at the head of the version list. This is because multiple concurrent updates have to wait when encountering a version with a negative timestamp, and proceed until assigning a final timestamp to this pending version. The versions in the version list are arranged in descending order starting from the head.

In line with linearizable semantics, a contains operation must return the most up-to-date value for the target link. This value is located in the most recently completed version, specifically the one bearing the largest positive timestamp. Thus, the contains operation doesn’t have to acquire hardware timestamps to establish its linearization point. Instead, its linearization point occurs when the last version in the traversal is dereferenced [10]. There is no additional overhead for contains operations because the traversal involves no accesses to a version list. On the other hand, if the head version of the version list has a negative timestamp, the contains operation skips the head and returns the next version without blocking. In contrast, the original bundled references implementation blocks when encountering pending versions.

4.3. TSC

Our scheme relies on the following two conditions: (1) Monotonicity: the TSC on each core is guaranteed to be monotonically increasing. (2) Cross-Core Synchronization: the TSCs are synchronized across all cores. However, the cross-core synchronization condition was not necessarily guaranteed on some processors. This is because the TSC in some processor families increments with every internal clock cycle, and the duration of these clock cycles can be affected by dynamic frequency scaling, leading to a loss of synchronization between the TSC values on different cores [27]. Invariant TSC is a more recent enhancement to the hardware timestamp counter that fulfills the second condition required for using the TSC as a timestamping mechanism [25,27]. The use of the invariant TSC guarantees the clocks increase monotonically at a constant rate, regardless of fluctuations in the processor’s speed. For the remainder of this paper, we assume an invariant TSC without clock skew [25] whenever we refer to the TSC. This assumption is widely adopted in prior research [5,27,28,29,30,31].

The choice of instruction for reading the TSC is crucial. The RDTSCP instruction is partially serializing, as it waits for all prior instructions to complete before reading the counter, thereby preventing inaccurate timing caused by out-of-order execution. In contrast, the RDTSC instruction is non-serializing and may be reordered by the processor. Although RDTSCP is often used with LFENCE to provide a stronger memory barrier, in our paper, we use the compiler intrinsic

__r d t s c p

. In fact, for reading the TSC, an LFENCE following RDTSCP is unnecessary for correctness in our context, since the data dependencies form a sufficient compiler barrier to prevent relevant reordering. Furthermore, our benchmarking results (Figure 2) show that adding LFENCE incurs no measurable overhead compared to using RDTSCP alone. This result is consistent across all experimental scenarios.

5. RQ-TSC Data Structure

We implemented RQ-TSC data structures on top of bundled data structures [6] by replacing bundled reference APIs with RQ-TSC APIs. These data structures are the highly concurrent lazy sorted linked list [32], the lazy skip list [33], and the Citrus tree [34]. Here, we describe how to apply RQ-TSC to the highly concurrent lazy sorted linked list [32].

RQ-TSC Linked List

An insert operation (Algorithm 5) first traverses the data structure to find the insertion position. Suppose that the insertion point is between

p r e d

and

c u r r

. Then, it locks

p r e d

and validates that they are not deleted and

p r e d

’s successor is

c u r r

. Upon successful validation and absence of the key, a new node pointing to the successor

c u r r

is created. In the opposite scenario, all locked nodes are released and the operation retries. In case of successful validation, it performs the following: (1) it first creates left and right versions with the same optimistic timestamp, referencing the new node and successor node, respectively, then adds the left version to the version list at

p r e d

and uses the right version as the sole version in the version list of the newly created node, and finally obtains a final timestamp by reading the CPU timestamp counter and returns it (Line 10); (2) it performs the critical section that inserts the new node into the linked list by changing the next pointer of

p r e d

to point to the new node (Line 11); and (3) it writes the final timestamp to the left and right versions (Line 12).

Algorithm 5: Insert Operation of Linked List

A delete operation first traverses the data structure to find the node

c u r r

to be deleted and its immediate predecessor

p r e d

. Upon finding these nodes, it locks them and makes a validation as in the insertion. If validation succeeds and the key exists, it creates a version with an optimistic timestamp pointing to the successor of

c u r r

, adds the newly created version to the head of the version list at

p r e d

, and obtains a final timestamp by reading the CPU timestamp counter. Next, it performs the critical section that removes

c u r r

from the linked list by changing the next pointer at

p r e d

to point to

c u r r

’s successor. Finally, it marks

c u r r

as deleted and writes the final timestamp to the newly created version.

A range query first traverses the linked list from the head until it finds the first node before the range. This phase is performed without accessing a version list. Then, it reads the CPU timestamp counter to create a snapshot, and follows the link returned by GetLinkFromVersion to reach the first node within the range. GetLinkFromVersion implements our version evaluation method to return the appropriate link from the version list of a node. Finally, the range query uses the version list to traverse any node belonging to the range and adds it to the result set.

6. Memory Reclamation

We use epoch-based memory reclamation (EBR) [23] to reclaim removed nodes and outdated versions that are no longer accessed by any active operations. To reduce memory reclamation overhead, we delegate the task of recycling outdated versions to a background cleanup thread. This is achieved by tracking active range queries and using the smallest hardware timestamp to retire outdated data.

EBR ensures the safe release of obsolete objects using a limbo list, a single global epoch counter, and an announcement array. Each operation begins by reading the current epoch and announcing it in the announcement array. It then checks whether all processes have announced the current epoch. If so, the operation increments the global epoch counter to initiate a new epoch. Once a new epoch begins, the limbo list associated with an earlier epoch (e.g., two epochs prior) becomes eligible for safe reclamation. Obsolete objects are placed into the limbo list corresponding to the current epoch via the

r e t i r e

operation. We employ DEBRA [35], a distributed variant of EBR, which stores retired nodes locally for each thread, reducing cache coherence traffic.

Similar to bundled references, we use a global array to track ongoing range queries. This is achieved by storing the linearization timestamp of each query in its corresponding slot within this array. A range query is initiated by writing the current hardware timestamp (its linearization timestamp) to its assigned slot in the global array; it then proceeds to scan the data structure. During cleanup, we scan this array to find the smallest active linearization timestamp. Utilizing this minimum timestamp facilitates the retirement of outdated versions. Subsequent memory reclamation of retired nodes is managed by EBR.

Correctness

Now we argue that our approach ensures linearizability [24]. The linearization point of updates that created any version is the acquirement of the final timestamp (Line 10 of Algorithm 1). It is readily apparent that conflicting update operations are serialized. When a pending operation exists at a node, a thread must wait for its completion before proceeding with its own update. In the version list of each node, the final timestamps of versions decrease monotonically as one iterates from the head of the list.

Once the range query operation RQ is linearized at timestamp

t_{R Q}

(Line 7 in Algorithm 3), it maintains a reference to a node that was previously accessed via the original links during the initial phase. RQ is guaranteed to find, within this node, the version with the greatest final timestamp

t \leq t_{R Q}

. Since updates make a node reachable after obtaining hardware timestamps (Line 10 in Algorithm 1), this node either contains a version with an optimistic timestamp (a pending version) or a version with a final timestamp. In either scenario, an RQ linearized after Line 10 in Algorithm 1 is guaranteed to observe a timestamp that is no less than that of the update which inserted the node.

For an update concurrent with a range query RQ, two linearization scenarios exist. In the first case, the update is linearized before RQ; if the update obtains the final timestamp before RQ acquires its snapshot, it implies that the update also inserted the version with the final timestamp before RQ obtained the hardware timestamps, and thus RQ will observe the update since the update has the final timestamp

t \leq t_{R Q}

. In the second case, the update is linearized after RQ; if the update obtains the final timestamp after RQ acquires its snapshot, then RQ will not observe the update because the update has the final timestamp

t > t_{R Q}

, and thus the version (added by the update) will not be traversed using

G e t L i n k F r o m V e r s i o n

.

Theorem 1.

When a range query with linearization timestamp

t s

encounters a pending version

v_{pending}

with an optimistic (negative) timestamp

t_{opt}

and skips it (because

t s < | t_{opt} |

), this skip operation does not cause the query to miss a version that should be visible in its snapshot.

The proof relies on the precise definition of the linearization points established by the hardware TSC. We define the following key events:

R Q_{LP}

, the linearization point of the range query (RQ), which is the moment it reads the TSC to obtain its snapshot timestamp

t s

;

U_{START}

, the start of an update operation that creates the pending version

v_{pending}

; and

U_{LP}

, the linearization point of that update operation, which is the moment it writes the final timestamp (

t_{final}

) to the versions, making its changes visible. Our goal is to show that if RQ skips

v_{pending}

, the update that created

v_{pending}

is not part of RQ’s snapshot. The proof is as follows:

Case 1: The update linearizes after the query ( $U_{LP} > R Q_{LP}$ ).
If the update is still pending (indicated by an optimistic timestamp $t_{opt}$ ) when RQ traverses the list, it means $U_{LP}$ has not yet occurred. The optimistic timestamp $t_{opt}$ is generated by reading the TSC at $U_{START}$ . Therefore, $t_{opt}$ is a timestamp that corresponds to a moment before $U_{LP}$ , that is, $| t_{opt} | \leq U_{LP}$ . Consequently, the condition $t s < | t_{opt} |$ implies that $t s < U_{LP}$ . This confirms that the query’s linearization point $R Q_{LP}$ occurs before the update’s linearization point $U_{LP}$ . By the definition of linearizability, an operation that linearizes after the query must not be observed by the query. Therefore, when $t s < | t_{opt} |$ , skipping this pending version is correct because the corresponding update does not belong to the query’s snapshot.
Case 2: The update linearizes before the query ( $U_{LP} < R Q_{LP}$ ).
If an update linearized before the query, its final timestamp ( $t_{final}$ ) must have already been written to the versions by the time the query performs its traversal. Therefore, a query with a timestamp $t s$ (where $t s \geq t_{final}$ ) would never encounter a pending version for this update. Instead, it would see the final version with $t_{final}$ and correctly include it in its snapshot.

7. Evaluation

To evaluate RQ-TSC, we integrate our design into an existing benchmark [6] and compare it against several state-of-the-art approaches: Bundle [10]; EBR-RQ [6], a lock-free variant of Arbel-Raviv and Brown’s epoch-based reclamation technique; RLU [9], which provides RCU-like synchronization for concurrent writers; and vCAS [11], a lock-based linearizable data structure implementation that uses vCAS objects in place of pointers and metadata (using source code from [10]). Note that RLU is omitted from the skip list results due to the lack of an available implementation. All competitors use single atomic counters with a backoff strategy [36] to generate timestamps, thus reducing contention. Furthermore, To explore the performance impact of contention on a single atomic variable, we used a version of Bundle without backoff strategy, called Bundle-no-backoff, which does not apply the backoff strategy to the increment operation of the single atomic variable.

All code is written in C++11 and compiled with g++ 11.4 with -std=c++11 -O3 -mcx16 and linked with the jemalloc [37] multi-threaded memory allocator. Experiments are conducted on a machine equipped with two Xeon Platinum 8336C processors, featuring 128 hyperthreaded cores in total and a 108 MB L3 cache, and running Ubuntu 22.04. Memory management for all methods is handled via epoch-based memory reclamation. For each experiment below, threads run a predefined combination of update, contains, and range query operations, employing uniformly random keys. The data structure is preloaded with half of the keys in the designated key range, and updates are split equally between insertions and deletions. Workloads are reported as

U - C - R Q

, where U is the percentage of updates, C is the percentage of contains, and

R Q

is the percentage of range queries. All reported results are an average of three runs of three seconds each. The key ranges per data structure are: 10,000 for the lazy list, and 1,000,000 for both the skip list and BST. Ranges are 50 keys long by default, unless stated otherwise.

Data Structure Performance

Figure 3 shows the total throughput of operations on a lazy list with varying the number of worker threads at both 10% updates and 90% updates with 10% range queries.

At 10% updates, RLU has update-side overhead since they have to wait for reads in progress, which becomes apparent as we add more threads. For vCAS, it adds an additional dereference overhead per-node due to accessing a version list for any node, resulting in worse performance for read-intensive workloads. At 90% updates, RLU and vCAS have poor performance. This happens because RLU’s heavy update synchronization causes its performance to degrade significantly in update-intensive workloads. In both cases, the RQ-TSC, Bundle, Bundle-no-backoff, and EBR-RQ have similar performance. On the other hand, this experiment also implies that contention on a single atomic counter will not have significant impact on the workload with small objects such as those using 10,000 keys.

Varying Workload Mix. Figure 4 and Figure 5 report the throughput (Mops/s) of different workloads for the skip list and Citrus tree.

Experimental results show that data structures enhanced with our method outperform RLU, EBR-RQ, and Bundle under mixed workloads. With the exception of corner cases involving 0% or 100% updates, RQ-TSC matches or exceeds the performance of the next best approach in the majority of scenarios. Under such mixed conditions, RQ-TSC delivers speedups of up to 1.3× over EBR-RQ in the skip list (Figure 4d) and 2.3× in the Citrus tree (Figure 5a). Against RLU and Bundle, RQ-TSC can reach 1.9× (Figure 5f) and 1.2× (Figure 4c) better performance.

For the skip list, EBR-RQ performs better due to the lock-free mechanism. When it encounters an unfinished update operation, a range query helps complete the update operation before proceeding with its own operation. This help-based lock-free mechanism benefits update performance, but may affect range query performance. RQ-TSC performs as well as or better than EBR-RQ after threads grow beyond a single node at 50% and 90% updates. For read-intensive workloads (Figure 4a,b,e), RQ-TSC has similar performance to Bundle and Bundle-no-backoff. This is because for them, the increment of the atomic counter is performed by update operations; therefore, optimization for the atomic counter will not benefit read-intensive workloads. This case is also similar to that of the Citrus tree discussed below. For write-intensive workloads (Figure 4c,d,f), Bundle-no-backoff performs significantly worse than RQ-TSC beyond a NUMA node due to cross-node coherence traffic caused by contention on the single atomic counter; for instance, Bundle-no-backoff is 29% worse than RQ-TSC at 90% updates. This implies that using a single atomic counter to generate timestamps can significantly affect performance. Although a backoff strategy can reduce contention on the atomic counter, it still suffers from global spinning and presents challenges in tuning the backoff timeouts.

For the Citrus tree, RQ-TSC significantly outperforms EBR-RQ across mixed workloads, except for the 90% updates, when the number of threads exceeds 48. This is because EBR-RQ has to visit more additional nodes in the limbo lists and announced deletions to collect nodes belonging in their snapshot. For RLU, its performance degrades significantly when there are more updates in the workload (Figure 5c,d). At high-update workloads, the Citrus tree does not scale well beyond one NUMA node, regardless of the presence of range queries. This happens because the update operation’s traversal section is protected by RCU’s read lock, which is operated atomically. However, before a delete operation physically removes a node, it calls the

s y n c h r o n i z e

function to read the atomic lock variables of all threads, which can lead to significant cross-node coherence traffic.

Relative to vCAS, RQ-TSC performs up to 1.2×–1.6× better in low-update workloads (namely 0%, 2%, and 10% updates) across all thread configurations, thanks to its lightweight traversal. In workloads with intensive updates (50% and 90% updates), RQ-TSC is more performant than vCAS and Bundle for the skip list. This is because it relies on hardware timestamps instead of an atomic counter to produce version timestamps, substantially reducing contention between concurrent threads. For the Citrus tree, vCAS performs better at 50% and 90% updates when there are large numbers of threads.

This is because the assignment of timestamps in vCAS is wait-free, that is, any concurrent operation can help complete it, whereas RQ-TSC and Bundle must wait when an update operation encounters a pending version. Moreover, RQ-TSC still outperforms Bundle due to its efficient versioning mechanism and the generation of timestamps in a scalable way.

In the read-only workload (Figure 5e), RLU shows good performance because of the lack of synchronization costs, and attains the best performance in the Citrus tree. However, the experiment result shows that even a low percentage of updates can substantially degrade RLU performance (Figure 5b) due to the synchronization caused by writers. Consequently, under update-intensive workloads (Figure 5c,d,f), RLU exhibits the lowest performance. In contrast, EBR-RQ requires each range query to increment a global timestamp counter, resulting in poorer efficiency under read-heavy workloads compared to update-intensive scenarios. While EBR-RQ performs efficiently in update-only conditions (Figure 5f), its performance significantly degrades in read-only workloads (Figure 5e), especially beyond a single NUMA node. Bundle moves the responsibility of incrementing the global timestamp to update operations, which has an effect on performance in write-dominated workloads. This is apparent in the skip list.

Table 1 and Table 2 show time statistics (in cycles) for 50% update and 90% update workloads on the skip list. Execution time statistics are presented using five key percentiles (5%, 25%, 50%, 75%, and 95%). The tested operation types include

c o n t a i n s

(

s r c h

), insertions (

i n s r

), deletions (

d e l t

), and range queries (

r q

). For example, the value a at percentile

x %

in the table indicates that

x %

of executions for that operation type completed within ≤a cycles. For the workload with 90% updates (90-0-10), since it contains no

c o n t a i n s

operations, the execution time for

s r c h

in Table 2 is left blank. We found that for both workloads, RQ-TS is significantly faster than Bundle and vCAS for update operations (e.g., inserts and deletes) because their updates require atomically increasing the global counter to generate timestamps, whereas RQ-TS achieves this by relying on hardware timestamps. For range queries, RQ-TSC is significantly faster than EBR-RQ because EBR-RQ requires traversing extra limbo lists and announced deletions.

Varying Range Query Size. Figure 6 presents the throughput (Mops/s) for update and range query operations on a Citrus tree. The setup involves partitioning a single NUMA node such that 32 hardware threads (half the total) run 100% updates and the remaining 32 run 100%, with the range query length being adjusted.

We note that RQ-TSC, Bundle, EBR-RQ, and vCAS do not see their update performance adversely affected by range query length, since updates are not blocked by range queries. RLU, however, experiences a notable throughput reduction when range query size surpasses 64. This stems from RLU’s requirement to synchronize with in-progress read operations, leading to longer range queries substantially harming update performance. For range queries, EBR-RQ has a mostly stable range query throughput with varying range size. Since a range query has to check limbo lists and announced deletion lists, as well as suffer from expensive DCSS operations, even short-range queries perform poorly. On the other hand, RQ-TSC, Bundle, vCAS, and RLU exhibit far better performance when dealing with short-range queries.

8. Conclusions

In this work, we presented an optimization building block, called RQ-TSC, for adding scalable and linearizable range query support to existing concurrent set implementations using locks. We applied RQ-TSC to three different data structures. Our experiments showed that RQ-TSC can offer scalable range query performance and significantly outperform previous approaches in a variety of workloads. One limitation stems from the constrained experimental environment: our evaluation was conducted exclusively in dual-socket machines. We anticipate that as the number of sockets in the machine increases, the benefits of our method will become more pronounced. This is because adding more sockets will significantly increase cross-node cache coherence traffic for atomic operations, consequently leading to increased overhead of using the single atomic counter for version timestamp generation. Moreover, variations in cache coherence protocol implementations across different architectures may impact the overhead of atomic operations. In future work, we plan to investigate the impact of architectural variations by conducting a controlled empirical study and extend the application of our method to more data structures and hardware platforms.

Author Contributions

Z.Y. and C.Z. collaboratively conceived and designed the study, performed the experiments, and wrote the initial draft of the manuscript. X.Z. provided valuable guidance and support throughout the project. All authors reviewed the results, contributed to the revision of the manuscript, and approved the final version for publication. All authors have read and agreed to the published version of the manuscript.

Funding

This research study was funded by the Hunan Provincial Department of Education Scientific Research Outstanding Youth Project (grant number 22B0222).

Data Availability Statement

The original data presented in the study are openly available in https://github.com/yizhengming17/RQ-TSC (accessed on 7 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Facebook. RocksDB: A Persistent Key-Value Store for Fast Storage Environments. 2023. Available online: https://rocksdb.org/ (accessed on 7 September 2025).
Kejriwal, A.; Gopalan, A.; Gupta, A.; Jia, Z.; Yang, S.; Ousterhout, J. SLIK: Scalable Low-Latency Indexes for a Key-Value Store. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC 16), Denver, CO, USA, 22–24 June 2016; pp. 57–70. [Google Scholar]
Raju, P.; Kadekodi, R.; Chidambaram, V.; Abraham, I. PebblesDB: Building Key-Value Stores Using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, Shanghai, China, 28 October 2017; pp. 497–514. [Google Scholar] [CrossRef]
Basin, D.; Bortnikov, E.; Braginsky, A.; Golan-Gueta, G.; Hillel, E.; Keidar, I.; Sulamy, M. KiWi: A Key-Value Map for Scalable Real-Time Analytics. ACM Trans. Parallel Comput. 2020, 7, 16. [Google Scholar] [CrossRef]
Kobus, T.; Kokociński, M.; Wojciechowski, P.T. Jiffy: A Lock-Free Skip List with Batch Updates and Snapshots. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’22, Seoul, Republic of Korea, 2–6 April 2022; pp. 400–415. [Google Scholar] [CrossRef]
Arbel-Raviv, M.; Brown, T. Harnessing Epoch-Based Reclamation for Efficient Range Queries. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, Vienna, Austria, 24–28 February 2018; pp. 14–27. [Google Scholar] [CrossRef]
Petrank, E.; Timnat, S. Lock-Free Data-Structure Iterators. In Proceedings of the 27th International Symposium on Distributed Computing-Volume 8205, DISC 2013, Jerusalem, Israel, 14–18 October 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 224–238. [Google Scholar] [CrossRef]
Chatterjee, B. Lock-Free Linearizable 1-Dimensional Range Queries. In Proceedings of the 18th International Conference on Distributed Computing and Networking, ICDCN ’17, Hyderabad, India, 5–7 January 2017. [Google Scholar] [CrossRef]
Matveev, A.; Shavit, N.; Felber, P.; Marlier, P. Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, Monterey, CA, USA, 4–7 October 2015; pp. 168–183. [Google Scholar] [CrossRef]
Nelson-Slivon, J.; Hassan, A.; Palmieri, R. Bundling Linked Data Structures for Linearizable Range Queries. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’22, Seoul, Republic of Korea, 2–6 April 2022; pp. 368–384. [Google Scholar] [CrossRef]
Wei, Y.; Ben-David, N.; Blelloch, G.E.; Fatourou, P.; Ruppert, E.; Sun, Y. Constant-Time Snapshots with Applications to Concurrent Data Structures. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, Virtual Event, Republic of Korea, 27 February 2021; pp. 31–46. [Google Scholar] [CrossRef]
Bernstein, P.A.; Goodman, N. Multiversion Concurrency Control—Theory and Algorithms. ACM Trans. Database Syst. 1983, 8, 465–483. [Google Scholar] [CrossRef]
Calciu, I.; Sen, S.; Balakrishnan, M.; Aguilera, M.K. Black-Box Concurrent Data Structures for NUMA Architectures. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, Xi’an, China, 8–12 April 2017; pp. 207–221. [Google Scholar] [CrossRef]
Ruan, W.; Liu, Y.; Spear, M. Boosting Timestamp-Based Transactional Memory by Exploiting Hardware Cycle Counters. ACM Trans. Archit. Code Optim. 2013, 10, 40. [Google Scholar] [CrossRef]
Shavit, N.; Touitou, D. Software Transactional Memory. In Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing, PODC ’95, Ottawa, ON, Canada, 20–23 August 1995; pp. 204–213. [Google Scholar] [CrossRef]
Herlihy, M.; Moss, J.E.B. Transactional Memory: Architectural Support for Lock-Free Data Structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA ’93, San Diego, CA, USA, 16–19 May 1993; pp. 289–300. [Google Scholar] [CrossRef]
McKenney, P.E.; Slingwine, J. Read-Copy Update: Using Execution History to Solve Concurrency Problems. 2002. Available online: http://www.rdrop.com/~paulmck/scalability/paper/rclockpdcsproof.pdf (accessed on 7 September 2025).
Blelloch, G.E.; Wei, Y. VERLIB: Concurrent Versioned Pointers. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’24, Edinburgh, UK, 2–6 March 2024; pp. 200–214. [Google Scholar] [CrossRef]
Ben-David, N.; Blelloch, G.E.; Wei, Y. Lock-free locks revisited. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’22, Seoul, Republic of Korea, 2–6 April 2022; pp. 278–293. [Google Scholar] [CrossRef]
Winblad, K.; Sagonas, K.; Jonsson, B. Lock-Free Contention Adapting Search Trees. ACM Trans. Parallel Comput. 2021, 8, 10. [Google Scholar] [CrossRef]
Bronson, N.G.; Casper, J.; Chafi, H.; Olukotun, K. A Practical Concurrent Binary Search Tree. SIGPLAN Not. 2010, 45, 257–268. [Google Scholar] [CrossRef]
David, T.; Guerraoui, R.; Trigonakis, V. Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, Farminton, PA, USA, 3–6 November 2013; pp. 33–48. [Google Scholar] [CrossRef]
Fraser, K. Practical Lock-Freedom; Technical Report UCAM-CL-TR-579; University of Cambridge, Computer Laboratory: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
Herlihy, M.P.; Wing, J.M. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 1990, 12, 463–492. [Google Scholar] [CrossRef]
Kashyap, S.; Min, C.; Kim, K.; Kim, T. A scalable ordering primitive for multicore machines. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, Porto, Portugal, 23–26 April 2018. [Google Scholar] [CrossRef]
Intel. Clock() or Gettimeofday() or Ippgetcpuclocks()? 2010. Available online: https://www.intel.com/content/www/us/en/developer/articles/technical/best-timing-function-for-measuring-ipp-api-timing.html (accessed on 19 August 2025).
Grimes, O.; Nelson-Slivon, J.; Hassan, A.; Palmieri, R. Opportunities and Limitations of Hardware Timestamps in Concurrent Data Structures. In Proceedings of the 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), St. Petersburg, FL, USA, 15–19 May 2023; pp. 624–634. [Google Scholar] [CrossRef]
Dodds, M.; Haas, A.; Kirsch, C.M. A Scalable, Correct Time-Stamped Stack. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, Mumbai, India, 15–17 January 2015; pp. 233–246. [Google Scholar] [CrossRef]
Giles, E.; Doshi, K.; Varman, P. Hardware transactional persistent memory. In Proceedings of the International Symposium on Memory Systems, MEMSYS ’18, Alexandria, VA, USA, 1–4 October 2018; pp. 190–205. [Google Scholar] [CrossRef]
Krishnan, R.M.; Kim, J.; Mathew, A.; Fu, X.; Demeri, A.; Min, C.; Kannan, S. Durable Transactional Memory Can Scale with Timestone. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, Lausanne, Switzerland, 16–20 March 2020; pp. 335–349. [Google Scholar] [CrossRef]
Boyd-Wickizer, S.; Kaashoek, M.F.; Morris, R.T.; Zeldovich, N. OpLog: A Library for Scaling Update-Heavy Data Structures. 2014. Available online: https://people.csail.mit.edu/nickolai/papers/boyd-wickizer-oplog-tr.pdf (accessed on 7 September 2025).
Heller, S.; Herlihy, M.; Luchangco, V.; Moir, M.; Scherer, W.N.; Shavit, N. A Lazy Concurrent List-Based Set Algorithm. In Proceedings of the 9th International Conference on Principles of Distributed Systems, OPODIS’05, Pisa, Italy, 12–14 December 2005; pp. 3–16. [Google Scholar] [CrossRef]
Herlihy, M.; Lev, Y.; Luchangco, V.; Shavit, N. A Simple Optimistic Skiplist Algorithm. In Structural Information and Communication Complexity; Prencipe, G., Zaks, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 124–138. [Google Scholar]
Arbel, M.; Attiya, H. Concurrent Updates with RCU: Search Tree as an Example. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing, PODC ’14, Paris, France, 15–18 July 2014; pp. 196–205. [Google Scholar] [CrossRef]
Brown, T.A. Reclaiming Memory for Lock-Free Data Structures: There Has to Be a Better Way. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing, PODC ’15, Donostia-San Sebastián, Spain, 21–23 July 2015; pp. 261–270. [Google Scholar] [CrossRef]
Radovic, Z.; Hagersten, E. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture, HPCA-9 2003, Anaheim, CA, USA, 8–12 February 2003; pp. 241–252. [Google Scholar] [CrossRef]
Evans, J. Scalable Memory Allocation Using Jemalloc. 2011. Available online: http://www.facebook.com/notes/facebookengineering/scalable-memory-allocation-using-jemalloc/480222803919 (accessed on 7 September 2025).

Figure 1. An example of using Bundle in a linked list. (a) Initial list. (b) After insert(20). (c) After insert(10). (d) After insert(30). (e) After remove(20).

Figure 2. Throughput (in Mops/s) of a skip list under a read-dominated workload (0% updates, 90% find, 10% range queries) versus the number of threads.

Figure 3. Throughput (Mops/s) of a lazy list for 10-80-10 and 90-0-10 workloads while varying the number of threads.

Figure 4. Throughput (Mops/s) under various workload configurations for the skip list, while varying the number of threads on the x-axis. Workloads are written as

U - C - R Q

, corresponding to the percentages of updates (U), contains (C), and range queries (

R Q

).

Figure 4. Throughput (Mops/s) under various workload configurations for the skip list, while varying the number of threads on the x-axis. Workloads are written as

U - C - R Q

, corresponding to the percentages of updates (U), contains (C), and range queries (

R Q

).

Figure 5. Throughput (Mops/s) under various workload configurations for the Citrus tree, while varying the number of threads on the x-axis. Workloads are written as

U - C - R Q

, corresponding to the percentages of updates (U), contains (C), and range queries (

R Q

).

Figure 5. Throughput (Mops/s) under various workload configurations for the Citrus tree, while varying the number of threads on the x-axis. Workloads are written as

U - C - R Q

, corresponding to the percentages of updates (U), contains (C), and range queries (

R Q

).

Figure 6. Update (left) and range query (right) throughput for 32 update-only threads and 32 range-query-only threads while varying the range query size.

Table 1. The latency distribution results (cycles) are taken from experiments run on 128 threads for the skip list (

U - C - R Q

: 50-40-10).

Table 1. The latency distribution results (cycles) are taken from experiments run on 128 threads for the skip list (

U - C - R Q

: 50-40-10).

	EBR-RQ				vCAS				Bundle				RQ-TSC
Percentile	srch	insr	delt	rq	srch	insr	delt	rq	srch	insr	delt	rq	srch	insr	delt	rq
5%	1830	3484	3369	49,708	3428	7011	5964	15,567	1986	4641	4549	9627	1958	2991	3153	9552
25%	2506	4602	4368	54,790	4826	10,198	8090	20,835	2812	6977	6718	14,674	2776	3986	4099	14,799
50%	3111	5655	5266	58,620	6115	13,722	10,093	24,766	3622	8827	8398	18,877	3590	5015	5010	19,236
75%	3862	7019	6410	62,949	7741	19,234	12,792	29,093	4791	11,277	10,409	23,181	4785	7161	6474	24,025
95%	5284	9680	8784	71,450	10,839	260,056	19,520	36,649	10,330	18,717	14,684	29,760	10,591	14,377	10,704	31,279

Table 2. The latency distribution results (cycles) are taken from experiments run on 128 threads for the skip list (

U - C - R Q

: 90-0-10).

Table 2. The latency distribution results (cycles) are taken from experiments run on 128 threads for the skip list (

U - C - R Q

: 90-0-10).

	EBR-RQ				vCAS				Bundle				RQ-TSC
Percentile	srch	insr	delt	rq	srch	insr	delt	rq	srch	insr	delt	rq	srch	insr	delt	rq
5%		3514	3353	57,054		5927	5330	14,896		4377	4308	11,024		2980	3046	11,564
25%		4709	4434	63,013		8376	7266	19,726		6744	6498	15,349		4153	4096	16,454
50%		5820	5407	67,518		11,889	9260	23,546		8562	8086	19,004		5566	5162	21,442
75%		7228	6633	72,651		20,059	12,344	28,434		11,140	10,077	23,998		7990	6833	27,773
95%		9963	9118	82,398		430,480	21,124	37,825		19,016	14,764	33,993		14,319	11,412	37,794

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, C.; Yi, Z.; Zhu, X. Scaling Linearizable Range Queries on Modern Multi-Cores. Computers 2025, 14, 381. https://doi.org/10.3390/computers14090381

AMA Style

Zhang C, Yi Z, Zhu X. Scaling Linearizable Range Queries on Modern Multi-Cores. Computers. 2025; 14(9):381. https://doi.org/10.3390/computers14090381

Chicago/Turabian Style

Zhang, Chen, Zhengming Yi, and Xinghui Zhu. 2025. "Scaling Linearizable Range Queries on Modern Multi-Cores" Computers 14, no. 9: 381. https://doi.org/10.3390/computers14090381

APA Style

Zhang, C., Yi, Z., & Zhu, X. (2025). Scaling Linearizable Range Queries on Modern Multi-Cores. Computers, 14(9), 381. https://doi.org/10.3390/computers14090381

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scaling Linearizable Range Queries on Modern Multi-Cores

Abstract

1. Introduction

2. Related Work

3. Background

4. The RQ-TSC Building Block

4.1. Update Operations

4.2. Range Queries and Contains Operations

4.3. TSC

5. RQ-TSC Data Structure

RQ-TSC Linked List

6. Memory Reclamation

Correctness

7. Evaluation

Data Structure Performance

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI