FPCache: A Fingerprint-Rectified Learned Index Cache for Disaggregated Memory

Jia, Chenyang; Cai, Miao

doi:10.3390/electronics15102210

Open AccessArticle

FPCache: A Fingerprint-Rectified Learned Index Cache for Disaggregated Memory

by

Chenyang Jia

¹

and

Miao Cai

^2,*

¹

School of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

²

School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2210; https://doi.org/10.3390/electronics15102210

Submission received: 20 April 2026 / Revised: 19 May 2026 / Accepted: 20 May 2026 / Published: 21 May 2026

(This article belongs to the Special Issue New Challenges in High-Performance Computing and Computer Architecture)

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of data-intensive applications has increased the demand for efficient storage in large-scale key-value (KV) stores. Disaggregated memory architectures provide a scalable solution by separating compute and memory resources via RDMA. However, existing indexing schemes in these environments suffer from poor read efficiency, significantly degrading overall system throughput and scalability. Specifically, learned indexes often encounter substantial read amplification during remote data retrieval due to prediction errors. In addition, caching full keys incurs a high cache footprint, limiting the effective cache capacity on compute nodes and leading to additional remote memory accesses. This paper presents FPCache, a fingerprint-rectified learned index cache for disaggregated memory. We propose a fingerprint-assisted two-stage read approach to mitigate read amplification. FPCache first retrieves a compact fingerprint array for local matching. It then converts range reads into precise point accesses and directly reads the corresponding data item, thereby avoiding reading the entire range and reducing extra data transfers. Next, we design a fingerprint-offset compression strategy to maximize cache density. Leveraging fixed-length fingerprints and position offsets enables compute nodes to retain significantly more hotspot data within limited memory resources. Experimental evaluations using various YCSB workloads demonstrate that FPCache consistently outperforms state-of-the-art methods. Compared to systems like CHIME and ROLEX, FPCache improves system throughput by up to 62% and effectively maintains stable access efficiency under diverse data distributions.

Keywords:

disaggregated memory; learned index; key-value store

1. Introduction

The rapid development of cloud computing, big data analytics, artificial intelligence, and cloud storage has significantly increased data volume and processing demands [1]. This growth has driven the need for efficient storage and retrieval systems. Key-value (KV) stores have become a fundamental component of modern data infrastructures due to their efficient key-based lookup capabilities. To overcome the performance limitations of disk-based systems, many deployments adopt in-memory KV stores such as Redis [2] and Memcached [3]. These systems store data in main memory, enabling low-latency access for data-intensive workloads.

In-memory KV stores are widely used in data-intensive applications such as distributed databases and real-time recommendation systems [4,5]. These applications impose strict requirements on throughput and latency [6,7]. However, traditional monolithic server architectures co-locate computation and storage on the same physical node, leading to inefficient resource utilization and limited scalability. To address these limitations, system architectures are increasingly shifting toward memory disaggregation. This paradigm separates CPUs and DRAM into independent compute and memory nodes connected via high-speed networks, enabling elastic and on-demand resource provisioning. The emergence of Remote Direct Memory Access (RDMA) further allows compute nodes to access remote memory directly. By bypassing the remote CPU and OS kernel, RDMA provides low-latency, high-bandwidth communication for disaggregated memory systems [8].

In KV stores built on disaggregated memory, the index structure plays a critical role in data lookup and query performance [9]. Most existing studies rely on B+ tree indexes [10,11,12,13,14]. However, storing full keys and pointers incurs significant memory overhead, making it difficult for compute nodes to cache the entire index locally. As a result, index traversal often requires cross-node pointer chasing, where each tree level may trigger a remote memory access, leading to multiple network round trips and increased latency [15,16,17,18]. In contrast, learned indexes approximate the cumulative distribution function (CDF) of keys to predict data positions, enabling a compact index structure that can be cached locally. By replacing multi-level tree traversal with model inference, learned indexes reduce remote memory accesses and network round-trip overheads in disaggregated memory [19,20,21,22].

Many studies have explored index optimizations for disaggregated memory. XStore [20] caches remote B+ tree nodes using learned models. However, dynamic write operations still require RPC-based updates at memory nodes. Sherman [10] redesigns the B+ tree to improve write efficiency in disaggregated memory, but its query process still suffers from considerable read amplification. SMART [23] reduces read amplification using a radix-tree structure; however, its distributed index layout increases memory footprint at compute nodes and reduces cache efficiency. ROLEX [21] introduces learned indexes into disaggregated architectures and decouples model retraining from query execution. Nevertheless, prediction errors still require fetching data within an error range, and model training relies on compute resources at memory nodes. CHIME [11] combines B+ trees with Hopscotch hashing [24] to balance query performance and caching efficiency, but introduces complex concurrency control and additional metadata access overhead.

Experimental results in prior work [11] show that existing indexing schemes in disaggregated memory suffer from significant read amplification. For example, in single-memory-node environments (bandwidth-constrained), background evaluations show that the high read amplification of range-based indexes (e.g., ROLEX and Sherman) can reduce throughput by up to

4.9 \times

relative to the optimal baseline (SMART). In addition, radix-tree-based structures such as SMART require storing per-item address metadata, resulting in approximately 503.2 MB of cache footprint for a dataset of 60 million 8-byte key-value pairs. This footprint is over

21.3 \times

higher than B+ tree-based counterparts like Sherman (23.6 MB). When the local cache is limited to 100 M, the performance of such structures can degrade by up to

5.9 \times

due to frequent remote index accesses caused by cache misses.

These observations indicate that indexing schemes in disaggregated memory still face a fundamental bottleneck characterized as read inefficiency, which manifests in two primary aspects. First, most existing methods suffer from severe read amplification. Due to prediction errors in learned models, compute nodes often retrieve large regions from remote memory to ensure query correctness, resulting in the transfer of substantial data and wasting limited network bandwidth. In learned indexes, search operations may require accessing multiple leaf nodes to compensate for prediction errors, further increasing read amplification. Second, traditional caching mechanisms incur significant memory overhead. Since compute nodes are typically constrained by DRAM capacity, caching full keys or extensive pointer metadata consumes substantial memory, limiting the number of entries that can be stored locally and becoming a scalability bottleneck.

To address the aforementioned read inefficiency, we propose FPCache, a fingerprint-rectified learned index cache for disaggregated memory. FPCache introduces a hierarchical access design to alleviate read-path bottlenecks. Specifically, FPCache employs a fingerprint-assisted two-stage read approach to reduce read amplification. In addition, we design a fingerprint-offset compression strategy to reduce cache footprint and improve local cache density. Overall, FPCache reduces unnecessary remote memory accesses and improves system throughput. The main contributions of this work are summarized as follows:

Fingerprint-assisted Two-stage Read Approach: To mitigate read amplification caused by prediction errors in learned indexes, we propose a two-stage read approach. Instead of directly retrieving the full range of candidate records determined by the model error bound, the compute node first fetches a compact fingerprint array from remote memory corresponding to the predicted region. These fingerprints are then matched locally to identify candidate positions for the query key. The system subsequently retrieves the corresponding key-value records from remote memory using fine-grained point accesses. This design transforms range reads into point accesses, reducing unnecessary data transfers and improving system throughput.
Fingerprint-Offset Compression Strategy: To address the high cache footprint on compute nodes, we design a fingerprint-offset compression strategy. In FPCache, compute nodes maintain the learned index, which consists of a set of piecewise linear segments. FPCache caches data entries belonging to hot segments of the learned index. Instead of storing original keys and full physical addresses, the cached entries are represented using fixed-length fingerprints and position offsets ( $δ$ ), where the offset records the deviation between the model-predicted position and the actual data location. By compressing both keys and address pointers, this strategy significantly reduces the space required for each cached entry. This approach reduces cache footprint on compute nodes, allowing them to accommodate a larger number of entries within limited memory resources.

Our evaluation demonstrates that FPCache effectively meets its design goals by alleviating read inefficiency in disaggregated memory. The proposed two-stage read approach and compression strategy significantly reduce unnecessary remote memory accesses, reducing cache footprint and boosting overall performance across diverse workloads. Under representative workloads, FPCache achieves up to 43% and 62% higher throughput than state-of-the-art systems (i.e., CHIME and ROLEX), respectively.

2. Background and Motivation

2.1. Disaggregated Memory Architecture

Disaggregated memory architectures are typically organized into compute and memory pools [25]. Compute nodes are responsible for request processing, index lookups, cache management, and concurrency control, while memory nodes primarily store data and handle remote requests. In such architectures, data access requires network communication between compute and memory nodes, introducing additional latency and communication overhead. As a result, system performance depends not only on local processing efficiency but also on network communication efficiency.

As shown in Figure 1, disaggregated memory architectures exhibit an asymmetry in resource distribution between compute and memory nodes. Compute nodes provide strong compute capability but limited local memory, making them suitable for computation-intensive tasks. In contrast, memory nodes prioritize large memory capacity while offering only limited compute capability and primarily serve simple data access requests. Consequently, efficient system designs typically place complex logic on compute nodes to minimize processing overhead on memory nodes. Performing heavy computations on memory nodes can easily create processing bottlenecks under concurrent client accesses, degrading overall system throughput and increasing access latency.

In disaggregated memory, compute nodes access remote memory through the network, which requires high-performance network interconnects. Representative remote access technologies include RDMA, Compute Express Link (CXL), and NVMe over Fabrics (NVMe-oF) [26]. CXL enables high-speed interconnects between processors and memory expansion devices, while NVMe-oF extends the NVMe protocol across network fabrics to support high-performance remote storage access. Among these technologies, RDMA has been widely adopted in disaggregated memory because it provides low-latency and high-bandwidth remote memory access.

Compared with kernel-based networking, RDMA supports zero-copy data transfer, kernel bypass, and NIC offloading, thereby reducing software overhead during remote data communication [27]. RDMA operations are typically categorized into one-sided and two-sided verbs [28]. One-sided verbs, such as RDMA Read, RDMA Write, and RDMA Atomic, allow a node to directly access the registered remote memory without involving the remote CPU. In contrast, two-sided verbs, including RDMA Send and RDMA Recv, require participation from both endpoints and are commonly used for metadata exchange or control coordination [29].

2.2. Learned Indexes for Disaggregated Memory

The basic idea of learned indexes is to model the mapping between keys and data positions as a function approximation problem. By learning the underlying data distribution, the model predicts the approximate position of a query key in the sorted dataset.

In practice, a learned index trains a model to predict the position of a key and outputs a predicted location together with a bounded error range. The system then performs a local search within this range to identify the exact position. To further improve prediction accuracy, the Recursive Model Index (RMI) adopts a hierarchical architecture composed of multiple models, where upper-layer models partition the key space and lower-layer models provide finer-grained predictions [19].

In disaggregated memory, compute nodes and memory nodes are physically separated and connected through high-speed RDMA networks. To reduce remote access overhead, learned indexes are typically deployed on compute nodes, while the actual key-value records are stored on memory nodes.

As illustrated in Figure 2, the compute node first uses the learned index to predict the approximate position of the queried key in the sorted dataset. The predicted position is then used to issue an RDMA Read request to the memory node to retrieve entries within the model’s error-bounded range. After the data is fetched into the local cache, the compute node performs a binary search within this range to locate the exact key.

However, this design still suffers from read amplification in practice. Due to prediction errors, the system must retrieve a range of entries rather than a single KV pair. Larger prediction errors lead to wider retrieval ranges, resulting in additional RDMA data transfers. This overhead becomes more pronounced under highly skewed key distributions.

2.3. Motivation

Existing studies have explored various techniques to reduce read-path overhead in disaggregated memory, including optimized index structures, caching mechanisms, and learned models for predicting data locations. However, current indexing schemes still face two key challenges: reducing read amplification and minimizing the cache footprint on compute nodes.

Learned indexes can reduce index traversal cost, but prediction errors often expand the remote retrieval range, leading to excessive data transfers. Meanwhile, conventional caching mechanisms store full keys and physical pointers, which consume substantial cache space and limit the number of entries that can be stored in the DRAM of compute nodes. As a result, existing approaches struggle to address these two issues simultaneously, resulting in poor read efficiency in disaggregated memory. This problem mainly manifests in the following two aspects:

Read Amplification: Compute nodes rely on learned indexes to predict key locations on memory nodes. Due to prediction errors, the actual position of a target key may deviate from the predicted position within a bounded error range. To locate the correct record, the system retrieves all records within this range from remote memory and performs a binary search locally on the compute node. This process introduces additional data transfers for each lookup, resulting in significant read amplification and wasted network bandwidth.

To quantitatively evaluate this impact, we conduct a preliminary experiment using ROLEX, a representative learned index system [21]. In this evaluation, each entry consists of an 8 B key and an 8 B value, resulting in a fixed entry width of 16 B. Under this setup, a prediction error bound of

ϵ

necessitates a remote range read of

2 ϵ

entries to ensure data retrieval. As shown in Table 1, the data read volume per query increases linearly with the error bound. Specifically, when

ϵ

increases from 8 to 256, the corresponding data read volume (

2 ϵ \times 16 B

) rises from 0.25 KB to 8 KB, while system throughput drops from 1112 Kops/s to 239.4 Kops/s. These results demonstrate that prediction errors in learned indexes lead to substantial additional data transfers in disaggregated memory, which increases network traffic and significantly degrades overall system performance.

High Cache Footprint: Existing approaches struggle to achieve high read efficiency under limited cache capacity, mainly because keys and pointers are not stored in a space-efficient manner. In traditional caching schemes, each cache entry typically stores the full key together with a physical pointer to the remote value. As a result, under the same cache capacity, the number of cached entries decreases as the key size increases.

To quantify this effect, we conduct a simple analysis under a read cache capacity of 100 MB. In this experiment, each cache entry stores the original key together with an 8 B pointer to the remote value. As the key size increases from 8 B to 128 B, the number of cached entries decreases sharply. As shown in Table 2, the cache can store about 6.55 million keys when the key size is 8 B, but only 0.77 million keys when the key size reaches 128 B.

This observation indicates that existing caching designs fail to store keys and pointers efficiently, which limits the number of entries that can be cached under limited memory. Consequently, the cache hit rate decreases and more queries need to access remote memory nodes through RDMA, increasing remote access overhead and degrading read performance. Therefore, how to store keys and pointers more compactly to accommodate more entries within limited cache capacity becomes a critical problem.

3. Design

3.1. Overview

To address the read inefficiency bottleneck in disaggregated memory, we propose FPCache, a fingerprint-rectified learned index cache, as depicted in Figure 3. FPCache employs a fingerprint-assisted two-stage read approach to mitigate read amplification caused by prediction errors in learned indexes. In addition, FPCache adopts a fingerprint-offset compression strategy to reduce cache overhead on compute nodes. By representing keys as fixed-length fingerprints and recording position offsets, this design significantly reduces per-entry storage cost and increases cache density. Overall, FPCache reduces unnecessary remote memory accesses and improves system throughput under the memory constraints of disaggregated memory.

3.2. Fingerprint-Assisted Two-Stage Read Approach

To mitigate read amplification caused by learned index prediction errors, FPCache introduces a fingerprint array on the memory-node side. Each fingerprint is generated by hashing the original key and truncating the result to a fixed-length 64-bit value. Based on this design, FPCache implements a fingerprint-assisted two-stage read approach. This approach transforms the conventional access flow—which involves fetching data before verifying key-values—into a process that performs fingerprint filtering prior to target data retrieval. This approach effectively reduces network bandwidth consumption while ensuring query correctness. As shown in Figure 4, the query process is compute-node-driven and primarily comprises three steps: ① remote fingerprint retrieval, ② local fingerprint matching, and ③ precise data retrieval or query termination.

Regarding memory-node data organization, the system adopts a decoupled layout for fingerprint metadata and data payloads. Specifically, fingerprint arrays are stored sequentially at the head of the KV data region or within a dedicated metadata region, with each element mapping one-to-one to a KV record. Each element in the array is a fixed-length hash value, and its index offset is strictly aligned with the physical position of the corresponding KV record. This design enables direct derivation of the physical address of a target record from its fingerprint index through linear mapping, without requiring additional lookup tables or multi-level pointer traversal. During updates or reorganization, this alignment is maintained by local memory-side management without introducing additional remote coordination. Consequently, the query execution flow can be divided into two stages.

Stage-1 (Fingerprint Prefetching and Local Matching): Upon receiving a query at the compute node, the system first uses the local learned index model to estimate the position of the target key and determines a retrieval range based on the maximum prediction error bound. Based on this interval, the compute node issues an RDMA read operation (①) to fetch the fingerprint array within the error range into local DRAM, forming a temporary fingerprint cache for subsequent verification. Meanwhile, the system computes the fingerprint of the query key using the same hash function and performs an element-wise comparison against the retrieved fingerprint array to identify potential matches. This process produces a set of candidate offsets relative to the predicted position. Due to hash collisions introduced by fingerprint compression, this candidate set may contain multiple possible positions (②). At this stage, only compact fingerprint metadata is transferred from remote memory, while KV payloads remain untouched, significantly reducing network data transfer.
Stage-2 (Precise Data Retrieval or Query Termination): When the candidate set is empty, the system directly concludes that the target key does not exist and terminates the query. Otherwise, for each candidate offset, the system computes the corresponding physical address by combining the segment base address with the fixed entry size. It then issues batch RDMA read operations to retrieve the corresponding KV records from remote memory. A full key verification is performed on these retrieved records at the compute node. If a matching record is found, its value is returned and the query completes. Otherwise, the result is identified as a false positive caused by fingerprint collisions, and a miss is returned (③).

3.3. Fingerprint-Offset Compression Strategy

The core of FPCache lies in the compact design of its cache entries. Unlike traditional learned indexes that store full key-to-position mappings on compute nodes, we adopt a compressed structure based on fingerprints and position offsets. This design encodes both key identifiers and address information in a compact form, reducing per-entry storage overhead and increasing cache capacity.

Dual Compression Strategy: To reduce space overhead caused by large keys, the system introduces a fingerprint-based compression mechanism. Original keys are mapped to fixed-length 64-bit fingerprints using the CityHash, which provides efficient hash computation and good distribution quality for large-scale in-memory key processing. The rationale for adopting 64-bit fingerprints is discussed in Section 4.4.2. These fingerprints replace the original keys in cache entries. Since the fingerprint length is fixed and significantly shorter than long string keys, the storage cost of key identifiers is substantially reduced. For address information, the system stores the offset between the actual address and the model-predicted address instead of the full 64-bit physical address. Since prediction errors after model training are typically small and concentrated within a narrow range, these offsets can be encoded using signed integers with fewer bits than 64. During query processing, the original physical address is reconstructed as shown in Figure 5. Through this dual compression design, cache entries are transformed from original keys and full pointers into fixed-length fingerprints and compact offsets, significantly reducing cache space overhead on compute nodes.

Hotspot Identification and Cache Construction: FPCache does not cache all data; instead, it selectively caches frequently accessed entries based on dynamic popularity evaluation. The system maintains access counters to record the access frequency of different data segments, serving as the basis for hotspot identification. Considering that learned index prediction accuracy varies across data regions, the popularity evaluation further incorporates the model error range. Regions with larger prediction errors may incur high read amplification even under moderate access frequencies. For these regions, the system assigns higher cache priority to reduce subsequent remote access overhead and network bandwidth consumption. During cache construction, when a query request is identified as accessing a hotspot region, the system generates a cache entry as follows. First, the fingerprint of the target key is computed. Then, the actual data position is obtained, and the position offset is derived by combining it with the model-predicted position. Finally, the constructed entry is written into the compute-node cache. Through this mechanism, the cache retains only compact metadata that effectively accelerates data access, thereby reducing cache space overhead under the limited DRAM capacity of compute nodes.

Conflict Handling Mechanism: Since fingerprints are generated using hash functions, collisions may occur when different keys produce the same fingerprint. To ensure correctness, FPCache integrates a collision detection and fallback mechanism into the cache lookup path. During query processing, the system first computes the fingerprint of the requested key and probes the local hash table. A matched entry is treated as a candidate rather than a direct cache hit. The system then derives a candidate physical address by combining the stored position offset with the model-predicted position and issues a small RDMA read to fetch the corresponding key for verification. If the retrieved key matches the query key, the cache hit is confirmed and the associated value is returned. Otherwise, the result is identified as a fingerprint collision, and the candidate entry is discarded. In this case, the system falls back to the standard learned index lookup path, performing a range read based on the model error bound followed by local search to locate the correct key.

4. Evaluation

4.1. Experimental Setup

Testbed. We conduct all experiments on CloudLab r650 servers to ensure a controlled and reproducible environment [30]. Each server is equipped with two 36-core Intel Xeon Platinum 8360Y CPUs (2.40 GHz) and 256 GB ECC memory (16 × 16 GB, 3200 MT/s). The storage system includes a 480 GB SATA SSD and a 1.6 TB NVMe SSD (PCIe 4.0). For networking, each server is configured with dual-port Mellanox ConnectX-5 25 GbE and dual-port Mellanox ConnectX-6 100 GbE NICs.

Workloads. We evaluate the performance of FPCache using the Yahoo! Cloud Serving Benchmark (YCSB). Four representative workloads are selected to reflect common access patterns, and their configurations regarding data distribution and KV sizes are summarized in Table 3. In addition, we incorporate two real-world datasets [31], fb_200M_uint64 and wiki_200M_uint64, where keys are 64-bit unsigned integers (8 B). For these datasets, we use 8-bit fingerprints, consistent with prior work [32].

To investigate system behavior under different conditions, we design several groups of controlled experiments within a unified evaluation framework. For YCSB benchmark tests, we preload 60 million KV items and execute 60 million operations, as done in prior work [11]. For Workload-C, we focus on sensitivity analyses by evaluating the system across a wide range of fixed KV sizes and diverse data distributions. Specifically, we conduct multiple independent experimental runs where the lengths of keys and values are varied (e.g., keys from 8 to 256 B and values from 8 to 1024 B) to assess the impact of different scales. For Workloads B, D, and E, experiments are conducted under a fixed Zipfian distribution and constant KV lengths, with the evaluation focusing on overall system performance.

Comparison methods. To comprehensively evaluate the performance of FPCache, we compare it with five representative index structures for disaggregated memory: DMTree [33], CHIME [11], SMART [23], ROLEX [21], and Sherman [10]. DMTree is a B+ tree variant that employs leaf-level fingerprinting to optimize remote access. It utilizes these fingerprints to filter out mismatches, thereby reducing redundant RDMA reads and network latency during tree traversals. CHIME adopts a hybrid architecture that combines tree-based routing with hashed leaf nodes. It leverages hash-neighborhood reads to reduce conflict handling overhead and mitigate read amplification in point queries. SMART is a radix-tree-based structure designed for disaggregated memory. It enables large-key support through prefix sharing and achieves low read amplification for point lookups, but relies heavily on pointer cache capacity at compute nodes. ROLEX is a learned index that maintains lightweight regression models on compute nodes to predict data locations, with final lookups performed on memory nodes within the prediction error bound. Sherman is a B+ tree variant optimized for disaggregated memory. It accelerates tree traversal using one-sided RDMA operations and employs fine-grained locking to support concurrent updates.

4.2. Overall Performance

4.2.1. Performance on YCSB Benchmark

We evaluate the overall throughput and tail-latency performance of different indexing schemes under representative workloads. The experiments employ four YCSB workloads, namely B, C, D, and E, with a fixed Zipfian data distribution. To ensure a fair comparison, parameters including KV sizes, cache capacity, and model error bounds are kept constant.

We present the throughput results under YCSB B, C, D, and E in Figure 6. Overall, we observe that FPCache maintains high throughput in point-query dominant workloads (YCSB B, C, and D). Compared to DMtree, FPCache achieves comparable or slightly superior performance, with an average lead of approximately

1.1 \times

in these scenarios. This demonstrates its stability in mixed scenarios involving random reads and light writes. Compared to CHIME, FPCache maintains a lead across YCSB B, C, and D with an average improvement of approximately 13.8%. When compared to ROLEX and Sherman, FPCache achieves an average throughput of

1.62 \times

and

3.01 \times

, respectively. These results indicate that under point-query dominant workloads, FPCache effectively reduces the overhead associated with unnecessary data reads and maintains high processing efficiency. However, we also observe that FPCache is outperformed by SMART in these scenarios. This phenomenon suggests that under the current experimental configuration, SMART possesses a higher throughput upper bound for point queries, although its performance realization depends on more extensive local cache resources. In contrast, FPCache exhibits more stable throughput across different workloads and achieves better efficiency in mitigating read amplification overhead.

Under the YCSB E range scan workload, we observe that throughput for all methods decreases significantly, indicating that this scenario is heavily impacted by the overhead of scan-oriented accesses. In this workload, ROLEX achieves the highest throughput, with FPCache reaching

0.86 \times

of its performance. Nonetheless, FPCache still outperforms CHIME, SMART, and Sherman by

1.43 \times

,

1.38 \times

, and

1.14 \times

, respectively.

To further evaluate latency stability, we additionally measure the cumulative latency distribution under different workloads, including P50, P95, P99, and P99.9 latency percentiles. Figure 7 presents the corresponding results.

Overall, FPCache and DMTree consistently achieve lower tail latency than ROLEX and Sherman across most point-query dominant workloads. In particular, under YCSB B, C, and D, both methods maintain relatively stable latency growth from median latency to tail latency, indicating improved robustness under skewed accesses and bursty requests. In contrast, ROLEX and Sherman exhibit substantially higher tail latency at the P99 and P99.9 levels, suggesting increased overhead caused by unnecessary remote accesses and cache miss amplification.

In YCSB E, FPCache and DMTree consistently outperform other candidates in both median and tail latency. In particular, FPCache achieves substantially lower P50 latency than CHIME and SMART while maintaining competitive P99 and P99.9 latency performance. Specifically, FPCache reduces the P50 latency by approximately

1.7 \times

compared to CHIME and SMART. Although DMTree exhibits a slight advantage at the extreme tail latency level, FPCache still preserves relatively high throughput and stable latency behavior overall. These results demonstrate that FPCache effectively balances throughput efficiency and latency stability under diverse workloads.

4.2.2. Performance on Real-World Datasets

We further evaluate the proposed design on real-world datasets from the SOSD benchmark, including Facebook (fb) and Wiki. All experiments use 8-byte keys and 8-byte values under a Zipfian access distribution.Since both datasets operate on fixed 8-byte keys, we accordingly adopt an 8-bit fingerprint configuration to maintain consistency with the key representation. The use of 8-bit fingerprints in the SOSD evaluation is primarily intended to study the compression-efficiency trade-off under short fixed-length keys. Since all SOSD keys are only 8 bytes, the overhead of collision verification remains relatively small even when fingerprint collisions occur, because full-key comparisons can be completed with minimal memory access cost. In contrast, the YCSB experiments in this paper use 128-byte keys and values, where fingerprint collisions may trigger substantially more expensive verification operations due to the larger key comparison cost and additional memory accesses. Therefore, although small fingerprints remain feasible for short-key workloads such as SOSD, FPCache adopts 64-bit fingerprints in the YCSB evaluation and default system configuration to minimize collision-induced verification overhead and stabilize tail latency. We compare FPCache against five representative baselines. For each dataset, we evaluate both the overall throughput performance and the multi-node scalability behavior. In the scalability experiments, six compute nodes are deployed to emulate distributed clients, where each node issues requests using 64 worker threads.

Figure 8 presents the throughput and scalability results on the fb dataset. Overall, FPCache consistently achieves high throughput and remains competitive with SMART, while outperforming CHIME, ROLEX, Sherman, and DMTree. Specifically, FPCache improves throughput by approximately

1.14 \times

,

1.61 \times

, and

2.98 \times

compared with CHIME, ROLEX, and Sherman, respectively, while maintaining slightly higher throughput than DMTree. ROLEX and Sherman gradually saturate after 64 threads, with ROLEX remaining slightly higher than Sherman, suggesting that both methods are constrained by RDMA operation overhead and NIC IOPS limitations under intensive concurrency. CHIME exhibits near-linear scaling before 128 threads, while its growth rate slows after 192 threads, although throughput still increases moderately. In contrast, SMART, FPCache, and DMTree all demonstrate near-linear scalability before 128 threads, followed by slower growth between 128 and 192 threads, and eventually converge toward a relatively stable region after 256 threads. At the highest concurrency level, FPCache achieves approximately

1.06 \times

higher throughput than DMTree, indicating that the learned-index-assisted design provides better scalability than traditional B+ tree-based indexing structures under highly concurrent workloads.

Figure 9 shows similar observations on the wiki dataset. FPCache again achieves consistently high throughput while maintaining better scalability than CHIME, ROLEX, Sherman, and DMTree. Specifically, FPCache improves throughput by approximately

1.14 \times

,

1.61 \times

, and

2.98 \times

compared with CHIME, ROLEX, and Sherman, respectively. Compared with DMTree, FPCache maintains a modest but stable throughput advantage across different concurrency levels, reaching approximately

1.04 \times

higher throughput at the maximum thread count. Similar to the fb dataset, ROLEX and Sherman saturate much earlier under high concurrency, whereas SMART, FPCache, and DMTree continue scaling efficiently before gradually stabilizing at higher thread counts. Overall, the results on both real-world datasets demonstrate that FPCache not only delivers high throughput but also maintains strong scalability under distributed multi-node workloads.

4.2.3. Performance Under Dynamic Hotspot Changes

We evaluate the adaptability of FPCache under a dynamically changing Zipfian distribution using the YCSB C workload, in which the hotspot is shifted every 30 s. Experiments are conducted over an 80-s interval after the system reaches a stable operating state. The results, as shown in Figure 10, indicate that each hotspot migration induces a transient throughput drop of approximately 17%. Notably, the system rapidly recovers within roughly 6 s, returning to the pre-migration steady-state throughput. This behavior is consistently observed across multiple hotspot transitions, indicating that FPCache can effectively adjust to dynamic workload changes while maintaining high and stable throughput, with only brief and bounded fluctuations during hotspot shifts.

4.2.4. Network Traffic and RDMA IO Analysis

We evaluate the trade-off introduced by the two-phase read design on the fb dataset. In this evaluation, both keys and values are fixed to 8 bytes to ensure a consistent and controlled comparison across different methods. We compare FPCache with ROLEX, as both systems are representative learned-index-based designs in disaggregated-memory settings. The goal is to measure the impact of fingerprint-based pre-filtering on network traffic and RDMA I/O operations under identical workload conditions.

As shown in Figure 11, although FPCache introduces an additional fingerprint lookup stage that increases the number of RDMA operations per query, it significantly reduces network traffic compared to ROLEX. Specifically, FPCache achieves a substantial reduction in network bandwidth consumption while issuing more RDMA operations due to the two-phase design. This is because fingerprint-based pre-filtering effectively eliminates unnecessary full KV retrievals, thereby mitigating read amplification at the cost of slightly increased RDMA I/O operations. Overall, FPCache shifts overhead from network bandwidth to lightweight RDMA I/O operations, but achieves better efficiency in disaggregated memory systems by reducing redundant data transfers.

4.3. Ablation Study

4.3.1. Performance Evaluation of the Two-Stage Read Approach

We evaluate the performance of the fingerprint-assisted two-stage read approach under varying value sizes. We utilize a YCSB C read-only workload with a uniform distribution to highlight the overhead differences between the full-read scheme and the proposed two-stage read approach under low cache hit conditions. We examine two scenarios with value sizes of 64 B and 128 B, comparing the throughput and P99 tail latency of the full-read scheme against the two-stage read approach.

As shown in Figure 12, the two-stage read approach outperforms the full-read scheme in both scenarios. In the 64 B configuration, the two-stage approach improves throughput by

1.21 \times

and reduces P99 latency by 36.6% compared to the full-read approach. As the value size increases to 128 B, the performance of the full-read scheme deteriorates further, whereas the two-stage approach maintains high throughput—reaching

1.58 \times

that of the full-read scheme—and reduces P99 latency by 53.0%. These trends suggest that the two-stage read approach is more robust to payload growth, maintaining stable access efficiency even as data volume increases.

The performance gap primarily stems from the disparate remote data retrieval mechanisms of the two schemes. In the full-read scheme, locating a target key requires fetching all records within the predicted range to the compute node for local filtering. When value sizes are large, this approach transfers a substantial amount of non-target data, which consumes excessive network bandwidth and increases NIC queuing delays, ultimately leading to long-tail latency. In contrast, we decompose the query into two steps: first, the system retrieves the segment-aligned fingerprint array for local matching; then, it performs RDMA reads only for candidate positions. Since the first stage involves only compact fingerprint metadata, and actual data retrieval is restricted to potential hits, extra data transfers are significantly minimized. Although this approach introduces an additional network round trip, the associated overhead is negligible compared to the massive network costs incurred by large-scale extra data transfers. This advantage becomes particularly pronounced in the 128 B scenario, where we observe enhanced throughput and improved tail latency.

4.3.2. Performance Evaluation of the Dual Compression Strategy

We independently validate the effectiveness of the dual compression strategy by comparing fingerprints and position offsets. We evaluate four configurations under identical experimental settings: Full-Cache, FPCache-F (fingerprint compression only), FPCache-D (position offset only), and FPCache-FD (both mechanisms enabled). We conduct the evaluation using the YCSB C workload with a Zipfian distribution (

α = 0.9

), fixed 128 B KV sizes, and a 200 MB compute-node cache capacity. Throughput and cache hit rate are employed as the primary metrics.

As shown in Figure 13, FPCache-FD achieves the highest performance across both throughput and cache hit rate. Compared to Full-Cache, we observe that FPCache-FD improves throughput by 48.2% and increases the hit rate by 3.2×, which indicates that dual compression significantly enhances effective cache density and hotspot residency. Our component-wise analysis further clarifies the contribution of each mechanism. We find that enabling fingerprint compression alone markedly improves performance, identifying key identifier compression as the primary driver of overall gains. In contrast, the improvement from position offset in isolation is relatively limited due to the overhead constraints of the pointer fields within the entry structure.

Furthermore, we observe that FPCache-FD still yields a 14.7% throughput increase and a 1.34× higher hit rate over FPCache-F. Compared to FPCache-D, it increases throughput by 44.9% and hit rate by 3.11×. These results demonstrate a robust synergy between fingerprints and position offsets; while the former substantially compresses key metadata, the latter further optimizes entry space under high-density cache conditions. Together, they mitigate the overhead of cache misses and subsequent remote memory fetches, leading to simultaneous gains in throughput and hit rate.

4.4. Parameter Sensitivity

4.4.1. Sensitivity to Key-Value Length

We focus on throughput variations under different data specifications to assess the stability of various indexing schemes. We present the impact of value size on throughput in Figure 14. When the value size is small, the performance advantage of FPCache is relatively limited because the additional fingerprint retrieval stage introduces extra RDMA operations during query processing. However, as the value size increases, the cost of unnecessary remote KV retrievals becomes significantly higher, allowing FPCache to gradually benefit from its fingerprint-assisted filtering mechanism. With the key length fixed at 8 B, we observe that the throughput of FPCache exhibits a gradual decline as the value size increases, demonstrating robust overall stability. Compared to Sherman, ROLEX, and CHIME, the performance advantages of FPCache continue to widen in scenarios involving medium-to-large values, reaching

27.5 \times

,

23.2 \times

, and

3.8 \times

their respective throughput at 1024 B. These results indicate that as the value payload grows, our access flow that prioritizes metadata filtering before precise retrieval effectively suppresses extra data transfers. This prevents read overhead from scaling linearly with the prediction interval, thereby maintaining high throughput even when the value size becomes large.

We observe that FPCache does not surpass SMART in this specific experiment. This is primarily because SMART can directly access target records under the current settings, which allows it to reach a higher throughput ceiling when the thread count and data scale remain stable. However, this advantage comes at the cost of high compute node cache consumption, as SMART requires significantly more memory to maintain its pointer-based structural metadata compared to FPCache. In contrast, FPCache regulates cache occupancy through compact metadata organization, thereby providing more consistent performance within a limited cache budget. Consequently, the performance difference reflects a fundamental trade-off between throughput upper bounds and memory costs. The design of FPCache prioritizes scalable throughput stability and superior space efficiency under resource-constrained conditions over pursuing absolute peak throughput in all small-value scenarios.

Regarding the baseline methods, Sherman and ROLEX show the highest sensitivity to variations in value size, with their throughput declining rapidly as the payload increases. This behavior is primarily attributed to their reliance on range-based data retrieval, which increases the amount of data transmitted over the network and significantly raises communication overhead. Similarly, CHIME exhibits noticeable performance degradation as the value size grows, suggesting that neighborhood reads in its conflict-handling phase enlarge the per-request data processing cost. Although SMART demonstrates a relatively slower decline in throughput, its substantial structural metadata and high cache occupancy limit its scalability under constrained compute node memory resources.

We further examine the impact of key length on throughput in Figure 15. With the value size fixed at 8 B, we find that the throughput of FPCache remains relatively stable for key lengths ranging from 8 B to 256 B. At the 256 B mark, FPCache achieves throughput that is

7.8 \times

that of Sherman,

6.5 \times

that of ROLEX,

1.4 \times

that of CHIME, and

1.5 \times

that of SMART. These results suggest that increasing key lengths does not significantly inflate the access costs for FPCache. This stability is primarily due to our use of fixed-length fingerprints for matching; consequently, the index metadata footprint and the first stage transfer overhead do not scale linearly with the raw key length, allowing FPCache to maintain consistent throughput even with long keys.

In contrast, we observe that the performance of other methods is adversely affected by increasing key lengths to varying degrees. Sherman and ROLEX store full keys within index nodes; thus, longer keys reduce node fan-out and decrease the number of valid entries per node, which necessitates additional remote accesses. CHIME organizes data via hash buckets and requires requests to carry full key information, leading to increased network request payloads as keys lengthen. Although SMART reduces some storage overhead through prefix sharing, extremely long keys can still lead to increased index depth and higher cross-node access overhead. Collectively, these findings demonstrate that the advantages of FPCache lie not only in its peak throughput gains but also in its remarkable throughput stability when faced with the dual challenges of varying value sizes and key lengths.

4.4.2. Sensitivity to Fingerprint Design

To investigate the impact of fingerprint width on system performance, we further evaluate FPCache with different fingerprint sizes under the YCSB C workload with Zipfian distribution. The experiments are conducted using 128 B key and 128 B value while fixing the prediction error bound to

ϵ = 256

, which represents the largest search window and therefore the highest collision risk. We vary the fingerprint width from 64 bits to 8 bits and measure the corresponding system throughput.

The experimental results are shown in Figure 16. We observe that reducing the fingerprint width decreases the throughput of FPCache. Compared with 64-bit fingerprints, the throughput degradation is relatively small with 32-bit fingerprints, but becomes significant for 16-bit and 8-bit fingerprints, with more than 50% throughput loss at 8 bits.

The main reason is that smaller fingerprints significantly increase the probability of fingerprint collisions. Although narrow fingerprints can improve metadata compression efficiency, collisions introduce additional remote verification operations during query processing. These unnecessary RDMA accesses consume extra NIC IOPS and network bandwidth, which gradually reduces the effective throughput of the system. This overhead becomes more severe when the search window is large, since more candidate entries need to be checked within each lookup.

We further evaluate the memory overhead of fingerprint storage at the memory node by comparing FPCache with representative Bloom and Cuckoo filter-based baselines. The experiment preloads 60 million keys under a YCSB C workload. Figure 17 reports the overall memory consumption of each scheme.

Although FPCache introduces a higher per-key memory overhead (8 bytes per key) compared to Bloom filters (approximately 1.2 bytes per key) and Cuckoo filters (approximately 1.5 bytes per key), it integrates both fingerprinting and pointer metadata within a unified structure. This design enables not only membership filtering but also direct key-value localization. In contrast, Bloom and Cuckoo filters can only provide probabilistic membership results; upon a positive match, an additional RDMA fetch of a full node page is required, followed by a local key-value search. This introduces extra network and memory access overhead. By directly mapping fingerprints to key-value pointers, FPCache enables one-shot RDMA retrieval of the exact key-value entry, reducing effective bandwidth consumption by approximately an order of magnitude compared to filter-based approaches. As a result, FPCache achieves higher throughput, outperforming Bloom and Cuckoo baselines by approximately 20–23% under the same workload configuration.

4.4.3. Sensitivity to Prediction Error Bounds

We evaluate the impact of prediction error bounds on the throughput of disaggregated memory indexes. By fixing the KV specifications and cache capacity, we vary the error bounds from 8 to 256 to compare the throughput variations of FPCache and ROLEX under both Uniform and Zipfian distributions. This single variable approach allows us to directly observe the influence of error bounds on system performance.

We report the system throughput under different prediction error bounds in Figure 18. The experimental results show that the throughput of ROLEX steadily decreases as the error bound grows under both data distributions. Under the Uniform distribution, ROLEX experiences a throughput reduction of approximately 64.4%, while the decline reaches 47.4% under the Zipfian distribution. As the prediction error range expands, the conventional range-based retrieval approach must fetch a larger amount of candidate data, which increases both network transmission volume and local verification overhead. Consequently, the growing proportion of unnecessary reads during query processing gradually degrades overall system throughput.

In contrast, we observe that FPCache demonstrates significantly higher stability under both distributions. Under the Uniform distribution, the throughput of FPCache exhibits minimal fluctuations and only a slight decrease as the error bound increases; under the Zipfian distribution, the performance remains essentially stable. At a large error bound of 256, the throughput of FPCache is

8.40 \times

and

2.48 \times

that of ROLEX under the Uniform and Zipfian distributions, respectively. These findings demonstrate that the access flow of FPCache effectively suppresses the growth of extra data transfers by prioritizing metadata filtering before precise data retrieval. Consequently, FPCache maintains stable throughput even under high error conditions.

4.4.4. Sensitivity to Cache Capacity

We evaluate the throughput variations of different indexing structures under varying compute node cache capacities. We set the cache capacity from 50 MB to 500 MB while keeping parameters such as data distribution, KV specifications, and model error bounds constant, making cache capacity the primary independent variable.

We present the throughput results under different cache capacities in Figure 19. At a 50 MB cache capacity, the throughput of FPCache is comparable to CHIME and significantly higher than ROLEX, Sherman, SMART, and DMTree. Compared with ROLEX, Sherman, SMART, and DMTree, FPCache achieves approximately

1.39 \times

,

2.12 \times

,

2.47 \times

, and

1.19 \times

higher throughput, respectively, demonstrating its strong space efficiency under limited cache resources. DMTree is a tree-based baseline optimized for hierarchical indexing and memory management. However, its effectiveness is limited under small cache configurations due to frequent tree traversal and pointer chasing overhead. In contrast, CHIME benefits from speculative read operations, while FPCache leverages a learned index structure that caches both fingerprints and offsets, enabling direct and accurate key-value localization without additional traversal.

As the cache capacity increases from 50 MB to 100 MB, the throughput of FPCache rises noticeably. When the cache grows further from 100 MB to 500 MB, its performance remains largely stable with only minor fluctuations. These results indicate that even with relatively small and moderate cache capacities, FPCache can already cover most hotspot data. Beyond this point, system performance becomes primarily constrained by network conditions and remote access paths rather than cache capacity.

The baseline methods exhibit varying degrees of dependence on cache capacity. We find that CHIME shows a noticeable performance improvement between 50 MB and 150 MB, but its gains plateau once a certain capacity is reached. ROLEX also shows some improvement at lower capacities but remains relatively consistent beyond 150 MB. The throughput of Sherman remains consistently low across the entire range and shows no significant response to cache variations, suggesting its performance is limited by other bottlenecks. In contrast, SMART is highly sensitive to cache capacity; its throughput continues to rise as capacity increases, eventually surpassing FPCache at 400 MB and 500 MB. This outcome indicates that SMART requires a larger local cache to fully realize its structural advantages, showing a strong dependency on cache resources.

Compared with DMTree, FPCache maintains a consistent advantage across all cache configurations. Specifically, FPCache achieves approximately

1.19 \times

–

1.35 \times

higher throughput than DMTree under small cache settings, and still maintains about

1.02 \times

advantage at larger cache capacities. This demonstrates that the learned-index-based design of FPCache provides better cache efficiency and stability than tree-based structures, especially under constrained memory budgets.

Collectively, the advantages of FPCache are most prominent in small to medium cache scenarios due to its high space efficiency and stable throughput. Within the 100 MB to 300 MB range, FPCache maintains its lead with robust stability; even at 500 MB, it still achieves

1.12 \times

,

1.56 \times

, and

2.97 \times

the throughput of CHIME, ROLEX, and Sherman, respectively. Although SMART can achieve higher peak throughput under ultra-large cache conditions, FPCache reaches high performance levels with a significantly smaller cache budget, demonstrating a lower cache space requirement.

5. Related Work

Fingerprinting and Pre-filtering Techniques. Fingerprinting has been widely adopted to accelerate search operations by filtering out mismatches before performing expensive data access. In the context of hybrid SCM-DRAM persistent memory, FPTree [34] utilizes fingerprints in leaf nodes to minimize costly persistent memory reads and key comparisons. By first checking the fingerprints in DRAM, the system only accesses the SCM for final verification when a match is found. For secondary indexes, LSI [32] integrates fingerprints with learned models to prune the candidate set. It addresses the uncertainty in model-predicted ranges by storing fingerprints for each entry, allowing the system to quickly skip irrelevant data within the predicted boundaries. More recently, in disaggregated memory architectures, DMTree [33] introduces the FP-B+ tree to optimize remote B+ tree traversals. It embeds fingerprints at the leaf level to reduce the number of remote RDMA reads, enabling the compute node to verify the existence of a key before fetching the full data payload from the memory node.

Index Optimization for Disaggregated Memory. Under disaggregated memory architectures, prior work has focused on optimizing the read path through index redesign and access optimization to reduce RDMA round-trip latency. Deft [14] introduces a segmented layout for fine-grained RDMA access and incorporates a read-validation mechanism to ensure consistency. SMART [23] mitigates read amplification by replacing B+ trees with the Adaptive Radix Tree (ART) [35] and employs read delegation to eliminate redundant requests. CHIME [11] combines B+ trees with Hopscotch hashing to improve lookup accuracy and reduce remote accesses through metadata aggregation. RACE [36] stabilizes remote access patterns via an RDMA-aware extendible subtable [37] and client-side caching. DEX [13] reduces traversal overhead by localizing hot paths through path-aware caching and optimizing offloading decisions using a cost–benefit model. Outback [38] introduces a decoupled indexing scheme based on dynamic minimal perfect hashing (DMPH), which offloads compute-heavy hashing seeds and locators to the compute node while maintaining lightweight buckets on the memory node to achieve one-hop RDMA access with minimal memory-side CPU overhead.

Learned Indexes for Disaggregated Memory. In the field of learned indexes, several studies leverage model-based prediction to reduce index lookup overhead. XStore [20] introduces a learned cache mechanism that maintains models and translation tables on the client side. The model provides approximate positions, which are mapped to leaf node locations via the translation table. A validation-and-fallback mechanism is invoked upon prediction errors to ensure correctness. ROLEX [21] further exploits learned models for data location prediction, and controls prediction errors through bounded bias and data movement constraints, enabling single-sided RDMA reads within a controllable error range. AStore [22] adopts a similar learned index design, where client and memory nodes share consistent model semantics. It organizes access and caching strategies based on prediction error bounds, allowing remote reads to be issued within well-defined intervals.

Despite these advances, achieving a synergy between minimal read amplification and high cache density remains a significant challenge in disaggregated memory. While recent indexing structures have made strides in reducing network latency through fingerprinting or decoupled designs, they often require storing original keys or substantial metadata, which limits the number of entries a compute node can cache. Meanwhile, learned-index-based approaches still incur read amplification due to model prediction errors, and their cache space overhead remains high when storing full physical pointers. To address these challenges, we propose FPCache, a fingerprint-rectified learned index cache. Unlike prior works that primarily focus on filtering accesses, FPCache uniquely incorporates a fingerprint-offset compression strategy. By substituting raw keys and full pointers with compact fixed-length fingerprints and relative offsets, FPCache not only rectifies prediction errors to minimize read amplification but also significantly increases effective cache density. This allows compute nodes to accommodate a much larger volume of hotspot data within limited DRAM resources compared to existing state-of-the-art designs.

6. Conclusions

In this paper, we propose FPCache, a fingerprint-rectified learned index cache designed for disaggregated memory. FPCache addresses the read inefficiency in disaggregated memory by improving both data access efficiency and cache utilization. The design incorporates two key components: a fingerprint-assisted two-stage read approach and a fingerprint-offset compression strategy. Together, they reduce read amplification while improving cache space efficiency, thereby lowering cross-node network bandwidth consumption and increasing effective cache density for hotspot data. We evaluate FPCache against several state-of-the-art indexing structures under diverse configurations, including different KV sizes, prediction error bounds, cache capacities, and workloads. Experimental results show that FPCache consistently outperforms existing approaches in throughput, with more significant gains observed under large records and large model prediction errors.

Author Contributions

Conceptualization, C.J.; Methodology, C.J.; Formal analysis, C.J.; Writing—review and editing, C.J.; Supervision, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2024 Yangtze River Delta Science and Technology Innovation Community Joint Research Project (Grant No. 2024CSJZN00400).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors affirm that they do not have any conflicts of interest.

References

Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Redis. Available online: https://redis.io/ (accessed on 7 April 2026).
Memcached. Available online: https://memcached.org/ (accessed on 7 April 2026).
Li, B.; Ruan, Z.; Xiao, W.; Lu, Y.; Xiong, Y.; Putnam, A.; Chen, E.; Zhang, L. Kv-direct: High-performance in-memory key-value store with programmable nic. In Proceedings of the 26th Symposium on Operating Systems Principles; Association for Computing Machinery: New York, NY, USA, 2017; pp. 137–152. [Google Scholar]
Neeli, S.S.S. Real-time data management with in-memory databases: A performance-centric approach. IJAIDR-J. Adv. Dev. Res. 2020, 11, 1–8. [Google Scholar]
Zhang, K.; Wang, K.; Yuan, Y.; Guo, L.; Li, R.; Zhang, X.; He, B.; Hu, J.; Hua, B. A distributed in-memory key-value store system on heterogeneous CPU–GPU cluster. VLDB J. 2017, 26, 729–750. [Google Scholar] [CrossRef]
Chen, H.; Zhang, H.; Dong, M.; Wang, Z.; Xia, Y.; Guan, H.; Zang, B. Efficient and available in-memory KV-store with hybrid erasure coding and replication. ACM Trans. Storage (TOS) 2017, 13, 1–30. [Google Scholar] [CrossRef]
Beck, M.; Kagan, M. Performance evaluation of the RDMA over ethernet (RoCE) standard in enterprise data centers infrastructure. In Proceedings of the 3rd Workshop on Data Center-Converged and Virtual Ethernet Switching; International Teletraffic Congress: Cracow, Poland, 2011; pp. 9–15. [Google Scholar]
Kejriwal, A.; Gopalan, A.; Gupta, A.; Jia, Z.; Yang, S.; Ousterhout, J. SLIK: Scalable Low-Latency Indexes for a Key-Value Store. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC 16); USENIX Association: Berkeley, CA, USA, 2016; pp. 57–70. [Google Scholar]
Wang, Q.; Lu, Y.; Shu, J. Sherman: A write-optimized distributed b+ tree index on disaggregated memory. In Proceedings of the 2022 International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1033–1048. [Google Scholar]
Luo, X.; Shen, J.; Zuo, P.; Wang, X.; Lyu, M.R.; Zhou, Y. Chime: A cache-efficient and high-performance hybrid index on disaggregated memory. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles; Association for Computing Machinery: New York, NY, USA, 2024; pp. 110–126. [Google Scholar]
An, H.; Wang, F.; Feng, D.; Zou, X.; Liu, Z.; Zhang, J. Marlin: A concurrent and write-optimized b+-tree index on disaggregated memory. In Proceedings of the 52nd International Conference on Parallel Processing; Association for Computing Machinery: New York, NY, USA, 2023; pp. 695–704. [Google Scholar]
Lu, B.; Huang, K.; Liang, C.J.M.; Wang, T.; Lo, E. Dex: Scalable range indexing on disaggregated memory [extended version]. arXiv 2024, arXiv:2405.14502. [Google Scholar] [CrossRef]
Wang, J.; Wang, Q.; Zhang, Y.; Shu, J. Deft: A scalable tree index for disaggregated memory. In Proceedings of the 20th European Conference on Computer Systems; Association for Computing Machinery: New York, NY, USA, 2025; pp. 886–901. [Google Scholar]
Gao, P.X.; Narayan, A.; Karandikar, S.; Carreira, J.; Han, S.; Agarwal, R.; Ratnasamy, S.; Shenker, S. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16); USENIX Association: Berkeley, CA, USA, 2016; pp. 249–264. [Google Scholar]
Guo, Z. Building End-to-End Disaggregation Stack via Cross Layer Co-Design; University of California: San Diego, CA, USA, 2025. [Google Scholar]
Chen, Y.; Li, A.; Li, W.; Deng, L. FB ⁺-tree: A Memory-Optimized B ⁺-tree with Latch-Free Update. arXiv 2025, arXiv:2503.23397. [Google Scholar] [CrossRef]
Ziegler, T.; Tumkur Vani, S.; Binnig, C.; Fonseca, R.; Kraska, T. Designing distributed tree-based index structures for fast rdma-capable networks. In Proceedings of the 2019 International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2019; pp. 741–758. [Google Scholar]
Kraska, T.; Beutel, A.; Chi, E.H.; Dean, J.; Polyzotis, N. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2018; pp. 489–504. [Google Scholar]
Wei, X.; Chen, R.; Chen, H.; Zang, B. Xstore: Fast rdma-based ordered key-value store using remote learned cache. ACM Trans. Storage (TOS) 2021, 17, 1–32. [Google Scholar] [CrossRef]
Li, P.; Hua, Y.; Zuo, P.; Chen, Z.; Sheng, J. ROLEX: A scalable RDMA-oriented learned Key-Value store for disaggregated memory systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST 23); USENIX Association: Berkeley, CA, USA, 2023; pp. 99–114. [Google Scholar]
Qiao, P.; Zhang, Z.; Li, Y.; Yuan, Y.; Wang, S.; Wang, G.; Yu, J.X. AStore: Uniformed Adaptive Learned Index and Cache for RDMA-Enabled Key-Value Store. IEEE Trans. Knowl. Data Eng. 2024, 36, 2877–2894. [Google Scholar] [CrossRef]
Luo, X.; Zuo, P.; Shen, J.; Gu, J.; Wang, X.; Lyu, M.R.; Zhou, Y. SMART: A High-Performance adaptive radix tree for disaggregated memory. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), Boston, MA, USA, 10–12 July 2023; pp. 553–571. [Google Scholar]
Herlihy, M.; Shavit, N.; Tzafrir, M. Hopscotch hashing. In Proceedings of the International Symposium on Distributed Computing; Springer: Berlin/Heidelberg, Germany, 2008; pp. 350–364. [Google Scholar]
Cao, W.; Zhang, Y.; Yang, X.; Li, F.; Wang, S.; Hu, Q.; Cheng, X.; Chen, Z.; Liu, Z.; Fang, J.; et al. Polardb serverless: A cloud native database for disaggregated data centers. In Proceedings of the 2021 International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2477–2489. [Google Scholar]
Wang, J.; Li, C.; Wang, T.; Guo, J.; Yang, H.; Zhuansun, Y.; Guo, M. Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters. arXiv 2025, arXiv:2503.20275. [Google Scholar] [CrossRef]
Miao, M.; Ren, F.; Luo, X.; Xie, J.; Meng, Q.; Cheng, W. Softrdma: Rekindling high performance software rdma over commodity ethernet. In Proceedings of the 1st Asia-Pacific Workshop on Networking; Association for Computing Machinery: New York, NY, USA, 2017; pp. 43–49. [Google Scholar]
Ziegler, T.; Nelson-Slivon, J.; Leis, V.; Binnig, C. Design guidelines for correct, efficient, and scalable synchronization using one-sided RDMA. In Proceedings of the ACM on Management of Data; Association for Computing Machinery: New York, NY, USA, 2023; Volume 1, pp. 1–26. [Google Scholar]
Taranov, K.; Fischer, F.; Hoefler, T. Efficient RDMA Communication Protocols. arXiv 2022, arXiv:2212.09134. [Google Scholar] [CrossRef]
CloudLab. Available online: https://cloudlab.us/ (accessed on 7 April 2026).
Searching on Sorted Data. Available online: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JGVF9A (accessed on 7 April 2026).
Wu, J.; Zhang, Y.; Chen, S.; Wang, J.; Chen, Y.; Xing, C. LSI: A Learned Secondary Index Structure. In Proceedings of the 5th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
Wei, G.; Li, Y.; Song, H.; Li, T.; Yao, L.; Xu, Y.; Cui, H. DMTree: Towards Efficient Tree Indexing on Disaggregated Memory via Compute-side Collaborative Design. In Proceedings of the 24th USENIX Conference on File and Storage Technologies, Santa Clara, CA, USA, 24–26 February 2026. [Google Scholar]
Oukid, I.; Lasperas, J.; Nica, A.; Willhalm, T.; Lehner, W. FPTree: A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. In Proceedings of the 2016 International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2016; pp. 371–386. [Google Scholar]
Leis, V.; Kemper, A.; Neumann, T. The adaptive radix tree: ARTful indexing for main-memory databases. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 8–12 April 2013; IEEE: New York, NY, USA, 2013; pp. 38–49. [Google Scholar]
Zuo, P.; Zhou, Q.; Sun, J.; Yang, L.; Zhang, S.; Hua, Y.; Cheng, J.; He, R.; Yan, H. RACE: One-sided RDMA-conscious extendible hashing. ACM Trans. Storage (TOS) 2022, 18, 1–29. [Google Scholar] [CrossRef]
Fagin, R.; Nievergelt, J.; Pippenger, N.; Strong, H.R. Extendible hashing—A fast access method for dynamic files. ACM Trans. Database Syst. (TODS) 1979, 4, 315–344. [Google Scholar] [CrossRef]
Liu, Y.; Xie, M.; Shi, S.; Xu, Y.; Litz, H.; Qian, C. Outback: Fast and communication-efficient index for key-value store on disaggregated memory. arXiv 2025, arXiv:2502.08982. [Google Scholar] [CrossRef]

Figure 1. Disaggregated memory architecture.

Figure 2. Learned index for disaggregated memory.

Figure 3. The design overview of FPCache.

Figure 4. Fingerprint-assisted two-stage read approach.

Figure 5. Fingerprint-Offset compression strategy.

Figure 6. Throughput comparison of different methods under various workloads.

Figure 7. Latency comparison of different methods under various workloads.

Figure 8. Throughput comparison of different methods on fb dataset under Zipfian distribution.

Figure 9. Throughput comparison of different methods on wiki dataset under Zipfian distribution.

Figure 10. Dynamic hotspot changes in throughput.

Figure 11. Network traffic and I/O comparison between FPCache and ROLEX per 10K queries.

Figure 12. Impact of two-stage read approach on throughput and P99 latency.

Figure 13. Impact of dual compression strategy on throughput and cache hit rate.

Figure 14. Throughput comparison of different methods under various value sizes.

Figure 15. Throughput comparison of different methods under various key sizes.

Figure 16. Throughput comparison under different fingerprint sizes.

Figure 17. Memory usage and throughput comparison for fingerprint, bloom, and cuckoo.

Figure 18. Throughput comparison of different methods under various error bounds.

Figure 19. Throughput comparison of different methods under various cache sizes.

Table 1. Data retrieved per query under various error bounds.

Error Bound ( $ϵ$ )	8	32	64	128	256
Data retrieved per query (KB)	0.25	1	2	4	8
Throughput (Kops/s)	1112	928.3	729.4	447.8	239.4

Table 2. Impact of key size on cache capacity under a 100 MB read cache.

Key Size	8 B	16 B	32 B	64 B	128 B
Cached Entries	6.55 M	4.36 M	2.62 M	1.46 M	0.77 M

Table 3. Dataset Characteristics.

Workload Type	Distribution	Key Size	Value Size
YCSB-B (95% Read, 5% Update)	Zipfian	128 B	128 B
YCSB-C (100% Read)	Zipfian/Uniform	128 B *	128 B *
YCSB-D (95% Read, 5% Insert)	Zipfian	128 B	128 B
YCSB-E (95% Scan, 5% Update)	Zipfian	128 B	128 B
Facebook user_id	Zipfian	8 B	8 B
Wiki timestamps	Zipfian	8 B	8 B

* For sensitivity analysis in Section 4.4, we vary the fixed key size from 8 to 256 B and the fixed value size from 8 to 1024 B across different experimental runs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jia, C.; Cai, M. FPCache: A Fingerprint-Rectified Learned Index Cache for Disaggregated Memory. Electronics 2026, 15, 2210. https://doi.org/10.3390/electronics15102210

AMA Style

Jia C, Cai M. FPCache: A Fingerprint-Rectified Learned Index Cache for Disaggregated Memory. Electronics. 2026; 15(10):2210. https://doi.org/10.3390/electronics15102210

Chicago/Turabian Style

Jia, Chenyang, and Miao Cai. 2026. "FPCache: A Fingerprint-Rectified Learned Index Cache for Disaggregated Memory" Electronics 15, no. 10: 2210. https://doi.org/10.3390/electronics15102210

APA Style

Jia, C., & Cai, M. (2026). FPCache: A Fingerprint-Rectified Learned Index Cache for Disaggregated Memory. Electronics, 15(10), 2210. https://doi.org/10.3390/electronics15102210

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FPCache: A Fingerprint-Rectified Learned Index Cache for Disaggregated Memory

Abstract

1. Introduction

2. Background and Motivation

2.1. Disaggregated Memory Architecture

2.2. Learned Indexes for Disaggregated Memory

2.3. Motivation

3. Design

3.1. Overview

3.2. Fingerprint-Assisted Two-Stage Read Approach

3.3. Fingerprint-Offset Compression Strategy

4. Evaluation

4.1. Experimental Setup

4.2. Overall Performance

4.2.1. Performance on YCSB Benchmark

4.2.2. Performance on Real-World Datasets

4.2.3. Performance Under Dynamic Hotspot Changes

4.2.4. Network Traffic and RDMA IO Analysis

4.3. Ablation Study

4.3.1. Performance Evaluation of the Two-Stage Read Approach

4.3.2. Performance Evaluation of the Dual Compression Strategy

4.4. Parameter Sensitivity

4.4.1. Sensitivity to Key-Value Length

4.4.2. Sensitivity to Fingerprint Design

4.4.3. Sensitivity to Prediction Error Bounds

4.4.4. Sensitivity to Cache Capacity

5. Related Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI