SwiftKV: A Metadata Indexing Scheme Integrating LSM-Tree and Learned Index for Distributed KV Stores

Wang, Zhenfei; Feng, Jianxun; Dun, Longxiang; Bao, Ziliang; Du, Chunfeng

doi:10.3390/fi17090398

Open AccessArticle

SwiftKV: A Metadata Indexing Scheme Integrating LSM-Tree and Learned Index for Distributed KV Stores

by

Zhenfei Wang

,

Jianxun Feng

,

Longxiang Dun

,

Ziliang Bao

and

Chunfeng Du

^*

School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(9), 398; https://doi.org/10.3390/fi17090398

Submission received: 30 July 2025 / Revised: 25 August 2025 / Accepted: 27 August 2025 / Published: 30 August 2025

Download

Browse Figures

Versions Notes

Abstract

Optimizing metadata indexing remains critical for enhancing distributed file system performance. The Traditional Log-Structured Merge-Trees (LSM-Trees) architecture, while effective for write-intensive operations, exhibits significant limitations when handling massive metadata workloads, particularly manifesting as suboptimal read performance and substantial indexing overhead. Although existing learned indexes perform well on read-only workloads, they struggle to support modifications such as inserts and updates effectively. This paper proposes SwiftKV, a novel metadata indexing scheme that combines LSM-Tree and learned indexes to address these issues. Firstly, SwiftKV employs a dynamic partition strategy to narrow the metadata search range. Secondly, a two-level learned index block, consisting of Greedy Piecewise Linear Regression (Greedy-PLR) and Linear Regression (LR) models, is leveraged to replace the typical Sorted String Table (SSTable) index block for faster location prediction than binary search. Thirdly, SwiftKV incorporates a load-aware construction mechanism and parallel optimization to minimize training overhead and enhance efficiency. This work bridges the gap between LSM-Trees’ write efficiency and learned indexes’ query performance, offering a scalable and high-performance solution for modern distributed file systems. This paper implements the prototype of SwiftKV based on RocksDB. The experimental results show that it narrows the memory usage of index blocks by 30.06% and reduces read latency by 1.19×~1.60× without affecting write performance. Furthermore, SwiftKV’s two-level learned index achieves a 15.13% reduction in query latency and a 44.03% reduction in memory overhead compared to a single-level model. For all YCSB workloads, SwiftKV outperforms other schemes.

Keywords:

metadata indexing; KV storage; LSM-Tree; dynamic partitioning; learned index

1. Introduction

Metadata, which describes file attributes and locations, is fundamental to distributed system performance, and its efficient indexing is crucial [1,2]. Driven by explosive growth in data volumes, modern distributed file systems commonly manage billions of files [2,3], resulting in the metadata scale reaching the petabyte level. This necessitates storage solutions offering both a larger capacity and higher performance to meet the demands of data-intensive workloads. Since metadata operations dominate all operations in a distributed file system (which constitute approximately 50~70% of all data access [4]), designing an efficient metadata index is crucial for improving the overall performance of the system.

To manage the petabyte-scale metadata generated by billions of files in modern systems, Key-Value (KV) storage index has emerged as core infrastructure for metadata management, leveraging its advantage for efficient access to internal file system information. Existing research has developed specialized KV-based metadata management frameworks specifically designed to index massive metadata volumes [5,6,7]. Consequently, within the domain of distributed KV storage, LSM-Tree-based storage engines have become the prevailing solution [8]. This architecture, exemplified by systems such as BigTable [9], LevelDB [10], RocksDB [11], Hbase [12] and Cassandra [13], offers significant advantages for write-intensive workloads. Specifically, the LSM-Tree optimizes write performance by converting random writes into sequential writes, thereby accelerating data persistence and enhancing overall system throughput.

However, due to the multi-level storage structure of the LSM architecture, a single query may access multiple data files, increasing I/O overhead. Existing research has explored multiple perspectives, but there are still many deficiencies. For example, LevelDB boosts write throughput with asynchronous compaction but struggles to scale because it lacks intelligent concurrency control and a cache mechanism [10]. BigTable proposed a distributed LSM-Tree architecture [8], but it is bound to the GFS design and cannot adapt to new storage devices. RocksDB improves read/write balance via bloom filters, prefix compression and other innovations [11], but its binary-search based indexing still lags on ultra-fast storage. AC-Key [14] uses cache to speed up the search process. Although cache speeds up the retrieval of frequently accessed data, it introduces non-negligible storage overhead as the amount of data increases. REMIX [15] uses filters to reduce the number of probe keys. However, filters are susceptible to high false-positive rates. In conclusion, these limitations highlight the urgent need for more efficient LSM-Tree read strategies.

In recent years, learned index has attracted close attention in the industry. It uses Machine Learning (ML) models to map keys to storage locations, thereby achieving more efficient query performance compared to traditional indexes [16]. However, applying it to distributed KV storage faces two challenges: limited scalability and high overhead in building indexes. Specifically, the frequent updates in metadata-intensive workloads quickly render learned models obsolete, necessitating costly retraining. Approaches like ALEX [17] and PGM-index [18] reserve empty slots in the training data queue for writes, but model retraining is still needed when empty slots run out, blocking system resources and hurting concurrency. XIndex avoids model retraining through its incremental buffer [19] but reduces search performance by 3× because each index operation needs to check both the learned index and the buffer. Furthermore, existing integrations with LSM-Trees remain imperfect: Bourbon’s learned index is located at the data block level and still relies on in-block binary searches, limiting read performance [20]. Google’s approach introduces fragmentation through its deletion markers, leading to compaction overhead and unstable read performance [21]. These limitations highlight the need for new solutions.

Inspired by the immutable nature of LSM-Tree’s SSTable and current defects in the read performance of learned LSM-Trees, this paper proposes SwiftKV, a highly efficient KV indexing scheme for metadata management. Since SSTable files in LSM-Trees usually do not change after generation and are merged and sorted during compaction, this solution chooses to build learned indexes in SStables and use Entry instead of block as the learning unit to avoid additional binary search overhead within the block. Specifically, this paper mainly makes the following contributions:

(1): Proposal of a subtree-isolation partitioning mechanism to enhance LSM-Tree query performance by narrowing search ranges. It uses a shared global MemTable for all partitions, buffering metadata according to rules when writing and separating them by subtrees when refreshing.
(2): Replacement of the traditional index block in SStable with a two-level learned index block to improve read performance. The first level uses Greedy-PLR to coarse-grainedly partition the SSTable key space through dynamic segmentation; and a LR model group is used in the second level for fine-grained offset prediction within segments.
(3): Proposal of a load-aware construction mechanism and a multithreading optimization of the index construction process, which increases the construction speed while avoiding invalid training overhead.

SwiftKV is implemented based on RocksDB and compared with other representative solutions to evaluate its effectiveness. The experimental results show that it improves read performance over RocksDB and performs better than other solutions in real-world read scenarios.

The rest of this paper is organized as follows. Section 2 introduces the background and motivations of our work. Section 3 presents the design of SwiftKV, including its partition strategy, learned index structure, and optimization mechanisms. Section 4 shows the experimental results. Section 5 reviews the related work. Finally, Section 6 concludes this paper and discusses future work.

2. Background and Motivation

2.1. Distributed File System and Metadata Indexing Technology

Distributed file systems (DFSs) have emerged as a viable solution, allowing multiple computers or nodes to access and share files stored on different machines as if using a single system [22]. Modern DFSs (such as CephFS [23] and HDFS [24]) usually decouple file metadata from file data, requiring high consistency, efficiency, fault tolerance, and scalability in metadata management. How to efficiently organize and index these metadata has become an important research direction in the field of DFSs.

Mainstream metadata indexing technologies fall into two categories. The first employs spatial trees (such as K-D trees [25], R trees [26]) to index metadata, as seen in systems like Spyglass [1], SmartStore [27]. While effective for specialized domains like geospatial systems, these methods face node imbalance with high-dimensional data, degrading query performance. The second approach employs external databases for indexing (such as GUFI [28], Robinhood [29]), reducing memory pressure but introducing consistency and synchronization challenges. Recent KV-based metadata management research uses LSM-Trees to manage and index metadata, but still largely depends on the inefficient “dentry-inode” indirect indexing scheme [30], highlighting the need for new approaches.

2.2. LSM-Tree Structures

LSM-Tree is one of the widely used data structures in distributed KV stores with efficient write performance [8], such as database engines like LevelDB and RocksDB [10,11], or distributed file systems like CephFS and HDFS [23,24]. Its structure is shown in Figure 1; new writes are first appended to the Log as a copy for recovering the file that has not been persisted in the event of a failure, then actual write operations will occur. The data is first written to the MemTable. When reaching capacity, the MemTable becomes immutable, and is then refreshed to a disk by background threads and converted into SSTable files.

SSTables are organized hierarchically, with newer files at higher levels. Once a level exceeds its capacity, it is merged to the next level through the background compaction process, deleting redundant or expired KV pairs and preserving only the latest versions. Each SSTable contains a data area and an index area. The index area stores key information used to describe the basic properties and content of the file, such as the file flag or file length, index of key and corresponding data location, and bloom filter.

For read operations, LSM-Tree first searches for the key in MemTable and Immutable MemTable. If missed, it traverses the L0~Ln levels to locate the key in SSTables until the key is found. The detailed process is divided into six steps:

Seek_Tables: Identify SSTables that may contain key (called Candidate SSTables).

Load_IB/FB: Load the index block and the filter block into memory.

Seek_FB: For each Candidate SSTable, check its bloom filter to determine whether key exists.

Seek_IB: If the filter indicates a positive result, use binary search in the index block to locate the data block where key may exit.

Load_DB: Load the target data block from the disk according to the offset provided by the index block.

Seek_DB: Locate the target KV pair via binary search within the data block.

2.3. Learned Index

Kraska first proposed the concept of learned index in 2018 to optimize the traditional index structure represented by B+ trees [16,31], hash tables and bloom filters [32,33]. The idea is that if the data distribution characteristics can be accurately fitted, the index query performance will be greatly improved, and no more space storage overhead is required, thereby solving these issues. Take the B+ tree index as an example: the B+ tree maps keys to locations with minimum and maximum errors in the sorted record array (the minimum error is 0 and the maximum error is the page size) and guarantees that the key can be found in that area (if it exists). Therefore, we can replace the B+ tree with other ML models as long as they can also provide similar strong guarantees on the minimum errors and maximum errors.

Research shows that the relationship model between keys and positions is a monotonically increasing curve that approximates the Cumulative Distribution Function (CDF) of the dataset [16,19], and therefore the CDF of the dataset can be modeled to predict the location. Most existing learned indexes use simple LR models to fit CDF, avoiding the training overhead of complex models, so only the slope and intercept parameters of the model need to be stored. Benefiting from lightweight parameters, learned indexes consume less memory than traditional indexes and better fit the data distribution law, showing better query performance in read-intensive scenarios.

2.4. Motivation

Motivation 1: Read performance bottleneck. As mentioned in the introduction, most operations in distributed file systems are metadata operations, and get operations account for 94.5% of all metadata operations [34], making read performance optimization critical. However, our analysis reveals that RocksDB suffers from intrinsic read amplification issues, which become unbearable under modern metadata workloads. This amplification stems from three primary sources:

Inefficient partitioning: Its Column Family (CF) partitioning requires independent Memtables per family, increasing memory contention and hurting read performance.

Tombstone overhead: Large deletions insert tombstones (deletion markers). Reads must scan through these invalid entries until a compaction process permanently removes them, which raises unnecessary I/O and CPU costs [11,35].

Inefficient index search: The LSM-Tree relies on binary search in index blocks and data blocks, with inferior query efficiency to learned indexes [16].

Consequently, an effective reduction in read amplification hinges on two critical enhancements: (1) a refined partition strategy that minimizes search scope and avoids resource contention, and (2) an augmented indexing mechanism that replaces binary search with faster lookup methods.

Motivation 2: The synergistic potential of learned indexes and LSM-Trees is promising. The learned index’s compatibility with LSM-Tree’s immutable SSTables is the key advantage. Once an SSTable is written, it remains unchanged until compaction, providing a stable data distribution on which a learned model can be trained without fear of immediate obsolescence.

By replacing binary search with ML-based positional prediction, the learned index can directly predict the position of KV pairs, reducing search iterations, minimizing access to irrelevant data, and thus lowering disk I/O. Additionally, since the data of other levels (except for Level 0) is stable within a certain period of time, constructing learned indexes for these stable levels can avoid the overhead of frequent model retraining. Our design seeks to fully exploit this synergy by integrating learned indexes at the SSTable level: a granularity that offers a favorable balance between model accuracy and update overhead.

3. Design of SwiftKV

3.1. Overall Architecture

To address the read performance bottlenecks identified in Section 2.4, we designed SwiftKV with three core principles in mind: (1) reduce search scope, (2) accelerate index lookups, and (3) minimize operational overhead. The overview architecture of SwiftKV is shown in Figure 2.

Partition strategy: SwiftKV leverages a partition-based strategy to distribute metadata across distinct subtrees, each structured as an independent LSM-Tree. During the flush operation from memory to device, KV pairs from the MemTable are first routed into a partitioned data buffer according to this strategy. Subsequently, each buffer segment is flushed in parallel to its corresponding subtree. The partition strategy incorporates dual write/flush buffers and constructs a lightweight L0 index to optimize the L0 level search.

Learned index block: We replace the traditional index blocks in SSTables with learned index blocks, leveraging an efficient model based on the two-level linear structure of the Recursive Model Index (RMI) [16]. When querying data, the model directly predicts target block locations, eliminating binary search and accelerating lookup operations. Thus, the learned index blocks are considerably more compact than their traditional counterparts, enabling a greater number of indices to reside in memory. These advantages, faster prediction and enhanced buffer, significantly improve read performance for metadata management.

Read process: SwiftKV employs a sequential, multi-tiered lookup process to probe the active MemTable, Immutable MemTable, and data buffer. If the key remains unresolved in memory, SwiftKV uses its mapping mechanism to identify and access the corresponding SSTable subtree on the device. During SSTable access, the system verifies the key’s potential existence via the bloom filter. Upon a positive indication, the learned index efficiently pinpoints the exact location of the relevant data block, and this block is loaded into memory, where the specific value associated with the key is retrieved through an intra-block search.

3.2. Partition Strategy

To fully leverage metadata and enhance the system’s scalability, DFSs commonly partition the namespace into smaller, independent index sets, known as metadata partitions [36]. However, as described in Section 2.4, the CF partition strategy that comes with RocksDB has poor read performance during large-scale deletions and high-concurrency reading [11]. Therefore, we designed a novel partition strategy that performs partitioning under a MemTable using two buffer arrays and a lightweight B+ tree index. Meanwhile, all partitions share a single MemTable instance, eliminating the need for redundant memory allocation. Furthermore, each partition is assigned a dedicated deletion key. During range deletion operations, this key enables instantaneous deletion of the entire targeted range, obviating the need for tombstone insertion. Consequently, this approach effectively avoids the performance overhead associated with subsequent compaction operations.

It is worth noting that SwiftKV employs an LSM-Tree for metadata storage. Hence, within the original SSTable structure, the data blocks store the metadata itself, while the index blocks store metadata about this metadata to effectively serve as the metadata index. The write buffer is located immediately adjacent to the Immutable MemTable’s exit. It temporarily distributes the KV data by PartitionID to ensure global order across the Immutable MemTable. The read buffer is located at the L0 entry of each sub-tree, accumulates 4 KB aligned data blocks, and appends them sequentially to L0 after sorting them by key. As shown in Figure 3, the partition strategy consists of three steps, corresponding to the numbers (①②③):

Key partition: When the Immutable MemTable reaches a certain size, KV pairs are flushed to the metadata buffer. The first 8 bytes of the key are the logical partition key (LPK). The partition to which it belongs is determined by calculating $P a r t i t i o n I D = L P K m o d P$ , where P is the number of subtrees (default 64, configurable). Then, KV pairs belonging to the same partition are merged and packaged into blocks.
Block flushing: When the block size of the partition reaches a certain size, it will be refreshed to the L0 level of the subtree, and a B+ tree index will be built in the background. This lightweight B+ tree is maintained in memory and records the mapping from partition index to subtree root directory, making it easier to directly locate the subtree in subsequent queries. In each subtree, a certain number of blocks in the L0 level will be combined into a metadata table, excluding the index information component.
Table compaction: During the compaction process of the L0 level, the metadata table is merged into SStables. In other words, the L1–Ln levels all maintain the form of SSTable, which is exactly the same as the compression or reading process of native RocksDB.

SwiftKV will refresh the MemTable synchronously when executing the partition strategy. First, the inserted KV pairs are preprocessed by type to ensure that they are inserted into the data block in order. When the capacity of the inserted KV pairs reaches a configurable block size threshold (the default is 4k), the block will be constructed by the build function first, and then written to the file by the write function. Subsequently, the corresponding metadata index block is generated and written to the file using the same functions. Crucially, this metadata index block is regenerated de novo based on the newly written metadata block, rather than merged from the older version, employing a Copy-on-Write (COW) mechanism for write consistency. The buffer consists of a write buffer and a refresh buffer. To ensure that the metadata in each block is in order, each Immutable MemTable is first added to the write buffer and then merged into the flush buffer. During the merge process, each partition is kept in order according to the key. When the refresh buffer fills, its blocks are appended to Level 0 of their respective subtree. Finally, during the refresh process, the metadata table of the L0 level consists of multiple metadata blocks. When merged, SSTable files will be generated and stored in L1 and above. The refresh buffer tracks the current refresh size per partition. When it reaches the threshold, it triggers the construction of the metadata table and divides different metadata tables according to the recorded starting position offset.

However, block-based flushing introduces key–range overlaps between adjacent blocks within each metadata table. This fragmentation necessitates scanning multiple blocks during range queries, resulting in non-sequential device access patterns that degrade read efficiency. To enhance read performance, we construct a lightweight B+ tree index in the L0 level to speed up the search, intending to quickly locate the metadata table where the key is located and reduce the file traversal overhead of the L0 level. Given the size of the L0 level fixed, the constructed B+ tree will not be very large and it will not take up more memory space. Crucially, SwiftKV employs a shared MemTable architecture across all partitions. When writing, the data is first buffered according to the partitioning rules, and then separated by subtrees when being refreshed to disk. Compared to RocksDB’s CF partition strategy, which maintains a MemTable for each partition, SwiftKV significantly reduces storage overhead. Furthermore, our solution can improve the read efficiency in the LSM-Tree, because the data is isolated by partition. Key queries only search the target partition’s corresponding subtree, eliminating access to irrelevant SSTables. Moreover, each partition has a lower level and a smaller amount of data, which greatly reduces the search scope.

3.3. SSTable Learned Index

As established in Section 1, learned indexes perform well on datasets that are not frequently modified, fitting for the characteristics of SSTable well. Therefore, we designed a novel learned index to replace the conventional SSTable indexing. Existing studies indicate that a two-level model based on RMI is sufficient to adapt to various workloads, we also adopt this hierarchical architecture. Inspired by the structure of the Bourbon [20], we choose Greedy-PLR as our baseline model. Other possible model architectures are beyond the scope of this paper.

As shown in Figure 4, SwiftKV’s hierarchical index comprises a first-level Greedy-PLR model and multiple second-level LR models. Greedy-PLR, as a lightweight enhancement of the PLR algorithm [37], does not need to store all the data, but processes point by point and determines the segmentation instantly. As shown in Algorithm 1, Each segment only needs to store a small number of parameters (the minimum key value

s_{k}

, slope a, and intercept b), which significantly reduces model space overhead compared to traditional approaches. The specific implementation is as follows:

Algorithm 1: Greedy-PLR

Input: an ordered collection D of SST keys {key_1, key_2, …, key_k} and offsets {offset_1, offset_2, …, offset_k}, error threshold ε
Output: segments S = {s1, s2, …, sn}, each is a tuple (

s_{k}

, a, b), each segment corresponds to an LR model

Initalize: S← Ø; current_start ← 1;
                current_points ← {(key_1,offset_1)};
                aprev, bprev ← NaN;
//Greedy strategy to add data points
for  i = 2  to  k:
add (key_i,offset_i) to current_points;
//Fit linear model y = a·x + b using least squares on current_points
compute a,b ← argmin_a,b∑_{(key_j,offset_j)∈current_points} (offset_j - (a·key_j + b))²;
//Compute maximum error
e_max←max₍_{key_j,offset_j}_{)∈current_points} |offset_j - (a·key_j + b)|;
if e_max > ε:
       add segment(current_start, a_prev, b_prev) to S;
       //Reset
                     current_start ← i;
                     current_points ← {(key_i, offset_i)};
                     aprev, bprev ← NaN;
       else a_prev ← a, b_prev ← b;
       //Finalize last segment
       add (current_start, aprev, bprev) to S;
       return S;

Greedy-PLR employs a stream-compatible greedy segmentation strategy. During data scanning, this strategy creates new segments only when the current key violates the error bound of the active segment. This approach requires only a single linear scan for model training, achieving O(n) time complexity (where n represents the total data volume). Conversely, traditional PLR algorithms require multiple scanning operations to generate data segments, typically with higher than O(nlog(n)) time complexity, and are inferior in memory usage efficiency.

For learned index construction, SwiftKV utilize the individual Entry as the fundamental unit instead of the data block. Each Entry constitutes a fixed-size (32-byte) serialized structure comprising a 16-byte key (including an 8-byte integer encoding and an 8-byte InternalKey encoding) and a 16-byte value (containing an 8-byte vlog file number and an 8-byte offset), thereby providing the granularity necessary for efficient learned index modeling. Compared with directly building learned indexes in blocks, building in Entry units can significantly reduce model calculations and error prediction costs, because Entry is smaller than a block, and incorrect predictions of blocks will always cause device I/O. Our SSTable learned index employs a hierarchical prediction pipeline, comprising three sequential stages: (1) the top-level Greedy-PLR model initially locates the target LR segment; (2) the associated LR model subsequently generates an approximate position within defined error bounds; (3) ultimately, the entry-level index precisely maps this ApproxPos to a definitive file byte offset (EntryOffset), enabling direct data access. The calculation is as follows:

E n t r y O f f s e t = E n t r y I n d e x \times E n t r y S i z e

(1)

where EntryIndex is the sequence number of Entry in SSTable, and EntrySize is the size of an Entry.

3.4. Load-Aware Index Adaptive Construction Mechanism

Given that most metadata predominantly resides in upper LSM-Tree levels where SSTables exhibit short lifespans, proactively avoiding learned index construction for unqueried data eliminates unnecessary system overhead. Therefore, the system can dynamically decide whether to learn an SSTable to obtain the maximum cost effectiveness. The size of the construction cost of learned indexes can be directly reflected by the index construction time. This is an abstract function of the data volume and the complexity of data distribution. Since each SSTable can accommodate a fixed number of entries, the data volume is a constant, and the construction cost only depends on the data distribution complexity.

To enable cost-effective learned indexing, we maintain per-level complexity counters within the LSM-Tree hierarchy. For SSTables in the L1–Ln levels, we use an adaptive algorithm to guide whether to build an SSTable. Critically, our adaptive algorithm operates proactively during compaction operations, not reactively after SSTable materialization. It determines learned index construction eligibility for the next-level SSTables generated by compaction. This preemptive approach avoids the substantial I/O penalty of reading persisted SSTables back into memory for retrospective index building, which would incur prohibitive overhead. When we compact multiple SSTable files for a key range, multiple SSTables calculate whether to build a learned index for the next level of SSTables according to the following formula:

∆ = (c / t - c_{a} / t_{a}) \times s a v e_{l} - t_{l i},

(2)

Six parameters govern our model. Specifically, c is the average number of search counts for SSTable files in compaction, which is maintained via an in-memory concurrent hash table (with SSTable ID as Key and atomic counters as Value) and automatically increments during the index lookup phase of query processing. t is the average lifespan of SSTable files in compaction, which is calculated as the duration between SSTable creation and compaction participation, measured using a monotonic clock in memory to avoid filesystem operations. These metrics are collected uniformly by the LSM-tree’s hierarchical counter, introducing negligible overhead.

c_{a}

is the historical average number of queries for the LSM-Tree level,

t_{a}

is the average lifespan of the files in this level, and they employ exponential smoothing (

α = 0.25

) to effectively dampen transient fluctuations while remaining responsive to workload shifts.

{s a v e}_{l}

is the time saved by using the learned index for each search of the files in this level compared to the sparse index;

t_{l i}

is the construction time of the learned index. The learned index is constructed only when

∆ > 0

. Upon completion of the compaction, the average number of queries for this level will be recalculated by incorporating the query count observed after the compaction. Higher levels of the LSM-Tree, which typically experience more frequent searches, exhibit a higher probability of having learned indexes built. The hierarchical counter records the data above and stores it in memory, and dynamically maintains it by level through the hash array.

3.5. Parallel Optimization

To better leverage the distributed system, we adopt a parallel optimization strategy during model construction. Specifically, all Entry items of each SSTable are divided into pieces of the same size, the optimal PLR of the fitting segment is built in parallel between each data piece, and the construction operation is performed in parallel with the same number of threads. We intuitively use the sharding algorithm based on the number of threads to guide this process. It is assumed that the number of threads is

T_{n}

, the number of KV pairs of the dataset is k, and the number of piece sizes of the dataset is

P_{n}

. Then, the relationship between them satisfies the following formula:

P_{n} = k / T_{n},

(3)

Under this strategy, each data slice achieves the minimum number of PLR fitting segments, theoretically improving the speed of building learned index models. However, the subsequent experiments in Section 4.4 reveal that this construction strategy is not universally effective. While it significantly improves build speed for data with relatively scattered distributions, it yields minimal benefits when applied to datasets governed by simple distribution rules.

4. Evaluation

To evaluate SwiftKV, we compare it with several state-of-the-art and widely adopted KV stores to demonstrate its effectiveness comprehensively. The chosen schemes include:

RocksDB: A high-performance LSM-tree-based embedded storage engine, serving as our primary baseline to represent traditional and highly optimized LSM-tree variants [11].

Bourbon: A pioneering machine learning-integrated LSM-tree storage engine that applies learned indexes at the data block level. It represents the state of the art in hybrid models that combine LSM-trees with learned indexes and has excellent search performance [20].

WiredTiger: A widely used B+ tree-based KV storage engine in modern database systems, providing a performance perspective from a traditional B+ tree structure, which is a dominant index structure in disk-based databases [38].

HashKV: A LevelDB-based KV storage engine that employs hash-based data grouping and separation, representing an optimization approach from the key-value separation domain [39].

We first build a benchmark experiment based on the YCSB workload to compare SwiftKV with RocksDB, demonstrating the effectiveness of our design in optimizing read performance and engineering applications. Then, we expand the comparison scope to various workloads and different datasets, further including Bourbon, WiredTiger, and HashKV in the comparison. Additionally, we performed an analysis on the selection of learned index models, comparing different model architectures in terms of training time, memory usage, and query latency. Finally, we conduct a multithreaded learned index average construction time experiment to verify the adaptability of our parallel optimization strategy.

4.1. Experimental Setup

Implementation: The prototype of SwiftKV is implemented based on RocksDB (V9.7.4), and all its codes are written in C++. We improved the partition strategy of RocksDB and replaced the traditional index blocks in SSTable with learned index blocks, but other configurations are basically the same as RocksDB. Therefore, our design can be used as a KV storage engine just like RocksDB.

Experimental environment: Our experimental testbed was a cluster comprising three identical nodes. Each node was equipped with an Intel i5-14600K processor (4.9 GHz base frequency, 14 cores/20 threads), 32 GB DDR5 6400 MHz RAM, and NVIDIA GeForce RTX 4070 SUPER GPU. The software environment used Ubuntu 24.04.4LTS OS, C++17 standard, Python 3.12.3, CMake 3.28.3, and g++13.3.0 for compilation.

Datasets: To ensure the evaluation reflects real-world metadata access patterns, we selected datasets that capture diverse key distributions and access characteristics observed in production systems. The test datasets include two string-key-type datasets and two integer-key-type datasets, which are from real environments, and we also constructed a synthetic integer key dataset. They are as follows: (1) AR represents Amazon customer review data, showing highly temporal and non-uniform distribution characteristics related to user behavior; (2) OSM represents Google uniformly sampled open street map data, featuring spatially correlated keys analogous to directory structures; (3) FB represents an upsampled version of the user ID dataset from Facebook, modeling high-entropy user-generated metadata; (4) WIKI represents the Wikipedia article-editing timestamp, capturing bursty write patterns that simulate metadata updates in collaborative environments; and (5) SEQ represents a synthetic dataset with keys as increasing integers, serving as a baseline for worst-case scenarios with no locality. Among them, we use the SHA-256 encoding method to convert the string type dataset into an integer type and truncate the first eight bytes to maintain consistency with industrial key formats [40]. Finally, each dataset consists of 80 million 8-byte unsigned integer keys, and the size of each dataset is approximately 625.35 MB.

Parameter settings: For other indexes involved in the comparison experiment, we directly run the source code with the default configuration. For SwiftKV, we remove the prefix compression of the Entry in the SSTable learned index. Each Entry is fixed to 32 bytes, of which the 16-byte key consists of 8-byte integer value and 8-byte encoded key_type plus SequenceNumber. The default size of each SSTable is 64 KB, and the size of each block is 4 KB, which is in line with the theoretical optimal paging size of the operating system. A total of 1 million KV pairs (16 B keys, 1 KB values) are inserted in all experiments, and the SwiftKV partition buffer size is set to 256 MB.

4.2. Read Performance and Space Overhead Comparison Against the Baseline

Random read and sequential read: By setting different operation types and key selection distribution in YCSB, we compared the performance of random read and sequential read between SwiftKV and RocksDB, as shown in Figure 5. The test results show that, compared with RocksDB, SwiftKV can improve the random read performance by 1.5×~2.1×. This significant improvement validates the effectiveness of our two core design choices: the partition strategy and the learned index. The partition strategy narrows the search to a single subtree, reducing the number of SSTables that must be examined. More importantly, the two-stage learned index model reduces the complexity of the SSTable index search from O(log(n)) (binary search) to O(1) (model prediction), which is the primary contributor to the latency reduction.

For the performance of sequential reads, SwiftKV delivers performance improvements scaling from 10% to 80% as the value size increases from 16 B to 4 KB, which is far less significant than the improvement in random read performance. This is because the prefetch mechanism used by RocksDB has certain optimizations for sequential reads. For example, when a sequential scan is initiated, the entire block containing the first key is loaded into memory, allowing subsequent contiguous keys to be read directly from memory without additional disk I/O. Furthermore, sequential reads typically require only one index lookup to traverse contiguous data blocks, rendering learned index acceleration less impactful in this scenario.

Read performance under different read-write ratios: We further study the improvement of SwiftKV in read performance in real environments. We simulated six experimental groups across four workloads, including read-only, read-heavy (read:write = 9:1, read:write = 7:3), read/write balance (read:write = 1:1), and write-heavy (read:write = 3:7, read:write = 1:9). The experimental data follows the Zipfian distribution (s = 0.99) to mimic real-world skewed data access patterns. The results are shown in Figure 6. Our SwiftKV performs better in any read-write ratio workload environment. Compared with RocksDB, it can improve read throughput in all cases (from 1.19×~1.60×), and the read performance improvement is greater when the query is heavier.

Learned index space usage: Following the previous experimental setup, we analyzed index sizes by replacing RocksDB’s original sparse index blocks with learned index blocks in 64 MB SSTable files across test datasets. After completing all operations, reclaimed space statistics were collected, with the results summarized in Table 1 (index sizes are averages).

The experimental results show that except for the FB dataset, the learned index of SwiftKV saves 30.06% of disk space on average compared with the sparse index. This space saving is a direct result of the learned index’s compact representation. Instead of storing a key–pointer pair for every data block, a few linear regression parameters are stored per segment. The FB dataset shows minimal savings because its highly random distribution likely requires more and shorter segments to maintain prediction accuracy, reducing the space advantage.

P99 Latency: We further evaluated the P99 latency of SwiftKV against the baseline scheme (Figure 7). Under a read-only workload, SwiftKV’s tail latency is approximately 2 μs, representing an average reduction of about 72.22% compared to RocksDB. Under read-heavy workloads, SwiftKV maintains substantially lower P99 latency, achieving reductions of approximately 48.91~50.58% relative to RocksDB, and delivers superior performance under balanced workloads as well. The reduction in I/O operations and CPU cache misses contributed by both the partition strategy and the learned index, which improves application responsiveness. However, in write-heavy scenarios compaction and flush operations dominate, and the read optimization benefits of the learned index are “diluted” by the high-frequency writes, resulting in less-pronounced improvements. Nevertheless, our approach still achieves a 6.16% improvement.

4.3. YCSB Tests

In order to further demonstrate the advantages of our solution, in addition to the RocksDB baseline, we also selected three other representative KV storage solutions (Bourbon, WiredTiger, and HashKV) for comparison. We evaluated each KV store using YCSB (Yahoo! Cloud Serving Benchmark) [41]. YCSB is an open-source performance testing tool developed by Yahoo, mainly used to evaluate the read and write performance of NoSQL databases and cloud storage systems. Its core goal is to provide a standardized benchmark framework to facilitate comparison of key indicators such as throughput and latency of different systems. The A–F workloads cover a variety of business workloads from pure read to mixed read and write. The default settings are shown in the Table 2.

The experiment adopts the default configuration with 1 million operations per workload. The results are shown in Figure 8. SwiftKV outperformed RocksDB by 0.8×~1.6× across all workloads except Workload E. In particular, YCSB-B, YCSB-C, and YCSB-D are read-heavy workloads. Our solution is significantly better than other solutions on these three workloads. Compared with RocksDB, HashKV, Bourbon, and WiredTiger, SwiftKV improves throughput by 1.1×~1.6×, 0.4×~0.8×, 0.2×~0.3×, and 1.2×~3.2×, respectively. SwiftKV uses efficient data-partitioning strategies and precise learned index to reduce the number of data accesses during the search process and improve the throughput under read-heavy workloads.

For balanced workloads (YCSB-A and YCSB-F), SwiftKV also outperforms other workloads. It shows average throughput gains of 125.61%, 10.14%, 30.52% and 96.17% over RocksDB, HashKV, Bourbon, and WiredTiger, respectively. Due to the impact of increased write operations, SwiftKV’s performance improvement on balanced loads is not as good as on read-intensive loads. However, its partitioning design still enhances performance by reducing hierarchy size and data volume.

For scan workloads (YCSB E), SwiftKV performs poorly because the learned index significantly optimizes point queries, but the range scan needs to fall back to the traditional traversal scheme; so, it cannot improve the scanning performance. Bourbon also uses the design of a learned LSM-Tree and its performance is also not good. HashKV and WiredTiger perform better because the HashKV scan operation only needs to read the Value Log sequentially, avoiding the multi-level jumps of the traditional LSM-Tree; and the B+ tree structure of WiredTiger also allows range scans to be completed efficiently through the leaf node linked list.

4.4. Model Testing

To demonstrate the rationale for the two-layer learned index model (Greedy-PLR + LR), we evaluated the performance of different learned index models on a 64 MB SSTable. First, we generate one million KV pairs (16 B key, 1 KB value) following a Zipfian distribution using the YCSB tool to simulate common hotspot access patterns in real-world systems. Each model was trained on 1000 randomly selected SSTables. Model training time is measured end-to-end from SSTable loading to model readiness. Memory usage includes model parameters and auxiliary structures. Query latency represents the 99th percentile of 10 M random lookups under YCSB workload C. All data was preloaded into memory to isolate indexing performance. The test results are shown in the Table 3.

The experimental results show that our model demonstrates an optimal trade-off between computational efficiency and accuracy. While LR achieves 46.71% faster training, its 44.03% higher query latency is prohibitive for metadata workloads. Conversely, RMI’s marginally better accuracy comes at 1.59× higher training cost and 2× memory overhead. Additionally, the 15.13% latency improvement over single-level Greedy-PLR validates the second-stage LR’s efficacy in error correction, particularly for non-uniform key distributions. Therefore, SwiftKV’s model structure is best adapted to LSM trees.

4.5. Learned Index Model Build Time

In order to test the effect of parallel optimization in building learned index, we conduct ablation experiments on the average time required to build SSTable learned index with different numbers of threads. Our SSTable file size is set to 2 MB and 64 MB. When the SSTable file data size is 2 MB and the Entry is fixed at 32 bytes, the number of Entry items is

2 \times 1024 \times 1024 / 32 = 65,536

, and the size of each piece is

65,536 / 8 = 8192

Entry items under 8 threads. Similarly, when the SSTable file size is 64 MB, each piece has 2,097,152 Entry items.

The small file size of 2 MB can compress the data piece granularity to the extreme. This extreme piecing condition can more sensitively expose the competition overhead and load balancing problems in multithreaded construction, thereby verifying the robustness of the algorithm under high-concurrency fine-grained tasks. If a larger file such as 64 MB is used, the sharp increase in the number of thread tasks may mask the advantages of parallel scheduling. From a practical application perspective, the default size of LevelDB’s SSTable file is 2 MB, which is to reduce the number of compactions and thus reduce write amplification. The default size of RocksDB’s SSTable file is 64 MB, which is more in line with the modern distributed production environment; so, we selected these two classic size configurations. The experimental results are shown in Figure 9a.

We test the average parallel construction time of the learned index. For the real dataset, the construction time of the SSTable learned index under the same thread condition is roughly at the same level. For sequential datasets, the learned index construction time is significantly shorter than for real datasets due to their simple distribution patterns, which reduce computational complexity when constructing ML models. For a 2 MB SSTable, compared with a single thread, the construction efficiency gains are 41.25~41.49% to 2 threads, 64.10~64.81% to 4 threads, 72.26~72.65% to 8 threads, and 82.71~82.89% to 16 threads. For the sequential dataset, the average construction time of 16 threads is slightly longer than that of 4 threads because the data distribution of the sequential dataset is too simple. When the number of threads has increased to a certain extent, the cost of mechanisms such as lock competition and thread scheduling exceeds the model construction cost reduced by increasing the number of threads, resulting in more time overhead.

Building learned indexes for 64 MB SSTables takes longer than for 2 MB SSTables due to a larger data volume and more segments. Figure 9b shows that adding threads improves construction efficiency on real datasets. Compared with using a single thread, the construction efficiency gains are 29.37~34.13% to 2 threads, 52.56~52.87% to 4 threads, 66.52~68.69% to 8 threads, and 74.78~75.29% to 16 threads. The benefit of increasing numbers of threads under the 64 MB SSTable size is not as good as that of the 2 MB SSTable, because as the amount of data increases, the frequent memory/cache access also weakens the advantages of multithreading. The experimental results indicate multithreaded construction is suitable for irregular key distributions under real datasets, while simpler data distributions gain little from extra additional threads and may increase overhead.

According to the experimental results, multithreaded construction efficiency depends critically on data distribution complexity. For sequential datasets, optimal performance is achieved with 2–4 threads, as excessive threads introduce scheduling overhead without accelerating the already-minimal model computation. Conversely, real-world datasets benefit from higher thread counts, where parallel processing effectively mitigates the computational cost of irregular key distributions. Based on these observations, we recommend: (1) for simple distribution, limiting the number of threads to no more than four, (2) scaling threads linearly with SSTable size (allocating one more thread for every additional 8 MB) for complex distributions, and (3) designing dynamic thread adaptation schemes in hybrid workloads. This strategy balances parallelism gains against contention overhead, achieving construction speedups for real-world data while avoiding performance degradation in simpler scenarios.

5. Related Work

5.1. The Read Optimization for LSM-Based KV Stores

Due to the multi-level storage structure of the LSM architecture, a single query may access multiple data files, increasing I/O overhead. Existing research has explored multiple perspectives, and they are mainly classified into three directions: filter, index structure, and cache optimization, as shown in Table 4.

Filter: Some works utilize filters to skip unnecessary reads because filters do not return false negatives [11,15], but filters may have high false-positive rates and excessive memory usage.

Index structure: LevelDB [10] designs mechanisms such as asynchronous reads to further optimize the LSM-Tree structure, Bourbon [20] analyzes the feasibility of applying learned indexes to LSM-Trees, and Google’s approach [21] deeply integrates learned indexes with a distributed, disk-based database system (Bigtable). Different from them, our solution focuses on read optimization.

Cache optimization: AC-Key [14] proposes an adaptive caching algorithm to adjust the cache size according to the workload. Although the cache can quickly find the data that has been frequently accessed recently, it will consume non-negligible storage space as the size of the buffered data increases.

These limitations highlight the urgent need for more efficient LSM-Tree read strategies. SwiftKV addresses these shortcomings by introducing a learned index structure that operates at the entry granularity, replacing the traditional binary search with a more efficient model-based prediction. This approach significantly reduces the indexing overhead and I/O operations without incurring the notable storage overhead of caching or the false positive issues of advanced filters.

5.2. Learned Indexes

The concept of learned indexes, introduced by Kraska [16], has inspired numerous subsequent works. ALEX [17] and PGM-Index [18] focus on making learned indexes updatable for dynamic workloads. Others like LIPP [42] and XIndex [19] explore concurrency control. Some recent research is highly coupled with specific hardware. For example, PLIN [43] is designed for NVM-only architectures. Its learned index can reside directly on NVM without worrying about data loss during power outages. LeaFTL [44] uses the OOB (out-of-band) area in the flash memory to store the reverse mapping entries from the logical address of the page and its adjacent pages to the physical address, improving the space efficiency of the learning index. These learned indexes differ from our focus. SwiftKV primarily designs an effective learning indexing scheme for LSM tree structures, leveraging the effects of machine learning models to read fewer index blocks and reduce disk access costs.

6. Conclusions

This paper proposes SwiftKV, a novel metadata indexing scheme that integrates LSM-Tree with learned indexes to address the read performance bottlenecks in modern DFSs. By introducing a dynamic partition strategy, SwiftKV effectively narrows the search scope and reduces I/O overhead. Furthermore, the two-level learned index block (Greedy-PLR + LR) replaces traditional binary search with efficient model-based prediction, significantly accelerating file read operations. Finally, to mitigate the training overhead, we also propose a load-aware construction mechanism and parallel optimization, which adaptively decide whether to build learned indexes and accelerate the construction process through multithreading. Experiments show SwiftKV achieves significant performance improvements in read operations, reduces index space by approximately 30%, and significantly lowers P99 latency (up to 50.58% in read-heavy scenarios). SwiftKV also maintains better read performance than other solutions under most YCSB workloads.

Several promising future research directions emerge from this work: First, we will further explore more sophisticated yet lightweight models of learned index to further improve prediction accuracy for highly complex key distributions, potentially reducing the error bounds and the need for subsequent fine-grained searches. Second, our current design assumes fixed-size keys; however, in the real world, most keys are variable-length strings; so, adapting the learned index’s structure to handle variable-length keys is an important practical extension. Finally, we intend to deploy and evaluate SwiftKV in concrete application scenarios, such as using it as the metadata engine for a large-scale DFS like CephFS or object storage system, to validate its benefits under production-like conditions and diverse, evolving workloads.

Author Contributions

Conceptualization, J.F. and L.D.; methodology, Z.W., J.F., L.D. and Z.B.; software, Z.B.; validation, J.F., L.D. and Z.B.; resources, L.D.; data curation, Z.B.; writing—original draft preparation, J.F. and C.D.; writing—review and editing, Z.W., J.F., L.D., Z.B. and C.D.; visualization, Z.B.; supervision, Z.W. and C.D.; project administration, Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Plan of China (2023YFB4502704).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Leung, A.W.; Shao, M.; Bisson, T.; Pasupathy, S.; Miller, E.L. Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09), San Francisco, CA, USA, 24–27 February 2009. [Google Scholar]
Dai, H.; Wang, Y.; Kent, K.B.; Zeng, L.; Xu, C. The State of the Art of Metadata Managements in Large-Scale Distributed File Systems—Scalability, Performance and Availability. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 3850–3869. [Google Scholar] [CrossRef]
Yang, B.; Xue, W.; Zhang, T.; Liu, S.; Ma, X.; Wang, X.; Liu, W. End-to-end I/O Monitoring on Leading Supercomputers. ACM Trans. Storage 2023, 19, 1–35. [Google Scholar] [CrossRef]
Huang, X.; Gao, Y.; Zhou, X.; Gao, X.; Chen, G. An Adaptive Metadata Management Scheme Based on Deep Reinforcement Learning for Large-Scale Distributed File Systems. IEEE/ACM Trans. Netw. 2023, 31, 2840–2853. [Google Scholar] [CrossRef]
Jiao, Y.; Bertron, S.; Patel, S.; Zeller, L.; Bennett, R.; Mukherjee, N.; Bender, M.A.; Condict, M.; Conway, A.; Farach-Colton, M.; et al. BetrFS: A Compleat File System for Commodity SSDs. In Proceedings of the 7th European Conference on Computer Systems (EuroSys’22), Rennes, France, 5–8 April 2022; 8 April 2022. [Google Scholar]
Ren, K.; Zheng, Q.; Patil, S.; Gibson, G. IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion. In Proceedings of the 14th International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14), New Orleans, LA, USA, 16–21 November 2014. [Google Scholar]
Ren, K.; Gibson, G. TABLEFS: Embedding a NoSQL Database Inside the Local File System. In Proceedings of the 2012 Asia-Pacific Magnetic Recording Conference (APMRC’12), Singapore, 31 October–2 November 2012. [Google Scholar]
Mei, F.; Cao, Q.; Jiang, H.; Tintri, L.T. LSM-tree Managed Storage for Large-Scale Key-Value Store. In Proceedings of the 2017 Symposium on Cloud Computing (SOCC’17), Santa Clara, CA, USA, 25–27 September 2017. [Google Scholar]
Chang, F.; Dean, J.; Ghemawat, S.; Hsieh, W.C.; Wallach, D.A.; Burrows, M.; Chandra, T.; Fikes, A.; Gruber, R.E. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 2008, 26, 1–26. [Google Scholar] [CrossRef]
Wang, L.; Ding, G.; Zhao, Y.; Wu, D.; He, C. Optimization of LevelDB by Separating Key and Value. In Proceedings of the 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’17), Taipei, Taiwan, 18–20 December 2017. [Google Scholar]
Dong, S.; Kryczka, A.; Jin, Y.; Stumm, M. Rocksdb: Evolution of Development Priorities in a KV Store Serving Large-Scale Applications. ACM Trans. Storage 2021, 17, 1–32. [Google Scholar] [CrossRef]
Vora, M.N. Hadoop-HBase for Large-Scale Data. In Proceedings of the 2011 International Conference on Computer Science and Network Technology (ICCSNT’11), Harbin, China, 24–26 December 2011. [Google Scholar]
Lakshman, A.; Malik, P. Cassandra: A Decentralized Structured Storage System. ACM SIGOPS Oper. Syst. Rev. 2010, 44, 35–40. [Google Scholar] [CrossRef]
Wu, F.; Yang, M.H.; Zhang, B.; Du, D.H. AC-Key: Adaptive Caching for LSM-based Key-Value Stores. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC’20), Online, 15–17 July 2020. [Google Scholar]
Zhong, W.; Chen, C.; Wu, X.; Jiang, S. REMIX: Efficient Range Query for LSM-trees. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21), Online, 23–25 February 2021. [Google Scholar]
Kraska, T.; Beutel, A.; Chi, E.H.; Dean, J.; Polyzotis, N. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18), Portland, OR, USA, 10–15 June 2018. [Google Scholar]
Ding, J.; Minhas, U.F.; Yu, J.; Wang, C.; Do, J.; Li, Y.; Zhang, H.; Chandramouli, B.; Gehrke, J.; Kossmann, D.; et al. ALEX: An Updatable Adaptive Learned Index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20), Portland, OR, USA, 14–19 June 2020. [Google Scholar]
Ferragina, P.; Vinciguerra, G. The PGM-index: A Fully-Dynamic Compressed Learned Index Witwithovable Worst-Case Bounds. Proc. VLDB Endow. 2020, 13, 1162–1175. [Google Scholar] [CrossRef]
Tang, C.; Wang, Y.; Dong, Z.; Hu, G.; Wang, Z.; Wang, M.; Chen, H. XIndex: A Scalable Learned Index for Multicore Data Storage. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20), San Diego, CA, USA, 22–26 February 2020. [Google Scholar]
Dai, Y.; Xu, Y.; Ganesan, A.; Alagappan, R.; Kroth, B.; Arpaci-Dusseau, A.; Arpaci-Dusseau, R. From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20), Virtual Event, 4–6 November 2020. [Google Scholar]
Abu-Libdeh, H.; Altınbüken, D.; Beutel, A.; Chi, E.H.; Doshi, L.; Kraska, T.; (Steve)Li, X.; Andy Ly, A.; Olston, C. Learned Indexes for a Google-Scale Disk-Based Database. arXiv 2020, arXiv:2012.12501. [Google Scholar]
Macko, P.; Hennessey, J. Survey of Distributed File System Design Choices. ACM Trans. Storage 2022, 18, 1–34. [Google Scholar] [CrossRef]
Borges, G.; Crosby, S.; Boland, L. CephFS: A New Generation Storage Platform for Australian High Energy Physics. J. Phys. Conf. Ser. 2017, 898, 062015. [Google Scholar] [CrossRef]
Karun, A.K.; Chitharanjan, K. A Review on Hadoop—HDFS InfraStructure Extensions. In Proceedings of the 2013 IEEE Conference on Information & Communication Technologies (ICT’13), Thuckalay, India, 11–12 April 2013. [Google Scholar]
Yang, X.; Liu, Q.; Yin, B.; Zhang, Q.; Zhou, D.; Wei, X. kd Tree Construction Designed for Motion Blur. In Proceedings of the 28th Eurographics Symposium on Rendering: Experimental Ideas & Implementations (EGSR’17), Helsinki, Finland, 19–21 June 2017. [Google Scholar]
Hadjieleftheriou, M.; Manolopoulos, Y.; Theodoridis, Y.; Tsotras, V.J. R-trees: A Dynamic Index Structure for Spatial Searching. In Encyclopedia of GIS; Springer: Cham, Switzerland, 2017; pp. 1805–1817. [Google Scholar]
Hua, Y.; Jiang, H.; Zhu, Y.; Feng, D.; Tian, L. SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems. In Proceedings of the 2009 Conference on High Performance Computing Networking, Storage and Analysis (SC’09), Portland, OR, USA, 14–20 November 2009. [Google Scholar]
Manno, D.; Lee, J.; Challa, P.; Zheng, Q.; Bonnie, D.; Grider, G.; Settlemyer, B. Gufi: Fast, Secure File System Metadata Search for Both Privileged and Unprivileged Users. In Proceedings of the 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22), Dallas, TX, USA, 13–18 November 2022. [Google Scholar]
Leibovici, T. Taking Back Control of HPC File Systems with Robinhood Policy Engine. arXiv 2015, arXiv:1505.01448. [Google Scholar] [CrossRef]
Li, S.; Lu, Y.; Shu, J.; Hu, Y.; Li, T. Locofs: A Loosely-Coupled Metadata Service for Distributed File Systems. In Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17), Denver, CO, USA, 12–17 November 2017. [Google Scholar]
Roh, H.; Park, S.; Kim, S.; Shin, M.; Lee, S.W. B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives. arXiv 2011, arXiv:1201.0227. [Google Scholar] [CrossRef]
Zuo, P.; Hua, Y.; Wu, J. Write-Optimized and High-Performance Hashing index scheme for persistent memory. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18), Carlsbad, CA, USA, 8–10 October 2018. [Google Scholar]
Chen, B.; Jin, Y.; Brown, P. An Enhanced Bloom Index for Quantifying Floral Phenology Using Multi-Scale Remote Sensing Observations. J. Photogramm. Remote Sens. 2019, 156, 108–120. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, J.; Min, X.; Ge, S.; Wan, J.; Yao, T.; Wang, D. PetaKV: Building Efficient Key-Value Store for File System Metadata on Persistent Memory. IEEE Trans. Parallel Distrib. Syst. 2022, 34, 843–855. [Google Scholar] [CrossRef]
Rehrmann, R.; Binnig, C.; Böhm, A.; Kim, K.; Lehner, W. Sharing Opportunities for OLTP Workloads in Different Isolation Levels. Proc. VLDB Endow. 2020, 13, 1696–1708. [Google Scholar] [CrossRef]
Mitra, S.; Winslett, M.; Hsu, W. Query-Based Partitioning of Documents and Indexes for Information Lifecycle Management. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Vancouver, BC, Canada, 9–12 June 2008. [Google Scholar]
Keogh, E.; Chu, S.; Hart, D.; Pazzani, M. An Online Algorithm for Segmenting Time Series. In Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01), San Jose, CA, USA, 29 November–2 December 2001. [Google Scholar]
Fedorova, A.; Mustard, C.; Beschastnikh, I.; Rubin, J.; Wong, A.; Miucin, S.; Ye, L. Performance Comprehension at WiredTiger. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’18), Lake Buena Vista, FL, USA, 4–9 November 2018. [Google Scholar]
Chan, H.; Li, Y.; Lee, P.; Xu, Y. HashKV: Enabling Efficient Updates in KV Storage via Hashing. In Proceedings of the 2018 USENIX Annual Technical Conference (ATC’18), Boston, MA, USA, 11–13 July 2018. [Google Scholar]
Appel, A. Verification of a Cryptographic Primitive: SHA-256. ACM Trans. Program. Lang. Syst. 2015, 37, 1–31. [Google Scholar] [CrossRef]
Cooper, B.F.; Silberstein, A.; Tam, E.; Ramakrishnan, R.; Sears, R. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC’10), Indianapolis, IN, USA, 10–11 June 2010. [Google Scholar]
Wu, J.; Zhang, Y.; Chen, S.; Wang, J.; Chen, Y.; Xing, C. Updatable Learned Index with Precise Positions. arXiv 2021, arXiv:2104.05520. [Google Scholar] [CrossRef]
Zhang, Z.; Chu, Z.; Jin, P.; Luo, Y.; Xie, X.; Wan, S.; Luo, Y.; Wu, X.; Zou, P.; Zheng, C.; et al. PLIN: A persistent learned index for non-volatile memory with high performance and instant recovery. Proc. VLDB Endow. 2022, 16, 243–255. [Google Scholar] [CrossRef]
Sun, J.; Li, S.; Sun, Y.; Sun, C.; Vucinic, D.; Huang, J. LeaFTL: A Learning-Based Flash Translation Layer for Solid-State Drives. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’23), Vancouver, BC, Canada, 25–29 March 2023. [Google Scholar]

Figure 1. LSM-Tree structure.

Figure 2. The overview architecture of SwiftKV.

Figure 3. Partition strategy.

Figure 4. Learned index structure.

Figure 5. Random read (a) and sequential read (b) with different value sizes.

Figure 6. Read performance of SwiftKV and RocksDB at different read/write ratios.

Figure 7. P99 latency of SwiftKV and RocksDB at different read/write ratios.

Figure 8. Throughput of SwiftKV and other KV storage engines under YCSB workloads.

Figure 9. Parallel average build time of 2 MB SSTable (a) and 64 MB SSTable (b).

Table 1. Learned index size comparison.

Real Dataset	Learned Index Blocks	Sparse Index Blocks	Space Savings
AR	0.45 MB	0.62 MB	27.42%
OSM	0.42 MB	0.61 MB	31.15%
FB	0.63 MB	0.64 MB	1.56%
WIKI	0.32 MB	0.45 MB	28.89%
SEQ	0.41 MB	0.61 MB	32.79%

Table 2. YCSB workloads.

Workload	Read-Write Ratio	Typical Application Scenarios	Data Distribution
A	50% Read, 50% Update	Session Store	Zipfian (Skewed Distribution)
B	95% Read, 5% Update	Photo Tagging	Zipfian
C	100% Read	User Profile Cache	Zipfian
D	95% Read, 5% Insert	User Status Updates	Latest (Latest Records First)
E	95% Scan, 5% Insert	Threaded Conversations	Zipfian
F	50% Read, 50% Read–Modify–Write	User Database	Zipfian

Table 3. Performance comparison of index models.

Model	Training Time (ms)	Memory Usage (KB)	Query Latency (μs)
Two-level (SwiftKV’s)	15.2	420	2.18
Greedy-PLR	12.7	380	2.51
RMI [16]	39.4	760	2.55
LR	8.1	310	3.14

Table 4. Comparison of representative solutions.

Optimization Direction	Scheme	Core Idea	Limitations
Filter	RocksDB [11]	Bloom filters to skip unnecessary reads	High false-positive rate and memory overhead
Filter	REMIX [15]	Advanced filters to reduce probe count	High false-positive rate and memory overhead
Cache Optimization	AC-Key [14]	Adaptive caching based on workload	Significant storage overhead
Index Structure	LevelDB [10]	Asynchronous compaction for write throughput	Not optimized for read performance
	Bourbon [20]	Learned indexes at the data block level
	Google’s approach [21]	Distributed LSM-Tree architecture

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Feng, J.; Dun, L.; Bao, Z.; Du, C. SwiftKV: A Metadata Indexing Scheme Integrating LSM-Tree and Learned Index for Distributed KV Stores. Future Internet 2025, 17, 398. https://doi.org/10.3390/fi17090398

AMA Style

Wang Z, Feng J, Dun L, Bao Z, Du C. SwiftKV: A Metadata Indexing Scheme Integrating LSM-Tree and Learned Index for Distributed KV Stores. Future Internet. 2025; 17(9):398. https://doi.org/10.3390/fi17090398

Chicago/Turabian Style

Wang, Zhenfei, Jianxun Feng, Longxiang Dun, Ziliang Bao, and Chunfeng Du. 2025. "SwiftKV: A Metadata Indexing Scheme Integrating LSM-Tree and Learned Index for Distributed KV Stores" Future Internet 17, no. 9: 398. https://doi.org/10.3390/fi17090398

APA Style

Wang, Z., Feng, J., Dun, L., Bao, Z., & Du, C. (2025). SwiftKV: A Metadata Indexing Scheme Integrating LSM-Tree and Learned Index for Distributed KV Stores. Future Internet, 17(9), 398. https://doi.org/10.3390/fi17090398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SwiftKV: A Metadata Indexing Scheme Integrating LSM-Tree and Learned Index for Distributed KV Stores

Abstract

1. Introduction

2. Background and Motivation

2.1. Distributed File System and Metadata Indexing Technology

2.2. LSM-Tree Structures

2.3. Learned Index

2.4. Motivation

3. Design of SwiftKV

3.1. Overall Architecture

3.2. Partition Strategy

3.3. SSTable Learned Index

3.4. Load-Aware Index Adaptive Construction Mechanism

3.5. Parallel Optimization

4. Evaluation

4.1. Experimental Setup

4.2. Read Performance and Space Overhead Comparison Against the Baseline

4.3. YCSB Tests

4.4. Model Testing

4.5. Learned Index Model Build Time

5. Related Work

5.1. The Read Optimization for LSM-Based KV Stores

5.2. Learned Indexes

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI