High-Throughput and Memory-Efficient Pipeline Key–Value Store Architecture on FPGA
Abstract
1. Introduction
- False misses (insert/query): When a key is removed from its original bucket but has not yet been placed into its alternate location, it becomes temporarily unavailable and cannot be reached from either of the buckets. If the insert request is not atomic (i.e., cannot be completed within a single clock cycle), query requests may potentially return a false miss result by concluding the query during a period where the key is temporarily unavailable.
- False inserts (insert/insert): When two consecutive insert requests with conflicting hash addresses arrive at the same empty bucket, the first key–value pair (KVP) will be successfully inserted. However, if the second request cannot promptly retrieve the occupancy status of the bucket due to the read and write-back delays of the BRAM (which require at least three clock cycles), it may overwrite the successful insertion of the first request, resulting in a false insert.
- Skewed Workloads: Cuckoo hashing [20] guarantees constant query and insert times, even in the worst-case scenario. In software design, serializing and locking insertion requests is a common strategy to prevent deadlocking [21]. However, in FPGA implementations, using serialized blocking can lead to considerable performance degradation, particularly under high load rates where hundreds of clock cycles may be required to complete a single insert request. Meanwhile, concurrent insertions carry the risk of deadlocks, thus posing a challenging trade-off in the implementation of cuckoo hashing on FPGA.
- To address the issues of false misses and false inserts, we propose a multi-pipeline architecture and integrate the conflict detection module after each level of the pipeline to ensure strict eventual consistency.
- Inspired by [14], we design a Content-Addressable Memory (CAM) block at the end of the pipeline to improve the robustness of our architecture. Moreover, this design enables effective parameterization and scalability while maintaining excellent timing closure.
- To ensure the high performance of our KVS architecture under all workloads, we chose the multi-level multi-hash approach and provide mathematically derived calculations of the hash collision probability for this method. Our decision is supported by rigorous software and hardware algorithmic simulations. Finally, our new architecture is successfully implemented on FPGA. Even at a load factor of 95%, the receivable throughput for all types of requests can still reach 400 million requests per second (MRPS), which is 2x faster than [15].
2. Related Work
2.1. KVS in FPGA
2.2. Cuckoo Hash
2.3. Parallel Hash
3. Request Abstraction
3.1. Probe Phase
3.2. Response Phase
3.3. Request and Response Bus
4. Proposed Architecture
4.1. KVS Unit
| Algorithm 1: Modify request processing flow in KVS Units. |
![]() |
4.2. KVS Column
4.3. KVS Matrix
5. Hash Table Guarantees
5.1. Open Addressing
5.2. Collision Probability
5.3. Capacity Requirement of CAM
6. Experimental Results
6.1. Feasibility Analysis
6.2. Implementation on FPGA
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Nadales, E.B. Impact of Traffic Growth on Networks and Investment Needs. 2023. Available online: https://www.telefonica.com/en/communication-room/blog/impact-of-traffic-growth-on-networks-and-investment-needs/ (accessed on 2 October 2025).
- Ezilarasan, M.R.; Britto; Pari, J.; Leung, M. High Performance FPGA Implementation of Single MAC Adaptive Filter for Independent Component Analysis. J. Circuits, Syst. Comput. 2023, 32, 2350294. [Google Scholar] [CrossRef]
- Chen, H.; Chen, Y.; Summerville, D.H. A Survey on the Application of FPGAs for Network Infrastructure Security. IEEE Commun. Surv. Tutor. 2011, 13, 541–561. [Google Scholar] [CrossRef]
- Rozhko, D.; Elliott, G.; Ly-Ma, D.; Chow, P.; Jacobsen, H.A. Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules. In FPGA ’17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 201–206. [Google Scholar] [CrossRef]
- Qu, Y.R.; Zhou, S.; Prasanna, V.K. High-Performance Architecture for Dynamically Updatable Packet Classification on FPGA. In ANCS ’13: Proceedings of the Ninth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Jose, CA, USA, 21–22 October 2013; IEEE Press: Piscataway, NJ, USA, 2013; pp. 125–136. [Google Scholar]
- Jiang, W.; Prasanna, V.K. Field-Split Parallel Architecture for High Performance Multi-Match Packet Classification Using FPGAs. In SPAA ’09: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, Calgary, AB, Canada, 11–13 August 2009; IEEE: New York, NY, USA, 2009; pp. 188–196. [Google Scholar] [CrossRef]
- Sheelavant, K.; KV, C.; Supriya, B.Y.; Assudani, P.J.; Mahato, C.B.; Suman, S.K. Ensemble Learning-Based Intrusion Detection and Classification for Securing IoT Networks: An Optimized Strategy for Threat Detection and Prevention. J. Intell. Syst. Internet Things 2025, 17. [Google Scholar]
- Ruiz, M.; Sutter, G.; López-Buedo, S.; Zazo, J.F.; López de Vergara, J.E. An FPGA-based approach for packet deduplication in 100 gigabit-per-second networks. In Proceedings of the 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 4–6 December 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Ding, R.; Yang, S.; Chen, X.; Huang, Q. BitSense: Universal and Nearly Zero-Error Optimization for Sketch Counters with Compressive Sensing. In Proceedings of the ACM SIGCOMM 2023 Conference, New York, NY, USA, 10–14 September 2023; ACM SIGCOMM ’23. pp. 220–238. [Google Scholar] [CrossRef]
- Jorden, S.M.; Naveenkumar, G.; Rose, J.A. Cloud–based Adaptive Traffic Signal System using Amazon AWS. In Proceedings of the 2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), Chennai, India, 9–10 May 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Shrivastav, V. Programmable Multi-Dimensional Table Filters for Line Rate Network Functions. In Proceedings of the ACM SIGCOMM 2022 Conference, New York, NY, USA, 22–26 August 2022; SIGCOMM ’22. pp. 649–662. [Google Scholar] [CrossRef]
- István, Z.; Alonso, G.; Blott, M.; Vissers, K. A Hash Table for Line-Rate Data Processing. ACM Trans. Reconfigurable Technol. Syst. 2015, 8, 1–15. [Google Scholar] [CrossRef]
- Chen, X.; Zhu, L.; Zheng, L.; Cheng, L.; Zhang, J.; Yang, X.; Zhang, D.; Liu, X.; Lu, X.; Yi, X.; et al. TurboCache: Empowering Switch-Accelerated Key-Value Caches with Accurate and Fast Cache Updates. In Proceedings of the IEEE INFOCOM 2025-IEEE Conference on Computer Communications, London, UK, 19–22 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–10. [Google Scholar]
- Kirsch, A.; Mitzenmacher, M.; Wieder, U. More Robust Hashing: Cuckoo Hashing with a Stash. SIAM J. Comput. 2009, 39, 1543–1561. [Google Scholar] [CrossRef]
- Fukač, T.; Matoušek, J.; Kořenek, J.; Kekely, L. Increasing Memory Efficiency of Hash-Based Pattern Matching for High-Speed Networks. In Proceedings of the 2021 International Conference on Field-Programmable Technology (ICFPT), Auckland, New Zealand, 6–10 December 2021; pp. 1–9. [Google Scholar] [CrossRef]
- Pontarelli, S.; Reviriego, P.; Maestro, J.A. Parallel d-Pipeline: A Cuckoo Hashing Implementation for Increased Throughput. IEEE Trans. Comput. 2016, 65, 326–331. [Google Scholar] [CrossRef]
- Wu, W.Q.; Xue, M.T.; Zhu, T.Q.; Ma, Z.G.; Yu, F. High-Throughput Parallel SRAM-Based Hash Join Architecture on FPGA. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 2502–2506. [Google Scholar] [CrossRef]
- Liang, W.; Yin, W.; Kang, P.; Wang, L. Memory efficient and high performance key-value store on FPGA using Cuckoo hashing. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–4. [Google Scholar] [CrossRef]
- Sha, M.; Guo, Z.C.; Song, M.G.; Wang, K. A Review of FPGA’s Application in High-speed Network Processing. Netw. New Media Technol. 2021, 10, 1–11. [Google Scholar]
- Pagh, R.; Rodler, F.F. Cuckoo Hashing. J. Algorithms 2004, 51, 122–144. [Google Scholar] [CrossRef]
- Fan, B.; Andersen, D.G.; Kaminsky, M. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, Lombard, IL, USA, 2–5 April 2013; nsdi’13. pp. 371–384. [Google Scholar]
- Fukač, T.; Kořenek, J. Hash-based Pattern Matching for High Speed Networks. In Proceedings of the 2019 IEEE 22nd International Symposium on Design and Diagnostics of Electronic Circuits Systems (DDECS), Cluj-Napoca, Romania, 24–26 April 2019; pp. 1–5. [Google Scholar] [CrossRef]
- Pan, T.; Guo, X.; Zhang, C.; Meng, W.; Liu, B. ALFE: A replacement policy to cache elephant flows in the presence of mice flooding. In Proceedings of the 2012 IEEE International Conference on Communications (ICC), Ottawa, ON, Canada, 10–15 June 2012; pp. 2961–2965. [Google Scholar] [CrossRef]
- Zhang, K.; Huang, W.; Kuang, N.; Zhong, S.; Liang, Y.; Luo, H. TCAM-Hash: A Combined Routing Lookup for Network Applications. In Proceedings of the 2023 5th International Conference on Electronic Engineering and Informatics (EEI), Wuhan, China, 23–25 June 2023; pp. 694–698. [Google Scholar] [CrossRef]
- Miao, R.; Zeng, H.; Kim, C.; Lee, J.; Yu, M. SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, Los Angeles, CA, USA, 21–25 August 2017; Association for Computing Machinery: New York, NY, USA, 2017. SIGCOMM ’17. pp. 15–28. [Google Scholar] [CrossRef]
- Pontarelli, S.; Bifulco, R.; Bonola, M.; Cascone, C.; Spaziani, M.; Bruschi, V.; Sanvito, D.; Siracusano, G.; Capone, A.; Honda, M.; et al. Flowblaze: Stateful packet processing in hardware. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, USA, 26–28 February 2019; USENIX ASSOC. pp. 531–547. [Google Scholar]
- Fiessler, A.; Lorenz, C.; Hager, S.; Scheuermann, B.; Moore, A.W. HyPaFilter+: Enhanced Hybrid Packet Filtering Using Hardware Assisted Classification and Header Space Analysis. IEEE/ACM Trans. Netw. 2017, 25, 3655–3669. [Google Scholar] [CrossRef]
- Yang, Y.; Kuppannagari, S.R.; Srivastava, A.; Kannan, R.; Prasanna, V.K. FASTHash: FPGA-Based High Throughput Parallel Hash Table. In Proceedings of the High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, 22–25 June 2020; Proceedings. Springer: Berlin/Heidelberg, Germany, 2020; pp. 3–22. [Google Scholar]
- Wu, H.; Wang, S.; Jin, Z.; Zhang, Y.; Ma, R.; Fan, S.; Chao, R. CostCounter: A Better Method for Collision Mitigation in Cuckoo Hashing. ACM Trans. Storage 2023, 19, 1–24. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, Y.; Richard Shi, C.J. High-Throughput Cuckoo Hashing Accelerator on FPGA Using One-Step BFS. In Proceedings of the 2021 IEEE 4th International Conference on Electronics Technology (ICET), Chengdu, China, 7–10 May 2021; pp. 313–317. [Google Scholar] [CrossRef]
- ARM. AMBA AXI-Stream Protocol Specification. Available online: https://documentation-service.arm.com/static/64819f1516f0f201aa6b963c (accessed on 6 October 2025).
- Ramakrishna, M.; Fu, E.; Bahcekapili, E. Efficient hardware hashing functions for high performance computers. IEEE Trans. Comput. 1997, 46, 1378–1381. [Google Scholar] [CrossRef]
- Krawczyk, H. New Hash Functions for Message Authentication. In Proceedings of the Advances in Cryptology—EUROCRYPT ’95: International Conference on the Theory and Application of Cryptographic Techniques, Saint-Malo, France, 21–25 May 1995; Guillou, L.C., Quisquater, J.J., Eds.; Springer: Berlin/Heidelberg, Germany, 1995; pp. 301–310. [Google Scholar]
- Krawczyk, H. LFSR-based hashing and authentication. In Proceedings of the 14th Annual International Cryptology Conference, Santa Barbara, CA, USA, 19–23 August 2001; pp. 129–139. [Google Scholar]
- Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms, 3rd ed.; The MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- H3C. H3C S5560S-EI Series Enhanced Gigabit Switches. Available online: https://www.h3c.com/en/Products_and_Solutions/InterConnect/Switches/Products/Campus_Network/Aggregation/S5500/H3C_S5560S-EI (accessed on 6 October 2025).
- Agrawal, A.; Bhyravarapu, S.; Venkata Krishna Chaitanya, N. Matrix Hashing with Two Level of Collision Resolution. In Proceedings of the 2018 8th International Conference on Cloud Computing, Data Science and Engineering (Confluence), Noida, India, 11–12 January 2018; pp. 14–15. [Google Scholar] [CrossRef]
- Bender, M.A.; Kuszmaul, B.C.; Kuszmaul, W. Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering. In Proceedings of the 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), Denver, CO, USA, 7–10 February 2022; pp. 1171–1182. [Google Scholar] [CrossRef]
- Alizadeh, M.; Edsall, T.; Dharmapurikar, S.; Vaidyanathan, R.; Chu, K.; Fingerhut, A.; Lam, V.T.; Matus, F.; Pan, R.; Yadav, N.; et al. CONGA: Distributed congestion-aware load balancing for datacenters. In Proceedings of the 2014 ACM Conference on SIGCOMM, Chicago, IL, USA, 17–22 August 2014; Association for Computing Machinery: New York, NY, USA, 2014; M ’14; pp. 503–514. [Google Scholar] [CrossRef]
- Wang, T.; Yang, X.; Antichi, G.; Sivaraman, A.; Panda, A. Isolation Mechanisms for High-Speed Packet-Processing Pipelines. In Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, Renton, WA, USA, 4–6 April 2022; Phanishayee, A., Sekar, V., Eds.; USENIX Association: Berkeley, CA, USA, 2022; pp. 1289–1305. [Google Scholar]
- Zhang, H.; Zhao, B.; Li, W.J.; Ma, Z.G.; Yu, F. Resource-Efficient Parallel Tree-Based Join Architecture on FPGA. IEEE Trans. Circuits Syst. II Express Briefs 2019, 66, 111–115. [Google Scholar] [CrossRef]










| Function | Explanation |
|---|---|
| insert(key, value) → bool | Insert a (key, value) pair. Returns true if insertion succeeds and false if the key already exists or all candidate buckets across hash levels are occupied. |
| delete(key) → bool | Delete a (key, value) pair. Returns true if the key is found in any sub-table or the CAM; returns false otherwise. |
| modify(key, op_code, op_num) → value | Atomically update the value of key using op_code on scalar op_num, and return the original value. Returns the original value before modification. Guarantees per-key read–modify–write atomicity even under pipeline parallelism and bursty traffic. |
| query(key) → value | Get the value of key. If found, it returns the stored value; if not found, it returns a special null value. |
| Table Num | LUTs (1,182,240) | Registers (2,364,480) | BRAMs (2160) |
|---|---|---|---|
| 32 | 7518 (0.64%) | 28,030 (1.19%) | 192 (8.89%) |
| 64 | 14,256 (1.21%) | 14,256 (1.21%) | 192 (8.89%) |
| 128 | 27,665 (2.34%) | 101,745 (4.30%) | 192 (8.89%) |
| 256 | 53,139 (4.49%) | 175,088 (7.40%) | 384 (17.78%) |
| Method | Platform | Clock Frequency (MHz) | LUTs | Throughput (Million Requests/s) | Load Factor |
|---|---|---|---|---|---|
| ref. [17], parallel hash * | Xilinx Zynq | 301 | 1653 | 153.6 | 0.50 |
| ref. [18], cuckoo hash † | Xilinx Zynq | 200 | 4944 | 147 | 0.90 |
| ref. [41], T = 16 ‡ | Xilinx Zynq | 261 | 7820 | 62.4 | — |
| ref. [15], CH + B = 8 ⋆ | Xilinx UltraScale+ | 200 | 34,212 | 200 | 0.95 |
| Our work ‡‡ | Xilinx UltraScale+ | 400 | 27,665 | 400 | 0.95 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, X.; Liu, L.; Li, Y. High-Throughput and Memory-Efficient Pipeline Key–Value Store Architecture on FPGA. Micromachines 2025, 16, 1398. https://doi.org/10.3390/mi16121398
Wang X, Liu L, Li Y. High-Throughput and Memory-Efficient Pipeline Key–Value Store Architecture on FPGA. Micromachines. 2025; 16(12):1398. https://doi.org/10.3390/mi16121398
Chicago/Turabian StyleWang, Xinshuo, Lei Liu, and Yifei Li. 2025. "High-Throughput and Memory-Efficient Pipeline Key–Value Store Architecture on FPGA" Micromachines 16, no. 12: 1398. https://doi.org/10.3390/mi16121398
APA StyleWang, X., Liu, L., & Li, Y. (2025). High-Throughput and Memory-Efficient Pipeline Key–Value Store Architecture on FPGA. Micromachines, 16(12), 1398. https://doi.org/10.3390/mi16121398


