TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams
Abstract
1. Introduction
1.1. Motivation
1.2. Our Solution: TP-Sketch
- We provide a light-weight and effective algorithm by controlling replacement probabilities. By minimizing the information stored in the sketch for each item, the sketch can contain more items within limited memory. This design reduces hash collisions for persistent items and improves the accuracy of the estimation.
- To improve throughput, our insertion procedure requires only a hash operation to locate its available positions. This strategy makes TP-Sketch’s insertion speed faster than existing methods.
- We provide a theoretical analysis and derive an error bound for the persistence estimates generated by our algorithm.
- We perform an extensive empirical evaluation using real network traces. Overall, TP-Sketch achieves the best results for persistent item lookups. For example, compared to P-Sketch [17], TP-Sketch showed an improvement in the F1-score and an average improvement in throughput in MAWI 1. TP-Sketch achieves the highest accuracy and throughput when compared with state-of-the-art algorithms.
2. Related Work
3. Our Proposed Method
3.1. Problem Statement
3.2. Principles
- Memory Efficiency: Following Insight 1, TP-Sketch stores only for each item stored in the sketch, reducing memory usage.
- Computational Simplicity: According to Insight 2, each item insertion uses fewer hash operations, which lowers computational costs.
- Replacement Strategy: Based on Insight 3, a global-threshold replacement strategy is used. Items with persistence estimates above this threshold are seen as promising, whereas those below the threshold are replaced probabilistically, allowing new items to replace them with a specific probability.
3.3. The TP-Sketch Algorithm
| Algorithm 1 The TP-Sketch algorithm for finding persistent items in a data stream. |
|
3.4. Running Examples
- Arrival of (Figure 3(1)): The item is used hashed to the array , which provides two available buckets: and . Since is not occupied, is placed in this bucket. Its state is updated to . Its is set to to prevent duplicate counts for in the same window.
- Arrival of (Figure 3(2)): The hash function directs to the array , where it is found in the bucket . Its is , meaning has not been recorded in this window. Consequently, its persistence counter increases by one, and its is set to .
- Arrival of (Figure 3(3)): The item is hashed to the array and is found in the array. A check reveals that its is already , indicating that it was recorded in this window. Therefore, no modification is necessary.
- Arrival of (Figure 3(4)): This item is assigned to array . The algorithm finds the bucket with the smallest counter of , which is with persistence of 5 (and a flag). The current persistence threshold is calculated as . Since , the item in is classified as a promising persistent item and is protected from replacement. Thus, is discarded.
- Arrival of (Figure 3(5)): Item is hashed to array , but both buckets in this array are occupied. The replacement procedure is initiated. The bucket with the minimum counter in is selected; it contains with a counter value of 2 and a flag. As the counter value of 2 is below the threshold of 3, is classified as non-promising. The algorithm then decides to replace it with probabilistically, with a chance of . If the replacement succeeds, the bucket is updated to ; otherwise, the original entry remains unchanged.
4. Mathematical Analysis
5. Experiment Results
5.1. Experimental Setup
- MAWI Dataset 1 and MAWI Dataset 2: Traffic traces were collected by the MAWI Working Group [26]. The MAWI Dataset 1 has 193.3 million packets and 47.8 million different items. The MAWI Dataset 2 has 248.8 million packets and 49.08 million different items.
- Zipf DataSet [30]: The Zipf 1.5 and Zipf 2.0 datasets were generated, each containing 200 million data items. The Zipf 1.5 dataset has 483 thousand distinct items with a skewness parameter of 1.5, while the Zipf 2.0 dataset has 19.5 thousand distinct items with a parameter of 2.0. Both datasets were produced using Python’s (version 3.9.13) built-in Zipf distribution generator [30].
- Precision Rate (PR): The ratio of the number of correctly reported instances to the number of reported instances.
- Recall Rate (RR): The ratio between the number of correctly reported instances and the number of correct instances.
- F1-Score:
- Average Relative Error (ARE): , where is the real persistence of the item e, is the estimated persistence of the item, and is the query set.
- Throughput: We use millions of operations (insertions) per second (Mops) to measure throughput.
5.2. Parameter Settings
5.3. Performance Comparison
5.3.1. Accuracy Comparison
5.3.2. Speed Comparison
5.4. Effect of Parameters
5.4.1. The Effect of Window Size
5.4.2. The Effect of Thresholds
5.4.3. Effect of Parameter d
5.5. Ablation Study
5.6. Case Study
6. Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kumar, A.; Xu, J.; Wang, J. Space-Code Bloom Filter for Efficient Per-Flow Traffic Measurement. IEEE J. Sel. Areas Commun. 2006, 24, 2327–2339. [Google Scholar] [CrossRef]
- Schweller, R.; Li, Z.; Chen, Y.; Gao, Y.; Gupta, A.; Zhang, Y.; Dinda, P.A.; Kao, M.-Y.; Memik, G. Reversible Sketches: Enabling Monitoring and Analysis Over High-Speed Data Streams. IEEE/Acm Trans. Netw. 2007, 15, 1059–1072. [Google Scholar] [CrossRef]
- Tanbeer, S.K.; Ahmed, C.F.; Jeong, B.S. Mining regular patterns in data streams. In Proceedings of the Database Systems for Advanced Applications: 15th International Conference, DASFAA 2010, Tsukuba, Japan, 1–4 April 2010; Proceedings, Part I 15; Springer: Berlin/Heidelberg, Germany, 2010; pp. 399–413. [Google Scholar] [CrossRef]
- Rahman, M.S.; Uddin, M.Y.S.; Hasan, T.; Rahman, M.S.; Kaykobad, M. Using Adaptive Heartbeat Rate on Long-Lived TCP Connections. IEEE/ACM Trans. Netw. 2018, 26, 203–216. [Google Scholar] [CrossRef]
- Miao, R.; Zhong, Z.; Guo, J.; Li, Z.; Yang, T.; Cui, B. BurstSketch: Finding Bursts in Data Streams. IEEE Trans. Knowl. Data Eng. 2022, 35, 11126–11140. [Google Scholar] [CrossRef]
- Metwally, A.; Agrawal, D.; El Abbadi, A. Efficient Computation of Frequent and Top-k Elements in Data Streams. In Proceedings of the Database Theory—ICDT 2005; Eiter, T., Libkin, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 398–412. [Google Scholar] [CrossRef]
- Yang, T.; Jiang, J.; Liu, P.; Huang, Q.; Gong, J.; Zhou, Y.; Miao, R.; Li, X.; Uhlig, S. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the SIGCOMM ’18: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication; Association for Computing Machinery: New York, NY, USA, 2018; pp. 561–575. [Google Scholar] [CrossRef]
- Fan, Z.; Hu, Z.; Wu, Y.; Guo, J.; Liu, W.; Yang, T.; Wang, H.; Xu, Y.; Uhlig, S.; Tu, Y. PISketch: Finding persistent and infrequent flows. In Proceedings of the FFSPIN ’22: Proceedings of the ACM SIGCOMM Workshop on Formal Foundations and Security of Programmable Network Infrastructures; Association for Computing Machinery: New York, NY, USA, 2022; pp. 8–14. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, J.; Lei, Y.; Yang, T.; Li, Z.; Zhang, G.; Cui, B. On-off sketch: A fast and accurate sketch on persistence. Proc. Vldb Endow. 2020, 14, 128–140. [Google Scholar] [CrossRef]
- Chen, X.; Landau-Feibish, S.; Braverman, M.; Rexford, J. Beaucoup: Answering many network traffic queries, one memory update at a time. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication; Association for Computing Machinery: New York, NY, USA, 2020; pp. 226–239. [Google Scholar] [CrossRef]
- Nagaraja, S.; Shah, R. Clicktok: Click fraud detection using traffic analysis. In Proceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks; Association for Computing Machinery: New York, NY, USA, 2019; pp. 105–116. [Google Scholar] [CrossRef]
- Cole, E. Advanced Persistent Threat: Understanding the Danger and How to Protect Your Organization; Newnes: Oxford, UK; Waltham, MA, USA, 2012. [Google Scholar]
- Huang, H.; Sun, Y.E.; Chen, S.; Tang, S.; Han, K.; Yuan, J.; Yang, W. You Can Drop but You Can’t Hide: K-persistent Spread Estimation in High-speed Networks. In Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications; IEEE: New York, NY, USA, 2018; pp. 1889–1897. [Google Scholar] [CrossRef]
- Chen, L.; Phan, R.C.W.; Chen, Z.; Huang, D. Persistent items tracking in large data streams based on adaptive sampling. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications; IEEE: New York, NY, USA, 2022; pp. 1948–1957. [Google Scholar] [CrossRef]
- Lahiri, B.; Chandrashekar, J.; Tirthapura, S. Space-efficient tracking of persistent items in a massive data stream. In Proceedings of the 5th ACM International Conference on Distributed Event-Based System; Association for Computing Machinery: New York, NY, USA, 2011; pp. 255–266. [Google Scholar] [CrossRef]
- Dai, H.; Shahzad, M.; Liu, A.X.; Zhong, Y. Finding persistent items in data streams. Proc. VLDB Endow. 2016, 10, 289–300. [Google Scholar] [CrossRef]
- Li, W.; Patras, P. P-Sketch: A Fast and Accurate Sketch for Persistent Item Lookup. IEEE/ACM Trans. Netw. 2023, 32, 987–1002. [Google Scholar] [CrossRef]
- Li, W.; Patras, P. Stable-sketch: A versatile sketch for accurate, fast, web-scale data stream processing. In Proceedings of the ACM Web Conference 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 4227–4238. [Google Scholar] [CrossRef]
- Li, W. Pandora: An Efficient and Rapid Solution for Persistence-Based Tasks in High-Speed Data Streams. Proc. ACM Manag. Data 2025, 3, 1–26. [Google Scholar] [CrossRef]
- Li, W.; Li, Z.; Bütün, B.; Diallo, A.F.; Fiore, M.; Patras, P. Pontus: A Memory-Efficient and High-Accuracy Approach for Persistence-Based Item Lookup in High-Velocity Data Streams. In Proceedings of the ACM on Web Conference 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1783–1794. [Google Scholar] [CrossRef]
- Alkasassbeh, M.; Al-Haj Baddar, S. Intrusion detection systems: A state-of-the-art taxonomy and survey. Arab. J. Sci. Eng. 2023, 48, 10021–10064. [Google Scholar] [CrossRef]
- Cao, L.; Shi, Q.; Xiao, W.; Wang, N.; Li, W.; Li, Z.; Zhang, W.; Xu, M. Hypersistent Sketch: Enhanced Persistence Estimation via Fast Item Separation. In Proceedings of the 2025 IEEE 41st International Conference on Data Engineering (ICDE); IEEE: New York, NY, USA, 2025; pp. 3030–3042. [Google Scholar] [CrossRef]
- Cormode, G.; Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 2005, 55, 58–75. [Google Scholar] [CrossRef]
- Estan, C.; Varghese, G. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst. (TOCS) 2003, 21, 270–313. [Google Scholar] [CrossRef]
- Yang, T.; Zhang, H.; Li, J.; Gong, J.; Uhlig, S.; Chen, S.; Li, X. HeavyKeeper: An accurate algorithm for finding Top-k elephant flows. IEEE/ACM Trans. Netw. 2019, 27, 1845–1858. [Google Scholar] [CrossRef]
- Mawi Dataset. Available online: https://mawi.wide.ad.jp/mawi/ (accessed on 3 May 2025).
- The Source Code of Bob Hash. Available online: http://burtleburtle.net/bob/hash/evahash.html (accessed on 20 April 2025).
- Data Set for Imc 2010 Data Center Measurement. Available online: https://pages.cs.wisc.edu/~tbenson/IMC_DATA/ (accessed on 10 April 2025).
- Benson, T.; Akella, A.; Maltz, D.A. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement; Association for Computing Machinery: New York, NY, USA, 2010; pp. 267–280. [Google Scholar] [CrossRef]
- Zipf Data Generator. Available online: https://numpy.net.cn/doc/stable/reference/random/generated/numpy.random.zipf.html (accessed on 10 October 2025).
















| 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | ||
|---|---|---|---|---|---|---|---|---|---|
| d | |||||||||
| 2 | 0.929 | 0.938 | 0.944 | 0.939 | 0.946 | 0.944 | 0.945 | 0.936 | |
| 3 | 0.934 | 0.947 | 0.951 | 0.958 | 0.963 | 0.952 | 0.957 | 0.953 | |
| Mem (KB) | 10 | 20 | 30 | 40 | 50 | 60 | |
|---|---|---|---|---|---|---|---|
| Method | |||||||
| TP | 0.514 | 0.716 | 0.805 | 0.856 | 0.897 | 0.921 | |
| TP-WT | 0.476 | 0.691 | 0.776 | 0.845 | 0.881 | 0.903 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, C.; Lu, Y.; Yang, G.; Xie, Y. TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams. Appl. Sci. 2026, 16, 2018. https://doi.org/10.3390/app16042018
Yang C, Lu Y, Yang G, Xie Y. TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams. Applied Sciences. 2026; 16(4):2018. https://doi.org/10.3390/app16042018
Chicago/Turabian StyleYang, Chen, Yuliang Lu, Guozheng Yang, and Yi Xie. 2026. "TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams" Applied Sciences 16, no. 4: 2018. https://doi.org/10.3390/app16042018
APA StyleYang, C., Lu, Y., Yang, G., & Xie, Y. (2026). TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams. Applied Sciences, 16(4), 2018. https://doi.org/10.3390/app16042018
