TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams

Yang, Chen; Lu, Yuliang; Yang, Guozheng; Xie, Yi

doi:10.3390/app16042018

Open AccessArticle

TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams

¹

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China

²

Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 2018; https://doi.org/10.3390/app16042018

Submission received: 3 January 2026 / Revised: 7 February 2026 / Accepted: 15 February 2026 / Published: 18 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Detecting persistent items that recur across multiple time windows is essential for identifying anomalies in high-speed data streams. However, performing such detection under tight memory constraints and high-speed data streams remains a challenge. Existing approaches often suffer from severe hash collisions because they store much redundant information in sketches, which increases hash collisions of persistent items and degrades both accuracy and processing speed. In this paper, we propose TP-Sketch, a novel approximate data structure that efficiently addresses these issues. Instead of recording additional item statistics, TP-Sketch classifies items as promising or unpromising based on a dynamic global threshold; it then protects promising persistent items from eviction while probabilistically replacing unpromising ones. This strategy improves both accuracy and speed. We provide a formal error-bound analysis to establish the theoretical soundness of TP-Sketch. Extensive trace-driven experiments show that TP-Sketch consistently outperforms state-of-the-art methods in both accuracy and throughput across a variety of tests. For example, compared with P-Sketch, TP-Sketch improves the average F1-score by

16.27 %

and the average throughput by

113.21 %

on the MAWI 1 dataset. Overall, TP-Sketch achieves the best accuracy and throughput among state-of-the-art algorithms.

Keywords:

streams; persistence; sketch; accuracy

1. Introduction

Data stream mining is an essential research task to discover anomalous patterns in real time [1,2,3,4,5], including the detection of frequent items [3,6,7], persistent items [8,9], superspreaders [10], etc. Recently, persistence has received increasing attention. For a data stream divided into L consecutive time windows, the persistence of an item e is defined as the number of time windows in which it appears in the stream. It is crucial for many applications. These include anomaly traffic detection, user behavior analysis, and click fraud detection.

In network monitoring and anomaly detection, persistence is often used to detect potentially malicious activities, such as click fraud or advanced persistent threats (APTs). Prior detection methods use frequency-based approaches to identify anomalies (such as heavy hitters). However, as noted in [11,12,13], many adversarial strategies reduce their speed and spread actions over long periods to avoid detection. For example, the action may transmit only one packet per hour over 100 days. This makes it undetectable by daily frequency thresholds because it has only 24 packets. In such cases, measuring persistence rather than frequency of packets is more effective in detecting long-term stealthy behavior. Sustained network activity can be a strong indicator of APT. This paper focuses on the persistent item lookup problem. So, our objection is to find items whose persistence is greater than the given threshold in the data stream. This approach helps to detect anomalies in the data streams.

Some of them are based on sample-based approaches [14,15]. These methods use a sampling method to detect potential persistent items. They often include too many non-persistent items, which makes the results inaccurate. The Small-Space method [15] works by sampling items from a data stream and records the results in the hash table. This method uses a sampling technique to reduce memory usage. However, it has a drawback: it requires a lot of memory to maintain the hash table. The low sampling rate reduces memory usage and causes significant errors. This decrease in the sampling rate results in low throughput. Some methods are based on coding approaches [16]. These methods encode each item in each time window. Encoding non-persistent items usually consumes a lot of memory, and memory costs increase with window size. This makes memory-limited scenarios impractical. The encoding operation requires a lot of computational resources. This makes its speed relatively slow. The most recent works are sketch-based algorithms. The sketches use probabilistic data structures. They estimate persistence at a lower memory cost and with a higher speed. However, eviction strategies in most existing sketch-based methods often misclassify non-persistent items. This often happens in hash collisions, rendering the result inaccurate. Recently, works such as P-Sketch [17], Stable-Sketch [18], Pandora [19], and Pontus [20] store additional statistics in sketches. These incur additional memory and computational costs. So, there is much room for improvement.

Currently, machine learning-based anomaly traffic detection algorithms (such as Advanced Persistent Threats, APT) [21] have been proposed. These methods typically require significant memory overhead and exhibit low processing throughput. Data throughput at the core network nodes (such as routers) often exceeds 10 Gb/s [17], and the storage space available for detection algorithms is less than 1 MB. This makes it difficult to deploy machine learning algorithms directly on these devices. It hinders the real-time detection of high-speed data streams. Therefore, this paper aims to design an accurate and light-weight algorithm that can operate efficiently on devices such as routers.

In this paper, we introduce the TP-Sketch algorithm. It uses thresholds to protect persistent items and improve the accuracy of detecting these items. The algorithm is straightforward and efficient for tackling the persistent item lookup problem.

1.1. Motivation

In practice, accurately and swiftly detecting persistence in high-speed data streams is a difficult and crucial task; the main difficulties are as follows. The first challenge is to use limited memory to handle large volumes of high-speed data. In the real world, memory remains the primary problem for the processing of the data stream. This is especially true in environments such as embedded systems, IoT devices, and edge computing. These systems have limited memory and computing resources; they are used for network monitoring, traffic analysis, and real-time anomaly detection. The probabilistic method for obtaining an estimated result can fulfill our needs in such scenarios. This presents a challenge. We must process large volumes of data in real-time. The second problem involves recording highly persistent items while evicting less persistent ones. In real-world scenarios, each item in a window is counted only once, regardless of its frequency. The persistence is lower than the frequency of the item. Most items have low persistence. We need to identify and remove them. Persistent items may be evicted from buckets by non-persistent items. This can lead to low detection accuracy; we must avoid this situation.

These challenges motivate us to design a simple and efficient algorithm for detecting persistent items with limited memory. Our algorithm is based on the sketch data structure. Our main idea is that items with different persistence should have different probabilities of being replaced. In previous work, additional statistics were used for each item to adjust its replacement probability. These statistics included Pandora [19], P-Sketch [17], Stable Sketch [18], and Pontus [20,22]. However, this requires extra memory to store these statistics and computing resources to update them. In our work, we do not record additional statistics for each item. We store only limited information about the entire stream. We use this information to distinguish promising persistent items from unpromising ones and protect promising persistent items from eviction. Using this method, we can avoid storing additional statistics for each item in the sketch. This approach will reduce memory usage. It will also save computational resources.

1.2. Our Solution: TP-Sketch

We propose a new algorithm called the TP-Sketch to address these challenges. The key insight is that we can identify promising persistent items directly from the estimated persistence itself, without storing additional information. We give promising persistent items more chances to stay in the sketch. As we do not store additional statistics, this reduces the number of hash collisions and creates space to record more items. Our method achieves high detection accuracy with limited memory.

When a new item arrives, the TP-Sketch adjusts the replacement probability when hash collisions happen and uses a global threshold to distinguish between promising and unpromising persistent items. If the item is already in the sketch and its persistence is above the threshold, it will not be evicted; otherwise, it will be replaced with a certain probability. It keeps persistent items in the sketch while evicting unpromising ones. This approach ensures high memory efficiency and accurate detection of persistent items. The update procedure is efficient and straightforward. It does not require additional statistics per item and keeps only the counter for the current time window. This contributes to its high update speed.

The main contributions of this paper are as follows:

We provide a light-weight and effective algorithm by controlling replacement probabilities. By minimizing the information stored in the sketch for each item, the sketch can contain more items within limited memory. This design reduces hash collisions for persistent items and improves the accuracy of the estimation.
To improve throughput, our insertion procedure requires only a hash operation to locate its available positions. This strategy makes TP-Sketch’s insertion speed faster than existing methods.
We provide a theoretical analysis and derive an error bound for the persistence estimates generated by our algorithm.
We perform an extensive empirical evaluation using real network traces. Overall, TP-Sketch achieves the best results for persistent item lookups. For example, compared to P-Sketch [17], TP-Sketch showed an improvement in the F1-score $16.27 %$ and an average improvement in throughput $113.21 %$ in MAWI 1. TP-Sketch achieves the highest accuracy and throughput when compared with state-of-the-art algorithms.

2. Related Work

In this section, we introduce the research on persistent items and related studies on persistence. The persistent item lookups are based on this research.

The “Small Space” algorithm [15] uses a “sample and count” method for the estimation of persistence. If an incoming data item is in the hash table, we update its associated counter based on the relevant window field. Otherwise, the item undergoes sampling based on its unique identifier and current time window. This method reduces the need for computational and memory resources. The main limitation of the algorithm is its poor space efficiency. This restricts its use in environments with limited memory.

Many algorithms have been proposed for persistent item detection, including On-Off Sketch [9], P-Sketch [17], and PIE [8]. On-Off Sketch divides the time stream into time windows. It increments the counter for an item once per window when it appears, and each bucket only updates once in a time window. The algorithm uses a sketch-based data structure that has limited memory. Each time a hash collision occurs, the minimum item is evicted and replaced by a new item. On-Off Sketch leads to significant improvements in speed and accuracy compared to earlier approaches such as Small-Space and PIE. However, its replacement strategy is coarse. There remains room for improvement. Subsequent works, such as P-Sketch [17], Stable Sketch [18], Pandora [19], and Pontus [20], aim to improve accuracy. They improve the predictions of persistent items and keep them in the sketch with higher probability. P-Sketch and Stable Sketch count the number of consecutive windows in which an item appears; items that appear in many consecutive windows are less likely to be evicted because they are more likely to be a persistent item. This method increases the chance of retaining persistent items in the sketch. Pandora [19] records consecutive windows of absence to adjust replacement probabilities; items with many consecutive windows of absence are more likely to be evicted. Pontus uses the Flag bit to ensure that persistent items remain in the sketch. These strategies provide more accurate results than the On-Off Sketch [9]. The Hypersistent algorithm [22] is designed for skewed data streams; it uses multiple filters to eliminate low-persistent items. This ensures accurate estimation results for skewed data streams.

Each of these methods has its limitations. The P-Sketch method [17] suffers from underestimation and overestimation errors, which limit its accuracy in the real world. It stores additional “hotness” statistics for each item record in the sketch. This can degrade performance, as it consumes extra memory. Pandora, Pontus, and Stable-Sketch also need additional statistics for each recorded item. This leads to inefficient use of limited memory resources. Stable-Sketch [18] uses bucket status for eviction; its probability decay mechanism is slow. This leads to suboptimal memory use and reduced accuracy. Pandora [19] and Pontus [20] outperform other methods when memory is high. Their performance drops when memory is low, and hash collisions occur frequently. Hypersistent [22] uses multiple filtering stages to find persistent items. It is designed for highly skewed data streams; its effectiveness decreases when the stream distributions are more uniformly distributed.

3. Our Proposed Method

3.1. Problem Statement

Definition 1.

Data Stream: We consider a data stream

S = {e_{1}, e_{2}, \dots, e_{n}}

, composed of various items, where each item is represented as a pair of keys. The key serves as the item identifier, and the value is the corresponding value of the key. In network monitoring, the key often represents a flow identifier, such as a source–destination address pair and its ports. At the same time, the value could be information such as frequency or persistence.

Definition 2.

Sketch: Sketches are probabilistic summaries. They use limited memory to record the information of large amounts of data. Classic examples of sketches include Count-Min Sketch [23] and CU Sketch [24]. A Count-Min sketch is represented by an array of buckets with w columns and d rows. Initially, each counter in the buckets array is set to zero. In addition, it requires

h_{1}, \dots, h_{d}

hash functions, with each hash function corresponding to a row. An item is counted in the d buckets during item insertion. We choose the minimum count in the d buckets as the estimated result.

Definition 3.

Time Window: The time range

[t_{1}, t_{N}]

of the data stream S is evenly divided into L time windows, each of size

R = (t_{N} - t_{1}) / L

. These windows are represented as a set

L = {(t_{1}, t_{1} + R), (t_{1} + R, t_{1} + 2 R), \dots, (t_{1} + (L - 1) R, t_{N})}

.

Definition 4.

Persistence: Consider a data stream S that is partitioned into L equal time windows across a defined time interval. For any item e,

f^{T} (e)

denotes its frequency within the time window T. The persistence

V (e)

of the item is then defined as

V (e) = \sum_{T = 1}^{L} I (f^{T} (e))

, where

I (\cdot)

represents the indicator function. We have

I (f^{T} (e)) = \{\begin{matrix} 1, & f^{T} (e) > 0; \\ 0, & otherwise . \end{matrix}

(1)

This equation means that for an item e, its persistence

V (e)

is the count of windows in which e appears across all L windows.

Definition 5.

Persistence Item Lookup: Item e is a θ-persistent item if

V (e) \geq θ L

, where

θ \in (0, 1]

is the threshold defined by the user. This means that item e appears in at least the

θ L

time window across all time windows.

3.2. Principles

Recent sketch-based methods for persistent item detection include P-Sketch [17], Stable-Sketch [18], and Pandora [19]. As shown in Figure 1a, they store each item using four fields:

(F P, F l a g, V, A)

. Here,

F P

is the item fingerprint.

F l a g

prevents the count of duplicates within a detection window. V estimates persistence. Finally, A maintains auxiliary information. For example, in P-Sketch [17], A represents an item’s “hotness”, which is its consecutive appearance count across windows and is used to predict its persistence. During a hash collision, it uses the replacement probability

\frac{1}{η (V + A) + 1}

. Here,

η

is a constant parameter. The statistic A can improve the accuracy of the detection by letting the persistent item have a greater chance of staying in the sketch. However, it introduces significant overhead. Storing A consumes extra memory. Updating it in real time requires considerable computational resources. This limits the throughput of the algorithm. A key observation for optimization is that the additional field A can be eliminated. A higher value of A is strongly correlated with a higher value of V (which indicates a higher persistence estimate). In the replacement probability

\frac{1}{η (V + A) + 1}

, this correlation shows that the term

(V + A)

depends mainly on the shared trend of both variables. The persistence signal that A captures is already recorded in V, so we can eliminate A.

To address this inefficiency, we propose an approach that uses a global threshold to predict persistent items. This eliminates the field A as shown in Figure 1b, thus reducing storage and computational overhead. The freed memory can then be allocated to more buckets. This reduces hash collisions. It also improves the accuracy of the estimate and speeds up the insertion of items. From this analysis, we derive three key insights:

Key Insight 1: In memory-constrained environments, reducing the per-item memory cost enables the allocation of more buckets. This will reduce hash collisions and further improve accuracy.

Key Insight 2: Operational simplicity is important for high throughput. Previous works [17,19] introduce additional computational overhead by updating additional statistics. Eliminating extra-dimensional computations will speed up insertion and then enhance overall throughput.

Key Insight 3: A differentiated replacement strategy is essential. Promising items should have a low probability of being replaced and evict unpromising items with a high probability. This approach will allow persistent items to be stored in the sketch with a higher probability.

Based on these insights, the TP-Sketch is designed as follows:

Memory Efficiency: Following Insight 1, TP-Sketch stores only $(F P, F l a g, V)$ for each item stored in the sketch, reducing memory usage.
Computational Simplicity: According to Insight 2, each item insertion uses fewer hash operations, which lowers computational costs.
Replacement Strategy: Based on Insight 3, a global-threshold replacement strategy is used. Items with persistence estimates above this threshold are seen as promising, whereas those below the threshold are replaced probabilistically, allowing new items to replace them with a specific probability.

Threshold for classifying promising\unpromising items and replacement probability: As we do not store additional information for each item to classify whether it is a promising persistent item, we use global information about items to classify promising persistent items. We define the global threshold as

θ l^{(t)}

, where

l^{(t)}

is the window number at time t. As persistent items spread over a long time, their persistence grows linearly with the number of windows. We can use this to predict persistent items and protect them from being evicted. This improves accuracy. When

\hat{V} (e) > θ l^{(t)}

, we consider e to be a promising persistent item. Let

V^{t} (e)

be the estimated persistence of e in time t. For promising persistent items, we will not evict them; for unpromising persistent items, we will replace them with a probability of

\frac{1}{λ * (V^{(t)} (e) + 1)}

. We then set the replacement probability as follows:

P (e) = \{\begin{matrix} 0, & V (e) \geq θ l^{(t)}; \\ \frac{1}{λ * (V^{(t)} (e) + 1)}, & otherwise . \end{matrix}

(2)

where

λ

is a constant obtained from the experiment. If the estimated persistence of item

e^{'}

is greater than

θ l^{(t)}

, we will not replace it; otherwise, we will replace it with a probability of

\frac{1}{λ * (V^{(t)} (e) + 1)}

.

Finding an optimal

λ

is the key problem in TP-Sketch’s design. If it is too small, the replacement probability will be large. As a result, the persistent item can easily be replaced. If it is too large, it is difficult for new persistent items to enter the sketch. In this work, we obtain an approximately optimal

λ

through experimentation.

3.3. The TP-Sketch Algorithm

Data Structure: Figure 2 shows the data structure of TP-Sketch. Memory is divided into M equal-sized blocks. We use each block as a bucket. We divide the M bucket into w arrays denoted as

B [1, \dots, w]

. Each array has d buckets, so

M = w \cdot d

. Each bucket includes three fields:

(F P, F l a g, V)

. The field

F P

is the fingerprint of item e. The field

F l a g

is the flag to avoid duplicates. Here,

T r u e

means that the items have not arrived in this time window.

F a l s e

means that they have arrived within this time window. The field V is the estimated persistence of the item e. We used a hash operation to find available buckets for each insertion of items.

The steps of the TP-Sketch algorithm are shown in Algorithm 1. The variable l (line 1) tracks the index of the current time window. The next sections explain the details.

Algorithm 1 The TP-Sketch algorithm for finding persistent items in a data stream.

Require:: The data stream $S = {e_{1}, e_{2}, \dots, e_{n}}$
1:: $l \leftarrow 0$
2:: while item tuple $e_{i}$ arrived do
3:: if $e_{i} \in B [h (e_{i})]$ then
4:: $(r, j) \leftarrow index of e_{i}$
5:: if $B [r] [j] . F l a g = = T r u e$ then
6:: $B [r] [j] . V \leftarrow B [r] [j] . V + 1$
7:: $B [r] [j] . F l a g \leftarrow F a l s e$
8:: end if
9:: else if ∃ empty bucket $(r, j)$ in $B [h (e_{i})]$
10:: $B [r] [j] . F P \leftarrow$ fingerprint of e;
11:: $B [r] [j] . V \leftarrow 1$
12:: $B [r] [j] . F l a g \leftarrow F a l s e$
13:: else
14:: $(r, j) \leftarrow$ index of estimate persistence in $B [h (e_{i})]$
15:: if $B [r] [j] . V \geq θ l$ then
16:: return
17:: end if
18:: $r a n d o m \leftarrow$ random number in $U (0, 1)$
19:: if $r a n d o m < \frac{1}{λ (B [r] [j] . V + 1)}$ and $B [r] [j] . F l a g = = T r u e$ then
20:: $B [r] [j] . F P \leftarrow f i n g e r p r i n t$ of $e_{i}$
21:: $B [r] [j] . V \leftarrow 1$
22:: $B [r] [j] . F l a g \leftarrow F a l s e$
23:: end if
24:: end if
25:: if End of a time window then
26:: $l \leftarrow l + 1$
27:: for $p = 1$ to r do
28:: for $q = 1$ to d do
29:: $B [p] [q] . F l a g \leftarrow T r u e$
30:: end for
31:: end for
32:: end if
33:: end while

Insertion Procedure: For the item

e_{i}

arriving in the current window, the algorithm inserts it into the sketch. The hash function

h (e_{i})

finds the corresponding array in the sketch. The item

e_{i}

corresponds to d available buckets

B [h (e_{i})] [1], \dots, B [h (e_{i})] [d]

. This provides d possible locations to record it. The algorithm computes a hash for each item while locating available buckets. This step improves the processing efficiency.

As detailed in Algorithm 1, the insertion can occur in three cases:

Case 1: If

e_{i}

is in

B [h (e_{i})]

and holds the position

(r, j)

(lines 3–8), the algorithm checks its flag first. When

B [r] [j] . F l a g

is

F a l s e

, the item is recorded in the current time window and has no update. If

F l a g

is

T r u e

, the counter

B [r] [j] . V

increases by 1. Next,

F l a g

is assigned to

F a l s e

.

Case 2: When

e_{i}

is not found in

B [h (e_{i})]

, and at least one bucket in the array is empty (lines 9–12), the item is placed into an empty bucket. Its counter

B [r] [j] . V

is set to 1.

B [r] [j] . F l a g

is set to

F a l s e

.

Case 3: If

e_{i}

is not in

B [h (e_{i})]

and all d buckets are occupied, a hash collision occurs. In this case, the algorithm uses the probabilistic replacement policy. First, it finds the bucket having the smallest estimated persistence among the d candidates; if multiple buckets have the same minimum value, the first one encountered is selected and denoted as

B [r] [j]

. The

θ l

classifies items as promising or not. For promising items (lines 15–17), no replacement occurs. Instead,

e_{i}

is simply discarded. For unpromising persistent items, the original entry is replaced with a probability of

\frac{1}{λ (B [r] [j] . V + 1)}

. If replacement fails,

e_{i}

is discarded.

At the end of each time window, the window counter l increases. All fields

F l a g

in the sketch reset to

T r u e

. This operation allows the algorithm to move to the next window.

Reporting Persistent Items: The algorithm scans all

M = w \cdot d

buckets to find persistent items and reports entries that have an estimated persistence above the defined threshold. The small size of the sketch allows for a quick report.

3.4. Running Examples

Figure 3 presents an example that clarifies the algorithm update procedure. The sketch in this example contains three arrays

(w = 3)

, each with two buckets

(d = 2)

. The depicted operations occur within the same time window. For example, like

e_{5}

, a

F a l s e

flag shows that they have already been recorded in this window. The current window number is

l^{(t)} = 30

. The persistence threshold is

θ = 0.1

. The parameter

λ

is set to 10 in this example.

The example is shown below:

Arrival of $e_{1}$ (Figure 3(1)): The item $e_{1}$ is used $h (\cdot)$ hashed to the array $B [1]$ , which provides two available buckets: $B [1] [1]$ and $B [1] [2]$ . Since $B [1] [2]$ is not occupied, $e_{1}$ is placed in this bucket. Its state is updated to $(e_{1}, 1, F)$ . Its $F l a g$ is set to $F a l s e$ to prevent duplicate counts for $e_{1}$ in the same window.
Arrival of $e_{6}$ (Figure 3(2)): The hash function directs $e_{6}$ to the array $B [2]$ , where it is found in the bucket $B [2] [2]$ . Its $F l a g$ is $T r u e$ , meaning $e_{6}$ has not been recorded in this window. Consequently, its persistence counter $B [2] [2] . V$ increases by one, and its $F l a g$ is set to $F a l s e$ .
Arrival of $e_{5}$ (Figure 3(3)): The item $e_{5}$ is hashed to the array $B [3]$ and is found in the array. A check reveals that its $F l a g$ is already $F a l s e$ , indicating that it was recorded in this window. Therefore, no modification is necessary.
Arrival of $e_{2}$ (Figure 3(4)): This item is assigned to array $B [2]$ . The algorithm finds the bucket with the smallest counter of $B [2]$ , which is $B [2] [1]$ with persistence of 5 (and a $T r u e$ flag). The current persistence threshold is calculated as $θ l = 0.1 \times 30 = 3$ . Since $5 > 3$ , the item in $B [2] [1]$ is classified as a promising persistent item and is protected from replacement. Thus, $e_{2}$ is discarded.
Arrival of $e_{10}$ (Figure 3(5)): Item $e_{10}$ is hashed to array $B [3]$ , but both buckets in this array are occupied. The replacement procedure is initiated. The bucket with the minimum counter in $B [3]$ is selected; it contains $e_{8}$ with a counter value of 2 and a $T r u e$ flag. As the counter value of 2 is below the threshold of 3, $e_{8}$ is classified as non-promising. The algorithm then decides to replace it with $e_{10}$ probabilistically, with a chance of $\frac{1}{10 \times (2 + 1)} = \frac{1}{30}$ . If the replacement succeeds, the bucket is updated to $(e_{10}, 1, F)$ ; otherwise, the original entry $(e_{8}, 2, T)$ remains unchanged.

4. Mathematical Analysis

Theorem 1.

For a TP-Sketch, there are w arrays, and each array has d buckets. The sum of items is F, and the total time window is L. The space complexity of the TP-Sketch is

O (w d log L)

. The time complexity for inserting all items is

O (F d + w d L)

. The time complexity for retrieving all persistent items is

O (w d)

.

Proof.

The PN-Sketch data structure comprises

w d

buckets, with each bucket storing a 64-bit key,

log L

bits for persistence, and 1 bit for a Flag. Thus, its total space complexity is

O (w d (64 + 1 + log L)) = O (w d log L)

.

Each insert operation requires access to up to d buckets, so processing all F items takes

O (F d)

time. Furthermore, at the start of each new time window, all

w d

Flags must be reset to

T r u e

. Over L windows, this resetting contributes

O (w d L)

time, resulting in a total insertion time complexity of

O (F d + w d L)

.

Retrieving persistent items is performed by scanning all

w d

buckets and reporting those whose persistence exceeds the given threshold, which requires

O (w d)

time.

□

Theorem 2.

Let

V (e)

be the true persistence of e, and let

\hat{V} (e)

be the estimation of

V (e)

. Then,

\hat{V} (e) \leq V (e)

.

Proof.

We conducted a case analysis to demonstrate this.

At the start of the detection task (

t = 0

), both

\hat{V} (e_{i})

and

V (e_{i})

are 0; thus, the theorem holds. Assume that in the

(t - 1)

-th time window,

\hat{V} (e_{i}) \leq V (e_{i})

.

If the item is recorded by sketch at the t-th time window, two scenarios are possible: (1) If

e_{i}

arrives and

F l a g

is

T r u e

, then

V_{t} (e_{i}) : = V_{t - 1} (e_{i}) + 1

and

\hat{V_{t}} (e_{i}) : = {\hat{V}}_{t - 1} (e_{i}) + 1

. Hence,

\hat{V_{t}} (e_{i}) \leq V_{t} (e_{i})

still holds at time t. Otherwise, if

e_{i}

arrives and

F l a g

is

F a l s e

, this means the item has arrived in this time window, so

V_{t} (e_{i}) : = V_{t - 1} (e_{i})

and

\hat{V_{t}} (e_{i}) : = {\hat{V}}_{t - 1} (e_{i})

,

\hat{V_{t}} (e_{i}) \leq V_{t} (e_{i})

still hold at time t. (2) If an item other than

e_{i}

arrives,

V_{t} (e_{i}) : = V_{t - 1} (e_{i})

,

\hat{V_{t}} (e_{i})

decreases to 0 (replaced by other items) or remains the same, so

\hat{V_{t}} (e_{i}) \leq V_{t} (e_{i})

still holds.

If the item is not stored by sketch at time

(t - 1)

,

{\hat{V}}_{t - 1} (e_{i}) = 0

, we have

{\hat{V}}_{t - 1} (e_{i}) \leq V_{t - 1} (e_{i})

. It tried to replace the item in the sketch at time t. If the replacement is successful, both

V (e)

and

\hat{V} (e)

increase by 1, resulting in

{\hat{V}}_{t} (e_{i}) \leq V_{t} (e_{i})

. Otherwise, if the replacement fails,

{\hat{V}}_{t} (e) = 0

, we still have

{\hat{V}}_{t} (e_{i}) \leq V_{t} (e_{i})

. Since the claim holds for all scenarios, Theorem 2 is proven. □

Theorem 3.

Let e denote the item with i-th that has the highest persistence among all persistence items. We focus on the persistent items. Given a small positive number ϵ and a persistent item e, and letting

V = \sum_{e \in U} V (e)

, we define the probability that the difference between

V (e)

and

\hat{V} (e)

exceeds

ϵ V

with the following inequality:

P r (V (e) - \hat{V} (e) \geq ϵ V) \leq \frac{θ L P r_{d}}{ϵ V λ}

(3)

where

P r_{d} = (\binom{i - 1}{r - 1}) {(\frac{1}{w})}^{d - 1} {(1 - \frac{1}{w})}^{i - d}

.

Proof.

We define

P r_{d}

as the probability that item e retains its status with the minimum persistence count in all rows where

e_{j}

is mapped. For e to have the minimum persistence in these d positions, exactly

d - 1

items of the top

i - 1

persistent items are assigned to the d positions. This can be expressed as

P r_{d} = (\binom{i - 1}{d - 1}) {(\frac{1}{w})}^{d - 1} {(1 - \frac{1}{w})}^{i - d}

. Here, i represents the index of the current item.

We calculate the average decrease in persistence due to replacement, which is the main reason for the decrease in persistence estimation [19]. Let us model the persistence of item e as a series of discrete states k, where

k \in {1, 2, \dots, V (e)}

. Given these considerations, we can formulate the expected number of times the persistence value of item e decreases due to replacement. We assume that once a persistent item e is placed in a bucket, it can be replaced by other items at most once, and when it arrives, it can successfully reenter the bucket at once [19,25]. Based on this, the expected number of times the durability of an item is reduced due to replacement can be calculated. This expectation is denoted as

E (X (e))

. If the estimated persistence of the item is

\hat{V} (e)

, then the probability of its replacement is

\frac{1}{λ (\hat{V} (e)) + 1}

; thus, the expected number of times its persistence decreases in this case is

\hat{V} (e) \frac{1}{λ (\hat{V} (e)) + 1}

. Therefore, it can be expressed as follows:

E (X (e)) = \sum_{k = 1}^{E (\hat{V} (e))} V^{k} (e) \frac{1}{λ (V^{k} (e) + 1)} P r_{d} < \sum_{k = 1}^{θ L} \frac{P r_{d}}{λ} = θ L \frac{P r_{d}}{λ}

(4)

The expectation of persistence estimation is equal to the persistence minus the expectation of persistence decrease, so we have

E (\hat{V} (e)) = V (e) - E (X (e))

(5)

Furthermore, by applying the Markov inequality, we deduce the following:

P r (V (e) - \hat{V} (e) \geq ϵ V) \leq \frac{V (e) - E (\hat{V} (e))}{ϵ V} = \frac{θ L P r_{d}}{ϵ V λ}

(6)

□

To validate our theoretical error bound experimentally, we conducted tests on the MAWI 1 dataset [26]. The parameters were configured as follows:

ϵ

was tuned to satisfy

ϵ

V = 100, the window size was fixed at 1000, the detection threshold

t h e t a L

was set to 20,000, and

d = 3

. Following the method in [19], the parameter w was determined based on the available memory and the dimension d. As shown in Figure 4, the theoretical error bound consistently remains higher than the empirically measured error throughout the experiment, confirming the validity of our derived bound.

5. Experiment Results

5.1. Experimental Setup

Platform and settings: We implement our algorithm using C++. The hash function is implemented using the 32-bit Bob Hash with different initial seeds [27]. We set the finger length to 64 bits, the estimated persistence to 32 bits, and the Flag to 1 bit. Thus, a bucket in the TP-Sketch totals 97 bits. The source code can be downloaded at https://github.com/YangChendegit/TP-Sketch.git (accessed on 20 January 2026).

Computation Platform: We conducted all the experiments on a machine with two 20-core processors (2 threads, Intel(R) Xeon(R) Gold 5218R CPU @ 2.10 GHz) and 256 GB of DRAM memory. The processor has a 1.3 MB L1 cache, a 40 MB L2 cache, and a 55 MB L3 cache shared by all cores.

Datasets:

MAWI Dataset 1 and MAWI Dataset 2: Traffic traces were collected by the MAWI Working Group [26]. The MAWI Dataset 1 has 193.3 million packets and 47.8 million different items. The MAWI Dataset 2 has 248.8 million packets and 49.08 million different items.
IMC DC Trace: The IMC Data Center Trace [28] is collected from the data centers studied in [29]. It contains 192 thousand types of items and 104 million items in total.
Zipf DataSet [30]: The Zipf 1.5 and Zipf 2.0 datasets were generated, each containing 200 million data items. The Zipf 1.5 dataset has 483 thousand distinct items with a skewness parameter of 1.5, while the Zipf 2.0 dataset has 19.5 thousand distinct items with a parameter of 2.0. Both datasets were produced using Python’s (version 3.9.13) built-in Zipf distribution generator [30].

Evaluation Metrics:

Precision Rate (PR): The ratio of the number of correctly reported instances to the number of reported instances.
Recall Rate (RR): The ratio between the number of correctly reported instances and the number of correct instances.
F1-Score: $\frac{2 * R R * P R}{R R + P R} .$
Average Relative Error (ARE): $\frac{1}{| Ψ |} \sum_{e \in Ψ} \frac{| V (e) - \hat{V} (e) |}{V (e)}$ , where $V (e)$ is the real persistence of the item e, $\hat{V} (e)$ is the estimated persistence of the item, and $Ψ$ is the query set.
Throughput: We use millions of operations (insertions) per second (Mops) to measure throughput.

Baselines: For persistent item lookup, we evaluate the performance of TP-Sketch against the following benchmarks: On-Off Sketch [9], P-Sketch [17], Stable-Sketch [18], Pandora [19], Pontus [20], and Hypersistent [22]. Detailed descriptions of these works are provided in the related works.

5.2. Parameter Settings

The TP-Sketch design addresses this using the key parameter

λ

. Taking inspiration from the previous sketch-based methods [17,19], we conducted an empirical evaluation to determine the appropriate parameter configurations for our method in MAWI 1 by setting different

λ

and d. The F1-score is shown in Table 1. We can see that when

λ = 10

, we achieve the highest F1-score. Therefore, we chose

λ = 10

as the appropriate optimal value

λ

.

Analysis: The results show that items that have too big or too small of a

λ

value will degrade the F1-score. The reason is that if

λ

is too small, the items can be easily replaced. This makes them more prone to replacement. In contrast, a large

λ

can prevent new persistent items from entering the Sketch. Using experiments, we set

λ = 10

to achieve a balance. This will protect promising persistent items and give new items more chances, which improves detection accuracy.

5.3. Performance Comparison

For persistent item lookup, we evaluated TP-Sketch’s performance across various memory sizes, ranging from 100 KB to 700 KB. We compared our algorithm with recent work on On-Off sketch [9], Pandora [19], P-Sketch [17], Stable-Sketch [18], Pontus [20], and Hypersistent [22]. The configuration is consistent with its original code, except for the counts on the windows; the total number of windows exceeds

U S H R T_M A X (65, 535 = 2^{16} - 1)

; we set the window size count to 32 bits. Each cell in On-Off, Pandora, P-Sketch, Stable-Sketch, and Pontus costs 97 bits, 136 bits, 136 bits, 136 bits, and 112 bits, respectively. TP-Sketch uses 97 bits for each cell. This allows it to have more cells to record persistent items compared to Pandora, P-Sketch, Stable-Sketch, and Pontus; d is set to 5 for TP-Sketch.

5.3.1. Accuracy Comparison

The results are shown in Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9. We can observe that the accuracy of TP-Sketch is higher than that of the other algorithms. For example, in MAWI 1, as shown in Figure 5, TP-Sketch obtains the highest F1-score. For example, in the MAWI 1 dataset, the F1-score is

539.69 %, 10.65 %, 7.36 %, 16.27 %, 5.9 %, 16.01 %

higher than that of On-Off, Pandora, P-Sketch, Stable-Sketch, Pontus, and Hypersistent. The precision, on average, is

1226 %

and

3.18 %

higher with On-Off and P-Sketch. TP-Sketch has the highest recall, which is, on average,

20.33 %, 19.13 %

12.26 %, 31.61 %, 11.35 %, 27.81 %

higher than that of On-Off, Pandora, P-Sketch, Stable-Sketch, Pontus, and Hypersistent. TP-Sketch reduces ARE by

98.8 %, 84.55 %, 83.02 %, 77.56 %, 73.38 %, 56.12 %

compared to On-Off, Pandora, P-Sketch, Stable-Sketch, Pontus, and Hypersistent.

We conducted experiments to evaluate the performance of algorithms under extreme traffic distributions with Zipf 1.5 and Zipf 2.0 distributions. The results are shown in Figure 8 and Figure 9. The results show that TP-Sketch achieves the highest F1-score in all tests. Its Average Relative Error (ARE) is low. For Zipf 1.5 and Zipf 2.0, the ARE value is below the threshold of 0.01, which is better than most other algorithms. This indicates that TP-Sketch provides the best overall performance under extreme traffic distributions.

The results emphasize the significant performance improvements of TP-Sketch over existing methods. On-Off Sketch uses a simple swap strategy to replace items. In memory-constrained environments, hash collisions and frequency can lead to the mistaken eviction of persistent items. This reduces detection accuracy. Recall increases with additional memory, but precision remains unstable. This results in an inconsistent F1-score. It requires a substantial amount of memory to achieve accurate results. P-Sketch, Pontus, Pandora, and Stable-Sketch require additional memory to record supplementary statistics. Our algorithm does not need to keep extra statistical information about items. It can hold more buckets. By combining this strategy with the method to protect promising persistent items, our algorithm achieves the best results. Hypersistent is designed for skewed data streams, like the IMC dataset. It performs better than other algorithms, but not as well as our algorithm.

5.3.2. Speed Comparison

We evaluate the updated throughput of TP-Sketch and the comparison benchmarks across various traces and memory configurations. The results in Figure 10 demonstrate that TP-Sketch achieves the fastest update speed in all experiments. For MAWI 1, compared with the On-Off sketch [9], Pandora [19], P-Sketch [17], Stable-Sketch [18], Pontus [18], and Hypersistent [22] algorithms, the throughput improved by 28.93%, 93.98%, 113.21%, 67.52%, 78.81%, 134.27%.

Note that the throughput of all approaches decreases as memory increases because the cache cannot store the entire sketch, and memory latency increases.

Analysis: This result shows a significant improvement in the speed of the TP-sketch algorithm compared to other algorithms. We briefly illustrate the reason. Compared with P-Sketch, Stable-Sketch, Pandora, and Pontus, it uses only one hash operation instead of d operations to locate available buckets; this greatly improves throughput. For other algorithms such as P-Sketch, Pandora, and Stable-Sketch, at the end of each time window, this algorithm updates each Flag and the hot/inactive count, which decreases speed, whereas TP-Sketch only needs to update the Flag. Hypersistent’s low speed is due to its multiple filters; when the data stream is not skewed and grows large, the previous layer of filters fills up, causing an item to be hashed multiple times, which degrades performance.

5.4. Effect of Parameters

5.4.1. The Effect of Window Size

We apply a different window size to each algorithm to study the effect of window size on accuracy. The window size varies from 1000 to 5000. The results are shown in Figure 11 and Figure 12. We can see that the TP-Sketch continues to demonstrate significant improvements in the F1-score across all tests. For On-Off, Pandora, P-Sketch, Stable-Sketch, Pontus, and Hypersistent, the average F1-score improved by 510.1%, 71.68%, 34.04%, 34.03%, 18.95%, 56.45% in MAWI 1. TP-Sketch achieves the highest detection accuracy in the test. This indicates that TP-Sketch is a robust algorithm that delivers the best results across different window sizes.

5.4.2. The Effect of Thresholds

To study the effect of the threshold, we set the memory to 100 KB and gradually increased the threshold until TP-Sketch’s F1-score reached nearly 1. The results are shown in Figure 13 and Figure 14. The results indicate that the speed of TP-Sketch reaches 1 significantly faster than that of other algorithms. For On-Off, Pandora, P-Sketch, Stable-Sketch, Pontus, and Hypersistent, the average F1-score improved by 673.37%, 20.29%, 8.93%, 15.1%, 6.93%, 19.46% in MAWI 1. TP-Sketch achieves the highest accuracy in the experiment. This shows that TP-Sketch is a robust algorithm that can obtain the best results under different threshold settings.

5.4.3. Effect of Parameter d

We keep the window size at 1000 and vary the d from 2 to 9 as shown in Figure 15. When

d \leq 5

, the accuracy increases rapidly. However, in general, the larger parameter d yields more accurate results for the TP-Sketch.

The larger the value of d, the more accurate it will be. However, throughput decreases with increasing d, and memory has no significant effect on throughput in our experiment.

Parameter Choice: When

d \geq 3

, accuracy increases significantly, but throughput decreases markedly. Therefore, we choose

d = 2

if we want to achieve higher throughput with acceptable accuracy; we choose

d = 3

for higher accuracy but lower throughput. Generally speaking, the higher the d, the greater the accuracy; however, this reduces throughput.

5.5. Ablation Study

To evaluate the contribution of the threshold strategy, we conducted an ablation study on TP-Sketch. We designed TP-WT Sketch based on TP-Sketch, as it does not contain a threshold component. If a hash collision occurs, the replacement probability is set to

\frac{1}{λ \hat{V} (e)}

(we choose

λ = 10

in the test) for all items stored in the sketch. The results are shown in Table 2. The results show that the F1-score of TP-Sketch is higher than that of TP-WT Sketch. In this way, we validated the effectiveness of the threshold strategy.

5.6. Case Study

Experiments were conducted to verify the algorithm’s detection capability for suspected APT attacks on MAWI 1. The experimental time window was set to 1000. Data items with persistence exceeding the set threshold were considered suspicious APT attacks. The experimental results are shown in Figure 16. We set the thresholds

θ

at 0.1 and 0.2, respectively. The results indicate that when memory usage exceeds 30 KB, the TP-Sketch F1-score approaches 1. Its detection performance (F1-score) outperforms that of other comparative algorithms.

6. Future Work

Our algorithm has improved in both accuracy and speed. It still has several limitations that need further improvement. The space efficiency is low because we allocate the same memory for each data entry. We can explore shorter fingerprints and allocate less space for low-persistence entries to improve space usage. Research on data streams with different distributions is lacking, especially uniform distributions. Speed has not been fully optimized. This is particularly relevant for skewed datasets that contain substantial duplicate data. We can improve speed by using filters. We have the following directions for future work:

Dynamic Parameter $λ$ : A fixed

λ

has been proven effective compared to prior research [7,17,19]. During the experiment, we recognize the benefits of adaptive

λ

based on traffic patterns and current configurations. Many factors can affect the optimal value for

λ

, such as the distribution of data streams, the available memory size, and the sketch configuration. Future work will explore the relationship between these parameters and the optimal

λ

. In this way, we can improve the accuracy further.

Memory and Speed Optimization: Because persistence values for most items are small, future developments will use fewer bits, like 8 or 16 bits, to reduce storage overhead and improve accuracy. We will also investigate using SIMD instructions (Single Instruction, Multiple Data) [17] to speed up sketch operations in the future.

Deployment on Programmable Switches: We want to deploy our scheme on programmable switches. This will help to achieve low latency [17,20], high-precision, and efficient persistent flow detection in high-speed networks. This approach fixes the hardware limitations of the current methods. This future direction is suitable for real-time security threat detection (e.g., LDoS; APT) and network traffic monitoring.

7. Conclusions

Finding persistent items is essential for many real-world applications. This paper proposes a novel algorithm, TP-Sketch, to find these items in data streams. We design an algorithm based on the sketch and create a corresponding eviction and replacement algorithm to obtain accurate results. Compared to previous work, the TP-Sketch stores less information about each item in the sketch. This relieves the hash collision of persistent items. During hash collisions, TP-Sketch uses a dynamic threshold to classify promising and unpromising items. For promising items, TP-Sketch will not replace them; for unpromising items, TP-Sketch replaces them with a certain probability. Also, to improve throughput, TP-Sketch uses only one hash operation to locate items during insertion. This simple and effective design makes the TP-Sketch accurate and fast. Compared with state-of-the-art solutions, the TP-Sketch algorithm exhibits the best accuracy and speed.

Author Contributions

Conceptualization, C.Y. and G.Y.; methodology, C.Y. and Y.L.; writing—original draft preparation, C.Y.; writing—review and editing, Y.X., G.Y. and Y.L.; validation, Y.X. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by The National Natural Science Foundation of China (62502529, 62271496).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The code used to support the findings of this study has been deposited in https://github.com/YangChendegit/TP-Sketch.git, accessed on 20 January 2026.

Acknowledgments

During the preparation of this manuscript/study, the author(s) used Grammarly for the purposes of finding mistakes in grammar problems. The authors have reviewed this paper and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kumar, A.; Xu, J.; Wang, J. Space-Code Bloom Filter for Efficient Per-Flow Traffic Measurement. IEEE J. Sel. Areas Commun. 2006, 24, 2327–2339. [Google Scholar] [CrossRef]
Schweller, R.; Li, Z.; Chen, Y.; Gao, Y.; Gupta, A.; Zhang, Y.; Dinda, P.A.; Kao, M.-Y.; Memik, G. Reversible Sketches: Enabling Monitoring and Analysis Over High-Speed Data Streams. IEEE/Acm Trans. Netw. 2007, 15, 1059–1072. [Google Scholar] [CrossRef]
Tanbeer, S.K.; Ahmed, C.F.; Jeong, B.S. Mining regular patterns in data streams. In Proceedings of the Database Systems for Advanced Applications: 15th International Conference, DASFAA 2010, Tsukuba, Japan, 1–4 April 2010; Proceedings, Part I 15; Springer: Berlin/Heidelberg, Germany, 2010; pp. 399–413. [Google Scholar] [CrossRef]
Rahman, M.S.; Uddin, M.Y.S.; Hasan, T.; Rahman, M.S.; Kaykobad, M. Using Adaptive Heartbeat Rate on Long-Lived TCP Connections. IEEE/ACM Trans. Netw. 2018, 26, 203–216. [Google Scholar] [CrossRef]
Miao, R.; Zhong, Z.; Guo, J.; Li, Z.; Yang, T.; Cui, B. BurstSketch: Finding Bursts in Data Streams. IEEE Trans. Knowl. Data Eng. 2022, 35, 11126–11140. [Google Scholar] [CrossRef]
Metwally, A.; Agrawal, D.; El Abbadi, A. Efficient Computation of Frequent and Top-k Elements in Data Streams. In Proceedings of the Database Theory—ICDT 2005; Eiter, T., Libkin, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 398–412. [Google Scholar] [CrossRef]
Yang, T.; Jiang, J.; Liu, P.; Huang, Q.; Gong, J.; Zhou, Y.; Miao, R.; Li, X.; Uhlig, S. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the SIGCOMM ’18: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication; Association for Computing Machinery: New York, NY, USA, 2018; pp. 561–575. [Google Scholar] [CrossRef]
Fan, Z.; Hu, Z.; Wu, Y.; Guo, J.; Liu, W.; Yang, T.; Wang, H.; Xu, Y.; Uhlig, S.; Tu, Y. PISketch: Finding persistent and infrequent flows. In Proceedings of the FFSPIN ’22: Proceedings of the ACM SIGCOMM Workshop on Formal Foundations and Security of Programmable Network Infrastructures; Association for Computing Machinery: New York, NY, USA, 2022; pp. 8–14. [Google Scholar] [CrossRef]
Zhang, Y.; Li, J.; Lei, Y.; Yang, T.; Li, Z.; Zhang, G.; Cui, B. On-off sketch: A fast and accurate sketch on persistence. Proc. Vldb Endow. 2020, 14, 128–140. [Google Scholar] [CrossRef]
Chen, X.; Landau-Feibish, S.; Braverman, M.; Rexford, J. Beaucoup: Answering many network traffic queries, one memory update at a time. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication; Association for Computing Machinery: New York, NY, USA, 2020; pp. 226–239. [Google Scholar] [CrossRef]
Nagaraja, S.; Shah, R. Clicktok: Click fraud detection using traffic analysis. In Proceedings of the 12th Conference on Security and Privacy in Wireless and Mobile Networks; Association for Computing Machinery: New York, NY, USA, 2019; pp. 105–116. [Google Scholar] [CrossRef]
Cole, E. Advanced Persistent Threat: Understanding the Danger and How to Protect Your Organization; Newnes: Oxford, UK; Waltham, MA, USA, 2012. [Google Scholar]
Huang, H.; Sun, Y.E.; Chen, S.; Tang, S.; Han, K.; Yuan, J.; Yang, W. You Can Drop but You Can’t Hide: K-persistent Spread Estimation in High-speed Networks. In Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications; IEEE: New York, NY, USA, 2018; pp. 1889–1897. [Google Scholar] [CrossRef]
Chen, L.; Phan, R.C.W.; Chen, Z.; Huang, D. Persistent items tracking in large data streams based on adaptive sampling. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications; IEEE: New York, NY, USA, 2022; pp. 1948–1957. [Google Scholar] [CrossRef]
Lahiri, B.; Chandrashekar, J.; Tirthapura, S. Space-efficient tracking of persistent items in a massive data stream. In Proceedings of the 5th ACM International Conference on Distributed Event-Based System; Association for Computing Machinery: New York, NY, USA, 2011; pp. 255–266. [Google Scholar] [CrossRef]
Dai, H.; Shahzad, M.; Liu, A.X.; Zhong, Y. Finding persistent items in data streams. Proc. VLDB Endow. 2016, 10, 289–300. [Google Scholar] [CrossRef]
Li, W.; Patras, P. P-Sketch: A Fast and Accurate Sketch for Persistent Item Lookup. IEEE/ACM Trans. Netw. 2023, 32, 987–1002. [Google Scholar] [CrossRef]
Li, W.; Patras, P. Stable-sketch: A versatile sketch for accurate, fast, web-scale data stream processing. In Proceedings of the ACM Web Conference 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 4227–4238. [Google Scholar] [CrossRef]
Li, W. Pandora: An Efficient and Rapid Solution for Persistence-Based Tasks in High-Speed Data Streams. Proc. ACM Manag. Data 2025, 3, 1–26. [Google Scholar] [CrossRef]
Li, W.; Li, Z.; Bütün, B.; Diallo, A.F.; Fiore, M.; Patras, P. Pontus: A Memory-Efficient and High-Accuracy Approach for Persistence-Based Item Lookup in High-Velocity Data Streams. In Proceedings of the ACM on Web Conference 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1783–1794. [Google Scholar] [CrossRef]
Alkasassbeh, M.; Al-Haj Baddar, S. Intrusion detection systems: A state-of-the-art taxonomy and survey. Arab. J. Sci. Eng. 2023, 48, 10021–10064. [Google Scholar] [CrossRef]
Cao, L.; Shi, Q.; Xiao, W.; Wang, N.; Li, W.; Li, Z.; Zhang, W.; Xu, M. Hypersistent Sketch: Enhanced Persistence Estimation via Fast Item Separation. In Proceedings of the 2025 IEEE 41st International Conference on Data Engineering (ICDE); IEEE: New York, NY, USA, 2025; pp. 3030–3042. [Google Scholar] [CrossRef]
Cormode, G.; Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 2005, 55, 58–75. [Google Scholar] [CrossRef]
Estan, C.; Varghese, G. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst. (TOCS) 2003, 21, 270–313. [Google Scholar] [CrossRef]
Yang, T.; Zhang, H.; Li, J.; Gong, J.; Uhlig, S.; Chen, S.; Li, X. HeavyKeeper: An accurate algorithm for finding Top-k elephant flows. IEEE/ACM Trans. Netw. 2019, 27, 1845–1858. [Google Scholar] [CrossRef]
Mawi Dataset. Available online: https://mawi.wide.ad.jp/mawi/ (accessed on 3 May 2025).
The Source Code of Bob Hash. Available online: http://burtleburtle.net/bob/hash/evahash.html (accessed on 20 April 2025).
Data Set for Imc 2010 Data Center Measurement. Available online: https://pages.cs.wisc.edu/~tbenson/IMC_DATA/ (accessed on 10 April 2025).
Benson, T.; Akella, A.; Maltz, D.A. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement; Association for Computing Machinery: New York, NY, USA, 2010; pp. 267–280. [Google Scholar] [CrossRef]
Zipf Data Generator. Available online: https://numpy.net.cn/doc/stable/reference/random/generated/numpy.random.zipf.html (accessed on 10 October 2025).

Figure 1. The fields of P-Sketch and TP-Sketch.

Figure 2. Data structure of TP-Sketch. The red bucket means item e’s mapped buckets.

Figure 3. An example of the insertion process within the algorithm. The green item means the arriving item, the read bucket means the change happens in this bucket.

Figure 4. Theoretical value v.s. empirical value.

Figure 5. The result of 7 algorithms on MAWI 1 with different memory constraints.

Figure 6. The result of 7 algorithms on MAWI 2 with different memory constraints.

Figure 7. The result of 7 algorithms on IMC with different memory constraints.

Figure 8. The result of 7 algorithms on Zipf 1.5 with different memory constraints.

Figure 9. The result of 7 algorithms on Zipf 2.0 with different memory constraints.

Figure 10. Speeds.

Figure 11. The result of 7 algorithms on MAWI 1 with different window sizes.

Figure 12. The result of 7 algorithms on MAWI 2 with different window sizes.

Figure 13. The result of 7 algorithms on MAWI 1 with different thresholds.

Figure 14. The result of 7 algorithms on MAWI 2 with different thresholds.

Figure 15. The F1-Score and throughput under different parameter d.

Figure 16. Case study of APT detection.

Table 1. F1-score in persistent item lookup with different parameter d and

λ

values.

Table 1. F1-score in persistent item lookup with different parameter d and

λ

values.

	2	4	6	8	10	12	14	16
d	2	4	6	8	10	12	14	16
2	0.929	0.938	0.944	0.939	0.946	0.944	0.945	0.936
3	0.934	0.947	0.951	0.958	0.963	0.952	0.957	0.953

Table 2. The F1-score of the ablation study.

	10	20	30	40	50	60
Method	10	20	30	40	50	60
TP	0.514	0.716	0.805	0.856	0.897	0.921
TP-WT	0.476	0.691	0.776	0.845	0.881	0.903

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, C.; Lu, Y.; Yang, G.; Xie, Y. TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams. Appl. Sci. 2026, 16, 2018. https://doi.org/10.3390/app16042018

AMA Style

Yang C, Lu Y, Yang G, Xie Y. TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams. Applied Sciences. 2026; 16(4):2018. https://doi.org/10.3390/app16042018

Chicago/Turabian Style

Yang, Chen, Yuliang Lu, Guozheng Yang, and Yi Xie. 2026. "TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams" Applied Sciences 16, no. 4: 2018. https://doi.org/10.3390/app16042018

APA Style

Yang, C., Lu, Y., Yang, G., & Xie, Y. (2026). TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams. Applied Sciences, 16(4), 2018. https://doi.org/10.3390/app16042018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TP-Sketch: A Light-Weight Methodology for Persistent Item Lookup in Data Streams

Abstract

1. Introduction

1.1. Motivation

1.2. Our Solution: TP-Sketch

2. Related Work

3. Our Proposed Method

3.1. Problem Statement

3.2. Principles

3.3. The TP-Sketch Algorithm

3.4. Running Examples

4. Mathematical Analysis

5. Experiment Results

5.1. Experimental Setup

5.2. Parameter Settings

5.3. Performance Comparison

5.3.1. Accuracy Comparison

5.3.2. Speed Comparison

5.4. Effect of Parameters

5.4.1. The Effect of Window Size

5.4.2. The Effect of Thresholds

5.4.3. Effect of Parameter d

5.5. Ablation Study

5.6. Case Study

6. Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI