Next Article in Journal
Applying Length-Dependent Stochastic Context-Free Grammars to RNA Secondary Structure Prediction
Previous Article in Journal
Lempel–Ziv Data Compression on Parallel and Distributed Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window

1
Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong, China
2
MADALGO (Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation), Department of Computer Science, Aarhus University, Aarhus C DK-8000, Denmark
*
Author to whom correspondence should be addressed.
Algorithms 2011, 4(3), 200-222; https://doi.org/10.3390/a4030200
Submission received: 23 June 2011 / Revised: 23 June 2011 / Accepted: 10 September 2011 / Published: 22 September 2011

Abstract

: In an asynchronous data stream, the data items may be out of order with respect to their original timestamps. This paper studies the space complexity required by a data structure to maintain such a data stream so that it can approximate the set of frequent items over a sliding time window with sufficient accuracy. Prior to our work, the best solution is given by Cormode et al. [1], who gave an O ( 1 log W log ( B log W ) min { log W , 1 } log | U | )-space data structure that can approximate the frequent items within an error bound, where W and B are parameters of the sliding window, and U is the set of all possible item names. We gave a more space-efficient data structure that only requires O ( 1 log W log ( B log W ) log log W ) space.

1. Introduction

Identifying frequent items in a massive data stream has many applications in data mining and network monitoring, and the problem has been studied extensively [2-5]. Recent interest has been shifted from the statistics of the whole data stream to that of a sliding window of recent data [6-9]. In most applications, the amount of data in a window is gigantic when compared with the amount of memory available in the processing units. It is impossible to store all the data and then find the exact frequent items. Existing research has focused on designing space-efficient data structures to support finding the approximate frequent items. The key concern is how to minimize the space so as to achieve a required level of accuracy.

1.1. Asynchronous Data Stream

Most of the previous work on data streams assume that items in a data stream are synchronous in the sense that the order of their arrivals is the same as the order of their creations. This synchronous model is however not suitable to applications that are distributed in nature. For example, in a sensor network, the sink collects data transmitted from sensors over a large area, and the data transmitted from different sensors would suffer different delay. It is possible that an item created at time t at a certain sensor may arrive at the sink later than an item created after t at another sensor. From the sink's viewpoint, items in the data stream are out of order with respect to their creation times. Yet the statistics to be computed are usually based on the creation times. More specifically, an asynchronous data stream (a.k.a. out-of-order data stream) [1,10,11] can be considered as a sequence (a1, t1), (a2, t2), (a3, t3), …, where ai is the name of a data item chosen from a fixed universe U, and ti is an integer timestamp recording the creation time of this item. Items arriving at the data stream are in arbitrary order regarding their timestamps, and it is possible that more than one data item has the same timestamp.

1.2. Previous Work on Approximating Frequent Items

Consider a data stream and, in particular, those data items whose timestamps fall into the last W time units (W is the size of the sliding window). An item (or precisely, an item name) is said to be a frequent item if its count (i.e., the number of occurrences) exceeds a certain required threshold of the total item count. Arasu and Manku [6] were the first to study approximating frequent items over a sliding window under the synchronous model, in which data items arrive in non-decreasing order of timestamps. The space complexity of their data structure is O ( 1 ( log 1 ) 2 log ( B ) ), where is a user-specified error bound and B is the maximum number of items with timestamps falling into the same sliding window. Their work was later improved by Lee and Ting [7] to O ( 1 log ( B ) ) space. Recently, Cormode et al. [1] initiated the study of frequent items under the asynchronous model, and gave a solution with space complexity O ( 1 log W log ( B log W ) min { log W , 1 } log | U | ), where U is the set of possible item names. Later, Cormode et al. [12] gave a hashing-based randomized solution using O ( 1 2 log | U | ) space. The space complexity is quadratic in 1 , which is less preferred, but that is a general solution that can solve other problems like finding the sum and quantiles.

The earlier work on asynchronous data stream focused on a relatively simpler problem called -approximate basic counting [10,11]. Cormode et al. [1] improved the space complexity of basic counting to. O ( 1 log W log ( B log W ) ) Notice that under the synchronous model, the best data structure requires O ( 1 log ( B ) ) space [9]. It is believed that there is roughly a gap of logW between the synchronous model to the asynchronous model. Yet, for frequent items, the asynchronous result of Cormode et al. [1] has space complexity way bigger than that of the best synchronous result, which is O ( 1 log ( B ) ) [7]. This motivates us to study more space-efficient solutions for approximating frequent items in the asynchronous model.

1.3. Formal Definition of Approximate Frequent Item Set

For any time interval I and any data item a, let fa(I) denote the frequency of item a in interval I, i.e., the number of arrived items named a with timestamps falling into I. Define f*(I) = ΣaU fa(I) to be the total number of all arrived items with timestamps within I.

Given a user-specified error bound and a window size W, we want to maintain a data structure to answer any -approximate frequent item set query for any sub-window (specified at query time), which is in the form (ϕ, W′) where ϕ ∈ [, 1] is the required threshold and W′ ≤ W is the sub-window size. Suppose that τcur is the current time. The answer to such a query is a set S of item names satisfying the following two conditions:

  • (C1) S contains every item a whose frequency in interval I = [τcurW′ + 1, τcur] is at least ϕf*(I), i.e., fa(I) ≥ ϕf*(I).

  • (C2) For any item a in S, its frequency in interval I is at least (ϕ)f*(I),i.e., fa(I) ≥ (ϕ)f*(I).

The set S is also called an -approximate ϕ-frequent item set. For example, assume = 1%, then the query (10%, 10, 000) would return all items whose frequencies in the last 10, 000 time units are each at least 10% of the total item count, plus possibly some other items with frequency at least 9% of the total count.

1.4. Our Contribution

This paper gives a more space-efficient data structure for answering any -approximate frequent item set query. Our data structure uses O ( 1 log W log ( B log W ) log log W ) words, which is significantly smaller than the one given by Cormode et al. [1] (see Table 1). Furthermore, this space complexity is larger than the best synchronous solution by only a factor of O(logW log logW), which is close to the expected gap of O(logW). Similar to existing data structures for this problem, it takes time linear to the data structure's size to answer an -approximate frequent item set query. Furthermore, it takes O ( log ( B log W ) ( log 1 + log log W ) ) time to modify the data structure for a new data item. Occasionally, we might need to clean up some old data items that are no longer significant to the approximation; in the worst case, this takes time linear to the size of the data structure, and thus is no bigger than the query time. As a remark, the solution of Cormode et al. [1] requires O ( log ( B log W ) log W log log | U | ) time for an update.

In the asynchronous model, if a data item has a delay more than W time units, it can be discarded immediately when it arrives. In many applications, the delay is usually small. This motivates us to extend the asynchronous model to consider data items that have a bounded delay. We say that an asynchronous data stream has tardiness dmax if a data item created at time t must arrive at the stream no later than time t + dmax. If we set dmax = 0, the model becomes the synchronous model. If we allow dmaxW, this is in essence the asynchronous model studied above. We adapt our data structure to take advantage of small tardiness such that when dmax is small, it uses smaller space (see Table 1) and support faster update time (which is O ( log ( B log d max ) ( log 1 + log log d max ) ) ) In particular, when dmax = Θ(1), the size and update time of our data structure match those of the best data structure for synchronous data stream.

Remark

This paper is a corrected version of a paper with the same title in WAOA 2009 [13]; in particular, the error bound on the estimates was given incorrectly before and is fixed in this version.

1.5. Technical Digest

To solve the frequent item set problem, we need to estimate the frequency of any item with relative error ∊f*(I) where I = [τcurW + 1, τcur] is the interval covered by the sliding window. To this end, we first propose a simple data structure for estimating the frequency of a fixed item over the sliding window. Then, we adapt a technique of Misra and Gries [14] to extend our data structure to handle any item. The result is an O(f*(I))/λ)-space data structure that allows us to obtain an estimate for any item with an error bound of about λ logW. Here λ is a design parameter. To ensure λ logW to be no greater than ∊f*(I), we should set λ∊f*(I)/logW. Since f*(I) can be as small as Θ ( 1 log W ) (the case for smaller f*(I) can be handled by brute-force), we need to be conservative and set λ to some constant. But then the size of the data structure can be Θ(B) because f*(I) can be as large as B. To reduce space, we introduce a multi-resolution approach. Instead of using one single data structure, we maintain a collection of O(logB) copies of our data structure, each uses a distinct, carefully chosen parameter λ so that it could estimate the frequent item set with sufficient accuracy when f*(I) is in a particular range. The resulting data structure uses O ( 1 log W log B ) space.

Unfortunately, a careful analysis of our data structure reveals that in the worst case, it can only guarantee estimates with an error bound of ∊f*(HI) where H = [τcur − 2W + 1, τcurW], not the required ∊f*(I). The reason is that the error of its estimates over I depend on the number of updates made during I, and unlike synchronous data stream, this number for asynchronous data stream can be significantly larger than f*(I). For example, at time τcurW + 1, there may still be many new items (a, u) with timestamps uH, for which we must update our data structure to get good estimates when the sliding window is at earlier positions. Indeed, the number of updates during I can be as large as f*(HI), and this gives an error bound of ∊f*(HI).

To reduce the error bound to ∊f*(I), we introduce a novel algorithm to split the data structure into independent smaller ones at appropriate times. For example, at time τcurW + 1, we can split our data structure into two smaller ones DH and DI, and we will only update DH for items (a, u) with uH and update DI for those with uI. Then, when we need to find an estimate on I at time τcur, we only need to consult DI, and the number of updates made to it is f*(I). In this paper, we develop sophisticated procedures to decide when and how to split the data structure so as to enable us to get good enough estimates when sliding window moves continuously. The resulting data structure has size O ( 1 ( log W ) 2 log ( B log W ) ) Then, we further make the data structure adaptive to the input size, allowing us to reduce the space to O ( 1 ( log log W ) log W log ( B log W ) ).

2. Preliminaries

Our data structures for the frequent item set problem depends on data structures for the following two related data stream problems. Let 0 < < 1 be any real number, and τcur be the current time.

  • The ∊-approximate basic counting problem asks for data structure that allows us to obtain, for any interval I = [τcurW′ + 1, τcur] where W′ ≤ W, an estimate *(I) of f*(I) such that |*(I) − f*(I)| ≤ ∊f*(I).

  • The ∊-approximate counting problem asks for data structure that allows us to obtain, for any item a and any interval I = [τcurW′ + 1, τcur] where W′ ≤ W, an estimate a(I) of fa(I) such that | a(I) − fa(I)|≤ ∊f*(I).

As mentioned in Section 1, Cormode et al. [1] gave an O ( 1 log W log ( B log W ) )-space data structure Algorithms 04 00200i6 for solving the -approximate basic counting problem. In this paper, we give an O ( 1 log W log ( B log W ) log log W )-space data structure Algorithms 04 00200i2 for solving the harder -approximate counting problem. The theorem below shows how to use these two data structures to answer -approximate frequent item set query.

Theorem 1

Let 0 = /4. Given Algorithms 04 00200i6o and Algorithms 04 00200i2o, we can answer any -approximate frequent item set query. The total space required is O ( 1 log W log ( B log W ) log log W ).

Proof

The space requirement is obvious. Consider any -approximate frequent item set query (ϕ, W′) where ϕ ≤ 1 and W′ ≤ W. Let I = [τcurW′ + 1, τcur]. Since o = /4, the estimates given by Algorithms 04 00200i6o satisfy | f ^ ( I ) f ( I ) | 4 f ( I ), and for any item a, the estimates given by Algorithms 04 00200i2o satisfy | f ^ a ( I ) f a ( I ) | 4 f ( I ) To answer the query (ϕ, W′), we return the set

S ϕ = { a | f ^ a ( I ) ( ϕ 2 I ) f ^ ( I ) }
which satisfies the required conditions (C1) and (C2) because
  • for any item a with fa(I) ≥ ϕf*(I), f ^ a ( I ) f a ( I ) 4 f ( I ) ( ϕ 4 ) f ( I ) ( ϕ 4 ) ( 1 1 + 4 ) f ^ ( I ) ( ϕ 4 ) ( 1 4 ) f ^ ( I ) ( ϕ 2 ) f ^ ( I ), and aSϕ; thus (C1) is satisfied, and

  • for every aSϕ, we have f a ( I ) f ^ a ( I ) 4 f ( I ) ( ϕ 2 ) f ^ ( I ) 4 f ( I ) ( ϕ 2 ) ( 1 4 ) f ( I ) 4 f ( I ) ( ϕ ) f ( I ); thus (C2) is satisfied.

The building block of Algorithms 04 00200i2 is a data structure that counts items over some fixed interval (instead of the sliding window). For any interval I = [I, rI] of size W, Theorem 4 in Section 4 gives a data structure Algorithms 04 00200i2I,∊ that uses O ( 1 log W log ( B log W ) log log W ) space, supports O ( log ( B log W ) ( log 1 + log log W ) ) update time, and enables us to obtain, for any item a and any time tI, an estimate a([t, rI]) of fa([t, rI]) such that

| f ^ a ( [ t , r I ] ) f a ( [ t , r I ] ) | f ( [ t , r I ] )

Given Algorithms 04 00200i2I1,, Algorithms 04 00200i2I2,, … where Ii = [(i − 1)W + 1, iW], we can obtain, for any W′ ≤ W, an estimate a([s, τcur]) of fa([s, τcur]) where s = τcurW′ + 1 as follows.

  • Let Ii and Ii+1 be the intervals such that [s, τcur] ⊂ IiIi+1.

  • Use Algorithms 04 00200i2Ii,∊ to get an estimate a([s, iW]) of fa([s, iW]), and Algorithms 04 00200i2Ii+1, an estimate a([iW + 1, (i + 1)W]) of fa([iW + 1, (i + 1)W]).

  • Our estimate a([s, τcur]) = a([s, iW]) + a([iW + 1, (iW + 1)W]).

By Equation (1), we have

| f ^ a ( [ S , i W ] ) f a ( [ S , i W ] ) | f ( [ S , i W ] )
and
| f ^ a ( [ i W + 1 , ( i + 1 ) W ] ) f a ( [ i W + 1 , ( i + 1 ) W ] ) | f ( [ i W + 1 , ( i + 1 ) W ] )

Observe that any item that arrives at or before the current time τcur must have timestamp no greater than τcur; hence fa([iW + 1, (i + 1)W]) = fa([iW + 1, τcur]) and f*([iW + 1, (i + 1)W]) = f*([iW +1, τcur]), and Equation (3) is equivalent to

| f ^ a ( [ i W + 1 , ( i + 1 ) W ] ) f a ( [ i W + 1 , τ cur ] ) | f ( [ i W + 1 , τ cur ] )

Adding Equations (2) and (4), we conclude |a([s, τcur]) − fa([s, τcur])| ≤ ∊f*([s, τcur]), as required.

Our data structure Algorithms 04 00200i2 is just the collection of Algorithms 04 00200i2I1,, Algorithms 04 00200i2I2,, …. Note that we only need to physically store in Algorithms 04 00200i2 the data structures Algorithms 04 00200i2Ii, and Algorithms 04 00200i2Ii+1, where [τcurW + 1,τcur] ⊆ IiIi+1. The intervals of the earlier ones will no longer be covered by the sliding window and the corresponding Algorithms 04 00200i2I,∊'s can be thrown away. Together with Theorem 4, we have the following theorem.

Theorem 2

The data structure Algorithms 04 00200i2 solves the -approximate counting problem. The space usage is O ( 1 log W log ( B log W ) log log W ) and it supports O ( log ( B log W ) ( log 1 + log log W ) ) update time.

3. A Simple Data Structure For Frequency Estimation

Let I = [I, rI] be any interval of size W. To simplify notation, we assume that W is a power of 2, so that logW is an integer and we can avoid the floor or the ceiling functions. In this section, we describe a simple data structure Algorithms 04 00200i1I,λ,κ that enables us to obtain, for any item a, a good estimate of a's frequency over I. The parameters λ and κ determine its accuracy and space usage. However, its accuracy is not enough for answering any -approximate frequent item set query. We will explain how to improve the accuracy in the next section.

Roughly speaking, Algorithms 04 00200i1I,λ,κ is a set of queues Q I , λ a i.e., C I , λ , κ = [ Q I , λ a a U ]. For an item a, the queue Q I , λ a keeps track of the occurrences of a in I. Each node N in Q I , λ a is associated with an interval i(N), a value v(N), and a debit d(N); v(N) counts the number of arrived items (a, u) with ui(N), and d(N) is for implementing a space reduction technique. Initially, Q I , λ a has only one node N with i(N) = I, and v(N) = d(N) = 0. In general, Q I , λ a is a queue 〈N1, N2, …, Nk〉 of nodes whose intervals form a partition of I, i.e.,

i ( N 1 ) , i ( N 2 ) , , i ( N k ) = [ p 1 , q 1 ] , [ p 2 , q 2 ] , , [ p k , q k ]
where qi−1 + 1 = piqi and ∪1≤ik[pi, qi] = I. When an item (a, u) with uI arrives, we update Q I , λ a as follows.


Q I , λ a.Debit( )

1:find the unique node N in Q I , λ a with ui(N) = J = [p, q],
2:increase the value of N by 1, i.e., v(N) = v(N) + 1;
3:if (|J| > 1 and λ units have been added to v(N) since J is assigned to i(N)) then
4: /* refine J */
5: create a new node N′ and insert it to the left of N;
6: let i(N′) = [p, m], i(N) = [m + 1, q] where m = ⌊(p + q)/2⌋;
7: let v(N′) = 0 and d(N′) = 0;
8: /* we make no change to v(N) and d(N) */
9:end if

Figure 1 gives an example on how Q I , λ a is updated using the procedure.

Obviously, a direct implementation of Algorithms 04 00200i1I,λ,κ uses too much space. We now extend a technique of Misra and Gries [14] to reduce the space requirement. For any Q I , λ a, we say that Q I , λ a is trivial if the queue contains only a single node N with (i) i(N) = I, and (ii) v(N) = d(N) = 0. Every queue in Algorithms 04 00200i1I,λ,κ is trivial initially. The key for reducing the space complexity of Algorithms 04 00200i1I,λ,κ is to maintain the following invariant throughout the execution:

  • (*) There are at most κ non-trivial queues in Algorithms 04 00200i1I,λ,κ.

We call κ the capacity of Algorithms 04 00200i1I,λ,κ. The invariant helps us save space because we do not need to store trivial queues physically in memory. To maintain (*), each queue Q I , λ a supports the following procedure, which is called only when v ( Q I , λ a ), the total values of the nodes in Q I , λ a, is strictly greater than d ( Q I , λ a ), the total debits of the nodes in Q I , λ a.


Q I , λ a.Debit( )

1:if ( v ( Q I , λ a ) d ( Q I , λ a )) then
2: return error;
3:else
4: find an arbitrary node N of Q I , λ a with v(N) > d(N);
5: /* such a node must exist because v ( Q I , λ a ) > d ( Q I , λ a ) */
6:d(N) = d(N) + 1;
7:end if

Note from the implementation of Debit( ) that v ( Q I , λ a ) is always no smaller than d ( Q I , λ a ), and for each node N of Q I , λ a , v ( N ) d ( N ). Furthermore, if v ( Q I , λ a ) = d ( Q I , λ a ), then v(N) = d(N) for every node N in Q I , λ a. To maintain (*), Algorithms 04 00200i1I,λ,κ processes a newly arrived item (a, u) with uI as follows.


Algorithms 04 00200i1I,λ,κ.Process((a, u))

1:update ( Q I , λ a ) by calling .Update((a, u));
2:if (after the update the number of non-trivial queues becomes κ) then
3:for each Q I , λ x with do.Debit( );
4:for each non-trivial queues v ( Q I , λ x ) = d ( Q I , λ x ) with do
5:  delete all nodes of Q I , λ x and make it a trivial queue;
6: /* Note that each deleted node N satisfies v(N) = d(N). */
7:end if

It is easy to see that Invariant (*) always holds: Initially the number m of non-trivial queues is zero, and m increases only when Process((a, u)) is on some trivial Q I , λ a; in such case v ( Q I , λ a ) becomes 1 and d ( Q I , λ a ) remains 0. If m becomes κ after this increase, we will debit, among other queues, Q I , λ a and its d ( Q I , λ a ) becomes 1 too. It follows that v ( Q I , λ a ) = d ( Q I , λ a ), and Lines 4–5 will make Q I , λ a trivial and m becomes less than κ again.

We are now ready to define Algorithms 04 00200i1I,λ,κ's estimate a([t, rI]) of fa([t, rI]) and analyze its accuracy. We need some definitions. For any interval J = [p, q] and any tI, we say that J covers t if t ∈ [p, q], is to the right of t if t < p, and is to the left of t otherwise. For any item a and any tI = [I, rI], Algorithms 04 00200i1I,λ,κ's estimate of fa([t, rI]) is

  • f̂a([t, rI]) = the value sum of the nodes N currently in Q I , λ a whose i(N) covers or is to the right of t.

For example, in Figure 1, after the update of the last item (a, 1), we can obtain the estimate a([2, 8]) = 0 + 4 + 5 = 9.

Given any node N of Q I , λ a, we say that N is monitoring a over J, or simply N is monitoring J if i(N) = J. Note that a node may monitor different intervals during different periods of execution, and the size of these intervals are monotonically decreasing. Observe that although there are about W2/2 possible sub-intervals of size-W interval I, there are only about 2W of them that would be monitored by some nodes: there is only one such interval of size W, namely I = [I, rI], which gives birth to two such intervals of size W/2, namely [I, m] and [m + 1, rI] where m = ⌊(I + rI)/2⌋, and so on. We call these O(W) intervals interesting intervals. For any two interesting intervals J and H such that JH, we say that J is a descendant of H, and H is an ancestor of J. Figure 2 shows all the interesting intervals for I = [1, 8], as well as their ancestor-descendant relationship. The following important fact is easy to verify by induction.

Fact 1

Any two interesting intervals J and H do not cross, although one can contain another, i.e., either JH, or HJ, or JH = ∅. Furthermore, any interesting interval has at most logW ancestors.

For any node N, let Algorithms 04 00200i4(N) be the set of intervals that have been monitored by N so far. The following fact can be verified from the update procedure.

Fact 2

Consider a node N in Q I , λ a, where i(N) = J.

  • If J covers or is to the right of t, then all intervals in Algorithms 04 00200i4(N) cover or are to the right of t.

  • If J is to the left of t, then all intervals in Algorithms 04 00200i4(N) are to the left of t.

We say that N covers or is to the right of t if the intervals in Algorithms 04 00200i4(N) cover or are to the right of t; otherwise, N is to the left of t. For any queue Q I , λ a, let alive ( Q I , λ a ) be the set of nodes currently in Q I , λ a, dead ( Q I , λ a ) be those nodes of Q I , λ a that have already been deleted (because of Line 5 of the procedure Process( )), and node ( Q I , λ a ) = alive ( Q I , λ a ) dead ( Q I , λ a ). Note that the estimate a([t, ri]) is the value sum of the nodes in alive ( Q I , λ a ) that cover or are to the right of t. For simplicity, we need to express it more succinctly. Let

alive ( C I , λ , κ ) = { alive ( Q I , λ a ) Q I , λ a C I , λ , κ }
be the set of nodes currently in Algorithms 04 00200i1I,λ,κ. Define dead( Algorithms 04 00200i1I,λ,κ) and node( Algorithms 04 00200i1I,λ,κ) similarly. For any item a and any subset X ⊆ node( Algorithms 04 00200i1I,λ,κ), let Xa be the set of nodes in X that are monitoring a (and thus are the nodes from Q I , λ a). For any tI, let Xt denote the set of nodes in X that cover or are to the right of t. Define v(X) = ΣNX v(N) and d(X) = ΣNX d(N). Then, a([t, rI]) can be expressed as follows:
f ^ a ( [ t , r I ] ) = v ( alive ( Q I , λ a ) t ) = v ( alive ( C I , λ , κ ) t a )

The following theorem analyzes its accuracy, as well as gives the size of Algorithms 04 00200i1I,λ,κ.

Lemma 3

For any tI, fa([t, rI]) − 1 κf*(I) ≤ a([t, rI]) ≤ fa([t, rI]) + λ logW. Furthermore, Algorithms 04 00200i1I,λ,κ has size O(f*(I)/λ + κ) words.

Proof

Recall that f ^ a ( [ t , r I ] ) = v ( alive ( Q I , λ a ) t ). Consider any node N ∈ alive ( Q I , λ a ) t. Note that v(N) = ΣJ Algorithms 04 00200i5(N) vadd(N, J) where vadd(N, J) is the value added to v(N) during the period when i(N) = J. By Fact 2, we can divide it as v(N) = Σ{vadd(N, J) | J covers t} + Σ {vadd(N, J) | J is to the right of t}. It follows that

v ( alive ( Q I , λ a ) t ) = N alive ( Q I , λ a ) t v ( N ) = N alive ( Q I , λ a ) t { v add ( N , J ) J covers t } + N alive ( Q I , λ a ) t { v add ( N , J ) J is to the right of t }

Note that N alive ( Q I , λ a ) t { v add ( N , J ) J is to the right of t } f a ( [ t , r I ] ), because if an arrived item (a, u) causes an increase of vadd(N, J) for some J that is to the right of t, then u must be in [t, rI]. By Equation (5), to show the second inequality of the lemma, it suffices to show that S o = N alive ( Q I , λ a ) t { v add ( N , J ) J covers t } = v add ( N 1 , J 1 ) + v add ( N 2 , J 2 ) + + v add ( N κ , J κ ) is no greater than λ logW, as follows.

Without loss of generality, suppose |J1| ≥ |J2| ≥ ⋯≥ |Jκ|. It can be verified that once an interval J is assigned to a node, it will not be assigned to other nodes; thus the Ji's are distinct. Furthermore, note that for 1 ≤ i < k, JκJi because (i) t is in both Ji and Jκ; (ii) Jκ is the smallest interval; and (iii) interesting intervals do not cross; thus Jκ is a descendant of Ji, and together with Fact 1, k ≤ logW. By Line 3 of the procedure Update( ), vadd(Ni, Ji) ≤ λ for 1 ≤ ik. It follows that Soλ logW.

For the first inequality of the lemma, it is clearer to use f ^ a ( [ t , r I ] ) = v ( alive ( C I , λ , κ ) t a ). Note that every arrived item (a, u) with u ∈ [t, rI] increments the value of some node in node ( C I , λ , κ ) t a; thus f ^ a ( [ t , r I ] ) v ( node ( C I , λ , κ ) t a ) and

f ^ a ( [ t , r I ] ) v ( alive ( C I , λ , κ ) t a ) v ( node ( C I , λ , κ ) t a ) v ( alive ( C I , λ , κ ) t a ) = v ( dead ( C I , λ , κ ) t a )

From Lines 4–6 of the procedure Process( ), when we delete a node N, v(N) = d(N). Hence, v ( dead ( C I , λ , κ ) t a ) = d ( dead ( C I , λ , κ ) t a ), which is equal to the total number of debit operations made to these dead nodes. Since whenever we make a debit operation to Q I , λ a, we will make a debit operation to κ − 1 other queues,

κ d ( dead ( C I , λ , κ ) t a ) d ( node ( C I , λ , κ ) ) v ( node ( C I , λ , κ ) ) = f ( I )

In summary, we have f ^ a ( [ t , r I ] ) f ^ a ( [ t , r I ] ) = f a ( [ t , r I ] ) v ( alive ( C I , λ , κ ) t a ) v ( dead ( C I , λ , κ ) t a ) = v ( dead ( C I , λ , κ ) t a ) f ( I ) / κ, and the first inequality of the lemma follows.

For the space, we say that a node is born-rich if it is created because of Line 5 of the procedure Update( ) (and thus has λ items under its belt); otherwise it is born-poor. Obviously, there are at most f*(I)/λ born-rich nodes. For born-poor nodes, we need to store at most κ of them because every queue has one born-poor node (the rightmost one), and we only need to store at most κ non-trivial queues; the space bound follows.

If we set λ = λi = 2i/logW and κ = 1 , then Lemma 3 asserts that C I , λ , κ = C I , λ i , 1 is an O ( f ( I ) 2 i log W + 1 )-space data structure that enables us to obtain, for any item aU and any timestamp tI, an estimate a([t, rI]) that satisfies

f a ( [ t , r I ] ) f ( I ) f ^ a ( [ t , r I ] ) f a ( [ t , r I ] ) + 2 i

If f*(I) does not vary too much, we can determine the i such that f* (I) ≈ 2i, and C I , λ , κ 1 is an O ( 1 log W ) space data structure that guarantees an error bound of O(∊f*(I)). However, this approach has two obvious shortcomings:

  • f*(I) may vary from some small value to a value as large as B, the maximum number of items falling in a window of size W; hence, there may not be any fixed i that always satisfies f* (I) ≈ 2i

  • To estimate fa([t, rI]), we need an error bound of ∊f*([t, rI]), not ∊f*(I).

We will explain how to overcome these two shortcomings in the next section.

4. Our Data Structure for -approximate Counting

The first shortcoming of the approach given in Section 3 is easy to overcome: a natural idea is to maintain C I , λ , κ 1 for different λi to handle different possible values of f*(I). The second shortcoming is more fundamental. To overcome it, we need to modify Algorithms 04 00200i1I,λ,κ substantially The result is a new and complicated data structure D I , Y, where Y is an integer determining the accuracy As asserted in Theorem 7 below, this data structure uses O ( 1 log W log log W ) space, supports O ( log 1 + log log W ) update time, and for any tI, it offers the following special guarantee:

  • When f ( [ t , r I ] ) Y , D I , Y can return, for any item a, an estimate a([t, rI]) of fa([t, rI]) such that |a([t, rI])−fa([t, rI])|≤∊Y.

  • When f ( [ t , r I ] ) > Y , D I , Y does not have any error bound on its estimate a([t, rI]).

Before giving the details of D I , Y, let us explain how to use it to build the data structure Algorithms 04 00200i2I,∊ mentioned in Section 2 for the -approximate counting problem. To build Algorithms 04 00200i2I,∊, we need another O ( 1 log W log B log W )-space data structure Algorithms 04 00200i6I,∊, which is a simple adaption of the data structure Algorithms 04 00200i6 of Cormode et al. [1] for the -approximate basic counting problem; Algorithms 04 00200i6I,∊ enables us to find, for any tI, an estimate *([t, rI]) of f*([t, rI]) such that

f ( [ t , r I ] ) f ^ a ( [ t , r I ] ) ( 1 + ) f ( [ t , r I ] )

Algorithms 04 00200i6I,∊ is implemented as follows. During execution, we maintain the data structure Algorithms 04 00200i6/4 of Cormode et al. to count the items in the sliding window. When τcur = rI, we duplicate Algorithms 04 00200i6/4 and get Algorithms 04 00200i6′. Then, Algorithms 04 00200i6′ is updated as if τcur was fixed at rI. To get the estimate *([t, rI]), we first obtain an estimate f′ of f*([t, rI]) from Algorithms 04 00200i6′, which satisfies | f f ( [ t , r I ] ) | 4 f ( [ t , r I ] ). Then, f ^ ( [ t , r I ] ) = 1 1 / 4 f . It can be verified that *([t, rI]) satisfies Equation (7). Our data structure Algorithms 04 00200i2I,∊ is composed of (i) Algorithms 04 00200i6I,∊, and (ii) D I , / 4 2 i for each integer i from log ( 1 log W ) + 1 to log B. It also maintains a brute-force O ( 1 log W )-space data structure for remembering the 1 log W items (a, u) with the largest uI; this brute-force data structure will be used for finding a([t, rI]) only when f ( [ t , r I ] ) 1 log W.

Theorem 4

  • The data structure Algorithms 04 00200i2I,∊ has size O ( 1 ( log log W ) ( log W ) log ( B log W ) ) words, and supports O ( ( log 1 + log log W ) log ( B log W ) ) update time.

  • Given Algorithms 04 00200i2I,∊, we can find, for any a ∈ Σ and tI, an estimate of a([t, rI]) of fa([t, rI]) such that |a([t, rI]) − fa([t, rI])| ≤ ∊f*([t, rI]).

Proof

Statement (i) is straightforward because there are log B log ( 1 log W ) different D I , Y, each has size O ( 1 ( log log W ) log W ) and takes O ( log 1 + log log W ) time for an update. For Statement (ii), we describe how to get the estimate and analyze its accuracy.

First, we use Algorithms 04 00200i6I,∊ to get the estimate *([t, rI]). If f ^ ( [ t , r I ] ) 1 log W, then f ( [ t , r I ] ) f ^ ( [ t , r I ] ) 1 log W and we can use the brute-force data structure to find fa([t, rI]) exactly. Otherwise, we determine the i with 2i−1 < *([t, rI]) ≤ 2i. Note that

  • i log ( 1 log W ) + 1 and we have the data structure D I , 4 2 i, and

  • f*([t, rI]) ≤ *([t, rI]) ≤ 2i.

We use D I , 4 2 i to obtain an estimate a([t, rI]) with | f ^ a ( [ t , r I ] ) f a ( [ t , r I ] ) | ( 4 ) 2 i. By Equation (7), 2i−1 < *([t, rI]) ≤ (1 + )f*([t, rI]). Combining the two inequalities we have

| f ^ a ( [ t , r I ] ) | f a ( [ t , r I ] ) | 2 ( 4 ) ( 2 i 1 ) < 2 ( 4 ) ( 1 + ) f ( [ t , r I ] ) f ( [ t , r I ] )

We now describe the construction of D I , Y. First, we describe an O ( 1 ( log W ) 2 )-space version of the data structure. Then, we show in the next section how to reduce the space to O ( 1 log log W log W ). In our discussion, we fix λ = ∊Y/logW and κ = 4 log W.

Initially, D I , Y is just the data structure Algorithms 04 00200i1I,λ,κ. By Lemma 3, we know that its size is O ( f ( I ) λ + κ ) = O ( f ( I ) Y log W + 1 log W ), which is O ( 1 log W ) when f*(I) ≤ Y. However, it is much larger than 1 log W when f*(I) ≫ Y, and to maintain small space usage in such case, we trim Algorithms 04 00200i1I,λ,κ by throwing away a significant number of nodes. This is acceptable because Algorithms 04 00200i1I,λ,κ only guarantees good estimates for those tI with f*([t, rI]) ≤ Y. The trimming process is rather tricky. The natural idea of throwing away all the nodes to the left of t when we find f*([t, rI]) > Y does not work because the resulting data structure may return estimates with error larger than the required ∊Y bound. For example, let I = [1, W]. For each item ai ∈ {a1, a2, …, aκ−1}, there are m = Y/κ copies of (ai, t + 1) arrive at time W + t for every t ∈ [0, W − 1]. Also, there are m copies of (a, W) arrive at time W + t for every t ∈ [0, W − 1]. Hence, at each time W + t, there are = Y items with timestamps in [t, W] arrives, m items for each of the κ item name in {a, a1, …, aκ−1}. We are interested in the accuracy of the estimate a([W, W]). It can be verified that at each time W + t, Lines 4–5 of the procedure Process( ) will eventually trivialize Q I , λ a and thus a([W, W]) = 0. Since fa([W, W]) = (t + 1)m, |a([W, W]) − fa([W, W])| = (t + 1)m. When t = 2∊Y/m − 1, the absolute error is 2∊Y which is larger than the required error bound ∊Y.

To describe the right trimming procedure, we need some basic operations. Consider any Algorithms 04 00200i1J,λ,κ where J = [p, q]. The following operation splits Algorithms 04 00200i1J,λ,κ into two smaller data structures Algorithms 04 00200i1J,λ,κ and Algorithms 04 00200i1Jr,λ,κ where Jt = [p, m] and Jr = [m+ 1, q] with m = ⌊(p + q)/2⌋.


D I , Y.Split( Algorithms 04 00200i1J,λ,κ)

1:for each non-trivial queue Q J , λ a C J , λ , κ do
2:if ( Q J , λ a has only one node N monitoring the whole interval J) then
3:  /* refine J */
4:  insert a new node N′ immediately to the left of N with v(N′) = d(N′) = 0;
5:  i(N′) = J, and i(N) = Jr;
6:end if
7: divide Q J r , λ a into two sub-queues and where
8:   Q J , λ a contains the nodes monitoring some sub-intervals of J, and
9:   Q J r , λ a contains those monitoring some sub-intervals of Jr;
10: put Q J r , λ a in Algorithms 04 00200i1J,λ,κ and in Algorithms 04 00200i1Jr,λ,κ.
11:end for
12:/* For a trivial Q J , λ a, its two children in Algorithms 04 00200i1J,λ,κ and Algorithms 04 00200i1Jr,λ,κ are also trivial. */

We say that Algorithms 04 00200i1J,λ,κ and Algorithms 04 00200i1Jr,λ,κ are the left and right child of Algorithms 04 00200i1Jr,λ,κ, respectively. Figure 3 gives an example of Split( Algorithms 04 00200i1[1,8],λ,κ), the split of Algorithms 04 00200i1[1,8],λ,κ, which has three non-trivial queues Q I , λ a, Q I , λ b and Q I , λ c, into Algorithms 04 00200i1[1, 4],λ,κ and Algorithms 04 00200i1[5, 8],λ,κ. Note that the queues for b and c in Algorithms 04 00200i1[1, 4],λ,κ are trivial and we have not stored them.

Using Split( ), we can trim, for example, Algorithms 04 00200i1[p,p+1],λ,κ into Algorithms 04 00200i1[p+1,p+1],λ,κ as follows: Split Algorithms 04 00200i1[p,p+1],λ,κ into Algorithms 04 00200i1[p,p],λ,κ and Algorithms 04 00200i1[p+1,p+1],λ,κ, and throw away Algorithms 04 00200i1[p, p],λ,κ. The following recursive procedure LeftRefine( ) generalizes this idea for larger J: Given Algorithms 04 00200i1J,λ,κ = Algorithms 04 00200i1[p, q],λ,κ, it returns a list 〈 Algorithms 04 00200i1J0,λ,κ, Algorithms 04 00200i1J1,λ,κ, …, Algorithms 04 00200i1Jm,λ,κ〉 where the Ji's form a partition of [p, q], and J0 = [p, p]. Throwing away Algorithms 04 00200i1J0,λ,κ, and the remaining Algorithms 04 00200i1Ji,λ,κ's all together monitor [p + 1, q].


D I , Y.LeftRefine ( Algorithms 04 00200i1[p,q],λ,κ)

1:if (|[p, q]| = |[p, p]| = 1) then
2: return 〈 Algorithms 04 00200i1[p,p],λ,κ〉;
3:else
4: split Algorithms 04 00200i1[p,q],λ,κ into its left child Algorithms 04 00200i1[p, m],λ,κ and right child Algorithms 04 00200i1[m+1,q],λ,κ
5: /* where m = ⌊(p + q)/2⌋ */;
6:L = LeftRefine( Algorithms 04 00200i1[p, m],λ,κ);
7: suppose L = 〈 Algorithms 04 00200i1J0,λ,κ, Algorithms 04 00200i1J1,λ,κ, …, Algorithms 04 00200i1Jk,λ,κ〉;
8: return 〈 Algorithms 04 00200i1J0,λ,κ, …, Algorithms 04 00200i1Jk,λ,κ Algorithms 04 00200i1[m+1,q],λ,κ〉;
9:end if

For example, LeftRefine( Algorithms 04 00200i1[1,8],λ,κ) gives us the list 〈 Algorithms 04 00200i1[1,1],λ,κ, Algorithms 04 00200i1[2, 2],λ,κ, Algorithms 04 00200i1[3, 4],λ,κ, Algorithms 04 00200i1[5,8],λ,κ〉. Note that J0 = [p, p] because the recursion stops only when |[p, q]| = 1. The list returned by LeftRefine( Algorithms 04 00200i1[p, q],λ,κ) has another useful property, which we describe below.

Given L = 〈 Algorithms 04 00200i1Z1,λ,κ, …, Algorithms 04 00200i1Zk,λ,κ), we say that L is an interesting-partition covering the interval J if (i) the Zi's are all interesting intervals and form a partition of J; and (ii) for 1 ≤ i < k, Zi is to the left of Zi+1, and | Z i | 1 2 | Z i + 1 |. The fact below can be verified by induction on the length of the list returned by LeftRefine( ).

Fact 3

Let J be an interesting interval, and L = 〈 Algorithms 04 00200i1J0,λ,κ, …, Algorithms 04 00200i1Jm,λ,κ〉 be the list returned by LeftRefine( Algorithms 04 00200i1J,λ,κ). Then, the list 〈 Algorithms 04 00200i1J1,λ,κ, …, Algorithms 04 00200i1Jm,λ,κ 〉 (i.e., the list obtained by throwing away the head Algorithms 04 00200i1J0,λ,κ of L) is an interesting-partition covering [p + 1, q].

For example, if [1, 8] is an interesting interval, then the list 〈 Algorithms 04 00200i1[2,2],λ,κ Algorithms 04 00200i1[3,4],λ,κ Algorithms 04 00200i1[5,8],λ,κ〉 obtained by throwing away the first element Algorithms 04 00200i1[1,1],λ,κ from LeftRefine( Algorithms 04 00200i1[1,8],λ,κ) is an interesting-partition covering [2, 8].

We now give details of D I , Y. Initially, it is the interesting-partition 〈CI,λ,κ 〉 covering the whole interval I = [I, rI]. Throughout the execution, we maintain the following invariant:

  • (**) D I , Y is an interesting-partition covering some [p, rI] ⊆ I.

When D I , Y = C J 1 , λ , κ , , C J m , λ , κ is covering [p, rI], it only guarantees good estimates of fa([t, rI]) for t ∈ [p, rI], and this estimate is obtained by

f ^ a ( [ t , r I ] ) = v ( alive ( C J h , λ , κ ) t a ) + h + 1 i m v ( alive ( C J i , λ , κ ) a )
(or equivalently, f ^ a ( [ t , r I ] ) = v ( alive ( Q J h , λ a ) t ) + h + 1 i m v ( alive ( Q J i , λ a ) ), where Jh is the interval in {J1, J2, …, Jm} that covers t. When an item (a, u) with u ∈ [p, rI] arrives, we find the unique Algorithms 04 00200i1Ji,λ,κ in D I , Y where uJi, update it by calling Algorithms 04 00200i1Ji,λ,κ. Process((a, u)). Note that this update has no effect on the other Algorithms 04 00200i1J,λ,κ in D I , Y.

During execution, we also keep track of the largest timestamp pmaxI such that the estimate *(pmax,rI]) given by Algorithms 04 00200i6I,∊ is greater than (1 + )Y (which implies f*([pmax,rI]) > Y because of Equation (7)). As soon as pmax falls in the interval covered by D I , Y, we use the following procedure to trim D I , Y to cover the smaller interval [pmax + 1, rI].

Suppose that L = 〈 Algorithms 04 00200i1J1,λ,κ, …, Algorithms 04 00200i1Ji,λ,κ) is an interesting-partition covering [p, rI], and t ∈ [p, rI]. Trim(L, t) constructs an interesting-partition covering [t + 1, rI] recursively as follows.


D I , Y.Trim(L, t)

1:find the unique Algorithms 04 00200i1Ji,λ,κ in L such that tJi;
2:L′ =LeftRefine( Algorithms 04 00200i1Ji,λ,κ);
3:suppose L′ = 〈 Algorithms 04 00200i1K0,λ,κ, …, Algorithms 04 00200i1K1,λ,κ, Algorithms 04 00200i1K,λ,κ〉;
4:if (K0 = [t, t]) then
5: return 〈 Algorithms 04 00200i1K1,λ,κ, …, Algorithms 04 00200i1K,λ,κ, Algorithms 04 00200i1Ji+1,λ,κ, Algorithms 04 00200i1Jm,λ,κ 〉;
6: /* i.e., throw away Algorithms 04 00200i1J1,λ,κ, …, Algorithms 04 00200i1Ji−1,λ,κ, and Algorithms 04 00200i1K0,λ,κ, */
7: /* and return an interesting-partition covering [t + 1, rI]. */
8:else
9: return Trim(〈 Algorithms 04 00200i1K1,λ,κ, …, Algorithms 04 00200i1K,λ,κ, Algorithms 04 00200i1Ji+1,λ,κ, Algorithms 04 00200i1Jm,λ,κ 〉, t).
10: /* throw away Algorithms 04 00200i1J1,λ,κ, …, Algorithms 04 00200i1Ji−1,λ,κ and Algorithms 04 00200i1K0,λ,κ */
11:end if

For example, Figure 4 shows that when D I , Y = C [ 2 , 2 ] , λ , κ , C [ 3 , 4 ] , λ , κ , C [ 5 , 8 ] , λ , κ , Trim ( D I , Y , 3 ) return 〈 Algorithms 04 00200i1[4,4],λ,κ, Algorithms 04 00200i1[5,8],λ,κ 〉. Based on Fact 3, it can be verified inductively that after D I , Y Trim ( D I , Y , p max ), the new D I , Y is an interesting-partition covering [pmax + 1, rI]; Invariant (**) is preserved. In the rest of this section, we analyze the size of D I , Y and the accuracy of its estimates.

Let All be the set of all Algorithms 04 00200i1J,λ,κ's that ever exist, i.e., if Algorithms 04 00200i1J,λ,κ ∈ All, then either (i) it is currently in D I , Y, or (ii) it has been in D I , Y some time earlier in the execution, but is thrown away during some trimming of D I , Y. For any pI, define

ALL p = { C J , λ , κ C J , λ , κ ALL , and J covers or is to the right of p }

Let vadd( Algorithms 04 00200i1J,λ,κ) be the total value added to the nodes of Algorithms 04 00200i1J,λ,κ during its lifespan. We now derive an upper bound on Σ Algorithms 04 00200i3J,λ,κ ∈ Allp vadd( Algorithms 04 00200i1J,λ,κ), which is crucial for getting a tight error bound on the accuracy of D I , Y's estimates.

Recall that initially D I , Y = C I , λ , κ and thus Algorithms 04 00200i1I,λ,κ ∈ All. For any other Algorithms 04 00200i1J,λ,κ ∈ All, Algorithms 04 00200i1J,λ,κ must be a child of some Algorithms 04 00200i1H,λ,κ ∈ All (i.e., Algorithms 04 00200i1J,λ,κ is obtained from Split( Algorithms 04 00200i1H,λ,κ))- Given Algorithms 04 00200i1J,λ,κ and Algorithms 04 00200i1H,λ,κ, we say that Algorithms 04 00200i1J,λ,κ is a descendant of Algorithms 04 00200i1H,λ,κ, and Algorithms 04 00200i1H,λ,κ is an ancestor of Algorithms 04 00200i1J,λ,κ, if either (i) Algorithms 04 00200i1J,λ,κ is a child of Algorithms 04 00200i1H,λ,κ, or (ii) it is a child of some of Algorithms 04 00200i1H,λ,κ's descendants. Note that the original Algorithms 04 00200i1I,λ,κ is an ancestor of every Algorithms 04 00200i1J,λ,κ ∈ All, and in general, any Algorithms 04 00200i1H,λ,κ ∈ All is an ancestor of every Algorithms 04 00200i1J,λ,κ ∈ All with JH. We have the following lemma. (Note that we are abusing the notation here and regard D I , Y as a set.)

Lemma 5

Suppose that D I , Y = C J 1 , λ , κ , , C J m , λ , κ is covering [p, rI]. Let anc ( D I , Y ) = anc ( C J 1 , λ , κ , , C J m , λ , κ ) be the set { C H , λ , κ C H , λ , κ is an ancestor of some C J i , λ , κ D I , Y }. Then,

  • ALL p D I , Y anc ( D I , Y ),

  • vadd( Algorithms 04 00200i1J,λ,κ) ≤ (1 + )Y for any Algorithms 04 00200i1J,λ,κ ∈ All, and

  • | D I , Y anc ( D I , Y ) | 2 log W.

Therefore, Σ{vadd( Algorithms 04 00200i1J,λ,κ) | Algorithms 04 00200i1J,λ,κ ∈ Allp} ≤ 2(1 + )Y logW.

Proof

For (1), it suffices to prove that for any C J , λ , κ ALL p , C J , λ , κ D I , Y anc ( D I , Y ). By definition, J covers or is to the right of p; thus J ∩ (J1 ∪ ⋯ ∪ Jm) = J ∩ [p, rI] ≠ ∅. Since the intervals are interesting and do not cross, there is an 1 ≤ im such that either (i) J = Ji, and thus C J , λ , κ D I , Y, or (ii) JiJ, which implies Algorithms 04 00200i1J,λ,κ is an ancestor of Algorithms 04 00200i1J,λ,κ, i.e., C J , λ , κ anc ( D I , Y ). (It is not possible that JJi, otherwise Algorithms 04 00200i1Ji,λ,κ would have been split and should not be in the current D I , Y. Hence, C J , λ , κ D I , Y anc ( D I , Y ).

To prove (2), suppose that J = [x, y] and vadd( Algorithms 04 00200i1J,λ,κ) has just reached (1 + )Y. This implies f*([x, rI]) ≥ (1 + )Y, and so does its estimate *([x, rI]) given by Algorithms 04 00200i6I,∊ (as f*([x, rI]) ≤ *([x, rI]), by Equation (7)). Then, the procedure Trim( ) will be called and Algorithms 04 00200i1J,λ,κ will be either thrown away or split, and no more value can be added to Algorithms 04 00200i1J,λ,κ. It follows that vadd( Algorithms 04 00200i1J,λ,κ) ≤ (1 + )Y.

For (3), recall that D I , Y = C J 1 , λ , κ , C J 2 , λ , κ , , C J m , λ , κ . Among the intervals J1, …, Jm, interval J1 is the leftmost interval and its left boundary J1 = p. We now prove that D I , Y anc ( D I , Y ) = D I , Y anc ( C J 1 , λ , κ ) where anc( Algorithms 04 00200i1J1,λ,κ) is the set of ancestors of Algorithms 04 00200i1J1,λ,κ. Then, together with the facts that | D I , Y | log W (by Property (ii) of interesting-partition) and |anc( Algorithms 04 00200i1J1,λ,κ)| ≤ logW (as each Split operation would reduce the size of interval by half), we have

| D I , Y anc ( D I , Y ) | = | D I , Y anc ( C J 1 , λ , κ ) | | D I , Y | + | anc ( C J 1 , λ , κ ) | 2 log W

To show D I , Y anc ( D I , Y ) = D I , Y anc ( C J 1 , λ , κ ), it suffices to show that for any C H , λ , κ anc ( D I , Y ), Algorithms 04 00200i1H,λ,κ ∈ anc( Algorithms 04 00200i1J1,λ,κ). Since C H , λ , κ anc ( D I , Y ), it is the ancestor of some C J i , λ , κ ( D I , Y ). Thus Ji = [ji, rji] ⊂ H = [H, rH]. Since Algorithms 04 00200i1H,λ,κ is already an ancestor, it no longer exists, and all the Algorithms 04 00200i1J,λ,κ to its left have been thrown away. Thus, D I , Y has no Algorithms 04 00200i1J,λ,κ where J is to the right of H. This implies Hp = J1 and HJ1rJ1rJirH. It follows that J1H and Algorithms 04 00200i1H,λ,κ is an ancestor of Algorithms 04 00200i1J1,λ,κ, i.e., Algorithms 04 00200i1H,λ,κ ∈ anc( Algorithms 04 00200i1J1,λ,κ).

We are now ready to analyze the accuracy of D I , Y's estimates.

Theorem 6

Suppose that D I , Y is covering [p, rI]. For any item a and any t ∈ [p, rI], the estimate a([t, rI]) of fa([t, rI]) obtained by D I , Y satisfies |a([t, rI]) − fa([t, rI])| ≤ ∊Y. Furthermore, D I , Y uses O ( 1 ( log W ) 2 ) space.

Proof

Let alive ( D I , Y ) be the set of nodes currently in D I , Y , dead ( D I , Y ) the set of those that were in D I , Y earlier in the execution but have been deleted, and node ( D I , Y ) = alive ( D I , Y ) dead ( D I , Y ). It can be verified that f ^ a ( [ t , r I ] ) = v ( alive ( D I , Y ) t a ). Below, we prove that

f ^ a ( [ t , r I ] ) 2 ( 1 + ) Y κ log W v ( alive ( D I , Y ) t a ) f a ( [ t , r I ] ) + λ log W

Recall that we fix λ = ∊Y/logW and κ = 4 log W; the ∊Y error bound follows.

The proof of the second inequality of Equation (8) is identical to that of Lemma 3, except that we replace all occurrences of Algorithms 04 00200i1J,λ,κ by D I , Y. The proof of the first inequality is also similar. We still have

f a ( [ t , r I ] ) v ( alive ( D I , Y ) t a ) v ( node ( D I , Y ) t a ) v ( alive ( D I , Y ) t a ) = v ( dead ( D I , Y ) t a )
which equals d ( dead ( D I , Y ) t a ). As in Lemma 3, we can derive the bound d ( dead ( D I , Y ) t a ) 1 κ v ( node ( D I , Y ) ) = 1 κ f ( I ), but we can do better here.

Observe that for any node N dead ( ( D I , Y ) t a ), N can only be in those Algorithms 04 00200i1J,λ,κ ∈ Allp (because t ∈ [p, rI]), and when we debit N, if it is in Algorithms 04 00200i1J,λ,κ, then we debit κ − 1 other nodes in Algorithms 04 00200i1J,λ,κ monitoring κ − 1 items other than a. Thus, κ d ( dead ( ( D I , Y ) t a ) ) is no more than the total value available in the Algorithms 04 00200i1J,λ,κ ∈ Allp, which is Σ {vadd( Algorithms 04 00200i1J,λ,κ) | Algorithms 04 00200i1J,λ,κ ∈ Allp}. Together with Lemma 5 we conclude

κ d ( dead ( D I , Y ) p a ) { v add ( C J , λ , κ ) | C J , λ , κ ALL p } 2 ( 1 + ) Y log W
and the first inequality of Equation (8) follows.

For the size of D I , Y, similar to the proof of Lemma 3, we can argue that the number of born-rich nodes is only O ( Y / λ ) = O ( 1 log W ), but the number of born-poor nodes can be much larger. A born-poor node of a non-trivial queue is created either when we increase the value of a trivial queue, or when we execute Lines 2-6 of procedure Split. It can be verified that every queue Q J , λ a has at most one born-poor node, which is the rightmost node in Q J , λ a. Since there are O(logW) Algorithms 04 00200i1J,λ,κ's in D I , Y and each has at most κ non-trivial queues, the number of born-poor nodes, and hence the size of D I , Y, is O ( κ log W ) = O ( 1 ( log W ) 2 ).

To reduce D I , Y's size from O ( 1 ( log W ) 2 ) to O ( 1 log log W log W ), we need to reduce the number of born-poor nodes; or equivalently, the number of non-trivial queues in D I , Y. In the next section, we give a simple idea to reduce the number of non-trivial queues and hence the size of D I , Y to O ( 1 log log W log W ). In Section 6, we show how to further reduce the size by taking advantage of the tardiness of the data stream.

5. Reducing the Size of D I , Y

Our idea for reducing the size is simple; for every C J , λ , κ D I , Y, its capacity is no longer fixed at κ = 4 log W; instead, we start with a much smaller capacity, namely 4 log log W, which is allowed to increase gradually during execution. To determine Algorithms 04 00200i1J,λ,κ's capacity, we use a variable to keep track of the number *(J) of items (a, u) with uJ that have arrived since Algorithms 04 00200i1J,λ,κ's creation. Let vJ be the total value of the nodes in Algorithms 04 00200i1J,λ,κ when it is created (vJ may not be zero if Algorithms 04 00200i1J,λ,κ is resulted from the splitting of its parent). The capacity of Algorithms 04 00200i1J,λ,κ is determined as follows.

  • When ( c 1 ) log W v J + f ¯ ( J ) < c Y log W for some integer c ≥ 1, the capacity of Algorithms 04 00200i1J,λ,κ is κ ( c ) = 4 c log log W, i.e., set κ = κ(c) and allow κ(c) non-trivial queues in Algorithms 04 00200i1J,λ,κ.

Note that when we increase the capacity of Algorithms 04 00200i1J,λ,κ to κ(c), we do not need to do anything, except that we allow more non-trivial queues (up to κ(c)) in the data structure. Also note that when Algorithms 04 00200i1J,λ,κ is created during the trimming process, its inherited capacity may be larger than the supposed capacity κ(c); in such case, we simply debit every non-trivial queue until some queue Q J , λ a has v ( Q I , λ x ) = d ( Q I , λ x ) and we execute Lines 4 and 5 of the procedure Process( ) to make this queue trivial. We repeat the process until the number of non-trivial queues is at most κ(c). The following theorem asserts that D I , Y maintains the accuracy of its estimates under this new implementation. It gives the revised size and the update time.

Theorem 7

  • Suppose that D I , Y is currently covering [p, rI]. For any item a ∈ Σ and any timestamp t ∈ [p, rI], the estimate a([t, rI]) of a([t, rI]) obtained by the new D I , Y satisfies |a([t, rI]) − fa([t, rI])| ≤ ∊Y.

  • D I , Y has size O ( 1 ( log log W ) log W ), and supports O ( log 1 + log log W ) update time.

Proof

Suppose that D I , Y = C J 1 , λ , κ ( c 1 ) , , C J m , λ , κ ( c m ) . From the fact that we are using Algorithms 04 00200i1Ji,λ,κ(ci) to monitor Ji we conclude ( c i 1 ) Y log W v J i + f ¯ ( J i ). It follows that 1 i m c i Y log W 1 i m ( v J i + f ¯ ( J i ) ) + 1 i m Y log W, which is O(Y) because (i) | D I , Y | = m = O ( log W ) and (ii) 1 i m ( v J i + f ¯ ( J i ) ) = O ( Y ) (otherwise D I , Y would have been trimmed). Thus,

1 i m c i = O ( log W )

For Statement (1), the analysis of the accuracy of a([t, rI]) is very similar to that of Theorem 6, except for the following difference: In the proof of Theorem 6, we show that d ( dead ( D I , Y ) p a ) 2 ( 1 + ) Y κ log W, and since κ is fixed at 4 log W, d ( dead ( D I , Y ) p a ) Y. Here, we also prove that d ( dead ( D I , Y ) p a ) Y, but we have to prove it differently because the capacities are no longer fixed.

As argued previously, any node in dead ( D I , Y ) p a is in some Algorithms 04 00200i1J,λ,κ ∈ Allp. Below, we show that for any Algorithms 04 00200i1J,λ,κ ∈ Allp, we can make at most Y 2 log W debit operations to the queue Q J , λ a of Algorithms 04 00200i1J,λ,κ during its lifespan. Together with the fact that |All≥p| ≤ 2 logW, we have d ( dead ( D I , Y ) p a ) Y.

Consider any Algorithms 04 00200i1J,λ,κ ∈ Allp. Note that the smaller its capacity, the larger the number of debit operations can be made to the queue Q J , λ a of Algorithms 04 00200i1J,λ,κ. To maximize the number of debit operations made to Q J , λ a, suppose that vJ = 0 and thus Algorithms 04 00200i1J,λ,κ has the smallest capacity κ(1) when it is created. Before increasing its capacity to κ(2), Algorithms 04 00200i1J,λ,κ can make at most 1 κ ( 1 ) Y log W debit operations to Q J , λ a. Then, during the next Y log W arrivals of items (a, u) with u J , Y log W v J + f ¯ ( J ) < 2 Y log W, the capacity is κ(2), and at most 1 κ ( 2 ) Y log W debit operations can be made to Q J , λ a. In general, during the period when ( c 1 ) Y log W v J + f ¯ ( J ) < c Y log W, at most 1 κ ( c ) Y log W debit operations can be made to Q J , λ a. If the largest capacity is κ(cmax), the total number of debit operations made to Q J , λ a is at most

Y log W ( 1 κ ( 1 ) + + 1 κ ( c max ) ) = Y 4 ( log log W ) log W ( 1 + 1 2 + + 1 c max ) Y ( ln ( c max ) + 1 ) 4 ( log log W ) log W
which is smaller than Y 2 log W because by Equation (9), cmax = O(logW), which implies ln(cmax) + 1 ≤ 2 log logW (suppose that W is larger than some constant).

We now prove (2). Note that the total number of non-trivial queues in D I , Y, and hence the number of born-poor nodes, is at most 1 i m κ ( c i ) = 1 i m 4 c i log log W. By Equation (9), 1 i m c i = O ( log W ), and it follows that the size of D I , Y is O ( 1 log log W log W ).

For the update time, suppose that an item (a, u) arrives. We can find the Algorithms 04 00200i1Ji,λ,κ in D I , Y = C J 1 , λ , κ , , C J m , λ , κ with uJi using O(log m) = O(log logW) time by querying a balanced search tree storing the Ji's. By hashing (e.g., Cuckoo hashing [15], which supports constant update and query time) we can locate the queue Q J i , λ a C J i , λ , κ in constant time. Then, by consulting an auxiliary balanced search tree on the intervals monitored by the nodes of Q J i , λ a, we can find and update the node N of Q J i , λ a with ui(N) using O ( log ( Y / λ ) ) = O ( log 1 + log log W ) time. At times we may also need to execute Lines 3 and 4 of the procedure Process( ), which debits all the non-trivial queues in Algorithms 04 00200i1Ji,λ,κ. Using the de-amortizing technique given in [16], this step takes constant time.

Note that occasionally, we may also need to clean up D I , Y by calling Trim( ); this step takes time linear to the size of D I , Y, which is O ( 1 ( log log W ) log W ).

6. Further Reducing the Size of D I , Y for Streams with Small Tardiness

Recall that in an out-of-order data stream with tardiness dmax ∈ [0, W], any item (a, u) arriving at time τcur satisfies uτcurdmax; in other words, the delay of any item is guaranteed to be at most dmax. This section extends D I , Y to a data structure I , Y that takes advantage of this maximum delay guarantee to reduce the space usage. The idea is as follows. Since there is no new item with stamps smaller than τCurdmax, we will not make any further change to those nodes to the of left τcurdmax and hence can consolidate these nodes to reduce space substantially. To handle those nodes with timestamps in [τcurdmax, τcur], we use the data structure given in Section 5; since it is monitoring an interval of dmax instead of W, its size is O ( 1 ( log log d max ) log d max ) instead of O ( 1 ( log log W ) log W ).

To implement I , Y, we need a new operation called consolidate. Consider any list of queues Q J 1 , λ a , Q J 2 , λ a , , Q J m , λ a , where J1, J2, …, Jm are ordered from left to right and form a partition of the interval J1‥m = J1 ∪ ⋯ ∪ Jm. We consolidate them into a single queue Q J 1 m , λ a as follows:

  • Concatenate the queues into a single queue, in which the nodes preserve the left-right order.

  • Starting from the leftmost node, check from left to right every node N in the queue, if N is not the rightmost node and v(N) < λ, merge it with the node N′ immediately to its right, i.e., delete N, set v(N′) = v(N) + v(N′), d(N′) = d(N) + d(N′) and Algorithms 04 00200i4(N′) = Algorithms 04 00200i4(N) ∪ Algorithms 04 00200i4(N′).

Note that after the consolidation, the resulting queue Q J 1 m , λ a has at most one node (the rightmost one) with value smaller than λ.

Given the list 〈 Algorithms 04 00200i1J1,λ,κ(c1), …, Algorithms 04 00200i1Jm,λ,κ(cm)〉, we consolidate them into C J 1 m , λ , 1 by first consolidating, for each item a, the queues Q J 1 , λ a , , Q J m , λ a in Algorithms 04 00200i1J1,λ,κ(c1), …, Algorithms 04 00200i1Jm,λ,κ(cm) into the queue Q J 1 m , λ a and put it in C J 1 m , λ , 1 . Then, we apply Lines 3–5 of procedure Process( ) repeatedly to reduce the number of non-trivial queues in the data structure to 1 .

We are now ready to describe how to extend D I 1 , Y to I , Y. In our discussion, we fix λ = Y log d max, and without loss of generality, we assume that I = [1, W]. Recall that pmax denotes the largest timestamp in I such that *([pmax, rI]) > (1 + )Y (which implies f*([pmax, rI]) > Y). We partition I into sub-windows I1, I2, …, Im, each of size dmax (i.e., Ii = [(i − 1)dmax, idmax]). We divide the execution into different periods according to τcur, the current time.

  • During the 1st period, when τcur ∈ [1, dmax] = I1, I , Y simply is D I 1 , Y.

  • During the 2nd period, when τcur = I2, I , Y maintains D I 2 , Y in addition to D I 1 , Y.

  • During the 3rd period, when τcurI3, I , Y maintains D I 3 , Y in addition to D I 2 , Y. Also, the D I 1 , Y = C J 1 , λ , κ ( c 1 ) , , C J m , λ , κ ( c m ) is consolidated into C I 1 , λ , 1 .

  • In general, during the ith period, when τ cur [ ( i 1 ) d max + 1 , i d max ] = I i , I , Y maintains D I i 1 , Y and D I i , Y, and also C I 1 i 2 , λ , 1 where I1‥i−2 = I1I2 ∪ ⋯ ∪ Ii−2. Observe that in this period, there is no item (a, u) with uI1‥i−2 arrives (because the tardiness is dmax), and thus we do not need to update C I 1 i 2 , λ , 1 . However, we will keep throwing away any node N in C I 1 i 2 , λ , 1 as soon as we know i(N) is to the left of pmax + 1.

  • When entering the (i + 1)st period, we do the followings: Keep D I i , Y, create D I i + 1 , Y, merge Algorithms 04 00200i1I1‥i−2,λ,κ with D I i 1 , Y = C J 1 , λ , κ ( c 1 ) , , C J m , λ , κ ( c m ) , and then get C I 1 i 1 , λ , 1 by consolidating C I 1 i 2 , λ , 1 , C J 1 , λ , κ ( c 1 ) , C J m , λ , κ ( c m ) .

Given any t ∈ [pmax + 1, rI], the estimate of fa([t, rI]) given by I , Y is

f ^ a ( [ t , r I ] ) = v ( alive ( I , Y ) t a )

The following theorem gives the accuracy of f ^ a ( [ t , r I ] ) , I , Y's size and its update time.

Theorem 8

  • For any t ∈ [pmax + 1, rI], the estimate a([t, rI]) given by I , Y satisfies

    f a ( [ t , r I ] ) 2 Y f ^ a ( [ t , r I ] ) f a ( [ t , r I ] ) + 2 Y

  • I , Y has size O ( 1 ( log log d max ) log d max ), and supports O ( log 1 + log log d max ) update time.

Proof

Recall that I is partitioned into sub-intervals I1, I2, …, Im. Suppose that tIκ. Note that if we had not performed any consolidation,

v ( alive ( I , Y ) t a ) = v ( alive ( D I κ , Y ) t a ) + κ + 1 i m v ( alive ( D I i , Y ) a )

Note that for κ + 1 ≤ im, v ( alive ( D I i , Y ) a ) f a ( I i ), and for v ( alive ( D I κ , Y ) t a ) since |Iκ|= dmax, the same argument used in the proof of Lemma 3 gives us v ( alive ( D I κ , Y ) t a ) f a ( [ t , r I k ] ) + λ log d max. Hence

v ( alive ( I , Y ) t a ) = v ( alive ( D I κ , Y ) t a ) + κ + 1 i m v ( alive ( D I i , Y ) a ) f a ( [ t , r I κ ] ) + λ log d max + κ + 1 i m f a ( I i ) = f a ( [ t , r I ] ) + λ log d max

The consolidation step may add errors to v ( alive ( I , Y ) t a ). To get a bound on them, let N1, N2, … be the nodes for a in I , Y, ordered from left to right. Suppose that tNh. Note that

  • the consolidation step will added at most λ units to v(Nh) before we move on to consider the node immediately to its right, and

  • for node Ni with ih + 1, any node N that has been merged to Ni must be to the right of of Nh, and thus is to the right of t; it follows that N is contributing v(N) to v ( alive ( I , Y ) t a ) in Equation (10) and its merging will not make any change.

In conclusion, the consolidation steps introduce at most λ extra errors, and Equation (10) becomes v ( alive ( I , Y ) t a ) f a ( [ t , r I ] ) + λ log W + λ f a ( [ t , r I ] ) + 2 Y, which is the second inequality of the lemma.

To prove the first inequality, suppose that we ask for the estimate a([t, rI]) during the ith period, when we have C I 1 i 2 , λ , 1 , D I i 1 , Y and D I i , Y. Recall that Algorithms 04 00200i1I1‥i−2, λ,∊ comes from consolidating D I 1 , Y , D I 2 , Y , , D I i 2 , Y. As in all our Previous analyses, we have

f a ( [ t , r I ] ) v ( alive ( I , Y ) t a ) v ( node ( I , Y ) t a ) v ( alive ( I , Y ) t a ) = d ( dead ( I , Y ) t a )

(Note that the merging of nodes during consolidations would not take away any value). To get a bound on d ( dead ( I , Y ) t a ), suppose that pmaxIκ. Then, all the nodes to the left of Iκ have been thrown away. Among D I κ , Y , D I κ + 1 , Y , , D I m , Y, only D I κ , Y may have been trimmed. Note that

  • d ( dead ( I , Y ) t a ) d ( dead ( D I κ , Y ) p max a ) + κ + 1 m d ( dead ( D I , Y ) a ),

  • as in the proof of Theorem 7, we can argue that d ( dead ( D I κ , Y ) p max a ) Y, and

  • for the other D I , Y, since their capacity is at least 1

κ + 1 m d ( dead ( D I , Y ) a ) κ + 1 m f ( I ) / ( 1 / ) f ( [ p max + 1 , r I ] ) Y

Thus, d ( dead ( I , Y ) t a ) 2 Y, and the first inequality follows.

For Statement (2), note that both D I i 1 , Y and D I i , Y have size O ( 1 log log d max log d max ) (by Theorem 7, and |Ii−1| = |Ii| = dmax), and for C J 1 i 2 , λ , 1 , it has size O ( Y / λ + 1 ) = O ( 1 log d max ); thus the size of I , Y is O ( 1 log log d max log d max ). For the update time, it suffices to note that it is dominated by the update times of D I i 1 , Y and D I i , Y.

Figure 1. Suppose that λ = 4. (i) shows the queue Q I , λ a before the arrivals of items (a, 1), (a, 2), (a, 3), (a, 8); (ii) is the resulting queue after the updates for these items; (iii) shows that after the arrival of another item (a, 1), the first node in (ii) is updated and refined.
Figure 1. Suppose that λ = 4. (i) shows the queue Q I , λ a before the arrivals of items (a, 1), (a, 2), (a, 3), (a, 8); (ii) is the resulting queue after the updates for these items; (iii) shows that after the arrival of another item (a, 1), the first node in (ii) is updated and refined.
Algorithms 04 00200f1 1024
Figure 2. Interesting intervals for I = [1, 8].
Figure 2. Interesting intervals for I = [1, 8].
Algorithms 04 00200f2 1024
Figure 3. Split of Algorithms 04 00200i1[1, 8], λ,κ.
Figure 3. Split of Algorithms 04 00200i1[1, 8], λ,κ.
Algorithms 04 00200f3 1024
Figure 4. Trim(〈 Algorithms 04 00200i1[2, 2],λ,κ, Algorithms 04 00200i1[3, 4],λ,κ, Algorithms 04 00200i1[5, 8],λ,κ〉, 3).
Figure 4. Trim(〈 Algorithms 04 00200i1[2, 2],λ,κ, Algorithms 04 00200i1[3, 4],λ,κ, Algorithms 04 00200i1[5, 8],λ,κ〉, 3).
Algorithms 04 00200f4 1024
Table 1. The space complexity for answering -approximate frequent item set query in a sliding time window. Results from this paper are marked with [†]. Note that we assume B 1 log W; otherwise, we can always store all items in the window for exact answer, using O ( 1 log W ) words. Similarly, for the result with tardiness, we assume B 1 log d max.
Table 1. The space complexity for answering -approximate frequent item set query in a sliding time window. Results from this paper are marked with [†]. Note that we assume B 1 log W; otherwise, we can always store all items in the window for exact answer, using O ( 1 log W ) words. Similarly, for the result with tardiness, we assume B 1 log d max.
Space Complexity (words)
Synchronous [7] O ( 1 log ( B ) )
Asynchronous [1] O ( 1 log W log ( B log W ) min { log W , 1 } log | U | )
Asynchronous [†] O ( 1 log W log ( B log W ) log log W )
Asynchronous with tardiness [†] O ( 1 log d max log ( B log d max ) log log d max )

Acknowledgments

H.F Ting is partially supported by the GRF Grant HKU-716307E; T.W. Lam is partially supported by the GRF Grant HKU-713909E.

References

  1. Cormode, G.; Korn, F.; Tirthapura, S. Time-Decaying Aggregates in Out-of-Order Streams. Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'08, Vancouver, Canada, 9–11 June 2008; pp. 89–98.
  2. Karp, R.; Shenker, S.; Papadimitriou, C. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 2003, 28, 51–55. [Google Scholar]
  3. Demaine, E.; Lopez-Ortiz, A.; Munro, J. Frequency Estimation of Internet Packet Streams with Limited Space. Proceedings of the 10th Annual European Symposium, ESA'07, Rome, Italy, 17–21 September 2002; pp. 348–360.
  4. Muthukrishnan, S. Data Streams: Algorithms and Applications; Now Publisher Inc.: Boston, MA, USA, 2005. [Google Scholar]
  5. Babcock, B.; Babu, S.; Datar, M.; Motwani, R.; Widom, J. Models and Issues in Data Stream Systems. Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS'02, Madison, WI, USA, 3–5 June 2002; pp. 1–16.
  6. Arasu, A.; Manku, G. Approximate Counts and Quantiles over Sliding Windows. Proceedings of the 23th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'04, Paris, France, 14–16 June 2004; pp. 286–296.
  7. Lee, L.K.; Ting, H.F. A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows. Proceedings of the PODS, Chicago, Illinois, USA, June 26–28, 2006; pp. 290–297.
  8. Lee, L.K.; Ting, H.F. Maintaining Significant Stream Statistics over Sliding Windows. Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA'06, Miami, FL, USA, 22–26 January 2006; pp. 724–732.
  9. Datar, M.; Gionis, A.; Indyk, P.; Motwani, R. Maintaining stream statistics over sliding windows. SIAMJ. Comput. 2002, 31, 1794–1813. [Google Scholar]
  10. Tirthapura, S.; Xu, B.; Busch, C. Sketching Asynchronous Streams over a Sliding Window. Proceedings of the 25th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC'06, Denver, CO, USA, 23–26 July 2006; pp. 82–91.
  11. Busch, C.; Tirthapua, S. A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window. Proceedings of the 24th Annual Symposium on Theoretical Aspects of Computer Science, STACS'07, Aachen, Germany, 22–24 February 2007; pp. 465–475.
  12. Cormode, G.; Tirthapura, S.; Xu, B. Time-decaying sketches for robust aggregation of sensor data. SIAM J. Comput. 2009, 39, 1309–1339. [Google Scholar]
  13. Chan, H.L.; Lam, T.W.; Lee, L.K.; Ting, H.F. Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window. Proceedings of the 7th Workshop on Approximation and Online Algorithms, WAOA'09, Copenhagen, Denmark, 10–11 September 2009; pp. 49–61.
  14. Misra, J.; Gries, D. Finding repeated elements. Sci. Comput. Program. 1982, 2, 143–152. [Google Scholar]
  15. Arbitman, Y.; Naor, M.; Segev, G. De-amortized Cuckoo Hashing: Provable Worst-Case Performance and Experimental Results. Proceedings of the 36th International Colloquium, ICALP'09, Rhodes, Greece, 5–12 July 2009; pp. 107–118.
  16. Hung, R.S.; Lee, L.K.; Ting, H.F. Finding frequent items over sliding windows with constant update time. Inf. Process. Lett. 2010, 110, 257–260. [Google Scholar]

Share and Cite

MDPI and ACS Style

Ting, H.-F.; Lee, L.-K.; Chan, H.-L.; Lam, T.-W. Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window. Algorithms 2011, 4, 200-222. https://doi.org/10.3390/a4030200

AMA Style

Ting H-F, Lee L-K, Chan H-L, Lam T-W. Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window. Algorithms. 2011; 4(3):200-222. https://doi.org/10.3390/a4030200

Chicago/Turabian Style

Ting, Hing-Fung, Lap-Kei Lee, Ho-Leung Chan, and Tak-Wah Lam. 2011. "Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window" Algorithms 4, no. 3: 200-222. https://doi.org/10.3390/a4030200

Article Metrics

Back to TopTop