This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
In an asynchronous data stream, the data items may be out of order with respect to their original timestamps. This paper studies the space complexity required by a data structure to maintain such a data stream so that it can approximate the set of frequent items over a sliding time window with sufficient accuracy. Prior to our work, the best solution is given by Cormode et al. [1], who gave an
O(1∊logWlog(∊BlogW)min{logW,1∊}log|U|)-space data structure that can approximate the frequent items within an ∊ error bound, where W and B are parameters of the sliding window, and U is the set of all possible item names. We gave a more space-efficient data structure that only requires
O(1∊logWlog(∊BlogW)loglogW) space.
asynchronous data streamsfrequent itemssliding windowspace complexityIntroduction
Identifying frequent items in a massive data stream has many applications in data mining and network monitoring, and the problem has been studied extensively [2-5]. Recent interest has been shifted from the statistics of the whole data stream to that of a sliding window of recent data [6-9]. In most applications, the amount of data in a window is gigantic when compared with the amount of memory available in the processing units. It is impossible to store all the data and then find the exact frequent items. Existing research has focused on designing space-efficient data structures to support finding the approximate frequent items. The key concern is how to minimize the space so as to achieve a required level of accuracy.
Asynchronous Data Stream
Most of the previous work on data streams assume that items in a data stream are synchronous in the sense that the order of their arrivals is the same as the order of their creations. This synchronous model is however not suitable to applications that are distributed in nature. For example, in a sensor network, the sink collects data transmitted from sensors over a large area, and the data transmitted from different sensors would suffer different delay. It is possible that an item created at time t at a certain sensor may arrive at the sink later than an item created after t at another sensor. From the sink's viewpoint, items in the data stream are out of order with respect to their creation times. Yet the statistics to be computed are usually based on the creation times. More specifically, an asynchronous data stream (a.k.a. out-of-order data stream) [1,10,11] can be considered as a sequence (a_{1}, t_{1}), (a_{2}, t_{2}), (a_{3}, t_{3}), …, where a_{i} is the name of a data item chosen from a fixed universe U, and t_{i} is an integer timestamp recording the creation time of this item. Items arriving at the data stream are in arbitrary order regarding their timestamps, and it is possible that more than one data item has the same timestamp.
Previous Work on Approximating Frequent Items
Consider a data stream and, in particular, those data items whose timestamps fall into the last W time units (W is the size of the sliding window). An item (or precisely, an item name) is said to be a frequent item if its count (i.e., the number of occurrences) exceeds a certain required threshold of the total item count. Arasu and Manku [6] were the first to study approximating frequent items over a sliding window under the synchronous model, in which data items arrive in non-decreasing order of timestamps. The space complexity of their data structure is
O(1∊(log1∊)2log(∊B)), where ∊ is a user-specified error bound and B is the maximum number of items with timestamps falling into the same sliding window. Their work was later improved by Lee and Ting [7] to
O(1∊log(∊B)) space. Recently, Cormode et al. [1] initiated the study of frequent items under the asynchronous model, and gave a solution with space complexity
O(1∊logWlog(∊BlogW)min{logW,1∊}log|U|), where U is the set of possible item names. Later, Cormode et al. [12] gave a hashing-based randomized solution using
O(1∊2log|U|) space. The space complexity is quadratic in
1∊, which is less preferred, but that is a general solution that can solve other problems like finding the sum and quantiles.
The earlier work on asynchronous data stream focused on a relatively simpler problem called ∊-approximate basic counting [10,11]. Cormode et al. [1] improved the space complexity of basic counting to.
O(1∊logWlog(∊BlogW)) Notice that under the synchronous model, the best data structure requires
O(1∊log(∊B)) space [9]. It is believed that there is roughly a gap of logW between the synchronous model to the asynchronous model. Yet, for frequent items, the asynchronous result of Cormode et al. [1] has space complexity way bigger than that of the best synchronous result, which is
O(1∊log(∊B)) [7]. This motivates us to study more space-efficient solutions for approximating frequent items in the asynchronous model.
Formal Definition of Approximate Frequent Item Set
For any time interval I and any data item a, let f_{a}(I) denote the frequency of item a in interval I, i.e., the number of arrived items named a with timestamps falling into I. Define f_{*}(I) = Σ_{a∈U}f_{a}(I) to be the total number of all arrived items with timestamps within I.
Given a user-specified error bound ∊ and a window size W, we want to maintain a data structure to answer any ∊-approximate frequent item set query for any sub-window (specified at query time), which is in the form (ϕ, W′) where ϕ ∈ [∊, 1] is the required threshold and W′ ≤ W is the sub-window size. Suppose that τ_{cur} is the current time. The answer to such a query is a set S of item names satisfying the following two conditions:
(C1) S contains every item a whose frequency in interval I = [τ_{cur} − W′ + 1, τ_{cur}] is at least ϕf_{*}(I), i.e., f_{a}(I) ≥ ϕf_{*}(I).
(C2) For any item a in S, its frequency in interval I is at least (ϕ − ∊)f_{*}(I),i.e., f_{a}(I) ≥ (ϕ − ∊)f_{*}(I).
The set S is also called an ∊-approximate ϕ-frequent item set. For example, assume ∊ = 1%, then the query (10%, 10, 000) would return all items whose frequencies in the last 10, 000 time units are each at least 10% of the total item count, plus possibly some other items with frequency at least 9% of the total count.
Our Contribution
This paper gives a more space-efficient data structure for answering any ∊-approximate frequent item set query. Our data structure uses
O(1∊logWlog(∊BlogW)loglogW) words, which is significantly smaller than the one given by Cormode et al. [1] (see Table 1). Furthermore, this space complexity is larger than the best synchronous solution by only a factor of O(logW log logW), which is close to the expected gap of O(logW). Similar to existing data structures for this problem, it takes time linear to the data structure's size to answer an ∊-approximate frequent item set query. Furthermore, it takes
O(log(∊BlogW)(log1∊+loglogW)) time to modify the data structure for a new data item. Occasionally, we might need to clean up some old data items that are no longer significant to the approximation; in the worst case, this takes time linear to the size of the data structure, and thus is no bigger than the query time. As a remark, the solution of Cormode et al. [1] requires
O(log(∊BlogW)logWloglog|U|) time for an update.
In the asynchronous model, if a data item has a delay more than W time units, it can be discarded immediately when it arrives. In many applications, the delay is usually small. This motivates us to extend the asynchronous model to consider data items that have a bounded delay. We say that an asynchronous data stream has tardiness d_{max} if a data item created at time t must arrive at the stream no later than time t + d_{max}. If we set d_{max} = 0, the model becomes the synchronous model. If we allow d_{max} ≥ W, this is in essence the asynchronous model studied above. We adapt our data structure to take advantage of small tardiness such that when d_{max} is small, it uses smaller space (see Table 1) and support faster update time (which is
O(log(∊Blogdmax)(log1∊+loglogdmax))) In particular, when d_{max} = Θ(1), the size and update time of our data structure match those of the best data structure for synchronous data stream.
Remark
This paper is a corrected version of a paper with the same title in WAOA 2009 [13]; in particular, the error bound on the estimates was given incorrectly before and is fixed in this version.
Technical Digest
To solve the frequent item set problem, we need to estimate the frequency of any item with relative error ∊f_{*}(I) where I = [τ_{cur} − W + 1, τ_{cur}] is the interval covered by the sliding window. To this end, we first propose a simple data structure for estimating the frequency of a fixed item over the sliding window. Then, we adapt a technique of Misra and Gries [14] to extend our data structure to handle any item. The result is an O(f_{*}(I))/λ)-space data structure that allows us to obtain an estimate for any item with an error bound of about λ logW. Here λ is a design parameter. To ensure λ logW to be no greater than ∊f_{*}(I), we should set λ ≤ ∊f_{*}(I)/logW. Since f*(I) can be as small as
Θ(1∊logW) (the case for smaller f_{*}(I) can be handled by brute-force), we need to be conservative and set λ to some constant. But then the size of the data structure can be Θ(B) because f_{*}(I) can be as large as B. To reduce space, we introduce a multi-resolution approach. Instead of using one single data structure, we maintain a collection of O(logB) copies of our data structure, each uses a distinct, carefully chosen parameter λ so that it could estimate the frequent item set with sufficient accuracy when f_{*}(I) is in a particular range. The resulting data structure uses
O(1∊logWlogB) space.
Unfortunately, a careful analysis of our data structure reveals that in the worst case, it can only guarantee estimates with an error bound of ∊f_{*}(H ∪ I) where H = [τ_{cur} − 2W + 1, τ_{cur} − W], not the required ∊f_{*}(I). The reason is that the error of its estimates over I depend on the number of updates made during I, and unlike synchronous data stream, this number for asynchronous data stream can be significantly larger than f_{*}(I). For example, at time τ_{cur} − W + 1, there may still be many new items (a, u) with timestamps u ∈ H, for which we must update our data structure to get good estimates when the sliding window is at earlier positions. Indeed, the number of updates during I can be as large as f_{*}(H ∪ I), and this gives an error bound of ∊f_{*}(H ∪ I).
To reduce the error bound to ∊f_{*}(I), we introduce a novel algorithm to split the data structure into independent smaller ones at appropriate times. For example, at time τ_{cur} − W + 1, we can split our data structure into two smaller ones D_{H} and D_{I}, and we will only update D_{H} for items (a, u) with u ∈ H and update D_{I} for those with u ∈ I. Then, when we need to find an estimate on I at time τ_{cur}, we only need to consult D_{I}, and the number of updates made to it is f_{*}(I). In this paper, we develop sophisticated procedures to decide when and how to split the data structure so as to enable us to get good enough estimates when sliding window moves continuously. The resulting data structure has size
O(1∊(logW)2log(∊BlogW)) Then, we further make the data structure adaptive to the input size, allowing us to reduce the space to
O(1∊(loglogW)logWlog(∊BlogW)).
Preliminaries
Our data structures for the frequent item set problem depends on data structures for the following two related data stream problems. Let 0 < ∊ < 1 be any real number, and τ_{cur} be the current time.
The ∊-approximate basic counting problem asks for data structure that allows us to obtain, for any interval I = [τ_{cur} − W′ + 1, τ_{cur}] where W′ ≤ W, an estimate f̂_{*}(I) of f_{*}(I) such that |f̂_{*}(I) − f_{*}(I)| ≤ ∊f_{*}(I).
The ∊-approximate counting problem asks for data structure that allows us to obtain, for any item a and any interval I = [τ_{cur} − W′ + 1, τ_{cur}] where W′ ≤ W, an estimate f̂_{a}(I) of f_{a}(I) such that | f̂_{a}(I) − f_{a}(I)|≤ ∊f_{*}(I).
As mentioned in Section 1, Cormode et al. [1] gave an
O(1∊logWlog(∊BlogW))-space data structure
_{∊} for solving the ∊-approximate basic counting problem. In this paper, we give an
O(1∊logWlog(∊BlogW)loglogW)-space data structure
_{∊} for solving the harder ∊-approximate counting problem. The theorem below shows how to use these two data structures to answer ∊-approximate frequent item set query.
Theorem 1
Let ∊_{0} = ∊/4. Given
_{∊o} and
_{∊o}, we can answer any ∊-approximate frequent item set query. The total space required is
O(1∊logWlog(∊BlogW)loglogW).
Proof
The space requirement is obvious. Consider any ∊-approximate frequent item set query (ϕ, W′) where ∊ ≤ ϕ ≤ 1 and W′ ≤ W. Let I = [τ_{cur} − W′ + 1, τ_{cur}]. Since ∊_{o} = ∊/4, the estimates given by
_{∊o} satisfy
|f^∗(I)−f∗(I)|≤∊4f∗(I), and for any item a, the estimates given by
_{∊o} satisfy
|f^a(I)−fa(I)|≤∊4f∗(I) To answer the query (ϕ, W′), we return the set
Sϕ={a|f^a(I)≥(ϕ−∊2I)f^∗(I)}which satisfies the required conditions (C1) and (C2) because
for any item a with f_{a}(I) ≥ ϕf_{*}(I),
f^a(I)≥fa(I)−∊4f∗(I)≥(ϕ−∊4)f∗(I)≥(ϕ−∊4)(11+∊4)f^∗(I)≥(ϕ−∊4)(1−∊4)f^∗(I)≥(ϕ−∊2)f^∗(I), and a ∈ S_{ϕ}; thus (C1) is satisfied, and
for every a ∈ S_{ϕ}, we have
fa(I)≥f^a(I)−∊4f∗(I)≥(ϕ−∊2)f^∗(I)−∊4f∗(I)≥(ϕ−∊2)(1−∊4)f∗(I)−∊4f∗(I)≥(ϕ−∊)f∗(I); thus (C2) is satisfied.
The building block of
_{∊} is a data structure that counts items over some fixed interval (instead of the sliding window). For any interval I = [ℓ_{I}, r_{I}] of size W, Theorem 4 in Section 4 gives a data structure
_{I,∊} that uses
O(1∊logWlog(∊BlogW)loglogW) space, supports
O(log(∊BlogW)⋅(log1∊+loglogW)) update time, and enables us to obtain, for any item a and any time t ∈ I, an estimate f̂_{a}([t, r_{I}]) of f_{a}([t, r_{I}]) such that
|f^a([t,rI])−fa([t,rI])|≤∊f∗([t,rI])
Given
_{I1,∊},
_{I2,∊}, … where I_{i} = [(i − 1)W + 1, iW], we can obtain, for any W′ ≤ W, an estimate f̂_{a}([s, τ_{cur}]) of f_{a}([s, τ_{cur}]) where s = τ_{cur} − W′ + 1 as follows.
Let I_{i} and I_{i}_{+1} be the intervals such that [s, τ_{cur}] ⊂ I_{i} ∪ I_{i}_{+1}.
Use
_{Ii,∊} to get an estimate f̂_{a}([s, iW]) of f_{a}([s, iW]), and
_{Ii+1,∊} an estimate f̂_{a}([iW + 1, (i + 1)W]) of f_{a}([iW + 1, (i + 1)W]).
By Equation (1), we have
|f^a([S,iW])−fa([S,iW])|≤∊f∗([S,iW])and
|f^a([iW+1,(i+1)W])−fa([iW+1,(i+1)W])|≤∊f∗([iW+1,(i+1)W])
Observe that any item that arrives at or before the current time τ_{cur} must have timestamp no greater than τ_{cur}; hence f_{a}([iW + 1, (i + 1)W]) = f_{a}([iW + 1, τ_{cur}]) and f_{*}([iW + 1, (i + 1)W]) = f_{*}([iW +1, τ_{cur}]), and Equation (3) is equivalent to
|f^a([iW+1,(i+1)W])−fa([iW+1,τcur])|≤∊f∗([iW+1,τcur])
Adding Equations (2) and (4), we conclude |f̂_{a}([s, τ_{cur}]) − f_{a}([s, τ_{cur}])| ≤ ∊f_{*}([s, τ_{cur}]), as required.
Our data structure
_{∊} is just the collection of
_{I1,∊},
_{I2,∊}, …. Note that we only need to physically store in
_{∊} the data structures
_{Ii,∊} and
_{Ii+1,∊} where [τ_{cur} − W + 1,τ_{cur}] ⊆ I_{i} ∪ I_{i}_{+1}. The intervals of the earlier ones will no longer be covered by the sliding window and the corresponding
_{I,∊}'s can be thrown away. Together with Theorem 4, we have the following theorem.
Theorem 2
The data structure
_{∊} solves the ∊-approximate counting problem. The space usage is
O(1∊logWlog(∊BlogW)loglogW) and it supports
O(log(∊BlogW)⋅(log1∊+loglogW)) update time.
A Simple Data Structure For Frequency Estimation
Let I = [ℓ_{I}, r_{I}] be any interval of size W. To simplify notation, we assume that W is a power of 2, so that logW is an integer and we can avoid the floor or the ceiling functions. In this section, we describe a simple data structure
_{I,λ,κ} that enables us to obtain, for any item a, a good estimate of a's frequency over I. The parameters λ and κ determine its accuracy and space usage. However, its accuracy is not enough for answering any ∊-approximate frequent item set query. We will explain how to improve the accuracy in the next section.
Roughly speaking,
_{I,λ,κ} is a set of queues
QI,λa i.e.,
CI,λ,κ=[QI,λa∣a∈U]. For an item a, the queue
QI,λa keeps track of the occurrences of a in I. Each node N in
QI,λa is associated with an interval i(N), a value v(N), and a debit d(N); v(N) counts the number of arrived items (a, u) with u ∈ i(N), and d(N) is for implementing a space reduction technique. Initially,
QI,λa has only one node N with i(N) = I, and v(N) = d(N) = 0. In general,
QI,λa is a queue 〈N_{1}, N_{2}, …, N_{k}〉 of nodes whose intervals form a partition of I, i.e.,
〈i(N1),i(N2),…,i(Nk)〉=〈[p1,q1],[p2,q2],…,[pk,qk]〉where q_{i−1} + 1 = p_{i} ≤ q_{i} and ∪_{1≤i≤k}[p_{i}, q_{i}] = I. When an item (a, u) with u ∈ I arrives, we update
QI,λa as follows.
QI,λa.Debit( )
1:
find the unique node N in
QI,λa with u ∈ i(N) = J = [p, q],
2:
increase the value of N by 1, i.e., v(N) = v(N) + 1;
3:
if (|J| > 1 andλ units have been added to v(N) since J is assigned to i(N)) then
4:
/* refine J */
5:
create a new node N′ and insert it to the left of N;
6:
let i(N′) = [p, m], i(N) = [m + 1, q] where m = ⌊(p + q)/2⌋;
7:
let v(N′) = 0 and d(N′) = 0;
8:
/* we make no change to v(N) and d(N) */
9:
end if
Figure 1 gives an example on how
QI,λa is updated using the procedure.
Obviously, a direct implementation of
_{I,λ,κ} uses too much space. We now extend a technique of Misra and Gries [14] to reduce the space requirement. For any
QI,λa, we say that
QI,λa is trivial if the queue contains only a single node N with (i) i(N) = I, and (ii) v(N) = d(N) = 0. Every queue in
_{I,λ,κ} is trivial initially. The key for reducing the space complexity of
_{I,λ,κ} is to maintain the following invariant throughout the execution:
(*) There are at most κ non-trivial queues in
_{I,λ,κ}.
We call κ the capacity of
_{I,λ,κ}. The invariant helps us save space because we do not need to store trivial queues physically in memory. To maintain (*), each queue
QI,λa supports the following procedure, which is called only when
v(QI,λa), the total values of the nodes in
QI,λa, is strictly greater than
d(QI,λa), the total debits of the nodes in
QI,λa.
QI,λa.Debit( )
1:
if (
v(QI,λa)≤d(QI,λa)) then
2:
return error;
3:
else
4:
find an arbitrary node N of
QI,λa with v(N) > d(N);
5:
/* such a node must exist because
v(QI,λa)>d(QI,λa) */
6:
d(N) = d(N) + 1;
7:
end if
Note from the implementation of Debit( ) that
v(QI,λa) is always no smaller than
d(QI,λa), and for each node N of
QI,λa,v(N)≥d(N). Furthermore, if
v(QI,λa)=d(QI,λa), then v(N) = d(N) for every node N in
QI,λa. To maintain (*),
_{I,λ,κ} processes a newly arrived item (a, u) with u ∈ I as follows.
_{I,λ,κ}.Process((a, u))
1:
update
(QI,λa) by calling
(QI,λa).Update((a, u));
2:
if (after the update the number of non-trivial queues becomes κ) then
3:
for eachQI,λx with
v(QI,λx)>d(QI,λx)doQI,λx.Debit( );
4:
for each non-trivial queues
QI,λx with
v(QI,λx)=d(QI,λx)do
5:
delete all nodes of
QI,λx and make it a trivial queue;
6:
/* Note that each deleted node N satisfies v(N) = d(N). */
7:
end if
It is easy to see that Invariant (*) always holds: Initially the number m of non-trivial queues is zero, and m increases only when Process((a, u)) is on some trivial
QI,λa; in such case
v(QI,λa) becomes 1 and
d(QI,λa) remains 0. If m becomes κ after this increase, we will debit, among other queues,
QI,λa and its
d(QI,λa) becomes 1 too. It follows that
v(QI,λa)=d(QI,λa), and Lines 4–5 will make
QI,λa trivial and m becomes less than κ again.
We are now ready to define
_{I,λ,κ}'s estimate f̂_{a}([t, r_{I}]) of f_{a}([t, r_{I}]) and analyze its accuracy. We need some definitions. For any interval J = [p, q] and any t ∈ I, we say that J covers t if t ∈ [p, q], is to the right of t if t < p, and is to the left of t otherwise. For any item a and any t ∈ I = [ℓ_{I}, r_{I}],
_{I,λ,κ}'s estimate of f_{a}([t, r_{I}]) is
f̂a([t, r_{I}]) = the value sum of the nodes N currently in
QI,λa whose i(N) covers or is to the right of t.
For example, in Figure 1, after the update of the last item (a, 1), we can obtain the estimate f̂_{a}([2, 8]) = 0 + 4 + 5 = 9.
Given any node N of
QI,λa, we say that N is monitoring a over J, or simply N is monitoring J if i(N) = J. Note that a node may monitor different intervals during different periods of execution, and the size of these intervals are monotonically decreasing. Observe that although there are about W^{2}/2 possible sub-intervals of size-W interval I, there are only about 2W of them that would be monitored by some nodes: there is only one such interval of size W, namely I = [ℓ_{I}, r_{I}], which gives birth to two such intervals of size W/2, namely [ℓ_{I}, m] and [m + 1, r_{I}] where m = ⌊(ℓ_{I} + r_{I})/2⌋, and so on. We call these O(W) intervals interesting intervals. For any two interesting intervals J and H such that J ⊂ H, we say that J is a descendant of H, and H is an ancestor of J. Figure 2 shows all the interesting intervals for I = [1, 8], as well as their ancestor-descendant relationship. The following important fact is easy to verify by induction.
Fact 1
Any two interesting intervals J and H do not cross, although one can contain another, i.e., either J ⊂ H, or H ⊂ J, or J ∩ H = ∅. Furthermore, any interesting interval has at most logW ancestors.
For any node N, let
(N) be the set of intervals that have been monitored by N so far. The following fact can be verified from the update procedure.
Fact 2
Consider a node N in
QI,λa, where i(N) = J.
If J covers or is to the right of t, then all intervals in
(N) cover or are to the right of t.
If J is to the left of t, then all intervals in
(N) are to the left of t.
We say that N covers or is to the right of t if the intervals in
(N) cover or are to the right of t; otherwise, N is to the left of t. For any queue
QI,λa, let alive
(QI,λa) be the set of nodes currently in
QI,λa, dead
(QI,λa) be those nodes of
QI,λa that have already been deleted (because of Line 5 of the procedure Process( )), and node
(QI,λa)=alive(QI,λa)∪dead(QI,λa). Note that the estimate f̂_{a}([t, r_{i}]) is the value sum of the nodes in alive
(QI,λa) that cover or are to the right of t. For simplicity, we need to express it more succinctly. Let
alive(CI,λ,κ)=∪{alive(QI,λa)∣QI,λa∈CI,λ,κ}be the set of nodes currently in
_{I,λ,κ}. Define dead(
_{I,λ,κ}) and node(
_{I,λ,κ}) similarly. For any item a and any subset X ⊆ node(
_{I,λ,κ}), let X^{a} be the set of nodes in X that are monitoring a (and thus are the nodes from
QI,λa). For any t ∈ I, let X_{≥t} denote the set of nodes in X that cover or are to the right of t. Define v(X) = Σ_{N∈X}v(N) and d(X) = Σ_{N∈X}d(N). Then, f̂_{a}([t, r_{I}]) can be expressed as follows:
f^a([t,rI])=v(alive(QI,λa)≥t)=v(alive(CI,λ,κ)≥ta)
The following theorem analyzes its accuracy, as well as gives the size of
_{I,λ,κ}.
Lemma 3
For any t ∈ I, f_{a}([t, r_{I}]) −
1κf_{*}(I) ≤ f̂_{a}([t, r_{I}]) ≤ f_{a}([t, r_{I}]) + λ logW. Furthermore,
_{I,λ,κ} has size O(f_{*}(I)/λ + κ) words.
Proof
Recall that
f^a([t,rI])=v(alive(QI,λa)≥t). Consider any node N ∈ alive
(QI,λa)≥t. Note that v(N) = Σ_{J∈
(N)}v_{add}(N, J) where v_{add}(N, J) is the value added to v(N) during the period when i(N) = J. By Fact 2, we can divide it as v(N) = Σ{v_{add}(N, J) | J covers t} + Σ {v_{add}(N, J) | J is to the right of t}. It follows that
v(alive(QI,λa)≥t)=∑N∈alive(QI,λa)≥tv(N)=∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jcoverst}+∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jis to the right oft}
Note that
∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jis to the right oft}≤fa([t,rI]), because if an arrived item (a, u) causes an increase of v_{add}(N, J) for some J that is to the right of t, then u must be in [t, r_{I}]. By Equation (5), to show the second inequality of the lemma, it suffices to show that
So=∑N∈alive(QI,λa)≥t∑{vadd(N,J)∣Jcoverst}=vadd(N1,J1)+vadd(N2,J2)+⋯+vadd(Nκ,Jκ) is no greater than λ logW, as follows.
Without loss of generality, suppose |J_{1}| ≥ |J_{2}| ≥ ⋯≥ |J_{κ}|. It can be verified that once an interval J is assigned to a node, it will not be assigned to other nodes; thus the J_{i}'s are distinct. Furthermore, note that for 1 ≤ i < k, J_{κ} ⊂ J_{i} because (i) t is in both J_{i} and J_{κ}; (ii) J_{κ} is the smallest interval; and (iii) interesting intervals do not cross; thus J_{κ} is a descendant of J_{i}, and together with Fact 1, k ≤ logW. By Line 3 of the procedure Update( ), v_{add}(N_{i}, J_{i}) ≤ λ for 1 ≤ i ≤ k. It follows that S_{o} ≤ λ logW.
For the first inequality of the lemma, it is clearer to use
f^a([t,rI])=v(alive(CI,λ,κ)≥ta). Note that every arrived item (a, u) with u ∈ [t, r_{I}] increments the value of some node in node
(CI,λ,κ)≥ta; thus
f^a([t,rI])≤v(node(CI,λ,κ)≥ta) and
f^a([t,rI])−v(alive(CI,λ,κ)≥ta)≤v(node(CI,λ,κ)≥ta)−v(alive(CI,λ,κ)≥ta)=v(dead(CI,λ,κ)≥ta)
From Lines 4–6 of the procedure Process( ), when we delete a node N, v(N) = d(N). Hence,
v(dead(CI,λ,κ)≥ta)=d(dead(CI,λ,κ)≥ta), which is equal to the total number of debit operations made to these dead nodes. Since whenever we make a debit operation to
QI,λa, we will make a debit operation to κ − 1 other queues,
κ⋅d(dead(CI,λ,κ)≥ta)≤d(node(CI,λ,κ))≤v(node(CI,λ,κ))=f∗(I)
In summary, we have
f^a([t,rI])−f^a([t,rI])=fa([t,rI])−v(alive(CI,λ,κ)≥ta)≤v(dead(CI,λ,κ)≥ta)=v(dead(CI,λ,κ)≥ta)≤f∗(I)/κ, and the first inequality of the lemma follows.
For the space, we say that a node is born-rich if it is created because of Line 5 of the procedure Update( ) (and thus has λ items under its belt); otherwise it is born-poor. Obviously, there are at most f_{*}(I)/λ born-rich nodes. For born-poor nodes, we need to store at most κ of them because every queue has one born-poor node (the rightmost one), and we only need to store at most κ non-trivial queues; the space bound follows.
If we set λ = λ_{i} = ∊2^{i}/logW and
κ=1∊, then Lemma 3 asserts that
CI,λ,κ=CI,λi,1∊ is an
O(f∗(I)∊2ilogW+1∊)-space data structure that enables us to obtain, for any item a ∈ U and any timestamp t ∈ I, an estimate f̂_{a}([t, r_{I}]) that satisfies
fa([t,rI])−∊f∗(I)≤f^a([t,rI])≤fa([t,rI])+∊2i
If f_{*}(I) does not vary too much, we can determine the i such that f_{*} (I) ≈ 2^{i}, and
CI,λ,κ1∊ is an
O(1∊logW) space data structure that guarantees an error bound of O(∊f_{*}(I)). However, this approach has two obvious shortcomings:
f_{*}(I) may vary from some small value to a value as large as B, the maximum number of items falling in a window of size W; hence, there may not be any fixed i that always satisfies f_{*} (I) ≈ 2^{i}
To estimate f_{a}([t, r_{I}]), we need an error bound of ∊f_{*}([t, r_{I}]), not ∊f_{*}(I).
We will explain how to overcome these two shortcomings in the next section.
Our Data Structure for <italic>∊</italic>-approximate Counting
The first shortcoming of the approach given in Section 3 is easy to overcome: a natural idea is to maintain
CI,λ,κ1∊ for different λ_{i} to handle different possible values of f_{*}(I). The second shortcoming is more fundamental. To overcome it, we need to modify
_{I,λ,κ} substantially The result is a new and complicated data structure
DI,∊Y, where Y is an integer determining the accuracy As asserted in Theorem 7 below, this data structure uses
O(1∊logWloglogW) space, supports
O(log1∊+loglogW) update time, and for any t ∈ I, it offers the following special guarantee:
When
f∗([t,rI])≤Y,DI,∊Y can return, for any item a, an estimate f̂_{a}([t, r_{I}]) of f_{a}([t, r_{I}]) such that |f̂_{a}([t, r_{I}])−f_{a}([t, r_{I}])|≤∊Y.
When
f∗([t,rI])>Y,DI,∊Y does not have any error bound on its estimate f̂_{a}([t, r_{I}]).
Before giving the details of
DI,∊Y, let us explain how to use it to build the data structure
_{I,∊} mentioned in Section 2 for the ∊-approximate counting problem. To build
_{I,∊}, we need another
O(1∊logWlog∊BlogW)-space data structure
_{I,∊}, which is a simple adaption of the data structure
_{∊} of Cormode et al. [1] for the ∊-approximate basic counting problem;
_{I,∊} enables us to find, for any t ∈ I, an estimate f̂_{*}([t, r_{I}]) of f_{*}([t, r_{I}]) such that
f∗([t,rI])≤f^a([t,rI])≤(1+∊)f∗([t,rI])
_{I,∊} is implemented as follows. During execution, we maintain the data structure
_{∊}_{/4} of Cormode et al. to count the items in the sliding window. When τ_{cur} = r_{I}, we duplicate
_{∊}_{/4} and get
′. Then,
′ is updated as if τ_{cur} was fixed at r_{I}. To get the estimate f̂_{*}([t, r_{I}]), we first obtain an estimate f′ of f_{*}([t, r_{I}]) from
′, which satisfies
|f′−f∗([t,rI])|≤∊4f∗([t,rI]). Then,
f^∗−([t,rI])=11−∊/4f′. It can be verified that f̂_{*}([t, r_{I}]) satisfies Equation (7). Our data structure
_{I,∊} is composed of (i)
_{I,∊}, and (ii)
DI,∊/42i for each integer i from
log(1∊logW)+1tologB. It also maintains a brute-force
O(1∊logW)-space data structure for remembering the
1∊logW items (a, u) with the largest u ∈ I; this brute-force data structure will be used for finding f̂_{a}([t, r_{I}]) only when
f∗([t,rI])≤1∊logW.
Theorem 4
The data structure
_{I,∊} has size
O(1∊(loglogW)(logW)log(∊BlogW)) words, and supports
O((log1∊+loglogW)log(∊BlogW)) update time.
Given
_{I,∊}, we can find, for any a ∈ Σ and t ∈ I, an estimate of f̂_{a}([t, r_{I}]) of f_{a}([t, r_{I}]) such that |f̂_{a}([t, r_{I}]) − f_{a}([t, r_{I}])| ≤ ∊f_{*}([t, r_{I}]).
Proof
Statement (i) is straightforward because there are
logB−log(1∊logW) different
DI,∊Y, each has size
O(1∊(loglogW)logW) and takes
O(log1∊+loglogW) time for an update. For Statement (ii), we describe how to get the estimate and analyze its accuracy.
First, we use
_{I,∊} to get the estimate f̂_{*}([t, r_{I}]). If
f^∗([t,rI])≤1∊logW, then
f∗([t,rI])≤f^∗([t,rI])≤1∊logW and we can use the brute-force data structure to find f_{a}([t, r_{I}]) exactly. Otherwise, we determine the i with 2^{i}^{−1} < f̂_{*}([t, r_{I}]) ≤ 2^{i}. Note that
i≥log(1∊logW)+1 and we have the data structure
DI,∊42i, and
f_{*}([t, r_{I}]) ≤ f̂_{*}([t, r_{I}]) ≤ 2^{i}.
We use
DI,∊42i to obtain an estimate f̂_{a}([t, r_{I}]) with
|f^a([t,rI])−fa([t,rI])|≤(∊4)2i. By Equation (7), 2^{i}^{−1} < f̂_{*}([t, r_{I}]) ≤ (1 + ∊)f_{*}([t, r_{I}]). Combining the two inequalities we have
|f^a([t,rI])|−fa([t,rI])|≤2(∊4)(2i−1)<2(∊4)(1+∊)f∗([t,rI])≤∊f∗([t,rI])
We now describe the construction of
DI,∊Y. First, we describe an
O(1∊(logW)2)-space version of the data structure. Then, we show in the next section how to reduce the space to
O(1∊loglogWlogW). In our discussion, we fix λ = ∊Y/logW and
κ=4∊logW.
Initially,
DI,∊Y is just the data structure
_{I,λ,κ}. By Lemma 3, we know that its size is
O(f∗(I)λ+κ)=O(f∗(I)∊YlogW+1∊logW), which is
O(1∊logW) when f_{*}(I) ≤ Y. However, it is much larger than
1∊logW when f_{*}(I) ≫ Y, and to maintain small space usage in such case, we trim
_{I,λ,κ} by throwing away a significant number of nodes. This is acceptable because
_{I,λ,κ} only guarantees good estimates for those t ∈ I with f_{*}([t, r_{I}]) ≤ Y. The trimming process is rather tricky. The natural idea of throwing away all the nodes to the left of t when we find f_{*}([t, r_{I}]) > Y does not work because the resulting data structure may return estimates with error larger than the required ∊Y bound. For example, let I = [1, W]. For each item a_{i} ∈ {a_{1}, a_{2}, …, a_{κ−1}}, there are m = Y/κ copies of (a_{i}, t + 1) arrive at time W + t for every t ∈ [0, W − 1]. Also, there are m copies of (a, W) arrive at time W + t for every t ∈ [0, W − 1]. Hence, at each time W + t, there are mκ = Y items with timestamps in [t, W] arrives, m items for each of the κ item name in {a, a_{1}, …, a_{κ−1}}. We are interested in the accuracy of the estimate f̂_{a}([W, W]). It can be verified that at each time W + t, Lines 4–5 of the procedure Process( ) will eventually trivialize
QI,λa and thus f̂_{a}([W, W]) = 0. Since f_{a}([W, W]) = (t + 1)m, |f̂_{a}([W, W]) − f_{a}([W, W])| = (t + 1)m. When t = 2∊Y/m − 1, the absolute error is 2∊Y which is larger than the required error bound ∊Y.
To describe the right trimming procedure, we need some basic operations. Consider any
_{J,λ,κ} where J = [p, q]. The following operation splits
_{J,λ,κ} into two smaller data structures
_{Jℓ,λ,κ} and
_{Jr,λ,κ} where J_{t} = [p, m] and J_{r} = [m+ 1, q] with m = ⌊(p + q)/2⌋.
DI,∊Y.Split(
_{J,λ,κ})
1:
for each non-trivial queue
QJ,λa∈CJ,λ,κdo
2:
if (
QJ,λa has only one node N monitoring the whole interval J) then
3:
/* refine J */
4:
insert a new node N′ immediately to the left of N with v(N′) = d(N′) = 0;
5:
i(N′) = J_{ℓ}, and i(N) = J_{r};
6:
end if
7:
divide
QJ,λa into two sub-queues
QJℓ,λa and
QJr,λa where
8:
QJℓ,λa contains the nodes monitoring some sub-intervals of J_{ℓ}, and
9:
QJr,λa contains those monitoring some sub-intervals of J_{r};
10:
put
QJℓ,λa in
_{Jℓ,λ,κ} and
QJr,λa in
_{Jr,λ,κ}.
11:
end for
12:
/* For a trivial
QJ,λa, its two children in
_{Jℓ,λ,κ} and
_{Jr,λ,κ} are also trivial. */
We say that
_{Jℓ,λ,κ} and
_{Jr,λ,κ} are the left and right child of
_{Jr,λ,κ}, respectively. Figure 3 gives an example of Split(
_{[1,8],λ,κ}), the split of
_{[1,8],λ,κ}, which has three non-trivial queues
QI,λa,
QI,λb and
QI,λc, into
_{[1, 4],λ,κ} and
_{[5, 8],λ,κ}. Note that the queues for b and c in
_{[1, 4],λ,κ} are trivial and we have not stored them.
Using Split( ), we can trim, for example,
_{[}_{p,p}_{+1],λ,κ} into
_{[}_{p}_{+1,p+1],λ,κ} as follows: Split
_{[p,p+1],λ,κ} into
_{[p,p],λ,κ} and
_{[p+1,p+1],λ,κ}, and throw away
_{[p, p],λ,κ}. The following recursive procedure LeftRefine( ) generalizes this idea for larger J: Given
_{J,λ,κ} =
_{[p, q],λ,κ}, it returns a list 〈
_{J0,λ,κ},
_{J1,λ,κ}, …,
_{Jm,λ,κ}〉 where the J_{i}'s form a partition of [p, q], and J_{0} = [p, p]. Throwing away
_{J0,λ,κ}, and the remaining
_{Ji,λ,κ}'s all together monitor [p + 1, q].
DI,∊Y.LeftRefine (
_{[p,q],λ,κ})
1:
if (|[p, q]| = |[p, p]| = 1) then
2:
return 〈
_{[p,p],λ,κ}〉;
3:
else
4:
split
_{[p,q],λ,κ} into its left child
_{[p, m],λ,κ} and right child
_{[m+1,q],λ,κ}
5:
/* where m = ⌊(p + q)/2⌋ */;
6:
L = LeftRefine(
_{[p, m],λ,κ});
7:
suppose L = 〈
_{J0,λ,κ},
_{J1,λ,κ}, …,
_{Jk,λ,κ}〉;
8:
return 〈
_{J0,λ,κ}, …,
_{Jk,λ,κ}_{[m+1,q],λ,κ}〉;
9:
end if
For example, LeftRefine(
_{[1,8],λ,κ}) gives us the list 〈
_{[1,1],λ,κ},
_{[2, 2],λ,κ},
_{[3, 4],λ,κ},
_{[5,8],λ,κ}〉. Note that J_{0} = [p, p] because the recursion stops only when |[p, q]| = 1. The list returned by LeftRefine(
_{[p, q],λ,κ}) has another useful property, which we describe below.
Given L = 〈
_{Z1,λ,κ}, …,
_{Zk,λ,κ}), we say that L is an interesting-partition covering the interval J if (i) the Z_{i}'s are all interesting intervals and form a partition of J; and (ii) for 1 ≤ i < k, Z_{i} is to the left of Z_{i}_{+1}, and
|Zi|≤12|Zi+1|. The fact below can be verified by induction on the length of the list returned by LeftRefine( ).
Fact 3
Let J be an interesting interval, and L = 〈
_{J0,λ,κ}, …,
_{Jm,λ,κ}〉 be the list returned by LeftRefine(
_{J,λ,κ}). Then, the list 〈
_{J1,λ,κ}, …,
_{Jm,λ,κ} 〉 (i.e., the list obtained by throwing away the head
_{J0,λ,κ} of L) is an interesting-partition covering [p + 1, q].
For example, if [1, 8] is an interesting interval, then the list 〈
_{[2,2],λ,κ}_{[3,4],λ,κ}_{[5,8],λ,κ}〉 obtained by throwing away the first element
_{[1,1],λ,κ} from LeftRefine(
_{[1,8],λ,κ}) is an interesting-partition covering [2, 8].
We now give details of
DI,∊Y. Initially, it is the interesting-partition 〈C_{I,λ,κ} 〉 covering the whole interval I = [ℓ_{I}, r_{I}]. Throughout the execution, we maintain the following invariant:
(**)
DI,∊Y is an interesting-partition covering some [p, r_{I}] ⊆ I.
When
DI,∊Y=〈CJ1,λ,κ,…,CJm,λ,κ〉 is covering [p, r_{I}], it only guarantees good estimates of f_{a}([t, r_{I}]) for t ∈ [p, r_{I}], and this estimate is obtained by
f^a([t,rI])=v(alive(CJh,λ,κ)≥ta)+∑h+1≤i≤mv(alive(CJi,λ,κ)a)(or equivalently,
f^a([t,rI])=v(alive(QJh,λa)≥t)+∑h+1≤i≤mv(alive(QJi,λa)), where J_{h} is the interval in {J_{1}, J_{2}, …, J_{m}} that covers t. When an item (a, u) with u ∈ [p, r_{I}] arrives, we find the unique
_{Ji,λ,κ} in
DI,∊Y where u ∈ J_{i}, update it by calling
_{Ji,λ,κ}. Process((a, u)). Note that this update has no effect on the other
_{J,λ,κ} in
DI,∊Y.
During execution, we also keep track of the largest timestamp p_{max} ∈ I such that the estimate f̂_{*}(p_{max},r_{I}]) given by
_{I,∊} is greater than (1 + ∊)Y (which implies f_{*}([p_{max},r_{I}]) > Y because of Equation (7)). As soon as p_{max} falls in the interval covered by
DI,∊Y, we use the following procedure to trim
DI,∊Y to cover the smaller interval [p_{max} + 1, r_{I}].
Suppose that L = 〈
_{J1,λ,κ}, …,
_{Ji,λ,κ}) is an interesting-partition covering [p, r_{I}], and t ∈ [p, r_{I}]. Trim(L, t) constructs an interesting-partition covering [t + 1, r_{I}] recursively as follows.
DI,∊Y.Trim(L, t)
1:
find the unique
_{Ji,λ,κ} in L such that t ∈ J_{i};
2:
L′ =LeftRefine(
_{Ji,λ,κ});
3:
suppose L′ = 〈
_{K0,λ,κ}, …,
_{K1,λ,κ},
_{Kℓ,λ,κ}〉;
4:
if (K_{0} = [t, t]) then
5:
return 〈
_{K1,λ,κ}, …,
_{Kℓ,λ,κ},
_{Ji+1,λ,κ},
_{Jm,λ,κ} 〉;
6:
/* i.e., throw away
_{J1,λ,κ}, …,
_{Ji−1,λ,κ}, and
_{K0,λ,κ}, */
7:
/* and return an interesting-partition covering [t + 1, r_{I}]. */
8:
else
9:
return Trim(〈
_{K1,λ,κ}, …,
_{Kℓ,λ,κ},
_{Ji+1,λ,κ},
_{Jm,λ,κ} 〉, t).
10:
/* throw away
_{J1,λ,κ}, …,
_{Ji−1,λ,κ} and
_{K0,λ,κ} */
11:
end if
For example, Figure 4 shows that when
DI,∊Y=〈C[2,2],λ,κ,C[3,4],λ,κ,C[5,8],λ,κ〉,
Trim(DI,∊Y,3) return 〈
_{[4,4],λ,κ},
_{[5,8],λ,κ} 〉. Based on Fact 3, it can be verified inductively that after
DI,∊Y←Trim(DI,∊Y,pmax), the new
DI,∊Y is an interesting-partition covering [p_{max} + 1, r_{I}]; Invariant (**) is preserved. In the rest of this section, we analyze the size of
DI,∊Y and the accuracy of its estimates.
Let All be the set of all
_{J,λ,κ}'s that ever exist, i.e., if
_{J,λ,κ} ∈ All, then either (i) it is currently in
DI,∊Y, or (ii) it has been in
DI,∊Y some time earlier in the execution, but is thrown away during some trimming of
DI,∊Y. For any p ∈ I, define
ALL≥p={CJ,λ,κ∣CJ,λ,κ∈ALL,andJcovers or is to the right ofp}
Let v_{add}(
_{J,λ,κ}) be the total value added to the nodes of
_{J,λ,κ} during its lifespan. We now derive an upper bound on Σ_{
J,λ,κ ∈ All≥p}v_{add}(
_{J,λ,κ}), which is crucial for getting a tight error bound on the accuracy of
DI,∊Y's estimates.
Recall that initially
DI,∊Y=〈CI,λ,κ〉 and thus
_{I,λ,κ} ∈ All. For any other
_{J,λ,κ} ∈ All,
_{J,λ,κ} must be a child of some
_{H,λ,κ} ∈ All (i.e.,
_{J,λ,κ} is obtained from Split(
_{H,λ,κ}))- Given
_{J,λ,κ} and
_{H,λ,κ}, we say that
_{J,λ,κ} is a descendant of
_{H,λ,κ}, and
_{H,λ,κ} is an ancestor of
_{J,λ,κ}, if either (i)
_{J,λ,κ} is a child of
_{H,λ,κ}, or (ii) it is a child of some of
_{H,λ,κ}'s descendants. Note that the original
_{I,λ,κ} is an ancestor of every
_{J,λ,κ} ∈ All, and in general, any
_{H,λ,κ} ∈ All is an ancestor of every
_{J,λ,κ} ∈ All with J ⊂ H. We have the following lemma. (Note that we are abusing the notation here and regard
DI,∊Y as a set.)
Lemma 5
Suppose that
DI,∊Y=〈CJ1,λ,κ,…,CJm,λ,κ〉 is covering [p, r_{I}]. Let
anc(DI,∊Y)=anc(〈CJ1,λ,κ,…,CJm,λ,κ〉) be the set
{CH,λ,κ∣CH,λ,κis an ancestor of someCJi,λ,κ∈DI,∊Y}. Then,
ALL≥p⊆DI,∊Y∪anc(DI,∊Y),
v_{add}(
_{J,λ,κ}) ≤ (1 + ∊)Y for any
_{J,λ,κ} ∈ All, and
For (1), it suffices to prove that for any
CJ,λ,κ∈ALL≥p,CJ,λ,κ∈DI,∊Y∪anc(DI,∊Y). By definition, J covers or is to the right of p; thus J ∩ (J_{1} ∪ ⋯ ∪ J_{m}) = J ∩ [p, r_{I}] ≠ ∅. Since the intervals are interesting and do not cross, there is an 1 ≤ i ≤ m such that either (i) J = J_{i}, and thus
CJ,λ,κ∈DI,∊Y, or (ii) J_{i} ⊂ J, which implies
_{J,λ,κ} is an ancestor of
_{J,λ,κ}, i.e.,
CJ,λ,κ∈anc(DI,∊Y). (It is not possible that J ⊂ J_{i}, otherwise
_{Ji,λ,κ} would have been split and should not be in the current
DI,∊Y. Hence,
CJ,λ,κ∈DI,∊Y∪anc(DI,∊Y).
To prove (2), suppose that J = [x, y] and v_{add}(
_{J,λ,κ}) has just reached (1 + ∊)Y. This implies f_{*}([x, r_{I}]) ≥ (1 + ∊)Y, and so does its estimate f̂_{*}([x, r_{I}]) given by
_{I,∊} (as f_{*}([x, r_{I}]) ≤ f̂_{*}([x, r_{I}]), by Equation (7)). Then, the procedure Trim( ) will be called and
_{J,λ,κ} will be either thrown away or split, and no more value can be added to
_{J,λ,κ}. It follows that v_{add}(
_{J,λ,κ}) ≤ (1 + ∊)Y.
For (3), recall that
DI,∊Y=〈CJ1,λ,κ,CJ2,λ,κ,…,CJm,λ,κ〉. Among the intervals J_{1}, …, J_{m}, interval J_{1} is the leftmost interval and its left boundary ℓ_{J1} = p. We now prove that
DI,∊Y∪anc(DI,∊Y)=DI,∊Y∪anc(CJ1,λ,κ) where anc(
_{J1,λ,κ}) is the set of ancestors of
_{J1,λ,κ}. Then, together with the facts that
|DI,∊Y|≤logW (by Property (ii) of interesting-partition) and |anc(
_{J1,λ,κ})| ≤ logW (as each Split operation would reduce the size of interval by half), we have
|DI,∊Y∪anc(DI,∊Y)|=|DI,∊Y∪anc(CJ1,λ,κ)|≤|DI,∊Y|+|anc(CJ1,λ,κ)|≤2logW
To show
DI,∊Y∪anc(DI,∊Y)=DI,∊Y∪anc(CJ1,λ,κ), it suffices to show that for any
CH,λ,κ∈anc(DI,∊Y),
_{H,λ,κ} ∈ anc(
_{J1,λ,κ}). Since
CH,λ,κ∈anc(DI,∊Y), it is the ancestor of some
CJi,λ,κ∈(DI,∊Y). Thus J_{i} = [ℓ_{ji}, r_{ji}] ⊂ H = [ℓ_{H}, r_{H}]. Since
_{H,λ,κ} is already an ancestor, it no longer exists, and all the
_{J,λ,κ} to its left have been thrown away. Thus,
DI,∊Y has no
_{J,λ,κ} where J is to the right of ℓ_{H}. This implies ℓ_{H} ≤ p = ℓ_{J1} and ℓ_{H} ≤ ℓ_{J1} ≤ r_{J1} ≤ r_{Ji} ≤ r_{H}. It follows that J_{1} ⊂ H and
_{H,λ,κ} is an ancestor of
_{J1,λ,κ}, i.e.,
_{H,λ,κ} ∈ anc(
_{J1,λ,κ}).
We are now ready to analyze the accuracy of
DI,∊Y's estimates.
Theorem 6
Suppose that
DI,∊Y is covering [p, r_{I}]. For any item a and any t ∈ [p, r_{I}], the estimate f̂_{a}([t, r_{I}]) of f_{a}([t, r_{I}]) obtained by
DI,∊Y satisfies |f̂_{a}([t, r_{I}]) − f_{a}([t, r_{I}])| ≤ ∊Y. Furthermore,
DI,∊Y uses
O(1∊(logW)2) space.
Proof
Let alive
(DI,∊Y) be the set of nodes currently in
DI,∊Y,dead(DI,∊Y) the set of those that were in
DI,∊Y earlier in the execution but have been deleted, and
node(DI,∊Y)=alive(DI,∊Y)∪dead(DI,∊Y). It can be verified that
f^a([t,rI])=v(alive(DI,∊Y)≥ta). Below, we prove that
f^a([t,rI])−2(1+∊)YκlogW≤v(alive(DI,∊Y)≥ta)≤fa([t,rI])+λlogW
Recall that we fix λ = ∊Y/logW and
κ=4∊logW; the ∊Y error bound follows.
The proof of the second inequality of Equation (8) is identical to that of Lemma 3, except that we replace all occurrences of
_{J,λ,κ} by
DI,∊Y. The proof of the first inequality is also similar. We still have
fa([t,rI])−v(alive(DI,∊Y)≥ta)≤v(node(DI,∊Y)≥ta)−v(alive(DI,∊Y)≥ta)=v(dead(DI,∊Y)≥ta)which equals
d(dead(DI,∊Y)≥ta). As in Lemma 3, we can derive the bound
d(dead(DI,∊Y)≥ta)≤1κv(node(DI,∊Y))=1κf∗(I), but we can do better here.
Observe that for any node
N∈dead((DI,∊Y)≥ta), N can only be in those
_{J,λ,κ} ∈ All_{≥p} (because t ∈ [p, r_{I}]), and when we debit N, if it is in
_{J,λ,κ}, then we debit κ − 1 other nodes in
_{J,λ,κ} monitoring κ − 1 items other than a. Thus,
κ⋅d(dead((DI,∊Y)≥ta)) is no more than the total value available in the
_{J,λ,κ} ∈ All_{≥p}, which is Σ {v_{add}(
_{J,λ,κ}) |
_{J,λ,κ} ∈ All_{≥p}}. Together with Lemma 5 we conclude
κ⋅d(dead(DI,∊Y)≥pa)≤∑{vadd(CJ,λ,κ)|CJ,λ,κ∈ALL≥p}≤2(1+∊)YlogWand the first inequality of Equation (8) follows.
For the size of
DI,∊Y, similar to the proof of Lemma 3, we can argue that the number of born-rich nodes is only
O(Y/λ)=O(1∊logW), but the number of born-poor nodes can be much larger. A born-poor node of a non-trivial queue is created either when we increase the value of a trivial queue, or when we execute Lines 2-6 of procedure Split. It can be verified that every queue
QJ,λa has at most one born-poor node, which is the rightmost node in
QJ,λa. Since there are O(logW)
_{J,λ,κ}'s in
DI,∊Y and each has at most κ non-trivial queues, the number of born-poor nodes, and hence the size of
DI,∊Y, is
O(κlogW)=O(1∊(logW)2).
To reduce
DI,∊Y's size from
O(1∊(logW)2) to
O(1∊loglogWlogW), we need to reduce the number of born-poor nodes; or equivalently, the number of non-trivial queues in
DI,∊Y. In the next section, we give a simple idea to reduce the number of non-trivial queues and hence the size of
DI,∊Y to
O(1∊loglogWlogW). In Section 6, we show how to further reduce the size by taking advantage of the tardiness of the data stream.
Reducing the Size of
<inline-formula>
<mml:math id="mm237" display="inline">
<mml:semantics id="sm237">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>∊</mml:mi></mml:mrow>
<mml:mi>Y</mml:mi></mml:msubsup></mml:mrow></mml:semantics></mml:math></inline-formula>
Our idea for reducing the size is simple; for every
CJ,λ,κ∈DI,∊Y, its capacity is no longer fixed at
κ=4∊logW; instead, we start with a much smaller capacity, namely
4∊loglogW, which is allowed to increase gradually during execution. To determine
_{J,λ,κ}'s capacity, we use a variable to keep track of the number f̄_{*}(J) of items (a, u) with u ∈ J that have arrived since
_{J,λ,κ}'s creation. Let v_{J} be the total value of the nodes in
_{J,λ,κ} when it is created (v_{J} may not be zero if
_{J,λ,κ} is resulted from the splitting of its parent). The capacity of
_{J,λ,κ} is determined as follows.
When
(c−1)logW≤vJ+f¯∗(J)<cYlogW for some integer c ≥ 1, the capacity of
_{J,λ,κ} is
κ(c)=4c∊loglogW, i.e., set κ = κ(c) and allow κ(c) non-trivial queues in
_{J,λ,κ}.
Note that when we increase the capacity of
_{J,λ,κ} to κ(c), we do not need to do anything, except that we allow more non-trivial queues (up to κ(c)) in the data structure. Also note that when
_{J,λ,κ} is created during the trimming process, its inherited capacity may be larger than the supposed capacity κ(c); in such case, we simply debit every non-trivial queue until some queue
QJ,λa has
v(QI,λx)=d(QI,λx) and we execute Lines 4 and 5 of the procedure Process( ) to make this queue trivial. We repeat the process until the number of non-trivial queues is at most κ(c). The following theorem asserts that
DI,∊Y maintains the accuracy of its estimates under this new implementation. It gives the revised size and the update time.
Theorem 7
Suppose that
DI,∊Y is currently covering [p, r_{I}]. For any item a ∈ Σ and any timestamp t ∈ [p, r_{I}], the estimate f̂_{a}([t, r_{I}]) of f̂_{a}([t, r_{I}]) obtained by the new
DI,∊Y satisfies |f̂_{a}([t, r_{I}]) − f_{a}([t, r_{I}])| ≤ ∊Y.
DI,∊Y has size
O(1∊(loglogW)logW), and supports
O(log1∊+loglogW) update time.
Proof
Suppose that
DI,∊Y=〈CJ1,λ,κ(c1),…,CJm,λ,κ(cm)〉. From the fact that we are using
_{Ji,λ,κ}_{(ci)} to monitor J_{i} we conclude
(ci−1)YlogW≤vJi+f¯∗(Ji). It follows that
∑1≤i≤mciYlogW≤∑1≤i≤m(vJi+f¯∗(Ji))+∑1≤i≤mYlogW, which is O(Y) because (i)
|DI,∊Y|=m=O(logW) and (ii)
∑1≤i≤m(vJi+f¯∗(Ji))=O(Y) (otherwise
DI,∊Y would have been trimmed). Thus,
∑1≤i≤mci=O(logW)
For Statement (1), the analysis of the accuracy of f̂_{a}([t, r_{I}]) is very similar to that of Theorem 6, except for the following difference: In the proof of Theorem 6, we show that
d(dead(DI,∊Y)≥pa)≤2(1+∊)YκlogW, and since κ is fixed at
4∊logW,
d(dead(DI,∊Y)≥pa)≤∊Y. Here, we also prove that
d(dead(DI,∊Y)≥pa)≤∊Y, but we have to prove it differently because the capacities are no longer fixed.
As argued previously, any node in
dead(DI,∊Y)≥pa is in some
_{J,λ,κ} ∈ All_{≥}_{p}. Below, we show that for any
_{J,λ,κ} ∈ All_{≥}_{p}, we can make at most
∊Y2logW debit operations to the queue
QJ,λa of
_{J,λ,κ} during its lifespan. Together with the fact that |All_{≥p}| ≤ 2 logW, we have
d(dead(DI,∊Y)≥pa)≤∊Y.
Consider any
_{J,λ,κ} ∈ All_{≥}_{p}. Note that the smaller its capacity, the larger the number of debit operations can be made to the queue
QJ,λa of
_{J,λ,κ}. To maximize the number of debit operations made to
QJ,λa, suppose that v_{J} = 0 and thus
_{J,λ,κ} has the smallest capacity κ(1) when it is created. Before increasing its capacity to κ(2),
_{J,λ,κ} can make at most
1κ(1)⋅YlogW debit operations to
QJ,λa. Then, during the next
YlogW arrivals of items (a, u) with
u∈J,YlogW≤vJ+f¯∗(J)<2YlogW, the capacity is κ(2), and at most
1κ(2)⋅YlogW debit operations can be made to
QJ,λa. In general, during the period when
(c−1)YlogW≤vJ+f¯∗(J)<cYlogW, at most
1κ(c)⋅YlogW debit operations can be made to
QJ,λa. If the largest capacity is κ(c_{max}), the total number of debit operations made to
QJ,λa is at most
YlogW(1κ(1)+⋯+1κ(cmax))=∊Y4(loglogW)logW(1+12+⋯+1cmax)≤∊Y(ln(cmax)+1)4(loglogW)logWwhich is smaller than
∊Y2logW because by Equation (9), c_{max} = O(logW), which implies ln(c_{max}) + 1 ≤ 2 log logW (suppose that W is larger than some constant).
We now prove (2). Note that the total number of non-trivial queues in
DI,∊Y, and hence the number of born-poor nodes, is at most
∑1≤i≤mκ(ci)=∑1≤i≤m4ci∊loglogW. By Equation (9),
∑1≤i≤mci=O(logW), and it follows that the size of
DI,∊Y is
O(1∊loglogWlogW).
For the update time, suppose that an item (a, u) arrives. We can find the
_{Ji,λ,κ} in
DI,∊Y=〈CJ1,λ,κ,…,CJm,λ,κ〉 with u ∈ J_{i} using O(log m) = O(log logW) time by querying a balanced search tree storing the J_{i}'s. By hashing (e.g., Cuckoo hashing [15], which supports constant update and query time) we can locate the queue
QJi,λa∈CJi,λ,κ in constant time. Then, by consulting an auxiliary balanced search tree on the intervals monitored by the nodes of
QJi,λa, we can find and update the node N of
QJi,λa with u ∈ i(N) using
O(log(Y/λ))=O(log1∊+loglogW) time. At times we may also need to execute Lines 3 and 4 of the procedure Process( ), which debits all the non-trivial queues in
_{Ji,λ,κ}. Using the de-amortizing technique given in [16], this step takes constant time.
Note that occasionally, we may also need to clean up
DI,∊Y by calling Trim( ); this step takes time linear to the size of
DI,∊Y, which is
O(1∊(loglogW)logW).
Further Reducing the Size of
<inline-formula>
<mml:math id="mm293" display="inline">
<mml:semantics id="sm293">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>∊</mml:mi></mml:mrow>
<mml:mi>Y</mml:mi></mml:msubsup></mml:mrow></mml:semantics></mml:math></inline-formula> for Streams with Small Tardiness
Recall that in an out-of-order data stream with tardiness d_{max} ∈ [0, W], any item (a, u) arriving at time τ_{cur} satisfies u ≥ τ_{cur} − d_{max}; in other words, the delay of any item is guaranteed to be at most d_{max}. This section extends
DI,∊Y to a data structure
ℰI,∊Y that takes advantage of this maximum delay guarantee to reduce the space usage. The idea is as follows. Since there is no new item with stamps smaller than τ_{Cur} − d_{max}, we will not make any further change to those nodes to the of left τ_{cur} − d_{max} and hence can consolidate these nodes to reduce space substantially. To handle those nodes with timestamps in [τ_{cur} − d_{max}, τ_{cur}], we use the data structure given in Section 5; since it is monitoring an interval of d_{max} instead of W, its size is
O(1∊(loglogdmax)logdmax) instead of
O(1∊(loglogW)logW).
To implement
ℰI,∊Y, we need a new operation called consolidate. Consider any list of queues
〈QJ1,λa,QJ2,λa,…,QJm,λa〉, where J_{1}, J_{2}, …, J_{m} are ordered from left to right and form a partition of the interval J_{1‥m} = J_{1} ∪ ⋯ ∪ J_{m}. We consolidate them into a single queue
QJ1‥m,λa as follows:
Concatenate the queues into a single queue, in which the nodes preserve the left-right order.
Starting from the leftmost node, check from left to right every node N in the queue, if N is not the rightmost node and v(N) < λ, merge it with the node N′ immediately to its right, i.e., delete N, set v(N′) = v(N) + v(N′), d(N′) = d(N) + d(N′) and
(N′) =
(N) ∪
(N′).
Note that after the consolidation, the resulting queue
QJ1‥m,λa has at most one node (the rightmost one) with value smaller than λ.
Given the list 〈
_{J1,λ,κ(c1)}, …,
_{Jm,λ,κ(cm)}〉, we consolidate them into
CJ1‥m,λ,1∊ by first consolidating, for each item a, the queues
QJ1,λa,…,QJm,λa in
_{J1,λ,κ(c1)}, …,
_{Jm,λ,κ(cm)} into the queue
QJ1‥m,λa and put it in
CJ1‥m,λ,1∊. Then, we apply Lines 3–5 of procedure Process( ) repeatedly to reduce the number of non-trivial queues in the data structure to
1∊.
We are now ready to describe how to extend
DI1,∊Y to
ℰI,∊Y. In our discussion, we fix
λ=∊Ylogdmax, and without loss of generality, we assume that I = [1, W]. Recall that p_{max} denotes the largest timestamp in I such that f̂_{*}([p_{max}, r_{I}]) > (1 + ∊)Y (which implies f_{*}([p_{max}, r_{I}]) > Y). We partition I into sub-windows I_{1}, I_{2}, …, I_{m}, each of size d_{max} (i.e., I_{i} = [(i − 1)d_{max}, id_{max}]). We divide the execution into different periods according to τ_{cur}, the current time.
During the 1st period, when τ_{cur} ∈ [1, d_{max}] = I_{1},
ℰI,∊Y simply is
DI1,∊Y.
During the 2nd period, when τ_{cur} = I_{2},
ℰI,∊Y maintains
DI2,∊Y in addition to
DI1,∊Y.
During the 3rd period, when τ_{cur} ∈ I_{3},
ℰI,∊Y maintains
DI3,∊Y in addition to
DI2,∊Y. Also, the
DI1,∊Y=〈CJ1,λ,κ(c1),…,CJm,λ,κ(cm)〉 is consolidated into
CI1,λ,1∊.
In general, during the ith period, when
τcur∈[(i−1)dmax+1,idmax]=Ii,ℰI,∊Y maintains
DIi−1,∊Y and
DIi,∊Y, and also
CI1‥i−2,λ,1∊ where I_{1‥i−2} = I_{1} ∪ I_{2} ∪ ⋯ ∪ I_{i−2}. Observe that in this period, there is no item (a, u) with u ∈ I_{1‥i−2} arrives (because the tardiness is d_{max}), and thus we do not need to update
CI1‥i−2,λ,1∊. However, we will keep throwing away any node N in
CI1‥i−2,λ,1∊ as soon as we know i(N) is to the left of p_{max} + 1.
When entering the (i + 1)st period, we do the followings: Keep
DIi,∊Y, create
DIi+1,∊Y, merge
_{I1‥i−2,λ,κ} with
DIi−1,∊Y=〈CJ1,λ,κ(c1),…,CJm,λ,κ(cm)〉, and then get
CI1‥i−1,λ,1∊ by consolidating
〈CI1‥i−2,λ,1∊,CJ1,λ,κ(c1)…,CJm,λ,κ(cm)〉.
Given any t ∈ [p_{max} + 1, r_{I}], the estimate of f_{a}([t, r_{I}]) given by
ℰI,∊Y is
f^a([t,rI])=v(alive(ℰI,∊Y)≥ta)
The following theorem gives the accuracy of
f^a([t,rI]),ℰI,∊Y's size and its update time.
Theorem 8
For any t ∈ [p_{max} + 1, r_{I}], the estimate f̂_{a}([t, r_{I}]) given by
ℰI,∊Y satisfies
fa([t,rI])−2∊Y≤f^a([t,rI])≤fa([t,rI])+2∊Y
ℰI,∊Y has size
O(1∊(loglogdmax)logdmax), and supports
O(log1∊+loglogdmax) update time.
Proof
Recall that I is partitioned into sub-intervals I_{1}, I_{2}, …, I_{m}. Suppose that t ∈ I_{κ}. Note that if we had not performed any consolidation,
v(alive(ℰI,∊Y)≥ta)=v(alive(DIκ,∊Y)≥ta)+∑κ+1≤i≤mv(alive(DIi,∊Y)a)
Note that for κ + 1 ≤ i ≤ m,
v(alive(DIi,∊Y)a)≤fa(Ii), and for
v(alive(DIκ,∊Y)≥ta) since |I_{κ}|= d_{max}, the same argument used in the proof of Lemma 3 gives us
v(alive(DIκ,∊Y)≥ta)≤fa([t,rIk])+λlogdmax. Hence
v(alive(ℰI,∊Y)≥ta)=v(alive(DIκ,∊Y)≥ta)+∑κ+1≤i≤mv(alive(DIi,∊Y)a)≤fa([t,rIκ])+λlogdmax+∑κ+1≤i≤mfa(Ii)=fa([t,rI])+λlogdmax
The consolidation step may add errors to
v(alive(ℰI,∊Y)≥ta). To get a bound on them, let N_{1}, N_{2}, … be the nodes for a in
ℰI,∊Y, ordered from left to right. Suppose that t ∈ N_{h}. Note that
the consolidation step will added at most λ units to v(N_{h}) before we move on to consider the node immediately to its right, and
for node N_{i} with i ≥ h + 1, any node N that has been merged to N_{i} must be to the right of of N_{h}, and thus is to the right of t; it follows that N is contributing v(N) to
v(alive(ℰI,∊Y)≥ta) in Equation (10) and its merging will not make any change.
In conclusion, the consolidation steps introduce at most λ extra errors, and Equation (10) becomes
v(alive(ℰI,∊Y)≥ta)≤fa([t,rI])+λlogW+λ≤fa([t,rI])+2∊Y, which is the second inequality of the lemma.
To prove the first inequality, suppose that we ask for the estimate f̂_{a}([t, r_{I}]) during the ith period, when we have
CI1‥i−2,λ,1∊,
DIi−1,∊Y and
DIi,∊Y. Recall that
_{I1‥i−2, λ,∊} comes from consolidating
DI1,∊Y,DI2,∊Y,…,DIi−2,∊Y. As in all our Previous analyses, we have
fa([t,rI])−v(alive(ℰI,∊Y)≥ta)≤v(node(ℰI,∊Y)≥ta)−v(alive(ℰI,∊Y)≥ta)=d(dead(ℰI,∊Y)≥ta)
(Note that the merging of nodes during consolidations would not take away any value). To get a bound on
d(dead(ℰI,∊Y)≥ta), suppose that p_{max} ∈ I_{κ}. Then, all the nodes to the left of I_{κ} have been thrown away. Among
DIκ,∊Y,DIκ+1,∊Y,…,DIm,∊Y, only
DIκ,∊Y may have been trimmed. Note that
Thus,
d(dead(ℰI,∊Y)≥ta)≤2∊Y, and the first inequality follows.
For Statement (2), note that both
DIi−1,∊Y and
DIi,∊Y have size
O(1∊loglogdmaxlogdmax) (by Theorem 7, and |I_{i}_{−1}| = |I_{i}| = d_{max}), and for
CJ1‥i−2,λ,1∊, it has size
O(Y/λ+1∊)=O(1∊logdmax); thus the size of
ℰI,∊Y is
O(1∊loglogdmaxlogdmax). For the update time, it suffices to note that it is dominated by the update times of
DIi−1,∊Y and
DIi,∊Y.
Figures and Table
Suppose that λ = 4. (i) shows the queue
QI,λa before the arrivals of items (a, 1), (a, 2), (a, 3), (a, 8); (ii) is the resulting queue after the updates for these items; (iii) shows that after the arrival of another item (a, 1), the first node in (ii) is updated and refined.
The space complexity for answering ∊-approximate frequent item set query in a sliding time window. Results from this paper are marked with [†]. Note that we assume
B≥1∊logW; otherwise, we can always store all items in the window for exact answer, using
O(1∊logW) words. Similarly, for the result with tardiness, we assume
B≥1∊logdmax.
Space Complexity (words)
Synchronous [7]
O(1∊log(∊B))
Asynchronous [1]
O(1∊logWlog(∊BlogW)min{logW,1∊}log|U|)
Asynchronous [†]
O(1∊logWlog(∊BlogW)loglogW)
Asynchronous with tardiness [†]
O(1∊logdmaxlog(∊Blogdmax)loglogdmax)
H.F Ting is partially supported by the GRF Grant HKU-716307E; T.W. Lam is partially supported by the GRF Grant HKU-713909E.
ReferencesCormodeG.KornF.TirthapuraS.Time-Decaying Aggregates in Out-of-Order StreamsProceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'08Vancouver, Canada9–11 June 20088998KarpR.ShenkerS.PapadimitriouC.A simple algorithm for finding frequent elements in streams and bagsDemaineE.Lopez-OrtizA.MunroJ.Frequency Estimation of Internet Packet Streams with Limited SpaceProceedings of the 10th Annual European Symposium, ESA'07Rome, Italy17–21 September 2002348360MuthukrishnanS.BabcockB.BabuS.DatarM.MotwaniR.WidomJ.Models and Issues in Data Stream SystemsProceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS'02Madison, WI, USA3–5 June 2002116ArasuA.MankuG.Approximate Counts and Quantiles over Sliding WindowsProceedings of the 23th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'04Paris, France14–16 June 2004286296LeeL.K.TingH.F.A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding WindowsProceedings of the PODSJune 26–28, 2006Chicago, Illinois, USA290297LeeL.K.TingH.F.Maintaining Significant Stream Statistics over Sliding WindowsProceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA'06Miami, FL, USA22–26 January 2006724732DatarM.GionisA.IndykP.MotwaniR.Maintaining stream statistics over sliding windowsTirthapuraS.XuB.BuschC.Sketching Asynchronous Streams over a Sliding WindowProceedings of the 25th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC'06Denver, CO, USA23–26 July 20068291BuschC.TirthapuaS.A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding WindowProceedings of the 24th Annual Symposium on Theoretical Aspects of Computer Science, STACS'07Aachen, Germany22–24 February 2007465475CormodeG.TirthapuraS.XuB.Time-decaying sketches for robust aggregation of sensor dataChanH.L.LamT.W.LeeL.K.TingH.F.Approximating Frequent Items in Asynchronous Data Stream over a Sliding WindowProceedings of the 7th Workshop on Approximation and Online Algorithms, WAOA'09Copenhagen, Denmark10–11 September 20094961MisraJ.GriesD.Finding repeated elementsArbitmanY.NaorM.SegevG.De-amortized Cuckoo Hashing: Provable Worst-Case Performance and Experimental ResultsProceedings of the 36th International Colloquium, ICALP'09Rhodes, Greece5–12 July 2009107118HungR.S.LeeL.K.TingH.F.Finding frequent items over sliding windows with constant update time