Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window

Ting, Hing-Fung; Lee, Lap-Kei; Chan, Ho-Leung; Lam, Tak-Wah

doi:10.3390/a4030200

Open AccessArticle

Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window

by

Hing-Fung Ting

^1,*,

Lap-Kei Lee

²,

Ho-Leung Chan

¹ and

Tak-Wah Lam

¹

Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong, China

²

MADALGO (Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation), Department of Computer Science, Aarhus University, Aarhus C DK-8000, Denmark

^*

Author to whom correspondence should be addressed.

Algorithms 2011, 4(3), 200-222; https://doi.org/10.3390/a4030200

Submission received: 23 June 2011 / Revised: 23 June 2011 / Accepted: 10 September 2011 / Published: 22 September 2011

Download

Browse Figures

Versions Notes

Abstract

: In an asynchronous data stream, the data items may be out of order with respect to their original timestamps. This paper studies the space complexity required by a data structure to maintain such a data stream so that it can approximate the set of frequent items over a sliding time window with sufficient accuracy. Prior to our work, the best solution is given by Cormode et al. [1], who gave an

O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) min {log W, \frac{1}{∊}} log | U |)

-space data structure that can approximate the frequent items within an ∊ error bound, where W and B are parameters of the sliding window, and U is the set of all possible item names. We gave a more space-efficient data structure that only requires

O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) log log W)

space.

Keywords:

asynchronous data streams; frequent items; sliding window; space complexity

1. Introduction

Identifying frequent items in a massive data stream has many applications in data mining and network monitoring, and the problem has been studied extensively [2-5]. Recent interest has been shifted from the statistics of the whole data stream to that of a sliding window of recent data [6-9]. In most applications, the amount of data in a window is gigantic when compared with the amount of memory available in the processing units. It is impossible to store all the data and then find the exact frequent items. Existing research has focused on designing space-efficient data structures to support finding the approximate frequent items. The key concern is how to minimize the space so as to achieve a required level of accuracy.

1.1. Asynchronous Data Stream

Most of the previous work on data streams assume that items in a data stream are synchronous in the sense that the order of their arrivals is the same as the order of their creations. This synchronous model is however not suitable to applications that are distributed in nature. For example, in a sensor network, the sink collects data transmitted from sensors over a large area, and the data transmitted from different sensors would suffer different delay. It is possible that an item created at time t at a certain sensor may arrive at the sink later than an item created after t at another sensor. From the sink's viewpoint, items in the data stream are out of order with respect to their creation times. Yet the statistics to be computed are usually based on the creation times. More specifically, an asynchronous data stream (a.k.a. out-of-order data stream) [1,10,11] can be considered as a sequence (a₁, t₁), (a₂, t₂), (a₃, t₃), …, where a_i is the name of a data item chosen from a fixed universe U, and t_i is an integer timestamp recording the creation time of this item. Items arriving at the data stream are in arbitrary order regarding their timestamps, and it is possible that more than one data item has the same timestamp.

1.2. Previous Work on Approximating Frequent Items

Consider a data stream and, in particular, those data items whose timestamps fall into the last W time units (W is the size of the sliding window). An item (or precisely, an item name) is said to be a frequent item if its count (i.e., the number of occurrences) exceeds a certain required threshold of the total item count. Arasu and Manku [6] were the first to study approximating frequent items over a sliding window under the synchronous model, in which data items arrive in non-decreasing order of timestamps. The space complexity of their data structure is $O (\frac{1}{∊} {(log \frac{1}{∊})}^{2} log (∊ B))$ , where ∊ is a user-specified error bound and B is the maximum number of items with timestamps falling into the same sliding window. Their work was later improved by Lee and Ting [7] to $O (\frac{1}{∊} log (∊ B))$ space. Recently, Cormode et al. [1] initiated the study of frequent items under the asynchronous model, and gave a solution with space complexity $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) min {log W, \frac{1}{∊}} log | U |)$ , where U is the set of possible item names. Later, Cormode et al. [12] gave a hashing-based randomized solution using $O (\frac{1}{∊^{2}} log | U |)$ space. The space complexity is quadratic in $\frac{1}{∊}$ , which is less preferred, but that is a general solution that can solve other problems like finding the sum and quantiles.

The earlier work on asynchronous data stream focused on a relatively simpler problem called ∊-approximate basic counting [10,11]. Cormode et al. [1] improved the space complexity of basic counting to. $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}))$ Notice that under the synchronous model, the best data structure requires $O (\frac{1}{∊} log (∊ B))$ space [9]. It is believed that there is roughly a gap of logW between the synchronous model to the asynchronous model. Yet, for frequent items, the asynchronous result of Cormode et al. [1] has space complexity way bigger than that of the best synchronous result, which is $O (\frac{1}{∊} log (∊ B))$ [7]. This motivates us to study more space-efficient solutions for approximating frequent items in the asynchronous model.

1.3. Formal Definition of Approximate Frequent Item Set

For any time interval I and any data item a, let f_a(I) denote the frequency of item a in interval I, i.e., the number of arrived items named a with timestamps falling into I. Define f_*(I) = Σ_a∈U f_a(I) to be the total number of all arrived items with timestamps within I.

Given a user-specified error bound ∊ and a window size W, we want to maintain a data structure to answer any ∊-approximate frequent item set query for any sub-window (specified at query time), which is in the form (ϕ, W′) where ϕ ∈ [∊, 1] is the required threshold and W′ ≤ W is the sub-window size. Suppose that τ_cur is the current time. The answer to such a query is a set S of item names satisfying the following two conditions:

(C1) S contains every item a whose frequency in interval I = [τ_cur − W′ + 1, τ_cur] is at least ϕf_*(I), i.e., f_a(I) ≥ ϕf_*(I).
(C2) For any item a in S, its frequency in interval I is at least (ϕ − ∊)f_*(I),i.e., f_a(I) ≥ (ϕ − ∊)f_*(I).

The set S is also called an ∊-approximate ϕ-frequent item set. For example, assume ∊ = 1%, then the query (10%, 10, 000) would return all items whose frequencies in the last 10, 000 time units are each at least 10% of the total item count, plus possibly some other items with frequency at least 9% of the total count.

1.4. Our Contribution

This paper gives a more space-efficient data structure for answering any ∊-approximate frequent item set query. Our data structure uses $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) log log W)$ words, which is significantly smaller than the one given by Cormode et al. [1] (see Table 1). Furthermore, this space complexity is larger than the best synchronous solution by only a factor of O(logW log logW), which is close to the expected gap of O(logW). Similar to existing data structures for this problem, it takes time linear to the data structure's size to answer an ∊-approximate frequent item set query. Furthermore, it takes $O (log (\frac{∊ B}{log W}) (log \frac{1}{∊} + log log W))$ time to modify the data structure for a new data item. Occasionally, we might need to clean up some old data items that are no longer significant to the approximation; in the worst case, this takes time linear to the size of the data structure, and thus is no bigger than the query time. As a remark, the solution of Cormode et al. [1] requires $O (log (\frac{∊ B}{log W}) log W log log | U |)$ time for an update.

In the asynchronous model, if a data item has a delay more than W time units, it can be discarded immediately when it arrives. In many applications, the delay is usually small. This motivates us to extend the asynchronous model to consider data items that have a bounded delay. We say that an asynchronous data stream has tardiness d_max if a data item created at time t must arrive at the stream no later than time t + d_max. If we set d_max = 0, the model becomes the synchronous model. If we allow d_max ≥ W, this is in essence the asynchronous model studied above. We adapt our data structure to take advantage of small tardiness such that when d_max is small, it uses smaller space (see Table 1) and support faster update time (which is $O (log (\frac{∊ B}{log d_{max}}) (log \frac{1}{∊} + log log d_{max})))$ In particular, when d_max = Θ(1), the size and update time of our data structure match those of the best data structure for synchronous data stream.

Remark

This paper is a corrected version of a paper with the same title in WAOA 2009 [13]; in particular, the error bound on the estimates was given incorrectly before and is fixed in this version.

1.5. Technical Digest

To solve the frequent item set problem, we need to estimate the frequency of any item with relative error ∊f_*(I) where I = [τ_cur − W + 1, τ_cur] is the interval covered by the sliding window. To this end, we first propose a simple data structure for estimating the frequency of a fixed item over the sliding window. Then, we adapt a technique of Misra and Gries [14] to extend our data structure to handle any item. The result is an O(f_*(I))/λ)-space data structure that allows us to obtain an estimate for any item with an error bound of about λ logW. Here λ is a design parameter. To ensure λ logW to be no greater than ∊f_*(I), we should set λ ≤ ∊f_*(I)/logW. Since f*(I) can be as small as $Θ (\frac{1}{∊} log W)$ (the case for smaller f_*(I) can be handled by brute-force), we need to be conservative and set λ to some constant. But then the size of the data structure can be Θ(B) because f_*(I) can be as large as B. To reduce space, we introduce a multi-resolution approach. Instead of using one single data structure, we maintain a collection of O(logB) copies of our data structure, each uses a distinct, carefully chosen parameter λ so that it could estimate the frequent item set with sufficient accuracy when f_*(I) is in a particular range. The resulting data structure uses $O (\frac{1}{∊} log W log B)$ space.

Unfortunately, a careful analysis of our data structure reveals that in the worst case, it can only guarantee estimates with an error bound of ∊f_*(H ∪ I) where H = [τ_cur − 2W + 1, τ_cur − W], not the required ∊f_*(I). The reason is that the error of its estimates over I depend on the number of updates made during I, and unlike synchronous data stream, this number for asynchronous data stream can be significantly larger than f_*(I). For example, at time τ_cur − W + 1, there may still be many new items (a, u) with timestamps u ∈ H, for which we must update our data structure to get good estimates when the sliding window is at earlier positions. Indeed, the number of updates during I can be as large as f_*(H ∪ I), and this gives an error bound of ∊f_*(H ∪ I).

To reduce the error bound to ∊f_*(I), we introduce a novel algorithm to split the data structure into independent smaller ones at appropriate times. For example, at time τ_cur − W + 1, we can split our data structure into two smaller ones D_H and D_I, and we will only update D_H for items (a, u) with u ∈ H and update D_I for those with u ∈ I. Then, when we need to find an estimate on I at time τ_cur, we only need to consult D_I, and the number of updates made to it is f_*(I). In this paper, we develop sophisticated procedures to decide when and how to split the data structure so as to enable us to get good enough estimates when sliding window moves continuously. The resulting data structure has size $O (\frac{1}{∊} {(log W)}^{2} log (\frac{∊ B}{log W}))$ Then, we further make the data structure adaptive to the input size, allowing us to reduce the space to $O (\frac{1}{∊} (log log W) log W log (\frac{∊ B}{log W}))$ .

2. Preliminaries

Our data structures for the frequent item set problem depends on data structures for the following two related data stream problems. Let 0 < ∊ < 1 be any real number, and τ_cur be the current time.

The ∊-approximate basic counting problem asks for data structure that allows us to obtain, for any interval I = [τ_cur − W′ + 1, τ_cur] where W′ ≤ W, an estimate f̂_*(I) of f_*(I) such that |f̂_*(I) − f_*(I)| ≤ ∊f_*(I).
The ∊-approximate counting problem asks for data structure that allows us to obtain, for any item a and any interval I = [τ_cur − W′ + 1, τ_cur] where W′ ≤ W, an estimate f̂_a(I) of f_a(I) such that | f̂_a(I) − f_a(I)|≤ ∊f_*(I).

As mentioned in Section 1, Cormode et al. [1] gave an $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}))$ -space data structure Algorithms 04 00200i6 _∊ for solving the ∊-approximate basic counting problem. In this paper, we give an $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) log log W)$ -space data structure Algorithms 04 00200i2 _∊ for solving the harder ∊-approximate counting problem. The theorem below shows how to use these two data structures to answer ∊-approximate frequent item set query.

Theorem 1

Let ∊₀ = ∊/4. Given Algorithms 04 00200i6 _{∊_o} and Algorithms 04 00200i2 _{∊_o}, we can answer any ∊-approximate frequent item set query. The total space required is $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) log log W)$ .

Proof

The space requirement is obvious. Consider any ∊-approximate frequent item set query (ϕ, W′) where ∊ ≤ ϕ ≤ 1 and W′ ≤ W. Let I = [τ_cur − W′ + 1, τ_cur]. Since ∊_o = ∊/4, the estimates given by Algorithms 04 00200i6 _{∊_o} satisfy $| {\hat{f}}_{*} (I) - f_{*} (I) | \leq \frac{∊}{4} f_{*} (I)$ , and for any item a, the estimates given by Algorithms 04 00200i2 _{∊_o} satisfy $| {\hat{f}}_{a} (I) - f_{a} (I) | \leq \frac{∊}{4} f_{*} (I)$ To answer the query (ϕ, W′), we return the set

S_{ϕ} = {a | {\hat{f}}_{a} (I) \geq (ϕ - \frac{∊}{2} I) {\hat{f}}_{*} (I)}

which satisfies the required conditions (C1) and (C2) because

for any item a with f_a(I) ≥ ϕf_*(I), ${\hat{f}}_{a} (I) \geq f_{a} (I) - \frac{∊}{4} f_{*} (I) \geq (ϕ - \frac{∊}{4}) f_{*} (I) \geq (ϕ - \frac{∊}{4}) (\frac{1}{1 + \frac{∊}{4}}) {\hat{f}}_{*} (I) \geq (ϕ - \frac{∊}{4}) (1 - \frac{∊}{4}) {\hat{f}}_{*} (I) \geq (ϕ - \frac{∊}{2}) {\hat{f}}_{*} (I)$ , and a ∈ S_ϕ; thus (C1) is satisfied, and
for every a ∈ S_ϕ, we have $f_{a} (I) \geq {\hat{f}}_{a} (I) - \frac{∊}{4} f_{*} (I) \geq (ϕ - \frac{∊}{2}) {\hat{f}}_{*} (I) - \frac{∊}{4} f_{*} (I) \geq (ϕ - \frac{∊}{2}) (1 - \frac{∊}{4}) f_{*} (I) - \frac{∊}{4} f_{*} (I) \geq (ϕ - ∊) f_{*} (I)$ ; thus (C2) is satisfied.

The building block of Algorithms 04 00200i2 _∊ is a data structure that counts items over some fixed interval (instead of the sliding window). For any interval I = [ℓ_I, r_I] of size W, Theorem 4 in Section 4 gives a data structure _I,∊ that uses $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) log log W)$ space, supports $O (log (\frac{∊ B}{log W}) \cdot (log \frac{1}{∊} + log log W))$ update time, and enables us to obtain, for any item a and any time t ∈ I, an estimate f̂_a([t, r_I]) of f_a([t, r_I]) such that

| {\hat{f}}_{a} ([t, r_{I}]) - f_{a} ([t, r_{I}]) | \leq ∊ f_{*} ([t, r_{I}])

(1)

Given Algorithms 04 00200i2 _I₁,∊, _I₂,∊, … where I_i = [(i − 1)W + 1, iW], we can obtain, for any W′ ≤ W, an estimate f̂_a([s, τ_cur]) of f_a([s, τ_cur]) where s = τ_cur − W′ + 1 as follows.

Let I_i and I_i₊₁ be the intervals such that [s, τ_cur] ⊂ I_i ∪ I_i₊₁.
Use _{I_i,∊} to get an estimate f̂_a([s, iW]) of f_a([s, iW]), and _{I_i+1,∊} an estimate f̂_a([iW + 1, (i + 1)W]) of f_a([iW + 1, (i + 1)W]).
Our estimate f̂_a([s, τ_cur]) = f̂_a([s, iW]) + f̂_a([iW + 1, (iW + 1)W]).

By Equation (1), we have

| {\hat{f}}_{a} ([S, i W]) - f_{a} ([S, i W]) | \leq ∊ f_{*} ([S, i W])

(2)

and

| {\hat{f}}_{a} ([i W + 1, (i + 1) W]) - f_{a} ([i W + 1, (i + 1) W]) | \leq ∊ f_{*} ([i W + 1, (i + 1) W])

(3)

Observe that any item that arrives at or before the current time τ_cur must have timestamp no greater than τ_cur; hence f_a([iW + 1, (i + 1)W]) = f_a([iW + 1, τ_cur]) and f_*([iW + 1, (i + 1)W]) = f_*([iW +1, τ_cur]), and Equation (3) is equivalent to

| {\hat{f}}_{a} ([i W + 1, (i + 1) W]) - f_{a} ([i W + 1, τ_{cur}]) | \leq ∊ f_{*} ([i W + 1, τ_{cur}])

(4)

Adding Equations (2) and (4), we conclude |f̂_a([s, τ_cur]) − f_a([s, τ_cur])| ≤ ∊f_*([s, τ_cur]), as required.

Our data structure Algorithms 04 00200i2 _∊ is just the collection of _I₁,∊, _I₂,∊, …. Note that we only need to physically store in _∊ the data structures _{I_i,∊} and _{I_i+1,∊} where [τ_cur − W + 1,τ_cur] ⊆ I_i ∪ I_i₊₁. The intervals of the earlier ones will no longer be covered by the sliding window and the corresponding Algorithms 04 00200i2 _I,∊'s can be thrown away. Together with Theorem 4, we have the following theorem.

Theorem 2

The data structure Algorithms 04 00200i2 _∊ solves the ∊-approximate counting problem. The space usage is $O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) log log W)$ and it supports $O (log (\frac{∊ B}{log W}) \cdot (log \frac{1}{∊} + log log W))$ update time.

3. A Simple Data Structure For Frequency Estimation

Let I = [ℓ_I, r_I] be any interval of size W. To simplify notation, we assume that W is a power of 2, so that logW is an integer and we can avoid the floor or the ceiling functions. In this section, we describe a simple data structure Algorithms 04 00200i1 _I,λ,κ that enables us to obtain, for any item a, a good estimate of a's frequency over I. The parameters λ and κ determine its accuracy and space usage. However, its accuracy is not enough for answering any ∊-approximate frequent item set query. We will explain how to improve the accuracy in the next section.

Roughly speaking, Algorithms 04 00200i1 _I,λ,κ is a set of queues $Q_{I, λ}^{a}$ i.e., $C_{I, λ, κ} = [Q_{I, λ}^{a} ∣ a \in U]$ . For an item a, the queue $Q_{I, λ}^{a}$ keeps track of the occurrences of a in I. Each node N in $Q_{I, λ}^{a}$ is associated with an interval i(N), a value v(N), and a debit d(N); v(N) counts the number of arrived items (a, u) with u ∈ i(N), and d(N) is for implementing a space reduction technique. Initially, $Q_{I, λ}^{a}$ has only one node N with i(N) = I, and v(N) = d(N) = 0. In general, $Q_{I, λ}^{a}$ is a queue 〈N₁, N₂, …, N_k〉 of nodes whose intervals form a partition of I, i.e.,

〈 i (N_{1}), i (N_{2}), \dots, i (N_{k}) 〉 = 〈 [p_{1}, q_{1}], [p_{2}, q_{2}], \dots, [p_{k}, q_{k}] 〉

where q_i−1 + 1 = p_i ≤ q_i and ∪_1≤i≤k[p_i, q_i] = I. When an item (a, u) with u ∈ I arrives, we update

Q_{I, λ}^{a}

as follows.


$Q_{I, λ}^{a}$ .Debit( )

1:	find the unique node N in $Q_{I, λ}^{a}$ with u ∈ i(N) = J = [p, q],
2:	increase the value of N by 1, i.e., v(N) = v(N) + 1;
3:	if (\|J\| > 1 and λ units have been added to v(N) since J is assigned to i(N)) then
4:	/* refine J */
5:	create a new node N′ and insert it to the left of N;
6:	let i(N′) = [p, m], i(N) = [m + 1, q] where m = ⌊(p + q)/2⌋;
7:	let v(N′) = 0 and d(N′) = 0;
8:	/* we make no change to v(N) and d(N) */
9:	end if

Figure 1 gives an example on how $Q_{I, λ}^{a}$ is updated using the procedure.

Obviously, a direct implementation of Algorithms 04 00200i1 _I,λ,κ uses too much space. We now extend a technique of Misra and Gries [14] to reduce the space requirement. For any $Q_{I, λ}^{a}$ , we say that $Q_{I, λ}^{a}$ is trivial if the queue contains only a single node N with (i) i(N) = I, and (ii) v(N) = d(N) = 0. Every queue in Algorithms 04 00200i1 _I,λ,κ is trivial initially. The key for reducing the space complexity of _I,λ,κ is to maintain the following invariant throughout the execution:

(*) There are at most κ non-trivial queues in _I,λ,κ.

We call κ the capacity of Algorithms 04 00200i1 _I,λ,κ. The invariant helps us save space because we do not need to store trivial queues physically in memory. To maintain (*), each queue $Q_{I, λ}^{a}$ supports the following procedure, which is called only when $v (Q_{I, λ}^{a})$ , the total values of the nodes in $Q_{I, λ}^{a}$ , is strictly greater than $d (Q_{I, λ}^{a})$ , the total debits of the nodes in $Q_{I, λ}^{a}$ .


$Q_{I, λ}^{a}$ .Debit( )

1:	if ( $v (Q_{I, λ}^{a}) \leq d (Q_{I, λ}^{a})$ ) then
2:	return error;
3:	else
4:	find an arbitrary node N of $Q_{I, λ}^{a}$ with v(N) > d(N);
5:	/* such a node must exist because $v (Q_{I, λ}^{a}) > d (Q_{I, λ}^{a})$ */
6:	d(N) = d(N) + 1;
7:	end if

Note from the implementation of Debit( ) that $v (Q_{I, λ}^{a})$ is always no smaller than $d (Q_{I, λ}^{a})$ , and for each node N of $Q_{I, λ}^{a}, v (N) \geq d (N)$ . Furthermore, if $v (Q_{I, λ}^{a}) = d (Q_{I, λ}^{a})$ , then v(N) = d(N) for every node N in $Q_{I, λ}^{a}$ . To maintain (*), Algorithms 04 00200i1 _I,λ,κ processes a newly arrived item (a, u) with u ∈ I as follows.


_I,λ,κ.Process((a, u))

1:	update $(Q_{I, λ}^{a})$ by calling .Update((a, u));
2:	if (after the update the number of non-trivial queues becomes κ) then
3:	for each $Q_{I, λ}^{x}$ with do.Debit( );
4:	for each non-trivial queues $v (Q_{I, λ}^{x}) = d (Q_{I, λ}^{x})$ with do
5:	delete all nodes of $Q_{I, λ}^{x}$ and make it a trivial queue;
6:	/* Note that each deleted node N satisfies v(N) = d(N). */
7:	end if

It is easy to see that Invariant (*) always holds: Initially the number m of non-trivial queues is zero, and m increases only when Process((a, u)) is on some trivial $Q_{I, λ}^{a}$ ; in such case $v (Q_{I, λ}^{a})$ becomes 1 and $d (Q_{I, λ}^{a})$ remains 0. If m becomes κ after this increase, we will debit, among other queues, $Q_{I, λ}^{a}$ and its $d (Q_{I, λ}^{a})$ becomes 1 too. It follows that $v (Q_{I, λ}^{a}) = d (Q_{I, λ}^{a})$ , and Lines 4–5 will make $Q_{I, λ}^{a}$ trivial and m becomes less than κ again.

We are now ready to define Algorithms 04 00200i1 _I,λ,κ's estimate f̂_a([t, r_I]) of f_a([t, r_I]) and analyze its accuracy. We need some definitions. For any interval J = [p, q] and any t ∈ I, we say that J covers t if t ∈ [p, q], is to the right of t if t < p, and is to the left of t otherwise. For any item a and any t ∈ I = [ℓ_I, r_I], Algorithms 04 00200i1 _I,λ,κ's estimate of f_a([t, r_I]) is

f̂a([t, r_I]) = the value sum of the nodes N currently in $Q_{I, λ}^{a}$ whose i(N) covers or is to the right of t.

For example, in Figure 1, after the update of the last item (a, 1), we can obtain the estimate f̂_a([2, 8]) = 0 + 4 + 5 = 9.

Given any node N of $Q_{I, λ}^{a}$ , we say that N is monitoring a over J, or simply N is monitoring J if i(N) = J. Note that a node may monitor different intervals during different periods of execution, and the size of these intervals are monotonically decreasing. Observe that although there are about W²/2 possible sub-intervals of size-W interval I, there are only about 2W of them that would be monitored by some nodes: there is only one such interval of size W, namely I = [ℓ_I, r_I], which gives birth to two such intervals of size W/2, namely [ℓ_I, m] and [m + 1, r_I] where m = ⌊(ℓ_I + r_I)/2⌋, and so on. We call these O(W) intervals interesting intervals. For any two interesting intervals J and H such that J ⊂ H, we say that J is a descendant of H, and H is an ancestor of J. Figure 2 shows all the interesting intervals for I = [1, 8], as well as their ancestor-descendant relationship. The following important fact is easy to verify by induction.

Fact 1

Any two interesting intervals J and H do not cross, although one can contain another, i.e., either J ⊂ H, or H ⊂ J, or J ∩ H = ∅. Furthermore, any interesting interval has at most logW ancestors.

For any node N, let Algorithms 04 00200i4 (N) be the set of intervals that have been monitored by N so far. The following fact can be verified from the update procedure.

Fact 2

Consider a node N in $Q_{I, λ}^{a}$ , where i(N) = J.

If J covers or is to the right of t, then all intervals in (N) cover or are to the right of t.
If J is to the left of t, then all intervals in (N) are to the left of t.

We say that N covers or is to the right of t if the intervals in Algorithms 04 00200i4 (N) cover or are to the right of t; otherwise, N is to the left of t. For any queue $Q_{I, λ}^{a}$ , let alive $(Q_{I, λ}^{a})$ be the set of nodes currently in $Q_{I, λ}^{a}$ , dead $(Q_{I, λ}^{a})$ be those nodes of $Q_{I, λ}^{a}$ that have already been deleted (because of Line 5 of the procedure Process( )), and node $(Q_{I, λ}^{a}) = alive (Q_{I, λ}^{a}) \cup dead (Q_{I, λ}^{a})$ . Note that the estimate f̂_a([t, r_i]) is the value sum of the nodes in alive $(Q_{I, λ}^{a})$ that cover or are to the right of t. For simplicity, we need to express it more succinctly. Let

alive (C_{I, λ, κ}) = \cup {alive (Q_{I, λ}^{a}) ∣ Q_{I, λ}^{a} \in C_{I, λ, κ}}

be the set of nodes currently in Algorithms 04 00200i1

_I,λ,κ. Define dead( Algorithms 04 00200i1

_I,λ,κ) and node( Algorithms 04 00200i1

_I,λ,κ) similarly. For any item a and any subset X ⊆ node( Algorithms 04 00200i1

_I,λ,κ), let X^a be the set of nodes in X that are monitoring a (and thus are the nodes from

Q_{I, λ}^{a}

). For any t ∈ I, let X_≥t denote the set of nodes in X that cover or are to the right of t. Define v(X) = Σ_N∈X v(N) and d(X) = Σ_N∈X d(N). Then, f̂_a([t, r_I]) can be expressed as follows:

{\hat{f}}_{a} ([t, r_{I}]) = v (alive {(Q_{I, λ}^{a})}_{\geq t}) = v (alive {(C_{I, λ, κ})}_{\geq t}^{a})

The following theorem analyzes its accuracy, as well as gives the size of Algorithms 04 00200i1 _I,λ,κ.

Lemma 3

For any t ∈ I, f_a([t, r_I]) − $\frac{1}{κ}$ f_*(I) ≤ f̂_a([t, r_I]) ≤ f_a([t, r_I]) + λ logW. Furthermore, Algorithms 04 00200i1 _I,λ,κ has size O(f_*(I)/λ + κ) words.

Proof

Recall that ${\hat{f}}_{a} ([t, r_{I}]) = v (alive {(Q_{I, λ}^{a})}_{\geq t})$ . Consider any node N ∈ alive ${(Q_{I, λ}^{a})}_{\geq t}$ . Note that v(N) = Σ_{J∈
(N)} v_add(N, J) where v_add(N, J) is the value added to v(N) during the period when i(N) = J. By Fact 2, we can divide it as v(N) = Σ{v_add(N, J) | J covers t} + Σ {v_add(N, J) | J is to the right of t}. It follows that

\begin{array}{l} v (alive {(Q_{I, λ}^{a})}_{\geq t}) & = \sum_{N \in alive {(Q_{I, λ}^{a})}_{\geq t}} v (N) \\ = \sum_{N \in alive {(Q_{I, λ}^{a})}_{\geq t}} \sum {v_{add} (N, J) ∣ J covers t} + \sum_{N \in alive {(Q_{I, λ}^{a})}_{\geq t}} \sum {v_{add} (N, J) ∣ J is to the right of t} \end{array}

(5)

Note that $\sum_{N \in alive {(Q_{I, λ}^{a})}_{\geq t}} \sum {v_{add} (N, J) ∣ J is to the right of t} \leq f_{a} ([t, r_{I}])$ , because if an arrived item (a, u) causes an increase of v_add(N, J) for some J that is to the right of t, then u must be in [t, r_I]. By Equation (5), to show the second inequality of the lemma, it suffices to show that $S_{o} = \sum_{N \in alive {(Q_{I, λ}^{a})}_{\geq t}} \sum {v_{add} (N, J) ∣ J covers t} = v_{add} (N_{1}, J_{1}) + v_{add} (N_{2}, J_{2}) + \dots + v_{add} (N_{κ}, J_{κ})$ is no greater than λ logW, as follows.

Without loss of generality, suppose |J₁| ≥ |J₂| ≥ ⋯≥ |J_κ|. It can be verified that once an interval J is assigned to a node, it will not be assigned to other nodes; thus the J_i's are distinct. Furthermore, note that for 1 ≤ i < k, J_κ ⊂ J_i because (i) t is in both J_i and J_κ; (ii) J_κ is the smallest interval; and (iii) interesting intervals do not cross; thus J_κ is a descendant of J_i, and together with Fact 1, k ≤ logW. By Line 3 of the procedure Update( ), v_add(N_i, J_i) ≤ λ for 1 ≤ i ≤ k. It follows that S_o ≤ λ logW.

For the first inequality of the lemma, it is clearer to use ${\hat{f}}_{a} ([t, r_{I}]) = v (alive {(C_{I, λ, κ})}_{\geq t}^{a})$ . Note that every arrived item (a, u) with u ∈ [t, r_I] increments the value of some node in node ${(C_{I, λ, κ})}_{\geq t}^{a}$ ; thus ${\hat{f}}_{a} ([t, r_{I}]) \leq v (node {(C_{I, λ, κ})}_{\geq t}^{a})$ and

{\hat{f}}_{a} ([t, r_{I}]) - v (alive {(C_{I, λ, κ})}_{\geq t}^{a}) \leq v (node {(C_{I, λ, κ})}_{\geq t}^{a}) - v (alive {(C_{I, λ, κ})}_{\geq t}^{a}) = v (dead {(C_{I, λ, κ})}_{\geq t}^{a})

From Lines 4–6 of the procedure Process( ), when we delete a node N, v(N) = d(N). Hence, $v (dead {(C_{I, λ, κ})}_{\geq t}^{a}) = d (dead {(C_{I, λ, κ})}_{\geq t}^{a})$ , which is equal to the total number of debit operations made to these dead nodes. Since whenever we make a debit operation to $Q_{I, λ}^{a}$ , we will make a debit operation to κ − 1 other queues,

κ \cdot d (dead {(C_{I, λ, κ})}_{\geq t}^{a}) \leq d (node (C_{I, λ, κ})) \leq v (node (C_{I, λ, κ})) = f_{*} (I)

(6)

In summary, we have ${\hat{f}}_{a} ([t, r_{I}]) - {\hat{f}}_{a} ([t, r_{I}]) = f_{a} ([t, r_{I}]) - v (alive {(C_{I, λ, κ})}_{\geq t}^{a}) \leq v (dead {(C_{I, λ, κ})}_{\geq t}^{a}) = v (dead {(C_{I, λ, κ})}_{\geq t}^{a}) \leq f_{*} (I) / κ$ , and the first inequality of the lemma follows.

For the space, we say that a node is born-rich if it is created because of Line 5 of the procedure Update( ) (and thus has λ items under its belt); otherwise it is born-poor. Obviously, there are at most f_*(I)/λ born-rich nodes. For born-poor nodes, we need to store at most κ of them because every queue has one born-poor node (the rightmost one), and we only need to store at most κ non-trivial queues; the space bound follows.

If we set λ = λ_i = ∊2ⁱ/logW and $κ = \frac{1}{∊}$ , then Lemma 3 asserts that $C_{I, λ, κ} = C_{I, λ_{i}, \frac{1}{∊}}$ is an $O (\frac{f_{*} (I)}{∊ 2^{i}} log W + \frac{1}{∊})$ -space data structure that enables us to obtain, for any item a ∈ U and any timestamp t ∈ I, an estimate f̂_a([t, r_I]) that satisfies

f_{a} ([t, r_{I}]) - ∊ f_{*} (I) \leq {\hat{f}}_{a} ([t, r_{I}]) \leq f_{a} ([t, r_{I}]) + ∊ 2^{i}

If f_*(I) does not vary too much, we can determine the i such that f_* (I) ≈ 2ⁱ, and $C_{I, λ, κ \frac{1}{∊}}$ is an $O (\frac{1}{∊} log W)$ space data structure that guarantees an error bound of O(∊f_*(I)). However, this approach has two obvious shortcomings:

f_*(I) may vary from some small value to a value as large as B, the maximum number of items falling in a window of size W; hence, there may not be any fixed i that always satisfies f_* (I) ≈ 2ⁱ
To estimate f_a([t, r_I]), we need an error bound of ∊f_*([t, r_I]), not ∊f_*(I).

We will explain how to overcome these two shortcomings in the next section.

4. Our Data Structure for ∊-approximate Counting

The first shortcoming of the approach given in Section 3 is easy to overcome: a natural idea is to maintain $C_{I, λ, κ \frac{1}{∊}}$ for different λ_i to handle different possible values of f_*(I). The second shortcoming is more fundamental. To overcome it, we need to modify Algorithms 04 00200i1 _I,λ,κ substantially The result is a new and complicated data structure $D_{I, ∊}^{Y}$ , where Y is an integer determining the accuracy As asserted in Theorem 7 below, this data structure uses $O (\frac{1}{∊} log W log log W)$ space, supports $O (log \frac{1}{∊} + log log W)$ update time, and for any t ∈ I, it offers the following special guarantee:

When $f_{*} ([t, r_{I}]) \leq Y, D_{I, ∊}^{Y}$ can return, for any item a, an estimate f̂_a([t, r_I]) of f_a([t, r_I]) such that |f̂_a([t, r_I])−f_a([t, r_I])|≤∊Y.
When $f_{*} ([t, r_{I}]) > Y, D_{I, ∊}^{Y}$ does not have any error bound on its estimate f̂_a([t, r_I]).

Before giving the details of $D_{I, ∊}^{Y}$ , let us explain how to use it to build the data structure Algorithms 04 00200i2 _I,∊ mentioned in Section 2 for the ∊-approximate counting problem. To build _I,∊, we need another $O (\frac{1}{∊} log W log \frac{∊ B}{log W})$ -space data structure Algorithms 04 00200i6 _I,∊, which is a simple adaption of the data structure _∊ of Cormode et al. [1] for the ∊-approximate basic counting problem; _I,∊ enables us to find, for any t ∈ I, an estimate f̂_*([t, r_I]) of f_*([t, r_I]) such that

f_{*} ([t, r_{I}]) \leq {\hat{f}}_{a} ([t, r_{I}]) \leq (1 + ∊) f_{*} ([t, r_{I}])

(7)

Algorithms 04 00200i6 _I,∊ is implemented as follows. During execution, we maintain the data structure _∊_/4 of Cormode et al. to count the items in the sliding window. When τ_cur = r_I, we duplicate _∊_/4 and get ′. Then, ′ is updated as if τ_cur was fixed at r_I. To get the estimate f̂_*([t, r_I]), we first obtain an estimate f′ of f_*([t, r_I]) from Algorithms 04 00200i6 ′, which satisfies $| f^{'} - f_{*} ([t, r_{I}]) | \leq \frac{∊}{4} f_{*} ([t, r_{I}])$ . Then, ${\hat{f}}_{*} - ([t, r_{I}]) = \frac{1}{1 - ∊ / 4} f^{'}$ . It can be verified that f̂_*([t, r_I]) satisfies Equation (7). Our data structure Algorithms 04 00200i2 _I,∊ is composed of (i) Algorithms 04 00200i6 _I,∊, and (ii) $D_{I, ∊ / 4}^{2^{i}}$ for each integer i from $log (\frac{1}{∊} log W) + 1 to log B$ . It also maintains a brute-force $O (\frac{1}{∊} log W)$ -space data structure for remembering the $\frac{1}{∊} log W$ items (a, u) with the largest u ∈ I; this brute-force data structure will be used for finding f̂_a([t, r_I]) only when $f_{*} ([t, r_{I}]) \leq \frac{1}{∊} log W$ .

Theorem 4

The data structure _I,∊ has size $O (\frac{1}{∊} (log log W) (log W) log (\frac{∊ B}{log W}))$ words, and supports $O ((log \frac{1}{∊} + log log W) log (\frac{∊ B}{log W}))$ update time.
Given _I,∊, we can find, for any a ∈ Σ and t ∈ I, an estimate of f̂_a([t, r_I]) of f_a([t, r_I]) such that |f̂_a([t, r_I]) − f_a([t, r_I])| ≤ ∊f_*([t, r_I]).

Proof

Statement (i) is straightforward because there are $log B - log (\frac{1}{∊} log W)$ different $D_{I, ∊}^{Y}$ , each has size $O (\frac{1}{∊} (log log W) log W)$ and takes $O (log \frac{1}{∊} + log log W)$ time for an update. For Statement (ii), we describe how to get the estimate and analyze its accuracy.

First, we use Algorithms 04 00200i6 _I,∊ to get the estimate f̂_*([t, r_I]). If ${\hat{f}}_{*} ([t, r_{I}]) \leq \frac{1}{∊} log W$ , then $f_{*} ([t, r_{I}]) \leq {\hat{f}}_{*} ([t, r_{I}]) \leq \frac{1}{∊} log W$ and we can use the brute-force data structure to find f_a([t, r_I]) exactly. Otherwise, we determine the i with 2ⁱ⁻¹ < f̂_*([t, r_I]) ≤ 2ⁱ. Note that

$i \geq log (\frac{1}{∊} log W) + 1$ and we have the data structure $D_{I, \frac{∊}{4}}^{2^{i}}$ , and
f_*([t, r_I]) ≤ f̂_*([t, r_I]) ≤ 2ⁱ.

We use $D_{I, \frac{∊}{4}}^{2^{i}}$ to obtain an estimate f̂_a([t, r_I]) with $| {\hat{f}}_{a} ([t, r_{I}]) - f_{a} ([t, r_{I}]) | \leq (\frac{∊}{4}) 2^{i}$ . By Equation (7), 2ⁱ⁻¹ < f̂_*([t, r_I]) ≤ (1 + ∊)f_*([t, r_I]). Combining the two inequalities we have

| {\hat{f}}_{a} ([t, r_{I}]) | - f_{a} ([t, r_{I}]) | \leq 2 (\frac{∊}{4}) (2^{i - 1}) < 2 (\frac{∊}{4}) (1 + ∊) f_{*} ([t, r_{I}]) \leq ∊ f_{*} ([t, r_{I}])

We now describe the construction of $D_{I, ∊}^{Y}$ . First, we describe an $O (\frac{1}{∊} {(log W)}^{2})$ -space version of the data structure. Then, we show in the next section how to reduce the space to $O (\frac{1}{∊} log log W log W)$ . In our discussion, we fix λ = ∊Y/logW and $κ = \frac{4}{∊} log W$ .

Initially, $D_{I, ∊}^{Y}$ is just the data structure Algorithms 04 00200i1 _I,λ,κ. By Lemma 3, we know that its size is $O (\frac{f * (I)}{λ} + κ) = O (\frac{f * (I)}{∊ Y} log W + \frac{1}{∊} log W)$ , which is $O (\frac{1}{∊} log W)$ when f_*(I) ≤ Y. However, it is much larger than $\frac{1}{∊} log W$ when f_*(I) ≫ Y, and to maintain small space usage in such case, we trim Algorithms 04 00200i1 _I,λ,κ by throwing away a significant number of nodes. This is acceptable because _I,λ,κ only guarantees good estimates for those t ∈ I with f_*([t, r_I]) ≤ Y. The trimming process is rather tricky. The natural idea of throwing away all the nodes to the left of t when we find f_*([t, r_I]) > Y does not work because the resulting data structure may return estimates with error larger than the required ∊Y bound. For example, let I = [1, W]. For each item a_i ∈ {a₁, a₂, …, a_κ−1}, there are m = Y/κ copies of (a_i, t + 1) arrive at time W + t for every t ∈ [0, W − 1]. Also, there are m copies of (a, W) arrive at time W + t for every t ∈ [0, W − 1]. Hence, at each time W + t, there are mκ = Y items with timestamps in [t, W] arrives, m items for each of the κ item name in {a, a₁, …, a_κ−1}. We are interested in the accuracy of the estimate f̂_a([W, W]). It can be verified that at each time W + t, Lines 4–5 of the procedure Process( ) will eventually trivialize $Q_{I, λ}^{a}$ and thus f̂_a([W, W]) = 0. Since f_a([W, W]) = (t + 1)m, |f̂_a([W, W]) − f_a([W, W])| = (t + 1)m. When t = 2∊Y/m − 1, the absolute error is 2∊Y which is larger than the required error bound ∊Y.

To describe the right trimming procedure, we need some basic operations. Consider any Algorithms 04 00200i1 _J,λ,κ where J = [p, q]. The following operation splits _J,λ,κ into two smaller data structures _{J_ℓ,λ,κ} and _{J_r,λ,κ} where J_t = [p, m] and J_r = [m+ 1, q] with m = ⌊(p + q)/2⌋.


$D_{I, ∊}^{Y}$ .Split( _J,λ,κ)

1:	for each non-trivial queue $Q_{J, λ}^{a} \in C_{J, λ, κ}$ do
2:	if ( $Q_{J, λ}^{a}$ has only one node N monitoring the whole interval J) then
3:	/* refine J */
4:	insert a new node N′ immediately to the left of N with v(N′) = d(N′) = 0;
5:	i(N′) = J_ℓ, and i(N) = J_r;
6:	end if
7:	divide $Q_{J_{r}, λ}^{a}$ into two sub-queues and where
8:	$Q_{J_{ℓ}, λ}^{a}$ contains the nodes monitoring some sub-intervals of J_ℓ, and
9:	$Q_{J_{r}, λ}^{a}$ contains those monitoring some sub-intervals of J_r;
10:	put $Q_{J_{r}, λ}^{a}$ in _{J_ℓ,λ,κ} and in _{J_r,λ,κ}.
11:	end for
12:	/* For a trivial $Q_{J, λ}^{a}$ , its two children in _{J_ℓ,λ,κ} and _{J_r,λ,κ} are also trivial. */

We say that Algorithms 04 00200i1 _{J_ℓ,λ,κ} and _{J_r,λ,κ} are the left and right child of _{J_r,λ,κ}, respectively. Figure 3 gives an example of Split( _[1,8],λ,κ), the split of _[1,8],λ,κ, which has three non-trivial queues $Q_{I, λ}^{a}$ , $Q_{I, λ}^{b}$ and $Q_{I, λ}^{c}$ , into Algorithms 04 00200i1 _{[1, 4],λ,κ} and _{[5, 8],λ,κ}. Note that the queues for b and c in _{[1, 4],λ,κ} are trivial and we have not stored them.

Using Split( ), we can trim, for example, Algorithms 04 00200i1 _[_p,p_+1],λ,κ into _[_p_{+1,p+1],λ,κ} as follows: Split _{[p,p+1],λ,κ} into _[p,p],λ,κ and _{[p+1,p+1],λ,κ}, and throw away _{[p, p],λ,κ}. The following recursive procedure LeftRefine( ) generalizes this idea for larger J: Given _J,λ,κ = _{[p, q],λ,κ}, it returns a list 〈 Algorithms 04 00200i1 _J₀,λ,κ, _J₁,λ,κ, …, _{J_m,λ,κ}〉 where the J_i's form a partition of [p, q], and J₀ = [p, p]. Throwing away _J₀,λ,κ, and the remaining _{J_i,λ,κ}'s all together monitor [p + 1, q].


$D_{I, ∊}^{Y}$ .LeftRefine ( _[p,q],λ,κ)

1:	if (\|[p, q]\| = \|[p, p]\| = 1) then
2:	return 〈 _[p,p],λ,κ〉;
3:	else
4:	split _[p,q],λ,κ into its left child _{[p, m],λ,κ} and right child _{[m+1,q],λ,κ}
5:	/* where m = ⌊(p + q)/2⌋ */;
6:	L = LeftRefine( _{[p, m],λ,κ});
7:	suppose L = 〈 _J₀,λ,κ, _J₁,λ,κ, …, _{J_k,λ,κ}〉;
8:	return 〈 _J₀,λ,κ, …, _{J_k,λ,κ} _{[m+1,q],λ,κ}〉;
9:	end if

For example, LeftRefine( Algorithms 04 00200i1 _[1,8],λ,κ) gives us the list 〈 _[1,1],λ,κ, _{[2, 2],λ,κ}, _{[3, 4],λ,κ}, _[5,8],λ,κ〉. Note that J₀ = [p, p] because the recursion stops only when |[p, q]| = 1. The list returned by LeftRefine( _{[p, q],λ,κ}) has another useful property, which we describe below.

Given L = 〈 Algorithms 04 00200i1 _Z₁,λ,κ, …, _{Z_k,λ,κ}), we say that L is an interesting-partition covering the interval J if (i) the Z_i's are all interesting intervals and form a partition of J; and (ii) for 1 ≤ i < k, Z_i is to the left of Z_i₊₁, and $| Z_{i} | \leq \frac{1}{2} | Z_{i + 1} |$ . The fact below can be verified by induction on the length of the list returned by LeftRefine( ).

Fact 3

Let J be an interesting interval, and L = 〈 Algorithms 04 00200i1 _J₀,λ,κ, …, _{J_m,λ,κ}〉 be the list returned by LeftRefine( _J,λ,κ). Then, the list 〈 _J₁,λ,κ, …, _{J_m,λ,κ} 〉 (i.e., the list obtained by throwing away the head _J₀,λ,κ of L) is an interesting-partition covering [p + 1, q].

For example, if [1, 8] is an interesting interval, then the list 〈 Algorithms 04 00200i1 _[2,2],λ,κ _[3,4],λ,κ _[5,8],λ,κ〉 obtained by throwing away the first element _[1,1],λ,κ from LeftRefine( _[1,8],λ,κ) is an interesting-partition covering [2, 8].

We now give details of $D_{I, ∊}^{Y}$ . Initially, it is the interesting-partition 〈C_I,λ,κ 〉 covering the whole interval I = [ℓ_I, r_I]. Throughout the execution, we maintain the following invariant:

(**) $D_{I, ∊}^{Y}$ is an interesting-partition covering some [p, r_I] ⊆ I.

When $D_{I, ∊}^{Y} = 〈 C_{J_{1}, λ, κ}, \dots, C_{J_{m}, λ, κ} 〉$ is covering [p, r_I], it only guarantees good estimates of f_a([t, r_I]) for t ∈ [p, r_I], and this estimate is obtained by

{\hat{f}}_{a} ([t, r_{I}]) = v (alive {(C_{J_{h}, λ, κ})}_{\geq t}^{a}) + \sum_{h + 1 \leq i \leq m} v (alive {(C_{J_{i}, λ, κ})}^{a})

(or equivalently,

{\hat{f}}_{a} ([t, r_{I}]) = v (alive {(Q_{J_{h}, λ}^{a})}_{\geq t}) + \sum_{h + 1 \leq i \leq m} v (alive (Q_{J_{i}, λ}^{a}))

, where J_h is the interval in {J₁, J₂, …, J_m} that covers t. When an item (a, u) with u ∈ [p, r_I] arrives, we find the unique Algorithms 04 00200i1

_{J_i,λ,κ} in

D_{I, ∊}^{Y}

where u ∈ J_i, update it by calling Algorithms 04 00200i1

_{J_i,λ,κ}. Process((a, u)). Note that this update has no effect on the other Algorithms 04 00200i1

_J,λ,κ in

D_{I, ∊}^{Y}

.

During execution, we also keep track of the largest timestamp p_max ∈ I such that the estimate f̂_*(p_max,r_I]) given by Algorithms 04 00200i6 _I,∊ is greater than (1 + ∊)Y (which implies f_*([p_max,r_I]) > Y because of Equation (7)). As soon as p_max falls in the interval covered by $D_{I, ∊}^{Y}$ , we use the following procedure to trim $D_{I, ∊}^{Y}$ to cover the smaller interval [p_max + 1, r_I].

Suppose that L = 〈 Algorithms 04 00200i1 _J₁,λ,κ, …, _{J_i,λ,κ}) is an interesting-partition covering [p, r_I], and t ∈ [p, r_I]. Trim(L, t) constructs an interesting-partition covering [t + 1, r_I] recursively as follows.


$D_{I, ∊}^{Y}$ .Trim(L, t)

1:	find the unique _{J_i,λ,κ} in L such that t ∈ J_i;
2:	L′ =LeftRefine( _{J_i,λ,κ});
3:	suppose L′ = 〈 _K₀,λ,κ, …, _K₁,λ,κ, _{K_ℓ,λ,κ}〉;
4:	if (K₀ = [t, t]) then
5:	return 〈 _K₁,λ,κ, …, _{K_ℓ,λ,κ}, _{J_i+1,λ,κ}, _{J_m,λ,κ} 〉;
6:	/* i.e., throw away _J₁,λ,κ, …, _{J_i−1,λ,κ}, and _K₀,λ,κ, */
7:	/* and return an interesting-partition covering [t + 1, r_I]. */
8:	else
9:	return Trim(〈 _K₁,λ,κ, …, _{K_ℓ,λ,κ}, _{J_i+1,λ,κ}, _{J_m,λ,κ} 〉, t).
10:	/* throw away _J₁,λ,κ, …, _{J_i−1,λ,κ} and _K₀,λ,κ */
11:	end if

For example, Figure 4 shows that when $D_{I, ∊}^{Y} = 〈 C_{[2, 2], λ, κ}, C_{[3, 4], λ, κ}, C_{[5, 8], λ, κ} 〉$ , $Trim (D_{I, ∊}^{Y}, 3)$ return 〈 Algorithms 04 00200i1 _[4,4],λ,κ, _[5,8],λ,κ 〉. Based on Fact 3, it can be verified inductively that after $D_{I, ∊}^{Y} \leftarrow Trim (D_{I, ∊}^{Y}, p_{max})$ , the new $D_{I, ∊}^{Y}$ is an interesting-partition covering [p_max + 1, r_I]; Invariant (**) is preserved. In the rest of this section, we analyze the size of $D_{I, ∊}^{Y}$ and the accuracy of its estimates.

Let All be the set of all Algorithms 04 00200i1 _J,λ,κ's that ever exist, i.e., if _J,λ,κ ∈ All, then either (i) it is currently in $D_{I, ∊}^{Y}$ , or (ii) it has been in $D_{I, ∊}^{Y}$ some time earlier in the execution, but is thrown away during some trimming of $D_{I, ∊}^{Y}$ . For any p ∈ I, define

{ALL}_{\geq p} = {C_{J, λ, κ} ∣ C_{J, λ, κ} \in ALL, and J covers or is to the right of p}

Let v_add( Algorithms 04 00200i1 _J,λ,κ) be the total value added to the nodes of _J,λ,κ during its lifespan. We now derive an upper bound on Σ_{_J,λ,κ ∈ All_≥p} v_add( _J,λ,κ), which is crucial for getting a tight error bound on the accuracy of $D_{I, ∊}^{Y}$ 's estimates.

Recall that initially $D_{I, ∊}^{Y} = 〈 C_{I, λ, κ} 〉$ and thus Algorithms 04 00200i1 _I,λ,κ ∈ All. For any other _J,λ,κ ∈ All, _J,λ,κ must be a child of some _H,λ,κ ∈ All (i.e., _J,λ,κ is obtained from Split( _H,λ,κ))- Given _J,λ,κ and _H,λ,κ, we say that _J,λ,κ is a descendant of _H,λ,κ, and _H,λ,κ is an ancestor of Algorithms 04 00200i1 _J,λ,κ, if either (i) _J,λ,κ is a child of _H,λ,κ, or (ii) it is a child of some of _H,λ,κ's descendants. Note that the original _I,λ,κ is an ancestor of every _J,λ,κ ∈ All, and in general, any _H,λ,κ ∈ All is an ancestor of every _J,λ,κ ∈ All with J ⊂ H. We have the following lemma. (Note that we are abusing the notation here and regard $D_{I, ∊}^{Y}$ as a set.)

Lemma 5

Suppose that $D_{I, ∊}^{Y} = 〈 C_{J_{1}, λ, κ}, \dots, C_{J_{m}, λ, κ} 〉$ is covering [p, r_I]. Let $anc (D_{I, ∊}^{Y}) = anc (〈 C_{J_{1}, λ, κ}, \dots, C_{J_{m}, λ, κ} 〉)$ be the set ${C_{H, λ, κ} ∣ C_{H, λ, κ} is an ancestor of some C_{J_{i}, λ, κ} \in D_{I, ∊}^{Y}}$ . Then,

${ALL}_{\geq p} \subseteq D_{I, ∊}^{Y} \cup anc (D_{I, ∊}^{Y})$ ,
v_add( _J,λ,κ) ≤ (1 + ∊)Y for any _J,λ,κ ∈ All, and
$| D_{I, ∊}^{Y} \cup anc (D_{I, ∊}^{Y}) | \leq 2 log W$ .

Therefore, Σ{v_add( Algorithms 04 00200i1 _J,λ,κ) | _J,λ,κ ∈ All_≥_p} ≤ 2(1 + ∊)Y logW.

Proof

For (1), it suffices to prove that for any $C_{J, λ, κ} \in {ALL}_{\geq p}, C_{J, λ, κ} \in D_{I, ∊}^{Y} \cup anc (D_{I, ∊}^{Y})$ . By definition, J covers or is to the right of p; thus J ∩ (J₁ ∪ ⋯ ∪ J_m) = J ∩ [p, r_I] ≠ ∅. Since the intervals are interesting and do not cross, there is an 1 ≤ i ≤ m such that either (i) J = J_i, and thus $C_{J, λ, κ} \in D_{I, ∊}^{Y}$ , or (ii) J_i ⊂ J, which implies Algorithms 04 00200i1 _J,λ,κ is an ancestor of _J,λ,κ, i.e., $C_{J, λ, κ} \in anc (D_{I, ∊}^{Y})$ . (It is not possible that J ⊂ J_i, otherwise _{J_i,λ,κ} would have been split and should not be in the current $D_{I, ∊}^{Y}$ . Hence, $C_{J, λ, κ} \in D_{I, ∊}^{Y} \cup anc (D_{I, ∊}^{Y})$ .

To prove (2), suppose that J = [x, y] and v_add( Algorithms 04 00200i1 _J,λ,κ) has just reached (1 + ∊)Y. This implies f_*([x, r_I]) ≥ (1 + ∊)Y, and so does its estimate f̂_*([x, r_I]) given by Algorithms 04 00200i6 _I,∊ (as f_*([x, r_I]) ≤ f̂_*([x, r_I]), by Equation (7)). Then, the procedure Trim( ) will be called and _J,λ,κ will be either thrown away or split, and no more value can be added to Algorithms 04 00200i1 _J,λ,κ. It follows that v_add( _J,λ,κ) ≤ (1 + ∊)Y.

For (3), recall that $D_{I, ∊}^{Y} = 〈 C_{J_{1}, λ, κ}, C_{J_{2}, λ, κ}, \dots, C_{J_{m}, λ, κ} 〉$ . Among the intervals J₁, …, J_m, interval J₁ is the leftmost interval and its left boundary ℓ_J₁ = p. We now prove that $D_{I, ∊}^{Y} \cup anc (D_{I, ∊}^{Y}) = D_{I, ∊}^{Y} \cup anc (C_{J_{1}, λ, κ})$ where anc( Algorithms 04 00200i1 _J₁,λ,κ) is the set of ancestors of _J₁,λ,κ. Then, together with the facts that $| D_{I, ∊}^{Y} | \leq log W$ (by Property (ii) of interesting-partition) and |anc( _J₁,λ,κ)| ≤ logW (as each Split operation would reduce the size of interval by half), we have

| D_{I, ∊}^{Y} \cup anc (D_{I, ∊}^{Y}) | = | D_{I, ∊}^{Y} \cup anc (C_{J_{1}, λ, κ}) | \leq | D_{I, ∊}^{Y} | + | anc (C_{J_{1}, λ, κ}) | \leq 2 log W

To show $D_{I, ∊}^{Y} \cup anc (D_{I, ∊}^{Y}) = D_{I, ∊}^{Y} \cup anc (C_{J_{1}, λ, κ})$ , it suffices to show that for any $C_{H, λ, κ} \in anc (D_{I, ∊}^{Y})$ , Algorithms 04 00200i1 _H,λ,κ ∈ anc( _J₁,λ,κ). Since $C_{H, λ, κ} \in anc (D_{I, ∊}^{Y})$ , it is the ancestor of some $C_{J_{i}, λ, κ} \in (D_{I, ∊}^{Y})$ . Thus J_i = [ℓ_{j_i}, r_{j_i}] ⊂ H = [ℓ_H, r_H]. Since _H,λ,κ is already an ancestor, it no longer exists, and all the Algorithms 04 00200i1 _J,λ,κ to its left have been thrown away. Thus, $D_{I, ∊}^{Y}$ has no _J,λ,κ where J is to the right of ℓ_H. This implies ℓ_H ≤ p = ℓ_J₁ and ℓ_H ≤ ℓ_J₁ ≤ r_J₁ ≤ r_{J_i} ≤ r_H. It follows that J₁ ⊂ H and _H,λ,κ is an ancestor of _J₁,λ,κ, i.e., Algorithms 04 00200i1 _H,λ,κ ∈ anc( _J₁,λ,κ).

We are now ready to analyze the accuracy of $D_{I, ∊}^{Y}$ 's estimates.

Theorem 6

Suppose that $D_{I, ∊}^{Y}$ is covering [p, r_I]. For any item a and any t ∈ [p, r_I], the estimate f̂_a([t, r_I]) of f_a([t, r_I]) obtained by $D_{I, ∊}^{Y}$ satisfies |f̂_a([t, r_I]) − f_a([t, r_I])| ≤ ∊Y. Furthermore, $D_{I, ∊}^{Y}$ uses $O (\frac{1}{∊} {(log W)}^{2})$ space.

Proof

Let alive $(D_{I, ∊}^{Y})$ be the set of nodes currently in $D_{I, ∊}^{Y}, dead (D_{I, ∊}^{Y})$ the set of those that were in $D_{I, ∊}^{Y}$ earlier in the execution but have been deleted, and $node (D_{I, ∊}^{Y}) = alive (D_{I, ∊}^{Y}) \cup dead (D_{I, ∊}^{Y})$ . It can be verified that ${\hat{f}}_{a} ([t, r_{I}]) = v (alive {(D_{I, ∊}^{Y})}_{\geq t}^{a})$ . Below, we prove that

{\hat{f}}_{a} ([t, r_{I}]) - \frac{2 (1 + ∊) Y}{κ} log W \leq v (alive {(D_{I, ∊}^{Y})}_{\geq t}^{a}) \leq f_{a} ([t, r_{I}]) + λ log W

(8)

Recall that we fix λ = ∊Y/logW and $κ = \frac{4}{∊} log W$ ; the ∊Y error bound follows.

The proof of the second inequality of Equation (8) is identical to that of Lemma 3, except that we replace all occurrences of Algorithms 04 00200i1 _J,λ,κ by $D_{I, ∊}^{Y}$ . The proof of the first inequality is also similar. We still have

f_{a} ([t, r_{I}]) - v (alive {(D_{I, ∊}^{Y})}_{\geq t}^{a}) \leq v (node {(D_{I, ∊}^{Y})}_{\geq t}^{a}) - v (alive {(D_{I, ∊}^{Y})}_{\geq t}^{a}) = v (dead {(D_{I, ∊}^{Y})}_{\geq t}^{a})

which equals

d (dead {(D_{I, ∊}^{Y})}_{\geq t}^{a})

. As in Lemma 3, we can derive the bound

d (dead {(D_{I, ∊}^{Y})}_{\geq t}^{a}) \leq \frac{1}{κ} v (node (D_{I, ∊}^{Y})) = \frac{1}{κ} f_{*} (I)

, but we can do better here.

Observe that for any node $N \in dead ({(D_{I, ∊}^{Y})}_{\geq t}^{a})$ , N can only be in those Algorithms 04 00200i1 _J,λ,κ ∈ All_≥p (because t ∈ [p, r_I]), and when we debit N, if it is in _J,λ,κ, then we debit κ − 1 other nodes in _J,λ,κ monitoring κ − 1 items other than a. Thus, $κ \cdot d (dead ({(D_{I, ∊}^{Y})}_{\geq t}^{a}))$ is no more than the total value available in the Algorithms 04 00200i1 _J,λ,κ ∈ All_≥p, which is Σ {v_add( _J,λ,κ) | _J,λ,κ ∈ All_≥p}. Together with Lemma 5 we conclude

κ \cdot d (dead {(D_{I, ∊}^{Y})}_{\geq p}^{a}) \leq \sum {v_{add} (C_{J, λ, κ}) | C_{J, λ, κ} \in {ALL}_{\geq p}} \leq 2 (1 + ∊) Y log W

and the first inequality of Equation (8) follows.

For the size of $D_{I, ∊}^{Y}$ , similar to the proof of Lemma 3, we can argue that the number of born-rich nodes is only $O (Y / λ) = O (\frac{1}{∊} log W)$ , but the number of born-poor nodes can be much larger. A born-poor node of a non-trivial queue is created either when we increase the value of a trivial queue, or when we execute Lines 2-6 of procedure Split. It can be verified that every queue $Q_{J, λ}^{a}$ has at most one born-poor node, which is the rightmost node in $Q_{J, λ}^{a}$ . Since there are O(logW) Algorithms 04 00200i1 _J,λ,κ's in $D_{I, ∊}^{Y}$ and each has at most κ non-trivial queues, the number of born-poor nodes, and hence the size of $D_{I, ∊}^{Y}$ , is $O (κ log W) = O (\frac{1}{∊} {(log W)}^{2})$ .

To reduce $D_{I, ∊}^{Y}$ 's size from $O (\frac{1}{∊} {(log W)}^{2})$ to $O (\frac{1}{∊} log log W log W)$ , we need to reduce the number of born-poor nodes; or equivalently, the number of non-trivial queues in $D_{I, ∊}^{Y}$ . In the next section, we give a simple idea to reduce the number of non-trivial queues and hence the size of $D_{I, ∊}^{Y}$ to $O (\frac{1}{∊} log log W log W)$ . In Section 6, we show how to further reduce the size by taking advantage of the tardiness of the data stream.

5. Reducing the Size of $D_{I, ∊}^{Y}$

Our idea for reducing the size is simple; for every $C_{J, λ, κ} \in D_{I, ∊}^{Y}$ , its capacity is no longer fixed at $κ = \frac{4}{∊} log W$ ; instead, we start with a much smaller capacity, namely $\frac{4}{∊} log log W$ , which is allowed to increase gradually during execution. To determine Algorithms 04 00200i1 _J,λ,κ's capacity, we use a variable to keep track of the number f̄_*(J) of items (a, u) with u ∈ J that have arrived since _J,λ,κ's creation. Let v_J be the total value of the nodes in _J,λ,κ when it is created (v_J may not be zero if _J,λ,κ is resulted from the splitting of its parent). The capacity of Algorithms 04 00200i1 _J,λ,κ is determined as follows.

When $\frac{(c - 1)}{log W} \leq v_{J} + {\bar{f}}_{*} (J) < \frac{c Y}{log W}$ for some integer c ≥ 1, the capacity of _J,λ,κ is $κ (c) = \frac{4 c}{∊} log log W$ , i.e., set κ = κ(c) and allow κ(c) non-trivial queues in _J,λ,κ.

Note that when we increase the capacity of Algorithms 04 00200i1 _J,λ,κ to κ(c), we do not need to do anything, except that we allow more non-trivial queues (up to κ(c)) in the data structure. Also note that when _J,λ,κ is created during the trimming process, its inherited capacity may be larger than the supposed capacity κ(c); in such case, we simply debit every non-trivial queue until some queue $Q_{J, λ}^{a}$ has $v (Q_{I, λ}^{x}) = d (Q_{I, λ}^{x})$ and we execute Lines 4 and 5 of the procedure Process( ) to make this queue trivial. We repeat the process until the number of non-trivial queues is at most κ(c). The following theorem asserts that $D_{I, ∊}^{Y}$ maintains the accuracy of its estimates under this new implementation. It gives the revised size and the update time.

Theorem 7

Suppose that $D_{I, ∊}^{Y}$ is currently covering [p, r_I]. For any item a ∈ Σ and any timestamp t ∈ [p, r_I], the estimate f̂_a([t, r_I]) of f̂_a([t, r_I]) obtained by the new $D_{I, ∊}^{Y}$ satisfies |f̂_a([t, r_I]) − f_a([t, r_I])| ≤ ∊Y.
$D_{I, ∊}^{Y}$ has size $O (\frac{1}{∊} (log log W) log W)$ , and supports $O (log \frac{1}{∊} + log log W)$ update time.

Proof

Suppose that $D_{I, ∊}^{Y} = 〈 C_{J_{1}, λ, κ (c_{1})}, \dots, C_{J_{m}, λ, κ (c m)} 〉$ . From the fact that we are using Algorithms 04 00200i1 _{J_i,λ,κ}_{(c_i)} to monitor J_i we conclude $\frac{(c_{i} - 1) Y}{log W} \leq v_{J_{i}} + {\bar{f}}_{*} (J_{i})$ . It follows that $\sum_{1 \leq i \leq m} \frac{c_{i} Y}{log W} \leq \sum_{1 \leq i \leq m} (v_{J_{i}} + {\bar{f}}_{*} (J_{i})) + \sum_{1 \leq i \leq m} \frac{Y}{log W}$ , which is O(Y) because (i) $| D_{I, ∊}^{Y} | = m = O (log W)$ and (ii) $\sum_{1 \leq i \leq m} (v_{J_{i}} + {\bar{f}}_{*} (J_{i})) = O (Y)$ (otherwise $D_{I, ∊}^{Y}$ would have been trimmed). Thus,

\sum_{1 \leq i \leq m} c_{i} = O (log W)

(9)

For Statement (1), the analysis of the accuracy of f̂_a([t, r_I]) is very similar to that of Theorem 6, except for the following difference: In the proof of Theorem 6, we show that $d (dead {(D_{I, ∊}^{Y})}_{\geq p}^{a}) \leq \frac{2 (1 + ∊) Y}{κ} log W$ , and since κ is fixed at $\frac{4}{∊} log W$ , $d (dead {(D_{I, ∊}^{Y})}_{\geq p}^{a}) \leq ∊ Y$ . Here, we also prove that $d (dead {(D_{I, ∊}^{Y})}_{\geq p}^{a}) \leq ∊ Y$ , but we have to prove it differently because the capacities are no longer fixed.

As argued previously, any node in $dead {(D_{I, ∊}^{Y})}_{\geq p}^{a}$ is in some Algorithms 04 00200i1 _J,λ,κ ∈ All_≥_p. Below, we show that for any _J,λ,κ ∈ All_≥_p, we can make at most $\frac{∊ Y}{2 log W}$ debit operations to the queue $Q_{J, λ}^{a}$ of _J,λ,κ during its lifespan. Together with the fact that |All_≥p| ≤ 2 logW, we have $d (dead {(D_{I, ∊}^{Y})}_{\geq p}^{a}) \leq ∊ Y$ .

Consider any Algorithms 04 00200i1 _J,λ,κ ∈ All_≥_p. Note that the smaller its capacity, the larger the number of debit operations can be made to the queue $Q_{J, λ}^{a}$ of _J,λ,κ. To maximize the number of debit operations made to $Q_{J, λ}^{a}$ , suppose that v_J = 0 and thus _J,λ,κ has the smallest capacity κ(1) when it is created. Before increasing its capacity to κ(2), Algorithms 04 00200i1 _J,λ,κ can make at most $\frac{1}{κ (1)} \cdot \frac{Y}{log W}$ debit operations to $Q_{J, λ}^{a}$ . Then, during the next $\frac{Y}{log W}$ arrivals of items (a, u) with $u \in J, \frac{Y}{log W} \leq v_{J} + {\bar{f}}_{*} (J) < \frac{2 Y}{log W}$ , the capacity is κ(2), and at most $\frac{1}{κ (2)} \cdot \frac{Y}{log W}$ debit operations can be made to $Q_{J, λ}^{a}$ . In general, during the period when $\frac{(c - 1) Y}{log W} \leq v_{J} + {\bar{f}}_{*} (J) < \frac{c Y}{log W}$ , at most $\frac{1}{κ (c)} \cdot \frac{Y}{log W}$ debit operations can be made to $Q_{J, λ}^{a}$ . If the largest capacity is κ(c_max), the total number of debit operations made to $Q_{J, λ}^{a}$ is at most

\frac{Y}{log W} (\frac{1}{κ (1)} + \dots + \frac{1}{κ (c_{max})}) = \frac{∊ Y}{4 (log log W) log W} (1 + \frac{1}{2} + \dots + \frac{1}{c_{max}}) \leq \frac{∊ Y (ln (c_{max}) + 1)}{4 (log log W) log W}

which is smaller than

\frac{∊ Y}{2 log W}

because by Equation (9), c_max = O(logW), which implies ln(c_max) + 1 ≤ 2 log logW (suppose that W is larger than some constant).

We now prove (2). Note that the total number of non-trivial queues in $D_{I, ∊}^{Y}$ , and hence the number of born-poor nodes, is at most $\sum_{1 \leq i \leq m} κ (c_{i}) = \sum_{1 \leq i \leq m} \frac{4 c_{i}}{∊} log log W$ . By Equation (9), $\sum_{1 \leq i \leq m} c_{i} = O (log W)$ , and it follows that the size of $D_{I, ∊}^{Y}$ is $O (\frac{1}{∊} log log W log W)$ .

For the update time, suppose that an item (a, u) arrives. We can find the Algorithms 04 00200i1 _{J_i,λ,κ} in $D_{I, ∊}^{Y} = 〈 C_{J_{1}, λ, κ}, \dots, C_{J_{m}, λ, κ} 〉$ with u ∈ J_i using O(log m) = O(log logW) time by querying a balanced search tree storing the J_i's. By hashing (e.g., Cuckoo hashing [15], which supports constant update and query time) we can locate the queue $Q_{J_{i}, λ}^{a} \in C_{J_{i}, λ, κ}$ in constant time. Then, by consulting an auxiliary balanced search tree on the intervals monitored by the nodes of $Q_{J_{i}, λ}^{a}$ , we can find and update the node N of $Q_{J_{i}, λ}^{a}$ with u ∈ i(N) using $O (log (Y / λ)) = O (log \frac{1}{∊} + log log W)$ time. At times we may also need to execute Lines 3 and 4 of the procedure Process( ), which debits all the non-trivial queues in Algorithms 04 00200i1 _{J_i,λ,κ}. Using the de-amortizing technique given in [16], this step takes constant time.

Note that occasionally, we may also need to clean up $D_{I, ∊}^{Y}$ by calling Trim( ); this step takes time linear to the size of $D_{I, ∊}^{Y}$ , which is $O (\frac{1}{∊} (log log W) log W)$ .

6. Further Reducing the Size of $D_{I, ∊}^{Y}$ for Streams with Small Tardiness

Recall that in an out-of-order data stream with tardiness d_max ∈ [0, W], any item (a, u) arriving at time τ_cur satisfies u ≥ τ_cur − d_max; in other words, the delay of any item is guaranteed to be at most d_max. This section extends $D_{I, ∊}^{Y}$ to a data structure $ℰ_{I, ∊}^{Y}$ that takes advantage of this maximum delay guarantee to reduce the space usage. The idea is as follows. Since there is no new item with stamps smaller than τ_Cur − d_max, we will not make any further change to those nodes to the of left τ_cur − d_max and hence can consolidate these nodes to reduce space substantially. To handle those nodes with timestamps in [τ_cur − d_max, τ_cur], we use the data structure given in Section 5; since it is monitoring an interval of d_max instead of W, its size is $O (\frac{1}{∊} (log log d_{max}) log d_{max})$ instead of $O (\frac{1}{∊} (log log W) log W)$ .

To implement $ℰ_{I, ∊}^{Y}$ , we need a new operation called consolidate. Consider any list of queues $〈 Q_{J_{1}, λ}^{a}, Q_{J_{2}, λ}^{a}, \dots, Q_{J_{m}, λ}^{a} 〉$ , where J₁, J₂, …, J_m are ordered from left to right and form a partition of the interval J_1‥m = J₁ ∪ ⋯ ∪ J_m. We consolidate them into a single queue $Q_{J_{1 ‥ m}, λ}^{a}$ as follows:

Concatenate the queues into a single queue, in which the nodes preserve the left-right order.
Starting from the leftmost node, check from left to right every node N in the queue, if N is not the rightmost node and v(N) < λ, merge it with the node N′ immediately to its right, i.e., delete N, set v(N′) = v(N) + v(N′), d(N′) = d(N) + d(N′) and (N′) = (N) ∪ (N′).

Note that after the consolidation, the resulting queue $Q_{J_{1 ‥ m}, λ}^{a}$ has at most one node (the rightmost one) with value smaller than λ.

Given the list 〈 Algorithms 04 00200i1 _{J₁,λ,κ(c₁)}, …, _{J_m,λ,κ(c_m)}〉, we consolidate them into $C_{J_{1 ‥ m}, λ, \frac{1}{∊}}$ by first consolidating, for each item a, the queues $Q_{J_{1}, λ}^{a}, \dots, Q_{J_{m}, λ}^{a}$ in _{J₁,λ,κ(c₁)}, …, _{J_m,λ,κ(c_m)} into the queue $Q_{J_{1 ‥ m}, λ}^{a}$ and put it in $C_{J_{1 ‥ m}, λ, \frac{1}{∊}}$ . Then, we apply Lines 3–5 of procedure Process( ) repeatedly to reduce the number of non-trivial queues in the data structure to $\frac{1}{∊}$ .

We are now ready to describe how to extend $D_{I_{1}, ∊}^{Y}$ to $ℰ_{I, ∊}^{Y}$ . In our discussion, we fix $λ = \frac{∊ Y}{log d_{max}}$ , and without loss of generality, we assume that I = [1, W]. Recall that p_max denotes the largest timestamp in I such that f̂_*([p_max, r_I]) > (1 + ∊)Y (which implies f_*([p_max, r_I]) > Y). We partition I into sub-windows I₁, I₂, …, I_m, each of size d_max (i.e., I_i = [(i − 1)d_max, id_max]). We divide the execution into different periods according to τ_cur, the current time.

During the 1st period, when τ_cur ∈ [1, d_max] = I₁, $ℰ_{I, ∊}^{Y}$ simply is $D_{I_{1}, ∊}^{Y}$ .
During the 2nd period, when τ_cur = I₂, $ℰ_{I, ∊}^{Y}$ maintains $D_{I_{2}, ∊}^{Y}$ in addition to $D_{I_{1}, ∊}^{Y}$ .
During the 3rd period, when τ_cur ∈ I₃, $ℰ_{I, ∊}^{Y}$ maintains $D_{I_{3}, ∊}^{Y}$ in addition to $D_{I_{2}, ∊}^{Y}$ . Also, the $D_{I_{1}, ∊}^{Y} = 〈 C_{J_{1}, λ, κ (c_{1})}, \dots, C_{J_{m}, λ, κ (c_{m})} 〉$ is consolidated into $C_{I_{1}, λ, \frac{1}{∊}}$ .
In general, during the ith period, when $τ_{cur} \in [(i - 1) d_{max} + 1, i d_{max}] = I_{i}, ℰ_{I, ∊}^{Y}$ maintains $D_{I_{i - 1}, ∊}^{Y}$ and $D_{I_{i}, ∊}^{Y}$ , and also $C_{I_{1 ‥ i - 2}, λ, \frac{1}{∊}}$ where I_1‥i−2 = I₁ ∪ I₂ ∪ ⋯ ∪ I_i−2. Observe that in this period, there is no item (a, u) with u ∈ I_1‥i−2 arrives (because the tardiness is d_max), and thus we do not need to update $C_{I_{1 ‥ i - 2}, λ, \frac{1}{∊}}$ . However, we will keep throwing away any node N in $C_{I_{1 ‥ i - 2}, λ, \frac{1}{∊}}$ as soon as we know i(N) is to the left of p_max + 1.
When entering the (i + 1)st period, we do the followings: Keep $D_{I_{i}, ∊}^{Y}$ , create $D_{I_{i + 1}, ∊}^{Y}$ , merge _{I_1‥i−2,λ,κ} with $D_{I_{i - 1}, ∊}^{Y} = 〈 C_{J_{1}, λ, κ (c_{1})}, \dots, C_{J_{m}, λ, κ (c_{m})} 〉$ , and then get $C_{I_{1 ‥ i - 1}, λ, \frac{1}{∊}}$ by consolidating $〈 C_{I_{1 ‥ i - 2}, λ, \frac{1}{∊}}, C_{J_{1}, λ, κ (c_{1})} \dots, C_{J_{m}, λ, κ (c_{m})} 〉$ .

Given any t ∈ [p_max + 1, r_I], the estimate of f_a([t, r_I]) given by $ℰ_{I, ∊}^{Y}$ is

{\hat{f}}_{a} ([t, r_{I}]) = v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a})

The following theorem gives the accuracy of ${\hat{f}}_{a} ([t, r_{I}]), ℰ_{I, ∊}^{Y}$ 's size and its update time.

Theorem 8

For any t ∈ [p_max + 1, r_I], the estimate f̂_a([t, r_I]) given by $ℰ_{I, ∊}^{Y}$ satisfies

$f_{a} ([t, r_{I}]) - 2 ∊ Y \leq {\hat{f}}_{a} ([t, r_{I}]) \leq f_{a} ([t, r_{I}]) + 2 ∊ Y$
$ℰ_{I, ∊}^{Y}$ has size $O (\frac{1}{∊} (log log d_{max}) log d_{max})$ , and supports $O (log \frac{1}{∊} + log log d_{max})$ update time.

Proof

Recall that I is partitioned into sub-intervals I₁, I₂, …, I_m. Suppose that t ∈ I_κ. Note that if we had not performed any consolidation,

v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) = v (alive {(D_{I_{κ}, ∊}^{Y})}_{\geq t}^{a}) + \sum_{κ + 1 \leq i \leq m} v (alive {(D_{I_{i}, ∊}^{Y})}^{a})

Note that for κ + 1 ≤ i ≤ m, $v (alive {(D_{I_{i}, ∊}^{Y})}^{a}) \leq f_{a} (I_{i})$ , and for $v (alive {(D_{I_{κ}, ∊}^{Y})}_{\geq t}^{a})$ since |I_κ|= d_max, the same argument used in the proof of Lemma 3 gives us $v (alive {(D_{I_{κ}, ∊}^{Y})}_{\geq t}^{a}) \leq f_{a} ([t, r_{I_{k}}]) + λ log d_{max}$ . Hence

v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) = v (alive {(D_{I_{κ}, ∊}^{Y})}_{\geq t}^{a}) + \sum_{κ + 1 \leq i \leq m} v (alive {(D_{I_{i}, ∊}^{Y})}^{a}) \leq f_{a} ([t, r_{I_{κ}}]) + λ log d_{max} + \sum_{κ + 1 \leq i \leq m} f_{a} (I_{i}) = f_{a} ([t, r_{I}]) + λ log d_{max}

(10)

The consolidation step may add errors to $v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a})$ . To get a bound on them, let N₁, N₂, … be the nodes for a in $ℰ_{I, ∊}^{Y}$ , ordered from left to right. Suppose that t ∈ N_h. Note that

the consolidation step will added at most λ units to v(N_h) before we move on to consider the node immediately to its right, and
for node N_i with i ≥ h + 1, any node N that has been merged to N_i must be to the right of of N_h, and thus is to the right of t; it follows that N is contributing v(N) to $v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a})$ in Equation (10) and its merging will not make any change.

In conclusion, the consolidation steps introduce at most λ extra errors, and Equation (10) becomes $v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) \leq f_{a} ([t, r_{I}]) + λ log W + λ \leq f_{a} ([t, r_{I}]) + 2 ∊ Y$ , which is the second inequality of the lemma.

To prove the first inequality, suppose that we ask for the estimate f̂_a([t, r_I]) during the ith period, when we have $C_{I_{1 ‥ i - 2}, λ, \frac{1}{∊}}$ , $D_{I_{i - 1}, ∊}^{Y}$ and $D_{I_{i}, ∊}^{Y}$ . Recall that Algorithms 04 00200i1 _{I_1‥i−2, λ,∊} comes from consolidating $D_{I_{1}, ∊}^{Y}, D_{I_{2}, ∊}^{Y}, \dots, D_{I_{i - 2}, ∊}^{Y}$ . As in all our Previous analyses, we have

f_{a} ([t, r_{I}]) - v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) \leq v (node {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) - v (alive {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) = d (dead {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a})

(Note that the merging of nodes during consolidations would not take away any value). To get a bound on $d (dead {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a})$ , suppose that p_max ∈ I_κ. Then, all the nodes to the left of I_κ have been thrown away. Among $D_{I_{κ}, ∊}^{Y}, D_{I_{κ + 1}, ∊}^{Y}, \dots, D_{I_{m}, ∊}^{Y}$ , only $D_{I_{κ}, ∊}^{Y}$ may have been trimmed. Note that

$d (dead {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) \leq d (dead {(D_{I_{κ}, ∊}^{Y})}_{\geq p_{max}}^{a}) + \sum_{κ + 1 \leq ℓ \leq m} d (dead {(D_{I_{ℓ}, ∊}^{Y})}^{a})$ ,
as in the proof of Theorem 7, we can argue that $d (dead {(D_{I_{κ}, ∊}^{Y})}_{\geq p_{max}}^{a}) \leq ∊ Y$ , and
for the other $D_{I_{ℓ}, ∊}^{Y}$ , since their capacity is at least $\frac{1}{∊}$

\sum_{κ + 1 \leq ℓ \leq m} d (dead {(D_{I_{ℓ}, ∊}^{Y})}^{a}) \leq \sum_{κ + 1 \leq ℓ \leq m} f_{*} (I_{ℓ}) / (1 / ∊) \leq ∊ f_{*} ([p_{max} + 1, r_{I}]) \leq ∊ Y

Thus, $d (dead {(ℰ_{I, ∊}^{Y})}_{\geq t}^{a}) \leq 2 ∊ Y$ , and the first inequality follows.

For Statement (2), note that both $D_{I_{i - 1}, ∊}^{Y}$ and $D_{I_{i}, ∊}^{Y}$ have size $O (\frac{1}{∊} log log d_{max} log d_{max})$ (by Theorem 7, and |I_i₋₁| = |I_i| = d_max), and for $C_{J_{1 ‥ i - 2}, λ, \frac{1}{∊}}$ , it has size $O (Y / λ + \frac{1}{∊}) = O (\frac{1}{∊} log d_{max})$ ; thus the size of $ℰ_{I, ∊}^{Y}$ is $O (\frac{1}{∊} log log d_{max} log d_{max})$ . For the update time, it suffices to note that it is dominated by the update times of $D_{I_{i - 1}, ∊}^{Y}$ and $D_{I_{i}, ∊}^{Y}$ .

Figure 1. Suppose that λ = 4. (i) shows the queue

Q_{I, λ}^{a}

before the arrivals of items (a, 1), (a, 2), (a, 3), (a, 8); (ii) is the resulting queue after the updates for these items; (iii) shows that after the arrival of another item (a, 1), the first node in (ii) is updated and refined.

Figure 1. Suppose that λ = 4. (i) shows the queue

Q_{I, λ}^{a}

before the arrivals of items (a, 1), (a, 2), (a, 3), (a, 8); (ii) is the resulting queue after the updates for these items; (iii) shows that after the arrival of another item (a, 1), the first node in (ii) is updated and refined.

Figure 2. Interesting intervals for I = [1, 8].

Figure 3. Split of Algorithms 04 00200i1

_{[1, 8], λ,κ}.

Figure 3. Split of Algorithms 04 00200i1

_{[1, 8], λ,κ}.

Figure 4. Trim(〈 Algorithms 04 00200i1

_{[2, 2],λ,κ}, Algorithms 04 00200i1

_{[3, 4],λ,κ}, Algorithms 04 00200i1

_{[5, 8],λ,κ}〉, 3).

Figure 4. Trim(〈 Algorithms 04 00200i1

_{[2, 2],λ,κ}, Algorithms 04 00200i1

_{[3, 4],λ,κ}, Algorithms 04 00200i1

_{[5, 8],λ,κ}〉, 3).

Table 1. The space complexity for answering ∊-approximate frequent item set query in a sliding time window. Results from this paper are marked with [†]. Note that we assume

B \geq \frac{1}{∊} log W

; otherwise, we can always store all items in the window for exact answer, using

O (\frac{1}{∊} log W)

words. Similarly, for the result with tardiness, we assume

B \geq \frac{1}{∊} log d_{max}

.

**Table 1.** The space complexity for answering ∊-approximate frequent item set query in a sliding time window. Results from this paper are marked with [†]. Note that we assume $B \geq \frac{1}{∊} log W$ ; otherwise, we can always store all items in the window for exact answer, using $O (\frac{1}{∊} log W)$ words. Similarly, for the result with tardiness, we assume $B \geq \frac{1}{∊} log d_{max}$ .
Space Complexity (words)
Synchronous [7]	$O (\frac{1}{∊} log (∊ B))$
Asynchronous [1]	$O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) min {log W, \frac{1}{∊}} log \| U \|)$
Asynchronous [†]	$O (\frac{1}{∊} log W log (\frac{∊ B}{log W}) log log W)$
Asynchronous with tardiness [†]	$O (\frac{1}{∊} log d_{max} log (\frac{∊ B}{log d_{max}}) log log d_{max})$

Acknowledgments

H.F Ting is partially supported by the GRF Grant HKU-716307E; T.W. Lam is partially supported by the GRF Grant HKU-713909E.

References

Cormode, G.; Korn, F.; Tirthapura, S. Time-Decaying Aggregates in Out-of-Order Streams. Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'08, Vancouver, Canada, 9–11 June 2008; pp. 89–98.
Karp, R.; Shenker, S.; Papadimitriou, C. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 2003, 28, 51–55. [Google Scholar]
Demaine, E.; Lopez-Ortiz, A.; Munro, J. Frequency Estimation of Internet Packet Streams with Limited Space. Proceedings of the 10th Annual European Symposium, ESA'07, Rome, Italy, 17–21 September 2002; pp. 348–360.
Muthukrishnan, S. Data Streams: Algorithms and Applications; Now Publisher Inc.: Boston, MA, USA, 2005. [Google Scholar]
Babcock, B.; Babu, S.; Datar, M.; Motwani, R.; Widom, J. Models and Issues in Data Stream Systems. Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS'02, Madison, WI, USA, 3–5 June 2002; pp. 1–16.
Arasu, A.; Manku, G. Approximate Counts and Quantiles over Sliding Windows. Proceedings of the 23th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'04, Paris, France, 14–16 June 2004; pp. 286–296.
Lee, L.K.; Ting, H.F. A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows. Proceedings of the PODS, Chicago, Illinois, USA, June 26–28, 2006; pp. 290–297.
Lee, L.K.; Ting, H.F. Maintaining Significant Stream Statistics over Sliding Windows. Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA'06, Miami, FL, USA, 22–26 January 2006; pp. 724–732.
Datar, M.; Gionis, A.; Indyk, P.; Motwani, R. Maintaining stream statistics over sliding windows. SIAMJ. Comput. 2002, 31, 1794–1813. [Google Scholar]
Tirthapura, S.; Xu, B.; Busch, C. Sketching Asynchronous Streams over a Sliding Window. Proceedings of the 25th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC'06, Denver, CO, USA, 23–26 July 2006; pp. 82–91.
Busch, C.; Tirthapua, S. A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window. Proceedings of the 24th Annual Symposium on Theoretical Aspects of Computer Science, STACS'07, Aachen, Germany, 22–24 February 2007; pp. 465–475.
Cormode, G.; Tirthapura, S.; Xu, B. Time-decaying sketches for robust aggregation of sensor data. SIAM J. Comput. 2009, 39, 1309–1339. [Google Scholar]
Chan, H.L.; Lam, T.W.; Lee, L.K.; Ting, H.F. Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window. Proceedings of the 7th Workshop on Approximation and Online Algorithms, WAOA'09, Copenhagen, Denmark, 10–11 September 2009; pp. 49–61.
Misra, J.; Gries, D. Finding repeated elements. Sci. Comput. Program. 1982, 2, 143–152. [Google Scholar]
Arbitman, Y.; Naor, M.; Segev, G. De-amortized Cuckoo Hashing: Provable Worst-Case Performance and Experimental Results. Proceedings of the 36th International Colloquium, ICALP'09, Rhodes, Greece, 5–12 July 2009; pp. 107–118.
Hung, R.S.; Lee, L.K.; Ting, H.F. Finding frequent items over sliding windows with constant update time. Inf. Process. Lett. 2010, 110, 257–260. [Google Scholar]

© 2011 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Ting, H.-F.; Lee, L.-K.; Chan, H.-L.; Lam, T.-W. Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window. Algorithms 2011, 4, 200-222. https://doi.org/10.3390/a4030200

AMA Style

Ting H-F, Lee L-K, Chan H-L, Lam T-W. Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window. Algorithms. 2011; 4(3):200-222. https://doi.org/10.3390/a4030200

Chicago/Turabian Style

Ting, Hing-Fung, Lap-Kei Lee, Ho-Leung Chan, and Tak-Wah Lam. 2011. "Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window" Algorithms 4, no. 3: 200-222. https://doi.org/10.3390/a4030200

APA Style

Ting, H.-F., Lee, L.-K., Chan, H.-L., & Lam, T.-W. (2011). Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window. Algorithms, 4(3), 200-222. https://doi.org/10.3390/a4030200

Article Menu

Approximating Frequent Items in Asynchronous Data Stream over a Sliding Window

Abstract

1. Introduction

1.1. Asynchronous Data Stream

1.2. Previous Work on Approximating Frequent Items

1.3. Formal Definition of Approximate Frequent Item Set

1.4. Our Contribution

Remark

1.5. Technical Digest

2. Preliminaries

Theorem 1

Proof

Theorem 2

3. A Simple Data Structure For Frequency Estimation

Fact 1

Fact 2

Lemma 3

Proof

4. Our Data Structure for ∊-approximate Counting

Theorem 4

Proof

Fact 3

Lemma 5

Proof

Theorem 6

Proof

5. Reducing the Size of D I , ∊ Y

Theorem 7

Proof

6. Further Reducing the Size of D I , ∊ Y for Streams with Small Tardiness

Theorem 8

Proof

Acknowledgments

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5. Reducing the Size of $D_{I, ∊}^{Y}$

6. Further Reducing the Size of $D_{I, ∊}^{Y}$ for Streams with Small Tardiness