Monitoring Threshold Functions over Distributed Data Streams with Node Dependent Constraints

Monitoring data streams in a distributed system has attracted considerable interest in recent years. The task of feature selection (e.g., by monitoring the information gain of various features) requires a very high communication overhead when addressed using straightforward centralized algorithms. While most of the existing algorithms deal with monitoring simple aggregated values such as frequency of occurrence of stream items, motivated by recent contributions based on geometric ideas we present an alternative approach. The proposed approach enables monitoring values of an arbitrary threshold function over distributed data streams through stream dependent constraints applied separately on each stream. We report numerical experiments on a real-world data that detect instances where communication between nodes is required, and compare the approach and the results to those recently reported in the literature.


Introduction
In many emerging applications one needs to process a continuous stream of data in real time.Sensor networks [1], network monitoring [2], and real-time analysis of financial data [3,4] are examples of such applications.Monitoring queries is a particular class of queries in the context of data streams.Previous work in this area deals with monitoring simple aggregates [2], or term frequency occurrence in a set of distributed streams [5].
A general framework for efficient local algorithms monitoring l 2 norm of the data average of large networks of computers, wireless sensors, or mobile devices was introduced in [6], and further developed in [7].The current contribution is motivated by results recently reported in [8,9] with focus on a special case of the general model considered in [7].This special case can be briefly described as follows: Let S = {s 1 , . . ., s n } be a set of data streams collected at n nodes.Let v 1 (t), . . ., v n (t) be d dimensional real time varying vectors derived from the streams.For a function f : R d → R we would like to confirm the inequality while minimizing communication between the nodes.Monitoring inequality (1), or monitoring geometric location of the mean is a problem that can be addressed using a variety of different mathematical tools.A specific choice of a monitoring tool is up to the user.We note that the problem as stated above does not specify any particular tool, l 2 , or any other norm that is required to address it.The problem was recently addressed in [10], where the approach proposed imposes equal constraints on each node.In addition to previously used l 2 norm (see, e.g., [6][7][8][9]11]) the paper provides theoretical framework for using a wide variety of convex functions, and, as an illustration, runs numerical experiments using l 2 , l 1 and l ∞ norms.In all numerical experiments reported in [10] an application of the same algorithm with l 1 norm generates superior results.This paper extends results in [10] in a machine learning direction-a constraint imposed on each node depends on the stream history at the node.
As a simple illustration of the problem considered in the paper we focus on two scalar functions v 1 (t) and v 2 (t), and the identity function f (i.e., f (x) = x).We would like to guarantee the inequality while keeping the nodes silent as much as possible.A possible strategy is to verify the initial inequality v(t 0 ) = v 1 (t 0 ) + v 2 (t 0 ) 2 > 0 and to keep both nodes silent while The first time t 1 when one of the functions, say v 1 (t), crosses the boundary of the local constraint, i.e., |v 1 (t 1 ) − v 1 (t 0 )| ≥ δ the nodes communicate, the mean v(t 1 ) is computed, the local constraint δ is updated and made available to the nodes, and nodes are kept silent as long as the inequalities hold.
The main contributions of this paper are listed next.We demonstrate that: 1.This approach works for a non-linear monitoring function f .2. The results depend on the choice of a norm, and the numerical results reported show that l 2 is probably not the best norm when one aims to minimize communication between nodes.In addition to the numerical results presented we also provide a simple illustrative example that highlights this point (see Remark 4.2).
3. Selection of node dependent local constraints may decrease communication between the nodes.4. The approach suggested in [10] and adopted in this paper paves the way to achieve further communication savings by clustering nodes, and monitoring cluster coordinators.Although this research direction is beyond the scope of this paper we address it briefly in Section 6.
In the next section we provide a text mining related example that leads to a non-linear threshold function f .

Text Mining Application
Let T be a finite text collection (for example a collection of mail or news items).We denote the size of the set T by |T|.We will be concerned with two subsets of T: 1. R-the set of "relevant" texts (text not labeled as spam), 2. F-the set of texts that contain a "feature" (word or term for example).
We denote complements of the sets by R, F respectably (i.e., R ∪ R = F ∪ F = T), and consider the relative size of the four sets F ∩ R, F ∩ R, F ∩ R, and F ∩ R as follows: Note that 0 ≤ x ij ≤ 1, and The function f is defined on the simplex (i.e., x ij ≥ 0,

∑
x ij = 1), and given by where log x = log 2 x throughout the paper.We next relate empirical version of information gain Equation (3) and the information gain (see e.g., [12]).
Let Y and X be random variable with know distributions Entropy of Y is defined by Entropy of Y conditional on X = x denoted by H(Y |X = x) is defined by Conditional entropy H(Y |X) and information gain IG(Y |X) are given by Information gain is symmetric, indeed Due to convexity of g(x) = − log x, information gain is non-negative It is easy to see that Equation (3) provides information gain for the "feature".
As an example, we consider n agents installed on n different servers and a stream of texts arriving at the servers.Let T h = {t h1 , . . ., t hw } be the last w texts received at the h th server, with Note that i.e., entries of the global contingency table {x ij (T)} are the average of the local contingency tables {x ij (T h )}, h = 1, . . ., n.For the given "feature" and a predefined positive threshold r we would like to verify the inequality f (x 11 (T), x 12 (T), x 21 (T), x 22 (T)) − r > 0 while minimizing communication between the servers.Note that Equation (3) is a nonlinear function.
The case of a nonlinear monitoring function is different from that of linear one (in fact [8] calls the nonlinear monitoring function case "fundamentally different").In the next section we demonstrate the difference, and describe an efficient way to handle the nonlinear case.

Non-Linear Threshold Function: An Example
We start with a slight modification of a simple one dimensional example presented in [8].
Example 3.1 Let f (x) = x 2 − 9, and v i , i = 1, 2 are scalar values stored at two distinct nodes.Note that if v 1 = −4, and v 2 = 4, then Finally, when v 1 = 2, and v 2 = 6 one has The simple illustrative example leads the authors of [8] to conclude that it is impossible to determine from the values of f at the nodes whether its value at the average is above the threshold or not.The remedy proposed is to consider the vectors u j (t) = v(t i ) + [v j (t) − v j (t i )], j = 1, . . ., n, t ≥ t i and to monitor the values of f on the convex hull conv {u 1 (t), . . ., u n (t)} instead of the value of f at the average Equation ( 1).This strategy leads to sufficient conditions for Equation (1), and may be conservative.
The monitoring techniques for values of f on conv {u 1 (t), . . ., u n (t)} without communication between the nodes are based on the following two observations: 1. Convexity property.The mean v(t) is given by (see Figure 1).Since each ball can be monitored by node j with no communication with other nodes, Equation (8) allows to split monitoring of conv {v(t i ), u 1 (t), . . ., u n (t)}, t ≥ t i into n independent tasks executed by the n nodes separately and without communication.In this paper we propose an alternative strategy that will be briefly explained next using Example 3.1, f (x) = x 2 − 9, and assignment provided by Equation (7).Let δ be a positive number.Consider two intervals of radius δ centered at v 1 = 2 and v 2 = 6, i.e., we are interested in the intervals and δ is small, then the average , and f is not far from 7 (hence positive).In fact the sum of the intervals is The "zero" points Z f of f are −3 and 3, and as soon as δ is large enough so that the interval [4 − δ, 4 + δ] "hits" a point where f vanishes, communication between the nodes is required in order to verify Equation (1).In this particular example as long as δ ≤ 1, and, therefore, no communication is required between the nodes.
The condition presented above is a sufficient condition that guarantees Equation (1).As any sufficient condition is, this condition can be conservative.In fact when the distance is provided by the l 2 norm, this sufficient condition is more conservative than the one provided by "ball monitoring" Equation (9) suggested in [8].On the other hand, since only a scalar δ should be communicated to each node, the value of the updated mean v(t i ) should not be transmitted (hence communication savings are possible), and there is no need to compute the distance from the center of each ball B 2 (v(t i ), u j (t)), j = 1, . . ., n, t > t i to the zero set Z f .For detailed comparison of results we refer the reader to [10].
We conclude the section by remarking that when inequality Equation ( 1) is reversed the same technique can be used to monitor the reversed inequality while minimizing communication between the nodes.We provide additional details in Section 5.In the next section we extend the above "monitoring with no communication" argument to the general vector setting.The approach suggested in the next section is motivated by an earlier research on robust stability of control systems (see e.g., [13]).

Convex Minimization Problem
In this section we state the monitoring problem as a convex minimization problem.For an appropriate analysis background we refer the interested reader to the classical monograph [14].For the relevant convex analysis material see [15].
Consider the following optimization problem: Assume that inequality Equation (1) holds for the vector w, i.e., f (Bw) > 0. We are looking for a vector x "nearest" to w so that f (Bx) = 0, i.e., Bx = z for some z ∈ Z f (where Z f is the zero set of f , i.e., Z f = {z : f (z) = 0}).We now fix z ∈ Z f and denote the distance from w to the set {x : Bx = z} by r(z).Note that for each y inside the ball of radius r(z) centered at w, one has By ̸ = z.If y belongs to a ball of radius r = inf z∈Z f r(z) centered at w, then the inequality f (By) > 0 holds true.Let F (x) be a "norm" on R nd (specific functions F we run the numerical experiments with will be described later).The nearest "bad" vector problem described above is the following.
We note that Equation ( 13) is equivalent to inf . The function is concave (actually linear) in λ, and convex in x.Hence (see e.g., [15]) The right hand side of the above equality can be conveniently written as follows The conjugate g * (y) of a function g(x) is defined by } (see e.g., [15]).We note that sup x { ( one has to deal with sup For many functions g the conjugate g * can be easily computed.Next we list conjugate functions for the most popular norms We note that some of the functions F we consider in this paper are different from l p norms (see Table 1 for the list of the functions).We first select F (x) = ∥x∥ ∞ , and show below that in this case The solution to this maximization problem is ||z − Bw|| ∞ .Analogously, when Finally the value for r(z) is given by ||z − Bw|| 2 .When For clarity sake we collect the above results in Table 1.
Table 1.norm-ball radius correspondence for three different norms and fixed w ∈ R nd .

F (x) r(z)
In the algorithm described below the norm is denoted just by ∥ • ∥ (numerical experiments presented in Section 5 are conducted with all three norms).The monitoring algorithm we propose is the following.Algorithm 4.1 Threshold monitoring algorithm.1. Set i = 0.

6.
If ∥v j − v j (t i )∥ < δ for each j = 1, . . ., n go to step 5 else go to step 3 In what follows, we assume that transmission of a double precision real number amounts to broadcasting one message.The message computation is based on the assumption that all nodes are updated by a new text simultaneously.When mean update is required, a coordinator (root) requests and receives messages from the nodes.We next count a number of messages that should be broadcast per one iteration if the local constraint δ is violated at least at one node.We shall denote the set of all nodes by N, the set of nodes complying with the constraint by N C , and the set of nodes violating the constraint by N V (so that N = N C ∪ N V ).The cardinality of the sets is denoted by |N|, N C , and N V respectively, so that |N| = N C + N V .
Assuming N V > 0 one has the following: 1. N V nodes violators transmit their scalar ID and new coordinates to the root ((d + 1) × N V messages).2. the root sends scalar requests for new coordinates to the complying N C nodes ( N C messages).This leads to total of (d + 2)|N| messages per mean update.(14) We conclude the section with three remarks.The first one compares conservatism of Algorithm 4.1 and the one suggested in [8].The second one again compares the ball cover suggested in [8] and application of Algorithm 4.1 with l 1 norm.The last one shows by an example that Equation (8) fails when B 2 is substituted by B 1 .Significance of this negative result becomes clear in Section 5. is contained in the l 2 ball of radius δ centered at v (see Figure 2).Hence the sufficient condition offered by Algorithm 4.1 is more conservative than the one suggested in [8].
the distance is given by the l 1 norm, and the aim is to monitor the inequality f We first consider the "ball cover" construction suggested in [8].With this data v(t 0 ) = 0 with   (see Figure 3).Hence the algorithm suggested in [8] requires nodes to communicate at time t 1 .
On the other hand the l 1 distance from v(t 0 ) to the set {x : ∥x − e∥ 1 = 1} is 1, and since 1 requires no communication between nodes at time t 1 .In this particular case the sufficient condition offered by Algorithm 4.1 is less conservative than the one suggested in [8].
In the next section we apply Algorithm 4.1 to a real life data and report number of required mean computations.

Experimental Results
We apply Algorithm 4.1 to data streams generated from the Reuters Corpus RCV1-V2.The data is available from [16] and consists of 781, 265 tokenized documents with DID (document ID) ranging from 2651 to 810596.
The methodology described below attempts to follow that presented in [8].We simulate n streams by arranging the feature vectors in ascending order with respect to DID, and selecting feature vectors for the stream in the round robin fashion.
In the Reuters Corpus RCV1-V2 each document is labeled as belonging to one or more categories.We label a vector as "relevant" if it belongs to the "CORPORATE/INDUSTRIAL" ("CCAT") category, and "spam" otherwise.Following [9] we focus on three features: "bosnia", "ipo", and "febru".Each experiment was performed with 10 nodes, where each node holds a sliding window containing the last 6700 documents it received.
First we use 67, 000 documents to generate initial sliding windows.The remaining 714, 265 documents are used to generate data streams, hence the selected feature information gain is computed 714, 265 times.Based on all the documents contained in the sliding window at each one of the 714, 266 time instances, we compute and graph 714, 266 information gain values for the feature "bosnia" (see Figure 5).
For the experiments described below the threshold value r is predefined, and the goal is to monitor the inequality f (v) − r > 0 while minimizing communication between the nodes.From now on we shall assume simultaneous arrival of a new text at each node.As new texts arrive, the local constraint (i.e., inequalities ∥v j − v j (t i )∥ < δ, j = 1, . . ., n) at each node is verified.If at least one node violates the local constraint, the average v(t i ) is updated.Our numerical experiment with the feature "bosnia", the l 2 norm, and the threshold r = 0.0025 (reported in [8] as the threshold for feature "bosnia" incurring the highest communication cost) shows overall 4006 computation of the mean vector.An application of Equation (14) yields 240, 360 messages.We repeat this experiment with l ∞ , and l 1 norms.The results obtained and collected in Table 2 show that the smallest number of the mean updates is required for the l 1 norm.Throughout the iterations the mean v(t i ) goes through a sequence of updates, and the values f (v(t i )) may be larger than, equal to, or less than the threshold r.We monitor the case f (v) ≤ r the same way as that of f (v) > r.In addition to the number of mean computations, we collect statistics concerning "crossings" (or lack of thereof), i.e., number of instances when the location of the mean v and its update v ′ relative to the surface are either identical or different.Specifically over the monitoring period we denote by: 1. "LL" the number of instances when f (v) < r and f (v ′ ) < r, 2. "LG" the number of instances when f (v) < r and f (v ′ ) > r, 3. "GL" the number of instances when f (v) > r and f (v ′ ) < r, 4. "GG" the number of instances when f (v) > r and f (v ′ ) > r.
The number of "crossings" is reported in the last four columns of Table 2.Note that variation of vectors v i (t) does not have to be uniform.Taking on account distribution of signals at each node may lead to additional communication savings.We illustrate this statement by a simple example involving just two nodes.If, for example, there is a reason to believe that then the number of node violations may be reduced by imposing node dependent constraints so that the faster varying signal at the second node enjoys larger "freedom" of change, while the inequality holds true.Assignments of "weighted" local constraints requires information provided by Equation (15).With no additional assumptions about signal distribution, this information is not available.Unlike [11] we refrain from making assumptions regarding possible underlying data distributions, instead we estimate the weights as follows: 1. Start with the initial set of weights 2. As texts arrive at the next time instance t i+1 each node computes If at time t i a local constraint is violated, then, in addition to (d + 2)|N| messages (see Equation ( 14)), each node j broadcasts W j (t i ) to the root, the root computes and transmits the updated weights Broadcasts of weights cause increase of total number of messages per iteration to With inequalities in Step 6 of Algorithm 4.1 substituted by ∥v j − v j (t i )∥ < δ j = w j δ the number of mean computations is reported in Table 3.
It is of interest to compare results presented in Table 3 with those reported, for example, in [9].The comparison, however, is not an easy task.While [9] reports the threshold r = 0.0025 as the threshold value that incurred the highest communication cost, the paper leaves the concept of "communication cost" undefined (we define transmission of a double precision real number as a single "message").In addition [9] provides a graph of "Messages vs. Threshold" only.It appears that the maximal value of "bosnia Messages vs. Threshold" graph is somewhere between 100, 000 and 200, 000.We repeat the experiments with "ipo" and "febru" and report the results in Tables 4 and 5 respectively.The results obtained with stream dependent local constraints is a significant improvement over those presented in [10].Consistent with the results in [10] l 1 norm comes up as the norm that requires smallest number of mean updates in all reported experiments.In what follows we briefly outline a number of immediate research directions we plan to pursue.
The local constraints introduced in this paper depend on history of a data stream at each node, and variations ∥v j (t i+1 )−v j (t i )∥ over time contribute uniformly to local constraints.Attaching more weight to recent changes than to older ones may contribute to further improvement of monitoring process.
Table 6 (borrowed from [10]) shows that in about 75% of instances (3034 out of 4006) the mean v(t) is updated because of a single node violation.This observation naturally leads to the idea of clustering nodes, and independent monitoring of the node clusters equipped with a coordinator.The monitoring will become a two step procedure.At the first step node violations are checked in each node separately.If a node violates its local constraint, the corresponding cluster computes updated cluster coordinator.At the second step, violations of local constraints by coordinators are checked, and if at least one violation is detected the root is updated.Table 6 indicates that in most of the instances only one coordinator will be effected, and, since communication within cluster requires less messages, the two step procedure briefly described above has a potential to bring additional savings.We note that a standard clustering problem is often described as ". . .finding and describing cohesive or homogeneous chunks in data, the clusters" (see e.g., [17]).The monitoring data streams problem requires to assign to the same cluster i nodes N i so that the total change within cluster is minimized, i.e., nodes with different variations v − v(t j ) that cancel out each other as much as possible should be assigned to the same cluster.Hence, unlike classical clustering procedures, one needs to combine "dissimilar" nodes together.This is a challenging new type of a difficult clustering problem.
Realistically, verification of inequality f (x) − r > 0 should be conducted with an error margin (i.e., the inequality f (x) − r − ϵ > 0 should be investigated, see [9]).A possible effect of an error margin on the required communication load is another direction of future research.

Conclusions
Monitoring streams over distributed systems is an important and challenging problem with a wide range of applications.In this paper we build on the approach for monitoring an arbitrary threshold functions suggested in [10], and introduce stream dependent local constraints that serve as a feedback monitoring mechanism.The obtained preliminary results indicate substantial improvement over those reported in [10], and demonstrate that monitoring with l 1 norm requires fewer updates than that with l ∞ or l 2 norm.

3 .
the N C complying nodes transmit new coordinates to the root (d × N C messages).4. root updates itself, computes new distance δ to the surface, and sends δ to each node (|N| messages).

Figure 5 .
Figure 5. information gain values for the feature "bosnia".

Table 2 .
number of mean computations, messages, and crossings per norm for feature "bosnia" with threshold r = 0.0025.

Table 3 .
number of mean computations, messages, and crossings per norm for feature "bosnia" with threshold r = 0.0025, and stream dependent local constraint δ j .

Table 4 .
number of mean computations, messages, and crossings per norm for feature "febru" with threshold r = 0.0025, and stream dependent local constraint δ j .

Table 5 .
number of mean computations, messages, and crossings per norm for feature "ipo" with threshold r = 0.0025, and stream dependent local constraint δ j .

Table 6 .
number of nodes simultaneously violating local constraints.for feature "bosnia" with threshold r = 0.0025, and l 2 norm