# Monitoring Threshold Functions over Distributed Data Streams with Node Dependent Constraints

^{*}

## Abstract

**:**

## 1. Introduction

_{2}norm of the data average of large networks of computers, wireless sensors, or mobile devices was introduced in [6], and further developed in [7]. The current contribution is motivated by results recently reported in [8,9] with focus on a special case of the general model considered in [7]. This special case can be briefly described as follows:

_{1}(t),...,v

_{n}(t) be d dimensional real time varying vectors derived from the streams. For a function we would like to confirm the inequality

_{2}, or any other norm that is required to address it.

_{2}norm (see, e.g., [6,7,8,9,11]) the paper provides theoretical framework for using a wide variety of convex functions, and, as an illustration, runs numerical experiments using l

_{2}, l

_{1}and l

_{∞}norms. In all numerical experiments reported in [10] an application of the same algorithm with l

_{1}norm generates superior results. This paper extends results in [10] in a machine learning direction—a constraint imposed on each node depends on the stream history at the node.

_{1}(t) and v

_{2}(t), and the identity function f (i.e., f(x) = x).We would like to guarantee the inequality

_{1}when one of the functions, say v

_{1}(t), crosses the boundary of the local constraint, i.e., the nodes communicate, the mean v(t

_{1}) is computed, the local constraint δ is updated and made available to the nodes, and nodes are kept silent as long as the inequalities hold.

- 1. This approach works for a non-linear monitoring function f.
- 2. The results depend on the choice of a norm, and the numerical results reported show that l
_{2}is probably not the best norm when one aims to minimize communication between nodes. In addition to the numerical results presented we also provide a simple illustrative example that highlights this point (see Remark 4.2). - 3. Selection of node dependent local constraints may decrease communication between the nodes.

## 2. Text Mining Application

**T**be a finite text collection (for example a collection of mail or news items). We denote the size of the set

**T**by |

**T**|. We will be concerned with two subsets of

**T**:

- 1.
**R**–the set of “relevant" texts (text not labeled as spam), - 2.
**F**–the set of texts that contain a “feature" (word or term for example).

## 3. Non-Linear Threshold Function: An Example

**Example 3.1**Let , and , are scalar values stored at two distinct nodes. Note that if , and , then

- 1. Convexity property. The mean v(t) is given by , i.e., the mean v(t) is in the convex hull of , and is available to node j without much communication with other nodes.
- 2. If is an l
_{2}ball of radius centered at , then

_{2}norm, this sufficient condition is more conservative than the one provided by “ball monitoring" Equation (9) suggested in [8]. On the other hand, since only a scalar δ should be communicated to each node, the value of the updated mean should not be transmitted (hence communication savings are possible), and there is no need to compute the distance from the center of each ball , , to the zero set . For detailed comparison of results we refer the reader to [10].

## 4. Convex Minimization Problem

**Problem 4.1**For a function concave with respect to the first d variables and convex with respect to the last nd variables , solve

**Problem 4.2**For identify

_{P}norms (see Table 1 for the list of the functions). We first select , and show below that in this case

F(x) | r(z) |
---|---|

||z − Bw||_{1} | |

||z − Bw||_{2} | |

||z − Bw||_{∞} |

**Algorithm 4.1**Threshold monitoring algorithm.

- 1. Set .
- 2. Until end of stream.
- 3. Set , (i.e., remember “initial" values for the vectors).
- 4. Set (for definition of w see Equation (12)).
- 5. Set .
- 6. If for eachgo to step 5elsego to step 3

- 1. nodes violators transmit their scalar ID and new coordinates to the root ( messages).
- 2. the root sends scalar requests for new coordinates to the complying nodes ( messages).
- 3. the complying nodes transmit new coordinates to the root ( messages).
- 4. root updates itself, computes new distance δ to the surface, and sends δ to each node ( messages).

_{1}norm. The last one shows by an example that Equation (8) fails when is substituted by . Significance of this negative result becomes clear in Section 5.

**Remark 4.1**Let ,and . If the Step 6 inequality holds for each node, then each point of the ball centered at with radius is contained in the l

_{2}ball of radius δ centered at v (see Figure 2). Hence the sufficient condition offered by Algorithm 4.1 is

**more**conservative than the one suggested in [8].

_{2}might not be the best one when communication between the nodes should be minimized.

**Remark 4.2**Let ,

_{1}norm, and the aim is to monitor the inequality . Let

_{2}ball of radius centered at intersects the l

_{1}ball of radius 1 centered at (see Figure 3). Hence the algorithm suggested in [8] requires nodes to communicate at time t

_{1}.

_{1}distance from to the set is 1, and since

_{1}. In this particular case the sufficient condition offered by Algorithm 4.1 is

**less**conservative than the one suggested in [8].

**Remark 4.3**It is easy to see that inclusion Equation (8) fails when is an l

_{1}ball of radius centered at . Indeed, when, for example,

## 5. Experimental Results

_{2}norm, and the threshold (reported in [8] as the threshold for feature “bosnia" incurring the highest communication cost) shows overall 4006 computation of the mean vector. An application of Equation (14) yields 240,360 messages. We repeat this experiment with l

_{∞}, and l

_{1}norms. The results obtained and collected in Table 2 show that the smallest number of the mean updates is required for the l

_{1}norm.

**Table 2.**number of mean computations, messages, and crossings per norm for feature “bosnia" with threshold .

Distance | Mean Comps | Messages | LL | LG | GL | GG |
---|---|---|---|---|---|---|

l_{2} | 4006 | 240,360 | 959 | 2 | 2 | 3043 |

l_{∞} | 3801 | 228,060 | 913 | 2 | 2 | 2884 |

l_{1} | 3053 | 183,180 | 805 | 2 | 2 | 2244 |

- 1. “LL" the number of instances when and ,
- 2. “LG" the number of instances when and ,
- 3. “GL" the number of instances when and ,
- 4. “GG" the number of instances when and .

- 1. Start with the initial set of weights
- 2. As texts arrive at the next time instance each node computesIf at time a local constraint is violated, then, in addition to messages (see Equation (14)), each node j broadcasts to the root, the root computes , and transmits the updated weightsback to node j.

**Table 3.**number of mean computations, messages, and crossings per norm for feature “bosnia" with threshold , and stream dependent local constraint .

Distance | Mean Comps | Messages | LL | LG | GL | GG |
---|---|---|---|---|---|---|

l_{2} | 2388 | 191,040 | 726 | 2 | 2 | 1658 |

l_{∞} | 2217 | 177,360 | 658 | 2 | 2 | 1555 |

l_{1} | 1846 | 147,680 | 611 | 2 | 2 | 1231 |

_{1}norm comes up as the norm that requires smallest number of mean updates in all reported experiments.

**Table 4.**number of mean computations, messages, and crossings per norm for feature “febru" with threshold , and stream dependent local constraint .

Distance | Mean Comps | Messages |
---|---|---|

l_{2} | 1491 | 119,280 |

l_{∞} | 1388 | 111,040 |

l_{1} | 1304 | 104,320 |

**Table 5.**number of mean computations, messages, and crossings per norm for feature “ipo" with threshold , and stream dependent local constraint .

Distance | Mean Comps | Messages |
---|---|---|

l_{2} | 7656 | 612,480 |

l_{∞} | 7377 | 590,160 |

l_{1} | 6309 | 504,720 |

## 6. Future Research Directions

**Table 6.**number of nodes simultaneously violating local constraints. for feature “bosnia" with threshold , and l

_{2}norm

nodes | violations |
---|---|

1 | 3034 |

2 | 620 |

3 | 162 |

4 | 70 |

5 | 38 |

6 | 26 |

7 | 34 |

8 | 17 |

9 | 5 |

10 | 0 |

**different**variations that cancel out each other as much as possible should be assigned to the same cluster. Hence, unlike classical clustering procedures, one needs to combine “dissimilar" nodes together. This is a challenging new type of a difficult clustering problem.

## 7. Conclusions

_{1}norm requires fewer updates than that with l

_{∞}or l

_{2}norm.

## Acknowledgments

## References

- Madden, S.; Franklin, M.J. An Architecture for Queries Over Streaming Sensor Data. In Proceedings of the ICDE 02, San Jose, CA, 26 February–1 March 2002; pp. 555–556.
- Dilman, M.; Raz, D. Efficient Reactive Monitoring. In Proceedings of the Twentieth Annual Joint Conference of the IEEE Computer and Communication Societies, Anchorage, Alaska, 2001; pp. 1012–1019.
- Zhu, Y.; Shasha, D. Statestream: Statistical Monitoring of Thousands of Data Streamsin Real Time. In Proceeding of the 28th international conference on Very Large Data Bases (VLDB), Hong Kong, China, 2002; pp. 358–369.
- Yi, B.-K.; Sidiropoulos, N.; Johnson, T.; Jagadish, H.V.; Faloutsos, C.; Biliris, A. Online Datamining for Co–Evolving Time Sequences. In Proceedings of ICDE 00IEEE Computer Society, San Diego, CA, 2000; pp. 13–22.
- Manjhi, A.; Shkapenyuk, V.; Dhamdhere, K.; Olston, C. Finding (Recently) Frequent Items in Distributed Data Streams. In Proceedings of the 21st International Conference on Data Engineering (ICDE 05), Tokyo, Japan, 2005; pp. 767–778.
- Wolff, R.; Bhaduri, K.; Kargupta, H. Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems. In Proceedings of the SIAM International Conference on Data Mining (SDM 06), Bethesda, MD, USA, 2006; pp. 430–441.
- Wolff, R.; Bhaduri, K.; Kargupta, H. A generic local algorithm with applications for data mining in large distributed systems. IEEE Trans. Knowl. Data Eng.
**2009**, 21, 465–478. [Google Scholar] [CrossRef] - Sharfman, I.; Schuster, A.; Keren, D. A geometric approach to monitoring threshold functions over distributed data streams. ACM Trans. Database Syst.
**2007**, 23, 23–29. [Google Scholar] - Sharfman, I.; Schuster, A.; Keren, D. A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams. In Ubiquitous Knowledge Discovery; May, M., Saitta, L., Eds.; Springer–Verlag: New York, NY, USA, 2010; pp. 163–186. [Google Scholar]
- Kogan, J. Feature Selection over Distributed Data Streams through Convex Optimization. In Proceedings of the Twelfth SIAM International Conference on Data Mining (SDM 2012), Anaheim, CA, USA, 2012; pp. 475–484.
- Keren, D.; Sharfman, I.; Schuster, A.; Livne, A. Shape sensitive geometric monitoring. IEEE Trans. Knowl. Data Eng.
**2012**, 24, 1520–1535. [Google Scholar] [CrossRef] - Gray, R.M. Entropy and Information Theory; Springer–Verlag: New York, NY, USA, 1990; pp. 119–162. [Google Scholar]
- Hinrichsen, D.; Pritchard, A.J. Real and Complex Stability Radii: A Survey. In Controlof Uncertain Systems; Hinrichsen, D., Pritchard, A.J., Eds.; Birkhauser: Boston, MA, USA, 1990; pp. 119–162. [Google Scholar]
- Rudin, W. Principles of Mathematical Analysis; McGraw-Hill: New York, NY, USA, 1976. [Google Scholar]
- Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
- Bottou, L. Home Page. Available online: leon.bottou.org/projects/sgd (accessed on 14 September 2012).
- Mirkin, B. Clustering for Data Mining: A Data Recovery Approach; Chapman & Hall/CRC: Boca Raton, FL, USA, 2005. [Google Scholar]

© 2012 by the authors; licensee MDPI, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Malinovsky, Y.; Kogan, J.
Monitoring Threshold Functions over Distributed Data Streams with Node Dependent Constraints. *Algorithms* **2012**, *5*, 379-397.
https://doi.org/10.3390/a5030379

**AMA Style**

Malinovsky Y, Kogan J.
Monitoring Threshold Functions over Distributed Data Streams with Node Dependent Constraints. *Algorithms*. 2012; 5(3):379-397.
https://doi.org/10.3390/a5030379

**Chicago/Turabian Style**

Malinovsky, Yaakov, and Jacob Kogan.
2012. "Monitoring Threshold Functions over Distributed Data Streams with Node Dependent Constraints" *Algorithms* 5, no. 3: 379-397.
https://doi.org/10.3390/a5030379