^{★}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

We propose novel algorithms for the timing correlation of streaming sensor data. The sensor data are assumed to have interval timestamps so that they can represent temporal uncertainties. The proposed algorithms can support efficient timing correlation for various timing predicates such as deadline, delay, and within. In addition to the classical techniques, lazy evaluation and result cache are utilized to improve the algorithm performance. The proposed algorithms are implemented and compared under various workloads.

Wireless sensor networks are composed of sensors, embedded computers, and communication devices. They can harvest various interesting information such as light, motion, proximity, temperature, and chemical conditions. There are many emerging applications utilizing the information from sensors. The applications range from simple monitoring systems to sophisticated systems making critical decisions based on the automated analysis of the sensor data.

In this paper, we propose novel algorithms for the timing correlation of streaming sensor data. The sensor data are assumed to have interval timestamps so that they can represent temporal uncertainties. The proposed algorithms can support efficient timing correlations for various timing predicates such as deadlines and delays. The timing correlation enables the users to extract pairs of streaming data, of which sources are different, satisfying specific timing conditions.

In some cases, the timestamp of the data from a sensor cannot be modeled as a scalar value. There can be various reasons such as inactivity of a sensor due to its battery limitation, granularity difference between heterogeneous sensors, and inaccurate timing behavior of a sensor. In addition, the possibility of temporal uncertainty is high because the unexpected hardships can easily happen due to the harsh environments where sensors operate.

In order to capture the timing uncertainty of the timestamp, adopting a time interval as the timestamp is a common approach [

In our previous work, we designed an efficient algorithm for the timing correlation by analyzing the upper-bounds and lower-bounds of the satisfaction probability on time intervals.

We further extend the algorithm by adopting the approaches of the lazy evaluation and the result look-up in this study. The extended algorithms show better performance by exploiting unveiled properties of timing correlations presented in this paper. We implement various timing correlation algorithms and compare them under various workloads.

Main contributions made in this paper are shown in the following:

Extending the algorithm by adopting a lazy evaluation approach: We extend the previously designed algorithm by adopting a lazy approach and correlating the sensor data by the blocks.

Extending the algorithm by adding the look-up technique: In order to avoid expensive calculation for satisfaction probability in probe regions, we add an idea of look-up technique based on new observations of the timing correlation.

There have been many studies on the sensor data processing in recent years. One of the most active research areas related to the sensor data processing is stream data management systems (SDMSs). Babcock

Since we are interested in the timing correlation problem, we shall restrict our discussion to the problem of the correlation of stream data.

In [

Hammad

There are studies based on the cost analysis of the sliding window joins. Kang

In order to handle the streaming data arriving in out-of-order, Srivastava and Widom [

Recently Wu

None of the above work address the cases of the interval timestamp. As stated in Introduction, the timestamp of a streaming data from sensors may have temporal uncertainties. In order to handle this inherent uncertainties in timestamps, we adopt interval timestamps and assume that the probability distribution in a given interval is uniform.

Dyreson and Snodgrass [

Our earlier work [

In this section, we present the problem of the interval timing correlation and review the main findings discussed in the previous studies. Sensors measure and transmit data to harvesting facilities. The harvesting facilities can be another typical sensors or specialized devices. Finally, the data are sent to a system (or systems for distributed computing) taking the role of the data analysis.

In general the sensor data processors filter out unnecessary data and forward a subset of the data which may be useful to the next data processors for further analysis. One of the typical operators used during this phase is the timing correlation. The timing correlation operators allow us to collect pairs of data which are satisfying a predefined timing condition. For example, a user may want to extract pairs of data such that the time differences of the pairs are within 5 seconds.

Specifically, we are interested in the timing correlation operator which can handle the interval timestamps. In order to specify a timing condition over interval timestamps, we take a probabilistic approach. The users of the system present an interval timing correlation by defining a timing predicate on interval timestamps. The timing predicate can be a form of deadline or delay.

A deadline constraint requires that a corresponding event should happen provided that a triggering event happen before the timer accompanied with the constraint expires. Assume that there exists a deadline constraint with a specific time _{1} is a triggering event and _{2} is the corresponding event. If there is another deadline constraint with the same time _{2} is a triggering event and _{1} is the corresponding event. In that case, we state that a mutual deadline constraint is defined on the events _{1} and _{2}. The mutual deadline is the most popular timing predicate in interval timing correlations. Hence, most of our examples will be the mutual deadlines.

A delay constraint requires that the corresponding event should not happen provided that a triggering event happen before the specified time passes.

An interval timing correlation requires a confidence threshold, which determines the minimum satisfaction probability of the timing predicate. Typically the interval timing correlation operator produces streams of paired data which satisfy the given interval timing correlation condition.

In this paper, we adopt the event model proposed in [

In addition, the following notations are used in the remaining of this paper. For a timestamp

In the remainder of this paper, we shall use the symbol “@” to indicate the timestamp of a tuple. If the symbol is used in front of a stream name, then it means the timestamp of any tuple sent from the stream. For example, @_{1} – @_{2}| ≤ _{1}, _{2}) satisfying the mutual deadline _{1} and _{2} are tuples from stream _{1} and _{2} respectively.

We derived formula for calculating the satisfaction probabilities of the deadline and delay predicates in our previous work [

_{1} + _{2}

In this paper, we assume that any deadlines specified in timing predicates are larger than

_{1} − _{2}| ≤ _{1} + _{2}) _{1}) ≤ _{2})

The computation of the satisfaction probability of a deadline constraint can be simplified by categorizing the problems into six different cases based on the relations of _{1} + _{2}. Interested readers are invited to our previous study in [

Throughout this section, we assume that there is an interval timing correlation such that it has a mutual deadline for two events such as |@_{1} − @_{2}| ≤ _{1}, _{2}) satisfying the mutual deadline

_{1} = (_{1}, _{1}) has arrived from stream _{1}. The graph presents the upper-bounds (solid lines) and the lower-bounds (dotted lines) of satisfaction probabilities for each possible _{2}) where _{2} is the timestamp of a tuple in the targets stream (_{2} in this specific example).

A timing correlation process starts upon receiving a tuple from a stream. The tuple and the stream is referred to as

By using the information shown in the

Any target tuple with the timestamp _{L}_{L}

Any target tuple with the timestamp _{H}

Any target tuple with the timestamp _{H}

Any target tuple with the timestamp _{H}_{L}

Any target tuple with the timestamp _{L}_{H}

_{L}

_{L}

_{H}

_{H}

From the figure, the above observations are intuitively derived. For example, any target tuple with the max(timestamp) ∈ [_{L}_{L}

In this section, we review the algorithms for the interval timing correlation proposed in [

The

SimpleTimingCorrelation(_{new}

1: | |

2: | _{new} |

3: | Add (_{new} |

4: | |

5: | Mark |

6: | |

7: | Remove the marked obsolete tuples in the target buffer. |

8: | Insert _{new} |

The Simple-Sort (SSort in short) timing correlation slightly modifies the simple timing correlation such that it keeps the tuples in order with respect to the max timestamps. Hence, the algorithm expects longer blocks of obsolete tuples consecutively located than those in the simple timing correlation.

The _{H}_{L}_{L}_{H}_{L}_{L}_{H}_{L}_{L}_{H}_{inv}_{base}_{base}_{base}_{H}_{inv}_{H}_{inv}

EagerTimingCorrelation(_{new}

1: | Compute _{H}_{L}_{L}_{H}_{new} |

2: | _{L}_{L} |

3: | Add (_{new} |

4: | |

5: | _{H}_{L}_{L}_{H} |

6: | Probe(_{new} |

7: | |

8: | Invalidate obsolete tuples in the target buffer by _{H}_{inv} |

9: | Insert _{new} |

Probe(_{new}

1: | _{new} |

2: | Add (_{new} |

3: |

The

When to re-evaluate can be determined either by the number of unprocessed tuples or by a time frequency (or both). For example, a system can be designed to re-evaluate whenever there are more than 500 unprocessed tuples or every one second.

It is noted in [

However, the benefit of the lazy algorithm comes at the expense of the longer response time; until the re-evaluation condition is met, the already arrived but un-evaluated tuples should wait in the buffers. Therefore, the re-evaluation condition must be designed carefully not to violate the system performance requirements. The algorithm for the lazy timing correlation is presented in

LazyTimingCorrelation(_{new}

1: | Insert _{new} |

2: | |

3: | call BlockTimingCorrelation(BaseStream) |

4: |

BlockTimingCorrelation(BaseStream)

1: | Sort the target stream buffer |

2: | _{l} |

3: | _{r} |

4: | _{new}_{r}_{l} |

5: | Compute _{H}_{L}_{L}_{H}_{new} |

6: | _{L}_{L} |

7: | AddResult(_{new} |

8: | |

9: | _{H}_{L}_{L}_{H} |

10: | Probe(_{new} |

11: | |

12: | |

13: | Sort the base stream buffer |

14: | Invalidate obsolete tuples in the base buffer by _{H}_{inv} |

15: | Invalidate obsolete tuples in the target buffer by _{H}_{inv} |

Now we extend the lazy timing correlation to use look-up tables in order to perform the probing process more efficiently. The following corollary presents properties used in the algorithm.

_{1}_{i}, and I_{j}_{1}) ≤ _{i}_{j}_{1}) ≤ _{i}_{j}_{1} + _{j}_{1} + _{i}_{1} + _{i}_{1} + _{j}

_{2} + _{11}, _{2} + _{12}, _{2} + _{14}, _{2} + _{12}, _{2} + _{11},

The main idea of the extended algorithm is to reuse the satisfaction probabilities calculated in the probe regions. As illustrated in the previous example, while performing an interval timing correlation for two blocks of tuples, there can be cases where we can reuse the previous calculation results and avoid expensive probability computations. By comparing

LazyWithLookup-newblock(BaseStream)

1: | Sort the target stream buffer |

2: | _{l} |

3: | _{r} |

4: | _{new}_{r}_{l} |

5: | Compute _{L}_{L}_{new} |

6: | _{L}_{L} |

7: | Add (_{new} |

8: | |

9: | |

10: | _{H}_{l}_{l} |

11: | _{L}_{r}_{r} |

12: | Initialize look-up table ( |

13: | _{new}_{r}_{l} |

14: | Compute _{H}_{L}_{new} |

15: | _{H}_{L} |

16: | EfficientProbe(_{new} |

17: | |

18: | |

19: | Initialize look-up table ( |

20: | _{new}_{l}_{r} |

21: | Compute _{L}_{H}_{new} |

22: | _{L}_{H} |

23: | EfficientProbe(_{new} |

24: | |

25: | |

26: | Invalidate obsolete tuples in the base buffer by _{H}_{inv} |

27: | Invalidate obsolete tuples in the target buffer by _{H}_{inv} |

The algorithm traverses the base tuples in the unprocessed block in reverse chronological order. Once we computed prob(|@_{b}_{t}_{b}_{t}_{b}_{b}_{t}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{b}_{t}_{b}, e_{t}_{b}_{t}_{b}_{t}_{b}_{b}

Recall that the primary purpose of using the look-up table is to avoid the “relatively” expensive operation—the satisfaction probability calculation incurring floating-point operations. To minimize the overhead for accessing to the look-up tables, we used array data structure to implement the look-up table. Hence every access to the look-up table was done via the index to an element in the array.

EfficientProbe(_{new}

1: | |

2: | |

3: | |

4: | Probe(_{new} |

5: | SetLookup(prob(|@_{new} |

6: | |

7: | |

8: | Add (_{new} |

9: | |

10: | Probe(_{new} |

11: | |

12: | |

13: | |

14: | Probe(_{new} |

15: | SetLookup(prob(|@_{new} |

16: |

Now let us prove the correctness of the look-up technique in the algorithm.

_{1} _{2}. _{1}) ≤ _{2}) _{1}) ≥ _{2}), _{1} _{2}. _{1} − _{2}| ≤

Proof:

Since _{1}) + _{1}). By the assumption, _{1}) ≥ _{2}). Therefore, _{1}) + _{2}); hence _{1} + _{2}) = 1. By the assumption, _{2}) ≥ _{1}). Hence, _{2}) + _{1}) + _{1}). Therefore, _{2} + _{1}) = 1. Therefore, _{1} – _{2}| ≤

Proof:

Let us first prove that the code block handling the left probe region is correct. The main idea of the look-up technique is that we can reuse the result of the timing condition (|@_{lookup}_{new}_{new}_{lookup}_{lookup}_{new}_{new}_{lookup}

First, let us prove that it is always the case that _{new}_{lookup}_{L}_{new}_{new}_{new}_{H}_{new}_{L}_{new}_{H}_{new}_{L}_{new}_{new}_{lookup}_{new}_{lookup}

Case _{new}_{new}_{new}_{lookup}_{lookup}_{lookup}_{new}

Case _{new}_{new}_{new}_{lookup}

In both cases, it was shown that if (@_{lookup,}_{new,}

In this section, we present experiment results showing various aspects of the proposed algorithms presented in the previous section. The data show that the lazy-family algorithms (lazy and lazy with look-up tables) give higher throughput than the eager algorithm; however they suffer from slower response time than the eager algorithm. We implemented a simple stream simulation system. Stream providers in the simulation system read the predefined event tuples and transmit them to the correlation algorithms. The implementation was done in Java. An Intel Xeon 1.8Mhz system with 1GB main memory on Windows XP professional was used for the experiment.

We prepared the data files r12.dat, r24.dat, ..., and r1600.dat providing data streams of which arrival rates are from 12 tuples/second to 1,600 tuples/second respectively. We measured the execution times and the average response times of the correlation algorithms under these workloads. The execution time of an algorithm is measured by the total time spent by the algorithm. The response time is the summation of the response times for all tuples processed by the algorithm. The response time of a tuple (_{1}, _{2}) is computed by the correlation completion time minus MAX(max(@_{1}), max(@_{2})). The results are shown in

It is also observed that the average response time of the eager correlation is better than that of the lazy correlation family. In addition, if the stream arrival rate is not so fast (until 400 tuples/second in this particular setting), the simple sort correlation and the simple correlation are better than the lazy correlation family as far as the average response time is concerned. The lazy correlation family intentionally delays the processing of incoming tuples; even in the case where the tuples can be processed right away; the tuples are waiting in the stream buffer until there are “enough” number of tuples. In contrast, the other algorithms process the incoming tuples as soon as they arrive. When the processing speed cannot catch up with the stream arrival speed, the response time begins to increase sharply.

The performance gain in the lazy correlation and the lazy correlation with a look-up table comes at the expense of larger memory usage and longer response time. Let us show the stream buffer usage of each correlation algorithm.

Now let us show the effectiveness of increasing the block size determining the “enough” number of tuples to run the lazy correlation algorithms.

It turns out the higher confidence threshold requirement, the larger left probe region as shown in

In this study, we proposed novel algorithms for the interval timing correlation. They can be used for extracting temporally related pairs of streaming sensor. In order to handle the uncertainty in timestamps, we adopted interval timestamps and included the confidence thresholds into timing conditions.

We extended a previously studied algorithm by adopting the approaches of the lazy evaluation and the result look-up. The lazy timing correlation also utilizes the upper-bounds and the lower-bounds. It postpones the evaluation until its re-evaluation condition is met and performs the correlation of the tuple blocks. In order to reduce the computation overhead of the satisfaction probability in probe regions, we added an idea of look-up technique. We measured the effectiveness of the proposed algorithms over the previous algorithms by comparing their performance under various workloads and presented the analysis. It turns out that the lazy family algorithms provide better performance with the sacrifice of extra memory for larger buffer and larger response time under slow streaming environment.

For the future work, the generalization of the proposed techniques for various probability distributions seems interesting. For the lazy approach, we need to derive upper-bounds and lower-bounds of the new probability distributions. The invalidation operations shown at the end of

It would be also interesting to apply the proposed techniques to the practical and real situations. As the sensor network become wide spread, we are looking forward to testing our algorithms in real life scenarios.

This research was supported by the Chung-Ang University Research Scholarship Grants in 2009.

(a) The upper-bounds and the lower-bounds of satisfaction probabilities (b) Efficient filtering process using the bounds.

Efficient probing example.

The execution times under various arrival rates.

The average response times under various arrival rates.

Average lengths of stream buffers.

The execution times under different correlation block sizes.

The hit ratio of the look-up table under different correlation block sizes.

The execution times under various confidence thresholds.

The interval timing correlation with high and low confidence thresholds.