In-Network Processing of an Iceberg Join Query in Wireless Sensor Networks Based on 2-Way Fragment Semijoins

We investigate the in-network processing of an iceberg join query in wireless sensor networks (WSNs). An iceberg join is a special type of join where only those joined tuples whose cardinality exceeds a certain threshold (called iceberg threshold) are qualified for the result. Processing such a join involves the value matching for the join predicate as well as the checking of the cardinality constraint for the iceberg threshold. In the previous scheme, the value matching is carried out as the main task for filtering non-joinable tuples while the iceberg threshold is treated as an additional constraint. We take an alternative approach, meeting the cardinality constraint first and matching values next. In this approach, with a logical fragmentation of the join operand relations on the aggregate counts of the joining attribute values, the optimal sequence of 2-way fragment semijoins is generated, where each fragment semijoin employs a Bloom filter as a synopsis of the joining attribute values. This sequence filters non-joinable tuples in an energy-efficient way in WSNs. Through implementation and a set of detailed experiments, we show that our alternative approach considerably outperforms the previous one.


Introduction
In wireless sensor networks (WSNs), the values sampled by a senor node can be modeled as a relational tuple that consists of the sensor readings as its main attributes and often of the node ID, the timestamp of the sampling, the location of the node, etc. as its auxiliary attributes [1]. Thus, for a region OPEN ACCESS of WSNs, the sensor readings of the nodes deployed in the region can be modeled as a virtual relation physically distributed across the senor nodes in the region. In WSN applications, which include vehicle surveillance, environment monitoring, animal habitat monitoring, and climate research to name just a few, a relational join query can be issued against two virtual relations. For example, based on the scenario in vehicle surveillance presented in [2], let us consider the identification of the moving objects that have passed two particular regions in WSNs. In each region, the ID of a passing object is sampled and stored with the time of passage. Then, the sets of sensor readings stored in the two regions are modeled as virtual relations denoted as and . The following join query is to retrieve the ID and time of the objects that have passed both of the two regions:

SELECT
.ID, .timestamp, .timestamp FROM , WHERE .ID = .ID Since a join is an important type of query in WSNs to monitor the correlations among the senor readings, processing of joins in WSNs has received much attention. A survey of the state-of-the-art techniques is presented in [3]. A naïve method to answer a join query, ⋈ , in WSNs is the external join, whereby all the tuples of and are sent to the base station where the result of the join is produced. In WSNs, the power in a node is consumed the most when the node transmits data [4]. Thus, the external join is not energy-efficient, and the state-of-the-art techniques conduct in-network processing of joins.
In this paper, we investigate the in-network processing of a special type of equijoin query called iceberg join in WSNs. It is to retrieve the frequent patterns of correlation among the sensor readings. For a joining attribute value v, it contributes to the join result only if the number of joined tuples for v exceeds some given threshold. This join frequency threshold is called iceberg threshold and denoted as α throughout this paper. Figure 1a shows an example of iceberg join of two relations and with α = 2, denoted as ⋈ . A query retrieving attribute A from this iceberg join can be expressed in SQL as in Figure 1b. In bird habitat monitoring, sensor nodes can be deployed to sample bird songs when birds are singing. From these audio samples, their fingerprints are generated, stored, and later used to recognize the bird species and to estimate their population size in certain regions [5]. For two regions of interest in WSNs, an iceberg join query can be issued to retrieve the fingerprints that are frequently sampled in both regions in studying regional correlations in bird population. Let and denote the virtual relations storing the fingerprints in the two regions. Then, this query can be expressed as:

HAVING COUNT(*) >= α
The iceberg join is an important type of query in WSNs because the frequent or prominent phenomena of interest in terms of correlations among sensor readings could be detected in a more energy-efficient way than with the conventional joins. Considering the resource constraints in WSNs and the cross-references required in join processing, the processing of the conventional types of join queries in WSNs could be too expensive [3]. The efficient processing of iceberg joins in WSNs deserves attention but so far little work has been reported. The most relevant ones include the schemes proposed in [6][7][8].
An iceberg join operation involves the checking of the join predicate and of the cardinality constraint between the tuples of the join operand relations. There could be two approaches depending on which condition of the two is the primary one to check. The primary condition is checked first, and for the tuples that satisfy it, the remaining condition is checked next. It would be reasonable to regard the join predicate as an intrinsic requirement of a join operation while treating the cardinality constraint as an additional one. In [6], such a view was taken, and a scheme called SRJA (Synopsis Refinement iceberg-Join Algorithm) was proposed, where a histogram-based synopsis of the joining attribute value ranges is transmitted for filtering non-joinable tuples. In [6], it was shown that SRJA significantly outperformed the baseline schemes. In this paper, we investigate an alternative approach where the cardinality constraint is checked first as the primary condition and then for those tuples that satisfy it, the join predicate is checked. We show that this approach is substantially superior to the other one. The contributions of this paper are as follows:  We consider a logical fragmentation of join operand relations based on the aggregate counts of the joining attribute values, proposing a 2-way fragment semijoin operation using a Bloom filter as a synopsis of the joining attribute values. In the backward reduction of the 2-way fragment semijoin, the false positives inherent with the Bloom filter are efficiently handled.  We take advantage of the Highest Count First strategy with which efficient reduction of the join operand relation (called Low Count Cut) occurs, developing a dynamic programming algorithm that generates the optimal sequence of 2-way fragment semijoins. The Highest Count First strategy is shown to be more effective in filtering non-joinable tuples than the transmissions of the value ranges widely used in WSNs.  Through implementation and a set of detailed experiments, we show that our approach considerably outperforms the previous one.
The rest of this paper is organized as follows: in Section 2, the problem statement is given. In Section 3, the background for our scheme is presented. In Section 4, an overview of our approach is given. In Section 5, the optimization with a dynamic programming algorithm is described. In Section 6, the performance of our scheme is compared with that of SRJA. In Section 7, related work is presented.
Finally, in Section 8, the conclusions are drawn and the future work is given. Notations used in this paper are summarized in Table 1 of Section 4.

Problem Statement
An iceberg join query Q for two virtual relations and on attribute A in WSNs is assumed to be submitted to the base station as a continuous query modeled as a sliding window join [9]. Each evaluation of the query is conducted against a window of and that of , and the query result is returned to the base station. Initially, the base station forwards Q to three sensor nodes , , and .
is the coordinator node of the region for , i = 0, 1. is the node located at the midpoint between and , which is to take part in in-network query optimization and processing. In each region, a routing tree whose root is is constructed with the standard routing tree construction algorithm of [1]. disseminates Q to all the sensor nodes in the region. In each region, preprocessing is carried out to collect the aggregate count of each joining attribute value that is sampled in the window. Each sensor node in a region generates a node histogram, which is a set of (value, count) pairs in the node. Then, it sends the node histogram to its parent node in the routing tree. Eventually, obtains the region histogram. It is a binary relation with the joining attribute A and the count attribute. In [6], this relation is called the base histogram. Let us denote the base histogram of region and as and , respectively. Now the problem is to fully reduce and such that only those tuples of them that are qualified for the iceberg join remain. Let ′ and ′ respectively denote the full reduction of and . Then, the final result of Q can be obtained by two semijoins followed by a final join: ( ⋉ ′ ) ⋈ ( ⋉ ′ ). Once ′ and ′ are obtained, the operations to produce the final result of Q are straightforward. Thus, in this paper, we deal only with the problem of optimally obtaining ′ and ′ .

Background
Our scheme employs the Bloom filter [10] and a variation of the 2-way semijoin [11,12]. In Section 3.1, semijoin and 2-way semijoin operations are briefly described. In Section 3.2, an overview of Bloom filter and its theory are given. In Section 3.3, the technique of SRJA [6] is described.

Semijoin and 2-Way Semijoin
The semijoin was proposed in [13] [14]. The 2-way semijoin was investigated in [11,12,15] as an extension of the semijoin. It includes the backward reduction phase in addition to the forward reduction phase of the original semijoin such that both of the two join operand relations are fully reduced. Suppose semijoin ⋊ reduces to ′ . In the backward reduction phase of 2-way semijoin ⋊⋉ , ′ [A] or its complement (i.e., ) or a bit vector indicating which value of [A] is joinable and which is not is sent back. In distributed query processing, this backward reduction is often effective and can be applied to a pipelined n-way join [12]. In our scheme proposed in this paper, the backward reduction is efficiently merged with the handling of the false positives inherent with the Bloom filter employed in implementing a semijoin.

Bloom Filter
A k-transform Bloom filter is a bit vector of length m that probabilistically represents a set S with k hash functions h1(),⋯, hk() for k ≥ 1 [10]. Initially, all the m bits are set to 0. For each value x in S, h1(x)-th, ⋯, hk(x)-th bits of the Bloom filter are set. For a given value y, y is not in S if any of the h1(y)-th, ⋯, hk(y)-th bits of the Bloom filter is not 1, whereas y is probably in S if all those k bits are 1. In the latter case, a false positive is possible due to the possibility of the collisions in hashing. However, it is guaranteed that false negatives are not possible.
For two relations and that reside at different locations, a Bloom filter can be employed in implementing semijoin ⋊ with possible errors. First, is scanned to construct a Bloom filter { , A} that represents [A]. { , A} is sent to . Then, for each value v of .A, the membership test of v ∈ { , A} is done to filter non-joinable tuples of . None of the joinable tuples of is filtered out but some of the non-joinable ones could survive because of the false positives (i.e., ⋊ ⊆ ⋊ ⊆ , where ⋊ denotes the semijoin implemented using a Bloom filter). Since { , A} is a bit vector, it is usually much smaller than [A]. In WSNs, it would be more energy-efficient to send { , A} than to send [A], provided that the false positives could be properly handled.
When the length of a k-transform Bloom filter is m, and the number of values in S is n, the probability of a false positive is approximately: and it is minimized to 1 2 ⁄ when = ( ) ⁄ • 2 [16,17]. In this case, the number of bits used to represent a value in S is ⁄ = 2 ⁄ .

SRJA
Given an iceberg join of and on attribute A in WSNs, SRJA works as follows [6]: The preprocessing as described in Section 2 is carried out to obtain and . At and , the values in [A] and [A] is respectively divided into a sequence of value ranges with the information on the count attribute associated with each range. A range is defined as a 4-tuple (minval, maxval, mincount, maxcount). minval and maxval are respectively the minimum and maximum value of A in the range, and the mincount and maxcount are respectively the minimum and maximum value of count among all the counts associated with the values of A in the range. The sequence of these ranges constitutes the synopsis of the values of A and count in (i = 0, 1). A value range represented as an interval would be much smaller than the list of all the values in the range. In WSNs, it is a common practice to send a value range (i.e., an interval [minval, maxval]) instead of the full list of values in the range to reduce data transmission cost though accuracy is compromised [18]. and send the synopses of and to the sensor node located at the midpoint between and . At , the value range matching is conducted first with the minval and maxval of the ranges in the synopses. In this matching process, the original synopses are modified. Some ranges of a synopsis are deleted because there is no matched counterpart in the other synopsis or further divided into subranges so that the two synopses have exactly the same set of ranges. Then, for each pair of matched ranges of and , the mincount's and maxcount's are checked, and the ranges are tagged as PRUNE, JOIN, or DIVIDE. The pair for which maxcount( ) × maxcount( ) < is tagged as PRUNE. The pair for which mincount( ) × mincount( ) ≥ α is tagged as JOIN. The remaining pairs are tagged as DIVIDE. The tagged synopses are sent back to and . The tuples in a PRUNE range are deleted from (i = 0, 1). The tuples in a JOIN range not qualified for the query are filtered out by 2-way semijoins, and the qualified ones are finally joined for the result. A DIVIDE range is further divided into subranges. The synopsis of is reconstructed with them. This process is repeated until the final query result is obtained. The optimization techniques for SRJA are as follows [6]:  Optimization 1: Before the above process begins, the tuples whose value of count exceeds is sent separately for checking their joinability because they are likely to be qualified for the query.  Optimization 2: The tuples whose value of A belongs to a sparse range are also separately sent.
Eliminating a sparse range would make the synopsis more selective.  Optimization 3: For a range to be tagged as DIVIDE, the maxcount of the opposite range (opp_maxcount) is also sent when the tagged synopses are sent back. When the range is divided into subranges, the tuples which turn out not to be qualified for the query (count × _maxcount < ) are deleted.
The skeleton of the protocol executed by for SRJA is described in Figure 2. The one for is symmetrical. In [6], it was shown that SRJA significantly outperformed the following baseline schemes:  NAÏVE: The external join where all the tuples of and are sent to the base station.  SIJ: The synopsis join of [2] extended for iceberg joins where and are sent to and fully reduced there.

Overview of Our Approach
Given an iceberg join of and on attribute A in WSNs, suppose the preprocessing described in Section 2 has been carried out to obtain and . In this section, the main components of our scheme in fully reducing and are described. They include 2-way fragment semijoin, Low Count Cut, Highest Count First strategy, and the sequence of 2-way fragment semijoins. The issue of optimization is dealt with in Section 5. The notations used in this section and Section 5 are summarized in Table 1. The coordinator sensor node in the region for (i = 0, 1) The sensor node located at the midpoint between and . The initial maximum value of the attribute count in (i = 0, 1) The current highest value of the attribute count in (i = 0, 1) The current lowest value of the attribute count in (i = 0, 1) The current state of after reduced by a sequence of fragment semijoins (i = 0, 1) Given ( , ) and ( , ), the optimal sequence of fragment semijoins that fully reduces and provided that the first semijoin is ⋉ ( , The cost of ⋉ * ( , ) The value of attribute count in with which (Equation (2)) is minimized (i = 0, 1).
Let n = ⋉ * ( , ). Then, the following sequence of fragment semijoins is the prefix of ( , ) The Bloom filter sent for ⋉ ( , ) The optimal number of hash functions used for ( , ) The cost of handling false positives with ( , ) in executing ⋉ ( , )

2-Way Fragment Semijoin
As described in Section 2, and are the base histograms of and with attribute count. For relation (i = 0, 1), let denote the result of σ . is a horizontal fragment of on count, and thus, a horizontal subset of the base histogram of . For example, given an iceberg join between and in Figure 1 on attribute A, Let us consider a logical fragmentation of and on count. Suppose the lowest and highest value of count in is 1 and 7, respectively. Then, is logically partitioned into 7 fragments: , , …, . Similarly, suppose is logically fragmented into , …, . Now let us consider an iceberg join with α = 30. Since and are fragmented, the semijoin where a fragment is the reducer relation can be used. For example, semijoin ⋊ is executed in the following way: [A] is sent to . Since α 7 ⁄ = 5, only those tuples of that belong to the fragments where 5 ≤ ≤ 8 could be joinable with the tuples in . Thus, only ⋃ could be considered as the reduced relation. The tuples in those fragments not joinable with [A] are deleted. Now let us define a new type of operation called fragment semijoin by modifying the semijoin. In the semijoin, the joinable tuples of the reduced relation remain whereas the non-joinable ones are deleted. In our fragment semijoin, the tuple filtering is done in the other way around with a side-effect. The joinable ones are deleted whereas the non-joinable ones remain. In other words, the result of a fragment semijoin is the complement of the conventional semijoin (in some commercial DBMSs, this variation of the semijoin is called an anti join; in fact, the term fragment anti join might be more exact one than fragment semijoin, however, we keep the term semijoin with a modifier "fragment" because it is widely known). Let us denote a fragment semijoin operator as ⋊ . With α = 30, a fragment semijoin ⋊ is executed in the following way: [A] is sent to . The tuples in ⋃ not joinable with [A] are intact (i.e., not deleted). Instead, the joinable ones are deleted and inserted to a separate relation ′ . The reason why the unmatched tuples remain is that they might be joined with the tuples in other fragments of . The insertion of matched tuples to ′ is a necessary side-effect of a fragment semijoin. Initially, ′ is empty. Every time a fragment semijoin to reduce is executed, the matched tuples, if any, are inserted to ′ . When all the tuples of that are joinable with have been inserted to ′ , is said to be fully reduced to ′ . Until then, is said to be reduced to ′′ , which keeps the tuples of yet to be checked for joinability with other fragments of .
, is the set of tuples of that have been finally confirmed not joinable with . How this difference is computed will be explained in the next two subsections. In a 2-way fragment semijoin, the backward reduction phase is added where the matched joining attribute values are sent back.
So far, we have assumed that the joining attribute values are sent in the forward reduction phase. In our scheme, we employ the Bloom filter as a synopsis of the joining attribute values in executing a fragment semijoin to reduce the amount of data transmission in WSNs. The fragment semijoin with a Bloom filter is the same as above except (1) the Bloom filter constructed from the joining attribute values are sent; and (2) the false positives need to be handled. With α = 30, the fragment semijoin ⋊ using a Bloom filter is executed in the following way: The values in . A are represented in a Bloom filter, { , A}, which is sent to . For each value of A in ⋃ , the membership test is done with { , A}. The matched tuples (including those due to false positives) are moved to ′ . Their values of A are sent back to . The ones that turn out to have been sent due to false positives are sent back to , and their corresponding tuples are moved from ′ back to their original fragments. Note that the backward reduction phase is mandatory with the fragment semijoin using a Bloom filter to sort out false positives. In the rest of this paper, what we mean by a fragment semijoin denoted with ⋊ is a 2-way fragment semijoin using a Bloom filter unless stated otherwise.

Low Count Cut (LCC)
In our approach, the cardinality constraint is the primary condition to check. One of the advantages we gain by checking the cardinality constraint first is that we can delete those tuples of and with low counts without checking the join predicate. Let and respectively denote the maximum count in and . Then, the tuples in the fragments ( ≤ α ⁄ ) and ( ≤ α ⁄ ) cannot meet the cardinality constraint, and they need not be considered at all. For example, suppose is fragmented into ,…, whereas is fragmented into ,…, , and α = 20. is ignored for the join because = 10 (1 × 10 < ). Similarly, so is ⋃ because = 7 (7 × 1 < and 7 × 2 < ). In general, the maximum count of one relation determines the minimum count of the candidate tuples for the join in the other relation. The tuples with the count less than this minimum can be deleted without checking the join predicate. Let us call this reduction effect as Low Count Cut (LCC). The LCCs in the above example are called initial LCCs. The initial LCCs for would be possible after is notified of as a part of the optimization process, which will be described in Section 5.5. Other than initial ones, LCC could occur after a fragment semijoin is executed. It will be explained in the next subsection.

Highest Count First (HCF)
It would be efficient to take advantage of LCCs in reducing and with a sequence of fragment semijoins. When a fragment of one relation is to be selected as the reducer relation for a fragment semijoin, it is desirable to select the fragment with the highest count in that relation, for it would result in LCC in the reduced relation. For example, suppose is fragmented into , ⋯, whereas is fragmented into , ⋯, , and α = 25. Suppose the fragment semijoin ⋉ is executed, and is reduced to ′′ . In , , ⋯, remain. In ′′ , ′′ , ⋯, ′′ remain. Note that ′′ does not contain . It is deleted due to LCC. Note that the LCC due to a fragment semijoin occurs only when the fragment with the highest count is the reducer relation. If ⋉ is executed to reduce to ′′ with remaining, ′′ is still contained in ′′ because some of the tuples in ′′ might be joinable with those in .
Let us call the strategy of selecting the fragment with the highest count as the reducer relation for a fragment semijoin as Highest Count First (HCF). Figure 3 shows how and are reduced after a fragment semijoin is executed with the HCF strategy. In Figure 3a, each box with a count value in and denotes a fragment. For example, the box at the top of with count = p denotes the fragment . Figure 3a shows how and are logically fragmented. As shown, = ⋃ (p ≤ q) and = ⋃ (r ≤ s). Figure 3b shows the fragment semijoin ⋊ . Figure 3c shows which fragments of and remain. In ′′ , ′′ is now the fragment with the highest count. In ′′ , the LCC has occurred, and = α ( − 1) ⁄ is now the lowest count in the remaining fragments.  (b-f) A sequence of fragment semijoins are executed where the fragments of (or ) being the reducer relations, and those of (or ) being the reduced relations. After each fragment semijoin, LCCs occur.

Fragment Semijoin Sequence
and could be fully reduced with a sequence of fragment semijoins which interleaves two types of fragment semijoins: one with a fragment of being the reducer relation, and the other with a fragment of being the reducer relation. Figure 4 shows an example where = ⋃ and = ⋃ are reduced by a sequence of fragment semijoins with α = 18. After the initial LCCs ( Figure 4a), a sequence of fragment semijoins are executed after which LCCs occur (Figure 4b-f). Throughout the sequence, either the fragment with the highest count in or that in is the reducer relation. In Figure 4b,d,e, more than one fragment is depicted as being the reducer relation for fragment semijoins. Since every fragment semijoin selects one fragment at a time as the reducer relation with the HCF strategy, the number of fragment semijoins executed in Figure 4b,d,e, is respectively equal to the number of fragments marked as selected. For example, in Figure 4b, two fragment semijoins are sequentially executed; ⋉ first, and then ′′ ⋉ where ′′ is the result of ⋉ .

Optimal Sequence of Fragment Semijoins
In our approach, a sequence of fragment semijoins are executed to fully reduce and , and the Bloom filter is employed as a synopsis of the joining attribute values. Data transmission for a fragment semijoin occurs to send the Bloom filter and to handle the false positives. According to the Bloom filter theory presented in Section 3.2, the length of a Bloom filter and the number of transformations used could be optimally set to minimize the probability of false positives. In this section, we present an algorithm that generates the optimal sequence of 2-way fragment semijoins with the HCF strategy whereby the total amount of data transmission in fully reducing and is minimized.

Formulation of Optimization Problem
We develop a dynamic programming algorithm to generate the optimal sequence of fragment semijoins that fully reduces and . In describing the algorithm, it is convenient to denote and as and (i = 0, 1). For example, if we need to mention both ⋊ and ⋊ to state something that is applied to both (Note that ⋊ ≠ ⋊ because the semijoin operation is not commutative.), ⋊ (i = 0, 1) will do. In the rest of this paper, the subscripts i and 1 − (e.g., , ) are used with (i = 0, 1) omitted if they are related to or and the context is clear. Other notations used in this section are summarized in Table 1.
The current state of after a certain sequence of fragment semijoins has been executed can be represented with the two current highest counts in the remaining fragments of and . Let denote the current highest count of as depicted in Figure 5. The following two statements hold:  The current lowest count of is α ⁄ (Figure 5a).  If the initial maximum count of is , each fragment in ⋃ has been selected as the reducer relation for the fragment semijoins executed thus far before or after reduced by some fragments of ( Figure 5b).  The termination condition of this recurrence relation is α ⁄ < or α ⁄ < , which means either one of the two relations gets empty. In such a case, no further fragment semijoin is needed. Thus: Let ⋉ * ( , ) denote the value n for which ⋉ * ( , ) in (Equation (2)) is minimized.
Thus, the optimal sequence can be represented as i ( ∈ {0,1}) followed by a sequence of counts in and as follows:

⋯ ⋯
For example, if the sequence of fragment semijoins in Figure 4 is the optimal one, then it is represented as 1, 9, 7, 6, 5, 5.

Cost of 2-Way Fragment Semijoin
The cost of a fragment semijoin ̂⋉ ( , ) where i = 0, 1 is defined as the total amount of data transmission in bits, and given as follows: The first term denotes the size of the Bloom filter, and the second is for the cost of handling false positives. Since the total cost of sending all the joinable attribute values (excluding those from false positives) in the backward reduction phases is the same for all the possible sequences of fragment semijoins, it is omitted. The multiplication of the number of hops between the two regions of and is also omitted because it is the same for all the sequences.
and have been reduced by each other with a fragment semijoin sequence until ( , ) becomes (x, y). However, the above estimation is valid since the joining attribute values are unique in , that is, ∩ = ∅ (p ≠ q), and thus, the following holds: where p ≠ q. Now ‖ ( , )‖ can be estimated as follows according to the Bloom filter theory described in Section 3.2: where ( , ) denotes the optimal number of hash functions for ( , ), which will be explained shortly. In some cases, the number of tuples of ( , ) is so small that the size of the list of joining attribute values might be smaller than ‖ ( , )‖. In such a case, the list is sent instead of ( , ).
To cover such exceptions, we modify the above equation as follows: where ‖ ‖ denotes the number of bits to represent a value in the joining attribute A, and the one added is for a flag bit (for distinguishing a Bloom filter from a value list).

Cost of Handling False Positives
The The estimation of E is according to the semijoin selectivity estimation as described in Section 3.
In case that the list of joining attribute values is sent instead of ( , ), ( , ) = 0.

Optimal Number of Hash Functions.
( , ), the optimal number of hash functions for ( , ), is determined as follows: as summarized in Section 3.2, the probability of false positives is affected by the number of hash functions. As the number of hash functions increases, the length of the Bloom filter increases with the probability of false positives decreased. In (Equation (5)), the cost of a fragment semijoin, ̂⋉ ( , ), is given as a sum of two terms, ‖ ( , )‖ and ( , ). There exists a tradeoff between the two terms, and thus, ( , ) should be determined to be a positive integer such that ̂⋉ ( , ) is minimized. From Equations (8) and (10), the cost of a fragment semijoin ⋉ ( , ) with k hash functions is given as follows:

Dynamic Programming Algorithm
⋉ * ( , ) and ⋉ * ( , ) can be obtained with a dynamic programming algorithm. From (Equation (2)) through (Equation (4)), ⋉ * ( , ) and ⋉ * ( , ) are initialized as follows: If there is no fragment remaining in either one of and , that is, if < α ⁄ or < α ⁄ , then ⋉ * ( , ) is set to 0, and ⋉ * ( , ) is undefined. If none of and is empty but there remains only one fragment in (i.e., = α ⁄ ), the only possible fragment semijoin is the one where that fragment is the reducer relation. Thus, ⋉ * ( , ) = ̂⋉ ( , ) and ⋉ * ( , ) = . Starting from these initializations, the optimal solution for the cases where the number of remaining fragments of and is greater than 1 can be obtained. For example, suppose there are two fragments remaining in and one in (Figure 6a). ⋉ * ( , ) can be obtained as follows: Because of the HCF strategy, the fragment with the highest count is the reducer relation in the first semijoin (Figure 6b). The remaining fragment of might be the reducer relation of the next semijoin (Figure 6c). That is, Figure 6b,c show all the possible cases. Figure 6d,e respectively show the subsequent fragment semijoin where the fragment of is the reducer relation after the semijoins in Figure 6b,c assuming that no LCC has occurred. The optimal solutions (i.e., ⋉ * ( , )) in Figure 6d,e are already known, because the number of remaining fragment in is 1 (Figure 6d) or 0 (Figure 6e), and that in is 1. The optimal solutions here are already given from the initialization. In this way, ⋉ * ( , ) can be obtained when there are 3, 4,…, − α ⁄ + 1 fragments remaining in (i = 0, 1). Figure 7 describes the algorithm that carries out this process.
In the algorithm, procedure opt(i, , , , ) is invoked. As shown in Figure 8a, the arguments and are respectively the current lowest and highest count in . and are those in .      The variable num_frag denotes the number of remaining fragments in , and opp_num_frag denotes that in . In the two outermost loops, the values of these variables increase. In the two innermost loops, pairings of and are provided such that the difference between and is set by num_frag and opp_num_frag, i = 0, 1. The invocations of opt() in Line 8 and 9 find the optimal solution for the case where the number of fragments in the reducer relation is greater than that in the reduced relation. Those in Line 11 and 12 find the optimal solution for the inverse case.
For example, suppose four and three fragments remain in and , respectively (Figure 9a). Let us consider ⋉ * ( , ) for them.

Post Optimization
So far, we have assumed that the backward reduction phase is carried out for each fragment semijoin in the optimal sequence. As described in Section 4.1, the values for which false positives are confirmed need to be sent back again and their corresponding tuples need to be inserted back to the reduced relation. In the backward reduction, if the matched joining attribute values are sent back with their counts, re-insertion would not be necessary at all. In doing so, the concern is the overhead of sending the count for each individual matched value. An efficient way of handling this problem is to delay the backward reduction of each fragment semijoin until the execution of all the fragment semijoins in the optimal joining attribute values (possibly including those from false positives) whose count is equal to j, and = | | ( = , + 1, ⋯, ).

Query Optimization and Processing
In-network optimization of an iceberg join query Q is carried out in the sensor node located at the midpoint between and . Thus, it is required for and to send the information necessary for the optimization to . First, sends the count information of (i.e., , ⋯, ) to . ‖ ‖ and | | are assumed to have already been sent to when Q was initially forwarded to , , and . All the equations in Sections 5 and 3.2 can be evaluated if the aforementioned information is available. The node generates the optimal sequence of fragment semijoins and sends it to with the count information of (i.e., , ⋯, ). The count information of as well as the optimal sequence as described in Section 5.1 are represented as a short sequence of integers. Thus, the communication overhead for optimization could be very small. Now and are ready to execute the optimal sequence to fully reduce and .The skeleton of the protocol executed by for our scheme is described in Figure 10. The one for is symmetrical.
1: begin 2: send the count information of to ; 3: receive the optimal sequence of fragment semijoins OPT and the count information of from ; 4: process initial low count cut for with ; 5: if( is to start forward reduction in OPT ) then S-flag = TRUE; 6: else S-flag = FALSE; 7: endif 8: // forward reduction phase 9: if( S-flag ) make and send Bloom filters to according to OPT; 10: loop 11: if( no fragment semijoin to execute is remaining in OPT ) break; 12: receive Bloom filters from and process forward reductions; 13: if( no fragment semijoin to execute is remaining in OPT ) break; 14: make and send Bloom filters to according to OPT;

Performance Evaluation of Our Scheme and SRJA
In this section, we compare the performance of our scheme and that of SRJA. We have implemented both of our scheme and SRJA, measuring the total number of packets transmitted among the sensor nodes and the total number of transmissions among the sensor nodes while the two schemes are executed. These two performance metrics are employed to compare the energy-efficiency of the two schemes. The number of packets transmitted is measured assuming that the network is IEEE 802.15.4-compliant. We also measured the ratio of joinable values transmitted over all the values transmitted. This metric is to compare the effectiveness of data filtering as well as the energy-efficiency of the two schemes. Finally, we present an analytical comparison of the total number of transmissions in the two schemes. All the procedures for the experiments were implemented in C and the experiments were conducted in a system of Windows 7 with an AMD Phenom II X4 945 Processor (3.0 GHz) and 4 GB memory.

Parameters
The network and query parameters in the experiments are summarized in Table 2. We have considered WSNs where sensor nodes are uniformly deployed. Each of the two regions for an inter-region iceberg join where a join operand relation resides is assumed to be a square consisting of n × n nodes. The distance between the coordinator nodes and of the two regions is set to 30 hops. The default epoch (i.e., sampling interval) of every sensor node is set to 30 s, and the size of the sliding window of the join query is set to 3 h as in the experiments with SRJA in [5]. After setting the join selectivity between and , the values of the joining attribute in the iceberg join operand relations are randomly generated as an integer in the range [1, 10,000] with a random distribution of their counts under the constraint that the highest count for a value could be 15. The iceberg thresholds considered are 50, 100, 150 and 200.

Experimental Results
Each of the reported measurements in this subsection is the average out of 50 runs against different data values. The experimental results reveal that our scheme considerably outperforms SRJA. Figure 11a,b respectively compare the total number of packets transmitted and the total number of transmissions of the two schemes as the size of each region varies from 5 × 5 nodes to 15 × 15 nodes while α is set to 150. Figure 11c,d compare the same while α is set to 200. As the size of each region gets bigger, more data is collected at the sensor nodes and more tuples need to be processed in both schemes. Thus, more packets are transmitted. As for the number of transmissions, it also increases in SRJA. In our scheme, it is not so sensitive to the data volume because for a given α, the optimal sequence of fragment semijoins generated could include the similar number of semijoins. For the total number of packets transmitted, the average performance improvement with our scheme over SRJA is 63.58% (α =150) and 71.97% (α =200). For the total number of transmissions, the average performance improvement with our scheme over SRJA is 59.66% (α =150) and 57.27% (α =200). Figure 12a,b respectively compare the total number of packets transmitted and the total number of transmissions of the two schemes as α varies from 50 to 200 while the size of each region is set to 10 × 10 nodes. Figure 12c,d compare the same while the size of each region is set to 15 × 15 nodes. As α increases, the iceberg join query gets more selective. Thus, in both schemes, more tuples are filtered and the number of packets transmitted decreases. The number of transmissions turns out not so sensitive to the increase of α except for the case of α = 200. In SRJA, the effect of subrange pruning results in the decrease of transmissions for larger α 's. In our scheme, on the other hand, the number of transmissions slightly increases from α = 50 to 150, then decreases for α = 200. These changes depend on the optimal sequence of fragment semijoins generated. The more semijoins are to be executed in the optimal sequence, the more transmissions would occur. For the total number of packets transmitted, the average performance improvement with our scheme over SRJA is 59.71% (regions of 10 × 10 nodes) and 64.07% (regions of 15 × 15 nodes). For the total number of transmissions, the average performance improvement with our scheme over SRJA is 63.98% (regions of 10 × 10 nodes) and 66.13% (regions of 15 × 15 nodes).   Figure 13a,b respectively compare the total number of packets transmitted and the total number of transmissions of the two schemes as the epoch (i.e., sampling interval) at each node varies from 10 s to 60 s while the size of each region is set to 10 × 10 nodes and α is set to 150. As the sampling rate gets higher, more data is collected at the sensor nodes and both schemes are supposed to transmit more packets. The number of transmissions also increases. However, SRJA turns out to suffer much more with higher sampling rates. For the total number of packets transmitted, the average performance improvement with our scheme over SRJA is 62.97%. For the total number of transmissions, the average performance improvement with our scheme over SRJA is 58.66%. Figure 14a compares the ratio of joinable values transmitted in the two schemes as the size of each region varies from 5 × 5 nodes to 15 × 15 nodes while α is set to 150. This ratio is defined as: Total number of joinable attribute values Total number of join attribute values transmitted (12) Figure 14b compares the same as α varies from 50 to 200 while the size of each region is set to 10 × 10 nodes. This ratio gets lower as the query gets more selective or as more non-joinable values are transmitted. As the size of each region gets bigger, the number of tuples collected at the sensor nodes increases and the size of the iceberg join result also increases with a given α. Thus, this ratio increases in both schemes ( Figure 14a). As α increases, on the other hand, this ratio decreases in both schemes because the iceberg join query gets more selective (Figure 14b). In Figure 14, this ratio in our scheme turns out to be significantly higher than that in SRJA. This means that the effectiveness of filtering out non-joinable values and the energy-efficiency in our scheme is much higher than that in SRJA. The major reasons for the improvements are two-fold:  In SRJA, the histogram-based value ranges are sent as a synopsis of the joining attribute values.
In our scheme, a Bloom filter constructed from a count-based fragment is sent as a synopsis of the joining attribute values. The Bloom filter is more compact than the value ranges. Besides, the false positives are efficiently handled in the backward reduction phase of the 2-way fragment semijoins in our scheme.
SRJA is centered around checking the join predicate with the cardinality constraint as an additional condition. For each pair of matched ranges, if there exists at least one pair of tuples t0 ∈ and t1 ∈ such that t0.count × t1.count ≥ α, the non-joinable tuples in either range cannot be filtered out, and recursive divisions of each of the two ranges into subranges are required. In contrast, our scheme is centered around the cardinality constraint with the join predicate as the secondary condition. Only the fragment semijoins between the fragments satisfying the cardinality constraint are carried out to filter non-joinable tuples. Besides, the reductions from the LCCs that result with the HCF strategy are effective.

Analytical Comparison of the Total Number of Transmissions
In this subsection, we present an analytical comparison of the total number of transmissions among the sensor nodes throughout the execution of SRJA and our scheme. In Figures 2 and 10, the protocols executed by the coordinator node of a region for SRJA and our scheme are described. In each of the two schemes, let , be the number of messages that sends to plus the number of messages that receives from . Let be the number of messages that sends to plus the number of messages that receives from . Let = + and = + . Then, the total number of transmissions in each scheme is • + • where d is the distance between and in hops.

SRJA
Considering the protocol in Figure 2, we have: , denotes the number of messages sends to plus the number of messages receives from in processing the values with a high count (line 2 in Figure 2), , denotes that in processing sparse subranges in the synopsis (line 4 or line 16), and , denotes that in processing JOIN-tagged subranges (line 13).
is 1 if the tagged synopsis received from includes at least one JOIN-tagged subrange. It is 0 otherwise. is 1 if the tagged synopsis includes at least one DIVIDE-tagged subrange. It is 0 otherwise. The sparse subranges could be generated after the synopsis is initialized (line 3) or after the DIVIDE-tagged subranges are divided (line 15). Finally, denotes the number of rounds needed until SRJA is terminated. It means how many times the loop (line 5 through line 18) is repeated either fully or partially (up to line 9 for breaking the loop). Thus, ≥ 1. Now is: Normally, = 2 because a 2-way semijoin is executed. first sends the values with a high count with their count information to , and then receives from the list of joinable values as a result (line 2).
does the same symmetrically. If these two symmetrical processes are conducted as an asymmetrical one, one message can be saved, and thus, the number of transmissions could be significantly reduced when the distance between and is long. We can let start the process by sending its data to . When returns the result, it can send its data as well. Then, finally returns the result. In this way, + = 3. Similarly, + = 3. Meanwhile, + = 2 because one 2-way semijoin in either direction but not the symmetrical two is enough in processing the JOIN-tagged subranges. Thus: Meanwhile, = 2 because is to send the synopsis to (line 7) and receive the tagged synopsis from (line 8) in each round. Thus, = 4 . Let be the total number of transmissions in SRJA. Then:

Our Scheme
Considering the protocol in Figure 10, we have , = , + , , where , denotes the number of messages sends to in the forward reduction phase (line 9 through line 15 in Figure 10), while , denotes the number of messages sends to plus the number of messages receives from in the backward reduction phase (line 17 through line 20). Now is ( + ) + ( + ). In executing the optimal sequence of fragment semijoins, and are supposed to take turns to play the role of reducer relation. Thus, the number of messages including the Bloom filters for a subsequence of fragment semijoins in the forward reduction phase is equal to the number of this turn overs denoted as . For example, in the sequence of fragment semijoins in Figure 4, there are 5 turn overs with being the reducer relation at first in Figure 4b. Thus, With the same argument for why + = 3 in SRJA, + = 3. Thus, = + 3. Meanwhile, = 2 because sends the count information of to once (line 2) and receive from the optimal sequence and the count information of once (line 3). Let be the total number of transmissions in our scheme. Then: From (Equations (16) and (17)), < if < ∑ 2 + 3 + 2 + 1. For the comparison, let us not consider the restricted case of = 1. This case happens in SRJA only when at least one of and has very small number of tuples and their join attribute values are sparsely distributed. Such a case is not of much interest. Now since ≥ 2, we have ∑ 2 + 3 ≅ 5( − 1). The reason for this is as follows: in Figure 10, we note that = = 0 for the final round. In each of the interim rounds, we note that = 1 because without any DIVIDE-tagged subrange the next round except the final one would not be necessary. In each of the interim rounds, it is not necessarily that = 1.
Assuming that normally it holds, < iff < 7 − 4. The values of and depend on several parameters such as the number of tuples, distribution of joining attribute values, their count distribution, iceberg threshold, effectiveness of non-joinable value filtering, and so on. In the experiments in the previous subsection, it turns out that < 3 while ≥ 2 on average. For such values, < as shown by the measurements in the experiments.
In [21], a technique called Distribute-Broadcast Join adapted from the nested-loop join is proposed. The main contribution of this work is the cost-based selection of optimal join region in WSNs. In [22,23], Mediated Join also adapted from the nested-loop join is proposed, where cost-based selection of inner and outer relation for nested-loop join is investigated. In [20], distributed algorithms for indexed nested-loop join and hash join are proposed. In the former, a technique of dynamically creating and using a distributed B+ tree in WSNs is developed. In the latter, a technique of partitioning and joining tuples with geographic hashing is investigated. In [25,26], a pair-wise join between any two sensor nodes is investigated. Multiple routing trees are employed and cost-based join initiation for long-running join query is proposed. Also, the issue of adaptive and cost-based re-optimization against the changes of sampled data is dealt with. In [2], Synopsis Join adapted from hash join as well as semijoin is proposed. It employs geographic hashing for partitioning and filtering of the joining attribute values and optimally determines the nodes of the final joins for each matched value. In [29], Two-Phase Self Join where one join operand relation is fully reduced with a semijoin is proposed. To process a join query, it employs a query decomposition technique assuming that the selection predicate on one relation is highly selective. In [27], SENS-Join based on a 2-way semijoin is proposed. It handles a general type of join predicates on multiple attributes, using a quad-tree as a multi-dimensional join filter based on Z-ordering. In [19], PEJA adapted from hash join as well as sort-merge join is proposed. In WSNs, physical sorting of the tuples distributed over the sensor nodes is infeasible. Thus, it conducts logical sort of tuples through division of joining attribute value range, partitioning and filtering tuples with geographic hashing. In [28,30], algorithms using the join filters are proposed for continuous queries on multiple attributes. At each sensor node involved in the join query processing, a join filter is installed for each joining attribute and only those tuples whose join attribute values pass the relevant filters are sent to the base station.
Other issues addressed by the state-of-the-art includes routing protocols, query dissemination, join initiation, involvement of the base station in in-network processing, collection of metadata for continuous joins [3].
In [18,24,31], in-network join processing where the Bloom filter is employed in WSNs are addressed. In [18], for a join between an external relation and the virtual relation in WSNs, a technique where a Bloom filter constructed with the joining attribute values of the external relation is injected into the network is investigated. In [24], a technique where the Bloom filter is transmitted instead of the joining attribute values in in-network computation of the two semijoins ⋉ and ⋊ to answer a join ⋈ , is proposed. In this technique, when the Bloom filter constructed for is disseminated through the routing tree of , it is selectively forwarded to the subtrees of the routing tree such that the cost of sending it is always compensated in reduction. The optimal solution in such a selective forwarding is proposed. In [31], an extension of the Bloom filter called Window Bloom filter (WBF) is devised to support a general join query with a time window. The proposed technique represents join attribute values in a compact way with the WBF, and saves the energy consumption by not sending the redundant join attribute values.
As described in Section 3.3, in SRJA [6], the value range instead of the joining attribute values is sent as a histogram-based synopsis. Transmitting the value ranges instead of the joining attribute values is considered for in-network join processing [18]. In [2], a technique using a histogram-based synopsis of the join operand relations coupled with geographic hashing is proposed. However, the considered join in these techniques is not an iceberg join.
The iceberg query was first introduced in [32]. It is defined as a query that retrieves aggregate values above some specified threshold in the applications such as data warehousing, data mining, information retrieval, and so on. Iceberg query processing over distributed data was also investigated [33][34][35]. However, the join query was not dealt with in these work. In [7], an iceberg distant join in spatial databases was investigated. In this work, the type of the join query is different from the one we consider in this paper in that the cardinality constraint is only on one join operand relation. That is, when the iceberg threshold is α, the tuple of one relation that is joined with more than α tuples of the other relation is qualified for the query.

Conclusions
In this paper, we investigated an alternative approach to processing an iceberg join in WSNs, and described an optimized scheme. In the previous approach, the join predicate is checked first, and the cardinality constraint is checked next. In our approach, the order is reversed. Our scheme refers to the aggregate count of each of the joining attribute values, logically fragmenting the relations on the counts. Based on the fragmentation, it generates the optimal sequence of 2-way fragment semijoins using a Bloom filter as a synopsis of joining attribute values in filtering non-joinable tuples. A detailed set of experiments showed that our approach is substantially superior to the previous one.
As a future work, we plan to extend our scheme for more complicated cases. First, the two regions producing the virtual relations to be joined could partially or fully overlap with each other. Secondly, more than two regions are involved in multiple joins. Thirdly, the correlations among the sensor readings along the neighboring n regions are to be monitored with an n-way join.

Conflicts of Interest
The author declares no conflict of interest.