Distributed Efficient Similarity Search Mechanism in Wireless Sensor Networks

The Wireless Sensor Network similarity search problem has received considerable research attention due to sensor hardware imprecision and environmental parameter variations. Most of the state-of-the-art distributed data centric storage (DCS) schemes lack optimization for similarity queries of events. In this paper, a DCS scheme with metric based similarity searching (DCSMSS) is proposed. DCSMSS takes motivation from vector distance index, called iDistance, in order to transform the issue of similarity searching into the problem of an interval search in one dimension. In addition, a sector based distance routing algorithm is used to efficiently route messages. Extensive simulation results reveal that DCSMSS is highly efficient and significantly outperforms previous approaches in processing similarity search queries.


Introduction
This paper considers a distributed information delivery and search service for one or more applications in a Wireless Sensor Network (WSN) that utilizes in-network storage, which is known as Data Centric Storage (DCS) [1]. The applications consist of a set of producer and consumer nodes that OPEN ACCESS can exchange information by relaying packets through neighboring sectors. Nodes have no explicit knowledge of each other but are aware of the applications. The distributed information delivery and search service is used to implement an information delivery and search layer between applications and nodes that provides enhanced reliability and improved flexibility. This paper introduces Data Centric Storage with Metric based Similarity Searching (DCSMSS), which is a highly scalable distributed information service based on Disk Based Data Centric Storage (DBDCS) [2] that incorporates similarity searching. A data query search for an exact match or for data within a specified similarity range is called similarity searching. Similarity searching is particularly useful where users seek data within a WSN that is either a match or close to a match.
The member nodes in a sector or zone report the sensed event to their associated Sector Head (SH), which aggregates the received events at the end of each epoch (length of a Time Division Multiple Access (TDMA) slot assigned to each sector). The aggregated event is hashed to produce a hash key, which is mapped from a one dimensional domain into a metric space utilizing a normalized and adapted variant of iDistance [3]. The distance between a data point and its closest reference point plus a scaling value is called the point's iDistance. In this paper distances between data points and reference points in the multi-dimensional space have been mapped to one-dimensional values.
The DCSMSS scheme presented is used to balance information transfer loads across the network, enhance reliability and provide efficient similarity searching within a distributed network for two types of queries-range query and k-query. DCSMSS uses a lightweight Sector Based Distance (SBD) routing algorithm, presented in [2,4], to route inter-sector storage, intra-sector storage and query traffic. The domain of the derived hash key of an aggregated sensed event, denoted by HD, is mapped into the metric space of the DBDCS architecture. In order to balance the load among the sectors, a pivot point generation procedure is used dividing HD into almost equally populated sub-intervals, denoted by hDi, where hDi ≠ hDj and 0 ≤ i ≤ j ≤ S; S refers to the total number of sectors. In order to store an event, the target sector is mapped based on the derived hash key and pivot points. Furthermore, the target SH distributes the load among the member nodes based on the hash key value and distance to the member nodes.
The remainder of this paper is structured as follows: Section 2 provides an overview of the related work in the literature. Network architecture, data processing and mapping, SBD routing, insertion and querying are illustrated in Section 3. Section 4 describes the SBD analytical model. This is followed by the simulation results and performance evaluation of DCSMSS and SBD presented in Section 5. The paper is concluded in Section 6.

Related Work
A detailed literature survey that discusses key research on DCS techniques is presented in [1,5]. This section mentions researches, which are closely related to the research reported on in this paper.
In order to process similarity search queries efficiently, Chung, et al. [6] propose a novel framework over a data-centric storage structure, referred to as the Similarity Search Algorithm (SSA), based on the concept of a Hilbert Curve. The lack of global knowledge about the entire sensor database is identified as one of the major challenges in processing a sensor network similarity search query. However, in order to overcome this constraint, SSA presents a network layout based on a Hilbert Li, et al. [11] proposed a method called Distributed Index for Multi-dimensional Data (DIM), which includes both a point and a range query in a multidimensional DCS model. In DIM, each sensor is linked as a node in a tree structure where each node represents a range of values. A root node represents the entire range of values and splits into two equal parts for left and right child nodes. This process continues for each non-leaf node until leaf nodes are reached. Table 1 summarizes related work with corresponding features.

Network Architecture
The surface/platter of a magnetic disk storage device consisting of tracks and sectors provides an interesting approach that may be applied to a large scale WSN. This assumption led to a Disk Based Data Centric Storage (DBDCS) architecture, as shown in Figure 1a, dividing the rectangular field into a matrix of storage cells (referred to as a sector) where row and column represent track (Ti) and sector (Sj), respectively. In DBDCS, the covered network is considered as one of the storage surface and sector is considered as the core cell of storage. However, unlike magnetic storage disk, in DBDCS, the header file for data mapping is not located in one single particular location rather a dynamic mapping algorithm is used using hashing. Hence, each SH could calculate the target sector to read/write corresponding data. The physical deployment is mapped to an m x n matrix, where m is the number of tracks and n is the number sectors for each track. Hence, the nodes in the network are divided into S (mxn) sectors, each comprising a Sector Head (SH) and sector members that communicate via one hop to the SH (see Figure 1c), where SHi ϵ [1…S]. Each node is configured to be aware of the deployment layout by knowing: (1) All SHs are assigned with the sector number as a virtual address and node id, and (2) All member nodes know their own node id and number of tracks (m) and sectors (n) of the network field. As shown in Figure 1b, the intra-sector communication (i.e., communication from sector members to SH or vice-versa) is constrained to one hop while inter-sector transmission is multi-hop. For simplification, the sensor nodes inside each sector are not shown explicitly in Figure 1b. Instead, an aggregated link (see Figure 1c) is shown to represent the total traffic from member nodes to head node.

Metric-Based Searching
Metric space M can be defined as a pair M = (D, d), where D is the domain of objects and d is the distance function-d: D × D → R satisfying the following constraints for all objects a, b, c ϵ D: In this metric space, two types of similarity queries can be defined including range query Range(q, r) and K-nearest neighbor search KNN(q, k) by the resultant set X, considering D I  to be a finite set of indexed objects:

Range(q, r):
The data space can be divided into S segments (S is the total number of sectors) with a pivot point, denoted by Pi, for each sector Si. The iDistance key for an object x ϵ D can be defined as ( Figure 2a): In Equation (4), c is the separating constant for individual sectors. Given q ϵ D, the range query for q with the range of r can be defined as (Figure 2b): (5) In Equation (5), q denotes the query point and Pi denotes the pivot point for SHi where Pi ≤ q ≤ Pi+1. Therefore, after locating the target sector (SHi), the conceptual range can be defined by Equation (5) and is illustrated in Figure 2b. The axis showed in Figure 2 represents the one dimensional data space that has been divided into S segments, where each sector is mapped to a segment.

Data Processing and Mapping
A sensed event E can be defined by an l-dimensional tuple, (A1, A2, A3, …, Al) where ,   denotes the gth attribute and DAg is the domain of attribute Ag. Each member node of a sector transmits the sensed event as an l-dimensional tuple , where 1 ≤ i ≤ Mk, Mk is the total number of member nodes in the kth sector and vij denotes the value of the jth attribute received from ith member node of the kth sector. The corresponding SH, after collecting tuples from all the member nodes, aggregates them at the end of each epoch before finding the target SH mapping. Hence, after aggregation at epoch t Here, it is assumed that the attribute's aggregated values of ψi have been normalized to be between the range of 0 and 1. From Figure 1a, lets consider 6th (k = 6) sector, where M6 = 3. If the total number of attribute is 3 then for any particular round (for example t = 2), Equation (6) can be illustrated as shown in Table 2.  (6) and (7).

Member Node
First Attribute Second Attribute Third Attribute After applying Equations (6) and (7) As shown in Table 3, weights have been assigned to different attributes based on their importance in the event description. Hence, an attribute with higher weight has greater influence on the similarity among events. Table 3. Weight settings.

Attribute Weight
The domain of the one dimensional derived hash key HD of an aggregated l-dimensional sensed event can be defined by α (αmin, αmax) as illustrated in Figure 3. In Equations (8)- (11), Ai(min), Ai(max), Ai(avg) and Ai(θ) denote the minimum, maximum, average and threshold value of ith attribute. The center of mass (COM), denoted by β, is derived in Equation (10) to find the normalized center point of the domain of the hash key HD whereas δ is the separating factor between two pivot point. However, in order to balance the load among sectors, it is important to find the range where the concentration of the data points is high. Hence, β and δ can be used to find this COM range, denoted by β (βrange-min, βrange-max) as shown in Equation (12): Thus, the separating step, denoted by η, between two pivot points in the COM range can be defined by: Thus the pivot points for S sectors can be defined in each sector head by (Algorithm 1):

Mapping
Given l attributes in an attribute list associated with weight wj (1 ≤ j ≤ l) in a WSN application, the source SHk generates the hash value by: (max) 1 (15) Hence, after each epoch, SHk forwards the aggregated event where t denotes the epoch number, to the destination sector head denoted by SHi where, Pi ≤ h ≤ Pi+1 and Pi and Pi+1 is the lower and upper limit of ith sub-interval, respectively.

SBD Routing
In order to relay aggregated packets from SHk to SHi, DCSMSS uses the Sector Based Distance (SBD) routing algorithm [4]. Each round of SBD consists of two phases: (a) Learning phase and (b) Relaying phase. The learning phase is again divided into three stages: (I) Sector head TDMA slot assignment stage using the grid coloring algorithm (GCA); (II) Member-SH association stage; and (III) Intra-sector TDMA slot assignment stage for member nodes managed by the SH. In the first stage of the learning phase, each SH finds the non-overlapping operating slot for corresponding sectors using Algorithm 2. It is assumed that each SH is configured to be aware of the number of sectors in the deployment layout. Using Algorithm 2, all sectors of any grid size could be assigned with conflict-free TDMA slot by reusing only four time slots. For example, Algorithm 2 has been applied to a grid of 30 sectors (see Figure 4). Each sector of the grid is assigned with conflict free time slot by reusing only four time slots (C0~C3). Sectors with similar time slot can perform concurrently without any interference. Algorithm 1. Pivot Point Generation Algorithm (implemented at each SH node).
Input: attrRangeTable (containing minimum, maximum, average and theta of each attribute), W (weights to different attributes based on their importance in the event description).
Output: P (derived pivot point for each sector) Hence, the frame length, denoted by L, of a round can be defined as: Here, ∆t is the length of the TDMA time slot assigned to each sector.

Algorithm 2.
Conflict free TDMA frame slot assignment GCA (implemented at each SH node).
Input: HD = 2 (circular hop distance between two sectors), m, n (total number of tracks (or rows) and sectors (or columns) in the grid, respectively) Output: Conflict-free time-slot (Ci) with frame length L = 4 × epoch (length of the slot assigned to a sector) 1: for each j from 1 to m do 2:

16: end for
In the Member-SH association stage, SH broadcasts a beacon frame and a member could receive beacon messages from more than one SH. Each member node then sorts the received beacon frames that come from more than one SH node based on Received Signal Strength Indicator (RSSI) into vector ν(SHi, RSSIi), where RSSIi ≥ RSSIi+1. In the presence of channel noise, fading and attenuation, it is not always possible to estimate the closest SH using RSSI only. Hence, in order to accurately find the closest SH, the round trip time (RTT) method has been used as well. According to this method, each Member Node (MN) sends a packet request to all candidate SHs in the list and waits for an immediate acknowledgment. After receiving the acknowledgment the MN calculates the distance of the corresponding SH from time of flight (TOF). It then calculates a ranking number for each candidate SH based on both RSSI and TOF and selects a SH from the candidate list that has highest ranking (see Algorithm 3).
According to this method, the time of flight, referred to as TTOF is calculated as follows Here, TRTT = Round Trip Time of Flight. TTCP = Time to Compute Packet. The distance between two nodes can be calculated as Here, c = Speed of Light The Equation (18) can further be rewritten after adding the faultiness as [13]: Here, ε LOS RTT = Error occurs for ranging in a line of sight setting. ε NLOS RTT = Error due to ranging in a non-line of sight environment.
The negative impacts of multipath effects, a big factor, in ε LOS RTT can be minimized using an empirical approach [14]. Uncertainties and noise in the hardware especially jitter effects play a key role in ε NLOS RTT . Considering the jitter component TTOF can be calculated as [15] In Equations (20) and (21), TOFR = TOF for the request packet. TOFA = TOF for the acknowledgment packet. JtN = jitter caused by the clock of transceiver. JcN = jitter caused by the clock of microcontroller.
The timestamps that are used to calculate the time between sending a request packet and receiving an acknowledge packet contain the jitter values Jt0, Jc0, Jt3 and Jc3. Another two timestamps that are considered in calculating the computation time between receiving a packet and sending the first bit of the ACK packet contain the jitter values Jt1, Jc1, Jt2 and Jc2.
The MNs then calculate the rank matrix for each candidate SH as In Equation (22), MN is the total number of member nodes in Nth sector. The MNs, then send an association request to the SH, which has the highest rank in its list. This ensures the association of a member node to its closest head node (see Algorithm 3). The SHs create a child table listing all the member nodes from which they receive association request. In the third stage of the learning phase, SHs broadcast a packet containing Ck (0 ≤ k ≤3), ∆t and an array γ, where γ = {m1, m2, m3, …, mi} and |γ| = Mk. In γ, mi and i denote the member node ID and index of this member node in the array, respectively. Each member node then calculates the intra-sector transmission slot based on their position in the array γ by: In Equation (23),  and MS-ID are the length of the intra-sector TDMA time slot and the node's self-network address, i.e., node's self-ID, respectively. The number of member nodes in a sector varies due to the dynamic nature of the Member-SH association procedure. Hence, the length of an intra-sector TDMA time slot can be defined by: In the relaying phase, all member nodes report their buffered or aggregated sensed data to their associated SH during their allocated intra-sector TDMA transmission slot. A SH, after each epoch, i.e., after collecting data from all member nodes, forwards the mapped event data (according to Section 3.2) in a multi-hop fashion to the corresponding sector for storage. In this inter-sector communication, SHs continue forwarding their packets to their immediate neighbor SH, which lies on the same row in the virtual grid (Figure 1a) until the packet reaches the SH that is on the same column as the destination sector. The packet is then forwarded vertically up or down until it reaches the destination (Figure 1b). The same process of routing is followed for query request and response. A description of the next hop selection process or algorithm during the relaying phase is given in Algorithm 4, which facilitates the selection of next hop in inter-sector communication. SHs continue forwarding their packets to their immediate neighbor in the same track until the packet reaches the same column where the destination sector lies. The packet is then routed vertically up or down until it reaches the destination.
A SH calls Algorithm 4 while acting as either: (I) a relaying node (receives a packet from MAC layer) or (II) a source node (receives packet from application layer).

Insertion
Within a sector, data is further distributed among nodes according to their distance from the SH. To do this, a sector is divided into segments. Figures 6, 7 and Table 4 illustrate the idea of sector segmentation. Given a kth sector containing Mk member nodes, the SHk first sorts all member nodes based on RSSI in ascending order. The member nodes are then divided into r segments. Each segment forms a ball, denoted by B(X,Y) (ri), where the ball centered in (X, Y) of radius ri. (X, Y) is the geographic co-ordinates for SHk. The number of segments depends on the WSN application, the size of a sector and the number of member nodes in each sector. Thus the set of sensors that are within a Euclidean distance ri from (X, Y) form the segment defined by: The rest of the head nodes: SHj+1, SHj+2, …, SHk−1 pull data from all of their member nodes. Suppose, Figure 8, hq is the hash value of the query (q, r). Hence, the range of the hash is [hq − r, hq + r], where hq − r belongs to (i − 1)th sector and hq + r belongs to i + 1th sector. Thus the target head nodes are (i − 1)th, ith, (i + 1)th sectors. Furthermore, within (i − 1)th sector, data is fetched from the member nodes of On the other hand, within (i + 1)th sector data is fetched from the sensor nodes of Finally, within ith segment, data is fetched from the whole sector.

K-Nearest Neighbor Query
Like range query, a query node first calculates hash hq using Equation (28) for K-nearest Neighbor Query denoted by KNN (q, k). Here, q is defined by an l-dimensional tuple (q1, q2, q3, …, ql) where   l q g g , 1 ,   , denotes the query value of gth attribute and k is the number of nearest neighbor nodes containing similar data to q. Thus the KNN (q, k) is first forwarded to the target sector head node, denoted by SHi, where Pi ≤ hq ≤ Pi+1.
The KNN retrieval protocol is iterative. The SH scans through its segmentation table and includes the closest segment one after another until the following condition is true:   m1, m3, m14, m17, m8, m10, m4, m9, m5. Range Query Example.

SBD Analysis
This section analyzes the SBD performance in terms of routing message complexity (total number of message transfers in the network). The notations used in this section are summarized in Table 5. For simplicity it is assumed that the data transmission is error-free. Assume that the local sensor sampling and reporting rate to SH is α, the remote update rate is λ and the query rate is η. Let Clu, Cru and Cqr be the cost of local update, cost of remote update and cost of getting an answer to a query, respectively. Hence, based on this assumption the overall message routing complexity can be defined as shown in Equations (30) and (31).
Here, Ctx is the transmission cost by a wireless sensor node that covers a transmission range of r inside a sector.
For a single remote update issued by S0 (represented by track (t0) and sector (s0)) to SH in S1 (t1, s1), S2 (t2, s2), ..., Sn (tn, sn), SBD sends out n updates to n different SHs. Let Cru,to,so,n and Ctx,SH be the cost of remote update from S0 to S1, S2, ..., Sn and transmission cost between two SHs, respectively. Hence, the cost of this remote update routing can be given by Equation (26).
(32) For simplicity, consider that the data first travels toward a corresponding track and then sector. So, the longest distance the data travels up or down is: Before forwarding an update, it is possible to merge the packets having the same destination as their next hop and hence it is possible to optimize traffic. Thus, the horizontal and vertical routing cost can be minimized to: In the ideal situation the lower bound for the routing cost that can be achieved is:  Hence, Cru,to,so,n is at least equal to the lower or upper bound defined in Equation (37) and no lower than or greater than the respective bounds.
The producer SH node and target storage node are considered to be randomly distributed. It is assumed that all sectors have the same probability to disseminate updates. The remote update cost can be defined as: Here S is the total number of sectors.

Performance Evaluation
Simulations were conducted using Castalia v3.2 [16] running on top of OMNET++ [17] to evaluate the SBD and DCSMSS performance. The system parameters and their settings used in the experiments are summarized in Table 6. The network model (illustrated in Section 3.1) was tested in four rectangular fields with different parameter settings. Simulations were run 30~40 times with varying-channel affecting seeds to provide results that included average and 95% confidence interval. In Section 5.1, performance of SBD was tested in terms of Energy Consumption and Latency. For the experiments presented in Sections 5.1.1 and 5.1.2, the routing efficiency of SBD is evaluated against Low Energy Adaptive Clustering Hierarchy (LEACH) [18], Greedy Perimeter Stateless Routing (GPSR) [19], Directed Diffusion (DD) [20] and Car Pooling [7]. The querying performance of DCSMSS is evaluated in Section 5.2 in terms of Point Query, Range Query, KNN Query, Similarity Searching and Scalability. For the experiments presented in Sections 5.2.1-5.2.5 the querying performance of DCSMSS is evaluated against SDS, GHT [21] and DD. Weight Matrix, and thus level of significance, is set using the configuration file that is used to initialize the network during the deployment of the network. In addition, an XML file is used that can be dynamically loaded any time from any SH and thus any change of the behavior of the environment or network can be disseminated throughout the network. The frequency of this dynamic dissemination technique is 1/round, where round = 1, 2, 3, ..., and this frequency is set based on how quickly the monitored network changes its behavior over time. The aggregation schemes are loaded at initialization of the network and can be changed on-demand during run-time. However, on-demand update during run time doesn't effect on previously collected data.
It is obvious that LSH is a very powerful tool. However, LSH is good for data with high dimension. In WSN, dimension is usually limited and fixed at the time of deployment because total number of dimension depends on the number of sensor attached to a node. Thus a similarity searching based on the events, which are categorized in terms of attributes, is not scalable. In this paper, multi-dimensional data has been normalized into a one-dimensional domain. The domain is segmented into n intervals, where n is the total number of sectors. Each sector is responsible for storing data that falls in that interval. Hence, we could say, this hash function is more suitable than LSH for WSN.

SBD Performance
The performance of SBD is evaluated in comparison with DD, GPSR, LEACH and Car Pooling routing. The candidate routing protocols for evaluation were chosen from the literature based upon their being an acceptable representation of existing comparative techniques. DD, GPSR, and Car Pooling were used in different DCS schemes over the last decade. On the other hand, SBD, LEACH and Car Pooling are cluster routing algorithms. DD, a data-centric routing technique, floods the query to a region of interest that contains the data sought for. One of the widely used point-to-point routing algorithms is GPSR, which is used in earlier DCS schemes. GPSR implements two distinct routing algorithms-greedy forwarding algorithm and perimeter forwarding algorithm. Greedy forwarding algorithm moves packets progressively closer to the destination at each hop. At a void situation, where there is no greedy path, it switches to perimeter forwarding mode, in which a packet traverses consecutively closer along a planer sub-graph of the full radio network connectivity graph. This continues until it reaches to a node closer to the destination where greedy forwarding resumes. In LEACH and Car Pooling, sensor nodes are grouped into clusters with a Cluster Head (CH) for each group. A CH is responsible for data aggregation and communicating with other CH on behalf of the cluster nodes. However, unlike LEACH, in Car Pooling routing, the next hop is determined from the neighbor head node, which is closest to the destination head node. Nevertheless, packets with a common next hop are aged and sent together in order to reduce overhead though they might have different destinations. The consequent sub-sections present the performance evaluation of SBD in terms of Energy Consumption, Reliability and Latency against Car Pooling, LEACH, GPSR and DD.

Energy Consumption
This experiment was conducted in a network of 180 nodes in a 90 m × 90 m (8100 m 2 ) field with a simulation time of 60 s. The data production and consumption rate per sector was varied between 0.1~15 packets per second. Figure 10a,b show the average energy consumption (joules) per node and total number of hop counts, respectively, as a function of packet rate per sector per sec. As shown in Figure 10a, SBD exhibits the lower energy consumption in all cases (low to high traffic rate). On the contrary, the energy consumption and total number of hop counts of DD are significantly higher than other methods and grows sharply due to its broadcasting. Figure 10b shows an interesting contrast. As shown in Figure 10b, the total number of hops for SBD, LEACH and Car Pooling is almost the same due to their similar clustering nature. However, despite having similar hop counts SBD outperforms all other approaches in energy consumption because SBD employs GCA to allocate conflict free scheduling. This helps to avoid packet retransmission as the chances of packet loss due to interference or collision is very low (see Section 5.1.2, Figure 11b).

Latency
The setting for this experiment was the same as for the reliability experiment except for the total number of remote storage updates and queries, which were set to 100 each (generating 300 application packets including 100 storage updates, 100 query requests and 100 query responses). Figure 12a shows the latency of each method. Here, latency is defined as the time from the source sending a remote packet (storage update/query/response) to the destination receiving it. As expected, the latency of each method increases gradually with the increase in network size except for one case. It is observed that DD leads to the highest latency with a higher value than the other methods especially when there are 80 nodes. This happened because DD broadcasts 100 queries among the small number of nodes, which makes it more likely to generate congestion. LEACH, SBD and Car Pooling show similar low latency. Figure 12b depicts an interesting explanation for the result provided in Figure 12a. In Figure 12b, it is noted that the number of total Request to Send (RTS) sent by SBD is almost equal to the number of remote packets (remote update, query and response) while for DD it is almost a factor of two and for the other algorithms it is one and a half. However, despite having lower packet loss and lower retransmission compared to LEACH and Car Pooling, SBD shows similar latency due to its store and forward technique.

Querying Performance
In this section, the performance of DCSMSS with SBD and LEACH routing algorithm is evaluated in comparison with SDS, DD and GHT. As mentioned earlier, DD broadcasts a query to search for all of the desired data. GHT applies a hash function on the attribute name to find the location of the data and merges the located data to the query result. SDS uses the Locality Sensitive Hash (LSH) function and the number of hash values for a data item after the LSH operation was set to 5.

Point Query
This experiment was conducted to evaluate the performance of each approach for point queries, which returns a single data item if it finds an exact match. The experiment was conducted using a 90 m × 90 m rectangular field, in which 180 nodes were randomly and independently disseminated. 300 queries, in total, were generated uniformly from different parts of the network. Queries were generated as a group referred to as a batch, which is sent out at the same time. The next group was released once all the queries of the previous batch were resolved or the maximum response waiting time was exceeded. Figure 13a shows the success rate of different methods. Success rate is defined as the ratio between the number of successfully resolved queries and the total number of queries generated. This metric is used to reflect the effectiveness of a data storage method. From Figure 13a, it is observed that DD exhibits the worst performance and its success rate falls sharply as the number of queries per batch increases. With increased number of queries per batch DD's broadcasting causes excessive messages, which leads to congestion and high packet loss. DCSMSS+SBD maintain a low packet loss due to its collision avoidance technique. The other three approaches-GHT, DCSMSS + LEACH and SDS fall in the middle. However, amongst these three, GHT's performance is slightly lower. GHT routing uses a node as a step unit rather than zone or sector. As a result, it leads to a bit higher traffic causing more congestion and packet loss than those of the DCSMSS + LEACH and SDS. Figure 13b shows that DD's latency grows radically due to the congestion as traffic increases. Since DD uses broadcasting for data querying, it produces excessive message and traffic congestion when the number of queries per batch increases. DCSMSS + SBD, DCSMSS + LEACH and SDS have almost similar latency. GHT takes the shortest path and thus it outperforms other approaches when the traffic was low but its latency is affected by the congestion caused by the increased traffic with the increase in the number of queries per batch. DCSMSS and SDS schemes do not need to send as many queries as GHT since they rely on neighbor zone or SH to forward queries, thus reducing traffic and congestion. DCSMSS + SBD produces less traffic by realizing the collision avoidance technique (GCA) compare to DCSMSS + LEACH and SDS with Car Pooling. Due to the collision free time slot allocated to SH in the routing layer through GCA, SBD in DCSMSS uses a store and forward technique. However, the overhead that was added due to the store and forward technique is consistent regardless of traffic volume. Hence, it is observed from Figure 13b that SBD's latency outperforms DCSMSS + LEACH and SDS with the increased number of queries per batch.

Range Query
This experiment was conducted in order to realize the performance of range query in various scenarios. The network size and number of queries was the same as the previous experiment. In Figure 14a-c, experiments were conducted for four different variations of range query. The range of the queries was varied in such a way so that in case one to four the number of sectors for the target data varies from one to four. For example, DCSMSS + Sector = 2 refers to the case where the target result of the query is to be fetched from two neighbor sectors. Figure 14a shows the average latency of each scenario as a function of number of queries per batch. As expected, the latency increases when the number of target sectors increases. If the target range of a query includes more than one sector all the corresponding SH fetch data from their respective segments and returns the data to the source SH. It is observed that, the latency for all scenarios grows slightly when the number of queries per batch increases except for the scenario DCSMSS + Sector = 4. In the case of DCSMSS + Sector = 4, latency begins to grow sharply when the number of queries increases from four to eight. This happens because of the congestion created due to the high number of reply packets flowing to the source query node from four neighbor sectors in response to a single query.  Figure 14b shows the success rate of each scenario as a function of the number of queries per batch. As expected, the performance of different scenarios is inversely proportional to the number of target sector. It is noted that all approaches falls slightly when the number of queries per batch increases from one to four but they start dropping sharply with the increase of number of queries per batch from four to sixteen. Figure 14c shows the number of discovered data items for each scenario when the number of queries per batch is four.

KNN Query
The setting of this experiment was similar to that of the previous experiment. Like the previous experiment, the value of k in KNN (q, k) in the four different scenarios was varied in such a way that the target number of sectors varied from one to four. It is observed from Figure 15a that the latency is directly relative to the number of target sectors from which the resultant query is to be fetched. In addition, latency increases for each scenario with the increase in the number of queries per batch.  Figure 15b shows the total number of events finally discovered in comparison to the total number of expected events when the number of queries per batch is four. The discovery rate was 100% when the number of target sectors is one but it gradually falls with the increase in the number of target sectors. This happens due to the packet loss during the response time. When the number of target sector increases with regard to the increase of the value of k, the volume of reported events for single query increases significantly. This large number of reported events created hotspot and congestion around the query node and the corresponding relay SH of its route.

Similarity Searching
The setting of this experiment was same as the previous experiment except the number of queries generated. The number of actual data items in the system and the number of discovered data items with no less than 50% similarity is shown in Figure 16a. This similarity is measured in terms of range query. After calculating hq of a query, r is calculated as ±0.25 hq. Thus, the range of the query was defined by [hq − 0.25 hq, hq + 0.25 hq]. The target head nodes where query was forwarded were SHj, SHj+1, …, SHk, where Pj ≤ hq-r ≤ Pj+1, Pk ≤ hq+r ≤ Pk+1 and j ≤ k. From Figure 16a it is observed that DCSMSS can always discover more than 85% of this type of data events. Figure 16b shows the discovery rate of DCSMSS, SDS, DD and GHT in terms of similarity between the discovered data and the query. Discovery rate is defined as the percent of events that have certain similarity to a query and that can be discovered. In the second experiment, in total 100 queries were generated with four queries per batch. Since GHT is not locality preserving in data storage, its exact-mapping querying cannot locate similar data and thus for the GHT only 100% similar data is considered. Unlike other approaches, DD broadcasts queries to all SH and accordingly achieves 100% discovery rate. However, SDS and DCSMSS discover 85%~90% similar data. However, DCSMSS provide an optimized trade-off between energy consumption, latency and discovery rate.

Scalability
This experiment was conducted in four different network field size of 60 × 60 m 2 , 90 × 90 m 2 , 120 × 120 m 2 and 150 × 150 m 2 containing 80, 180, 320 and 500 nodes, respectively. In total 200 queries were generated with eight queries per batch. Figure 17a shows the total number of hops. It demonstrates that DD's total number of hop count is much higher than other approaches and grows sharply. This refers to the poor scalability of DD. The total number of hop counts for DCSMSS + SBD, DCSMSS + LEACH, SDS and GHT grows relatively slowly, which demonstrates the high scalability of these approaches. However, DCSMSS + SBD provides reasonably stable performance in terms of the total number of hops. This implies that this scheme has relatively stable routing performance for different size WSNs. Figure 17b demonstrates the latency performance of each approach for different network sizes. DD has higher latency than other approaches with a dramatically higher latency when the network size is small (80 nodes). This happened because DD broadcasted the same number of queries in a small network creating high traffic with subsequent congestion in the network. In contrast, DCSMSS + SBD, DCSMSS + LEACH and SDS exhibit low latency across varying network sizes. This indicates the high scalability of these approaches.  Figure 18 illustrates the experiments which were conducted in a scenario of 120 m × 120 m rectangular field, in which 320 nodes are randomly and independently placed. These experiments were executed for 50 s with the querying frequency varied from 0.1 to 100 queries/s. Figure 18a,b show the total hop count and latency as the function of the querying frequency. Figure 18a demonstrates that the total number of hops for all approaches increases linearly. However, the performance of DD is lower because its broadcasting technique leads to vast traffic. It is also noted that the total number of hops for DCSMSS + SBD, DCSMSS + LEACH and SDS schemes is less than that of GHT. GHT always sends a query to ten different nodes for every attribute. SDS always sends queries to five sectors and DCSMSS sends to i sectors depending on the range r. That's why DCSMSS based schemes show lower hops while SDS is slightly higher. As shown in Figure 18b, the latency of each approach increases with an increase in the querying frequency. The latency of all approaches grows slightly with the increase in querying frequency from 0.1 to ten and then grows sharply when the frequency increases to 100 queries/s. It is also noted that DD has the highest latency. Since DD broadcasts queries to all sectors it generates congestion, packet loss excessive retransmission. Despite having lower collision and subsequent packet loss and retransmission, DCSMSS + SBD's latency is higher than SDS and DCSMSS + LEACH due to the reasoning explained in Section 5.1.2. However, it is interesting to note that the latency of GHT is lower than other approaches when the querying frequency is 100 queries/s. This happens because the number of sectors is lower than the number of nodes and under heavy traffic routing relying on SH became more congested than routing relying on nodes. Moreover, the routing, referred to GPSR, used in GHT uses the greedy forwarding technique which eventually selects the shortest path to route packets.

Conclusions and Future Work
In this paper a highly scalable distributed information service, DCSMSS, is presented that provides improved performance over comparative schemes. The scheme is an efficient similarity search mechanism for WSN. DCSMSS was applied to a range of WSN scenarios utilizing modeling, simulation and a statistical analysis and found to provide lower latency and improved search accuracy when compared to relatively recent alternate approaches. Discussion has been provided surrounding the alternate approaches and the improvements found when DCSMSS is applied. The research is continuing with future work considering methods to reduce complexity and improve processing at the nodes and SH to reduce energy utilization. DCSMSS has been simulated with a static, non-mobile network. Problems are expected when applying virtual sector formation or synchronization to groups of mobile nodes. Virtual sector or cluster formation in the dynamic WSN is an interesting area for future research. Furthermore, in current model, SH is the only gateway to the sector and hence it could create hotspot around the SH. This issue can be resolved in future work by outsourcing some of the responsibility to MNs, which will act as Secondary SH (SSH). A prototype implementation of DCSMSS is under development using the Texas Instruments' (TI) CC2530 Evolution Module (CC2530EM) [17], which is ZigBee/IEEE 802.15.4 compliant System-on-Chip with an optimized 8051 MCU core and radio for the 2.4 GHz unlicensed ISM/SRD band.