This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

A kind of data compression algorithm for sensor networks based on suboptimal clustering and virtual landmark routing within clusters is proposed in this paper. Firstly, temporal redundancy existing in data obtained by the same node in sequential instants can be eliminated. Then sensor networks nodes will be clustered. Virtual node landmarks in clusters can be established based on cluster heads. Routing in clusters can be realized by combining a greedy algorithm and a flooding algorithm. Thirdly, a global structure tree based on cluster heads will be established. During the course of data transmissions from nodes to cluster heads and from cluster heads to sink, the spatial redundancy existing in the data will be eliminated. Only part of the raw data needs to be transmitted from nodes to sink, and all raw data can be recovered in the sink based on a compression code and part of the raw data. Consequently, node energy can be saved, largely because transmission of redundant data can be avoided. As a result the overall performance of the sensor network can obviously be improved.

By integrating different kinds of micro-sensors sensor networks can monitor the environment or designated objects, receive and send information in a wireless way, and transmit real-time information to end users, which makes the idea of information interaction between the real world, computer networks and human society come true [

Aiming to reduce the temporal-spatial redundancy that exists in data obtained by nodes, this paper proposes a data compression algorithm for wireless sensor networks based on suboptimal clustering and virtual landmark routing within clusters (abbr. SC-LVLR). Suboptimal clustering is used to divide the whole network according to the spatial correlativity, which is the basis for eliminating spatial redundancy. On the other hand, optimal routes are established based on virtual landmarks in the clusters and a structure tree established between the clusters, respectively. During the course of transmitting data from nodes to clusters and from clusters to a sink, redundant data existing in adjacent nodes and adjacent clusters are both eliminated, so the data transmitted from nodes to the sink are greatly reduced and the average energy cost of nodes can obviously be reduced too. Meanwhile, the ability of locating an exceptional node in the whole system can be improved and the longevity of sensor networks can be extended.

The second part of this paper will illustrate suboptimal clustering theory and virtual landmark routing within the cluster model established by SC-LVLR. Because the MSTC algorithm also aims to eliminate data temporal-spatial redundancy and represents a class of temporal-spatial redundancy-eliminating algorithms, we compare SC-LVLR and MSTC, and the corresponding flow charts are given in the third part. In the fourth part, the average energy cost per node, signal to noise ratio and the number of expired nodes are taken as criteria to test the performance of the different algorithms and the algorithm performance is analyzed in detail through simulation examples. Finally, our conclusions and an outlook for the future are given.

Dividing the whole sensor network based on the spatial correlativity of data obtained by nodes is not only beneficial for eliminating spatial redundancy but also good for locating exceptions. However, given the diversity of practical applications, sensor networks’ size and number of monitored objects, it is difficult to propose an optimal algorithm that suits all kinds of situations based on the limited energy, memory and computing ability of nodes, so the suboptimal clustering algorithm from reference [

The suboptimal clustering algorithm supposes the information obtained by nodes can be expressed by _{i}

From _{1} = {_{1}}, where _{1} represents an arbitrary node. We can get _{1}) = _{1} when we set the combined united information entropy of node set _{i}_{i}_{i−1} can be updated by adding node _{i}_{i} ∈ V_{i}_{i}_{i−1} and the sum of Euclidean distance from _{i}_{i−1} is the least. Set _{i}_{i−1}_{i}_{i}_{i}_{i−1}. Namely, _{i}_{i}_{i−1}. Combined entropy can be shown as follows:

Then we can get _{i}_{n}

To divide sensor networks effectively, we can begin with a simple situation. Supposing

In _{in}_{extra}

The derivation of

For realizing optimal clustering, we can get _{s}

From _{optimal}

We need to minimize the difference of optimal energy cost _{Snop}

For getting a suboptimal solution, we consider two extreme cases. Namely,

According to

According to

Namely:

Then for simplifying computation, we set

The basic theory about virtual landmark routing within clusters will be demonstrated in the following section. The routing method is based on the connectivity of nodes in clusters and the virtual landmarks of cluster heads to realize routing in clusters, thus avoiding the dependence on practical landmarks of nodes. Meanwhile, a global structure tree rooted from the sink is built among cluster heads and a global routing table can be obtained. Supposing

This theorem can be proved by contradiction and the details can be found in reference [

In the case of continuation, the distance between nodes is expressed by the Euclidean distance. The aim of setting virtual landmarks within clusters is to build a coordinate function set, which is only dependent on the Euclidean distances of different cluster heads. Along the direction of gradient descent of the coordinate function set, the target node is always reachable. Supposing {_{i}

In _{i}_{i}

However, under the condition of

The virtual landmark vector of arbitrary node _{b}_{b}

Under the centering virtual coordinate system, the distance of

_{i},i_{i}_{i}

Under the centering virtual coordinate system, the distance of nodes

Up to now, the local virtual coordinate system is built and the routing from node to sink can be divided into two parts, namely, within and outside cluster routing. Compared with the number of nodes, the number of cluster heads is relative small, so a global network structure tree rooted from the sink can be built. Then the least hops routing table will be broadcasted to all cluster heads. Routing in clusters includes non-edge region and edge region routing. Non-edge region routing can be obtained along the direction of gradient descent according to the virtual landmark within a cluster. The choice of edge region routing sets the neighbor cluster head which is in the direction of gradient descent as the target node before data is transmitted across the cluster border and executes non-edge region routing after the data is transmitted across the cluster border.

Based on the theories of suboptimal clustering and virtual landmark routing within clusters clarified above, details of the algorithm proposed in this paper will be demonstrated in the following section. Firstly, the difference of monitoring values obtained by the same node in sequential moments will be coded to eliminate the time redundancy. Then spatial redundancy will be eliminated in courses of data transmitted from nodes to cluster heads and from cluster heads to sink in the same way. The Multi-resolution Spatial and Temporal Coding (MSTC) algorithm proposed by Wang and Hsieh realizes data compression by eliminating temporal-spatial redundancy, and stands for a class data compression algorithm in wireless sensor networks. It will be illustrated necessarily here for comparison with the algorithm proposed in this paper. Flow charts of SC-LVLR and MSTC are shown in

Sensor networks are divided into some clusters based on suboptimal clustering theory, and cluster heads are randomly appointed by the sink at first. A Voronoi network is established and then edge region nodes will be formed.

A structure tree rooted from the sink and including all cluster heads will be established. Meanwhile, a virtual coordinate system is built within clusters on the condition that the cluster is appointed as reference node. Cluster head and nodes within the cluster are imparted a global mark and a local mark, respectively.

Time redundancy will be eliminated through coding the difference between the present instant monitor value and values stored in some pre-instance (the amount of data stored depends on the actual application and the memory of the node). This way a group of raw monitoring values can be replaced by reference values and compression code.

Routing in clusters can be realized through a greedy algorithm based on a virtual coordinate system. Spatial redundancy can be eliminated in a similar way by eliminating time redundancy in the course of routing within clusters. The difference is that the next hop node is set as reference node. The reference node local mark will be included and only time redundancy need be compressed around monitoring values obtained by the cluster head.

If the cluster heads have gathered all compressed data within a cluster, then compressed data will be transmitted from the cluster head to a sink along the route in the global table. In the course of transmitting data, if data have reached edge region nodes, edge region routing discussed above will be executed, or non-edge region routing will be executed. Spatial redundancy existing in adjacent clusters will be eliminated when data cross the edge region. As a reference, clusters which are nearby the sink won’t execute the spatial redundancy compression. As a supplement in case that routing within cluster is trapped into a local loop, flooding routing will be executed.

According to reference values, compression code and the global marks transmitted from cluster heads, nodes where spatial compression code is produced can be ascertained. The sink first decodes the spatial compression code according to the mapping relationship between code and difference, and then decodes the time compression code. In combination with reference values, raw data can be recovered in the sink.

The sink will check the energy of cluster heads periodically. If the remaining energy is lower than a threshold, the node with most energy in the cluster will be elected to replace the raw cluster head, and the global routing table will be regulated by the sink and broadcast to all cluster heads, and then step (2) or step (3) will be executed.

For comparison, the realization steps of MSTC will be as follows:

The sensor networks are divided into k*k (the value of k depends on the practical application) parts evenly based on a virtual grid. Every part will be divided into k*k parts recursively until the smallest virtual grid only includes one node. Since the deployment of nodes is random, the smallest virtual grid may include zero, one or more nodes. If more than one node is included in the same smallest virtual grid, average value will be regarded as the monitoring value of the smallest virtual grid. If there is no node included, an interpolated value from an adjacent grid will be regarded as the grid value.

The smallest virtual grid is in the lowest layer, and cluster head will be elected among the k*k adjacent smallest virtual grids. Recursively, different cluster heads in different layers will be elected. The layer structure tree is established in the whole network.

If the interval is set as T and the value obtained in the moment p needs to be transmitted to cluster heads in a higher layer, then values obtained at moments p + n*Ṭn = 1,2,3…) will need to be transmitted to cluster heads in higher layer. Values obtained between the moments p and p + T will be compared with the value obtained at moment p; if the difference is smaller than a threshold, values will be considered the same as historic values in the higher layer and there is no need to transmit data to the higher layer, or if different they will be transmitted to the higher layer.

Since the memory of nodes is limited, historic values are stored by reverse-exponential means, namely, values obtained at moments t-1, t-2, t-4, t-8 and t-16 will be stored if the present moment is t. That is to say, the nearer values obtained are from the present moment, the higher the possibility the value will be stored in nodes.

To eliminate spatial redundancy, starting from the lowest layer, k*k adjacent virtual grids will be mapped into a k*k matrix, which will be processed with a discrete cosine transform.

According to a preset compression ratio r, there are r*k*k values in the transformed matrix that will be sent to a higher layer and others will be replaced by zeros. The discrete cosine transform will be executed from the lowest layer to the highest layer recursively. When all compression values are obtained, the sink can recover all raw data through an inverse discrete cosine transform.

The basic theory and realization steps of the algorithm are given above. From the point of view of accuracy of recovered data, average energy cost of node and the number of expired nodes, in this section we make some comparisons and analysis through concrete simulation examples. Compared with the MSTC algorithm, the SC-LVLR algorithm has many merits, as follows:

Suboptimal clustering can gather a suitable amount of nodes which have spatial correlativity as a cluster, which is beneficial for eliminating spatial data redundancy, balancing the energy cost of networks and extending the longevity of the whole network. MSTC forms clusters evenly based on a virtual grid, which leads to some clusters including more nodes and the cluster heads expiring frequently and even some local holes may appear quickly.

The same compression dictionary is used to code the difference, both in eliminating spatial redundancy and time redundancy. The accuracy of recovered data can be regulated by changing the size of the coding dictionary. However, MSTC eliminates time redundancy by comparing differences with a threshold. If the threshold is too large, the accuracy of recovered data will be obviously affected, or time redundancy can’t be eliminated effectively.

Data spatial redundancy in the course of routing is eliminated based on a local virtual coordinate system, which is not only beneficial for reducing network congestion and improving the real-time ability of networks, but also beneficial for reducing the average energy cost of nodes, so the compressive performance of the whole network will obviously be improved. In MSTC, data from different adjacent nodes are mapped into a matrix, which is executed by discrete cosine transformation. For eliminating data spatial redundancy, a compression ratio is set and the transform coefficients are reduced correspondingly. However, when reducing transform coefficients directly according to a fixed compression ratio, this is not able to reflect any dynamic changes of the monitored data. If the ratio is too high, the accuracy of the recovered data will be greatly reduced, or the data spatial redundancy can’t be eliminated efficiently.

To fairly compare SC-LVLR with MSTC, some efficient evaluation criteria should be used. Reducing average energy cost of nodes is one of the most important aims of designing data compression algorithms for wireless sensor networks, so the average energy cost of nodes is considered here as a performance evaluation criterion. Meanwhile, for measuring the accuracy of recovered data, signal to noise ratio (abbr. SNR) is also taken as one of the evaluation criteria. As one of the most straightforward ways to measure the longevity of networks, the number of expired nodes is considered as the third evaluation criterion of algorithm performance in this paper. The formula of energy cost of nodes in this paper refers to reference [

In _{lp}_{lt}_{lr}_{rt}

In the equation above _{elec}_{amp} =^{2}_{rt}

The meaning of the parameters in

In

For an ordinary processor, if the distance from node to sink is vastly larger than the distance from node to node, the energy cost of executing instructions can be ignored, so the formula of energy cost can be given as follows:

Because cluster heads are responsible for communicating with all nodes within a cluster, the energy cost of cluster heads is larger than that of ordinary nodes. It is thus rational to take the average energy cost of all nodes in the sensor network as a performance evaluation criterion.

Peak signal to noise ratio:

Nodes are considered to be deployed randomly in the surveillance area. The following figures demonstrate comparisons of performance between SC-LVLR and MSTC. As a reference, the method of directly transmitting data from nodes to sink is included in the comparisons. Namely, there is no network clustering and data compression (abbr. NCDC) Operation. Both

Difference coding is used for eliminating time redundancy in SC-LVLR and the same way is used to eliminate spatial redundancy in courses of routing within and outside of clusters, so the actual data that must be transmitted is reduced considerably and the number of expired nodes is less in SC-LVLR than that in MSTC. Compared with difference coding along routing in SC-LVLR, more energy is needed in MSTC because the discrete cosine transformation is used in cluster heads to eliminate spatial redundancy. The simulation result is shown in

With the features of limited energy, computation ability and memory of nodes and random deployment of nodes in sensor networks in mind, a data compression algorithm based on suboptimal clustering and local virtual coordinate routing has been proposed in this paper. Compared with MSTC, the relative complex discrete cosine transformation is avoided in SC-LVLR. After suboptimal clustering, routing within clusters can be obtained based on a local virtual coordinate system and an optimal global routing table can be obtained based on a global structure tree. Through coding difference of monitoring values in courses of routing within and outside of clusters, spatial redundancy can be eliminated near the place where spatial redundancy has occurred. Therefore, the amount of communication in networks can obviously be reduced and the average energy cost of nodes can be reduced and the longevity of the whole sensor network will be extended. The algorithm of this paper mainly aims to deal with one dimensional monitoring data. More redundancy may need to be eliminated when the monitored values are two-dimensional, for which the research on a corresponding algorithm is the key point of our next planned work.

The work described in this paper was supported by the National Natural Science Foundation of China (NSFC60974012), the Natural Science Foundation of ZheJiang province (Y1100054), the Key Science and Technology Plan Program of Science and Technology Department of ZheJiang Province (2008C23097), the Science and Technology Plan Program of Science and Technology Department of Hangzhou (20091133B03).

Flow chart of the SC-LVLR algorithm.

Flow chart of the MSTC algorithm.

Changes of SNR as network size changes.

Changes of SNR with changes of values’ fluctuation amplitude.

Changes of NAEC as network size changes.

Changes of NAEC with changes of values’ fluctuation amplitude.

Changes of the number of expired nodes as time goes by.

Changes of the number of expired nodes as network size changes.

Changes of the number of expired nodes as changes of values’ fluctuation amplitude.