A Smart Checkpointing Scheme for Improving the Reliability of Clustering Routing Protocols

In wireless sensor networks, system architectures and applications are designed to consider both resource constraints and scalability, because such networks are composed of numerous sensor nodes with various sensors and actuators, small memories, low-power microprocessors, radio modules, and batteries. Clustering routing protocols based on data aggregation schemes aimed at minimizing packet numbers have been proposed to meet these requirements. In clustering routing protocols, the cluster head plays an important role. The cluster head collects data from its member nodes and aggregates the collected data. To improve reliability and reduce recovery latency, we propose a checkpointing scheme for the cluster head. In the proposed scheme, backup nodes monitor and checkpoint the current state of the cluster head periodically. We also derive the checkpointing interval that maximizes reliability while using the same amount of energy consumed by clustering routing protocols that operate without checkpointing. Experimental comparisons with existing non-checkpointing schemes show that our scheme reduces both energy consumption and recovery latency.

evaluate our scheme using network protocol simulation software and implement it to sensor nodes that are run on TinyOS [5].
The paper is organized as follows. In Section 2, we describe previous works related to fault tolerant schemes of wireless sensor networks. Section 3 explains the design of our checkpointing scheme, and Section 4 shows its implementation. In Section 5, we evaluate the impact and performance of our scheme on a resource-constrained sensor network in terms of both energy consumption and recovery latency. A conclusion is presented in Section 6.

Related Works
This section briefly introduces prior studies related to fault tolerant schemes. We describe the features of each scheme and explain their pros and cons.

Checkpointing the Sink Node
In [6], the authors proposed the concept of in-network fault tolerance for achieving enhanced network dependability and performance. In that scheme, the sink node periodically checkpoints its state and saves it in the memory of one or more sensor nodes, so called checkpoint sensors. When a sink node (S 1 ) fails or reaches an energy level below its threshold, another sensor node will be selected to operate as the new sink node (S 2 ). After applying this approach m times, the sink will be located in a sensor denoted by S m . If the sink is located on S m , then S m-1 is the checkpoint sensor and the path between S 1 and S m is the checkpoint path. When a sink node (S m ) fails, S m-1 detects the failure and becomes the sink instead; it iteratively operates in this sequence through the checkpoint path. This scheme is simple to implement, but energy consumption and reliability vary according to the position of the sink node.

Checkpointing all Nodes
Each sensor node within a WSN tends to fail because of software (S/W) or hardware (H/W) related failures. To solve this type of problem, different mechanisms have been designed for each sensor node. Some researchers have suggested a checkpointing scheme based on the density of the neighbors [7]. In such a scheme, each node broadcasts the checkpoint packet to its neighbor nodes, and the neighbor nodes decide whether or not to save the checkpoint packet as the density of sensor nodes.
In [8], authors proposed a flash file system that supports the flexible use of storage capacity for a variety of applications. When considering the memory and energy constraints of the sensor nodes, they use an efficient compaction and storage organization techniques. To tolerate software faults in sensor applications, Capsule, an efficient log-structured file system for flash memory provides the necessary checkpointing and rollback of object states. These schemes improve the reliability of the network, but the scalability issue must be considered when these schemes are used.

Macroprogramming
Macroprogramming means that a programmer describes a sensor network application as a centralized program and a compiler then generates the node level program. Gummadi et al. designed a simple checkpoint application programming interface (API) for macroprograms and implemented Kairos, a framework that consists of a program language based on Python, a code generator, and a compiler [9]. If we macroprogramming is applied to a sensor application, then the synchronization problem is automatically solved via the Kairos runtime system. Although macroprogramming has many pros, it is inflexible and too complex for some sensor applications, such as those related to forest fire detection and enemy tracing.

Checkpointing Scheme for Clustering Routing Protocols
In this section, we present the design of a checkpointing scheme for clustering routing protocols in detail. First, the essential concept of the clustering routing protocols and its features is described. Then, the design of our scheme and the model for finding the optimal checkpointing interval are presented.

Clustering Routing Protocol
The main aim of clustering routing protocols (hierarchical protocols) is to efficiently maintain the energy consumption of sensor nodes by involving them in multi-hop communication within a particular cluster and by performing data aggregation in order to decrease the number of messages transmitted to the BS [4]. Since the Low-Energy Adaptive Clustering Hierarchy (LEACH) [10] protocol was proposed, there have been many studies on clustering routing protocols such as PEGASIS [11], TEEN [12], ATEEN [13] and OEDSR [14]. These protocols form clusters of sensor nodes based on received signal strength, and they use cluster heads as routers to send the collected information to the BS. Figure 1 shows the concept of the clustering routing protocol. The depicted network is divided into four clusters, and it elects cluster heads based on the residual energy within each cluster. Normal nodes only communicate with their cluster head, which in turn, aggregates the collected information and sends it to the BS. In this scheme, cluster head failures are more critical than those of normal nodes. When a cluster head fails, re-election of the cluster head is performed within the cluster. Such a recovery scheme is a time and energy consuming process. Therefore, to improve the quality and reliability of sensor networks, a fault tolerant mechanism is needed for such cluster heads.

System Design
We propose a checkpointing scheme for the cluster head in clustering routing protocols that will minimize recovery cost and recovery latency. During the cluster head election step, our scheme elects additional backup nodes for checkpointing the cluster head information. All collected information sent by normal nodes to the cluster head is also saved in the backup nodes. The backup nodes periodically detect the state of the cluster head, and if the cluster head has a transient problem, then one of backup nodes replaces the failed cluster head to play the role of a new cluster head. Figure 2 presents an overview of our scheme applying the cluster head checkpointing mechanism. When the cluster head operates properly (see clusters a, c, d in Figure 2), backup nodes save only the checkpoint information and they monitor the state of the cluster head. In the case of cluster b, the cluster head cannot carry out its tasks when it encounters an S/W or H/W problem. A backup node then operates as a cluster head based on the obtained checkpointing information. Through this checkpointing scheme, we can prevent information loss caused by failure in the cluster head, and we can reduce recovery latency related to the frequent re-election of a cluster head. In clustering routing protocols, the communication range of a cluster head is larger than that of its cluster. To prevent network partition and orphan node problems, cluster heads adjust their communication ranges properly. In our mechanism, backup nodes can also adjust their communication range to cover all member nodes of their cluster.

System Modeling
We use the Markov model to find the minimum number of backup nodes that meets the expected reliability of users and the energy analysis model to determine the optimal checkpointing interval. Table 1 shows the notations and functions used when modeling our system. In order to simplify our model, we make the following assumptions:  the reference network model is based on [15].  all nodes know their residual energy.  there are no communication errors between two nodes, and  failure rate (λ) is based on the Poisson distribution.

The minimum number of backup nodes
In our scheme, there is a trade-off between reliability and energy consumption. As the number of backup nodes increases, reliability also increases. However, the energy consumption of the checkpointing process also increases, and, as a result, the life-time of the network decreases. Therefore, we need to find the minimum number of backup nodes that satisfies user reliability expectations (R user ). Here, we apply the Markov model to determine the minimum number of backup nodes when the expected reliability is specified by a user or an application designer.
In [16], there is a special case of a birth-death process that reflects that of a continuous-time Markov model. Figure 3 shows the state diagram of our model, where the state indicates the number of failure nodes. If the failure rate of each node (including the cluster head) is λ and the repair rate is μ, the expressions for steady-state probabilities are obtained via Equations (1) and (2): Each node has its own repair facility such as a watchdog timer that monitors the state of the sensor node periodically. If a sensor node has problems and cannot operate properly, a watchdog timer restarts the system. When the watchdog timer interval is the repair rate (μ), the availability of an individual component (A indiv ) is obtained via Equation (3), and the steady-state availability (A steady ) is computed via Equation (4): When A steady equal to the expected reliability of the user (R user ), μ is equal to the frequency of watchdog timer and the failure rate of each node, (λ), is given, we can define the minimum number of backup nodes (n-1) through Equation (4).

Optimal checkpointing interval
In the clustering routing protocols, a cluster head is in charge of the data collection activity, and this step is modeled as in Figure 4. This cluster is composed of N nodes (a cluster head and N-1 normal nodes), and each member node sends sensing data to its cluster head during time T. If the failure rate of each node is λ, then represents a lack of failure for each node during the total time of data collection of all member nodes (i.e., time T). In this condition, the probability of failure is λ λ , when the cluster head gathers data from the k th node.
To compare the energy consumption of our checkpointing scheme with that of an existing noncheckpointing scheme, we define E pre and E ckpt as in Equation (5): The energy consumption of the existing clustering routing protocols (E pre ) is divided by two parts. One is the summation of energy consumption of each member node while the cluster head operates properly. The other is the energy consumption of the recovery process. In clustering routing protocols without a checkpointing mechanism, when a cluster head fails, member nodes re-elect a new cluster head. This recovery process includes many types of messages such as a recovery process start message (N − 1), broadcasting the remaining energy notification messages of normal nodes ((N − 1) 2 ), and a recovery process end message of the new cluster head (N − 1), used for finding member nodes and constructing a routing table [17]. The energy consumption of the cluster head re-election process is represented by E elec .
The energy consumption of a clustering routing protocol with checkpointing (E ckpt ) is similar to that of previously reported clustering routing protocols. However, the proposed checkpointing scheme excludes re-election cost (E elec ) because our scheme does not need to re-elect a new cluster head, although it does includes checkpointing costs during time k.
Algorithm 1 explains the checkpointing and recovery process of our scheme. As our scheme can omit cluster head election and state recovery, it reduces energy consumption and recovery latency. The optimal checkpointing interval is the time between two successive checkpoints while satisfying the E pre ≥ E ckpt condition. This condition means that the checkpointing energy is to be less than the re-election energy. Therefore, the minimum value of I ckpt is the optimal checkpointing interval, which is derived through Equation (6): λ λ (6) As recovery latency is in direct proportion with the number of required messages, we compare the recovery latency of our checkpointing scheme with that of previous schemes through Equation (7). In clustering routing protocols without checkpointing, the recovery latency includes the cluster head re-election process and the scheduling latency of the ZigBee Medial Access Control (MAC) protocol [15]. In our proposed scheme, backup nodes wait one checkpointing interval (I ckpt ) for detection of a cluster head failure, and a backup node sends its identification (ID) code to member nodes to commit that node to the role of its cluster head:

Implementation
We have implemented our checkpointing scheme for clustering routing protocols to evaluate recovery latency in a real world situation. Figure 5 shows an example of the target sensor node called Ubi-coin, and Table 2 describes the H/W specifications of the sensor node. We implement our scheme using the TinyOS API, a well-known sensor operating system in wireless sensor networks (available at http://www.tinyos.net/). The testbed is composed of 50 nodes that include a cluster head, three backup nodes, and 46 member nodes. This testbed represents a single cluster of a sensor network in which there are several clusters.  To simplify the testbed, all nodes were able to communicate with each other within a one-hop range and we changed the number of nodes range from 10, 20, and 50. Each node periodically collects temperature data through a temperature sensor and sends the obtained data to the cluster head in the order of its ID code.

Performance Evaluation
We evaluate our scheme in terms of energy efficiency and recovery latency. Table 3 describes the parameters used for the evaluation. The value of the parameters are based on [15] and [19], studies that researched energy consumption and communication latency in WSNs. Table 3. Parameters for simulation.

Parameters
Value Field size 500 m × 500 m N 10, 20, 50, 100 n 3 λ 10 −4 (0 < λ < 1.0) μ 2*10 −4 , 2*10 −6 (0 < μ < 1.0) 0.5 (λ/μ ) R user 0.8 (80%) MSG s 128 Bytes E init 0.5 J E rf 80 nJ I ckpt 17 ms ≥ I ckpt ≥ 0 ms D schd 17 ms T (N−1) * D schd To compare energy consumption between clustering routing protocols without checkpointing and with checkpointing, the number of backup nodes needs to be determined. Figure 6 shows the steady-state availability (A steady ) of our scheme, the number of backup nodes, and the ratio (i.e., λ/μ) obtained by plotting Equation (4). When the failure rate (λ) is higher than the repair rate (μ) of the watchdog timer ( > 1), ant system availability is dramatically decreased because the value of Equation (4) exponentially increases and decreases by . To improve availability, the watchdog timer interval must be appropriately decreased. If watchdog timer rate is higher than the failure rate, resulting in < 1, the reliability of the system is more than 80% when using three backup nodes. In case of the repair rate is the same to the failure rate ( = 1), and our system provides reasonable availability (more than 73%) when using just three backup nodes. We have assumed is smaller than 1 in order to satisfy user expected reliability (R user ) requirements. Under those conditions, three backup nodes are sufficient to satisfy the system availability requirements.
The energy consumption between clustering routing protocols without checkpointing (E pre ) and with checkpointing (E ckpt ) is compared via Equation (5) with the results shown in Figure 7. In this comparison, three backup nodes request the checkpoint packet from the cluster head whenever member nodes send sensing data to the cluster head, with I ckpt = 17 ms. The energy consumption of the non-checkpointing scheme is higher than that of our scheme and the difference of two schemes steadily increases with increases in the number of nodes in a cluster. By using this extra energy, our scheme can reduce the check pointing interval and increase the reliability of sensor network. In this case, we derived optimal checkpointing intervals of between 2.019 ms and 2.002 ms, when the number of sensor nodes ranged from 10 to 100 (Figure 8). The results show that as the number of sensor nodes increase, the amount of extra energy (E pre − E ckpt ) is increase, and the amount of checkpointing messages also increase. In summary, the optimal checkpointing interval approaches 2ms as the number of sensor nodes in a cluster increases.    We tested our checkpointing scheme on the aforementioned testbed to evaluate recovery latency. Figures 10 and 11 show the recovery latency comparison between our checkpointing scheme applied to LEACH and that from the original LEACH with the results obtained via GloMoSim and a real-world testbed respectively. Simulation result shows the recovery latency of the original LEACH increases exponentially while that from LEACH with our checkpointing scheme applied increased more slowly and steadily ( Figure 10). Recovery latency is affected by the amount of messages sent during the recovery process. In the original LEACH, O(n 2 ) messages are generated during the re-election process as the number of nodes increases in a cluster. However, LEACH with our checkpointing scheme applied generates only O(n) messages via the a backup node; thus, recovery latency with checkpointing increases linearly.
During implementation testing, we uniformly deployed sensor nodes in a 10 m × 10 m test field and created failure conditions by turning off the cluster head, or blocking wireless communication by using obstacles. We then measured the completion time for data collection from all member nodes within a cluster and calculated the mean recovery latency time after running the conditions 10 times. The

Recovery latency
Original LEACH LEACH with checkpointing sec implementation results ( Figure 11) were similar trend to simulation result in Figure 10. As in the simulation results, the implementation results showed that recovery latency using our checkpointing scheme steadily increases, while that of the original LEACH increases exponentially. Therefore, our scheme is also more efficient than previous clustering routing protocols without checkpointing in terms of energy consumption and recovery latency. Figure 11. Recovery latency comparison between checkpointing and non-checkpointing LEACH results by using a real-world testbed.

Conclusions
When designing an efficient sensor application, we must consider the resource constraints of sensor nodes and their scalability. WSN users are concerned about information quality and user requirements for real-time features are also increasing. Moreover, sensor applications are expanding into harsher and more dangerous environments. Therefore, fault tolerant schemes have emerged as important issues in WSNs.
Clustering routing protocols such as LEACH, PEGASIS and TEEN were designed to improve both energy efficiency and scalability. These protocols compose clusters and elect a cluster head in each cluster. The cluster heads aggregate data from its member nodes and reduce the amount of messages sent by member nodes to the BS directly. In clustering routing protocols, cluster head management is needed because the role of the cluster head is more important than one of member nodes.
In this paper, we proposed a checkpointing scheme for clustering routing protocols. Our scheme can reduce energy consumption and recovery latency when a cluster head fails transiently. In addition, our checkpointing scheme is easy to implement. The simulation and real-world testbed results show energy consumption and recovery latency efficiencies when our checkpointing scheme is implemented.