Design and Analysis of an Efficient Energy Algorithm in Wireless Social Sensor Networks

Because mobile ad hoc networks have characteristics such as lack of center nodes, multi-hop routing and changeable topology, the existing checkpoint technologies for normal mobile networks cannot be applied well to mobile ad hoc networks. Considering the multi-frequency hierarchy structure of ad hoc networks, this paper proposes a hybrid checkpointing strategy which combines the techniques of synchronous checkpointing with asynchronous checkpointing, namely the checkpoints of mobile terminals in the same cluster remain synchronous, and the checkpoints in different clusters remain asynchronous. This strategy could not only avoid cascading rollback among the processes in the same cluster, but also avoid too many message transmissions among the processes in different clusters. What is more, it can reduce the communication delay. In order to assure the consistency of the global states, this paper discusses the correctness criteria of hybrid checkpointing, which includes the criteria of checkpoint taking, rollback recovery and indelibility. Based on the designed Intra-Cluster Checkpoint Dependence Graph and Inter-Cluster Checkpoint Dependence Graph, the elimination rules for different kinds of checkpoints are discussed, and the algorithms for the same cluster checkpoints, different cluster checkpoints, and rollback recovery are also given. Experimental results demonstrate the proposed hybrid checkpointing strategy is a preferable trade-off method, which not only synthetically takes all kinds of resource constraints of Ad hoc networks into account, but also outperforms the existing schemes in terms of the dependence to cluster heads, the recovery time compared to the pure synchronous, and the pure asynchronous checkpoint advantage.


Research Motive
As a new kind of network technology in the wireless communication field, the mobile ad hoc network, which doesn't depend on fixed infrastructures, has been widely applied to different kinds of situations such as offices, industries and military, etc., bcause it can not only construct a multi-hops network with the existing network, but also support wireless transmission of data, voices and images under bad environmental conditions by temporarily constructing a smart network.

1.
It proposes the hybrid checkpointing model combined the coordinated and uncoordinated checkpointing after considering the communication features and kinds of resource restraints of mobile ad hoc networks; 2.
It discusses the checkpoint correctness criteria for a hybrid checkpointing model including the checkpoint taking criteria, rollback criteria and indelibility criteria. 3.
It discusses the triggering occasions of checkpoints for processes in the same cluster and different clusters according to the characteristics and requirements of different types of checkpoints. 4.
It designs the intra-cluster and inter-cluster checkpoint dependence graph according to the dependence relations among checkpoints. Then it discusses the elimination rules of checkpoints in the same cluster and different clusters based on these graphs. What is more, it provides the algorithms for checkpoints in the same cluster and different clusters and rollback recovery algorithm. At the same time, it proves the correctness of the algorithms by theory.

5.
By testing the performances of the proposed strategy, this paper verifies that the hybrid checkpointing strategy is a good trade off method and it has advantages such as less of dependence on cluster heads, low overhead and shorter recovery times.
The rest of the paper is organized as follows: Section 2 introduces the ad hoc network topological structure and the model of hybrid checkpointing. The correctness criteria for hybrid checkpointing model and the triggering occasions of kinds of checkpoints are discussed in Sections 3 and 4. In Sections 5 and 6, we propose the elimination strategy of checkpoints and related algorithms. Then we discuss the processing method for handoffs in Section 7. In Section 8, we present the testing of performance and analysis. The last part gives the conclusions of this paper and proposes future work.

Mobile Ad Hoc Network Topological Structure
The mobile ad hoc network is a kind of wireless self-organizing network which has peer to peer movement. Its topological structures are mainly classified to two kinds: flat structures and hierarchical structures [35][36][37][38]. In a flat structure, all network nodes are equal, but the entire network in the hierarchical structure is constituted by subnets which are divided into clusters. Each cluster includes one cluster head and many cluster members. The cluster heads form a higher-level network. Because the scale of the hierarchical structure is not limited and the expandability of this structure is good, this structure has become the favorite development trend of mobile ad hoc networks, therefore, this paper will discuss the checkpointing algorithm under the multi-frequency hierarchical structure. Figure 1 is the multi-frequency hierarchical topological structure of an ad hoc network. This structure consists of multiple clusters. Each cluster includes one cluster head (CH) and a limited number of cluster members [39][40][41]. The CH s and cluster members are all regular MH s . The hosts in one cluster share one frequency to communicate, while the CH s communicate on another frequency. The members in she ame cluster can communicate with each other directly, while the members in different clusters have to communicate by the retransmission of CH s .

Hybrid Checkpointing Model
As we know, the different kinds of resources on MHs are quite limited in mobile ad hoc networks. These resources include the storage space, processing capacity, battery energy, wireless bandwidth and so on. If we fully adopt the asynchronous checkpointing algorithm, it will need to transmit a lot of checkpoint coordination messages. Because the communication among MHs in different clusters usually needs multi-hop routing before reaching the destination, the number of checkpoint coordination messages is proportional to the number of hops and the retransmission will occupy precious wireless bandwidth. Though it will avoid a lot of checkpoint coordination messages if we completely use the asynchronous checkpointing algorithm, the cascade rollback caused by asynceshronous checkpointing would waste resources such as precious CPU allocation and power sourcs of MHs and the storage space for storing checkpoint information is large.
Therefore, this paper plans to adopt the hybrid checkpointing model which combines the synchronization and asynchronization after considering the kinds of resource limitations in MHs synthetically. In this model, the hosts in the same cluster use the synchronous checkpointing (because these hosts could communicate with each other directly without the retransmissions involving cluster heads), while the hosts in different clusters adopt the asynchronous checkpointing. This will not only avoid the waste of all kinds of resources for hosts in same cluster because of cascade rollback, but also avoid too many transmissions of messages among hosts in different clusters. What is more, it reduces the communication delay of the network and checkpoint operation delay.

Definition 1. A same cluster checkpoint (SC) is a checkpoint which is used for ensuring the consistency of processes in hosts of the same cluster.
The SC is a synchronous checkpoint. Every time the process use a SC, it needs to remain synchronous with the related processes in the same cluster. During the synchronization procedure, if part of processes cannot finish their checkpoints because of their own reasons, then the coordinated checkpoints become useless checkpoints which need to be eliminated. Hence, in order to reduce the occupation of precious storage space in wireless terminals, the SCs can be further divided into two kinds: uncertain checkpoints and certain checkpoints.

Hybrid Checkpointing Model
As we know, the different kinds of resources on MH s are quite limited in mobile ad hoc networks. These resources include the storage space, processing capacity, battery energy, wireless bandwidth and so on. If we fully adopt the asynchronous checkpointing algorithm, it will need to transmit a lot of checkpoint coordination messages. Because the communication among MH s in different clusters usually needs multi-hop routing before reaching the destination, the number of checkpoint coordination messages is proportional to the number of hops and the retransmission will occupy precious wireless bandwidth. Though it will avoid a lot of checkpoint coordination messages if we completely use the asynchronous checkpointing algorithm, the cascade rollback caused by asynceshronous checkpointing would waste resources such as precious CPU allocation and power sourcs of MH s and the storage space for storing checkpoint information is large.
Therefore, this paper plans to adopt the hybrid checkpointing model which combines the synchronization and asynchronization after considering the kinds of resource limitations in MH s synthetically. In this model, the hosts in the same cluster use the synchronous checkpointing (because these hosts could communicate with each other directly without the retransmissions involving cluster heads), while the hosts in different clusters adopt the asynchronous checkpointing. This will not only avoid the waste of all kinds of resources for hosts in same cluster because of cascade rollback, but also avoid too many transmissions of messages among hosts in different clusters. What is more, it reduces the communication delay of the network and checkpoint operation delay. Definition 1. A same cluster checkpoint (SC) is a checkpoint which is used for ensuring the consistency of processes in hosts of the same cluster.
The SC is a synchronous checkpoint. Every time the process use a SC, it needs to remain synchronous with the related processes in the same cluster. During the synchronization procedure, if part of processes cannot finish their checkpoints because of their own reasons, then the coordinated checkpoints become useless checkpoints which need to be eliminated. Hence, in order to reduce the occupation of precious storage space in wireless terminals, the SC s can be further divided into two kinds: uncertain checkpoints and certain checkpoints.

Definition 2.
An uncertain checkpoint is a SC which is stored in the cache of a MH when the coordination is not finished.

Definition 3.
A certain checkpoint is a SC which is stored on the local disk of a MH when the coordination is finished. Therefore, if and only if the coordination of checkpoints succeeds, the uncertain checkpoints could be turned to certain checkpoints. Otherwise, the uncertain checkpoints become useless checkpoints which could be eliminated. The elimination of uncertain checkpoints involves releasing the information stored in the cache.

Definition 4.
A different cluster checkpoint (DC) is a checkpoint which is used for assuring the consistency of processes in MHs of different clusters.
The DC is an asynchronous checkpoint. When a process receives messages from MH s of different clusters, it finishes the checkpoint operation independently without informing other MH s . The DC is triggered by the reception of messages from MH s of different clusters and its checkpoint information is directly stored in the local disk of the MH, so the DC is a certain checkpoint in nature.
Suppose P i,j,k expresses a process, i is the id of cluster which the mobile terminal stays, j is the id of mobile terminal and k expresses the id of process. Figure 2 is an example of hybrid checkpoint model. P i,k,s , P i,a,v and P i,j,t are the processes executing in different terminals of cluster C i respectively. While P m,s,c is the process executing in terminal s of cluster C m . SC b i,a,v expresses the SC of P i,a,v and b is the sequence number of checkpoint. DC m i,k,s is the DC of P i,k,s and m is the id of cluster which the sender of message from different cluster stays. Therefore, if and only if the coordination of checkpoints succeeds, the uncertain checkpoints could be turned to certain checkpoints. Otherwise, the uncertain checkpoints become useless checkpoints which could be eliminated. The elimination of uncertain checkpoints involves releasing the information stored in the cache.

Definition 4. A different cluster checkpoint (DC) is a checkpoint which is used for assuring the consistency of processes in MHs of different clusters.
The DC is an asynchronous checkpoint. When a process receives messages from MHs of different clusters, it finishes the checkpoint operation independently without informing other MHs. The DC is triggered by the reception of messages from MHs of different clusters and its checkpoint information is directly stored in the local disk of the MH, so the DC is a certain checkpoint in nature.
Suppose Pi,j,k expresses a process, i is the id of cluster which the mobile terminal stays, j is the id of mobile terminal and k expresses the id of process. Figure 2 is an example of hybrid checkpoint model. Pi,k,s, Pi,a,v and Pi,j,t are the processes executing in different terminals of cluster Ci respectively. While Pm,s,c is the process executing in terminal s of cluster Cm.   Figure 2 are initialized checkpoints. 2. The taking of SCs. When a process P receives a fixed number of messages from MHs in same cluster, it initializes a coordination procedure of checkpoints. P firstly takes an uncertain checkpoint, and then it sends checkpoint requests to the processes which have dependence relation with P. When these processes receive the requests, they firstly take uncertain checkpoints and then reply to P. By the time P receives all replies from processes which take part in the coordination procedure, it turns the uncertain checkpoint into a certain checkpoint. Combined with the Figure 2, the explanation for principles of hybrid checkpointing model can be expressed as follows: 1.
The initialized checkpoint. When each process initializes, it takes an initialization checkpoint (certain checkpoint). For example, SC b i,a,v , SC y i,j,t , SC d i,k,s and SC d m,s,c in Figure 2 are initialized checkpoints.

2.
The taking of SC s . When a process P receives a fixed number of messages from MH s in same cluster, it initializes a coordination procedure of checkpoints. P firstly takes an uncertain checkpoint, and then it sends checkpoint requests to the processes which have dependence relation with P. When these processes receive the requests, they firstly take uncertain checkpoints and then reply to P. By the time P receives all replies from processes which take part in the coordination procedure, it turns the uncertain checkpoint into a certain checkpoint. Otherwise, P eliminates its own uncertain checkpoint. Finally, P notices related processes to finish the turning or elimination operations of checkpoints together. 3. The taking of DC s . When a process receives a message from a MH in a different cluster, the DC is triggered. In order to assure that the receiving event of this message can be cancelled when the sending event of this message is cancelled, the process has to take the DC firstly and then receives the message.
In Figure 2, P i,j,t is the process which initializes the checkpointing. P i,a,v and P i,k,s are the participant processes of checkpointing because of messages M 2 and M 3 . When the checkpointing starts, P i,j,t firstly takes a uncertain checkpoint SC y+1 i,j,t and then notices P i,a,v and P i,k,s to take uncertain checkpoints SC b+1 i,a,v and SC d+1 i,k,s respectively. By the time P i,a,v and P i,k,s finish each uncertain checkpoint SC b+1 i,a,v and SC d+1 i,k,s , they send replies to P i,j,t . After P i,j,t has received the replies from P i,a,v and P i,k,s (not marked in Figure 2), P i,j,t turns SC y+1 i,j,t to certain checkpoint. In addition, P i,j,t notices P i,a,v and P i,k,s to turn SC b+1 i,a,v and SC d+1 i,k,s into certain checkpoints, but their Checkpoint Sequence Number (CSN) does not change. Otherwise, P i,j,t eliminates its own uncertain checkpoint and notices P i,a,v and P i,k,s . For example, when process P i,k,s in cluster C i receives a message M 2 from process P m,s,c in cluster C m as Figure 2 shows, P i,k,s firstly takes DC m i,k,s and then receives M 2 .

Problem of Hybrid Checkpointing Model to Be Solved
The hybrid checkpoint model takes both of the communication features of hosts in same cluster and different clusters into consideration, and thus reduces the number of wireless communication messages greatly and saves the kinds of resources of MH s . What is more, this model has less dependence on cluster heads (it just records the dependence among checkpoints of processes in different clusters, and it will be discussed in Part 5). Therefore, this model has better flexibility.
However, in a hybrid checkpointing strategy, the checkpoints of a process cannot be handled according to the fully synchronous or asynchronous checkpointing algorithm because of the existing of SC s and DC s . The main problems for checkpoints processing are as follows:

1.
What are the correctness criteria? In the hybrid checkpoint model, because there are mutual dependence relations among checkpoints in the same cluster and different clusters, so it is necessary to research a brand new global consistency criterion for checkpoints to assure the global consistency of checkpoints.

2.
When is the checkpoint triggered? In order to reduce the checkpoint operations and storage overhead, different triggering occasions should be taken for different kinds of checkpoints according to the ad hoc network features and hybrid checkpointing model.

3.
When is the checkpoint eliminated? The existing of DC s makes the elimination of SC s be complicated, and so creates a lot of checkpoints. In order to save space, it needs to clear away the useless SC s in time and coordinate the useless DC s actively.

4.
What is the strategy for processing handoff? The checkpointing strategy for the hybrid checkpoint model is decided by the same cluster and different clusters relations of MH s . However, the same cluster and different clusters relations of MH s will change when MH s take handoff. Hence, the checkpointing strategy should dispose the change of MH s correctly and automatically maintain the new cluster relation to guarantee the correct checkpoint operations of processes in handoff MH s .

Correctness Criteria for a Hybrid Checkpointing Model
In the asynchronous checkpointing strategy, the condition that checkpoints ensure the global correctness is no orphan messages. An orphan message is a message whose receiving event is stored but its sending event is not stored. Therefore, in order to assure the messages not become orphan messages, the sender process of the message must take a checkpoint also when the receiver process of a message takes a checkpoint, but in the asynchronous checkpointing strategy, the orphan messages are allowed to exist. Hence, the cascade rollback phenomenon exists. However, the checkpoints in the same cluster and different clusters adopt different strategies. Therefore, it could not finish the checkpoint operation according to the checkpoints correctness criteria of coordinated or uncoordinated strategy purely.
Let the same cluster message (SM) be the message transmitted from process SP i to process SP j of MH in same cluster. SC i and SC j are respectively the checkpoints taken by SP i and SP j after sending and receiving SM. The different cluster message (DM) is the message transmitted between from process DP i to process DP j of MH in another cluster. CKP i and DC j are respectively the checkpoints of DP i and DP j related to DM. The Receive(M) is the operation of receiving message M and the Send (M) expresses the operation of sending message M. A >> B expresses that the appearance of operation A is prior to operation B. CKP means one checkpoint (SC or DC).

Checkpoint Taking Criteria
For assuring the SM is not an orphan message, it requires that the sending process of SM must take a checkpoint when the receiving process of SM takes a checkpoint: The DC does not require assuring no orphan message, but it requires that the receiving event of DM could be cancelled when the sending event of DM is cancelled. Hence, it demands that the receiving process of DM firstly takes a DC and then receives the message.

Checkpoint Rollback Criteria
To guarantee that all processes could roll back to the global consistent checkpoint states when some processes fail, it requires that the receiving event of message could be cancelled when the sending event of message is cancelled. This requirement is suitable to both SC s and DC s : , then SP j must be able to roll back to SC j −1 when SP i rolls back to SC i −1.
Criterion 4: ∀DM, if DP i rolls back to CKP i , then DP j must be able to roll back to DC j .

Checkpoint Indelibility Criteria
In synchronous checkpointing strategy, the processes which joins the checkpointing can clear away the prior checkpoint when the checkpointing is finished. While the processes in asynchronous checkpointing strategy will eliminate checkpoints together after finding the global consistent checkpoint and finishing rollback recovery when some process fails, however, the checkpoints of one process not only include the SC s related to MHs in same cluster, but also contain the DC s related to MHs in different clusters in a hybrid checkpoint environment. When a process finishes the SC, the other SC s earlier than this SC cannot be eliminated easily, otherwise, global inconsistency may appear. The elimination of checkpoints is judged by the relations among checkpoints: Definition 5. If CKP i >> Send(M) and CKP j >> Receive(M), then CKP j directly depends on CKP i between processes. It expresses as CKP j ⇒ CKP i . Definition 6. Let CKP i and CKP i+1 to be the contiguous checkpoints of P i , and then CKP i+1 directly depends on CKP i in same process. It expresses as CKP i+1 ⇒ CKP i .
The upper two dependence relations are all named by checkpoint direct dependence relation. They will not be distinguished in the later parts: Definition 7. If CKP i ⇒ CKP j and CKP j ⇒ CKP k , then CKP i depends on CKP k indirectly. It expresses as CKP i ⇒ ⇒ CKP k . In Figure 3, there are relations such as  Let SCi and SCk be the SCs of process P. DCx is the DC of P or the processes in the cluster same as P. Then the indelibility criteria for checkpoint can be described as follows: Criterion 5 indicates that if one SC which is not newest depends on any DC directly or indirectly then it cannot be eliminated. Because the DC is an asynchronous checkpoint, it may cause a cascading rollback. As Figure 4 shows, the rollback of processes Pm,s,c and Pm,j,t will all cause Pi,k,s rolling back to  Therefore, the indelibility criteria of DCa can be defined as follows: Criterion 6: If CKPi is not eliminated, then DCj must not be cleaned up. Let SC i and SC k be the SC s of process P. DC x is the DC of P or the processes in the cluster same as P. Then the indelibility criteria for checkpoint can be described as follows: , then SC j cannot be eliminated.
Criterion 5 indicates that if one SC which is not newest depends on any DC directly or indirectly then it cannot be eliminated. Because the DC is an asynchronous checkpoint, it may cause a cascading rollback. As Figure 4 shows, the rollback of processes P m,s,c and P m,j,t will all cause P i,k,s rolling back to DC m i,k,s (rollback 1 and 2 all cause the rollback 3 ). If DC m i,k,s is eliminated earlier than SC t m,s,c and SC y m,j,t , then P i,k,s will not be able to roll back to the consistent states with processes in C m when P m,s,c rolls back to SC t m,s,c or P m,j,t rolls back to SC y m,j,t .
Sensors 2017, 17, 2166 9 of 28 In Figure 3, there are relations such as  Let SCi and SCk be the SCs of process P. DCx is the DC of P or the processes in the cluster same as P. Then the indelibility criteria for checkpoint can be described as follows: Criterion 5: If ∃DCx (SCj  DCx or SCj   DCx) ∧ ∃SCk (k>i), then SCj cannot be eliminated.
Criterion 5 indicates that if one SC which is not newest depends on any DC directly or indirectly then it cannot be eliminated. Because the DC is an asynchronous checkpoint, it may cause a cascading rollback. As Figure 4 shows, the rollback of processes Pm,s,c and Pm,j,t will all cause Pi,k,s rolling back to  Therefore, the indelibility criteria of DCa can be defined as follows:

Criterion 6:
If CKPi is not eliminated, then DCj must not be cleaned up. Therefore, the indelibility criteria of DCa can be defined as follows:

Criterion 6:
If CKP i is not eliminated, then DC j must not be cleaned up. Criterion 6 shows that if the checkpoints of MH s in different clusters which are dependent on one DC are not cleaned up, then this DC cannot be eliminated also. Otherwise, it will cause something wrong about rollback.

Criterion 7:
If CKP i has been eliminated, then DC j cannot be cleared away when one of the following conditions is satisfied.
DC j is the only checkpoint of DP j . DC j does not depend on the checkpoints of DP j , but it depends on the checkpoints of processes in the same cluster of DP j .
Th Criterion 7 means that if one DC is the only checkpoint of process or the DC is the earliest checkpoint and the SC s on which the DC depends are not eliminated, then the DC cannot be cleared away. The former requires that the new state of process be stored in an update checkpoint, while the later demands that the elimination of a DC will not cause wrong rollback. Hence, there will be one conclusion: if the hybrid checkpointing strategy satisfies the Criteria 1~7, then this strategy could assure the global consistency of checkpoints.

Checkpoint Triggering Occasions
In the distributed system, the checkpoint always adopts the way to be triggered at a fixed time or by a fixed number of messages. For the ad hoc network, the time interval for triggering checkpoints periodically is hard to decide because of the dynamism and uncertainty of this network on the one hand, and on the other hand, a timing trigger may usually wake up some MHs staying dormant state and then cause unnecessary battery consumption. Therefore, the SC s and DC s in the hybrid checkpointing model all adopt the active trigger mechanism based on messages. Besides, the SC s use the reactive trigger mechanism to reduce the number of SC s , which cannot be eliminated. There are two different SC triggers: Active Trigger based on Message and Reactive Trigger: Active Trigger based on Message: The active trigger means that a process initializes a checkpointing procedure actively when it receives K (K is the threshold, K ≥ 1) messages from MH s in same cluster. The value of K is directly proportional to the continued communication time without failure and storage space of the MH, but inversely proportion to the disconnection and failure probability of the MH. What is more, the choice of K should consider the kinds of overhead of the ad hoc network. This part will be further discussed and verified in the experiments.

2.
Reactive Trigger: It can be seen from Criterion 5, that if one SC depends on any DC directly or indirectly then it cannot be cleared away, so there will be a lot of SC s which could not be eliminated.
As Figure 5 shows, all SC s of P i,k,l and P i,j,e depends on DC m i,j,ε directly or indirectly, so they could not be cleared away before DC m i,j,ε is eliminated. If the checkpoint frequency of cluster C i is higher than the one of C m , then there may be a lot of SC s of C i which cannot be cleaned up. Criterion 6 shows that if the checkpoints of MHs in different clusters which are dependent on one DC are not cleaned up, then this DC cannot be eliminated also. Otherwise, it will cause something wrong about rollback.

Criterion 7:
If CKPi has been eliminated, then DCj cannot be cleared away when one of the following conditions is satisfied.
DCj is the only checkpoint of DPj. DCj does not depend on the checkpoints of DPj, but it depends on the checkpoints of processes in the same cluster of DPj.
Th Criterion 7 means that if one DC is the only checkpoint of process or the DC is the earliest checkpoint and the SCs on which the DC depends are not eliminated, then the DC cannot be cleared away. The former requires that the new state of process be stored in an update checkpoint, while the later demands that the elimination of a DC will not cause wrong rollback. Hence, there will be one conclusion: if the hybrid checkpointing strategy satisfies the Criteria 1~7, then this strategy could assure the global consistency of checkpoints.

Checkpoint Triggering Occasions
In the distributed system, the checkpoint always adopts the way to be triggered at a fixed time or by a fixed number of messages. For the ad hoc network, the time interval for triggering checkpoints periodically is hard to decide because of the dynamism and uncertainty of this network on the one hand, and on the other hand, a timing trigger may usually wake up some MHs staying dormant state and then cause unnecessary battery consumption. Therefore, the SCs and DCs in the hybrid checkpointing model all adopt the active trigger mechanism based on messages. Besides, the SCs use the reactive trigger mechanism to reduce the number of SCs, which cannot be eliminated. There are two different SC triggers: Active Trigger based on Message and Reactive Trigger: 1. Active Trigger based on Message: The active trigger means that a process initializes a checkpointing procedure actively when it receives K (K is the threshold, K ≥ 1) messages from MHs in same cluster. The value of K is directly proportional to the continued communication time without failure and storage space of the MH, but inversely proportion to the disconnection and failure probability of the MH. What is more, the choice of K should consider the kinds of overhead of the ad hoc network. This part will be further discussed and verified in the experiments. 2. Reactive Trigger: It can be seen from Criterion 5, that if one SC depends on any DC directly or indirectly then it cannot be cleared away, so there will be a lot of SCs which could not be eliminated.
As Figure 5 shows, all SCs of Pi,k,l and Pi,j,e depends on   Figure 5. An example of the coexistence of multiple same cluster checkpoints. The key to solve this problem is to clean up the DC s on which the SC s depend. According to the Criteria 6 and 7, the premises to eliminate DC include two conditions. One is that the process has finished new checkpoint (namely this DC is not the new checkpoint of the process). The other is that the checkpoints of processes in other clusters on which the process depends have been eliminated.
Let C i and C j to be two different clusters. If the processes in C i which contains all DC s (created by the receiving of messages from MH s in C j ) related to C j have finished the certain checkpoints (namely meet the first premise), then C i notices C j to initialize a new checkpointing (only the new checkpoint is done, the former SC s can be cleaned up and this meets the last premise). If some checkpoint dependent on the DC s in C i can be cleared away after the related processes in C j has finished new SC, then the elimination probability of the DC s increases heavily. Then the SC s in C i which depend on the DC s could be cleaned up further. The detailed elimination rules of checkpoints will be discussed in Section 5.
Different from the active trigger, this checkpointing procedure is initialized by the CH and the related processes in cluster just act as the participants, so it is called a reactive trigger.
The DC uses the message trigger mechanism also, namely the DC is triggered when it receives messages from MH s in different clusters. However, if a process takes a DC when it receives a message from the same different cluster each time, it will increase the checkpoints and so augment the storage overhead heavily.
As Figure 6 shows, P i,j,e in cluster C i receives messages M 1 and M 2 from processes P m,s,c and P m,a,v in cluster C m respectively. If P i,j,e takes DC 1 and DC 2 respectively, then P i,j,e will roll back according to the following situations when processes in C m fail.
The key to solve this problem is to clean up the DCs on which the SCs depend. According to the Criteria 6 and 7, the premises to eliminate DC include two conditions. One is that the process has finished new checkpoint (namely this DC is not the new checkpoint of the process). The other is that the checkpoints of processes in other clusters on which the process depends have been eliminated.
Let Ci and Cj to be two different clusters. If the processes in Ci which contains all DCs (created by the receiving of messages from MHs in Cj) related to Cj have finished the certain checkpoints (namely meet the first premise), then Ci notices Cj to initialize a new checkpointing (only the new checkpoint is done, the former SCs can be cleaned up and this meets the last premise). If some checkpoint dependent on the DCs in Ci can be cleared away after the related processes in Cj has finished new SC, then the elimination probability of the DCs increases heavily. Then the SCs in Ci which depend on the DCs could be cleaned up further. The detailed elimination rules of checkpoints will be discussed in Section 5.
Different from the active trigger, this checkpointing procedure is initialized by the CH and the related processes in cluster just act as the participants, so it is called a reactive trigger.
The DC uses the message trigger mechanism also, namely the DC is triggered when it receives messages from MHs in different clusters. However, if a process takes a DC when it receives a message from the same different cluster each time, it will increase the checkpoints and so augment the storage overhead heavily.
As Figure 6 shows, Pi,j,e in cluster Ci receives messages M1 and M2 from processes Pm,s,c and Pm,a,v in cluster Cm respectively. If Pi,j,e takes DC1 and DC2 respectively, then Pi,j,e will roll back according to the following situations when processes in Cm fail.  It can be seen, Pi,j,e will roll back to DC1 in most instances. Hence, for reducing the number of DCs, a process just takes one DC for each different cluster during one DC interval.

Strategies for Checkpoint Elimination
The storage space of MHs in an ad hoc network is quite limited, so it is necessary to clean up the overdue useless checkpoints in time. This section will firstly introduce the checkpoint dependence graph, and then design the relevant rules and strategies for checkpoint elimination. The Figure 7 below shows the IntraCDG of processes Pi,a,v and Pi,k,s (see in Figure 3). Because there are no dependence relations in proesses Pm,s,c and Pi,j do,t, they not have the dependence relations at present and so the IntraCDGm,s,c and IntraCDGi,j,t are empty. M 0 exists. If P m,a,v rolls back to C b m,s,c because of failure, P m,s,c will roll back to SC b m,s,c and P i,j,e should roll back to DC 1 ; if P m,s,c rolls back to SC b m,s,c because of failure, P i,j,e will roll back to DC 1 also. M 0 does not exist. If P m,a,v rolls back to C b m,a,v because of failure, P i,j,e will roll back to DC 2 ; if P m,s,c rolls back to SC b m,s,c because of failure, P i,j,e will roll back to DC 1 . It can be seen, P i,j,e will roll back to DC 1 in most instances. Hence, for reducing the number of DC s , a process just takes one DC for each different cluster during one DC interval.

Strategies for Checkpoint Elimination
The storage space of MH s in an ad hoc network is quite limited, so it is necessary to clean up the overdue useless checkpoints in time. This section will firstly introduce the checkpoint dependence graph, and then design the relevant rules and strategies for checkpoint elimination. The Figure 7 below shows the IntraCDG of processes P i,a,v and P i,k,s (see in Figure 3). Because there are no dependence relations in proesses P m,s,c and P i,j do,t , they not have the dependence relations at present and so the IntraCDG m,s,c and IntraCDG i,j,t are empty.

Intra-Cluster Checkpoint Dependence Graph
In order to describe the dependence relations among checkpoints of processes in same cluster clearly, every process stores an Intra-cluster Checkpoint Dependence Graph (IntraCDG). The nodes of IntraCDG are certain checkpoints or DCs and the edges of SDG are directed edges. The edge is started by one checkpoint and ended by another checkpoint on which the checkpoint depends. According to Definitions 5 and 6, the update situations of IntraCDG contain the following two aspects: 1. The update of dependence relations among processes. When a process Pi receives a SM, it firstly obtains the recent checkpoint of the sender process and then checks whether the edge started by this recent checkpoint and ended by current checkpoint of Pi exists in IntraCDGi. If the edge does not exist, then it will be added to IntraCDGi. 2. The update of dependence relations in same process. When Pi takes a new certain checkpoint or DC, it adds a directed edge started by the current checkpoint (a certain checkpoint or DC) and ended by the new checkpoint to IntraCDG.

Inter-Cluster Checkpoint Dependence Graph
In order to describe the dependence relations among checkpoints of processes in different clusters, the source cluster head and target cluster head store Inter-Cluster Checkpoint Dependence Graph (InterCDG) for sending messages and receiving messages. They are marked by S_InterCDG and R_InterCDG, respectively. The nodes of InterCDG are certain checkpoints or DCs and the edges of it are directed edges. The edge is started by one checkpoint and ended by another checkpoint on which the checkpoint depends.
In Figure 4,

Rules and Strategies of Checkpiont Elimination
The checkpoint elimination includes two parts: the elimination of SC and elimination of DC.

Elimination of SC
As Criterion 5 shows, one SC can be cleared away if it is not the newest checkpoint of process and it does not depend on any DC.
Let IntraCDGglobal to be the global checkpoint dependence graph of all processes in one cluster. SCi and SCk are the SCs of Pi. Set-Path(SCi) is the set of paths which are ended by SCi in IntraCDGglobal. DCx is any DC. The elimination rule of SC can be described in formalized way as follows: Let Pi to be the process which cleans up the DCx, then the algorithm for elimination of DC is indicated as Algorithm 1. Figure 7. IntraCDG i,a,v and IntraCDG i,k,s .

Intra-Cluster Checkpoint Dependence Graph
In order to describe the dependence relations among checkpoints of processes in same cluster clearly, every process stores an Intra-cluster Checkpoint Dependence Graph (IntraCDG). The nodes of IntraCDG are certain checkpoints or DC s and the edges of SDG are directed edges. The edge is started by one checkpoint and ended by another checkpoint on which the checkpoint depends. According to Definitions 5 and 6, the update situations of IntraCDG contain the following two aspects: 1.
The update of dependence relations among processes. When a process P i receives a SM, it firstly obtains the recent checkpoint of the sender process and then checks whether the edge started by this recent checkpoint and ended by current checkpoint of P i exists in IntraCDG i . If the edge does not exist, then it will be added to IntraCDG i .

2.
The update of dependence relations in same process. When P i takes a new certain checkpoint or DC, it adds a directed edge started by the current checkpoint (a certain checkpoint or DC) and ended by the new checkpoint to IntraCDG.

Inter-Cluster Checkpoint Dependence Graph
In order to describe the dependence relations among checkpoints of processes in different clusters, the source cluster head and target cluster head store Inter-Cluster Checkpoint Dependence Graph (InterCDG) for sending messages and receiving messages. They are marked by S_InterCDG and R_InterCDG, respectively. The nodes of InterCDG are certain checkpoints or DC s and the edges of it are directed edges. The edge is started by one checkpoint and ended by another checkpoint on which the checkpoint depends.
In Figure 4, DC m i,k,s depends on SC t m,s,c . This dependence relation can be recorded in the InterCDG s related to M 1 . The InterCDG s are respectively stored in CH i and CH m . The figure is comparative simple, and so it is not given.

Rules and Strategies of Checkpiont Elimination
The checkpoint elimination includes two parts: the elimination of SC and elimination of DC.

Elimination of SC
As Criterion 5 shows, one SC can be cleared away if it is not the newest checkpoint of process and it does not depend on any DC.
Let IntraCDGglobal to be the global checkpoint dependence graph of all processes in one cluster. SC i and SC k are the SC s of P i . Set-Path(SC i ) is the set of paths which are ended by SC i in IntraCDGglobal. DC x is any DC. The elimination rule of SC can be described in formalized way as follows: Let P i to be the process which cleans up the DC x , then the algorithm for elimination of DC is indicated as Algorithm 1.
IntraCDGglobal. Then the CH checks whether the earlier checkpoints can be cleaned up.

Elimination of DC
It can be seen from Criterion 6, one DC must not be able to be cleared up if the related checkpoints of processes in other clusters on which the DC depends directly are not eliminated. Hence, one premise of DC to be cleaned up is that all checkpoints from other clusters on which the DC depends directly are cleared away. However, it needs to check whether the elimination of DC satisfies the conditions in Criterion 7 even if this premise meets. The elimination of DCs is based on the elimination of SCs. If the conditions are satisfied, then the DC still cannot be cleared away.
Let DCx to be any DC of process Pi. CKPi is any kind of checkpoint of Pi. CKPj is any checkpoint of process in the cluster same as Pi. Edge(A, B) expresses the edge started by A and ended by B in IntraCDG. Then the elimination rules of DC can be described in formalized way as follows: Because the CH to which the sending process of DM belongs records S_IntraCDG, so the process needs to inform the CH to delete the checkpoint and related edges in S_IntraCDG when the process cleans up the SC. If the CH finds the isolated DC DCx, then it informs the related cluster to judge whether the DC can be eliminated according to Rules 2 and 3. After receiving the processing request of the DC, the related process decides the next step operation by its own IntraCDG.
1. If there is any edge which is started by checkpoint CKPi of process and ended by DCx in IntraCDG, then DCx will be cleared away. The edge which is started by and ended by DCx will be eliminated.
What is more, it will use CKPi to replace DCx. Finally, the CH informs the processes in same cluster to change the DCx to CKPi in their own IntraCDGs. message. 2. After received the DM, the CH updates S_InterCDG according to the dependence relation between checkpoints of the sender and receiver process and then transmits the DM. 3. The CH of the cluster to which the receiver process of the DM belongs updates the R_InterCDG also after received the DM, and then transmits the DM to the receiver process. 4. By the time the process received the DM, if it has taken DC for the cluster, it receives the DM directly. Otherwise, it checks whether it receives or sends any message during the current checkpoint interval. If it has sent or received any message, it stores a OC. Or else, it turns the current checkpoint to OC.
Let DM be a message sent by Pi (belongs to Cs) and received by Pj (belongs to Ct). CKPi and DCj are respectively the checkpoints of Pi and Pj related to DM. The algorithm for checkpoints in different clusters is described as Algorithm 3.

Algorithm for Failure Recovery
When a process fails, it firstly asks related processes to roll back to some global consistency states. Then these processes execute the operations stored in the logs. The keys of the correctness for rollback The elimination occasions of SC include two occasions:

1.
When one process finishes the new SC, the initialization process of checkpointing (active trigger) or the CH (reactive trigger) will collect the IntraCDGglobal and check whether the earlier checkpoints can be cleaned up.

2.
After any DC is cleared away by one process, the process will notice the CH to collect the IntraCDGglobal. Then the CH checks whether the earlier checkpoints can be cleaned up.

Elimination of DC
It can be seen from Criterion 6, one DC must not be able to be cleared up if the related checkpoints of processes in other clusters on which the DC depends directly are not eliminated. Hence, one premise of DC to be cleaned up is that all checkpoints from other clusters on which the DC depends directly are cleared away. However, it needs to check whether the elimination of DC satisfies the conditions in Criterion 7 even if this premise meets. The elimination of DC s is based on the elimination of SC s . If the conditions are satisfied, then the DC still cannot be cleared away.
Let DC x to be any DC of process P i . CKP i is any kind of checkpoint of P i . CKP j is any checkpoint of process in the cluster same as P i . Edge(A, B) expresses the edge started by A and ended by B in IntraCDG. Then the elimination rules of DC can be described in formalized way as follows: The elimination occasions of SC include two occasions: 1. When one process finishes the new SC, the initialization process of checkpointing (active trigger) or the CH (reactive trigger) will collect the IntraCDGglobal and check whether the earlier checkpoints can be cleaned up. 2. After any DC is cleared away by one process, the process will notice the CH to collect the IntraCDGglobal. Then the CH checks whether the earlier checkpoints can be cleaned up.

Elimination of DC
It can be seen from Criterion 6, one DC must not be able to be cleared up if the related checkpoints of processes in other clusters on which the DC depends directly are not eliminated. Hence, one premise of DC to be cleaned up is that all checkpoints from other clusters on which the DC depends directly are cleared away. However, it needs to check whether the elimination of DC satisfies the conditions in Criterion 7 even if this premise meets. The elimination of DCs is based on the elimination of SCs.
If the conditions are satisfied, then the DC still cannot be cleared away. Let DCx to be any DC of process Pi. CKPi is any kind of checkpoint of Pi. CKPj is any checkpoint of process in the cluster same as Pi.
Edge(A, B) expresses the edge started by A and ended by B in IntraCDG. Then the elimination rules of DC can be described in formalized way as follows: DCx) ∈ IntraCDGi, then DCx can be cleaned up. Because the CH to which the sending process of DM belongs records S_IntraCDG, so the process needs to inform the CH to delete the checkpoint and related edges in S_IntraCDG when the process cleans up the SC. If the CH finds the isolated DC DCx, then it informs the related cluster to judge whether the DC can be eliminated according to Rules 2 and 3. After receiving the processing request of the DC, the related process decides the next step operation by its own IntraCDG.
1. If there is any edge which is started by checkpoint CKPi of process and ended by DCx in IntraCDG, then DCx will be cleared away. The edge which is started by and ended by DCx will be eliminated. What is more, it will use CKPi to replace DCx. Finally, the CH informs the processes in same cluster to change the DCx to CKPi in their own IntraCDGs. The elimination occasions of SC include two occasions: 1. When one process finishes the new SC, the initialization process of checkpointing (active trigger) or the CH (reactive trigger) will collect the IntraCDGglobal and check whether the earlier checkpoints can be cleaned up. 2. After any DC is cleared away by one process, the process will notice the CH to collect the IntraCDGglobal. Then the CH checks whether the earlier checkpoints can be cleaned up.

Elimination of DC
It can be seen from Criterion 6, one DC must not be able to be cleared up if the related checkpoints of processes in other clusters on which the DC depends directly are not eliminated. Hence, one premise of DC to be cleaned up is that all checkpoints from other clusters on which the DC depends directly are cleared away. However, it needs to check whether the elimination of DC satisfies the conditions in Criterion 7 even if this premise meets. The elimination of DCs is based on the elimination of SCs.
If the conditions are satisfied, then the DC still cannot be cleared away. Let DCx to be any DC of process Pi. CKPi is any kind of checkpoint of Pi. CKPj is any checkpoint of process in the cluster same as Pi.
Edge(A, B) expresses the edge started by A and ended by B in IntraCDG. Then the elimination rules of DC can be described in formalized way as follows: DCx) ∈ IntraCDGi, then DCx can be cleaned up. Because the CH to which the sending process of DM belongs records S_IntraCDG, so the process needs to inform the CH to delete the checkpoint and related edges in S_IntraCDG when the process cleans up the SC. If the CH finds the isolated DC DCx, then it informs the related cluster to judge whether the DC can be eliminated according to Rules 2 and 3. After receiving the processing request of the DC, the related process decides the next step operation by its own IntraCDG.
1. If there is any edge which is started by checkpoint CKPi of process and ended by DCx in IntraCDG, then DCx will be cleared away. The edge which is started by and ended by DCx will be eliminated. What is more, it will use CKPi to replace DCx. Finally, the CH informs the processes in same cluster to change the DCx to CKPi in their own IntraCDGs.
IntraCDG i ), then DC x can be cleaned up.
Because the CH to which the sending process of DM belongs records S_IntraCDG, so the process needs to inform the CH to delete the checkpoint and related edges in S_IntraCDG when the process cleans up the SC. If the CH finds the isolated DC DC x , then it informs the related cluster to judge whether the DC can be eliminated according to Rules 2 and 3. After receiving the processing request of the DC, the related process decides the next step operation by its own IntraCDG.

1.
If there is any edge which is started by checkpoint CKP i of process and ended by DC x in IntraCDG, then DC x will be cleared away. The edge which is started by and ended by DC x will be eliminated. What is more, it will use CKP i to replace DC x . Finally, the CH informs the processes in same cluster to change the DC x to CKP i in their own IntraCDG s .

2.
If there is no edge started by the own checkpoint CKP i of process and ended by DC x , and there is edge started by DC x and ended by its own checkpoint CKP j , then two situations exist.

3.
If there is not edge started by checkpoint CKP a of process in same cluster and ended by DC x in IntraCDG, then can be cleared away.

4.
If there is edge started by checkpoint CKP a of process in same cluster and ended by DC x in IntraCDG, then turns DC x to SC k and informs other processes in same cluster to change DC x in IntraCDG to SC k . Theorem 1. The strategy of checkpoint elimination satisfies the correctness criteria of hybrid checkpointing model.

Proof.
We prove this from the SC and DC, respectively: 1.
Let SC i and SC j to be the SC s of any process P i . DC x is any DC. According to Rule 1, if there is no path from DC x to SC i but there is node SC j (j > i) in IntraCDGglobal, then it shows that SC i is not the current checkpoint of P i and there is no DC x meeting the condition SC i ⇒ DC x or SC i ⇒⇒ DC x . At this moment, any rollback of process in different cluster will not cause P i,j,t to roll back to SC y i,j,t . Hence, SC i can be eliminated. It can be seen, the elimination of SC s meets the Criterion 5.

2.
Let the cluster to which the DC belongs to be C m . If there is any isolated DC x in S_InterCDG m after CH m has handled the set of checkpoints that could be eliminated, then it means that any checkpoint CKP i of the process in C m does not meet the condition DC x ⇒ CKP i . Therefore, any rollback of process in C m will not cause P i to roll back to DC x .
Let CKP a to be any kind of checkpoint of P i . CKP b is any kind checkpoint of process in the cluster same as P i . If ∃Edge(CKP a , DC x ) ∈ IntraCDG i , when DC x has no relation with checkpoints in C m , then P i can roll back to CKP a relatively. Hence, the elimination of DC x will not cause wrong rollback. If ∃Edge (CKP a , DC x ) ∈ IntraCDG i and ∃Edge(DC x , CKP a ) ∈ IntraCDG i ∧ ∃Edge (CKP b , DC x ) ∈ IntraCDG i , then it shows that no rollback will cause P i to roll back to DC x but there is checkpoint which stores the newest state of P i . Hence, DC x can be eliminated. It can be seen that the elimination of DC s meets the Criteria 6 and 7. In conclusion, the strategy of checkpoint elimination satisfies the correctness criteria of the hybrid checkpointing model.

Algorithm for Checkpoints in the Same Cluster
Different from the traditional synchronous checkpointing strategy, when a process initializes the checkpointing, it firstly collects the dependence list of each process in the same cluster to obtain the dependence relation among processes. Then it finds out the processes that have to remain synchronous with it for realizing the checkpoints coordination of processes. However, if there are multiple processes initializing the checkpointing in the same cluster at the same time, then it will create unnecessary wireless communication overhead. Hence, let the CH of each cluster to control one token. In order to avoid the concurrent initializations, only the process which gets the token from CH is allowed to initialize the checkpointing. The all processes staying in active states in one cluster is named as the process list (PL).
In order to reduce the occupation of useless checkpoints to the precious storage space of wireless terminals, this paper divides the SC s into uncertain checkpoints and certain checkpoints (see the Definitions 2 and 3). If and only if the all processes which take part in the checkpointing finish the coordination of checkpoints, the uncertain checkpoints could be turned to certain checkpoints. Otherwise, these uncertain checkpoints will be eliminated, namely the caches released. Different from the DL which stores the list of processes depended on one process directly, the LDS stores the set of processes depended on one process directly and indirectly. The dependence relations in LDS can be achieved from the DL s of different processes: Definition 8. If P i sends a message to P j , then P j depends on P i directly. It expresses as P j →P i . Definition 9. If P k →P j and P j →P i , then P k depends on P i transitively. It expresses as P k →P i . Definition 10. The least dependence set (LDS) of P i is the set of processes in cluster same as P i which depends on P i directly or transitively. If P j →P i or P k →P i , then P j , P k ∈ LDS i .
To describe the up-dependence relations, the dependence list (DL) is used to store the set of process which directly depends on one process, and so, the transitive dependence relation for LDS can be obtained from the DL s of different processes. The steps of checkpointing in same cluster are described as follows:

1.
When process P receives K, SM s , it sends token request to CH. If the token is not assigned, then P gets the token. Otherwise, P goes on applying for the token. During the interval, if P receives the checkpoint request from other process, then it takes an uncertain checkpoint.

2.
After getting the token, P initializes the checkpointing. This procedure includes collecting the DLs of other processes, calculating LDS and so on.

3.
P takes an uncertain checkpoint and sends checkpoint request and LDS to all of the other processes in same cluster. If some members stay in the disconnected states, it sends the information to the CH.

4.
After getting the checkpoint request, the processes will take uncertain checkpoints if they are part of LDS. Then they send replies to P.

5.
If P receives all of the positive replies, then it turns the uncertain checkpoint to certain checkpoint and notices the processes in LDS to turn. If some members stay in the disconnected states, P sends the turn request to the CH. If P does not receive all of the positive replies during a time interval, P clears away the uncertain checkpoint and informs all of the processes in the same cluster. 6.
If P finishes the checkpointing successfully, then P collects the IntraCDG of processes in same cluster to build the IntraCDGfinal and P checks whether any checkpoints could be eliminated according to Rule 1. If this kind of checkpoint exists, P notices the processes in same cluster to execute elimination processing.
Let P j to be the process which initializes the checkpointing. P j stays in the cluster C m . PL m expresses the process list of C m . CSN i indicates the checkpoint sequence number (CSN) of P j . UC i expresses whether P j has taken the uncertain checkpoint. The UC i initialization value of is 0. DP m indicates the set of processes which are disconnected in C m . MN i indicates the number of messages received by P i from processes in same cluster. The algorithm of process which initializes the checkpointing is indicated as Algorithm 2.

Algorithm for Checkpoints in Different Clusters
The DC is triggered by the message that is sent from a process in a different cluster. The processing of DC is relatively simple. The following steps are its detailed procedures:

1.
When a process sends a DM to MHs in different clusters, it adds the current checkpoint to the message.

2.
After received the DM, the CH updates S_InterCDG according to the dependence relation between checkpoints of the sender and receiver process and then transmits the DM.

3.
The CH of the cluster to which the receiver process of the DM belongs updates the R_InterCDG also after received the DM, and then transmits the DM to the receiver process.

4.
By the time the process received the DM, if it has taken DC for the cluster, it receives the DM directly. Otherwise, it checks whether it receives or sends any message during the current checkpoint interval. If it has sent or received any message, it stores a OC. Or else, it turns the current checkpoint to OC.
Let DM be a message sent by P i (belongs to C s ) and received by P j (belongs to C t ). CKP i and DC j are respectively the checkpoints of P i and P j related to DM. The algorithm for checkpoints in different clusters is described as Algorithm 3.

Algorithm 3 DC_Processing( )
SEND_message(P i , DM); // P i sends DM TRANSMIT_message(CH s , DM); // CH s transmits DM UPDATE_S_InterCDG(P i , P j ); // update S_InterCDG TRANSMIT_message(CH t , DM); UPDATE_R_InterCDG(P i , P j ); RECEIVE_message(P j , DM); // P j receives DM if (P j has taken DC j ) then RECEIVE(DM); else if ( current checkpoint to OC. Let DM be a message sent by Pi (belongs to Cs) and received by Pj (belongs to Ct). CKPi and DCj are respectively the checkpoints of Pi and Pj related to DM. The algorithm for checkpoints in different clusters is described as Algorithm 3.

Algorithm for Failure Recovery
When a process fails, it firstly asks related processes to roll back to some global consistency states. Then these processes execute the operations stored in the logs. The keys of the correctness for rollback

Algorithm for Failure Recovery
When a process fails, it firstly asks related processes to roll back to some global consistency states. Then these processes execute the operations stored in the logs. The keys of the correctness for rollback recovery are that determining the earliest rollback set of process and finding out the processes in different clusters belonging to the cascade rollback.
Definition 11. The rollback set of P i is the set of checkpoints to which the processes in same cluster and P i may roll back, it is recorded as RS i .

Definition 12.
The earliest rollback set is the set of earliest checkpoints to which the processes in same cluster and P i must roll back, it is recorded as ERS i , ERS i ⊆ RS i .

Definition 13.
Let CKP i to be the checkpoint to which the failure process P i needs roll back. If ∃CKP k (CKP k ⇒ CKP i ∨ CKP k ⇒ ⇒ CKP i ), then CKP k ∈ RS i .

Definition 14.
Let P i to be the failure process, P j is the process in the cluster same as P i . CKP 1 , CKP 2 , . . . , CKP n are the set of checkpoints of P j and they belong to RS i . If ∃CKP x , x ∈ {1, 2, . . . , n}, the condition CKP x ⇒ CKP k (k ∈ {1, 2, . . . , n}, x = k) is not satisfied, then CKP x ∈ ERS i .
The ERS calculation procedure for failure process P i is described as follows:

1.
P i collects the IntraCDG s of all processes in same cluster and builds the IntraCDGglobal.

2.
P i executes the depth-first traversal to the IntraCDGglobal started from the checkpoint to which P i has to roll back, and adds any checkpoints that can be arrived to the RS i . Finally, the earliest checkpoint of each process in RS i is added to ERS i .
Based on the ERS, we define the following rollback rules:

Rule 4:
Let P i to be the failure process. CKP x is any checkpoint of process P j in the cluster same as P i .
If CKP x ∈ ERS i , then P j should roll back to CKP x .

Rule 5:
Let P i to be the failure process. CKP x is any checkpoint of process P j in the cluster same as P i . CKP x ∈ RS i . If ∃Edge(CKP x , DC x ) ∈ S_InterCDG m , then the related process in other cluster will be noticed to roll back to DC x .
Let P i to be the failure process and CKP i to be the checkpoint to which P i has to roll back. The rollback recovery algorithm is described as Algorithm 4.

Theorem 2.
The strategy of rollback satisfies the correctness criteria of hybrid checkpointing model.
Proof. By contradiction. If the rollback strategy does not satisfy the correctness criteria, then there must be a message M there must be a message M whose sender process has cancelled the sending event of M but the receiver process does not cancel the receiving event of M. Let P i and P j to be the sender process and receiver process of M respectively. CKP i and CKP j are the newest checkpoints of P i and P j when they send and receives M, respectively.

1.
M is a SM. By the time P j receives M, the IntraCDG j of P j will record the dependence relation of CKP j ⇒ CKP i . If some process in the cluster same as P j build the IntraCDGglobal for failure, then there must be a path from CKP i to CKP j . Therefore, P j is sure to roll back to CKP j because CKP j belongs to RS j when P i rolls back to CKP i for failure or rollback request.

2.
M is a DM. At present, CKP j is DC j . After Pi sends M, the cluster head CH k of cluster same as P i will firstly receive M. When CH k receives M, it transmits M and records the dependence relation of CKP j ⇒ CKP i in S_InterCDG k . When Pi rolls back to CKP i for failure or rollback request, CH k is sure to require P j to roll back to CKP j because there is edge started by CKP i and ended by CKP j in S_InterCDGk.
So, it follows that the assumption for rollback strategy does not set up. The rollback strategy meets the correctness criteria.

Processing of Handoffs
A MH leaves one cluster and enters another cluster in the ad hoc network, namely a handoff occurs, then the relations of the MH in same cluster and different clusters all change. We call this MH that passes the handoff as the HandOff Host (HOH). The cluster that HOH leaves is called old cluster and the cluster that HOH enters is called new cluster.
After the handoff has happened, there will be some situations. If the old cluster does not deal with this change in time, then any checkpointing procedure of processes in old cluster may fail to accomplish because of the absence of HOH. The rollback of processes may also not be realized, because the processes could not obtain all of dependence relations of checkpoints. If the new cluster does not deal with this change in time, then the DCs of HOH may not be able to be cleared away and affects the elimination of more checkpoints further. What is more, the checkpointing in new cluster will ignore the entering of HOH and so cause the incoordination of checkpoints. Hence, the HOH, old cluster and new cluster all must deal with this change relatively for assuring the correctness of checkpoints when the handoff happens.

Process States and Treatment
For the processes running in the HOH, they stay one of the following states when the handoff happens. Figure 8 is the transition figure of states. 1.
Running state. The process runs or communicates normally. 2.
Checkpoint state. The process is taking checkpoint. The process can be divided to CKP Initiator or CKP Participant. The CKP Intiator means that the checkpointing procedure is initialized by this process, while the CKP Participant means that this process just take part in the checkpointing procedure initialized by other process. 3.
Rollback state. The process is just dealing with the rollback recovery. This state can be divided to Active Rollback and Reactive Rollback state. The Active Rollback state indicates that the process is the failure process, namely the initiator of rollback. While the Reactive Rollback state means that the process rolls back because of the cascade rollback caused by rollback request of another process. HOH informs its location and state to the CH of old cluster when it enters the new cluster. Then the CH judges whether it is needful to notice the processes in its cluster. Active Rollback. The HOH goes on obtaining the IntraCDG of processes in the old cluster and calculates the ERS according to IntraCDG. The LCH sends the cluster id of new cluster to the CH of old cluster. Then the process executes rollback operation according to ERS and sends the ERS to related processes in old cluster to inform them to roll back.
Reactive Rollback. The process waits for the ERS sent by the process running in old cluster and executes rollback after received the ERS.

The Process Stays the Checkpoint State
CKP Initiator. If the process stays the receiving stage of DL, then it goes on waiting for the DL sent from processes in old cluster. If the process has sent the LDS, then it waits for the replies from processes in old cluster. After that, the process continues executing the checkpointing until the end.
CKP Participant. The process waits for the coordination request from the initiator of checkpointing and replies until the end of checkpointing procedure.

Update of Checkpoint Dependence Relation
Because the the relations of the HOH in same cluster and different clusters all change after it leaves the old cluster and enters the new cluster, so it needs to update the dependence relations among checkpoints in old or new cluster and the checkpoints of processes in HOH when the all processes in HOH have finished their current work. The main update work is to renew the related data structures such as the PL, DL, IntraCDG and InterCDG. Let Ck and Ci respectively the old cluster When the HOH leaves the old cluster, and enters the new cluster, the processes running in HOH should be able to handle according to their different states. Despite the processes staying normal state, all processes running in HOH have to finish the current work (checkpointing or rollback operation) with the processes in old cluster together for assuring the consistency of checkpoints. Hence, the HOH informs its location and state to the CH of old cluster when it enters the new cluster. Then the CH judges whether it is needful to notice the processes in its cluster.

The Process Stays in the Rollback State
Active Rollback. The HOH goes on obtaining the IntraCDG of processes in the old cluster and calculates the ERS according to IntraCDG. The LCH sends the cluster id of new cluster to the CH of old cluster. Then the process executes rollback operation according to ERS and sends the ERS to related processes in old cluster to inform them to roll back.
Reactive Rollback. The process waits for the ERS sent by the process running in old cluster and executes rollback after received the ERS. CKP Initiator. If the process stays the receiving stage of DL, then it goes on waiting for the DL sent from processes in old cluster. If the process has sent the LDS, then it waits for the replies from processes in old cluster. After that, the process continues executing the checkpointing until the end.
CKP Participant. The process waits for the coordination request from the initiator of checkpointing and replies until the end of checkpointing procedure.

Update of Checkpoint Dependence Relation
Because the the relations of the HOH in same cluster and different clusters all change after it leaves the old cluster and enters the new cluster, so it needs to update the dependence relations among checkpoints in old or new cluster and the checkpoints of processes in HOH when the all processes in HOH have finished their current work. The main update work is to renew the related data structures such as the PL, DL, IntraCDG and InterCDG. Let C k and C i respectively the old cluster and new cluster related to HOH. PS is the set of processes running in HOH.

Update of Checkpoints Dependence Relation in Old Cluster
Firstly, it needs to delete the processes in PS from the PL k and DL s of all processes in old cluster. Then it should turn the same cluster dependence relation among checkpoints of processes in PL k and PS to dependence relations among different clusters. That means to modify the related IntraCDG and InterCDG.
Let P t ∈ PS, P j ∈ PL k . CKP a and CKP b are the checkpoints of P t and P j respectively. The algorithm for updating of checkpoints dependence relations in old cluster is described as the Algorithm 5.

Update of Checkpoints Dependence Relation in New Cluster
Firstly, it needs to add the processes in PS to PL i . When CH i receives the subgraph SubInterCDG from CH k , it adds the edges ended by the checkpoints of processes in PS to R_InterCDG i and adds the rest to S_InterCDG i . Then CH i builds new relations of same cluster for the processes in PS, namely updates the IntraCDG s and DL s of related checkpoints.
Let P t ∈ PS, P s ∈ C i . CKP c and CKP d are the checkpoints of P t and P s , respectively. The algorithm for updating checkpoints dependence relations in a new cluster is described as the Algorithm 6.

Algorithm 6 NewCluster_CKPUpdate( )
CHANGE_dependence(CKP c , CKP d ); //change dependence relation of CKP c and CKP d for NOTICE_clusterchange(CH m , P t ); //notice CH m the cluster change of P t for (P x ∈ PL i ) do CHANGE_IntraCDG(P x ); //P x modify IntraCDG CHANGE_DL(P x ); //P x modify DL

Experiment Environment and Parameters
This experiment builds a simulation environment of a mobile ad hoc network based on Windows XP. The simulation network consists of five clusters. Each cluster includes 5 MHs. The routing protocol in the experiment is the Ad hoc On-Demand Distance Vector Routing (AODV). The size of one checkpoint and one message log are set to 2 KB and 50 B, respectively. Table 1 shows the parameters of this experiment. This experiment plans to compare the proposed hybrid checkpointing protocol (HCP) with the pure Synchronous Checkpointing Protocol (SCP) and the pure Asynchronous Checkpointing Protocol (ACP). In the pure synchronous checkpointing strategy, a process sends the coordination request to the related processes in same cluster and different clusters after received K messages. The process firstly takes an uncertain checkpoint, and then turns the uncertain one to certain one after the coordination is successful. In the pure asynchronous checkpointing strategy, a process takes a checkpoint independently after received K messages. According to the characteristics of ad hoc networks, this experiment adopts the following performance indexes:

Experiment Results and Analysis
This experiment divides to two parts. The first part is to test the message threshold for checkpoint K in the experiment environment. The second part is to test the ACT, ART, ASS, AAM and ACFM of the three protocols under different conditions.

Experiment 1 Selection of K
The checkpoint frequency of a process is decided by the message threshold K in HCP. If K enlarges, then the number of checkpoints deceases and the ART increases. What is more, if K increases, then the logs between checkpoints increase. Because the storage overhead of checkpoints is much larger than the average storage overhead of message logs, so the storage overhead will reduce as K increases.
For the additional messages, the checkpoint frequency deceases as K increases and so the additional messages for checkpoint coordination reduce, but the additional messages for rollback recovery augment, so the AAM does not maintain an absolute downtrend as K augments. Hence, this experiment takes the lower of AAM as the main parameter for selection of K. Figure 9 shows the AAM for varying values of K under different failure probabilities. It can be seen that the AAM under different failure probabilities is the least when K is 10. Therefore, the follow-up experiments choose this value of K to analyze and compare.  Figure 10 shows the ACT for varying ratios of SM number to DM number (R). It can be seen that because the checkpoints in ACP do not remain coordinated with other processes and the processes just need to store the current states of processes, so the ACT does not change at all. However, the ACT of HCP and SCP reduces as R augments. For the HCP, the number of processes that participate in the coordination decreases when the processes satisfy the checkpoint condition as R increases, so the ACT reduces as R augments. For the SCP, the number of DMs diminishes as R increases and it means that the processes in different clusters which also have to join the coordination reduction. Hence, the ACT decreases as R augments. What is more, the time for checkpoints coordination of HCP is sure to be less than the SCP, because the checkpoints coordination of HCP is limited in one cluster.

Experiment 3 Test of ART
Because there are major cascade rollbacks in ACP, so the ART of ACP is greatly large than the one of HCP and SCP. In order to clearly describe the difference between HCP and SCP, this experiment just compares the effect on ART of HCP and ACP for varying M and F. The results are displayed in Figures 11 and 12.  Figure 10 shows the ACT for varying ratios of SM number to DM number (R). It can be seen that because the checkpoints in ACP do not remain coordinated with other processes and the processes just need to store the current states of processes, so the ACT does not change at all. However, the ACT of HCP and SCP reduces as R augments. For the HCP, the number of processes that participate in the coordination decreases when the processes satisfy the checkpoint condition as R increases, so the ACT reduces as R augments. For the SCP, the number of DM s diminishes as R increases and it means that the processes in different clusters which also have to join the coordination reduction. Hence, the ACT decreases as R augments. What is more, the time for checkpoints coordination of HCP is sure to be less than the SCP, because the checkpoints coordination of HCP is limited in one cluster.  Figure 10 shows the ACT for varying ratios of SM number to DM number (R). It can be seen that because the checkpoints in ACP do not remain coordinated with other processes and the processes just need to store the current states of processes, so the ACT does not change at all. However, the ACT of HCP and SCP reduces as R augments. For the HCP, the number of processes that participate in the coordination decreases when the processes satisfy the checkpoint condition as R increases, so the ACT reduces as R augments. For the SCP, the number of DMs diminishes as R increases and it means that the processes in different clusters which also have to join the coordination reduction. Hence, the ACT decreases as R augments. What is more, the time for checkpoints coordination of HCP is sure to be less than the SCP, because the checkpoints coordination of HCP is limited in one cluster.

Experiment 3 Test of ART
Because there are major cascade rollbacks in ACP, so the ART of ACP is greatly large than the one of HCP and SCP. In order to clearly describe the difference between HCP and SCP, this experiment just compares the effect on ART of HCP and ACP for varying M and F. The results are displayed in Figures 11 and 12.

Experiment 3 Test of ART
Because there are major cascade rollbacks in ACP, so the ART of ACP is greatly large than the one of HCP and SCP. In order to clearly describe the difference between HCP and SCP, this experiment just compares the effect on ART of HCP and ACP for varying M and F. The results are displayed in Figures 11 and 12.   Figure 10. Impact of R on ACT.

Experiment 3 Test of ART
Because there are major cascade rollbacks in ACP, so the ART of ACP is greatly large than the one of HCP and SCP. In order to clearly describe the difference between HCP and SCP, this experiment just compares the effect on ART of HCP and ACP for varying M and F. The results are displayed in Figures 11 and 12. As Figure 11 shows, the ART of SCP is lower than the one of HCP when M is 2. This is because the extent of change of the rollback caused by the cascade rollback of HCP is larger than the one brought by the rollback request of SCP, but the ART of these two protocols all increase as M augments and the time of HCP is lower than SCP when M is 3 and M is 4. The reason for the augment of two algorithms is that there are rollback requests sent to different clusters. Thus, the rollback extent increases when processes in different clusters receive the rollback requests as M augments. By contrast, the rollback requests sent to different clusters of SCP are greatly larger than the ones of HCP because the rollback of SCP is global. Hence, the rollback extent change of SCP caused by M is greater than the rollback extent change of HCP brought by the cascade rollback when M is larger.
It can be seen from Figure 12, the recovery time of HCP and SCP all increase when F augments but the recovery time of HCP is less than the one of SCP. This is because the communications between the checkpoint and failure point decrease and then the related rollback processes reduce. Hence, the rollback extents of processes that do not fail have larger rollback extents when they roll back next time. This situation causes the increment of recovery time. Because the rollback notifications sent to different clusters of HCP is lesser, so the ART of HCP is less than the one of SCP.

Experiment 4 Test of ASS
In order to test the ASS, this experiment compares the occupation of storage space of different protocols by testing the change of SCN. Figure 13 shows the results of tests. It can be seen, the ASS of SCP does not change basically. This is because each process in SCP just needs to store one checkpoint and a handful of message logs. However, each process in HCP and ACP has to store multiple checkpoints and many message logs, and so the occupation of storage space increases as the successive checkpoints number augments. But the ASS of HCP is less than the ASS of ACP greatly because the DCs of HCP will be cleared away in advance. As Figure 11 shows, the ART of SCP is lower than the one of HCP when M is 2. This is because the extent of change of the rollback caused by the cascade rollback of HCP is larger than the one brought by the rollback request of SCP, but the ART of these two protocols all increase as M augments and the time of HCP is lower than SCP when M is 3 and M is 4. The reason for the augment of two algorithms is that there are rollback requests sent to different clusters. Thus, the rollback extent increases when processes in different clusters receive the rollback requests as M augments. By contrast, the rollback requests sent to different clusters of SCP are greatly larger than the ones of HCP because the rollback of SCP is global. Hence, the rollback extent change of SCP caused by M is greater than the rollback extent change of HCP brought by the cascade rollback when M is larger.
It can be seen from Figure 12, the recovery time of HCP and SCP all increase when F augments but the recovery time of HCP is less than the one of SCP. This is because the communications between the checkpoint and failure point decrease and then the related rollback processes reduce. Hence, the rollback extents of processes that do not fail have larger rollback extents when they roll back next time. This situation causes the increment of recovery time. Because the rollback notifications sent to different clusters of HCP is lesser, so the ART of HCP is less than the one of SCP.

Experiment 4 Test of ASS
In order to test the ASS, this experiment compares the occupation of storage space of different protocols by testing the change of SCN. Figure 13 shows the results of tests. It can be seen, the ASS of SCP does not change basically. This is because each process in SCP just needs to store one checkpoint and a handful of message logs. However, each process in HCP and ACP has to store multiple checkpoints and many message logs, and so the occupation of storage space increases as the successive checkpoints number augments. But the ASS of HCP is less than the ASS of ACP greatly because the DC s of HCP will be cleared away in advance.
It can be seen from Figure 12, the recovery time of HCP and SCP all increase when F augments but the recovery time of HCP is less than the one of SCP. This is because the communications between the checkpoint and failure point decrease and then the related rollback processes reduce. Hence, the rollback extents of processes that do not fail have larger rollback extents when they roll back next time. This situation causes the increment of recovery time. Because the rollback notifications sent to different clusters of HCP is lesser, so the ART of HCP is less than the one of SCP.

Experiment 4 Test of ASS
In order to test the ASS, this experiment compares the occupation of storage space of different protocols by testing the change of SCN. Figure 13 shows the results of tests. It can be seen, the ASS of SCP does not change basically. This is because each process in SCP just needs to store one checkpoint and a handful of message logs. However, each process in HCP and ACP has to store multiple checkpoints and many message logs, and so the occupation of storage space increases as the successive checkpoints number augments. But the ASS of HCP is less than the ASS of ACP greatly because the DCs of HCP will be cleared away in advance.

Experiment 5 Test of AAM
In order to test the AAM, this experiment compares the effect of change created by F and R respectively. The performance results of this experiment are described in Figures 14 and 15. It can be seen from Figure 14 that the additional messages increase along with the increment of F. This is because the larger the F is, the more additional messages related to rollbacks exist, but only the AAM of SCP shows a downtrend as R augments. The reason is that the processes which participate in the coordination decrease with the reduction of number of DM s when R increases, but for the HCP and ACP, the increment of R causes the dependence relations among checkpoints of processes to become tighter and so the additional messages created by the rollbacks of failure processes augment. In general, the AAM of SCP is greatly larger than the one of HCP and ACP. This is because the SCP has the global rollback despite the global coordination, but for the HCP, though there are additional messages caused by coordination and rollback together, the coordination and rollback are limited in the same cluster. Hence, the additional messages of HCP are less than S for CP. For the ACP, its additional messages are the least because there is no coordination in ACP.

Experiment 5 Test of AAM
In order to test the AAM, this experiment compares the effect of change created by F and R respectively. The performance results of this experiment are described in Figures 14 and 15. It can be seen from Figure 14 that the additional messages increase along with the increment of F. This is because the larger the F is, the more additional messages related to rollbacks exist, but only the AAM of SCP shows a downtrend as R augments. The reason is that the processes which participate in the coordination decrease with the reduction of number of DMs when R increases, but for the HCP and ACP, the increment of R causes the dependence relations among checkpoints of processes to become tighter and so the additional messages created by the rollbacks of failure processes augment. In general, the AAM of SCP is greatly larger than the one of HCP and ACP. This is because the SCP has the global rollback despite the global coordination, but for the HCP, though there are additional messages caused by coordination and rollback together, the coordination and rollback are limited in the same cluster. Hence, the additional messages of HCP are less than S for CP. For the ACP, its additional messages are the least because there is no coordination in ACP.    It can be seen that the enlargement of F causes the increment of ACFM in the three protocols. Hereinto, the change extents of HCP and ACP are lesser. This is because the rollback requests forwarded by CHs increase when F augments. For the HCP and ACP, the rollback requests forwarded by CHs are lesser, and so is the effect of F. In the Scp, each rollback requires all processes to partake and so the messages forwarded by CHs become major and take up a high proportion of the total messages. Then it causes the change extent of messages forwarded by CHs being larger when R varies. What is more, the CHs in SCP will forward the coordination requests also and so the messages forwarded by CHs in Scp are the largest. For the HCP, the coordination requests for different clusters and rollback requests forwarded by CHs are lesser, and so the messages forwarded by CHs are less than ACP. As R increases, the ACFM of the three protocols all reduce. Hereinto, the ACFM of HCP is least when R is less than 1. This is because the kinds of requests related to processes in different clusters increase when R It can be seen that the enlargement of F causes the increment of ACFM in the three protocols. Hereinto, the change extents of HCP and ACP are lesser. This is because the rollback requests forwarded by CH s increase when F augments. For the HCP and ACP, the rollback requests forwarded by CH s are lesser, and so is the effect of F. In the Scp, each rollback requires all processes to partake and so the messages forwarded by CH s become major and take up a high proportion of the total messages. Then it causes the change extent of messages forwarded by CH s being larger when R varies. What is more, the CH s in SCP will forward the coordination requests also and so the messages forwarded by CH s in Scp are the largest. For the HCP, the coordination requests for different clusters and rollback requests forwarded by CH s are lesser, and so the messages forwarded by CH s are less than ACP. As R increases, the ACFM of the three protocols all reduce. Hereinto, the ACFM of HCP is least when R is less than 1. This is because the kinds of requests related to processes in different clusters increase when R augments. So the messages forwarded by CH s reduce. For the SCP, the related coordination processes in different clusters decreases with the increment of R when R is larger than 1, and so the coordination requests forwarded by CH s reduce obviously. In the HCP, though the coordination and rollback requests for different clusters decrease with the increment of R, there are a certain amount of elimination notices of DC s . Hence, the reduction extent of HCP is less than the one of SCP. augments. So the messages forwarded by CHs reduce. For the SCP, the related coordination processes in different clusters decreases with the increment of R when R is larger than 1, and so the coordination requests forwarded by CHs reduce obviously. In the HCP, though the coordination and rollback requests for different clusters decrease with the increment of R, there are a certain amount of elimination notices of DCs. Hence, the reduction extent of HCP is less than the one of SCP.   In short, because each process in SCP just stores one checkpoint, so the occupation of storage space is lesser. There is no cascade rollback in SCP, thus the recovery time is short, but in order to assure the checkpoints coordination of all processes, the processes in SCP need to exchange a lot of coordination information and so the additional messages are major.
In the ACP, because the checkpoints of processes are not synchronous, so the checkpoint time is just the time needed to store the state of the process and the additional messages just include bits of additional messages caused by rollback. However, the storage space of ACP is larger because of the storing of many checkpoints. What is more, the recovery time is longer, because there are cascade rollbacks among checkpoints of processes. Figure 16. Impact of F on ACFM. In short, because each process in SCP just stores one checkpoint, so the occupation of storage space is lesser. There is no cascade rollback in SCP, thus the recovery time is short, but in order to assure the checkpoints coordination of all processes, the processes in SCP need to exchange a lot of coordination information and so the additional messages are major.
In the ACP, because the checkpoints of processes are not synchronous, so the checkpoint time is just the time needed to store the state of the process and the additional messages just include bits of additional messages caused by rollback. However, the storage space of ACP is larger because of the storing of many checkpoints. What is more, the recovery time is longer, because there are cascade rollbacks among checkpoints of processes.
For the proposed hybrid protocol HCP, the time to transmit and receive the kinds of coordination information is less than the one of SCP, because the coordination of checkpoints is limited to the processes in same cluster. Hence, the checkpoint time of HCP is greatly shorter than the one of SCP. In addition, the coordination and rollback of HCP do not affect the other clusters marginally and so the retransmission overhead of CHs is lesser. Though the additional messages of HCP are a little higher than the ones of ACP because of coordination, the recovery time of HCP is shorter for bits of constraints from different clusters and cascade rollbacks.

Conclusions and Next Work
Mobile ad hoc networks are an important part of modern communication networks, but the existing checkpointing algorithms could not be applied to this network environment very well because it has the features such as less of support of fixed network, multi-hops routing, changeable topological structure, bad stability and high failure probability. For the proposed hybrid protocol HCP, the time to transmit and receive the kinds of coordination information is less than the one of SCP, because the coordination of checkpoints is limited to the processes in same cluster. Hence, the checkpoint time of HCP is greatly shorter than the one of SCP. In addition, the coordination and rollback of HCP do not affect the other clusters marginally and so the retransmission overhead of CH s is lesser. Though the additional messages of HCP are a little higher than the ones of ACP because of coordination, the recovery time of HCP is shorter for bits of constraints from different clusters and cascade rollbacks.

Conclusions and Next Work
Mobile ad hoc networks are an important part of modern communication networks, but the existing checkpointing algorithms could not be applied to this network environment very well because it has the features such as less of support of fixed network, multi-hops routing, changeable topological structure, bad stability and high failure probability.
In the distributed system, the performance indexes for measuring checkpoints mainly include the number of communication messages, storage overhead, recovery overhead and time for checkpointing coordination, so it can choose between coordinated or uncoordinated checkpointing according to the characteristics of application environment. However, there is no support for fixed networks and the kinds of resources are very limited in mobile ad hoc networks. For example, the wireless bandwidth is lower and the CPU capability is weak in mobile ad hoc networks, so the checkpointing technologies for this kind of network cannot only consider any index alone, and a trade off method which takes kinds of performance indexes (such as checkpoint time, rollback recovery time, storage space and number of additional messages) into account synthetically should be designed. Therefore, this paper designs a hybrid checkpointing model and algorithm which combines the synchronization with asynchronization technology. This hybrid checkpointing algorithm not only avoids the major cascade rollback among MH s in the same cluster and too much message transmissions among the MH s in different clusters, but also reduces communication delay. What is more, this hybrid model has little dependence on CH s , so it has good flexibility.
Compared to the pure synchronous and asynchronous checkpointing strategy, this paper does plenty of performance tests for the hybrid checkpointing strategy. The results show that the hybrid checkpointing strategy is a good trade off method between the synchronous and asynchronous checkpointing strategy. What is more, it has shorter recovery time compared with the two methods.
Our next work is to analyze the multicast communication protocols and kinds of routing protocols and then design a new checkpointing algorithm which is suitable to the multicast environment and adapts to the different routing protocols. To do so, the fault tolerance and reliability of ad hoc networks will be further improved.