Low-Overhead Accrual Failure Detector

Failure detectors are one of the fundamental components for building a distributed system with high availability. In order to maintain the efficiency and scalability of failure detection in a complicated large-scale distributed system, accrual failure detectors that can adapt to multiple applications have been studied extensively. In this paper, an new accrual failure detector—LA-FD with low system overhead has been proposed specifically for current mobile network equipment on the Internet whose processing power, memory space and power supply are all constrained. It does not rely on the probability distribution of message transmission time, or on the maintenance of a history message window. By simple calculation, LA-FD provides adaptive failure detection service with high accuracy to multiple upper applications. The related experiments and results have also been presented.


Introduction
Failure detector is one of the fundamental components for building a distributed system with high availability [1]. By providing the processes' failure information to the system, it supports the solution of many basic issues (such as consensus and atomic broadcasting, etc.) in an asynchronous system. Failure detection was proposed and formally defined by Chandra and Toueg [2] as an effective way to enhance the asynchronous system computational model. With the increasing demands on capability in distributed systems, failure detectors have been widely applied to many fields including grid computing [3], cluster management [4] and peer-to-peer networks [5]. As a fundamental component,

OPEN ACCESS
more and more challenges to the efficiency and scalability [6] of failure detectors have been posed by the expanding system scale and increasingly complex distributed applications. How to achieve good detection speed and accuracy with low detection load has become a hot research topic in this field.
Adaptive failure detectors have been proposed as an important approach to solve this problem. They adjusts the detector's parameters automatically so that the system's requirement on the indicator of effectiveness can be met with low load under different network environments. Chen [7] and Bertier [8] proposed a series of QoS-based adaptive failure detection algorithms based on a probability network model. These algorithms have achieved adaptive adjustment in the quantitative control of detector parameters and greatly improved the detector's control accuracy and effectively reduced detection load. However, with the development of various network applications, multiple applications are often running simultaneously in large-scale systems such as grid, P2P and cloud computing. They have different failure detection QoS requirements. Taking into account the impact of load on scalability, we can't supply separate failure detectors for each application. Therefore, here comes another requirement for adaptive failure detectors, that is, that they can adapt to different QoS requirements demanded by multiple applications. This has become an important issue in the research of failure detection in large-scale distributed systems [6].
Hayashibara [9] first launched the research in this area and proposed the concept of accrual detector. It allows a complete decoupling between monitoring and interpretation in traditional models of failure detection. By outputting a continuous value associated with the status of a process rather than a binary value simply representing success or failure, upper applications can interpret detection results according to their own QoS requirements. Therefore, multiple applications can share the same detector and the failure detection load can be effectively reduced in large-scale distributed systems. Currently many implementations of accrual detectors have been proposed and applied satisfactorily to some well-known systems, such as Facebook [10]. However, with the development of applications in the Internet of Things and cloud computing, network access equipment has become diversified. Mobile terminals like cell phones and tablet PCs are being used more widely. The majority of such equipment are embedded systems whose processing power, memory space and power supply are all constrained, but the previously proposed accrual detectors require the probability distribution model for message transmission delay. For example, the ϕ-detector uses normal distribution [11], Cassandra uses exponential distribution [10], and Benjamin uses gamma distribution [12]. Furthermore, those detectors need a certain memory space to save a large history message window. At each detection cycle, a large amount of calculation is needed to compute the probability distribution parameters and detector parameters. For most mobile terminals, these system overheads for failure detection have an important impact on system performance and battery consumption, and regarding failure detection itself, Gillen [13] has pointed out that the transmission delays caused by performance degradation would also have great impact on detection accuracy.
Therefore, aiming at mobile devices with constrained resource, we have proposed an accrual failure detector with low system overhead. It does not rely on the probability distribution of message transmission delay, or on the maintenance of history message windows. Through simple calculations, it is able to provide an adaptive failure detection service with high accuracy to multiple upper applications.

System Model
We consider an asynchronous distributed system consisting of n processes, Π = {p 1 , p 2 , …, p n }.
Because the failure detector is running as a basic component in the node, one simple topology is considered, and we assume that each pair of processes is connected by a communication channel that can be used to send and receive messages. The type of failure is crash and channels are fair-lossy channels. No synchronized clock is assumed.

Basic Failure Detection Strategy
Heartbeat is a common method to implement failure detectors. The detection modules detect each other's status by sending heartbeat messages periodically at duration Δt i . According to the different modes of implementation, there are two monitoring approaches: PUSH and PULL. For two processes p and q in system, where q is monitoring p, the two basic approaches are described in Figure 1.
Both of the approaches detect each other's status by sending out heartbeat messages periodically at duration Δt i . The difference is, in PUSH, the monitored process p initiatively sends a periodical message "I am alive" to process q, informing q that p is still alive; while in PULL, process q sends a probing message "Are you alive?" to the monitored process p periodically. After receiving the query message, the monitored process p passively replies an "I am alive!" message to indicate its status. For traditional failure detectors based on timeout mechanism, an appropriate time-out value Δt o needs to be set. If no response message is received after Δt o , the monitored process will be suspected as a failure. Obviously, the PULL approach needs twice the number of messages to achieve the same performance, but this does not affect its scalability. However, PULL is an initiative detection method which launches detection only when needed, and it does not need the assumption of a global synchronization clock. This is very important for current complicate large-scale distributed applications. Therefore, PULL employed as the basic detection strategy in this paper.

Basic Idea of the Algorithm
One of the key factors that affect the performance of an accrual failure detector is the calculation method for sl(t). Whether the value of sl(t) can give an accurate description about the actual failure status of a process determines the detector's detection accuracy and delay, etc. In current implementations of the accrual failure detector, in order to improve the calculation precision for sl(t), we usually have to rely on the prediction of the arrival time of detection messages. An accurate prediction model will greatly increase the detector performance. Some examples of the estimation methods which are used most frequently are: estimating the arrival time of detection messages using the distribution probability of message delay, predicting possible transmission delay by a linear process based on learning, etc. These methods not only cause heavy computing and storage overhead but also are limited to specific distributed systems. For example, Avinash's prediction method based on exponential distribution is proposed according to the particular characteristics of the Facebook system. In order to find a prediction method with less overhead and better adaptability, we have observed transmission delays under two typical network conditions. The detection processes used in the experiment are located in Harbin, and the monitored processes are located in Beijing (China) and Pittsburgh (PA, USA) respectively. These two sets of experiments correspond to good (dataset 1 with an average delay of 82.1 ms) and poor (dataset 2 with an average delay of 1,297.8 ms) network conditions, respectively. We have observed for 24 h, respectively, and the results are shown in the figure below.  From Figure 2, we can see that in the two different network environments, transmission delay shows a continuity (in Figure 2(a), data is centralized on 50, 80 and 100 ms, and in Figure 2(b), data is centralized on 1,200 and 1,400 ms). Only a very small number of detection messages have a large deviated transmission delay due to network congestion or message loss, etc. Furthermore, from the statistical data in Figure 2(a), we can get: Even in Figure 2(b) for a poor network environment, 0 P has also reached 56.3%. Therefore, the transmission time delay i for most detection messages is less than or close to the transmission time of previous message delay i-1 . delay i-1 can be used as the predicted value for delay i to support failure detection, which means the predicted value of the i-th detection message is prek i = delay i-1 . This method does not cause overhead for modeling and recording a large amount of historical data, and it's adaptive to different network environments. However, we can see from P 0 that the accuracy of this method is not high, especially for the case of a poor network environment. Therefore, we refer to the evaluation method proposed by Jacobson [14] and add consideration of a safety margin to the predicted value: Let α = 0.25, for data in Figure 2(a), we have P m [delay i ≤ delay i-1 + margin i-1 ] = 98.9%. For Figure 2(b), P m has also reached 98.4%. Therefore, this new prediction method has greatly improved the prediction accuracy and met the needs for most failure detections. Based on this method, we have proposed the LA-FD failure detector.

LA-FD Failure Detector
LA-FD employs the PULL approach as the basic failure detection strategy. To simply the description, suppose the system consists of only two processes p and q, where q is monitoring p. The detection algorithm is shown in Figure 3.    iving the pr e its status. margin for age. When a pplication w hen ρ qp > P, lts and Ana will analyze In order to the configu 512 M RA ed to the Int 3. There ar ng (dataset from severa D) [11], Be -FD) [10]. verhead.
ion Accurac crual failure ion delay w ons for the tive experi o calculate s ts and relat een the ave in Figure 4,   It can be seen from the Figure 5 that the CPU overhead is the heaviest in the ϕ-detector based on normal distribution and it grows the fastest as the window size changes. This is because the workload for calculating parameters of the normal distribution model is the most, and every time it needs the statistical data from the entire window. The overhead of LA-FD is the least (about 0.08%), and it isn't affected by window size. Each process in the experiment shown in Figure 5 only maintains five connections. In large-scale P2P systems, in order to maintain a high locating efficiency, each process is generally required to maintain logN (N is the number of processes in the system) connections. Therefore, the fact that LA-FD can reduce CPU overhead is more significant in real systems.

Conclusions
Accrual failure detector can adapt to the changes in network conditions and on this basis, it can satisfy the different QoS requirements of multiple applications. The accrual failure detector is a fundamental component to ensure the efficiency and scalability of applications in large-scale distributed systems. Aiming at the characteristics that resources is constrained in mobile network equipment like cell phones and tablet PCs, LA-FD has been proposed as an accrual failure detector of class ◊P ac [9] in this paper. It does not need the probability distribution for message transmission time and the maintenance costs for message history window. LA-FD can provide adaptive detection service to multiple applications with very low overhead. Experimental analysis has shown that compared to several other implementations of accrual detectors, LA-FD maintains a high detection accuracy while effectively reducing system overhead and it meets the needs of major distributed applications.