<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sensors</journal-id>
<journal-title>Sensors</journal-title>
<issn pub-type="epub">1424-8220</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/s120505815</article-id>
<article-id pub-id-type="publisher-id">sensors-12-05815</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Low-Overhead Accrual Failure Detector</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Ren</surname><given-names>Xiao</given-names></name></contrib>
<contrib contrib-type="author">
<name><surname>Dong</surname><given-names>Jian</given-names></name><xref ref-type="corresp" rid="c1-sensors-12-05815"><sup>*</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname><given-names>Hongwei</given-names></name></contrib>
<contrib contrib-type="author">
<name><surname>Li</surname><given-names>Yang</given-names></name></contrib>
<contrib contrib-type="author">
<name><surname>Yang</surname><given-names>Xiaozong</given-names></name></contrib>
<aff id="af1-sensors-12-05815">School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China</aff></contrib-group>
<author-notes>
<corresp id="c1-sensors-12-05815">
<label>*</label>Author to whom correspondence should be addressed; E-Mail: <email>dan@hit.edu.cn</email>; Tel.: +86-451-8640-3317; Fax: +86-451-8641-3309.</corresp></author-notes>
<pub-date pub-type="collection">
<year>2012</year></pub-date>
<pub-date pub-type="epub">
<day>04</day>
<month>05</month>
<year>2012</year></pub-date>
<volume>12</volume>
<issue>5</issue>
<fpage>5815</fpage>
<lpage>5823</lpage>
<history>
<date date-type="received">
<day>06</day>
<month>03</month>
<year>2012</year></date>
<date date-type="rev-recd">
<day>20</day>
<month>04</month>
<year>2012</year></date>
<date date-type="accepted">
<day>25</day>
<month>04</month>
<year>2012</year></date></history>
<permissions>
<copyright-statement>© 2012 by the authors; licensee MDPI, Basel, Switzerland</copyright-statement>
<copyright-year>2012</copyright-year>
<license>
<p>This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>Failure detectors are one of the fundamental components for building a distributed system with high availability. In order to maintain the efficiency and scalability of failure detection in a complicated large-scale distributed system, accrual failure detectors that can adapt to multiple applications have been studied extensively. In this paper, an new accrual failure detector—LA-FD with low system overhead has been proposed specifically for current mobile network equipment on the Internet whose processing power, memory space and power supply are all constrained. It does not rely on the probability distribution of message transmission time, or on the maintenance of a history message window. By simple calculation, LA-FD provides adaptive failure detection service with high accuracy to multiple upper applications. The related experiments and results have also been presented.</p></abstract>
<kwd-group>
<kwd>failure detection</kwd>
<kwd>accrual failure detector</kwd>
<kwd>adaptive</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Failure detector is one of the fundamental components for building a distributed system with high availability [<xref ref-type="bibr" rid="b1-sensors-12-05815">1</xref>]. By providing the processes' failure information to the system, it supports the solution of many basic issues (such as consensus and atomic broadcasting, <italic>etc.</italic>) in an asynchronous system. Failure detection was proposed and formally defined by Chandra and Toueg [<xref ref-type="bibr" rid="b2-sensors-12-05815">2</xref>] as an effective way to enhance the asynchronous system computational model. With the increasing demands on capability in distributed systems, failure detectors have been widely applied to many fields including grid computing [<xref ref-type="bibr" rid="b3-sensors-12-05815">3</xref>], cluster management [<xref ref-type="bibr" rid="b4-sensors-12-05815">4</xref>] and peer-to-peer networks [<xref ref-type="bibr" rid="b5-sensors-12-05815">5</xref>]. As a fundamental component, more and more challenges to the efficiency and scalability [<xref ref-type="bibr" rid="b6-sensors-12-05815">6</xref>] of failure detectors have been posed by the expanding system scale and increasingly complex distributed applications. How to achieve good detection speed and accuracy with low detection load has become a hot research topic in this field.</p>
<p>Adaptive failure detectors have been proposed as an important approach to solve this problem. They adjusts the detector's parameters automatically so that the system's requirement on the indicator of effectiveness can be met with low load under different network environments. Chen [<xref ref-type="bibr" rid="b7-sensors-12-05815">7</xref>] and Bertier [<xref ref-type="bibr" rid="b8-sensors-12-05815">8</xref>] proposed a series of QoS-based adaptive failure detection algorithms based on a probability network model. These algorithms have achieved adaptive adjustment in the quantitative control of detector parameters and greatly improved the detector's control accuracy and effectively reduced detection load. However, with the development of various network applications, multiple applications are often running simultaneously in large-scale systems such as grid, P2P and cloud computing. They have different failure detection QoS requirements. Taking into account the impact of load on scalability, we can't supply separate failure detectors for each application. Therefore, here comes another requirement for adaptive failure detectors, that is, that they can adapt to different QoS requirements demanded by multiple applications. This has become an important issue in the research of failure detection in large-scale distributed systems [<xref ref-type="bibr" rid="b6-sensors-12-05815">6</xref>].</p>
<p>Hayashibara [<xref ref-type="bibr" rid="b9-sensors-12-05815">9</xref>] first launched the research in this area and proposed the concept of accrual detector. It allows a complete decoupling between monitoring and interpretation in traditional models of failure detection. By outputting a continuous value associated with the status of a process rather than a binary value simply representing success or failure, upper applications can interpret detection results according to their own QoS requirements. Therefore, multiple applications can share the same detector and the failure detection load can be effectively reduced in large-scale distributed systems. Currently many implementations of accrual detectors have been proposed and applied satisfactorily to some well-known systems, such as Facebook [<xref ref-type="bibr" rid="b10-sensors-12-05815">10</xref>]. However, with the development of applications in the Internet of Things and cloud computing, network access equipment has become diversified. Mobile terminals like cell phones and tablet PCs are being used more widely. The majority of such equipment are embedded systems whose processing power, memory space and power supply are all constrained, but the previously proposed accrual detectors require the probability distribution model for message transmission delay. For example, the <italic>ϕ</italic>-detector uses normal distribution [<xref ref-type="bibr" rid="b11-sensors-12-05815">11</xref>], Cassandra uses exponential distribution [<xref ref-type="bibr" rid="b10-sensors-12-05815">10</xref>], and Benjamin uses gamma distribution [<xref ref-type="bibr" rid="b12-sensors-12-05815">12</xref>]. Furthermore, those detectors need a certain memory space to save a large history message window. At each detection cycle, a large amount of calculation is needed to compute the probability distribution parameters and detector parameters. For most mobile terminals, these system overheads for failure detection have an important impact on system performance and battery consumption, and regarding failure detection itself, Gillen [<xref ref-type="bibr" rid="b13-sensors-12-05815">13</xref>] has pointed out that the transmission delays caused by performance degradation would also have great impact on detection accuracy.</p>
<p>Therefore, aiming at mobile devices with constrained resource, we have proposed an accrual failure detector with low system overhead. It does not rely on the probability distribution of message transmission delay, or on the maintenance of history message windows. Through simple calculations, it is able to provide an adaptive failure detection service with high accuracy to multiple upper applications.</p></sec>
<sec>
<label>2.</label>
<title>Algorithm Description</title>
<sec>
<label>2.1.</label>
<title>System Model</title>
<p>We consider an asynchronous distributed system consisting of <italic>n</italic> processes, <italic>∏</italic> = {<italic>p<sub>1</sub></italic>, <italic>p<sub>2</sub></italic>, …, <italic>p<sub>n</sub></italic>}. Because the failure detector is running as a basic component in the node, one simple topology is considered, and we assume that each pair of processes is connected by a communication channel that can be used to send and receive messages. The type of failure is crash and channels are fair-lossy channels. No synchronized clock is assumed.</p></sec>
<sec>
<label>2.2.</label>
<title>Basic Failure Detection Strategy</title>
<p>Heartbeat is a common method to implement failure detectors. The detection modules detect each other's status by sending heartbeat messages periodically at duration Δ<italic>t<sub>i</sub></italic>. According to the different modes of implementation, there are two monitoring approaches: PUSH and PULL. For two processes <italic>p</italic> and <italic>q</italic> in system, where <italic>q</italic> is monitoring <italic>p</italic>, the two basic approaches are described in <xref ref-type="fig" rid="f1-sensors-12-05815">Figure 1</xref>.</p>
<p>Both of the approaches detect each other's status by sending out heartbeat messages periodically at duration Δ<italic>t<sub>i</sub></italic>. The difference is, in PUSH, the monitored process <italic>p</italic> initiatively sends a periodical message “I am alive” to process <italic>q</italic>, informing <italic>q</italic> that <italic>p</italic> is still alive; while in PULL, process <italic>q</italic> sends a probing message “Are you alive?” to the monitored process <italic>p</italic> periodically. After receiving the query message, the monitored process <italic>p</italic> passively replies an “I am alive!” message to indicate its status. For traditional failure detectors based on timeout mechanism, an appropriate time-out value Δ<italic>t<sub>o</sub></italic> needs to be set. If no response message is received after Δ<italic>t<sub>o</sub></italic>, the monitored process will be suspected as a failure. Obviously, the PULL approach needs twice the number of messages to achieve the same performance, but this does not affect its scalability. However, PULL is an initiative detection method which launches detection only when needed, and it does not need the assumption of a global synchronization clock. This is very important for current complicate large-scale distributed applications. Therefore, PULL employed as the basic detection strategy in this paper.</p></sec>
<sec>
<label>2.3.</label>
<title>Basic Idea of the Algorithm</title>
<p>One of the key factors that affect the performance of an accrual failure detector is the calculation method for <italic>sl</italic>(<italic>t</italic>). Whether the value of <italic>sl</italic>(<italic>t</italic>) can give an accurate description about the actual failure status of a process determines the detector's detection accuracy and delay, <italic>etc</italic>. In current implementations of the accrual failure detector, in order to improve the calculation precision for <italic>sl</italic>(<italic>t</italic>), we usually have to rely on the prediction of the arrival time of detection messages. An accurate prediction model will greatly increase the detector performance. Some examples of the estimation methods which are used most frequently are: estimating the arrival time of detection messages using the distribution probability of message delay, predicting possible transmission delay by a linear process based on learning, <italic>etc</italic>. These methods not only cause heavy computing and storage overhead but also are limited to specific distributed systems. For example, Avinash's prediction method based on exponential distribution is proposed according to the particular characteristics of the Facebook system. In order to find a prediction method with less overhead and better adaptability, we have observed transmission delays under two typical network conditions. The detection processes used in the experiment are located in Harbin, and the monitored processes are located in Beijing (China) and Pittsburgh (PA, USA) respectively. These two sets of experiments correspond to good (dataset 1 with an average delay of 82.1 ms) and poor (dataset 2 with an average delay of 1,297.8 ms) network conditions, respectively. We have observed for 24 h, respectively, and the results are shown in the figure below.</p>
<p>From <xref ref-type="fig" rid="f2-sensors-12-05815">Figure 2</xref>, we can see that in the two different network environments, transmission delay shows a continuity (in <xref ref-type="fig" rid="f2-sensors-12-05815">Figure 2(a)</xref>, data is centralized on 50, 80 and 100 ms, and in <xref ref-type="fig" rid="f2-sensors-12-05815">Figure 2(b)</xref>, data is centralized on 1,200 and 1,400 ms). Only a very small number of detection messages have a large deviated transmission delay due to network congestion or message loss, <italic>etc</italic>. Furthermore, from the statistical data in <xref ref-type="fig" rid="f2-sensors-12-05815">Figure 2(a)</xref>, we can get:
<disp-formula id="FD1">
<label>(1)</label>
<mml:math id="mm1" display="block">
<mml:semantics id="sm1">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mn>0</mml:mn></mml:msub>
<mml:mo stretchy="false">[</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mtext mathvariant="italic">delay</mml:mtext>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mtext mathvariant="italic">delay</mml:mtext>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>≤</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>73.1</mml:mn>
<mml:mo>%</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>Even in <xref ref-type="fig" rid="f2-sensors-12-05815">Figure 2(b)</xref> for a poor network environment, has also reached 56.3%. Therefore, the transmission time <italic>delay<sub>i</sub></italic> for most detection messages is less than or close to the transmission time of previous message <italic>delay<sub>i</sub></italic><sub>-1</sub>. <italic>delay<sub>i</sub></italic><sub>-1</sub> can be used as the predicted value for <italic>delay<sub>i</sub></italic> to support failure detection, which means the predicted value of the <italic>i-th</italic> detection message is <italic>prek<sub>i</sub></italic> = <italic>delay<sub>i</sub></italic><sub>-1</sub>. This method does not cause overhead for modeling and recording a large amount of historical data, and it's adaptive to different network environments. However, we can see from <italic>P</italic><sub>0</sub> that the accuracy of this method is not high, especially for the case of a poor network environment. Therefore, we refer to the evaluation method proposed by Jacobson [<xref ref-type="bibr" rid="b14-sensors-12-05815">14</xref>] and add consideration of a safety margin to the predicted value:
<disp-formula id="FD2">
<label>(2)</label>
<mml:math id="mm2" display="block">
<mml:semantics id="sm2">
<mml:mrow>
<mml:msub>
<mml:mtext mathvariant="italic">margin</mml:mtext>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mtext mathvariant="italic">margin</mml:mtext>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mtext mathvariant="italic">prek</mml:mtext>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mtext mathvariant="italic">delay</mml:mtext>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow>
<mml:mo>|</mml:mo></mml:mrow>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mtext mathvariant="italic">margin</mml:mtext>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>Let <italic>α</italic> = 0.25, for data in <xref ref-type="fig" rid="f2-sensors-12-05815">Figure 2(a)</xref>, we have <italic>P<sub>m</sub></italic>[<italic>delay<sub>i</sub></italic> ≤ <italic>delay<sub>i</sub></italic><sub>-1</sub> + <italic>margin<sub>i</sub></italic><sub>-1</sub>] = 98.9%. For <xref ref-type="fig" rid="f2-sensors-12-05815">Figure 2(b)</xref>, <italic>P<sub>m</sub></italic> has also reached 98.4%. Therefore, this new prediction method has greatly improved the prediction accuracy and met the needs for most failure detections. Based on this method, we have proposed the LA-FD failure detector.</p></sec>
<sec>
<label>2.4.</label>
<title>LA-FD Failure Detector</title>
<p>LA-FD employs the PULL approach as the basic failure detection strategy. To simply the description, suppose the system consists of only two processes <italic>p</italic> and <italic>q</italic>, where <italic>q</italic> is monitoring <italic>p</italic>. The detection algorithm is shown in <xref ref-type="fig" rid="f3-sensors-12-05815">Figure 3</xref>.</p>
<p><xref ref-type="fig" rid="f3-sensors-12-05815">Figure 3</xref> shows that the LA-FD failure detector consists of a detection module and a query module. The detection module located on process <italic>q</italic> sends probing message <italic>mq<sub>i</sub></italic> to the monitored process <italic>q</italic> at interval <italic>Δt<sub>i</sub></italic>. After receiving the probing message <italic>mq<sub>i</sub></italic>, process <italic>p</italic> immediately replies an acknowledge message <italic>ma<sub>i</sub></italic> to indicate its status. Everytime after receiving the acknowledge message <italic>ma<sub>i</sub></italic>, process <italic>q</italic> needs to calculate the margin for the next detection message and records the transmission time of current detection message. When an upper application queries the detector, it will reply with a value of <italic>&amp;rho:<sub>qp</sub></italic>. Then the upper application will set a threshold value <italic>P</italic> according to its own requirement for detection accuracy. When <italic>&amp;rho:<sub>qp</sub></italic> &gt; <italic>P</italic>, process <italic>p</italic> is suspected as failed.</p></sec></sec>
<sec sec-type="methods|results">
<label>3.</label>
<title>Experimental Results and Analysis</title>
<p>In this section, we will analyze and compare the performance and overhead of the LA-FD detector through experiments. In order to make the results more convincing, detection processes have been designed according to the configuration of current mainstream mobile devices (ARM2440 processor, 400 MHz clock speed, 512 M RAM, 1,200 mAh battery capacity). The monitoring process is located in Harbin and connected to the Internet through WiFi. The monitored processes uses the configuration described in Section 2.3. There are two set of servers representing two typical network environments. one is located in Beijing (dataset 1) and the other is located in Pittsburgh (dataset 2). Experimental references are selected from several major implementations of accrual detectors such as Hayashibara's ϕ-failure detector (ϕ-FD) [<xref ref-type="bibr" rid="b11-sensors-12-05815">11</xref>], Benjamin's new accrual detector (NAD) [<xref ref-type="bibr" rid="b15-sensors-12-05815">15</xref>] and Avinash's improved ϕ-failure detector (I-ϕ-FD) [<xref ref-type="bibr" rid="b10-sensors-12-05815">10</xref>]. All experiments are focused on two aspects of LA-FD: detection accuracy and system overhead.</p>
<sec>
<label>3.1.</label>
<title>Analysis of Detection Accuracy</title>
<p>The accuracy of accrual failure detector is usually affected by two main factors. One is detection delay, and lower detection delay will reduce the accuracy of detection results; the other is the threshold set by upper applications for the suspicion level <italic>sl</italic>, and higher threshold leads to higher accuracy. However, in comparative experiments, different implementations of accrual failure detector use different approaches to calculate <italic>sl</italic> and its threshold. For the same detection data, we have collected all the detection results and related data from different detectors under multiple sets of thresholds. The relationship between the average mistake rate (<italic>λ<sub>M</sub></italic>) and detection delay has been explored and the results are shown in <xref ref-type="fig" rid="f4-sensors-12-05815">Figure 4, where (a) and (b)</xref> represent the detection results for dataset 1 and dataset 2, respectively.</p>
<p>It's obviously from <xref ref-type="fig" rid="f4-sensors-12-05815">Figure 4</xref> that LA-FD has demonstrated higher detection accuracy in both of the different network environments. Under the same accuracy requirement (mistake rate <italic>λ<sub>M</sub></italic> in Y-axis), LA-FD has lower detection delay. This point is more obvious under poor network conditions (<xref ref-type="fig" rid="f4-sensors-12-05815">Figure 4(b)</xref>). I-ϕ-FD based on exponential distribution has the worst detection performance in both of the network environments. so, the assumption of exponential distribution is only suitable for the specific P2P systems and normal distribution can describe the message transmission delay more accurately.</p></sec>
<sec>
<label>3.2.</label>
<title>Comparison of System Overhead</title>
<p>Since accrual failure detector needs to calculate detector parameters and maintain the history window of detection messages for each detection period, these two factors are the main reason for different system overhead in accrual detectors. A large history window will improve the prediction accuracy for model parameters and has a certain impact on the calculation accuracy for detector parameters. Meanwhile, the maintenance of history window will cause more overhead. For the experiments in this section, we have selected different historical window size settings and have made a detailed comparison of CPU utilization.</p>
<p>It can be seen from the <xref ref-type="fig" rid="f5-sensors-12-05815">Figure 5</xref> that the CPU overhead is the heaviest in the ϕ-detector based on normal distribution and it grows the fastest as the window size changes. This is because the workload for calculating parameters of the normal distribution model is the most, and every time it needs the statistical data from the entire window. The overhead of LA-FD is the least (about 0.08%), and it isn't affected by window size. Each process in the experiment shown in <xref ref-type="fig" rid="f5-sensors-12-05815">Figure 5</xref> only maintains five connections. In large-scale P2P systems, in order to maintain a high locating efficiency, each process is generally required to maintain <italic>logN</italic> (<italic>N</italic> is the number of processes in the system) connections. Therefore, the fact that LA-FD can reduce CPU overhead is more significant in real systems.</p></sec></sec>
<sec sec-type="conclusions">
<label>4.</label>
<title>Conclusions</title>
<p>Accrual failure detector can adapt to the changes in network conditions and on this basis, it can satisfy the different QoS requirements of multiple applications. The accrual failure detector is a fundamental component to ensure the efficiency and scalability of applications in large-scale distributed systems. Aiming at the characteristics that resources is constrained in mobile network equipment like cell phones and tablet PCs, LA-FD has been proposed as an accrual failure detector of class <italic>◊P<sub>ac</sub></italic> [<xref ref-type="bibr" rid="b9-sensors-12-05815">9</xref>] in this paper. It does not need the probability distribution for message transmission time and the maintenance costs for message history window. LA-FD can provide adaptive detection service to multiple applications with very low overhead. Experimental analysis has shown that compared to several other implementations of accrual detectors, LA-FD maintains a high detection accuracy while effectively reducing system overhead and it meets the needs of major distributed applications.</p></sec></body>
<back>
<ack>
<p>This project was supported by National Natural Science Foundation of China under (No. 61100029), Development Program for Outstanding Young Teachers in Harbin Institute of Technology (No. HITQNJS.2009.053) and International Science &amp; Technology Cooperation Program of China (No. 2010DFA14400).</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-sensors-12-05815"><label>1.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Dixit</surname><given-names>M.</given-names></name><name><surname>Casimiro</surname><given-names>A.</given-names></name></person-group><article-title>Adaptare-fd: A Dependability-Oriented Adaptive Failure Detector</article-title><conf-name>Proceedings of IEEE Symposium on Reliable Distributed Systems</conf-name><conf-loc>New Delhi, India</conf-loc><conf-date>1–3 November 2010</conf-date></citation></ref>
<ref id="b2-sensors-12-05815"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chandra</surname><given-names>T.D.</given-names></name><name><surname>Toueg</surname><given-names>S.</given-names></name></person-group><article-title>Unreliable failure detectors for reliable distributed systems</article-title><source>J. ACM</source><year>1996</year><volume>43</volume><fpage>225</fpage><lpage>267</lpage><pub-id pub-id-type="doi">10.1145/226643.226647</pub-id></citation></ref>
<ref id="b3-sensors-12-05815"><label>3.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Horita</surname><given-names>Y.</given-names></name><name><surname>Taura</surname><given-names>K.</given-names></name><name><surname>Chikayama</surname><given-names>T.</given-names></name></person-group><article-title>A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications</article-title><conf-name>Proceedings of the 6th IEEE /ACM International Workshop on Grid Computing</conf-name><conf-loc>Tokyo, Japan</conf-loc><conf-date>13–14 November 2005</conf-date></citation></ref>
<ref id="b4-sensors-12-05815"><label>4.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Lavinia</surname><given-names>A.</given-names></name><name><surname>Dobre</surname><given-names>C.</given-names></name><name><surname>Pop</surname><given-names>F.</given-names></name><name><surname>Cristea</surname><given-names>V.</given-names></name></person-group><article-title>A Failure Detection System for Large Scale Distributed Systems</article-title><conf-name>Proceedings of 2010 International Conference on Complex, Intelligent and Software Intensive Systems (CISIS)</conf-name><conf-loc>Krakow, Poland</conf-loc><conf-date>15–18 February 2010</conf-date></citation></ref>
<ref id="b5-sensors-12-05815"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>H.</given-names></name><name><surname>Xu</surname><given-names>H.</given-names></name><name><surname>Zhou</surname><given-names>Y.</given-names></name><name><surname>Song</surname><given-names>M.</given-names></name><name><surname>Song</surname><given-names>J.</given-names></name></person-group><article-title>A service delivery platform based on p2p technology for converged networks</article-title><source>J. Comput. Inf. Syst</source><year>2009</year><volume>5</volume><fpage>655</fpage><lpage>663</lpage></citation></ref>
<ref id="b6-sensors-12-05815"><label>6.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Pasin</surname><given-names>M.</given-names></name><name><surname>Fontaine</surname><given-names>S.</given-names></name><name><surname>Bouchenak</surname><given-names>S.</given-names></name></person-group><article-title>Failure Detection in Large Scale Systems: A Survey</article-title><conf-name>Proceedings of IEEE Network Operations and Management Symposium Workshops</conf-name><conf-loc>Bahia, Brazil</conf-loc><conf-date>7–11 April 2008</conf-date></citation></ref>
<ref id="b7-sensors-12-05815"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname><given-names>C.</given-names></name><name><surname>Toueg</surname><given-names>S.</given-names></name><name><surname>Aguilera</surname><given-names>M.K.</given-names></name></person-group><article-title>On the quality of service of failure detectors</article-title><source>IEEE Trans. Comput</source><year>2000</year><volume>51</volume><fpage>561</fpage><lpage>580</lpage></citation></ref>
<ref id="b8-sensors-12-05815"><label>8.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Bertier</surname><given-names>M.</given-names></name><name><surname>Marin</surname><given-names>O.</given-names></name><name><surname>Sens</surname><given-names>P.</given-names></name></person-group><article-title>Implementation and Performance Evaluation of an Adaptable Failure Detector</article-title><conf-name>Proceedings of International Conference on Dependable Systems and Networks</conf-name><conf-loc>Washington, DC, USA</conf-loc><conf-date>23–26 June 2002</conf-date></citation></ref>
<ref id="b9-sensors-12-05815"><label>9.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Defago</surname><given-names>X.</given-names></name><name><surname>Urban</surname><given-names>P.</given-names></name><name><surname>Hayashibara</surname><given-names>N.</given-names></name><name><surname>Katayama</surname><given-names>T.</given-names></name></person-group><article-title>Definition and Specification of Accrual Failure Detectors</article-title><conf-name>Proceedings of International Conference on Dependable Systems and Networks</conf-name><conf-loc>Yokohama, Japan</conf-loc><conf-date>28 June–1 July 2005</conf-date></citation></ref>
<ref id="b10-sensors-12-05815"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lakshman</surname><given-names>A.</given-names></name><name><surname>Malik</surname><given-names>P.</given-names></name></person-group><article-title>Cassandra: A decentralized structured storage system</article-title><source>ACM SIGOPS Oper. Syst. Rev</source><year>2010</year><volume>44</volume><fpage>35</fpage><lpage>40</lpage></citation></ref>
<ref id="b11-sensors-12-05815"><label>11.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Hayashibara</surname><given-names>N.</given-names></name><name><surname>Defago</surname><given-names>X.</given-names></name><name><surname>Yared</surname><given-names>R.</given-names></name><name><surname>Katayama</surname><given-names>T.</given-names></name></person-group><article-title>The ϕ Accrual Failure Detector</article-title><conf-name>Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems</conf-name><conf-loc>Florianopolis, Brazil</conf-loc><conf-date>18–20 October 2004</conf-date></citation></ref>
<ref id="b12-sensors-12-05815"><label>12.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Satzger</surname><given-names>B.</given-names></name><name><surname>Pietzowski</surname><given-names>A.</given-names></name><name><surname>Trumler</surname><given-names>W.</given-names></name><name><surname>Ungerer</surname><given-names>T.</given-names></name></person-group><article-title>Variations and Evaluations of an Adaptive Accrual Failure Detector to Enable Self-Healing Properties in Distributed Systems</article-title><conf-name>Proceedings of the 20th International Conference on Architecture of Computing Systems</conf-name><conf-loc>Zurich, Switzerland</conf-loc><conf-date>12–15 March 2007</conf-date></citation></ref>
<ref id="b13-sensors-12-05815"><label>13.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Gillen</surname><given-names>M.</given-names></name><name><surname>Rohloff</surname><given-names>K.</given-names></name><name><surname>Manghwani</surname><given-names>P.</given-names></name><name><surname>Schantz</surname><given-names>R.</given-names></name></person-group><article-title>Scalable, Adaptive, Time-Bounded Node Failure Detection</article-title><conf-name>Proceedings of the 10th IEEE High Assurance Systems Engineering Symposium</conf-name><conf-loc>Dallas, TX, USA</conf-loc><conf-date>14–16 November 2007</conf-date></citation></ref>
<ref id="b14-sensors-12-05815"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jacobson</surname><given-names>V.</given-names></name></person-group><article-title>Congestion avoidance and control</article-title><source>ACM SIGCOMM Comput. Commun. Rev</source><year>1988</year><volume>18</volume><fpage>314</fpage><lpage>329</lpage><pub-id pub-id-type="doi">10.1145/52325.52356</pub-id></citation></ref>
<ref id="b15-sensors-12-05815"><label>15.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Satzger</surname><given-names>B.</given-names></name><name><surname>Pietzowski</surname><given-names>A.</given-names></name><name><surname>Trumler</surname><given-names>W.</given-names></name><name><surname>Ungerer</surname><given-names>T.</given-names></name></person-group><article-title>A New Adaptive Accrual Failure Detector for Dependable Distributed Systems</article-title><conf-name>Proceedings of ACM Symposium on Applied Computing (SAC '07)</conf-name><conf-loc>Seoul, Korea</conf-loc><conf-date>11–15 March 2007</conf-date></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures</title>
<fig id="f1-sensors-12-05815" position="float">
<label>Figure 1.</label>
<caption>
<p>Heartbeat detection approaches.</p></caption>
<graphic xlink:href="sensors-12-05815f1.gif"/></fig>
<fig id="f2-sensors-12-05815" position="float">
<label>Figure 2.</label>
<caption>
<p>Experimental results for transmission delay.</p></caption>
<graphic xlink:href="sensors-12-05815f2.gif"/></fig>
<fig id="f3-sensors-12-05815" position="float">
<label>Figure 3.</label>
<caption>
<p>LA-FD failure detector.</p></caption>
<graphic xlink:href="sensors-12-05815f3.gif"/></fig>
<fig id="f4-sensors-12-05815" position="float">
<label>Figure 4.</label>
<caption>
<p>Average mistake rate and detection delay.</p></caption>
<graphic xlink:href="sensors-12-05815f4a.gif"/>
<graphic xlink:href="sensors-12-05815f4b.gif"/></fig>
<fig id="f5-sensors-12-05815" position="float">
<label>Figure 5.</label>
<caption>
<p>Comparison of CPU overhead.</p></caption>
<graphic xlink:href="sensors-12-05815f5.gif"/></fig></sec></back></article>
