Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Scalable Communication-Induced Checkpointing Protocol with Little Overhead for Distributed Computing Environments

Electronics 2023, 12(12), 2702; https://doi.org/10.3390/electronics12122702

by Jinho Ahn

Reviewer 1:

Tatsuhiro Tsuchiya

Reviewer 2:

Da-Yin Liao

Electronics 2023, 12(12), 2702; https://doi.org/10.3390/electronics12122702

Submission received: 26 April 2023 / Revised: 6 June 2023 / Accepted: 14 June 2023 / Published: 16 June 2023

(This article belongs to the Special Issue Fault-Tolerant Design for Safety-Critical Applications)

Round 1

Reviewer 1 Report

Communication-induced checkpointing has been considered impractical, as demonstrated by Alvisi et al.’s FTCS 1999 paper. This old paper is the most cited paper related to this particular checkpointing approach; but the current submission simply ignores it. This attitude seems to lack scientific integrity.

The experiments compare four different protocols, including the proposed one. The three of these protocols are those proposed by the author his/herself. The remaining one, proposed by [8], seems to be little known: This paper ([8]) is cited 10 times according to Google scholar and five of them are made by the author of the current paper. For example, [7] has much more citation counts. The reason why [8] was chosen for comparison should be provided.

The correctness argument of the proposed protocol is quite casual and seems to cut corners. It is basically made based on examples and no rigorous proof is given. For example, the paper reads: “This immediate update of the information can transform equation 3 into equation 5 as below.” But it does not provide the exact meaning of “transform”; as a result, why this claim actually holds is unclear.

"The selection of the target process to which ... has an exponential distribution..": This seems simply wrong. What is exponentially distributed should be the period between two consecutive basic checkpoints.

English could be improved by, for example, correcting typos. (ex.- ..has been already updated .. )

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

General Comments:

1. Some sentences are too long to read. Please consider to make them short and concise.

2. Some definitions (Definitions 6, 7, 8, 9) are NOT definition. Instead, they are more like helpers (lemma) to Theorem 1. Please revise and provide proofs to them if needed.

3. Although some definitions (such as Definition 9) or theorems (such as Theorem 1) are referred to references ([2] and [1]), please provide some descriptions about the ideas behind.

4. The HMNR [7] protocol and its family form the basis of this research. Please consider to add some descriptions about HMNR in Section 1 Introduction.

5. Why is the proposed Scalable CIC protocol correct and convergent?

Questions/Suggestions

Title, Page 1

Question: The author claims the the proposed protocol is SCALABLE. Why? Please add more descriptions on the scalability of the proposed protocol.

Line 55

should be the piecewise deterministic one only

=> should be piecewise-deterministic only

Line 66

Question: this literature (??)

Lines 89-90

The author wrote "..., but send one not between a pair of local checkpoints of two different processes [19,20].

Question: The description of Definition 2 is with grammatical errors.

Lines 97, 103, and more

figure 1 in Line 97, figure 2 in Line 103, ...

=> Figure 1, Figure 2, ...

Line 282

The author wrote "... C_1^{m_3} is satisfied."

Question: Why and what for C_1^{m_3} to satisfy?

Lines 10-12

Evaluation results indicate ... are in the range of 12.5% to 84.2% and 2.5% to 11.5% respectively.

=> Evaluation results indicate that the proposed protocol outperforms the existing ones at the reduced forced checkpointing overheads from 12.5% to 84.2%, and at the reduced total execution times from 2.5% to 11.5%.

Line 21

Most of CIC protocols [3-15]

=> Most CIC protocols [3-15]

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

It appears that the concerns I raised during the first round of reviews have been appropriately addressed.

Reviewer 2 Report

There are still some typos and grammatical errors in the revised manuscript. Please check and revise again and again.

Lines 22, 23

... each sent message some control information of other processes as well as ...

=> ... each sent message, some control information of other process as well as ...

Line 24

on the process's local variables and information piggybacked on the messages [2].

=> on the local variables of the process and the information piggybacked on the messages [2].

Line 69

limitations

=> limitation

Line 70

mentioned above, that is, the slow gathering ...

=> mentioned above. That is, the slow gathering ...

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Scalable Communication-Induced Checkpointing Protocol with Little Overhead for Distributed Computing Environments

Further Information

Guidelines

MDPI Initiatives

Follow MDPI