Scalable Communication-Induced Checkpointing Protocol with Little Overhead for Distributed Computing Environments
Round 1
Reviewer 1 Report
Communication-induced checkpointing has been considered impractical, as demonstrated by Alvisi et al.’s FTCS 1999 paper. This old paper is the most cited paper related to this particular checkpointing approach; but the current submission simply ignores it. This attitude seems to lack scientific integrity.
The experiments compare four different protocols, including the proposed one. The three of these protocols are those proposed by the author his/herself. The remaining one, proposed by [8], seems to be little known: This paper ([8]) is cited 10 times according to Google scholar and five of them are made by the author of the current paper. For example, [7] has much more citation counts. The reason why [8] was chosen for comparison should be provided.
The correctness argument of the proposed protocol is quite casual and seems to cut corners. It is basically made based on examples and no rigorous proof is given. For example, the paper reads: “This immediate update of the information can transform equation 3 into equation 5 as below.” But it does not provide the exact meaning of “transform”; as a result, why this claim actually holds is unclear.
"The selection of the target process to which ... has an exponential distribution..": This seems simply wrong. What is exponentially distributed should be the period between two consecutive basic checkpoints.
English could be improved by, for example, correcting typos. (ex.- ..has been already updated .. )
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
General Comments:
1. Some sentences are too long to read. Please consider to make them short and concise.
2. Some definitions (Definitions 6, 7, 8, 9) are NOT definition. Instead, they are more like helpers (lemma) to Theorem 1. Please revise and provide proofs to them if needed.
3. Although some definitions (such as Definition 9) or theorems (such as Theorem 1) are referred to references ([2] and [1]), please provide some descriptions about the ideas behind.
4. The HMNR [7] protocol and its family form the basis of this research. Please consider to add some descriptions about HMNR in Section 1 Introduction.
5. Why is the proposed Scalable CIC protocol correct and convergent?
Questions/Suggestions
Title, Page 1
Question: The author claims the the proposed protocol is SCALABLE. Why? Please add more descriptions on the scalability of the proposed protocol.
Line 55
should be the piecewise deterministic one only
=> should be piecewise-deterministic only
Line 66
Question: this literature (??)
Lines 89-90
The author wrote "..., but send one not between a pair of local checkpoints of two different processes [19,20].
Question: The description of Definition 2 is with grammatical errors.
Lines 97, 103, and more
figure 1 in Line 97, figure 2 in Line 103, ...
=> Figure 1, Figure 2, ...
Line 282
The author wrote "... C_1^{m_3} is satisfied."
Question: Why and what for C_1^{m_3} to satisfy?
Lines 10-12
Evaluation results indicate ... are in the range of 12.5% to 84.2% and 2.5% to 11.5% respectively.
=> Evaluation results indicate that the proposed protocol outperforms the existing ones at the reduced forced checkpointing overheads from 12.5% to 84.2%, and at the reduced total execution times from 2.5% to 11.5%.
Line 21
Most of CIC protocols [3-15]
=> Most CIC protocols [3-15]
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
It appears that the concerns I raised during the first round of reviews have been appropriately addressed.
Reviewer 2 Report
There are still some typos and grammatical errors in the revised manuscript. Please check and revise again and again.
Lines 22, 23
... each sent message some control information of other processes as well as ...
=> ... each sent message, some control information of other process as well as ...
Line 24
on the process's local variables and information piggybacked on the messages [2].
=> on the local variables of the process and the information piggybacked on the messages [2].
Line 69
limitations
=> limitation
Line 70
mentioned above, that is, the slow gathering ...
=> mentioned above. That is, the slow gathering ...
Author Response
Please see the attachment.
Author Response File: Author Response.pdf