Efficient Sender-Based Message Logging Tolerating Simultaneous Failures with Always No Rollback Property
Abstract
:1. Introduction
2. Preliminaries
2.1. Distributed System Model
2.2. Related Works
3. The New SBML Protocol
3.1. Basic Concepts and Algorithms
- None of the live processes in the system are rolled back during recovery, even in the case of concurrent process failures.
- The protocol supports a fault-tolerance for distributed applications exchanging messages in a mix of point-to-point and group communication modes.
- With little communication cost, it can maintain the recovery information of each application message on redundant volatile storages in a symmetric manner.
- : the sequence number of the most recent one among all the messages has transmitted since its initial execution state.
- : the sequence number of the most recent one among all the messages has delivered to applications since its initial execution state.
- : a set keeping the recovery information of each message transmitted. Its element e is composed of the identifier of the receiver (), send sequence number (), list of the receive sequence numbers (), and data () of the message. Here, the first field can be the identifer of a process or a process group. Moreover, the third field is a set whose component is a form of (, ) of the message received and assigned to . It may contain multiple s of each sent message if the message is transferred to a group of processes.
- : a set keeping the determinant of each message received. Its element e is composed of the identifier of the sender (), , , and of the message. Here, the second field and the fourth field have the same respective meanings as in .
- : a table for sensing duplication of application messages already delivered that their senders have regenerated in their recovery procedures. Its field contains the of the most recent one that has received from another process .
3.2. Correctness
4. Performance Evaluation
4.1. Simulation Environments
4.2. Comparison Results
5. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, X.; Li, H.; Sun, Q.; Guo, C.; Zhao, H.; Wu, X.; Wang, A. The g-Good-Neighbor Conditional Diagnosability of Exchanged Crossed Cube under the MM* Model. Symmetry 2022, 14, 2376. [Google Scholar] [CrossRef]
- Wang, S.; Yao, Y.; Zhu, F.; Tang, W.; Xiao, Y. A Probabilistic Prediction Approach for Memory Resource of Complex System Simulation in Cloud Computing Environment. Symmetry 2020, 12, 1826. [Google Scholar] [CrossRef]
- Mansouri, H.; Pathan, A. Checkpointing distributed computing systems: An optimisation approach. Int. J. High Perform. Comput. Appl. 2019, 15, 202–209. [Google Scholar] [CrossRef]
- Chlebus, B.S.; Kowalski, D.R.; Olkowski, J. Brief announcement: Deterministic consensus and checkpointing with crashes: Time and communication efficiency. In Proceedings of the 2022 ACM Symposium on Principles of Distributed Computing, Salerno, Italy, 25–29 July 2022; pp. 106–108. [Google Scholar]
- Elnozahy, E.; Alvisi, L.; Wang, Y.; Johnson, D. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 2002, 34, 375–408. [Google Scholar] [CrossRef] [Green Version]
- Lion, R.; Thibault, S. From tasks graphs to asynchronous distributed checkpointing with local restart. In Proceedings of the IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale, Atlanta, GA, USA, 11 November 2020; pp. 31–40. [Google Scholar]
- Jayasekara, S.; Karunasekera, S.; Harwood, A. Optimizing checkpoint-based fault-tolerance in distributed stream processing systems: Theory to practice. Softw. Pract. Exp. 2022, 52, 296–315. [Google Scholar] [CrossRef]
- Abdelhafidi, Z.; Djoudi, M.; Lagraa, N.; Yagoubi, M.B. FNB: Fast non-blocking coordinated checkpointing protocol for distributed systems. Theory Comput. Syst. 2015, 57, 397–425. [Google Scholar] [CrossRef]
- Meyer, H.; Rexachs, D.; Luque, E. Hybrid message pessimistic logging. improving current pessimistic message logging protocols. J. Parallel Distrib. Comput. 2017, 104, 206–222. [Google Scholar] [CrossRef] [Green Version]
- Ropars, T.; Morin, C. Active optimistic and distributed message logging for message-passing applications. Concurr. Comput. Pract. Exp. 2011, 23, 2167–2178. [Google Scholar] [CrossRef]
- Ropars, T.; Morin, C. Improving message logging protocols scalability through distributed event logging. In Proceedings of the 16th International Euro-Par Conference, Ischia, Italy, 31 August–3 September 2010; pp. 511–522. [Google Scholar]
- Bouteiller, A.; Ropars, T.; Bosilca, G.; Morin, C.; Dongarra, J. Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery. In Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops, New Orleans, LA, USA, 31 August–4 September 2009; pp. 1–9. [Google Scholar]
- Ahn, J. Enhanced sender-based message logging for reducing forced checkpointing overhead in distributed systems. IEICE Trans. Inf. Syst. 2021, E104-D, 1500–1505. [Google Scholar] [CrossRef]
- Ahn, J. Scalable sender-based message logging protocol with little communication overhead for distributed systems. Parallel Process. Lett. 2019, 29, 1–10. [Google Scholar] [CrossRef]
- Johnson, D.; Zwaenpoel, W. Sender-based message logging. In Proceedings of the 7th International Symposium on Fault-Tolerant Computing, Pittsburgh, PA, USA, 6–8 July 1987; pp. 14–19. [Google Scholar]
- Gupta, B.; Nikolaev, R.; Chirra, R. A recovery scheme for cluster federations using sender-based message logging. J. Comput. Inf. Technol. 2011, 19, 127–139. [Google Scholar] [CrossRef] [Green Version]
- Jaggi, P.; Singh, A. Log based recovery with low overhead for large mobile computing systems. J. Inf. Sci. Eng. 2013, 29, 969–984. [Google Scholar]
- Luo, Y.; Manivannan, D. HOPE: A hybrid optimistic checkpointing and selective pessimistic mEssage logging protocol for large scale distributed systems. Future Gener. Comput. Syst. 2012, 28, 1217–1235. [Google Scholar] [CrossRef]
- Kumari, P.; Kaur, P. Checkpointing algorithms for fault-tolerant execution of large-scale distributed applications in cloud. Wirel. Pers. Commun. 2021, 117, 1853–1877. [Google Scholar] [CrossRef]
- Chandra, T.D.; Toueg, S. Unreliable failure detectors for reliable distributed systems. J. ACM 1996, 43, 225–267. [Google Scholar] [CrossRef]
- Bagrodia, R.; Meyer, R.; Takai, M.; Chen, Y.; Zeng, X.; Martin, J.; Song, H.Y. Parsec: A parallel simulation environments for complex systems. Comput. J. 1998, 31, 77–85. [Google Scholar] [CrossRef] [Green Version]
- Xiaohua, L.; Kai, C. The research and application of IP multicast in enterprise network. In Proceedings of the International Conference on Internet Computing and Information Services, Hong Kong, China, 17–18 September 2011; pp. 191–194. [Google Scholar]
- Andrews, G.R. Paradigms for process interaction in distributed programs. ACM Comput. Surv. 1991, 23, 49–90. [Google Scholar] [CrossRef]
- Losada, N.; Bosilca, G.; Bouteiller, A.; González, P.; Martín, M. Local rollback for resilient mpi applications with application-level checkpointing and message logging. Future Gener. Comput. Syst. 2019, 91, 450–464. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahn, J. Efficient Sender-Based Message Logging Tolerating Simultaneous Failures with Always No Rollback Property. Symmetry 2023, 15, 816. https://doi.org/10.3390/sym15040816
Ahn J. Efficient Sender-Based Message Logging Tolerating Simultaneous Failures with Always No Rollback Property. Symmetry. 2023; 15(4):816. https://doi.org/10.3390/sym15040816
Chicago/Turabian StyleAhn, Jinho. 2023. "Efficient Sender-Based Message Logging Tolerating Simultaneous Failures with Always No Rollback Property" Symmetry 15, no. 4: 816. https://doi.org/10.3390/sym15040816