In exa-scale HPC systems, node failure will be a common event. Checkpoint/restart is thus mandatory, but snapshotting for fault tolerance is increasingly expensive due to checkpoint sizes. Application-based checkpointing aims at reducing the overhead by optimizing snapshot times, i.e., when to write a checkpoint, and what to write, i.e., store only information that cannot be re-computed in a reasonable amount of time. Moreover, information should only be stored with required accuracy, which might be significantly smaller than double precision values.

Calhoun et al. [

90] investigate using lossy compression to reduce checkpoint sizes for time stepping codes for PDE simulations. For choosing the compression tolerance they aim at an error less than the simulation’s discretization error, which is estimated a priori using information about the mesh width of the space discretization and order of the numerical methods. Compression is performed using SZ [

18,

19]. Numerical experiments for two model problems (1D-heat and 1D-advection equations) and two HPC applications (2D Euler equations with PlacComCM, a multiphysics plasma combustion code, and 3D Navier-Stokes flow with the code Nek5000) demonstrate that restart from lossy compressed checkpoints does not significantly impact the simulation, but reduces checkpoint time.

Application-specific fault-tolerance including computing optimum checkpoint intervals [

91,

92] or multilevel checkpointing techniques [

93] has been a research topic for many years. In order to illustrate the influence of lossy compression, we derive a simple model similar to [

92], relating probability of failure and checkpoint times to the overall runtime of the application. We assume equidistant checkpoints and aim at determining the optimum number of checkpoints

n. For this we consider a parallel-in-time simulation application (see

Section 4.1) and use the notation summarized in

Table 2. The overall runtime of the simulation consists of the actual computation time

${T}_{C}$, the time it takes to write a checkpoint

${T}_{\mathrm{CP}}$, and the restart time

${T}_{\mathrm{RS}}$ for

${N}_{\mathbf{RS}}$ failures/restarts,

$T={T}_{C}+n{T}_{\mathrm{CP}}+{T}_{\mathrm{RS}}{N}_{\mathrm{RS}}$. Note that here

${T}_{C}$ depends implicitly on the number of cores used. The time for restart

${T}_{\mathrm{RS}}$ consists of the average required re-computation from the last written checkpoint to the time of failure, here for simplicity assumed as

$\frac{1}{2}\frac{T}{n}$ (see also [

91]) and time to recover data structures

${T}_{\mathrm{R}}$. For

N compute cores, and probability of failure per unit time and core

${p}_{\mathrm{RS}}$, we get the estimated number of restarts

${N}_{\mathrm{RS}}={p}_{\mathrm{RS}}TN$. Bringing everything together, the overall runtime amounts to

where for brevity we use the unit-less quantity

$b=1-{T}_{\mathrm{R}}{p}_{\mathrm{RS}}N$. As

${T}_{\mathrm{RS}}$ and

${N}_{\mathrm{RS}}$ depend on the total time

T, this model includes failures during restart as well as multiple failures during the computation of one segment between checkpoints. Given parameters of the HPC system and the application, an optimal number of checkpoints can be determined by solving the optimization problem

Taking into account the condition

required for the existence of a real solution, the minimization can be done analytically, yielding

In this model the only influence of lossy compression is given by the time to read/write checkpoints and to recover data structures. While a simple model for time to checkpoint is given, e.g., in [

90], here we just exemplarily show the influence in

Figure 11, by comparing different write/read times for checkpointing.

Reducing checkpoint size by lossy compression, thus reducing

${T}_{\mathrm{CP}},{T}_{\mathrm{DS}}$ has a small but noticeable effect on the overall runtime. Note that this model neglects the impact of inexact checkpoints on the re-computation time, which might increase, e.g., due to iterative methods requiring additional steps to reduce the compression error. For iterative linear solvers this is done in [

89]; a thorough analysis for the example of parallel-in-time simulation with hybrid parareal methods can be done along the lines of [

65].