Efficient implementation of ADER discontinuous Galerkin schemes for a scalable hyperbolic PDE engine

In this paper we discuss a new and very efficient implementation of high order accurate ADER discontinuous Galerkin (ADER-DG) finite element schemes on modern massively parallel supercomputers. The numerical methods apply to a very broad class of nonlinear systems of hyperbolic partial differential equations. ADER-DG schemes are by construction communication avoiding and cache blocking and are furthermore very well-suited for vectorization, so that they appear to be a good candidate for the future generation of exascale supercomputers. We introduce the numerical algorithm and show some applications to a set of hyperbolic equations with increasing level of complexity, ranging from the compressible Euler equations over the equations of linear elasticity and the unified Godunov-Peshkov-Romenski (GPR) model of continuum mechanics to general relativistic magnetohydrodynamics (GRMHD) and the Einstein field equations of general relativity. We present strong scaling results of the new ADER-DG schemes up to 180,000 CPU cores. To our knowledge, these are the largest runs ever carried out with high order ADER-DG schemes for nonlinear hyperbolic PDE systems. We also provide a detailed performance comparison with traditional Runge-Kutta DG schemes.


Introduction
Hyperbolic partial differential equations are omnipresent in the mathematical description of time-dependent processes in fluid and solid mechanics, in engineering and geophysics, as well as in plasma physics and even in general relativity. Among the most widespread applications nowadays are i) computational fluid mechanics in mechanical and aerospace engineering, in particular compressible gas dynamics at high Mach numbers; ii) geophysical and environmental free surface flows in oceans, rivers and lakes, describing natural hazards such as tsunami wave propagation, landslides, storm surges and floods; iii) seismic, acoustic and electromagnetic wave propagation processes in the time domain are described by systems of hyperbolic partial differential equations, namely the equations of linear elasticity, the acoustic wave equation and the well-known Maxwell equations; iv) high energy density plasma flows in nuclear fusion reactors as well as astrophysical plasma flows in the solar system and the universe, using either the Newtonian limit or the complete equations in full general relativity; v) the Einstein field equations of general relativity, which govern the evolution of dynamic spacetimes, can be written under the form of a nonlinear system of hyperbolic partial differential equations.

High order ADER discontinuous Galerkin finite element schemes
In this paper we consider hyperbolic PDE with non-conservative products and algebraic source terms of the form (see also [34,35]) where t ∈ R + 0 is the time, x ∈ Ω ⊂ R d is the spatial position vector in d space dimensions, Q ∈ Ω Q ⊂ R m is the state vector, F(Q) is the nonlinear flux tensor, B(Q) · ∇Q is a non-conservative product and S(Q) is a purely algebraic source term. Introducing the system matrix A(Q) = ∂F/∂Q + B(Q) the above system can also be written in quasi-linear form as The system is hyperbolic if for all n = 0 and for all Q ∈ Ω Q the matrix A(Q) · n has m real eigenvalues and a full set of m linearly independent right eigenvectors. The system (1) is provided with an initial condition Q(x, 0) = Q 0 (x) and appropriate boundary conditions on ∂Ω. In some parts of the paper we will also make use of the vector of primitive (physical) variables denoted by V = V(Q). For very complex PDE systems, like the general relativistic MHD equations, it may be much easier to express the flux tensor F in terms of V rather than in terms of Q, but the evaluation of V = V(Q) can become very complicated.

Unlimited ADER-DG scheme and Riemann solvers
We cover the computational domain Ω with a set of non-overlapping Cartesian control volumes in Here, x i = (x i , y i , z i ) denotes the barycenter of cell Ω i and ∆x i = (∆x i , ∆y i , ∆z i ) is the mesh spacing associated with Ω i in each space dimension. The domain Ω = Ω i is the union of all spatial control volumes. A key ingredient of the ExaHyPE engine http://exahype.eu is a cell-by-cell adaptive mesh refinement (AMR), which is built upon the space-tree implementation Peano [43,44]. For further details about cell-by-cell AMR, see [45]. For AMR in combination with better than second order accurate finite volume and DG schemes with time-accurate local time stepping (LTS) and for a literature overview of state of the art AMR methods, the reader is referred to [46][47][48][49][50][51] and references therein. Since the main focus of this paper is not on AMR, at this point we can only give a very brief summary of existing AMR methods and codes for hyperbolic PDE, without pretending to be complete. Starting point of adaptive mesh refinement for hyperbolic conservation laws was of course the pioneering work of Berger et al. [52][53][54], who were the first to introduce a patched-based block-structured AMR method. Further developments are reported in [55][56][57] based on the second order accurate wave-propagation algorithm of LeVeque. For computational astrophysics, relevant AMR techniques have been documented, e.g., in [58][59][60][61][62][63][64][65][66], including the RAMSES, PLUTO, NIRVANA, AMRVAC and BHAC codes. For a recent and more complete survey of high level AMR codes, the reader is referred to the review paper [67]. Better than second order accurate finite volume and finite difference schemes on adaptive meshes can be found, e.g., in [68][69][70][71][72][73][74].
In the following, the discrete solution of the PDE system (1) is denoted by u h and is defined in terms of tensor products of piecewise polynomials of degree N in each spatial direction. The discrete solution space is denoted by U h in the following. Since we adopt a discontinuous Galerkin (DG) finite element method, the numerical solution u h is allowed to jump across element interfaces, as in the context of finite volume schemes. Within each spatial control volume Ω i the discrete solution u h restricted to that control volume is written at time t n in terms of some nodal spatial basis functions Φ l (x) and some unknown degrees of freedomû n i,l : where l = (l 1 , l 2 , l 3 ) is a multi-index and the spatial basis functions Φ l (x) = ϕ l 1 (ξ)ϕ l 2 (η)ϕ l 3 (ζ) are generated via tensor products of one-dimensional nodal basis functions ϕ k (ξ) on the reference interval [0, 1]. The transformation from physical coordinates x ∈ Ω i to reference coordinates For the one-dimensional basis functions ϕ k (ξ) we use the Lagrange interpolation polynomials passing through the Gauss-Legendre quadrature nodes ξ j of an N + 1 point Gauss quadrature formula. Therefore, the nodal basis functions satisfy the interpolation property ϕ k (ξ j ) = δ kj , where δ kj is the usual Kronecker symbol, and the resulting basis is orthogonal. Furthermore, due to this particular choice of a nodal tensor-product basis, the entire scheme can be written in a dimension-by-dimension fashion, where all integral operators can be decomposed into a sequence of one-dimensional operators acting only on the N + 1 degrees of freedom in the respective dimension. For details on multi-dimensional quadrature, see the well-known book of Stroud [75].
In order to derive the ADER-DG method, we first multiply the governing PDE system (1) with a test function Φ k ∈ U h and integrate over the space-time control volume Ω i × [t n ; t n+1 ]. This leads to with dx = dx dy dz. As already mentioned before, the discrete solution is allowed to jump across element interfaces, which means that the resulting jump terms have to be taken properly into account. In our scheme this is achieved via numerical flux functions (approximate Riemann solvers) and via the path-conservative approach that was developed by Castro and Parés in the finite volume context [76,77]. It has later been also extended to the discontinuous Galerkin finite element framework in [35,78,79]. In classical Runge-Kutta DG schemes, only a weak form in space of the PDE is obtained, while time is still kept continuous, thus reducing the problem to a nonlinear system of ODE, which is subsequently integrated with standard ODE solvers in time. However, this requires MPI communication in each Runge-Kutta stage. Furthermore, each Runge-Kutta stage requires accesses to the entire discrete solution in memory. In the ADER-DG framework, a completely different paradigm is used. Here, higher order in time is achieved with the use of an element-local space-time predictor, denoted by q h (x, t) in the following, and which will be discussed in more detail later. Using (3), integrating the first term by parts in time and integrating the flux divergence term by parts in space, taking into account the jumps between elements and making use of this local space-time predictor solution q h instead of Q, the weak formulation (4) can be rewritten as where the first integral leads to the element mass matrix, which is diagonal since our basis is orthogonal. The boundary integral contains the approximate Riemann solver and accounts for the jumps across element interfaces, also in the presence of non-conservative products. The third and fourth integral account for the smooth part of the flux and the non-conservative product, while the right hand side takes into account the presence of the algebraic source term. According to the framework of path-conservative schemes [35,76,77,79], the jump terms are defined via a path-integral in phase space between the boundary extrapolated states at the left q − h and at the right q + h of the interface as follows: with B · n = B 1 n 1 + B 2 n 2 + B 3 n 3 . Throughout this paper, we use the simple straight-line segment path In order to achieve exactly well-balanced schemes for certain classes of hyperbolic equations with non-conservative products and source terms, the segment path is not sufficient and a more elaborate choice of the path becomes necessary, see e.g. [80][81][82][83]. In relation (6) above the symbol Θ > 0 denotes an appropriate numerical dissipation matrix. Following [35,84,85], the path integral that appears in (6) can be simply evaluated via some sufficiently accurate numerical quadrature formulae. We typically use a three-point Gauss-Legendre rule in order to approximate the path-integral. For a simple path-conservative Rusanov-type method [35,86], the numerical dissipation matrix reads where I denotes the identity matrix and s max is the maximum wave speed (eigenvalue λ of matrix A · n) at the element interface. In order to reduce numerical dissipation, one can use better Riemann solvers, such as the Osher-type schemes proposed in [85,87], or the recent extension of the original HLLEM method of Einfeldt and Munz [88] to general conservative and non-conservative hyperbolic systems recently put forward in [89]. The choice of the approximate Riemann solver and therefore of the viscosity matrix Θ completes the numerical scheme (5). In the next subsection, we shortly discuss the computation of the element-local space-time predictor q h , which is a key ingredient of our high order accurate and communication-avoiding ADER-DG schemes.

Space-time predictor and suitable initial guess
As already mentioned previously, the element-local space-time predictor is an important key feature of ADER-DG schemes and is briefly discussed in this section. The computation of the predictor solution q h (x, t) is based on a weak formulation of the governing PDE system in space-time and was first introduced in [34,90,91]. Starting from the known solution u h (x, t n ) at time t n and following the terminology of Harten et al. [92], we solve a so-called Cauchy problem in the small, i.e. without considering the interaction with the neighbor elements. In the ENO scheme of Harten et al. [92] and in the original ADER approach of Toro and Titarev [40][41][42] the strong differential form of the PDE was used, together with a combination of Taylor series expansions and the so-called Cauchy-Kovalewskaya procedure. The latter is very cumbersome, or becomes even unfeasible for very complicated nonlinear hyperbolic PDE systems, since it requires a lot of analytic manipulations of the governing PDE system, in order to replace time derivatives with known space derivatives at time t n . This is achieved by successively differentiating the governing PDE system with respect to space and time and inserting the resulting terms into the Taylor series. For an explicit example of the Cauchy-Kovalewskaya procedure applied to the three-dimensional Euler equations of compressible gas dynamics and the MHD equations, see [93] and [94]. Instead, the local space-time discontinuous Galerkin predictor introduced in [34,90,91], requires only pointwise evaluations of the fluxes, source terms and non-conservative products. For element Ω i the predictor solution q h is now expanded in terms of a local space-time basis Using the local space-time ansatz (9) Eq. (11) becomes an element-local nonlinear system for the unknown degrees of freedomq i,l of the space-time polynomials q h . The solution of (11) can be found via a simple and fast converging fixed point iteration (a discrete Picard iteration) as detailed e.g. in [34,95]. For linear homogeneous systems, the discrete Picard iteration converges in a finite number of at most N + 1 steps, since the involved iteration matrix is nilpotent, see [96]. However, we emphasize that the choice of an appropriate initial guess q 0 h (x, t) for q h (x, t) is of fundamental importance to obtain a faster convergence and thus a computationally more efficient scheme. For this purpose, one can either use an extrapolation of q h from the previous time interval [t n−1 , t n ], as suggested e.g. in [97], or one can employ a second-order accurate MUSCL-Hancock-type approach, as proposed in [95], which is based on discrete derivatives computed at time t n . The initial guess is most conveniently written in terms of a Taylor series expansion of the solution in time, where then suitable approximations of the time derivatives are computed. In the following we introduce the operator which is an approximation of the time derivative of the solution. The second-order accurate MUSCL-type initial guess [95] then reads while a third-order accurate initial guess for q h (x, t) is given by Here, we have used the abbreviations k 1 := L (u h (x, t n )) and k 2 := L (u h (x, t n ) + ∆tk 1 ). For an initial guess of even higher order of accuracy, it is possible to use the so-called continuous extension Runge-Kutta (CERK) schemes of Owren and Zennaro [98]; see also [99] for the use of CERK time integrators in the context of high order discontinuous Galerkin finite element methods. If an initial guess with polynomial degree N − 1 in time is chosen, it is sufficient to use one single Picard iteration to solve (11) to the desired accuracy. At this point, we make some comments about a suitable data-layout for high order ADER-DG schemes. In order to compute the discrete derivative operators needed in the predictor (11), especially for the computation of the discrete gradient ∇q h , it is very convenient to use an array-of-struct (AoS) data structure. In this way, the first or fastest-running unit-stride index is the one associated with the m quantities contained in the vector Q, while the other indices are associated with the space-time degrees of freedom, i.e. we arrange the data contained in the set of degrees of freedomq i l asq i v,l 1 ,l 2 ,l 3 ,l 0 , with 1 ≤ v ≤ m and 1 ≤ l k ≤ N + 1. The discrete derivatives in any spatial and in time direction can then be simply computed by the multiplication of a subset of the degrees of freedom with the transpose of a small (N + 1) × (N + 1) matrix D kl from the right, which reads where h is the respective spatial or temporal step size in the corresponding coordinate direction, i.e. either ∆x i , ∆y i , ∆z i or ∆t. For this purpose, the optimized library for small matrix multiplications libxsmm can be employed on Intel machines, see [100] and [101,102] for more details. However, the AoS data layout is not convenient for vectorization of the PDE evaluation in ADER-DG scheme, since vectorization of the fluxes, source terms and non-conservative products should preferably be done over the integration points l. For this purpose, we convert the AoS data layout on the fly into a struct-of-array (SoA) data layout via appropriate transposition of the data and then call the physical flux function F(q h ) as well as the combined algebraic source term and non-conservative product contained in the expression S(q h ) − B(q h ) · ∇q h simultaneously for a subset of VECTORLENGTH space-time degrees of freedom, where VECTORLENGTH is the length of the AVX registers of modern Intel Xeon CPUs, i.e. 4 for those with the old 256 bit AVX and AVX2 instruction sets (Sandy Bridge, Haswell, Broadwell) and 8 for the latest Intel Xeon scalable CPUs with 512 bit AVX instructions (Skylake). The result of the vectorized evaluation of the PDE, which is still in SoA format, is then converted back to the AoS data layout using appropriate vectorized shuffle commands. The element-local space-time predictor is arithmetically very intensive, but at the same time it is also by construction cache-blocking. While in traditional RKDG schemes, each Runge-Kutta stage requires touching all spatial degrees of freedom of the entire domain once per Runge-Kutta stage, in our ADER-DG approach the spatial degrees of freedom u h need to be loaded only once per element and time step, and from those all space-time degrees of freedom of q h are computed. Ideally, this procedure fits entirely into the L3 cache or even into the L2 cache of the CPU, at least up to a certain critical polynomial degree N c = N c (m), which is a function of the available L3 or L2 cache size, but also of the number of quantities m to be evolved in the PDE system.
Last but not least, it is important to note that it is possible to hide the entire MPI communication that is inevitably needed on distributed memory supercomputers behind the space-time predictor. For this purpose, the predictor is first invoked on the MPI boundaries of each CPU, which then immediately sends the boundary-extrapolated data q − h and q + h to the neighbor CPUs. While the messages containing the data of these non-blocking MPI send and receive commands are sent around, each CPU can compute the space-time predictor of purely interior elements that do not need any MPI communication.
For an efficient task-based formalism used within ExaHyPE in the context of shared memory parallelism, see [103]. This completes the description of the efficient implementation of the unlimited ADER-DG schemes used within the ExaHyPE engine.

A posteriori subcell finite volume limiter
In regions where the discrete solution is smooth, there is indeed no need for using nonlinear limiters. However, in the presence of shock waves, discontinuities or strong gradients, and taking into account the fact that even a smooth signal may become non-smooth on the discrete level if it is underresolved on the grid, we have to supplement our high order unlimited ADER-DG scheme described above with a nonlinear limiter.
In order to build a simple, robust and accurate limiter, we follow the ideas outlined in [36][37][38]104], where a novel a posteriori limiting strategy for ADER-DG schemes was developed, based on the ideas of the MOOD paradigm introduced in [105][106][107][108] in the finite volume context. In a first run, the unlimited ADER-DG scheme is used and produces a so-called candidate solution, denoted by u * h (x, t n+1 ) in the following. This candidate solution is then checked a posteriori against several physical and numerical detection criteria. For example, we require some relevant physical quantities of the solution to be positive (e.g. pressure and density), we require the absence of floating point errors (NaN) and we impose a relaxed discrete maximum principle (DMP) in the sense of polynomials, see [36]. As soon as one of these detection criteria is not satisfied, a cell is marked as troubled zone and is scheduled for limiting.
A cell Ω i that has been marked for limiting is now split into (2N + 1) d finite volume subcells, which are denoted by Ω i,s . They satisfy Ω i = s Ω i,s . Note that this very fine division of a DG element into finite volume subcells does not reduce the time step of the overall ADER-DG scheme, since the CFL number of explicit DG schemes scales with 1/(2N + 1), while the CFL number of finite volume schemes (used on the subgrid) is of the order of unity. The discrete solution in the subcells Ω i,s is represented at time t n in terms of piecewise constant subcell averagesū n i,s , i.e.
These subcell averages are now evolved in time with a second or third order accurate finite volume scheme, which actually looks very similar to the previous ADER-DG scheme (5), with the difference that now the test function is unity and the spatial control volumes Ω i are replaced by the sub-volumes Ω i,s : Here we use again a space-time predictor solution q h , but which is now computed from an initial condition given by a second order TVD reconstruction polynomial, or from a WENO [26] or CWENO reconstruction [73,[109][110][111] polynomial w h (x, t n ) computed from the cell averagesū n i,s via an appropriate reconstruction operator. The predictor is either computed via a standard second order MUSCL-Hancock-type strategy, or via the space-time DG approach of (11), but where the initial data u h (x, t n ) are now replaced by w h (x, t n ) and the spatial control volumes Ω i are replaced by the subcells Ω i,s . Once all subcell averagesū n+1 i,s inside a cell Ω i have been computed according to (17), the limited DG polynomial u h (x, t n+1 ) at the next time level is obtained again via a classical constrained least squares reconstruction procedure requiring and Here, the second relation is a constraint and means conservation at the level of the control volume Ω i . This completes the brief description of the subcell finite volume limiter used here.

Some examples of typical PDE systems solved with the ExaHyPE engine
The great advantage of ExaHyPE over other existing PDE solvers is its great flexibility and versatility for the solution of a very wide class of hyperbolic PDE systems (1). The implementation of the numerical method and the definition of the PDE system to be solved are completely independent of each other. The compute kernels are provided either as generic or as an optimized implementation for the general PDE system given by (1), while the user only needs to provide particular implementations of the functions F(Q), B(Q) and S(Q). It is obviously also possible to drop terms that are not needed. This allows to solve all the PDE systems listed below in one single software package. In all numerical examples shown below, we have used a CFL condition of the type where ∆x, ∆y and ∆z are the mesh spacings and |λ x max |, |λ x max | and |λ x max | are the maximal absolute values of the eigenvalues (wave speeds) of the matrix A · n in x, y and z direction, respectively. The coefficient α < 1/(2N + 1) can be obtained via a numerical von Neumann stability analysis and is reported for some relevant N in [34].

The Euler equations of compressible gas dynamics
The Euler equations of compressible gas dynamics are among the simplest nonlinear systems of hyperbolic conservation laws. They only involve a conservative flux F(Q) and read ∂ ∂t Here, ρ is the mass density, v is the fluid velocity, ρE is the total energy density and p is the fluid pressure, which is related to ρ, ρE and v via the so-called equation of state (EOS). In the following we show the computational results for two test problems. The first one is the smooth isentropic vortex test case first proposed in [112] and also used in [36], which has an exact solution and is therefore suitable for a numerical convergence study. Some results of [36] are summarized in Table 1 below, where N x denotes the number of cells per space dimensions. From the results one can conclude that the high order ADER-DG schemes converge with the designed order of accuracy in both space and time. In order to give a quantitative assessment for the cost of the scheme, we define and provide the TDU metric, which is the cost per degree of freedom update per CPU core, see also [34]. The TDU metric is easily computed by dividing the measured wall clock time (WCT) of a simulation by the number of elements per CPU core and time steps carried out, and by the number of spatial degrees of freedom per element, i.e. (N + 1) d . With the appropriate initial guess and AVX 512 vectorization of the code discussed in the previous section, the cost for updating one single degree of freedom for a fourth order ADER-DG scheme (N = 3) for the 3D compressible Euler equations is as low as TDU=0.25 µs when using one single CPU core of a new Intel i9-7900X Skylake test workstation with 3.3 GHz nominal clock speed, 32 GB of RAM and a total number of 10 CPU cores. This cost metric can be directly compared with the cost to update one single point or control volume of existing finite difference and finite volume schemes. Table 1. L 1 , L 2 and L ∞ errors and numerical convergence rates obtained for the two-dimensional isentropic vortex test problem using different unlimited ADER-DG schemes, see [36].  In the following we show the results obtained with an ADER-DG scheme using piecewise polynomials of degree N = 9 for a very stringent test case, which is the so-called Sedov blast wave problem detailed in [97,104,113,114]. It consists in an explosion propagating in a zero pressure gas, leading to an infinitely strong shock wave. In our setup, the outer pressure is set to 10 −14 , i.e. close to machine zero. In order to get a robust numerical scheme, it is useful to perform the reconstruction step in the subcell finite volume limiter as well as the space-time predictor of the ADER-DG scheme in primitive variables, see [97]. The computational results obtained are shown in Fig. 1, where we can observe a very good agreement with the reference solution. One furthermore can see that the discrete solution respects the circular symmetry of the problem and the a posteriori subcell limiter is only acting in the vicinity of the shock wave.

A novel diffuse interface approach for linear seismic wave propagation in complex geometries
Seismic wave propagation problems in complex 3D geometries are often very challenging due to the geometric complexity. Standard approaches either use regular curvilinear boundary-fitted meshes, or unstructured tetrahedral or hexahedral meshes. In all cases, a certain amount of user interaction for grid generation is required. Furthermore, the geometric complexity can have a negative impact on the admissible time step size due to the CFL condition, since the mesh generator may create elements with very bad aspect ratio, so-called sliver elements. In the case of regular curvilinear grids, the Jacobian of the mapping may become ill-conditioned and thus reduce the admissible time step size. In [115] a novel diffuse interface approach has been forwarded, where only the definition of a scalar volume fraction function α is required, where α = 1 is set inside the solid medium, and α = 0 in the surrounding gas or vacuum. The governing PDE system proposed in [115] reads ∂αv ∂t and clearly falls into the class of PDE systems described by (1). Here, σ denotes the symmetric stress tensor, v is the velocity vector, α ∈ [0, 1] is the volume fraction, λ and µ are the Lamé constants and ρ is the density of the solid medium. The elasticity tensor E is a function of λ and µ and relates stress and strain via the Hooke law. The last four quantities obey trivial evolution equations, which state that these parameters remain constant in time. However, they still need to be properly included in the evolution system, since they have an influence on the solution of the Riemann problem. An analysis of the eigenstructure of (21) -(23) shows that the eigenvalues are all real and are independent of the volume fraction function α. Furthermore, the exact solution of a generic Riemann problem with α = 1 on the left and α = 0 on the right yields the free surface boundary condition σ · n = 0 at the interface, see [115] for details. In this new approach, the mesh generation problem can be fully avoided, since all that is needed is the specification of the scalar volume fraction function α, which is set to unity inside the solid and to zero outside. A realistic 3D wave propagation example based on real DTM data of the Mont Blanc region is shown in Figures 2 and 3, where the 3D contour colors of the wave field as well as a set of seismogram recordings in two receiver points are reported. For this simulation, a uniform Cartesian base-grid of 80 3 elements was used, together with one level of AMR refinement close to the free surface boundary determined by the DTM model. A fourth order ADER-DG scheme (N = 3) has been used in this simulation. We stress that the entire setup of the computational model in the diffuse interface approach is completely automatic, and no manual user interaction was required. The reference solution was obtained with a high order ADER-DG scheme of the same polynomial degree N = 3 using an unstructured boundary-fitted tetrahedral mesh [116] of similar spatial resolution, containing a total of 1, 267, 717 elements. We observe an excellent agreement between the two simulations, which were obtained with two completely different PDE systems on two different grid topologies.

The unified Godunov-Peshkov-Romenski model of continuum mechanics (GPR)
A major achievement of ExaHyPE was the first successful numerical solution of the unified first order symmetric hyperbolic and thermodynamically compatible Godunov-Peshkov-Romenski (GPR) model of continuum mechanics, see [17,18]. The GPR model is based on the seminal papers by Godunov and Romenski [14,15,117] on inviscid symmetric hyperbolic systems. The dissipative mechanisms, which allow to model both plastic solids as well as viscous fluids within one single set of equations were added later in the groundbreaking work of Peshkov and Romenski in [16]. The GPR Figure 2. Wave field of a seismic wave propagation problem with the novel diffuse interface approach on adaptive Cartesian grids developed in [115] (left) compared with the reference solution obtained on a classical boundary-fitted unstructured tetrahedral mesh [116] (right).  [115] and with a reference solution obtained with high order ADER-DG schemes on boundary-fitted unstructured meshes [116]. model is briefly outlined below, while for all details the interested reader is referred to [16][17][18]. The governing equations read ∂ρ ∂t ∂ρu i ∂t ∂A ik ∂t ∂ρJ i ∂t ∂ρE ∂t Furthermore, the system is also endowed with an entropy inequality, see [17]. Here, ρ is the mass density, [u i ] = v = (u, v, w) is the velocity vector, p is a non-equilibrium pressure, [A ik ] = A is the distorsion field, [J i ] = J is the thermal impulse vector, T is the temperature and ρE is the total energy density that is defined according to [17] as in terms of the specific internal energy e = e(p, ρ) given by the usual equation of state (EOS), the kinetic energy, the energy stored in the medium due to deformations and in the thermal impulse. Furthermore, G = A T A is a metric tensor induced by the distortion field A, which allows to measure distances and thus deformations in the medium, c s is the shear sound speed and α is a heat wave propagation speed; the symbol devG = G − 1 3 tr G indicates the trace-free part of the metric tensor G. From the definition of the total energy (29) and the relations H i = E J i , ψ ik = E A ij , σ ik = −ρA mi E A mk , T = E S and q k = E S E J k the shear stress tensor and the heat flux read σ = −ρc 2 s GdevG and q = α 2 TJ. It can furthermore be shown via formal asymptotic expansion [17] that via an appropriate choice of θ 1 and θ 2 in the stiff relaxation limit τ 1 → 0 and τ 2 → 0, the stress tensor and the heat flux tend to those of the compressible Navier-Stokes equations with transport coefficients µ = µ(τ 1 , c s ) and λ = λ(τ 2 , α) related to the relaxation times τ 1 and τ 2 and to the propagation speeds c s and α, respectively. For a complete derivation, see [17,18]. In the opposite limit τ 1 → ∞ the model describes an ideal elastic solid with large deformations. This means that elastic solids as well as viscous fluids can be described at the aid of the same mathematical model. At this point we stress that numerically we always solve the unified first order hyperbolic PDE system (24)- (28), even in the stiff relaxation limit (30), when the compressible Navier-Stokes-Fourier system is retrieved asymptotically. We emphasize that we never need to discretize any parabolic terms, since the hyperbolic system (24)-(28) with algebraic relaxation source terms fits perfectly into the framework of Eqn. (1). In the Fig. 4 we show numerical results obtained in [17] for a viscous heat conducting shock wave and the comparison with the exact solution of the compressible Navier-Stokes equations.

The equations of ideal general relativistic magnetohydrodynamics (GRMHD)
A very challenging PDE system is given by the equations of ideal general relativistic magnetohydrodynamics (GRMHD). The governing PDE are a result of the Einstein field equations and can be written in compact covariant notation as follows: where ∇ µ is the usual covariant derivative operator, T µν is the energy-momentum tensor, * F µν is the Faraday tensor and u µ is the four-velocity. The compact equations above can be expanded into a so-called 3+1 formalism, which can be cast into the form of (1), see [51,118,119] for more details. The final evolution system involves 9 field variables plus the 4-metric of the background space-time, which is supposed to be stationary here. A numerical convergence study for the large amplitude Alfvén wave test problem described in [119] solved in the domain Ω = [0, 2π] 3 up to t = 1 and carried out with high order ADER-DG schemes in [51] is reported in the Table 2 below, where we also show a direct comparison with high order Runge-Kutta discontinuous Galerkin schemes. We observe that the ADER-DG schemes are competitive with RKDG methods, even for this very complex system of hyperbolic PDE. The results reported in Table 2 refer to the non-vectorized version of the code. Further significant performance improvements are expected from a carefully vectorized implementation of the GRMHD equations, in particular concerning the vectorization of the cumbersome conversion of the vector of conservative variables to the vector of primitive variables, i.e. the function V = V(Q). For the GRMHD system V cannot be computed analytically in terms of Q, but requires the iterative solution of one nonlinear scalar algebraic equation together with the computation of the roots of a third order polynomial, see [119] for details. In our vectorized implementation of the PDE, we have therefore in particular vectorized the primitive variable recovery via a direct implementation in AVX intrinsics. We have furthermore made use of careful auto-vectorization via the compiler for the evaluation of the physical flux function and for the non-conservative product. Thanks to this vectorization effort, on one single CPU core of an Intel i9-7900X Skylake test workstation with 3.3 GHz nominal clock frequency and using AVX 512 the CPU time necessary for a single degree of freedom update (TDU) for a fourth order ADER-DG scheme (N = 3) could be reduced to TDU = 2.3 µs for the GRMHD equations in three space dimensions.
As second test problem we present the results obtained for the Orszag-Tang vortex system in flat Minkowski spacetime, where the GRMHD equations reduce to the special relativitic MHD equations. The initial condition is given by x , 0, 1, − sin y , sin 2x , 0 , and we set the adiabatic index to Γ = 4/3. The computational domain is Ω = [0, 2π] 2 and is discretized with a dynamically adaptive AMR grid. For this test we chose the P 5 version of the ADER-DG scheme with FV subcell limiter and the rest mass density as indicator function for AMR, i.e. ϕ(Q) = ρ. Fig. 6 shows 1D cuts through the numerical solution at time t = 2 and at y = 0.01, while Fig. 5 shows the numerical results for the AMR-grid with limiter-status map (blue cells are unlimited, while limited cells are highlighted in red), together with Schlieren images for the rest-mass density at time t = 2.
The same simulation has been repeated with different refinement estimator functions χ that tell the AMR algorithm where and when to refine and to coarsen the mesh: (i) a simple first order derivative estimator χ 1 based on discrete gradients of the indicator function ϕ(Q), (ii) the classical second order derivative estimator χ 2 based on [120], (iii) a novel estimator χ 3 based on the action of the a posteriori subcell finite volume limiter, i.e. the mesh is refined where the limiter is active (iv) a multi-resolution estimator χ 4 based on the difference in L ∞ norm of the discrete solution on two different refinement As last test case we simulate a stationary neutron star in three space dimensions using the Cowling approximation, i.e. assuming a fixed static background spacetime. The initial data for the matter and the spacetime are both compatible with the Einstein field equations and are given by the solution of the Tolman-Oppenheimer-Volkoff (TOV) equations, which constitute a nonlinear ODE system in the radial coordinate that can be numerically solved up to any precision at the aid of a fourth order Runge-Kutta scheme using a very fine grid. We setup a stable nonrotating TOV star without magnetic field and with central rest mass density ρ(0, 0) = 1.28 · 10 −3 and adiabatic exponent Γ = 2 in a computational domain Ω = [−10, +10] 3 discretized with a fourth order ADER-DG scheme (N = 3) using 32 3 elements, which corresponds to 128 3 spatial degrees of freedom. The pressure in the atmosphere outside the compact object is set to p atm = 10 −13 . We run the simulation until a final time of t = 1000 and measure the L ∞ error norms of the rest mass density and the pressure against the exact solution, which is given by the initial condition. The error at t = 1000 for the rest mass density is L ∞ (ρ) = 1.553778 · 10 −5 while the error for the pressure is L ∞ (p) = 1.605334 · 10 −7 . The simulation was carried out with the vectorized  Fig. 7. At the final time t = 1000, the relative error of the central rest mass density is still below 0.1%. In the right panel of Fig. 7 we show the contour surfaces of the pressure at the final time t = 1000. In Fig. 8 we show a 1D cut along the x axis, comparing the numerical solution at time t = 1000 with the exact one. We note that the numerical scheme is very accurate, but it is not well-balanced for the GRMHD equations, i.e. the method cannot preserve the stationary equilibrium solution of the TOV equations exactly at the discrete level. Therefore, further work along the lines of research reported recently in [83] for the Euler equations with Newtonian gravity are needed, extending the framework of well-balanced methods [76,77,122] also to general relativity. Finally, in Fig. 9 we compare the exact and the numerical solution at time t = 1000 in the x − y plane.

A strongly hyperbolic first order reduction of the CCZ4 formulation of the Einstein field equations (FO-CCZ4)
The last PDE system under consideration here are the Einstein field equations that describe the evolution of dynamic spacetimes. Here we consider the so-called CCZ4 formulation [123], which is based on the Z4 formalism that takes into account the involutions (stationary differential constraints) inherent in the Einstein equations via an augmented system similar to the generalized Lagrangian multiplier (GLM) approach of Dedner et al. [124] that takes care of the stationary divergence-free constraint of the magnetic field in the MHD equations. In compact covariant notation the undamped Z4 Einstein equations in vacuum, which can be derived from the Einstein-Hilbert action integral associated with the Z4 Lagrangian L = g µν R µν + 2∇ µ Zν , read where g µν is the 4-metric of the spacetime, R µν is the 4-Ricci tensor and the 4-vector Z ν accounts for the stationary constraints of the Einstein equations, as already mentioned before. After introducing the usual 3+1 ADM split of the 4-metric as the equations can be cast into a time-dependent system of 25 partial differential equations that involve first order derivatives in time and both first and second order derivatives in space, see [123]. Nevertheless, the system is not dissipative, but a rather unusual formulation of a wave equation, see [125]. In the expression above, α denotes the so-called lapse, β i is the spatial shift vector and γ ij is the spatial metric. In the original form presented in [123], the PDE system does not fit into the formalism given by Eqn. (1). After the introduction of 33 auxiliary variables, which are the spatial gradients of some of the 25 primary evolution quantities, it is possible to derive a first order reduction of the system that contains a total of 58 evolution quantities. However, a naive procedure of converting the original second order evolution system into a first order system leads only to a weakly hyperbolic formulation, which is not suitable for numerical simulations since the initial value problem is not well posed in this case. Only after adding suitable first and second order ordering constraints, which arise from the definition of the auxiliary variables, it is possible to obtain a provably strongly hyperbolic and thus well-posed evolution system, denoted by FO-CCZ4 in the following. For all details of the derivation, the strong hyperbolicity proof and numerical results achieved with high order ADER-DG schemes, the reader is referred to [126]. In order to give an idea about the complexity of the Einstein field equations, it should be mentioned that one single evaluation of the FO-CCZ4 system requires about 20,000 floating point operations! In order to obtain still a computationally efficient implementation, the entire PDE system has been carefully vectorized using blocks of the size VECTORLENGTH, so that in the end a level of 99,9 % of vectorization of the code have been reached. Using a fourth order ADER-DG scheme (N = 3) the time per degree of freedom update (TDU) metric per core on a modern workstation with Intel i9-7900X CPU that supports the novel AVX 512 instructions is TDU=4.7 µs.

Strong MPI scaling study for the FO-CCZ4 system
A major focus of this paper is the efficient implementation of ADER-DG schemes for high performance computing (HPC) on massively parallel distributed memory supercomputers. For this purpose, we have very recently carried out a systematic study of the strong MPI scaling efficiency of our new high order fully-discrete one-step ADER-DG schemes on the Hazel Hen supercomputer of the HLRS center in Stuttgart, Germany, using from 720 up to 180,000 CPU cores. We have furthermore carried out a systematic comparison with conventional Runge-Kutta DG schemes using the SuperMUC phase 1 system of the LRZ center in Munich, Germany.
As already discussed before, the particular feature of ADER-DG schemes compared to traditional Runge-Kutta DG schemes (RKDG) is that they are intrinsically communication-avoiding and cache-blocking, which makes them particularly well suited for modern massively parallel distributed memory supercomputers. As governing PDE system for the strong scaling test the novel first-order reduction of the CCZ4 formulation of the 3+1 Einstein field equations has been been adopted [126]. We recall that FO-CCZ4 is a very large nonlinear hyperbolic PDE system that contains 58 evolution quantities.
The first strong scaling study on the SuperMUC phase 1 system uses 64 to 64,000 CPU cores. The test problem was the gauge wave problem [126] setup on the 3D domain Ω = [−0.5, 0.5] 3 . For the test we have compared a fourth order ADER-DG scheme (N = 3) with a fourth order accurate RKDG scheme on a uniform Cartesian grid composed of 120 3 elements. It has to be stressed, that when using 64,000 CPU cores for this setup each CPU has to update only 3 3 = 27 elements. The wall clock time as a function of the used number of CPU cores (nCPU) and the obtained parallel efficiency with respect  [126]. Left: comparison of ADER-DG schemes with conventional Runge-Kutta DG schemes from 64 to 64,000 CPU cores on the SuperMUC phase 1 system of the LRZ supercomputing center (Garching, Germany). Right: strong scaling study from 720 to 180,000 CPU cores, including a full machine run on the Hazel Hen supercomputer of HLRS (Stuttgart, Germany) with ADER-DG schemes (right). Even on the full machine we observe still more than 90 % of parallel efficiency.
to an ideal linear scaling are reported in the left panel of Fig. 10. We find that ADER-DG schemes provide a better parallel efficiency than RKDG schemes, as expected. The second strong scaling study has been performed on the Hazel-Hen supercomputer, using 720 to 180,000 CPU cores. Again we have used a fourth order accurate ADER-DG scheme (N = 3), this time using a uniform grid of 200 × 180 × 180 elements, solving again the 3D gauge wave benchmark problem detailed in [126]. The measured wall-clock-times (WCT) as a function of the employed number of CPU cores, as well as the corresponding parallel scaling-efficiency are shown in Fig. 10. The results depicted in Fig. 10 clearly show that our new ADER-DG schemes scale very well up to 90,000 CPU cores with a parallel efficiency greater than 95%, and up to 180,000 cores with a parallel efficiency that is still greater than 93%. Furthermore, the code was instrumented with manual FLOP counters in order to measure the floating point performance quantitatively. The full machine run on 180,000 CPU cores of Hazel Hen took place on 7th of May 2018. During the run, each core has provided an average performance of 8.2 GFLOPS, leading to a total of 1.476 PFLOPS of sustained performance. To our knowledge, this was the largest test run ever carried out with high order ADER-DG schemes for nonlinear hyperbolic systems of partial differential equations. For large runs with sustained petascale performance of ADER-DG schemes for linear hyperbolic PDE systems on unstructured tetrahedral meshes, see [102].

Conclusions
In this paper we have presented an efficient implementation of high order ADER-DG schemes on modern massively parallel supercomputers using the ExaHyPE engine. The key ingredients are the communication-avoiding and cache-blocking properties of ADER-DG, together with an efficient vectorization of the high level user functions that provide the evaluation of the physical fluxes F(Q), of the non-conservative products B(Q) · ∇Q and of the algebraic source terms S(Q). The engine is highly versatile and flexible and allows to solve a very broad spectrum of different hyperbolic PDE systems in a very efficient and highly scalable manner. In order to support this claim, we have provided a rather large set of different numerical examples solved with ADER-DG schemes. To show the excellent parallel scalability of the ADER-DG method, we have provided strong scaling results on 64 to 64,000 CPU cores including a detailed and quantitative comparison with RKDG schemes. We have furthermore shown strong scaling results of the vectorized ADER-DG implementation for the FO-CCZ4 formulation of the Einstein field equations using 720 to 180,000 CPU cores of the Hazel Hen supercomputer at the HLRS in Stuttgart, Germany, where a sustained performance of more than one petaflop has been reached.
Future research in ExaHyPE will concern an extension of the GPR model to full general relativity, able to describe nonlinear elastic and plastic solids as well as viscous and ideal fluids in one single governing PDE system. We furthermore plan an implementation of the FO-CCZ4 system [126] directly based on AVX intrinsics, in order to further improve the performance of the scheme and to reduce computational time. The final aim of our developments are the simulation of ongoing nonlinear dynamic rupture processes during earthquakes, as well as the inspiral and merger of binary neutron star systems and the associated generation of gravitational waves. Although both problems seem to be totally different and unrelated, it is indeed possible to write the mathematical formulation of both applications under the same form of a hyperbolic system given by (1) and thus to solve both problems within the same computer software.