1. Introduction
Supernovae are the main source of the “elements of life”. In particular, the supernovae are the birthplaces for such elements as carbon, oxygen, and ferrum, which are the basis of all life forms. The great energy output of the supernovae allows them to spread the heavy elements great distances. This contributes to synthesis of complex chemical compounds up to organic ones [
1,
2].
Together with the various mechanisms of supernova explosions and their material compositions [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13], it remains important to study the hydrodynamics of supernova explosions. Naturally, because of proportion of the scales between the stars, the supernova remnant, and the core, simulation of such an explosion requires the most powerful supercomputers. Analysis of the top 10 supercomputers in the world shows the prevalence of hybrid architectures, primarily equipped with graphics accelerators and vector processors.
Development of codes for such architectures is a difficult problem. It requires not only application of the appropriate technologies but also a special structure of the numerical method and mathematical models. Several codes adapted for graphics processors (GPU) or the Intel Xeon Phi accelerators have been developed. We review the most interesting solutions.
The WOMBAT code [
14] uses a combination of the Message Passing Interface (MPI), Open Multi-Processing (OpenMP), and high-level Single Instruction Multiple Data (SIMD) vectorization technologies. The WOMBAT code is remarkable for its orientation to the target architecture of the supercomputer, which will run it. The code utilizes one-sided MPI communications, which allow one to efficiently implement the exchange of the subdomains overlaps. The specialized Cray MPI is used for optimization of the interprocess communications. The code also uses high-level SIMD vectorization. It appears that the WOMBAT code contains the requirements for modern program implementation of astrophysical codes.The previously developed codes will be used for a long time, but the future is for codes oriented to the specific classes of the supercomputer architectures.
The CHOLLA code [
15,
16,
17] was implemented to conduct numerical experiments using GPUs. It is based on the Corner Transport Upwind method, the essence of which is the propagation of the counterflux scheme in the multidimensional case [
18,
19]. For storing the computational grid in a GPU, the structure of a mesh containing all the hydrodynamical parameters is used. It is stated that such locality of the data allows one to obtain a more efficient application of the global memory of the graphics processor. Calculation of the time step was performed on graphics processors using the NVIDIA Compute Unified Device Architecture (CUDA) extensions.
The GAMER code [
20] solves aerodynamic equations applying the Adaptive Mesh Refinement (AMR) approach to graphics accelerators. For solving aerodynamic equations, the Total Variational Diminishing method is used. For solving the Poisson equation, a combination of the fast Fourier transform method and successive over relaxation is used. The main feature of this software is probably of implementation of the AMR approach on GPUs. Thus, the regular structure of the grid naturally falls onto the GPU architecture, while the tree structure requires a special approach. This approach is based on the application of octets to define the grid. Under this, each octet is projected to a separate thread in the GPU. The main problem is in forming the fake meshes for the octets. Note that solving this problem takes around 63% of the time. However, this procedure can be implemented independently for each octet. The GAMER code was further optimized [
21] and extended to solving the magnetic hydrodynamic equations [
22].
Approaches such as those described above were implemented in the FARGO3D [
23] and the SMAUG [
24] codes. The authors of the present work also implemented the GPUPEGAS code [
6] for supercomputers with graphics processors and the AstroPhi code for the Intel Xeon Phi in the offload mode [
25] and in the native mode using low-level vectorization [
26].
The necessity of more efficient utilization of the graphics and vector processors and accelerators will result in creation of new codes. One code is presented in this work. We develop the OMPEGAS code for numerically solving the special relativistic hydrodynamics equations. The numerical algorithm is based on Godunov’s method and the piecewise parabolic method on the local stencil [
27,
28]. This technique has been successfully applied to a number of astrophysical problems [
29,
30]. Its main advantages are a high order of accuracy and high efficiency of calculations. The code was developed for shared memory multicore systems with vectorization capabilities. This make it possible to perform experiments on a wide range of computers, from laptops to nodes of cluster supercomputers. Additionally, in future, a hybrid implementation is possible for massive parallel systems.
In
Section 2, we provide the hydrodynamical model of the astrophysical flows and describe the numerical method for solving the hydrodynamical equations.
Section 3 is devoted to detailing the architecture of the OMPEGAS code and to the study of its performance. Naturally, the code development raised several issues for discussion, which are considered in
Section 4.
Section 5 is devoted to the numerical experiments.
Section 6 concludes the paper.
2. Numerical Model
To describe the gas dynamics, we use the numerical model presented in work [
31]. The physical (primitive) variables are: the density
, vector of velocity
, and pressure
p. When describing the evolution of gas in the model of special relativistic hydrodynamics, we introduce the concept of special enthalpy
h, which is determined by the formula
where
is the adiabatic exponent. The relativistic speed of sound
is determined by the formula
Let us choose the speed of light to be
as the dimensionless unit speed. In this case, the Lorentz factor
will be determined by the formula
Thus, the speed module is limited to one. Let us introduce concepts of the conservative variables relativistic density
and the relativistic momentum
, where
are components of the velocity vector
at
, and the total relativistic energy
.
Under the absence of the relativistic velocities, the Lorentz factor is taken to be unity. Then, the speed of sound of the ideal gas is calculated by the formula
The conservative hydrodynamic quantities take the form:
is density,
is the momentum, and
is the total energy.
The equations of the Newtonian and relativistic hydrodynamics in the form of the conservation laws are written in the same form
where
is the Kronecker symbol. The exactly general form of the hyperbolic equations is important to us, since we construct a new modification of the numerical scheme for it.
For complex relativistic objects, such as jets, supernovae, and collapsing stars, the correct simulation of the shock waves is of great importance [
32,
33,
34,
35,
36,
37]. Usually, to solve the equations, the Roe-type, Rusanov-type, or the Harten–Lax-van Leer schemes are used [
38,
39,
40]. Recently, a unified approach for such problems was proposed [
41]. In our code, we use the piecewise parabolic method on a local stencil [
29] and the combination of the Roe scheme [
31] and the Rusanov scheme [
42].
3. Parallel Implementation
In this section, we describe the developed OMPEGAS code for the hydrodynamics simulation. In our implementation, we use OpenMP technology, the Intel SIMD Data Layout Template (SDLT) library, and the auto-vectorization directives of the Intel C++ Compiler Classic.
3.1. Domain Decomposition
In our implementation, we use the following domain decomposition approach.
The uniform grid in the Cartesian coordinates for solving the hydrodynamical equations allows one to use arbitrary Cartesian topology for the domain decomposition. This structure of the computation provides great potential for scalability. The OMPEGAS code uses the multilevel multidimensional decomposition of the domain. On one coordinate, the inner fine decomposition is performed by vectorization, while decomposition of two other coordinates is performed by two collapsed OpenMP loops.
Note that our method also allows one to perform additional decomposition by one of the coordinates using the MPI with one layer for subdomains overlapping.
Figure 1 shows the geometric domain decomposition implemented by the OpenMP and vectorization.
3.2. OpenMP + SIMD Implementation
The following algorithm for computing the single time step is used (with the corresponding function names used in the code):
Find a minimum time step between all spatial cells (computational_tau()).
Construct the local parabolic approximations of primitive variables for each cell, consider the boundary conditions (build_parabola_sdlt()).
Solve the Riemann problem for each cell (eulerian_stage()).
Process the boundary conditions for the obtained conservative variables.
Recover the primitive variables for each cell.
Process the boundary conditions for the primitive variables (boundary_mesh()).
3.3. Data Structures
To optimize the data storage for the parabolic coefficients, we used the Intel SDLT library [
43]. It allows one to achieve the performance of the vectorized C++ code by automatically converting the Array of Structures to the Structure of Arrays by aligning the data to the SIMD words and cache lines and providing the
n-dimensional containers. The code is shown in
Listing 1.
Listing 1.
Declaration of data structures for the parabolic approximations.
Listing 1.
Declaration of data structures for the parabolic approximations.
|
#include <sdlt/sdlt.h> |
|
// type of parabola |
struct parabola |
{ |
// parameters of parabola |
real ulr, ull, dq, dq6, ddq; |
// right integral of parabola |
inline real right_parabola (real ksi, real h) const |
// left integral of parabola |
inline real left_parabola (real ksi, real h) const |
// build local parabola |
inlinevoid local_parabola (real uvalue) |
// reconstruct parabola |
inlinevoid reconstruct_parabola (real uvalue) |
}; |
|
// type of mesh |
struct mesh |
{ |
// value on mesh |
real u; |
// parabols |
parabola parx, pary, parz; |
}; |
|
// invoking SDLT macros for structures |
SDLT_PRIMITIVE(parabola, ulr, ull, dq, dq6, ddq) |
SDLT_PRIMITIVE(mesh, u, parx, pary, parz) |
|
// defining the 3D container type |
typedef sdlt :: n_container <mesh, sdlt :: layout :: soa<1>, \ |
sdlt :: n_extent_t<int, int, int>> ContainerMesh3D; |
|
3.4. Auto Vectorization
To study the code for vectorization, we used the Intel Advisor tool [
44]. The initial analysis showed that the automatic vectorization was prevented due to the supposed dependencies and function calls. To avoid this, we used the directive
‘#pragma ivdep’ for loops and
‘#pragma forceinline recursive’ for functions. This allowed the compiler to vectorize automatically the relevant loops.
The next recommendation by the Advisor was optimizing the arithmetic operations. The costliest procedure of the Riemann solver performed 750 divisions in the loop body. After using the compiler option
‘\Qprec-div-’, this number was reduced approximately to 300.
Figure 2 and
Figure 3 show the result of the Advisor survey before and after applying this option. Note that the vectorization efficiency of the Riemann solver loop (last row) rose from 2.5 times to a nearly perfect fourfold.
The procedures for constructing the parabolic approximations [
29] used a lot of branching. Thus, their vectorization potential was limited.
3.5. OpenMP Parallelization
Parallelization was performed by distributing the workload between the OpenMP threads.
Listing 2 shows the typical usage of
‘#pragma omp for’ directive for the Riemann solver loop. Two outer loops were collapsed and parallelized using the OpenMP. The inner loop was vectorized. The scheduling was set to
‘runtime’, which allowed us to tune the program without recompiling.
Listing 2.
Riemann solver loop.
Listing 2.
Riemann solver loop.
|
#pragma omp for schedule (runtime) collapse (2) |
for (i = 1; i < NX − 1; i++) |
for (k = 1 ; k < NY − 1; k++) |
#pragma ivdep |
for (l = 1; l < NZ − 1; l++) |
{ |
// Riemann solver |
#pragma forceinline recursive |
SRHD_Lamberts_Riemann ( |
… |
|
3.6. Code Research
The numerical experiments were performed on a 16-core Intel Core i9-12900K processor with AVX/AVX2 support and DDR5-6000 memory. A mesh with the size of was used to solve the evaluation. The computing times presented below exclude the time spent on the initialization, finalizing, and file operations.
3.6.1. Performance of Single Thread Using Vectorization
Table 1 shows the computing time of a single-threaded program with various compilation settings. To disable auto-vectorization, we used the
\Qvec- option, and to use the precise or optimized divisions, we used the
\Qprec-div or
\Qprecc-div-, respectively.
3.6.2. Threading Performance
For studying the threading performance, we used the speedup coefficient, where is the calculation time on m threads for the same problem.
Table 2 shows the results of solving the problem with the
mesh on various numbers of the OpenMP threads. The table contains the computing time of the program and speedup for various numbers of the OpenMP threads. It also shows the memory bandwidth utilization and performance in FLOPs, as well as the maximum values obtained via the Intel Advisor. Note that these two values were affected by the profiling overhead.
5. Model Problem: Hydrodynamical Simulation of Star Explosion
Let us consider the model problem of a hydrodynamical explosion. At the center of the star (defined by the equilibrium density profile), we induce the point blast with a given dispersion of the density 2%. Further, all equations and results of the simulation are presented in the dimensionless form.
As the initial data for the profile of the star, let us take ones of the hydrostatic equilibrium configuration. In spherical coordinates with the gravitational constant
, the data have the form
The initial density distribution is chosen as follows:
Thus, the initial distribution of the pressure and gravitational potential have the form
We shall use this profile to define the initial configuration of the star.
Figure 5 shows the results of the simulation and the density profiles at various instants.
Despite introducing the random perturbations of the star, the physics was dominated by the process of the point blast similar to the Sedov problem.