# Optimization of Finite-Differencing Kernels for Numerical Relativity Applications

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Method

`u[ijk + s*dj]`or

`u[ijk + s*dk]`, where

`dj`$={n}_{x}$ and

`dk`$={n}_{x}{n}_{y}$. Thus, for the approximation of the derivatives in the y- and z-directions, stridden memory access is unavoidable. Instead of using vector instructions to evaluate Equation (3), we vectorize the code by grouping together points in the x-direction, i.e., the derivatives in the y-direction are computed as:

## 3. Experimental Setup

icc -O3 -xCORE-AVX2, # do Vectorization on BDW,icc -O3 -xCORE-AVX512, # do Vectorization on SKL,icc -O3 -xMIC-AVX512, # do Vectorization on KNL,icc -O3 -no-vec -no-simd, # do not Vectorize.

`-no-vec`for compiler auto-vectorization and

`-no-simd`to disable the PRAGMA SIMD statements.

## 4. Results

#### 4.1. Wave Equation

`KMP_AFFINITY`environment variable to bind adjacent threads to adjacent cores on the same socket. On the KNL node, we used the Scatter mode since the hyper-threading was enabled (see Figure 3). On SKL and BDW nodes, we used the Compact mode.

`omp for`directive with the following clauses:

#pragma omp for collapse(1) schedule(static,1).

#### 4.2. Linearized Einstein Equations

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

GFLOPS | Billion FLoating Point Operations Per Second |

BDW | Intel Broadwell architecture |

CHA | Caching/Home Agent |

KNL | Intel Knight Landing architecture |

HBM | High-Bandwidth Memory library |

HPC | High-Performance-Computing |

MCDRAM | Multi-Channel Dynamic Random-Access Memory |

MPI | Message Passing Interface |

NR | Numerical Relativity |

OpenMP | Open Multiprocessing |

RHS | Right-hand side |

SIMD | Single Instruction Multiple Data |

SKL | Intel Skylake architecture |

## Appendix A. Results with GNU Compiler

gcc -O3 -mavx512f -mavx512cd -mavx512er -mavx512pf, # do Vectorization on KNL,gcc -fno-tree-vectorize, # do not Vectorize.

OMP_PROC_BIND=true,OMP_PLACES=cores.

**Figure A1.**

**Left**panel: Comparative performance on vectorization and OpenMP parallelization between GNU and Intel compilers on KNL nodes. The speedup is the ratio between non-vectorized and vectorized execution time.

**Right**panel: Strong scaling with multiple threads using single and double precision floating-point numbers. Dashed lines, corresponding to non-vectorized runs, are overlapping. In the case of vectorized runs, the speedup with single precision floating-point numbers is about 30% better than with double precision ones.

## Appendix B. Results with Float Data Types

## References

- Abbott, B.P.; Abbott, R.; Abbott, T.D.; Acernese, F.; Ackley, K.; Adams, C.; Adams, T.; Addesso, P.; Adhikari, R.X.; Adya, V.B.; et al. Observation of Gravitational Waves from a Binary Black Hole Merger. Phys. Rev. Lett.
**2016**, 116, 061102. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Abbott, B.P.; Abbott, R.; Abbott, T.D.; Acernese, F.; Ackley, K.; Adams, C.; Adams, T.; Addesso, P.; Adhikari, R.X.; Adya, V.B.; et al. GW170817: Observation of Gravitational Waves from a Binary Neutron Star Inspiral. Phys. Rev. Lett.
**2017**, 119, 161101. [Google Scholar] [CrossRef] [PubMed] - Abbott, B.P.; Abbott, R.; Abbott, T.D.; Acernese, F.; Ackley, K.; Adams, C.; Adams, T.; Addesso, P.; Adhikari, R.X.; Adya, V.B.; et al. Multi-messenger Observations of a Binary Neutron Star Merger. Astrophys. J.
**2017**, 848, L12. [Google Scholar] [CrossRef] - Radice, D.; Bernuzzi, S.; Del Pozzo, W.; Roberts, L.F.; Ott, C.D. Probing Extreme-Density Matter with Gravitational Wave Observations of Binary Neutron Star Merger Remnants. Astrophys. J.
**2017**, 842, L10. [Google Scholar] [CrossRef] - Perego, A.; Rosswog, S.; Cabezon, R.; Korobkin, O.; Kaeppeli, R.; Arcones, A.; Liebendoerfer, M. Neutrino-driven winds from neutron star merger remnants. Mon. Not. R. Astron. Soc.
**2014**, 443, 3134–3156. [Google Scholar] [CrossRef] - Radice, D. General-Relativistic Large-Eddy Simulations of Binary Neutron Star Mergers. Astrophys. J.
**2017**, 838, L2. [Google Scholar] [CrossRef] - Kiuchi, K.; Kyutoku, K.; Sekiguchi, Y.; Shibata, M. Global simulations of strongly magnetized remnant massive neutron stars formed in binary neutron star mergers. arXiv, 2017; arXiv:1710.01311. [Google Scholar]
- Bernuzzi, S.; Nagar, A.; Thierfelder, M.; Brügmann, B. Tidal effects in binary neutron star coalescence. Phys. Rev.
**2012**, D86, 044030. [Google Scholar] [CrossRef] - Bernuzzi, S.; Nagar, A.; Dietrich, T.; Damour, T. Modeling the Dynamics of Tidally Interacting Binary Neutron Stars up to the Merger. Phys. Rev. Lett.
**2015**, 114, 161103. [Google Scholar] [CrossRef] [PubMed] - Sodani, A.; Gramunt, R.; Corbal, J.; Kim, H.S.; Vinod, K.; Chinthamani, S.; Hutsell, S.; Agarwal, R.; Liu, Y.C. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro
**2016**, 36, 34–46. [Google Scholar] [CrossRef] - Brügmann, B.; Gonzalez, J.A.; Hannam, M.; Husa, S.; Sperhake, U.; Tichy, W. Calibration of Moving Puncture Simulations. Phys. Rev.
**2008**, D77, 024027. [Google Scholar] [CrossRef] - Husa, S.; González, J.A.; Hannam, M.; Brügmann, B.; Sperhake, U. Reducing phase error in long numerical binary black hole evolutions with sixth order finite differencing. Class. Quantum Gravity
**2008**, 25, 105006. [Google Scholar] [CrossRef] - Radice, D.; Rezzolla, L.; Galeazzi, F. Beyond second-order convergence in simulations of binary neutron stars in full general-relativity. Mon. Not. R. Astron. Soc.
**2014**, 437, L46–L50. [Google Scholar] [CrossRef] - Bernuzzi, S.; Dietrich, T. Gravitational waveforms from binary neutron star mergers with high-order weighted-essentially-nonoscillatory schemes in numerical relativity. Phys. Rev.
**2016**, D94, 064062. [Google Scholar] [CrossRef] - Borges, L.; Thierry, P. 3D Finite Differences on Multi-Core Processors. 2011. Available online: https://software.intel.com/en-us/articles/3d-finite-differences-on-multi-core-processors (accessed on 23 May 2018).
- Andreolli, C. Eight Optimizations for 3-Dimensional Finite Difference (3DFD) Code with an Isotropic (ISO). Intel Software On-Line Documentation. 2014. Available online: https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso) (accessed on 23 May 2018).
- Baumgarte, T.W.; Shapiro, S.L. On the numerical integration of Einstein’s field equations. Phys. Rev.
**1999**, D59, 024007. [Google Scholar] [CrossRef] - Nakamura, T.; Oohara, K.; Kojima, Y. General Relativistic Collapse to Black Holes and Gravitational Waves from Black Holes. Prog. Theor. Phys. Suppl.
**1987**, 90, 1–218. [Google Scholar] [CrossRef] - Shibata, M.; Nakamura, T. Evolution of three-dimensional gravitational waves: Harmonic slicing case. Phys. Rev.
**1995**, D52, 5428–5444. [Google Scholar] [CrossRef] - Bernuzzi, S.; Hilditch, D. Constraint violation in free evolution schemes: Comparing BSSNOK with a conformal decomposition of Z4. Phys. Rev.
**2010**, D81, 084003. [Google Scholar] - Landry, W. Implementing a high performance tensor library. Sci. Program.
**2003**, 11, 273–290. [Google Scholar] [CrossRef] - Solomonik, E.; Hoefler, T. Sparse Tensor Algebra as a Parallel Programming Model. arXiv, 2015; arXiv:1512.00066. [Google Scholar]
- Huang, J.; Matthews, D.A.; van de Geijn, R.A. Strassen’s Algorithm for Tensor Contraction. arXiv, 2017; arXiv:1704.03092. [Google Scholar]
- Lewis, A.G.M.; Pfeiffer, H.P. Automatic generation of CUDA code performing tensor manipulations using C++ expression templates. arXiv, 2018; arXiv:1804.10120. [Google Scholar]

**Figure 1.**Single-core performances (

**left**panel) and speedup (

**right**panel) for variable block-sizes and different node architectures, in the case of the wave equation with stencil size $S=2$. The tests have been executed with vectorization enabled (solid lines) or disabled (dashed lines). The best performance is obtained for vectorized kernel on SKL nodes.

**Figure 2.**Multi-thread performances (

**left**panel) and speedup (

**right**panel) for block-size $n=128$ as a function of the number of threads and for different node architectures, in the case of the wave equation with stencil size $S=2$. Solid and dashed lines refer to enabled and disabled vectorization, respectively. The KNL node demonstrates better speedup, overall performance and scalability, especially in case of large block-sizes.

**Figure 3.**Effect of pinning threads to KNL cores. The affinity is controlled by the

`KMP_AFFINITY`environment variable. Cyan lines refer to

`KMP_AFFINITY=none`, orange lines to

`KMP_AFFINITY=scatter`.

**Figure 4.**Single-core performance on the KNL architecture, as a function of the block-size (

**left**panels), and strong scaling with OpenMP, as a function of the number of threads (

**right**panels), for the linearized Einstein equations with different stencil sizes (

**top**: $S=2$,

**middle**: $S=3$,

**bottom**: $S=4$). We find nearly ideal vector speedup and scaling for large block-sizes. However, the code performances appear inconsistent when using all the 68 physical cores on the node, possibly because of the effect of system interrupts.

**Figure 5.**Vectorization speedup for the linearized Einstein equations for $S=2$, $S=3$, and $S=4$ stencil sizes on the KNL architecture.

**Left**panel: single core vector speedup.

**Right**panel: vector speedup for increasing thread count. Good vector efficiency is achieved for large block-sizes, even though the speedup due to vectorization shows an unclear trend with S. The results when using 68 threads might be affected by system interrupts.

**Table 1.**Coefficients of 1D finite differencing stencils for the evaluation of first (${c}_{s}$, top) and second (${d}_{s}$, bottom) derivatives according to Equation (3), up to a stencil size $S=4$. Stencils are symmetric with respect to 0.

S | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|

${c}_{s}$ for $s\le S$ | |||||

2 | 0 | 2/3 | −1/12 | ||

3 | 0 | 3/4 | −3/20 | 1/60 | |

4 | 0 | 4/5 | −1/5 | 4/105 | −1/280 |

${d}_{s}$ for $s\le S$ | |||||

2 | −5/2 | 4/3 | −1/12 | ||

3 | −49/18 | 3/2 | −3/20 | 1/90 | |

4 | −205/72 | 8/5 | −1/5 | 8/315 | −1/560 |

Node Type | Intel Xeon | Frequency | Cores | HT | Core/Node Perf | L1/L2 Cache | L3 Cache |
---|---|---|---|---|---|---|---|

BDW | E5-2697 v4 | 2.3 GHz | 2 × 18 | off | 36/1300 GFLOPS | 576 KB/4.5 MB | 45 MB (Smart Cache) |

KNL | Phi 7250 | 1.4 GHz | 1 × 68 | on | 44/3000 GFLOPS | 32 KB/1 MB (per tile) | 16 GB (MCDRAM) |

SKL | 8160 | 2.1 GHz | 2 × 24 | off | 67/3200 GFLOPS | 768 KB/1 MB | 33 MB L3 |

**Table 3.**Vectorization speedup on single cores for the wave equation with stencil size $S=2$. The block-size is $n=128$. The table shows the Intel compiler report information (obtained with the

`-qopt-report`options) and the measured speedup (ratio between non-vectorized and vectorized execution time). The measured speedup differs by about a factor 2 or more from the potential speedup.

Operation | BDW | KNL | SKL | |||
---|---|---|---|---|---|---|

Potential | Measured | Potential | Measured | Potential | Measured | |

Derivative | 5.03 | 1.7 | 6.58 | 3.7 | 5.73 | 2.22 |

Contraction | 5.61 | 1.8 | 7.77 | 4 | 5.61 | 2.27 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Alfieri, R.; Bernuzzi, S.; Perego, A.; Radice, D.
Optimization of Finite-Differencing Kernels for Numerical Relativity Applications. *J. Low Power Electron. Appl.* **2018**, *8*, 15.
https://doi.org/10.3390/jlpea8020015

**AMA Style**

Alfieri R, Bernuzzi S, Perego A, Radice D.
Optimization of Finite-Differencing Kernels for Numerical Relativity Applications. *Journal of Low Power Electronics and Applications*. 2018; 8(2):15.
https://doi.org/10.3390/jlpea8020015

**Chicago/Turabian Style**

Alfieri, Roberto, Sebastiano Bernuzzi, Albino Perego, and David Radice.
2018. "Optimization of Finite-Differencing Kernels for Numerical Relativity Applications" *Journal of Low Power Electronics and Applications* 8, no. 2: 15.
https://doi.org/10.3390/jlpea8020015