HIJING++, a Heavy Ion Jet INteraction Generator for the High-luminosity Era of the LHC and Beyond

HIJING++ (Heavy Ion Jet INteraction Generator) is the successor of the widely used original HIJING, developed almost three decades ago. While the old versions (1.x and 2.x) were written in FORTRAN, HIJING++ was completely rewritten in C++. During the development we keep in mind the requirements of the high-energy heavy-ion community: the new Monte Carlo software have a well designed modular framework, therefore any future modifications are much easier to implement. It contains all the physical models that were also present in it's predecessor, but utilizing modern C++ features it also includes native thread based parallelism, an easy-to-use analysis interface and a modular plugin system, which makes room for possible future improvements. In this paper we summarize the results of our performance tests measured on 2 widely used architectures.


Introduction
During the approaching Long Shutdown 2 (LS2) of the Large Hadron Collider (LHC) in 2019-2020 many technical improvement will occur in the accelerator complex, in the detector and in the data acquisition systems. These will result in a huge increase of the number of expected collisions per second and also the amount of measured data per event will grow rapidly. This period is the forerunner of the next generation of particle accelerators, such as the High-Luminosity LHC (HL-LHC) or the Future Circular Collider (FCC), where we will accumulate high-energy experimental data in a higher rate than ever. In parallel we need to improve also the numerical tools in order to be able to keep up the requisites of the high-precision era.
The new HIJING++ heavy-ion Monte Carlo framework is written from scratch with a modular, effective C++ structure and with built-in CPU based parallelism in order to fulfill these requirements. Though the program flow is based on the original FORTRAN HIJING[1,2], the design is completely revised so the main components of the program can work together effectively. Such components are the most recent versions of PYTHIA8 [3] (used for the hard scattering processes and for the hadronization), LHAPDF6 [4], the GNU Scientific Library [5,6]  HIJING++ is intended to work effectively regarding different aspects, not just based on the raw performance of the CPU. As an example, it is possible to replace any of the main components, such as the jet quenching and shadowing algorithms, in a convenient, well defined way, without modifying the core code. An another built-in feature is the above mentioned HijAnalysis framework, which adds the possibility to define any kind of data collecting objects, such as ROOT TTrees, histograms or simple ASCII files to collect all final state particles event-by-event. Utilizing modern C++ features, the result of a run will be data structures that can be further processed in a convenient way.
In the following section we present the results of the performance tests of the pre-release version of HIJING++, taking advantage of these features.

Results
We have already presented preliminary physics and performance results in Ref [8,9]. Here we summarize the benchmark tests measured on two different machines.

Benchmark setups
In order to measure the performance in a real case situation, we calculated 6 different histograms to collect various quantities of the current run, such as the impact parameter, number of binary collisions, event multiplicity, p T spectra and pseudorapidity distributions of different identified hadrons with various binnings. We performed each run several times in order to reduce fluctuations. The main parameters of the different run setups are summarized in Table 1 [10,11]. The tests were made on 2 commonly used, typical architectures, whose parameters are listed in Table 2 [12]. These setups represents common use cases in the heavy-ion community: CPUs with lower TDP values (thermal design power -the higher the value, the larger the power consumption and performance) and its variants are widely used in recent laptops and ultrabooks, while CPUs with higher TDP are common in desktop computers or larger workstations, clusters.

Results
The results of the benchmarking runs for the two different CPUs are shown on Figure 1. As expected, the measured times show significant differences between the two system: using the CPU with the lower TDP value (upper panels) by increasing the number of threads the total runtime decreases significally until N thread = 4, then the speedup gained from the multiple threads is compensated by the fact that more CPU cores have to share the same amount of energy, resulting in a decrease of the CPU frequency. In accorddance with this, the initialization time increases slightly with the increasing thread number. In contrast to these, on the lower panels the results achieved with the higher performance desktop/server CPU are shown, where the speedup is more significant with the higher number of threads. In this case, the initialization time increases with a much lower rate. The reason is that this CPU doesn't have to decrease the performance when we are operating with multiple cores. √ s = 2.76 ATeV proton-proton (left panels), proton-lead (middle panels) and lead-lead collisions (right panels), using (nuclear) parton distribution functions CT14nlo (for protons) and EPPS16nlo_CT14nlo_Pb208 (for lead nuclei), defining 6 different histogram analysis objects.
By fitting the measured results with Amdahl's law [13] we can determine the maximum theoretical speedup compared to the single thread run that can be achieved on the specific architecture: where α is the non-parallelizable part of the code. According to the results summarized in Table 3 the scalability on the higher performance CPU is better, the non-parallelizable parts (such as the thread managing system itself) result in a lower α value. However, using 3-4 threads HIJING++ runs more efficiently also with the low TDP CPU, resulting in a considerably reduced runtime. In order put the performance of HIJING++ into context, we measured and compared the (single thread) runtime of PYTHIA8.2 and HIJING v2.552. We found that HIJING++ is ∼ 30% faster than PYTHIA8.2 and ∼ 50% slower than HIJING v2.552. This is not a surprising result, because the published FORTRAN HIJING was originally written with single precision floating point numbers: on one hand, this can lead to significant numerical errors (especially at LHC energies) when performing calculations with frequently occurring small quantities like ∼ m q √ s 1, where m q is the mass of a given quark species and √ s is the center-of-mass energy. On the other hand, we measured the effect of modifying the FORTRAN HIJING into double precision, and we found that in such case it's runtime scales up by a factor of 4.

Summary and conclusions
We presented the results of the performance benchmarks of the new HIJING++ heavy-ion Monte Carlo event generator using different CPUs and collision systems. Utilizing the built-in CPU parallelization and analysis frameworks HIJING++ provides a significant decrease in the necessary computation time which is especially important at higher performance architectures. In the future developments further optimizations are planned to improve the scalability.