Abstract
The latest published version of GRASP (General-purpose Relativistic Atomic Structure Package), i.e., GRASP2018, retains a few suboptimal subroutines/algorithms, which reflect the limited memory and file storage of computers available in the 1980s. Here we show how the efficiency of the relativistic self-consistent-field (SCF) procedure of the multiconfiguration-Dirac–Hartree–Fock (MCDHF) method and the relativistic configuration-interaction (RCI) calculations can be improved significantly. Compared with the original GRASP codes, the present modified version reduces the CPU times by factors of a few tens or more. The MPI performances for all the original and modified codes are carefully analyzed. Except for diagonalization, all computational processes show good MPI scaling.
1. Introduction
Atomic energy levels, oscillator strengths, transition probabilities and energies are essential parameters for abundance analysis and diagnostics in astrophysics and plasma physics. In the past decade, the atomic spectroscopy group of Fudan University carried out two projects to calculate transition characteristics with high accuracy in collaboration with other groups. One project focused on the ions with , which are generally of astrophysical interest, and the other on tungsten ions (), which are relevant in the research of magnetic confinement fusion. Employing the multiconfiguration Dirac–Hartree–Fock (MCDHF) approach [1,2,3,4,5], implemented within the GRASP2K package [6], and/or the relativistic many-body perturbation theory (RMBPT) [7], implemented within the FAC package [8,9,10], we performed a series of systematic and large-scale calculations of radiative atomic data for ions of low and medium Z-values belonging to the He I [11], Be I-Ne I [12,13,14,15,16,17,18,19,20,21,22], Si I-Cl I [23,24,25,26,27] isoelectronic sequences, and for the highly-charged isonuclear sequence ions of tungsten [28,29,30,31,32,33]. A large amount of atomic data, including level energies, transition wavelengths, line strengths, oscillator strengths, transition probabilities and lifetimes, were obtained. Their uncertainties were comprehensively assessed by cross-validations between the MCDHF and RMBPT results and by detailed comparisons with observations. It showed that spectroscopic accuracy was achieved for the computed excitation and transition energies in most of the ions concerned due to the fact that electron correlation was treated at a high level of approximation by using a very large expansion of configuration state functions (CSF) based on extended sets of one-electron orbitals. To make these large-scale calculations feasible and tractable, many efforts were devoted to improving the performance and stability of the codes used. Here, we describe some improvements made in the last two years for the rmcdhf and rci programs, which have not yet been included in the latest published version of GRASP, i.e., GRASP2018 [34].
The GRASP2018 package is an updated Fortran 95 version of recommended programs from GRASP2K Version 1_1 [6], providing improvements in accuracy and efficiency in addition to the translation from Fortran 77 to Fortran 95 [34]. However, it has retained some original subroutines/algorithms that reflect the limited memory and file storage capacities of computers in the 1980s, when the first versions of GRASP were released [2]. For example, the spin-angular coefficients, which are used to build the Hamiltonian matrix and the potentials, are stored on disks in unformatted files. During the iterations of the self-consistent-field (SCF) calculations, aiming to optimize the one-electron radial functions, the spin-angular coefficients are read from disks again and again. The calculations using expansions of hundreds of thousands of CSFs are very time-consuming, as the disk files easily exceed over 10 GB. This kind of inefficiency, which was considered a major bottleneck of the GRASP package for a long time, was removed very recently by one of the authors (GG) through two programs, rmcdhf_mem and rmcdhf_mem_mpi, which have been uploaded to the GRASP depository [35]. The new feature of these two programs is that the spin-angular coefficients, once they are read from disk files, are stored in memory by using arrays. In the present work, we show that these codes can be further improved by redesigning the procedure to obtain the direct and exchange potentials and the Lagrange multipliers, which are used to update the radial orbitals (large and small components) during the SCF procedure.
Once the radial functions have been determined by an MCDHF calculation based on the Dirac–Coulomb Hamiltonian, subsequent relativistic configuration-interaction (RCI) calculations are often performed to include the transverse photon interaction (which reduces to the Breit interaction at the low-frequency limit) and the leading quantum electrodynamics (QED) corrections. At this stage, the CSFs expansions are usually considerably enlarged to capture additional electron correlation effects. For example, our recent MCDHF calculations on C-like ions [16] were performed using an expansion of about two-million CSFs, which were generated by single and double (SD) excitations from the outer subshells of the multi-reference (MR) configurations, taking only the valence–valence correlation into account. The subsequent RCI calculations were based on approximately 20 million CSFs to adequately account for the additional core-valence (CV) electron correlation effects.
MCDHF and RCI calculations, using large CSFs expansions, require a lot of computing resources. Firstly, the construction of the Hamiltonian matrix is very time-consuming. The spin-angular integration of the Hamiltonian between pairs of CSFs has to be performed times, where N is the order of the interaction matrix, i.e., the size of CSFs expansion for the block of given J and parity. Fortunately, we recently implemented a computational methodology based on configuration state function generators (CSFGs) that relaxes the above scaling. Instead of having to perform the spin-angular integration for each of the elements in the Hamiltonian matrix, the use of generators makes it possible to restrict the integration to a limited number of cases and then directly infer the spin-angular coefficient for all matrix elements between groups of CSFs spanned by the generators, which takes advantage of the fact that spin-angular expressions are independent of the principal quantum number [36]. Secondly, the time for solving the eigenvalue problem in MCDHF and RCI may also be significant, especially if many eigenpairs are required, as is normally the case in spectrum calculations for complex systems [16,23,24,25,26,27].
The present paper, which reports on improvements both for MCDHF and RCI, is organized as follows:
- In Section 2, we show how the diagonalization procedure in MCDHF and RCI calculations can be improved by further parallelization.
- In Section 3, we discuss the improvements in the MCDHF program resulting from the new management of spin-angular coefficients in memory and from the redesign of the procedures for calculating the potentials and Lagrange multipliers. Results are reported from a number of performance tests.
- In Section 4, we study the improvements in RCI performances thanks to the use of CSFGs. We also investigate the time ratios for constructing and diagonalizing the Hamiltonian matrix to determine the desired eigenpairs.
- Finally, in Section 5, we summarize the results of the performance tests and identify the remaining bottlenecks. This is followed by a discussion on how the latter could be circumvented in future developments.
2. Additional Parallelization for the DVDRC Library of GRASP
In the DVDRC library of GRASP, the Davidson algorithm [37], as implemented in [38], is used to extract the eigenpairs of interest from the interaction matrix. Assuming that the K lowest eigenpairs are required of a large, sparse, real and symmetric matrix of order N, the original Davidson algorithm can be described as shown in Algorithm 1, in which the upper limit of P, the order of the expanding basis, is defined by the variable in GRASP2018 [34]. The matrix-vector multiplication (6), which is the most time-consuming step, has already been parallelized in GRASP using message passing interface (MPI) by calling upon one of the three subroutines named DNICMV, SPODMV, and SPICMVMPI, depending on if the interaction matrix is sparse or dense, and stored in memory or on disk.
| Algorithm 1: Davidson algorithm. |
|
It should be pointed out that the subroutines of the library in GRASP2018 [34] performing the remaining calculations, except for step (6) of the Davidson algorithm, are all serial. Step (1), solving the small symmetric eigenvalue problem of order P, which is generally smaller than 500, is very fast as it calls upon the DSPEVX routine from the LAPACK library. However, in steps (3)–(5) and (7)–(8), the matrix-vector, matrix–matrix multiplication and inner-products involve vectors of size N, such as all the column vectors of and . In the MCDHF and RCI calculations, when N, the size of the CSFs expansion of a given J and parity, is large enough, and meanwhile, dozens of eigenpairs or more are searched, steps (3)–(5) and (7)–(8) can be as time-consuming as step (6). Hence, we have parallelized all the possibly time-consuming routines of the DVDRC library for these steps by using MPI, such as MULTBC, NRM_MGS, NEWVEC, ADDS, etc. We show in Section 3 and Section 4 that the CPU time for diagonalization can be significantly reduced by factors of about three in relatively large-scale calculations.
3. Improvements for MCDHF
3.1. Outline of the MCDHF Method
The theory of MCDHF has been comprehensively described in the literature; for examples, see [1,2,3,4,5]. Here it is outlined to explain the modifications of the original GRASP2018 codes. Atomic units are used throughout except for those given explicitly.
In MCDHF calculations with GRASP, only the Dirac–Coulomb Hamiltonian () is taken into account. The Dirac one-electron orbital a is given by
in which P and Q are the radial functions, and is the usual spherical spinor, i.e., the spin-angular function, , . For a state of given total angular momentum J, total magnetic quantum number , and parity , the atomic state function (ASF) is formed by a linear combination of CSFs
is the number of CSFs used in the expansion. Each CSF, , is constructed on four-component spinor orbital functions (1). The label contains all the needed information on its structure, i.e., the constituent subshells with their symmetry labels and the way their angular momenta are coupled to each other in -coupling. The level energy and the vector of expansion coefficients are solved from the following secular equation:
with
where the reduced matrix element (RME) is defined from Edmond’s formulation of the Wigner–Eckart theorem [39]. This RME can be developed in terms of RMEs in the CSF basis, , which are generally expressed as
The radial integrals and are, respectively, relativistic kinetic-energy and Slater integrals, and are the corresponding spin-angular coefficients, and k is the tensor rank.
The radial functions of the orbitals are unknown and should be determined on a grid. The stationary condition with respect to variations in the radial functions, in turn, gives the following MCDHF integro-differential equations for each orbital a [1,2,3,4]:
in which the Lagrange multipliers and ensure that the orbitals of form an orthonormal set. The direct potential arising from the two-body interactions, summing over the allowed tensor rank k, is given by
with being the relativistic one-dimensional radial integrals [1,2,3,4], and
where and are the spin-angular coefficients. The exchange potentials in Equation (6) are given by
with
where and are also the spin-angular coefficients. The coefficients are the generalized weights
in which is the weight attributed to level , and is the number of targeted levels. In the extended optimal level (EOL) calculation of GRASP, the MCDHF optimization procedure ensures that the average energy weighted by , i.e., , is stationary with respect to small changes of the orbitals and expansion mixing coefficients. In all of the above equations, is the generalized occupation number of orbital a:
where is the occupation number for orbital a of CSF r. The resulting direct and exchange potentials are also used to determine the Lagrange multipliers [1].
It should be mentioned that is assumed in both Equations (9) and (12), whose left-hand sides should be multiplied by some adequate factors if and/or , as given in Equation (8). In addition, the contributions to the exchange potential arising from off-diagonal one-body integrals are not presented here, but they have been included since the GRASP92 version [40].
Spin-angular coefficients and are known in closed forms [1,2] and calculated during the constructions of potentials and the Hamiltonian matrix, whereas , as well as involving a one-body integral , are obtained from the unformatted disk files, namely mcp.XXX, which are generated by the rangular program [41,42,43] of GRASP.
3.2. Redesigning the Calculations of Potentials
The MCDHF calculations are generally divided into two parts, i.e., (i) searching the concerned eigenpairs from solving Equation (3) for a given set of one-electron orbitals and (ii) updating the orbitals from iteratively solving the orbital equations Equation (6) for a given set of mixing coefficients. In addition to the additional parallelization for the DVDRC Library of GRASP mentioned in Section 2, the computational task can be reduced significantly by redesigning the calculations of potentials.
The general MCDHF procedure used in rmcdhf or rmcdhf_mpi programs of GRASP2018 [34] is illustrated in Algorithm 2. The notes integrated in the description of the SCF procedure outline the modifications provided in the memory-version rmcdhf_mem and rmcdhf_mem_mpi [35], and the present modified version referred to as rmcdhf_mpi_FD for convenience. Only the parallel versions are referred to hereinafter, as we focus on large-scale MCDHF calculations.
| Algorithm 2: SCF procedure. |
|
We describe some of the modifications in detail below:
- One routine SETMCP_MEM is added in rmcdhf_mem_mpi and retained in rmcdhf_mpi_FD to read the and spin-angular coefficients together with the corresponding packed orbital labels from mcp.XXX disk-files into arrays. When needed, the data are fetched from memory in rmcdhf_mem_mpi and rmcdhf_mpi_FD, whereas rmcdhf_mpi reads the mcp.XXX disk-files in steps (0.3), (a1), (b1), (2) and (3) of Algorithm 2.
- The most time-consuming SETCOF subroutine of rmcdhf_mpi is split into two routines, i.e., SETTVCOF and SETALLCOF in rmcdhf_mpi_FD.
- -
- During the first call just before the SCF iterations start, SETTVCOF records the Slater integrals contributing to the NEC off-diagonal Lagrange multipliers, the packed labels, i.e., with (and , which is the maximum value allowing for that LABV variable that may be stored as an integer of 4 bytes), and the corresponding tensor rank k are saved into arrays. are, respectively, the positions of in the set consisting of orbitals. There are many identical Slater integrals arising from different blocks.
- -
- -
- Within the first entrance just before the SCF iterations start, SETALLCOF constructs the and arrays for all the orbitals involved in the calculations of all off-diagonal Lagrange multipliers. The diagonal Slater integrals or of the Hamiltonian matrix involved in the calculations for and (see Equations (8) and (11)), and those , recorded by SETTVCOF and involved in Equations (9) and (12), are considered. The labels () and packed by are sorted and saved into and arrays, respectively. Hence, and are sorted lists with distinct elements. All MPI processes are modified to have the same and arrays. The and coefficients, arising from the same Slater integrals but from different blocks, are accumulated, respectively, according to the and values stored in and .
- -
- During SCF iterations, SETALLCOF only accumulates all the needed coefficients in Equations (7) and (10) across different blocks, employing a binary search strategy (with time complexity of ) to match the and values with those stored in the and arrays, respectively. The accumulated and coefficients are saved into YA and XA arrays at the same positions as those of and in and arrays. This accumulation scheme significantly reduces the computation efforts for the relativistic one-dimensional radial integrals in Equations (7) and (10).
- -
- In both SETTVCOF and SETALLCOF routines, the computation efforts are significantly reduced by taking advantage of the symmetry properties of Equations (8), (9), (11) and (12). Their right-hand sides, i.e., the summations, are the same for all the involved orbitals and performed only once within the individual SCF loop. For example, given , the corresponding contributes to the exchange parts of the four orbitals, and the associated four coefficients can be obtained simultaneously by considering their generalized occupation number.
- -
- In the SETCOF routine of rmcdhf_mpi, the symmetry properties are not yet considered. The and arrays are constructed again and again in each entrance and have repetitious labels for which the sequential search method (with time complexity of ), used to accumulate the corresponding and coefficients, is inefficient. In MCDHF calculations using many orbitals, the number of labels can easily exceed hundreds of thousands or even more. This inefficiency of rmcdhf_mpi significantly slows down the computations.
- In rmcdhf_mpi_FD, the subroutines YPOT and XPOT are parallelized by using MPI, whereas they are serial in both rmcdhf_mpi and rmcdhf_mem_mpi.
- Obviously, compared with rmcdhf_mpi and rmcdhf_mem_mpi, the new code rmcdhf_mpi_FD is more memory-consuming since many additional arrays possibly of large size are maintained during the SCF procedure, and dozens of additional GB of memory are needed if the number of labels reaches several million.
3.3. Performance Tests for MCDHF
In the present section, we would like to compare the relative performances of the three available codes, rmcdhf_mpi, rmcdhf_mem_mpi and rmcdhf_mpi_FD, to perform MCDHF calculations. Here we choose two examples, i.e., Mg VII and Be I, to illustrate and discuss the improvements in efficiency obtained with the two new codes, i.e., rmcdhf_mem_mpi and rmcdhf_mpi_FD. The calculations are all performed using the Linux server with two Intel(R) Xeon(R) Gold 6 278C CPU (2.60 GHz) and 52 cores, except in some cases for which the used CPU is explicitly given. In this comparative work, we carefully checked that the results obtained with the three codes were identical. Throughout the present work, the reported CPU times are all wall-clock times, as they are more meaningful for the end-users.
3.3.1. Mg VII
In our recent work on C-like ions [16], large-scale MCDHF-RCI calculations were performed for the states in C-like ions from O III to Mg VII. Electron correlation effects were accounted for by using large configuration state function expansions, built from the orbital sets with principal quantum numbers . A consistent atomic data set including both energies and transition data with spectroscopic accuracy was produced for the lowest hundreds of states of C-like ions from O III to Mg VII. Here we take Mg VII as an example to investigate the performances of rmcdhf_mpi, rmcdhf_mem_mpi and rmcdhf_mpi_FD programs.
In the MCDHF calculations of [16] aiming at the orbital optimisation, the CSF expansions were generated by SD-excitations up to orbitals from all possible with configurations. (More details can be found in [16].) The MCDHF calculations were performed layer by layer using the following sequence of active sets (AS):
Here the test calculations are carried out only for the even states with J = 0–3. The CSF expansions using the above AS orbitals, as well as the number of targeted levels for each block, are listed in Table 1. To keep the calculations tractable, only two SCF iterations are performed, taking the converged radial functions from [16] as the initial estimation. The zero- and first-order partition techniques [4,44], often referred to as ‘Zero-First’ methods [45], are employed. The zero-space contains the CSFs with orbitals up to , the numbers of which are also reported in Table 1. The corresponding sizes of mcp.XXX files are, respectively, about 5.2, 11, 19, 29, and 41 GB in the through calculations.
Table 1.
MCDHF calculations for the even states of Mg VII. For each J-block, the number of targeted levels (eigenpairs) and sizes (number of CSFs) of the zero-space and CSF-expansions for the different orbital active sets are listed.
The CPU times for these MCDHF calculations using the and orbital sets are reported in Table 2 and Table 3, respectively. To show the MPI performance, the calculations are carried out using various numbers of MPI processes (np) ranging from 1 to 48. The rmcdhf_mpi and rmcdhf_mem_mpi MPI calculations using the orbitals set are only performed with , as the calculations with smaller -values are too time-consuming. The CPU times are presented in the time sequence of Algorithm 2. For MCDHF calculations limited to two iterations, the eigenpairs are searched three times, i.e., once at step (0.3) and twice at step (3). The three rows with label “SetH&Diag” in Table 2 and Table 3 report the corresponding CPU times for setting the Hamiltonian matrix (routine MATRIXmpi) and for its diagonalization (routine MANEIGmpi), whereas the row with “Sum(SetH&Diag)” reports their sum. Steps (1.1) and (2) of Algorithm 2 are carried out twice in all calculations, as well as step (1.0) in rmcdhf_mpi_FD calculations. The rows labeled by “SetCof + LAG” and “IMPROV” report, respectively, the CPU times for routines SETLAGmpi and IMPROVmpi, i.e., for steps (1.1) and (2) of Algorithm 2, while the row “Update” gives their sum. The rows labeled “Sum(Update)” display the total CPU times needed to update the orbitals twice. The rows “Walltime” represent the total code execution times. The differences between the summed value “Sum(Update)” + “Sum(SetH&Diag)” and the “Walltime” ones represent the CPU times that are not monitored by the former two. It can be seen that these differences are relatively small in cases of rmcdhf_mpi and rmcdhf_mem_mpi, implying that most of the time-consuming parts of the codes have been included in the tables, while the relatively large differences in the case of rmcdhf_mpi_FD could be reduced if the CPU times needed for constructing the sorted NXA and NYA arrays in step (0.4) of Algorithm 2 would be taken into account.
Table 2.
CPU times (in s) for the Mg VII AS3 SCF calculations using the rmcdhf_mpi, rmcdhf_mem_mpi and rmcdhf_mpi_FD codes as a function of the number of MPI processes (). See text for the label meanings.
Table 3.
CPU times (in s) for the Mg VII AS5 SCF calculations using the rmcdhf_mpi, rmcdhf_mem_mpi and rmcdhf_mpi_FD codes as a function of the number of MPI processes (). See text for the label meanings.
In the rmcdhf_mpi_FD calculations, five kinds of CPU times are additionally recorded, labeled, respectively, “NXA&NYA”, “SetTVCof”, “WithoutMCP”, “WithMCP”, and “SetLAG”. Row “NXA&NYA” reports the CPU times to construct the sorted NXA and NYA arrays. The “SetTVCof” displays the CPU times required to perform all the summations of Equations (9) and (12) in the newly added routine SETTVCOF. The “WithoutMCP” and “WithMCP” rows report the CPU times spent in the added routine SETALLCOF to accumulate the and coefficients using Equations (8) and (11), and Equations (9) and (12), respectively. These three contributions—“SetTVCof”, “WithoutMCP” and “WithMCP”, correspond to the computation effort associated with step (1.0) of Algorithm 2. The “SetLAG” row represents the CPU times required to calculate the off-diagonal Lagrange multipliers in routine SETLAGmpi using the calculated and coefficients. The “SetCof + LAG” CPU time values correspond approximately to the sum of the four tasks “SetTVCof” + “WithoutMCP” + “WithMCP” + “SetLAG”, as the calculations involving the one-body integral contributions are generally very fast. The “Update” row reports the summed value of “SetCof + LAG” and “IMPROV”, as done above for rmcdhf_mpi and rmcdhf_mem_mpi. (The CPU times with the same labels for the different codes can be compared because they are recorded for the same computation tasks.)
Based on the CPU times reported in Table 2 and Table 3, some comparisons are illustrated in Figure 1, Figure 2, Figure 3 and Figure 4. We discuss below the relative performances of the three codes.
Figure 1.
MPI performances of the first diagonalization in the Mg VII -SCF calculations. Solid lines (left y axis): CPU times (T in s) of rmcdhf_mpi (squares), rmcdhf_mem_mpi (circles), and rmcdhf_mpi_FD (triangles) codes versus the number of MPI processes () (also listed in the first “SetH&Diag” line for each code of Table 2). Dashed lines (right y axis): speed-up factors for the three codes, with the same corresponding symbols, estimated from the ratios of to others. Dotted line (right y axis) (square symbols): speed-up of rmcdhf_mpi_FD relative to rmcdhf_mpi, calculated as T(rmcdhf_mpi)/T(rmcdhf_mpi_FD).
Figure 2.
MPI performances for updating orbitals in the Mg VII -SCF calculations. (a) Solid lines (left y axis): orbital updating CPU times (T in s) of rmcdhf_mpi (squares), rmcdhf_mem_mpi(circles), and rmcdhf_mpi_FD (triangles) codes versus the number of MPI processes () (also listed in the second “Update” line for each code of Table 2). Dashed lines (right y axis): speed-up factors for the three codes, with the same corresponding symbols, estimated from the ratios of to others. (b) Speed-up factors of rmcdhf_mpi_FD relative to rmcdhf_mpi (squares) and to rmcdhf_mem_mpi (circles) calculated as T(rmcdhf_mpi)/T(rmcdhf_mpi_FD) and T(rmcdhf_mem_mpi)/T(rmcdhf_mpi_FD), respectively. Speed-up factors of rmcdhf_mem_mpi relative to rmcdhf_mpi (stars) calculated as T(rmcdhf_mpi)/T(rmcdhf_mem_mpi).
Figure 3.
MPI performances for the code running times in the Mg VII -SCF calculations. (a) Solid (left y axis): walltimes (in s) for rmcdhf_mpi (squares), rmcdhf_mem_mpi (circles), and rmcdhf_mpi_FD (triangles) codes versus the number of MPI processes (). Dashed lines (right y axis): speed-up factors for the three codes, with the same corresponding symbols, estimated from the ratios of to others. (b): Speed-up factors of rmcdhf_mpi_FD relative to rmcdhf_mpi (squares) and to rmcdhf_mem_mpi (circles), respectively. Speed-up factors of rmcdhf_mem_mpi relative to rmcdhf_mpi (stars).
Figure 4.
MPI performances of the codes for the Mg VII -SCF calculations. (a) Solid lines (left y axis): walltimes (in s) for rmcdhf_mpi (squares), rmcdhf_mem_mpi (circles) and rmcdhf_mpi_FD (triangles) codes versus the number of MPI processes (). Dashed line (right y axis): speed-up factors for rmcdhf_mpi_FD (triangles) calculated as the ratios of to others. Dotted lines (right y axis): speed-up factors for rmcdhf_mpi_FD relative to rmcdhf_mpi (squares) and to rmcdhf_mem_mpi (circles), respectively. (b) Solid lines (left y axis): orbital updating CPU times (T in s) for rmcdhf_mpi (squares), rmcdhf_mem_mpi (circles) and rmcdhf_mpi_FD (triangles) calculations. Dashed line (right y axis): speed-up factors (triangles) for rmcdhf_mpi_FD estimated from the ratios of to others. The second “Update” times given in Table 3 are shown here. Dotted lines (right y axis): speed-up factors of rmcdhf_mpi_FD relative to rmcdhf_mpi (squares) and to rmcdhf_mem_mpi (circles), respectively.
As seen in Table 2 and Figure 1 for the calculations, the MPI performances for diagonalization are unsatisfactory for all three codes. The largest speed-up factors are about for rmcdhf_mpi and rmcdhf_mem_mpi, and for rmcdhf_mpi_FD. The optimal numbers of MPI processes used for diagonalization are all around and the MPI performances deteriorate when exceeds 24. The CPU times of rmcdhf_mem_mpi and rmcdhf_mpi are very similar. Compared to these two codes, the CPU time of rmcdhf_mpi_FD is reduced by a factor of ≃2.5 with , thanks to the additional parallelization described in Section 2. The speed-up efficiency of rmcdhf_mpi_FD relative to rmcdhf_mpi increases slightly with the size of the CSF expansion. As seen from the first line of Table 3, the CPU time gain factor reaches for the calculations using the orbital set. It should be noted that the CPU times to set the Hamiltonian matrix are negligible in all three codes, being tens of times shorter than those for the first search of eigenpairs. The eigenpairs are searched three times, and the corresponding CPU times are included in the three rows labeled “SetH&Diag”. As seen in Table 3, the first “SetH&Diag” CPU time is 945 s in the rmcdhf_mpi_FD calculation with , consisting of 14 and 931 s, respectively, for the matrix construction and diagonalization. For the subsequent two “SetH&Diag”, the matrix construction CPU times are still about 14 s, whereas those for diagonalization are, respectively, reduced to 34 and 26 s because the mixing coefficients are already converged. If the present calculations would be initialized by Thomas–Fermi or hydrogen-like approximations, these CPU times should reach about 900 s.
As far as the orbital updating process is concerned, the MPI performances of the three codes scale very well. The linearity is indeed attained even with or more, as seen in Figure 2a and Figure 4b. The speed-up factors with are, respectively, 26.0, 17.8 and 44.3 for the rmcdhf_mpi, rmcdhf_mem_mpi and rmcdhf_mpi_FD calculations, while it is 43.2 for the calculation using rmcdhf_mpi_FD. The slopes obtained by a linear fit of the speed-up factors as a function of np are, respectively, , , and for the rmcdhf_mpi, rmcdhf_mem_mpi, and rmcdhf_mpi_FD calculations, while it reaches 0.91 for the rmcdhf_mpi_FD calculation. In the calculations, compared to rmcdhf_mpi and rmcdhf_mem_mpi, the rmcdhf_mpi_FD CPU times for updating the orbitals with are, respectively, reduced by factors and , as seen in Figure 2b. The corresponding reduction factors are and for the calculations, as seen in Figure 4b. These large CPU time-saving factors result from the new strategy developed to calculate the potentials, as implemented in rmcdhf_mpi_FD, and described in Section 3.2. Unlike the diagonalization part, the memory version rmcdhf_mem_mpi brings about some interesting improvements over rmcdhf_mpi: the orbital updating CPU times are indeed reduced by a factor of 2 and 1.6 for the and calculations, respectively.
The MPI performances for the “walltimes” are different from each other among the three codes. As seen in Table 2 and Table 3, the orbital updating CPU times are predominant in most of the rmcdhf_mpi and rmcdhf_mem_mpi MPI calculations, whereas the diagonalization CPU times dominate in all rmcdhf_mpi_FD MPI calculations for all np values. Hence, as seen in Figure 3a and Figure 4a, the global MPI performance of rmcdhf_mpi_FD is similar to the one achieved for diagonalization. The maximum speed-up factors are, respectively, about 6.6 and 7.3 for the and calculations, both with = 16–32, though the “Update” CPU times could be reduced by factors of 44 or 43 with , as shown above. The speed-ups increase along with np in the rmcdhf_mpi and rmcdhf_mem_mpi calculations, being, respectively, about 13.5 and 7.2 with . As shown in Figure 3b, comparing to rmcdhf_mpi, the walltimes are, respectively, reduced by a factor of 11.2 and 4.3 in rmcdhf_mpi_FD calculations using and , while the reduction factors are, respectively, 15 and 4.8 for the calculations, as shown in Figure 4a. The corresponding speed-up factors of rmcdhf_mpi_FD relative to rmcdhf_mem_mpi are smaller by a factor of 1.5, as rmcdhf_mem_mpi is 1.5 times faster than rmcdhf_mpi, as seen in Figure 3b.
As mentioned above, the total CPU times for diagonalization reported in Table 2 and Table 3 (see rows labeled “Sum(SetH&Diag)”) are dominated by the first diagonalization, as the initial radial functions are taken from the converged calculations. In SCF calculations initialized by Tomas–Fermi or screened Hydrogenic approximations, more computation efforts have to be devoted to subsequent diagonalization during the SCF iterations. It is obvious that the limited MPI performance for diagonalization is the bottleneck in rmcdhf_mpi_FD calculations. As seen in Table 2 and Table 3 and Figure 3a and Figure 4a, more CPU time is required if np exceeds the optimal number of cores for diagonalization, which is generally in the range of 16–32. In rmcdhf_mpi and rmcdhf_mem_mpi calculations, the inefficiency of the orbital-updating procedure is another bottleneck, though this limitation may be alleviated by using more cores to perform the SCF calculations. However, this kind of alleviation would be eventually prohibited by the limited MPI performance of diagonalization. As seen in Table 3 and Figure 4a for the rmcdhf_mpi and rmcdhf_mem_mpi calculations, the walltimes with are longer than those with , though the CPU times for updating the orbitals are still reduced significantly in the former calculation.
3.3.2. Be I
To further understand the inefficiency of the orbital updating process in both rmcdhf_mpi and rmcdhf_mem_mpi codes, the second test case is carried out for a rather simple system, i.e., Be I. The calculations are performed to target the lowest 99 levels arising from the configurations with . The 99 levels are distributed over 15 blocks, i.e., , with the largest numbers of 12 for and blocks. The MCDHF calculations are performed simultaneously for both even and odd parity states. The largest CSF space contains 55 166 CSFs formed by SD excitation up to orbitals from all the targeted states distributed over the above 15 blocks, with the largest size of 4 868 for . The orbitals are also optimized with a layer-by-layer strategy. The CPU times recorded for the calculations using and orbital sets are given in Table 4 and Table 5. These calculations are hereafter labeled and . The corresponding sizes of mcp.XXX files are, respectively, 760 MB and 19 GB. As the rmcdhf_mpi and rmcdhf_mem_mpi calculations are time-consuming, they are only performed with .
Table 4.
CPU times (in s) for the Be I SCF calculations using the rmcdhf_mpi, rmcdhf_mem_mpi and rmcdhf_mpi_FD codes as a function of the number of MPI processes (). See text for the label meanings.
Table 5.
CPU times (in s) for the Be I SCF calculations using the rmcdhf_mpi, rmcdhf_mem_mpi and rmcdhf_mpi_FD codes as a function of the number of MPI processes (). See text for the label meanings.
In comparison to the Mg VII test case considered in Section 3.3.1 (see Table 1), the CSF expansions for Be I are much smaller, and fewer levels are targeted. Hence, fewer computational efforts are expected for the construction of a Hamiltonian matrix and the subsequent diagonalization. This is true for the diagonalization parts of all the calculations using various MPI processes. For example, the CPU times for searching for eigenpairs are tens of times smaller than those for building the Hamiltonian matrix, representing 14s out of 150s, as reported by the first “SetH&Diag” value given in Table 5 for the rmcdhf_mpi_FD calculation with . These CPU times are negligible (<1 s) for the following two diagonalizations. Unlike the cases considered for Mg VII, the CPU times for setting the Hamiltonian matrix predominate in the three “SetH&Diag” values, being all around 136s in the calculations using , as shown in Table 5. These large differences in CPU time distributions between our Mg II and Be I test cases arise from the fact that the expansion in Be I is built on a rather large set of orbitals, consisting of 171 Dirac one-electron orbitals, whereas the expansion in Mg VII involves only 88 ones. The number of Slater integrals possibly contributing to matrix elements is, therefore, much larger in Be I (95 451 319) than in Mg VII (6 144 958). Consequently, the three codes report very similar “SetH&Diag” and “Sum(SetH&Diag)” CPU times and all attain the maximum speed-up factors of about 10 around , as seen in Table 4.
The MPI performances of the Be I calculations are shown in Figure 5. In general, a perfect MPI scaling is a speed-up factor equal to . With respect to this, the speed-up factors observed in rmcdhf_mpi and rmcdhf_mem_mpi are unusual, being much larger than the corresponding values. For example, the speed-up factors of rmcdhf_mpi and rmcdhf_mem_mpi for the orbital updating calculations are 79.5 and 93.6 at , while the corresponding slopes obtained from the linear fit of the speed-up factors as a function of are about 1.7 and 2.0, respectively. The corresponding reductions at for the code running times are 69.4 and 79.1, while the slopes are about 1.5 and 1.7, respectively. These reductions should be even larger for the calculations. A detailed analysis shows that the inefficiency of the sequential search method largely accounts for the above unexpected MPI performances. As mentioned in Section 3.2, the labels LABYk and LABXk are constructed and stored sequentially in NYA(:,a) and NXA(:,a) arrays, respectively. In the subsequent accumulations of the and coefficients, the sequential search method is employed to match the labels. As mentioned above, a large number of Slater integrals contribute to the Hamiltonian matrix elements in calculations using a large set of orbitals. They are also involved in the calculations of the potential. In general, the number of terms is much larger than the number of terms. For example, the largest number of the former is 196 513 for all the orbitals while there are at most 3 916 terms for all the orbitals in the calculations. Similarly, for the calculations, there are at most 2 191 507 terms and 17 328 terms, both for the orbitals. These values correspond to the largest sizes of the one-dimension vectors NXA(:,a) and NYA(:,a). The sequential search from a large list is obviously more inefficient than from a small list, as the time complexity is . In the MPI calculations with small np values, for example , the sequential search of the LABXk from the NXA(:,a) lists of over two million elements is very time-consuming. It is obvious that the sizes of NXA(:,a) in each MPI process decrease as increases, alleviating the inefficiency of the sequential search method, as also shown in Mg VII calculations. For example, when , the size of in each MPI process is reduced to 528 802 in the calculations, and consequently, the unusually high speed-up factors are attained with both rmcdhf_mpi and rmcdhf_mem_mpi.
Figure 5.
MPI performances for the Be I 9()-SCF calculations. (a) Solid (left y axis): orbital updating CPU times (T in s) for rmcdhf_mpi (squares), rmcdhf_mem_mpi (circles), and rmcdhf_mpi_FD (triangles) codes versus the number of MPI processes (). Dashed line (right y axis): speed-up factors of the three codes (with the same corresponding symbols) estimated as the ratios of to others. Dotted line (right y axis): speed-up factors of rmcdhf_mpi_FD relative to rmcdhf_mpi (squares) and to rmcdhf_mem_mpi (circles), calculated as T(rmcdhf_mpi)/T(rmcdhf_mpi_FD) and T(rmcdhf_mem_mpi)/T(rmcdhf_mpi_FD), respectively. (b) Same as in (a), but for the walltimes.
In rmcdhf_mpi_FD, the above inefficiency is removed by using the binary search strategy from the sorted arrays NXA(:,a) and NYA(:,a), and this code benefits from some other improvements, as discussed in Section 3.2. As seen in Figure 5a and Figure 6a, the speed-up factors for updating the orbitals increase slightly along with and attain the value of about 22 for both the and rmcdhf_mpi_FD calculations with , while the corresponding reductions for the code running times are, respectively, 12.5 and 22.5, as seen in Figure 5b and Figure 6b. It should be mentioned that in the rmcdhf_mpi and rmcdhf_mem_mpi calculations, the “Sum(SetH&Diag)” CPU time values are all less than the “Sum(Update)” ones, as seen in Table 4 and Table 5. However, it is the opposite for the rmcdhf_mpi_FD calculations, with all “Sum(SetH&Diag)” CPU time values still larger than the “Sum(Update)” ones, as in Mg VII calculations. Comparing to rmcdhf_mpi, the code rmcdhf_mpi_FD reduces the CPU times required for updating the orbitals by a factor lying in the range of 38–20, with = 16–48 for the MCDHF calculations, while for the calculations, the corresponding reduction factors are in the range of 242–287. The corresponding reduction factor ranges for the code running times are, respectively, 11–4.7 and 54–22.5, as seen in Figure 5b and Figure 6b. One can conclude that the larger the scale of the calculations, the larger the CPU time reduction factors. Moreover, the lower the number of cores used, the larger the reduction factors obtained with rmcdhf_mpi_FD. These features become highly relevant for extremely large-scale MCDHF calculations if they have to be performed using a small number of cores due to the limited performance of diagonalization, as discussed in the above section.
Figure 6.
MPI performances for the Be I 15()14()13()-SCF calculations. (a) Solid (left y axis): orbital updating CPU times (T in s) for rmcdhf_mpi (squares), rmcdhf_mem_mpi (circles), and rmcdhf_mpi_FD (triangles) codes versus the number of MPI processes (). Dashed line (right y axis): speed-up factors of rmcdhf_mpi_FD (triangles) calculated as the ratios of to others. Dotted line (right y axis): speed-up factors of rmcdhf_mpi_FD relative to rmcdhf_mpi (squares) and to rmcdhf_mem_mpi (circles), calculated as T(rmcdhf_mpi)/T(rmcdhf_mpi_FD) and T(rmcdhf_mem_mpi)/T(rmcdhf_mpi_FD), respectively. (b) Same as in (a), but for the walltimes.
3.3.3. Possible Further Improvements for rmcdhf_mpi_FD
As discussed above, the MPI performances of rmcdhf_mpi_FD for updating orbitals scale well in Mg VII calculations. The speed-up factors roughly follow a scaling law of ≃ (see Figure 2b and Figure 4b). For the Be I calculations, however, as illustrated by Figure 5a and Figure 6a, the speed-up factor increases slightly with to attain a maximum value of 22. The partial CPU times for the orbital updating process, labeled “SetTVCof”, “WithoutMCP”, “WithMCP”, and “SetLAG”, are plotted in Figure 7, together with the total updating time labeled “Update”, for the Mg VII and for the Be I calculations (these labels have been explained in Section 3.3.1.). The partial CPU times labeled “IMPROV” are not reported here, as they are generally negligible. It can be seen that the “SetTVCof” and “WithoutMCP” partial times dominate the total CPU times required for updating orbitals in Mg VII calculations, and they all scale well along with . In the Be I calculations, the partial “SetLAG” CPU times are predominant, and the remaining partial CPU times scale well. However, the scaling for both the partial “SetLAG” and total CPU times is worse than in Mg VII. Extra speed-up for “SetLAG” in Mg VII calculations is observed. These different scalings can again be attributed to the fact that there is a large amount of terms contributing to the exchange potentials in Be I calculations, as mentioned above. For each term, calculations of relativistic one-dimensional radial integrals are performed on the grid with hundreds of r values. All these calculations are serial and are often repetitious for the same integral associated with different terms which differ from each other only by a and/or c orbitals. This kind of inefficiency could be improved by calculating all the needed integrals in advance and storing them in arrays. This will be implemented in future versions of rmcdhf_mpi_FD codes.
Figure 7.
MPI performances for the partial and total CPU times for updating the orbitals in (a) Mg VII and (b) Be I calculations. Solid curves are the CPU times (in s) labeled “Update” (squares), “SetTVCof” (circles), “WithoutMCP” (up-triangles), “WithMCP” (down-triangles), and “SetLAG” (diamonds), respectively, for the first iteration of the rmcdhf_mpi_FD SCF calculations. (Some of them are also listed in Table 3 and Table 5). In addition, the “NXA&NYA” CPU times are also shown as the dotted line.
The MPI performances of the construction of and arrays are also displayed as dotted lines in Figure 7. As the distributed and values obtained by different MPI processes have to be collected, sorted and then re-distributed, the linear scaling begins to deteriorate at about 32 cores for the Be I calculations. Fortunately, this construction is realized only once, just before the SCF iterations. This slightly poor MPI performances should not be the bottleneck in large-scale MCDHF calculations.
After a deep investigation of the procedures that could affect the MPI performances of rmcdhf_mpi_FD, one concludes that the poor performances of diagonalization could be the bottleneck for MCDHF calculations based on relatively large expansions consisting of hundreds of thousands of CSFs and targeting dozens of eigenpairs. We will discuss this issue in the next section.
4. Performance Tests for RCI Codes
The MCDHF calculations are generally followed by RCI calculations employing the GRASP rci and rci_mpi codes. In these calculations, larger CSF expansions than those considered in MCDHF calculations are used to capture higher-order electron correlation effects. Corrections to the Dirac–Coulomb Hamiltonian, such as the transverse photon interaction and the leading QED corrections, are also taken into account in this configuration interaction step, without affecting the one-electron orbitals. As mentioned in Section 1, we recently implemented in GRASP2018 [36] the original computational methodology based on configuration state function generators (CSFG) to build the Hamiltonian matrix. This strategy takes full advantage of the fact that the spin-angular integrals, such as the coefficients in Equation (5), are independent of the principal quantum numbers. In this approach, the CSFs space is divided into two parts, i.e., the labeling space and correlation space. The former typically accounts for the major correlation effects due to close degenerate and long-range rearrangements, while the latter typically accounts for short-range interactions and dynamical correlation. The orbitals set is also divided into two parts, i.e., a subset of labeling-ordered (LO) orbitals and a subset of symmetry-ordered (SO) orbitals [36]. The labeling CSFs are built with the LO orbitals only, generated by electron excitations (single (S), double (D), tripe (T), quadruple (Q), etc.) from an MR. The correlation CSFs are built with the LO orbitals together with SO orbitals, generated by only SD excitations also from the given MR. In the present implementation, two electrons at most are allowed to occupy the SO orbitals.
A CSFG of a given type is a correlation CSF in which one or two electrons occupy the SO orbitals with the highest principal quantum number allowed. Given a CSFG, a group of correlation CSFs can be generated by orbital de-excitations, within the SO orbitals set, that preserve the spin-angular coupling. The generated CSFs within the same group differ from each other only by the principal quantum numbers. The use of CSFGs makes it possible to restrict the spin-angular integration to a limited number of cases rather than being performed for each of the elements in the Hamiltonian matrix. Compared to ordinary RCI calculations employing rci_mpi, the CPU times are demonstrated to be reduced by factors of ten with the newly developed code, referred hereafter to as rci_mpi_CSFG. It is also found that the Breit contributions involving high orbital angular momentum (l) can be safely discarded. An efficient a priori condensation technique is also developed by using CSFG to significantly reduce the expansion sizes, with negligible changes to the computed transition energies. Some test calculations are presented for a number of atomic systems and correlation models with increasing sets of one-electron orbitals in [36]. Compared to the original GRASP2018 rci_mpi program, the larger the scale of the calculations, the larger the CPU time reduction factors will be with rci_mpi_CSFG. The latter is, therefore, very suitable for extremely large-scale calculations. Here we focus on the MPI performances of rci_mpi_CSFG and rci_mpi.
The MPI performance test calculations are performed for the block in Ne VII using the orbitals set, i.e., . As in the MCDHF calculations, all the possible with configurations define the MR. The correlation CSFs are formed by SD excitations from this MR, allowing at most one electron excitation from the 1s subshell. These CSF expansions model both the VV and CV electron correlation. The number of resulting CSFs is = 2 112 922. This expansion is used in the rci_mpi calculation.
In the rci_mpi_CSFG calculation, all the orbitals are treated as LO orbitals, while the others are regarded as SO orbitals. The CSFs are generated as in the rci_mpi calculation. The labeling space contains 95 130 CSFs, while there are 197 480 CSFGs within the correlation space spanning 2 017 792 correlation CSFs. The total number of the original CSF expansion, is reproduced by adding the sizes of the labeling and correlation spaces, i.e., 95 130 + 2 017 792 = 2 112 922, as it should be. However, the program rci_mpi_CSFG reads a file of only = 292 610 CSFs, corresponding to the total of the labeling CSFs (95 130) and CSFGs (197 480). The size is reduced by the ratio , compared to the file containing all the CSFs treated by rci_mpi. This ratio is very meaningful, being related to the performance enhancement of rci_mpi_CSFG, as the numbers of spin-angular integrations are, respectively, and about in the rci_mpi and rci_mpi_CSFG calculations. Ideally, relative to the former, a speed-up factor of is expected for the latter. It is obvious that this kind of outperformance is impossible to achieve because the spin-angular integration is not the whole computation load of RCI calculations.
The MPI performances of the rci_mpi and rci_mpi_CSFG can be realized from Table 6. All the MPI calculations are performed for the lowest 54 levels of in Ne VII, but using various cores in the range of 16–128 within a Linux server with two AMD EPYC 7763 64-Core Processors. Rather than using the zero-first approximation as in the above MCDHF calculations, here, all the matrix elements are calculated and taken into account in the RCI calculation. The disk space taken by the nonzero matrix elements is about 173 GB. The CPU times for building the Hamiltonian matrix, searching eigenpairs, and the sums are also shown in Figure 8. The former can be precisely reproduced by allometric scaling, i.e., and 16 005 both in minutes, respectively, for rci_mpi_CSFG and rci_mpi. This means that the CPU times for the matrix construction can be, respectively, reduced by factors of 1.77 and 1.86 if using double cores, implying that both rci_mpi and rci_mpi_CSFG have good MPI scaling to build the Hamiltonian matrix. However, the poor MPI scaling is again seen for diagonalization. The optimal values for the two codes are both around 32. The CPU times increase significantly with . The rci_mpi and rci_mpi_CSFG diagonalization with is longer than with by factors of about 4.1 and 3.7, respectively. The different MPI scalings for the matrix construction and diagonalization are not unexpected. For the former, after each MPI process obtains the CSFs expansion, communications between different processes are not needed anymore. However, during the diagonalization procedure, a large amount of MPI communications are needed to ensure that each process has the same approximated eigenvector after every matrix-vector multiplication.
Table 6.
CPU times (in s) for the construction of the Hamiltonian matrix (H), its diagonalization (D), and for the cumulated tasks (Sum) using rci_mpi and rci_mpi_CSFG for the block Mg VII calculations, as a function of the number of MPI processes ().
Figure 8.
MPI performances of rci_mpi and rci_mpi_CSFG. (a) CPU times (in min) for the construction of the Hamiltonian matrix (squares), its diagonalization (circles), and their sum (triangles) versus the number of MPI processes () for the block in Mg VII calculations with rci_mpi. The dashed curve is reproduced by an allometric fit, see text. (b) Same as in (a), but for calculations with rci_mpi_CSFG.
Consequently, considering both tasks (matrix construction and diagonalization), as seen in Table 6 and Figure 8, the optimal values for the whole code running times are in the ranges of 64–96 and 32–64 for rci_mpi and rci_mpi_CSFG, respectively. The latter outperforms the former by factors of 8.7 and 4.0 for the calculations using 16 and 128 cores, respectively. The best performance of rci_mpi_CSFG is 118 m using 32 cores, while it is 693 m using 64 cores for rci_mpi. The CPU time is reduced by a factor of 5.9 for rci_mpi_CSFG, but we should keep in mind that it uses half of the cores used by rci_mpi. This is more interesting for public servers. The outperformance of rci_mpi_CSFG is obviously due to the improvements in matrix construction and diagonalization, i.e., thanks to the implementation of CSFG and the additional parallelization, as discussed above. The CPU times needed for these two tasks are averagely reduced by about a factor of 10 and 3, respectively.
The scalability of the codes is also of interest if more and more eigenpairs are searched from a given Hamiltonian matrix. In Figure 9, the diagonalization CPU times are plotted versus , the number of searched eigenpairs. These calculations were also performed for the block in Ne VII using 16 cores. We observe that the reported CPU times, in minutes, with large enough -values, can be well reproduced by a quadratic polynomial fit as and for calculations with rci_mpi and rci_mpi_CSFG, respectively. For the latter, is approximately linear in and the quadratic term is smaller by over one order of magnitude for this code than for rci_mpi. Consequently, the rci_mpi_CSFG outperforms rci_mpi more significantly as increases, reducing the diagonalization CPU times by a factor of 4.2 for . This feature of rci_mpi_CSFG code is very helpful for large-scale spectrum calculation involving hundreds of levels.
Figure 9.
CPU times (in min) for searching different numbers of eigenpairs of the block in Mg VII. Dashed lines: CPU times for rci_mpi (squares) and rci_mpi_CSFG (circles). The lines are reproduced by a quadratic polynomial fit. Dotted line (right y axis): speed-up factors of rci_mpi_CSFG relative to rci_mpi (circles).
5. Conclusions
In summary, the computation load of MCDHF calculations employing the GRASP rmcdhf_mpi code is generally divided into the orbital-updating process and the matrix diagonalization. The inefficiency found in the former part has been removed by redesigning the calculation of direct and exchange potentials, as well as Lagrange multipliers. Consequently, the CPU times may be significantly reduced by one or two orders of magnitude. For the second part, the additional parallelization of the diagonalization procedure may reduce the CPU times by about a factor of 3. The computation load of RCI calculations employing GRASP rci_mpi can also be divided into the Hamiltonian matrix construction and its diagonalization. In addition to the additional parallelization that improves the efficiency of the latter, the load of the former is reduced by a factor of ten or more thanks to the recently implemented computational methodology based on CSFG. Compared to the original rmcdhf_mpi and rci_mpi codes, the present modified versions, i.e., rmcdhf_mpi_FD and rci_mpi_CSFG cut down the whole computation loads of MCDHF and RCI calculations by several or tens of times, a factor that depends on the calculation scale governed by (i) the size of the CSF expansion, (ii) the size of the orbital set, (iii) the number of desired eigenpairs and (iv) the number of MPI processes used. In general, the larger the first three, the larger the CPU time reduction factors obtained with rmcdhf_mpi_FD and rci_mpi_CSFG. On the other hand, the smaller the number of used cores, the larger the reduction factors observed. These features make the rmcdhf_mpi_FD and rci_mpi_CSFG codes very suitable for extremely large-scale MCDHF and RCI calculations.
The MPI performances of the above four codes, as well as the memory version of rmcdhf_mpi, i.e., rmcdhf_mem_mpi, are carefully investigated. All codes have a good MPI scaling for the orbital updating process and the matrix construction step, respectively, in MCDHF and RCI calculations, whereas the MPI scaling for diagonalization is poor. If few eigenpairs are searched or very small CSFs expansions are employed, the MPI calculations may be performed using as many cores as possible. To obtain the best performance for large-scale calculation using hundreds of thousands or millions of CSFs expansion and targeting dozens of levels or more, the relative computation loads of diagonalization versus orbital update and matrix construction should be considered. As the latter two are, respectively, reduced significantly by rmcdhf_mpi_FD and rci_mpi_CSFG codes, the diagonalization computation load will often dominate. For such cases, the MPI calculations should be performed using the optimal number of cores for diagonalization, generally being around 32. The poor MPI scaling of diagonalization is obviously the bottleneck of rmcdhf_mpi_FD and rci_mpi_CSFG codes for precise spectrum calculations involving hundreds of levels. The way to improve the MPI scaling for diagonalization is unclear to us. An MPI/OpenMP hybridization might be helpful. By now, a temporary method is provided for large-scale RCI calculations. Firstly, the Hamiltonian matrix is calculated using as many cores as possible. The files storing the nonzero matrix elements are then re-distributed by a program to match the optimal diagonalization performances.
Author Contributions
Methodology, Y.L., J.L., C.S., C.Z., R.S., K.W., M.G., G.G., P.J. and C.C.; software, Y.L., J.L., C.S., C.Z., R.S., K.W., M.G., G.G., P.J. and C.C.; validation, Y.L., J.L., C.S., C.Z., R.S., K.W., M.G., G.G., P.J. and C.C.; investigation, Y.L., J.L., R.S., K.W., M.G., G.G., P.J. and C.C.; writing—original draft, Y.L., R.S., K.W. and C.C.; writing—review and editing, Y.L., J.L., C.S., C.Z., R.S., K.W., M.G., G.G., P.J. and C.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (Grant nos. 12104095 and 12074081). Y.L. acknowledges support from the China Scholarship Council with Grant No. 202006100114. K.W. expresses his gratitude for the support from the visiting researcher program at Fudan University. M.G. acknowledges support from the Belgian FWO and FNRS Excellence of Science Programme (EOSO022818F).
Data Availability Statement
Not applicable.
Acknowledgments
The authors wish to thank the members of the CompAS group for valuable suggestions for improvements of the computer codes. R.S. and C.Y.C. would like to thank Charlotte Froese Fischer for the suggestions about the performance test. The authors would also like to thank Jacek Bieroń for his valuable comments on the manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Dyall, K.G.; Grant, I.P.; Johnson, C.T.; Parpia, F.A.; Plummer, E.P. Grasp—A General-purpose Relativistic Atomic-structure Program. Comput. Phys. Commun. 1989, 55, 425–456. [Google Scholar] [CrossRef]
- Grant, I.P.; McKenzie, B.J.; Norrington, P.H.; Mayers, D.F.; Pyper, N.C. An atomic multiconfigurational Dirac-Fock package. Comput. Phys. Commun. 1980, 21, 207–231. [Google Scholar] [CrossRef]
- Grant, I.P. Relativistic Quantum Theory of Atoms and Molecules. Theory and Computation (Atomic, Optical and Plasma Physics); Springer Science and Business Media, LLC: New York, NY, USA, 2007. [Google Scholar] [CrossRef]
- Froese Fischer, C.; Godefroid, M.; Brage, T.; Jönsson, P.; Gaigalas, G. Advanced multiconfiguration methods for complex atoms: I. Energies and wave functions. J. Phys. B At. Mol. Opt. Phys. 2016, 49, 182004. [Google Scholar] [CrossRef]
- Jönsson, P.; Godefroid, M.; Gaigalas, G.; Ekman, J.; Grumer, J.; Li, W.; Li, J.; Brage, T.; Grant, I.P.; Bieroń, J.; et al. An Introduction to Relativistic Theory as Implemented in GRASP. Atoms 2023, 11, 7. [Google Scholar] [CrossRef]
- Jönsson, P.; Gaigalas, G.; Bieroń, J.; Froese Fischer, C.; Grant, I.P. New version: Grasp2K relativistic atomic structure package. Comput. Phys. Commun. 2013, 184, 2197–2203. [Google Scholar] [CrossRef]
- Lindgren, I. The Rayleigh-Schrodinger perturbation and the linked-diagram theorem for a multi-configurational model space. J. Phys. B At. Mol. Phys. 1974, 7, 2441–2470. [Google Scholar] [CrossRef]
- Gu, M.F. The flexible atomic code. Can. J. Phys. 2008, 86, 675–689. [Google Scholar] [CrossRef]
- Gu, M.F. Energies of 1s22lq (1 ≤ q ≤ 8) states for Z ≤ 60 with a combined configuration interaction and many-body perturbation theory approach. At. Data Nucl. Data Tables 2005, 89, 267–293. [Google Scholar] [CrossRef]
- Gu, M.F.; Holczer, T.; Behar, E.; Kahn, S.M. Inner-Shell Absorption Lines of Fe VI-Fe XVI: A Many-Body Perturbation Theory Approach. Astrophys. J. 2006, 641, 1227–1232. [Google Scholar] [CrossRef]
- Si, R.; Guo, X.; Wang, K.; Li, S.; Yan, J.; Chen, C.; Brage, T.; Zou, Y. Energy levels and transition rates for helium-like ions with Z= 10–36. Astron. Astrophys. 2016, 592, A141. [Google Scholar] [CrossRef]
- Wang, K.; Chen, Z.B.; Zhang, C.Y.; Si, R.; Jönsson, P.; Hartman, H.; Gu, M.F.; Chen, C.Y.; Yan, J. Benchmarking Atomic Data for Astrophysics: Be-like Ions between B II and Ne VII. Astrophys. J. Suppl. Ser. 2018, 234, 40. [Google Scholar] [CrossRef]
- Wang, K.; Guo, X.; Liu, H.; Li, D.; Long, F.; Han, X.; Duan, B.; Li, J.; Huang, M.; Wang, Y.; et al. Systematic calculations of energy levels and transition rates of Be-like ions with Z= 10–30 using a combined configuration interaction and many-body perturbation theory approach. Astrophys. J. Suppl. Ser. 2015, 218, 16. [Google Scholar] [CrossRef]
- Wang, K.; Song, C.X.; Jönsson, P.; Ekman, J.; Godefroid, M.; Zhang, C.Y.; Si, R.; Zhao, X.H.; Chen, C.Y.; Yan, J. Large-scale Multiconfiguration Dirac–Hartree–Fock and Relativistic Configuration Interaction Calculations of Transition Data for B-like S xii. Astrophys. J. 2018, 864, 127. [Google Scholar] [CrossRef]
- Si, R.; Zhang, C.; Cheng, Z.; Wang, K.; Jönsson, P.; Yao, K.; Gu, M.; Chen, C. Energy Levels, Transition Rates and Electron Impact Excitation Rates for the B-like Isoelectronic Sequence with Z= 24–30. Astrophys. J. Suppl. Ser. 2018, 239, 3. [Google Scholar] [CrossRef]
- Li, J.; Zhang, C.; Del Zanna, G.; Jönsson, P.; Godefroid, M.; Gaigalas, G.; Rynkun, P.; Radžiūtė, L.; Wang, K.; Si, R.; et al. Large-scale Multiconfiguration Dirac–Hartree–Fock Calculations for Astrophysics: C-like Ions from O iii to Mg vii. Astrophys. J. Suppl. Ser. 2022, 260, 50. [Google Scholar] [CrossRef]
- Wang, K.; Si, R.; Dang, W.; Jönsson, P.; Guo, X.L.; Li, S.; Chen, Z.B.; Zhang, H.; Long, F.Y.; Liu, H.T.; et al. Calculations with spectroscopic accuracy: Energies and transition rates in the nitrogen isoelectronic sequence from Ar XII to Zn XXIV. Astrophys. J. Suppl. Ser. 2016, 223, 3. [Google Scholar] [CrossRef]
- Wang, K.; Jönsson, P.; Ekman, J.; Gaigalas, G.; Godefroid, M.; Si, R.; Chen, Z.; Li, S.; Chen, C.; Yan, J. Extended calculations of spectroscopic data: Energy levels, lifetimes, and transition rates for O-like ions from Cr XVII to Zn XXIII. Astrophys. J. Suppl. Ser. 2017, 229, 37. [Google Scholar] [CrossRef]
- Song, C.; Zhang, C.; Wang, K.; Si, R.; Godefroid, M.; Jönsson, P.; Dang, W.; Zhao, X.; Yan, J.; Chen, C. Extended calculations with spectroscopic accuracy: Energy levels and radiative rates for O-like ions between Ar XI and Cr XVII. At. Data Nucl. Data Tables 2021, 138, 101377. [Google Scholar] [CrossRef]
- Si, R.; Li, S.; Guo, X.; Chen, Z.; Brage, T.; Jönsson, P.; Wang, K.; Yan, J.; Chen, C.; Zou, Y. Extended calculations with spectroscopic accuracy: Energy levels and transition properties for the fluorine-like isoelectronic sequence with Z= 24–30. Astrophys. J. Suppl. Ser. 2016, 227, 16. [Google Scholar] [CrossRef]
- Li, J.; Zhang, C.; Si, R.; Wang, K.; Chen, C. Calculations of energies, transition rates, and lifetimes for the fluorine-like isoelectronic sequence with Z= 31- 35. At. Data Nucl. Data Tables 2019, 126, 158–294. [Google Scholar] [CrossRef]
- Wang, K.; Chen, Z.B.; Si, R.; Jönsson, P.; Ekman, J.; Guo, X.L.; Li, S.; Long, F.Y.; Dang, W.; Zhao, X.H.; et al. Extended relativistic configuration interaction and many-body perturbation calculations of spectroscopic data for the n ≤ 6 configurations in Ne-like ions between Cr XV and Kr XXVII. Astrophys. J. Suppl. Ser. 2016, 226, 14. [Google Scholar] [CrossRef]
- Zhang, X.; Del Zanna, G.; Wang, K.; Rynkun, P.; Jönsson, P.; Godefroid, M.; Gaigalas, G.; Radžiūtė, L.; Ma, L.; Si, R.; et al. Benchmarking Multiconfiguration Dirac–Hartree–Fock Calculations for Astrophysics: Si-like Ions from Cr xi to Zn xvii. Astrophys. J. Suppl. Ser. 2021, 257, 56. [Google Scholar] [CrossRef]
- Wang, K.; Jönsson, P.; Gaigalas, G.; Radžiūtė, L.; Rynkun, P.; Del Zanna, G.; Chen, C. Energy levels, lifetimes, and transition rates for P-like ions from Cr X to Zn XVI from large-scale relativistic multiconfiguration calculations. Astrophys. J. Suppl. Ser. 2018, 235, 27. [Google Scholar] [CrossRef]
- Song, C.; Wang, K.; Del Zanna, G.; Jönsson, P.; Si, R.; Godefroid, M.; Gaigalas, G.; Radžiūtė, L.; Rynkun, P.; Zhao, X.; et al. Large-scale Multiconfiguration Dirac–Hartree–Fock Calculations for Astrophysics: n = 4 Levels in P-like Ions from Mn xi to Ni xiv. Astrophys. J. Suppl. Ser. 2020, 247, 70. [Google Scholar] [CrossRef]
- Wang, K.; Song, C.X.; Jönsson, P.; Del Zanna, G.; Schiffmann, S.; Godefroid, M.; Gaigalas, G.; Zhao, X.H.; Si, R.; Chen, C.Y.; et al. Benchmarking atomic data from large-scale multiconfiguration Dirac–Hartree–Fock calculations for astrophysics: S-like ions from Cr ix to Cu xiv. Astrophys. J. Suppl. Ser. 2018, 239, 30. [Google Scholar] [CrossRef]
- Wang, K.; Jönsson, P.; Del Zanna, G.; Godefroid, M.; Chen, Z.; Chen, C.; Yan, J. Large-scale Multiconfiguration Dirac–Hartree–Fock Calculations for Astrophysics: Cl-like Ions from Cr viii to Zn xiv. Astrophys. J. Suppl. Ser. 2019, 246, 1. [Google Scholar] [CrossRef]
- Zhang, C.Y.; Wang, K.; Godefroid, M.; Jönsson, P.; Si, R.; Chen, C.Y. Benchmarking calculations with spectroscopic accuracy of excitation energies and wavelengths in sulfur-like tungsten. Phys. Rev. A 2020, 101, 032509. [Google Scholar] [CrossRef]
- Zhang, C.Y.; Wang, K.; Si, R.; Godefroid, M.; Jönsson, P.; Xiao, J.; Gu, M.F.; Chen, C.Y. Benchmarking calculations with spectroscopic accuracy of level energies and wavelengths in W LVII–W LXII tungsten ions. J. Quant. Spectrosc. Radiat. Transf. 2021, 269, 107650. [Google Scholar] [CrossRef]
- Zhang, C.Y.; Li, J.Q.; Wang, K.; Si, R.; Godefroid, M.; Jönsson, P.; Xiao, J.; Gu, M.F.; Chen, C.Y. Benchmarking calculations of wavelengths and transition rates with spectroscopic accuracy for W xlviii through W lvi tungsten ions. Phys. Rev. A 2022, 105, 022817. [Google Scholar] [CrossRef]
- Guo, X.; Li, M.; Zhang, C.; Wang, K.; Li, S.; Chen, Z.; Liu, Y.; Zhang, H.; Hutton, R.; Chen, C. High accuracy theoretical calculation of wavelengths and transition probabilities in Se-through Ga-like ions of tungsten. J. Quant. Spectrosc. Radiat. Transf. 2018, 210, 204–216. [Google Scholar] [CrossRef]
- Guo, X.; Li, M.; Si, R.; He, X.; Wang, K.; Dai, Z.; Liu, Y.; Zhang, H.; Chen, C. Accurate study on the properties of spectral lines for Br-like W39+. J. Phys. At. Mol. Opt. Phys. 2017, 51, 015002. [Google Scholar] [CrossRef]
- Guo, X.; Grumer, J.; Brage, T.; Si, R.; Chen, C.; Jönsson, P.; Wang, K.; Yan, J.; Hutton, R.; Zou, Y. Energy levels and radiative data for Kr-like W38+ from MCDHF and RMBPT calculations. J. Phys. At. Mol. Opt. Phys. 2016, 49, 135003. [Google Scholar] [CrossRef]
- Froese Fischer, C.; Gaigalas, G.; Jönsson, P.; Bieroń, J. GRASP2018—A Fortran 95 version of the general relativistic atomic structure package. Comput. Phys. Commun. 2019, 237, 184–187. [Google Scholar] [CrossRef]
- Gaigalas, G. Commits on Feb 2, 2022, commit/77aa600ab02b58718b9c5a82ce9e6c638cc09921. Available online: https://www.github.com/compas/grasp2018 (accessed on 20 November 2022).
- Li, Y.T.; Wang, K.; Si, R.; Godefroid, M.; Gaigalas, G.; Chen, C.Y.; Jönsson, P. Reducing the Computational Load—Atomic Multiconfiguration Calculations based on Configuration State Function Generators. Comput. Phys. Commun. 2022, 283, 108562. [Google Scholar] [CrossRef]
- Davidson, E.R. Iterative Calculation of A Few of Lowest Eigenvalues and Corresponding Eigenvectors of Large Real-symmetric Matrices. J. Comput. Phys. 1975, 17, 87–94. [Google Scholar] [CrossRef]
- Stathopoulos, A.; Froese Fischer, C. A Davidson program for finding a few selected extreme eigenpairs of a large, sparse, real, symmetric matrix. Comput. Phys. Commun. 1994, 79, 268–290. [Google Scholar] [CrossRef]
- Edmonds, A. Angular Momentum in Quantum Mechanics; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
- Parpia, F.A.; Froese Fischer, C.; Grant, I.P. GRASP92: A package for large-scale relativistic atomic structure calculations. Comput. Phys. Commun. 1996, 94, 249–271. [Google Scholar] [CrossRef]
- Gaigalas, G.; Rudzikas, Z.; Froese Fischer, C. An efficient approach for spin - angular integrations in atomic structure calculations. J. Phys. B At. Mol. Opt. Phys. 1997, 30, 3747. [Google Scholar] [CrossRef]
- Gaigalas, G.; Fritzsche, S.; Grant, I.P. Program to calculate pure angular momentum coefficients in jj-coupling. Comput. Phys. Commun. 2001, 139, 263–278. [Google Scholar] [CrossRef]
- Gaigalas, G. A Program Library for Computing Pure Spin-Angular Coefficients for One- and Two-Particle Operators in Relativistic Atomic Theory. Atoms 2022, 10, 129. [Google Scholar] [CrossRef]
- Gustafsson, S.; Jönsson, P.; Froese Fischer, C.; Grant, I.P. Combining multiconfiguration and perturbation methods: Perturbative estimates of core–core electron correlation contributions to excitation energies in Mg-like iron. Atoms 2017, 5, 3. [Google Scholar] [CrossRef]
- Gaigalas, G.; Rynkun, P.; Radžiūtė, L.; Kato, D.; Tanaka, M.; Jönsson, P. Energy Level Structure and Transition Data of Er2+. Astrophys. J. Suppl. Ser. 2020, 248, 13. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).