Scalable codes for precision calculations of properties of complex atomic systems

High precision atomic data is indispensable for experiments involving studies of fundamental interactions, astrophysics, atomic clocks, plasma science, and others. We develop new parallel atomic structure codes and explore the difficulties of load-balancing in these codes. Efficient load-balancing of matrix elements for many-electron systems is very difficult due to the intrinsic nature of the computational methods used to compute them. By arithmetically selecting determinants for each core, we achieve very even workload distribution, and attain near-perfect linear scalability and efficiency with the number of cores. We also implement dynamic memory allocation to minimize memory usage and remove the need for users to set certain array parameters. Our newly developed codes enable computations that were not possible before due to lack of memory or prohibitive computation times, and allow a broader range of correlations to be investigated in a shorter period of time. This includes calculations correlating all 60 electrons in the highly charged Ir$^{17+}$ ion and calculations predicting the $3C/3D$ line intensity ratio in Fe$^{16+}$. Our new code package will also be used to produce large volumes of high precision atomic data for a new online portal being developed at the University of Delaware.


Introduction
Studies of the fundamental symmetries with atoms and ions require knowledge of atomic properties as well as calculations to extract potential new physics from the experiments [1].For example, the most accurate atomic calculation for a heavy atom was carried out to study Cs parity violation [2,3].A precision (0.35% accuracy) experiment [4] had to be supplemented by theory computation of a comparable accuracy in order to extract the value of the weak charge and test the standard model of elementary particles.The theory accuracy is more than a order of magnitude worse in Yb, not sufficient for a similar analysis of the Yb experiments [5,6] motivating studies with multiple isotopes [6].While Yb has two valence electrons, low-lying excitations from the closed shells that cannot so far be accurately treated by any high-precision methods significantly lower attainable precision.Theory accuracy is even worse for more compacted systems such as Dy [7] or Sm of interest to parity violation and other studies.Studies of CP-violation with atoms also require computations of the enhancement factors [8].Studies of Lorentz symmetry violations need computations of the Lorentz invariance violating matrix elements [9].Moreover, development of experiments with new systems requires accurate knowledge of many atomic properties.For example, development of even more precise atomic clocks [10] that are used to test the Einstein's equivalence principle, search for the variation of the fundamental constants and dark matter [1] is accelerated by strong theory-experimental collaborations.
In the present work we improve on the capabilities of an ab initio atomic structure code package for calculating atomic properties of complex many-electron systems.The methods used here, including configuration interaction (CI) and the combination of CI with either many body perturbation theory (CI+MBPT) or the linearized coupled cluster method (CI+all-order), are very broadly applicable to any atom in the periodic table.Numerous problems that were not tractable before have been solved with our newly developed parallel codes, including problems in astrophysics, metrology of highly charged ions, neutral atoms, and negative arXiv:2103.04473v2 [physics.atom-ph]10 Mar 2021 ions.Our focus will be on the computational developments that enable these new large-scale calculations.
A version of the CI+MBPT code package was modified for public use and published in Computer Physics Communications in 2015 by M. Kozlov et al [11].Although not published yet, the inclusion of the all-order method which provides accurate solutions for a large number of properties of atoms and ions with up to 5 -6 valence electrons has been completed and made fully compatible with this code package.The main computational developments involved implementing the message passing interface (MPI) to parallelize computationally expensive portions of the programs.This allows us to fully take advantage of modern high performance computing facilities, such as the University of Delaware high performance Caviness community cluster, where we developed and tested our programs.We use the new package of parallel programs to consider two cases of particular experimental significance: Ir 17+ , which was proposed for the development of novel atomic clocks, and Fe 16+ , which has lines essential for plasma diagnostics tools for astrophysics.We found great success in the parallelization efforts, as new programs enabled precision calculations of these systems beyond what was desired.

Theory and Methods
In this section, we describe the methods used in our programs and applications.For any many-electron system, we can divide all electrons into core and valence electrons.In this way, we can separate the electron-electron correlation problem into one describing the valencevalence correlations under the frozen-core approximation, and another describing the core-core and core-valence correlations.In the initial approximation, we start from the solution of the restricted Dirac-Hartree-Fock (HFD) equations in the central field approximation to construct one-electron orbitals for the core and valence electrons.Virtual orbitals can be constructed from B-splines or by other means to account for correlations.The valence-valence correlation problem is solved using the CI method, while core-core and core-valence correlations are included using either MBPT or the all-order method.In either case, we form an effective Hamiltonian in the valence CI space, then diagonalize the effective Hamiltonian using the CI method to find energies and wave functions for the low-lying states.

The CI Method
The CI method is a standard ab initio method for calculating atomic properties of manyelectron systems.In the valence space, the CI wave function is constructed as a linear combination of all distinct states of a specified angular momentum J and parity where the set {Φ i } are Slater determinants generated by exciting electrons from some reference configurations to higher orbitals.
Varying the coefficients c i results in a generalized eigenvalue problem which can be written in matrix form and diagonalized to find the energies and wave functions of the low-lying states.The energy matrix of the CI method can be obtained as a projection of the exact Hamiltonian H onto the CI subspace H CI [12] where E core is the energy of the frozen core, N core is the number of core electrons, h CI i accounts for the kinetic energy of the valence electrons and their interaction with the central field, and V ij accounts for the valence-valence correlations.
Having obtained from CI the many-electron states |J M and |J M with the total angular momenta J, J and their projections M, M , one can form density transition matrix in terms of the one-electron states |nljm [11] ρ = ρ nljm,n l j m |nljm n l j m |, where Here un-primed indices refer to the initial state and primed indices refer to the final state.The many-electron matrix element can then be written as where the trace sums over all quantum numbers (nljm) and (n l j m ), and T L q is the spherical component of the tensor operator of rank L. Using the Wigner-Eckart theorem, one can reduce the many-electron matrix element to where We have developed new parallel programs based on these methods: conf realizes the CI method, which forms the CI Hamiltonian and uses Davidson's algorithm of diagonalization [13] to find low-lying energies and wave functions; dtm calculates reduced matrix elements (7) of one-electron operators by forming the reduced density transition matrices (8).We discuss the challenges of developing the parallel programs in Sec. 3.

Selecting important configurations with valence perturbation theory (CI+PT)
As the number of configurations contributing to the CI wave function grows exponentially with the number of valence electrons, efficient selection of the most important configurations from a set of configurations becomes the main challenge of accurate computations.To significantly reduce the number of configurations, we further developed a method suggested in Ref. [14] to predict important configurations based on a set of configurations with known weights.This method can be used to optimize the CI space by identifying the most important configurations from a list of CI configurations using perturbation theory (PT) [14].
All second-order corrections are taken into account and added to the energy calculated from CI to obtain the total energy, E CI = E 0 + E 1 , while first-order corrections to the wave functions are stored for use in subsequent CI calculations.This process of using CI on a small subspace, calculating corrections via PT, and reordering the list of configurations in descending weights can be repeated several times to form the most optimal CI subspace.Once the energy differences between subsequent CI calculations are relatively small, it can be assumed that convergence has been met.
We've developed a new parallel program conf_pt that realizes the CI+PT method.The parallel version enables computations of extremely large problems, with tests running up to 400 million determinants.

Including core correlations with other methods (CI+MBPT/CI+all-order method)
The CI+MBPT [12,15,16] and CI+all-order [16] methods include core-core and corevalence correlations using the combination of CI with second-order MBPT and the linearized coupled cluster method, respectively.The many-electron Schrödinger equation can be written as where the effective Hamiltonian has the form Here H CI is the CI Hamiltonian described by Eq. 3 and Σ = Σ 1 + Σ 2 is the core-valence correlation potential obtained from either MBPT or the all-order method, where Σ 1 and Σ 2 are the one-and two-electron parts of the core-valence correlation potential, respectively.Eq. 9 is solved by diagonalizing the effective Hamiltonian using the CI method to obtain energies and wave functions of the low-lying states, then matrix elements can be calculated as described in Sec.2.1.

Computational developments
The complete CI/CI+MBPT/CI+all-order code package scheme is illustrated in Fig. 1.Of the programs listed in the scheme, the CI program (conf) and the matrix elements program (dtm) have been parallelized with the message passing interface (MPI) [17,18].The CI+PT program (conf_pt) is an auxiliary program that has also been parallelized with MPI.There are several key features of our recently developed parallel codes over previous serial (non-parallel) programs published in Ref. [11].For all parallel programs, we focused on improving readability and usability, reducing the total memory footprint, implementing dynamic memory allocation where possible, and parallelizing the computationally expensive portions of code using MPI.
The newly developed parallel programs enable computations that were not possible before due to lack of memory or prohibitive computational times.For smaller systems, we can now do calculations with much more accuracy and much faster, allowing a broader range of effects to be investigated in a short period of time.With the original serial version of the programs, calculating properties of very small systems of 1 -2 valence electrons could take up to a few days, while runs for more complex systems with 3 -6 valence electrons could last up to weeks.Some problems involving systems with more than 6 valence electrons were simply intractable with this code due to lack of memory or prohibitively long computation times.One of the main objectives of parallelizing the programs was to reduce the computational time required for calculations of properties of complex atomic systems.

Parallelization of codes
Our main focus will be on the developments of the parallel CI, CI+PT, and density transition matrix programs.Each of these methods require calculating matrix elements between a pair of determinants, which is very difficult to parallelize due to the nature of the problem.In the CI program, the CI Hamiltonian matrix as well as the matrix of the operator J 2

Second-order second-ci
Calculates correction to the Hamiltonian that conf uses.Entire all-order + second-order or just the all-order block can be bypassed.

pol-ci
Calculates polarizabilities matrix is constructed.Construction of each of these matrices are essentially a different variant of the same problem.
We will describe only the parallel implementation to the CI program, but the methods discussed here are applicable in general to the other parallel programs.The main difficulty here is the intrinsic unpredictability of the computational method, i.e. one cannot guess how many non-zero matrix elements there are without explicitly comparing all pairs of determinants.When forming the matrices and calculating the matrix elements, one has to compare electron occupancies between determinants for each configuration.The Slater-Condon rules [19,20] describe the number of operations that are done for each matrix element depending on the number of differences between the pair of determinants forming the matrix element.
The construction of the matrix follows a two-loop structure: an outer loop iterates over the total number of determinants and an inner loop iterates over the total number of configurations.Determinants belonging to each configuration are paired and the value of the matrix element is calculated if there are less than 3 differences between the pair of determinants.We separate the computation into two stages: (i) a comparison stage to find pairs of determinants with less than 3 differences between them and (ii) a calculation stage to calculate the value of the matrix element.After all determinants have been compared, one final loop is done through the total number of non-zero matrix elements to calculate their values.With this implementation, we have achieved near-perfect linear scalability and efficiency in our tests with up to 550 computing cores.It is also possible, and much simpler, to distribute the workload by dividing the outer loop by the number of cores.However, this results in very uneven load-balancing which is exacerbated for larger computations.This is not the case when dealing with smaller matrices such as in the case of the density and transition matrices.
In the serial implementation, non-zero matrix elements were written to disk due to lack of memory of the computers used when the codes were first developed.However with modern computers, we are able to store the non-zero matrix elements in memory, which also has the advantage of reducing computation time due to the removal of file input and output (I/O) to disk.The serial program wrote each matrix element to disk with 24 bytes: 8 bytes for the counter, 4 bytes for each index, and 8 bytes for the value of the non-zero matrix element.The parallel program removes the redundant 8 bytes for the id, reducing the memory requirement of storing the Hamiltonian matrix by 33%.
Since the matrix elements of the CI Hamiltonian are distributed across cores, the Davidson procedure had to be modified to take advantage of the parallelization.The Davidson procedure diagonalizes the CI Hamiltonian matrix to find energies and wave functions of the lowlying states.In the present code, only the calculation of matrix-vector products have been parallelized.Since each core holds a portion of the Hamiltonian matrix in memory, the Davidson algorithm had to be modified so each core is only responsible for computing products of their stored matrix elements with the wave function.

Speedup of parallel programs
The performance of the parallel conf program was tested using the Ir 17+ ion.We describe Ir 17+ as the ion with 30 valence electrons in the open 4 f , 4d, and 4p shells with an [8spd f g] basis set.The basis set is designated by the highest principal quantum number for each partial wave included.For example, the [8spd f g] basis set includes 1 − 8s, 2 − 8p, 3 − 8d, 4 − 8 f , and 5 − 8g orbitals.These test calculations included 24 895 relativistic even-parity configurations and 17 431 323 determinants.As reference, the total computation time of the original serial calculation just short of 2 weeks.
In Table 1, we compare runtimes of the formation of the CI Hamiltonian matrix (FormH) and the Davidson iterative procedure (Diag4), as well as the total computational time, for increasing number of computing cores.Comparing large number of cores to the base N=50 case, we see that there is a near-perfect linear scalability up to 500 cores for the FormH subroutine, but see very minimal performance gain for Diag from 200 to 500 cores.The Davidson procedure of our parallel CI program does not scale well since only the calculation of matrix-vector products was parallelized, and a large majority of the procedure remains serial.Since the Davidson procedure typically does not run as long as the formation of the Hamiltonian, the performance of the Davidson procedure was deemed sufficient for our problems, leaving remaining optimizations to a future project.
The efficiency of the total speedup attained with the parallel CI program is about 70 -80% with the number of cores.The speedup achieved with the parallel CI+PT program and the parallel matrix element program is very similar, with the parallel CI+PT program averaging around 75 -85% efficiency and the parallel matrix element program averaging around 80 -90% efficiency with the number of cores.Since the development of these parallel programs, they have been routinely used in our group to calculate various atomic properties of many-electron systems.

Selection of important configurations
In addition to the parallel codes, we've developed algorithms and auxiliary codes for significantly reducing the number of configurations by efficiently identifying dominant configurations from a list of configurations.It is necessary to be able to select the most important configurations for the valence CI space when the dimensionality of the CI problem becomes intractably large.
The first algorithm involves using the CI+PT method to predict important configurations based on a set of already selected configurations with known weights, as discussed in Sec.2.2.Initially, a configuration list with no weights is constructed by allowing excitations from a few basic configurations.A small subspace is chosen from the top of the list to generate initial weights for each of the chosen configurations in the list using CI, then weights of all other configurations are calculated using PT.We then reorder the initial configuration list by descending weights, then repeat this process until convergence is met, i.e. the CI space has been saturated as it has taken into account the most important configurations.The convergence is checked by comparing energy levels computed from consecutive CI and PT procedures.
The second algorithm explores the importance of the configurations based on the correlations between electrons that are present in the configurations.We find that contributions of electrons in orbitals of the same partial wave are very regular.For example, if our calculations of Fe 16+ show that the contribution of the 1s 2 2s2p 5 6s 2 configuration is negligible, then all similar configurations where the two last electrons have higher principle quantum number, e.g.1s 2 2s2p 5 6s7s, 1s 2 2s2p 5 6s8s, 1s 2 2s2p 5 7s 2 , can be omitted.However, if the 1s 2 2s2p 6 6 f configuration has a large weight, then other configurations with 7 f and 8 f electrons should be included.

Applications 4.1. Optical clocks based on highly charged ions
Highly charged ions (HCI) such as Ir 17+ are of particular interest for the development of novel atomic clocks due to its very high sensitivity to the variation of the fine structure constant α and related dark matter searches [21][22][23][24].There are many advantages to creating optical clocks with HCI, such as enhanced sensitivity to the variation of α, heavily suppressed systematic effects, and estimated potential clock uncertainty beyond the 10 −18 limit [25][26][27][28].Recent developments in quantum logic techniques for HCI spectroscopy have made rapid progress in the development of HCI possible [29].
In the case of Ir 17+ , theoretical calculations are particularly difficult due to atomic configurations with holes in the 4 f shell leading to a very large uncertainty in prior theoretical results, with over 50% for the predicted clock transition energy [30,31].No clock transitions or any E1 transitions have been found despite over 6 years of experimental effort.These transitions were expected to be observed in recent experiments since their predicted transition rates were well within experimental capabilities; especially since the M1 transitions with much smaller transition rates have been observed.The lack of observations for the E1 transitions brought serious concerns over the accuracy of theoretical predictions, even to the point of doubt of approximate spectral range.With our newly developed parallel programs, we resolved this problem and for the first time definitively demonstrated the ability to converge CI in systems with a few holes in the 4 f shell and place uncertainty bounds on our results.Our results explain the lack of observations of the E1 transitions and provide a pathway towards detection of clock transitions based on highly charged ions.Here we will only summarize the results of our application of the new parallel code to Ir 17+ .A detailed discussion can be found in Ref. [32].
We find that the best initial approximation is achieved by solving restricted DHF equations with partially filled shells, namely [1s 2 . . .4d 10 ]4 f 13 5s.The hybrid approaches described in Sec.2.3 that incorporates core excitations into the CI by constructing an effective Hamiltonian using MBPT or the all-order method can not be used with such an initial approximation.Therefore, we treat the inner shells with the CI method, leading to an exponential increase in the number of configurations.We found that while the weights of most configurations are small, the number of important configurations were very large.
Our new parallel programs allowed us to increase the valence space from 14 electrons to all 60, and to include 250 000 configurations, resulting in 133 million Slater determinants, a factor of 20 increase to what was previously feasible.In order to definitively ensure the reliability of the theoretical calculations, we considered all possible contributions that may affect the accuracy of the computations and ensure the convergence in all numerical parameters, including the number and type of configurations included in the CI, the size of the orbital basis set used to construct CI configurations, inclusion of the quantum electrodynamics (QED) corrections, and inclusion of the Breit corrections beyond the Gaunt term.We found that by far the largest effect comes from the inclusion of the inner electron shells into the CI.
We start with the most straightforward CI computation that includes single and double excitations from the 4 f and 5s valence shells, similar to Ref [23].The excitations are allowed to each of the basis set orbitals up to [7spd f g].Next, we allow all 4d electrons into the valence space and allow excitations of any of the 24 electrons from the 4d 10 4 f 13 5s shells to the same basis set orbitals up to [7spd f g].We find drastic changes in the energies of the E1 transitions when excitations are allowed from the 4d shell.Due to such large contributions, we continued to include more and more electrons of the inner shells into the CI valence space, until all 60 electrons have been included.With the same [7spd f g] basis set, both single and double excitations are allowed from the 4 f , 4d, 4p, 4s and 3d shells, and only single excitations are included for all other shells.We tested that the double excitation contribution is small for these inner shells and can be omitted at the present level of accuracy.The results, obtained with different number of shells included in the CI valence space, are given in Table 2. Contributions from increased size of the basis set, triple excitations, full Breit, and QED were found to be small at the present level of accuracy.Three basis sets of increasing sizes [7spd f g], [8spd f g], and [10spd f g] were used to test basis set convergence.
While allowing excitations from the 4d shell drastically reduced the energies of the E1 transitions, allowing excitations from the 4p shell drastically reduced several E1 matrix elements and rates to well below the detector ability.We have identified several other transitions at different wavelengths for the future E1 transition search.As soon as any E1 transition is detected, we will be able to obtain a better prediction for the proposed clock transition wavelength.
Table 2. Energies of Ir 17+ (in cm −1 ) obtained using CI with increasing number of open shells.Only configurations obtained by exciting 4 f and 5s electrons are included in the "5s4 f only" column.Contributions from exciting electrons from the 4d shell are given in the column labeled "4d contr.",and contributions of all other shells are given separately in the next columns.The results with all 60 electrons correlated by the CI are listed in the column "All shells open".Sum of all other corrections is given in the column labeled "Other" -see Ref. [32]  The largest Ir 17+ run with the latest version of the parallel CI program included 96 622 configurations and 58 224 918 determinants, while the largest number of determinants included in the old serial CI program was about 5 000 000.The largest run calculated and stored over 100 billion Hamiltonian matrix elements, with the whole run requiring a total of 2 880 GB of memory, which was the maximum amount available to us at the time.Distributed across 80 cores, this run took 2 days and 18 hours to complete.A larger number of cores were not used due to the large amount of memory required to store the basis set, the Hamiltonian matrix, and subsequent arrays for the Davidson procedure.

Calculation of the 3C/3D line intensity ratio in Fe XVII
Some of the brightest lines of the spectra of many hot astrophysical objects arise from Fe 16+ around 15 Å, namely the resonance line 3C ([(2p 5 ) 1/2 3d 3/2 ] J=1 → [2p 6 ] J=0 ) and the intercombination line 3D ([(2p 5 ) 3/2 3d 5/2 ] J=1 → [2p 6 ] J=0 ).They are crucial for plasma diagnostics of electron temperatures, elemental abundances, ionization conditions, velocity turbulences, and opacities.For the last four decades, there has been a persistent discrepancy between the observed intensity ratios and advanced plasma models, diminishing the utility of high-resolution X-ray observations.A recent experiment measured the most accurate 3C/3D oscillator strength to date, in an attempt to explain this puzzle [33].
We carried out a precision calculation with our newly developed MPI version of the CI program, allowing us to drastically increase the number of included configurations to over 230 000.This calculation correlated all 10 electrons, included full Breit and QED corrections, and predicted the transition rates with 1 -2% accuracy.Our calculations ruled out incomplete inclusion of the electronic correlations in theoretical calculations as the potential explanation of the puzzle.A detailed study of the latest experiment and theoretical work done can be found in Ref. [33].
We started with all possible single and double excitations from the 2s 2 2p 6 , 2s 2 2p 5 3p even and 2s 2 2p 5 3s, 2s 2 2p 5 3d, 2s2p 6 3p, 2s 2 2p 5 4d, 2s 2 2p 5 5d odd configurations, correlating 8 electrons with a short [5spd f 6g] basis set.Separate calculations were done to establish the effects of triple and quadruple excitations, as well as full correlation with the 1s 2 shell.We found negligible corrections to both energies and matrix elements as illustrated by Table 3.The differences between our theoretical energies and experimental values were found to be less than those of the NIST database by 3000 cm −1 .The energies from the revised analysis of the Fe 16+ spectra given in Table 3 are estimated to be accurate to about 90 cm −1 .The last line of Table 3 shows the difference of the 3C and 3D energies in eV, with the final value 13.44(5) eV.

Conclusion and further developments
We have developed a new parallel atomic structure code package that has opened a lot of new possibilities of high-precision calculations of atomic properties of various complex many-electron systems.The efficiency and accuracy of our programs have been validated by solving many problems, involving astrophysics, in metrology of highly charged ions, and negative ions.
Although our parallel programs have displayed great success in efficiency and accuracy, there is still plenty of developmental work to be done.This includes optimization of current parallel algorithms, parallelization of remaining problematic serial codes (e.g. the Davidson procedure in the CI program), optimization of memory usage, completion of documentation, including check-pointing, making the codes user-friendly, and more.
The programs developed here are part of a larger project developing an online portal for high-precision atomic data and computation.The portal will feature a database of all high-precision data calculated in the last couple of decades, as well as wave functions of large number of atoms and ions stored on a back-end server.Users will be able to request transition properties and polarizabilities for systems not in the database without having to download any codes or learn any inputs.Requested data will be generated automatically through code executions on a back-end server and then updated on the database.All codes will also be released to the public, optimized and user-friendly.
are constructed, in the CI+PT program, the CI and PT blocks of the Hamiltonian matrix are constructed, and in the density transition matrix elements program, the density or transition hfd Calculates Dirac-Fock orbitals (initial approximation) bass Constructs basis set needed by all codes

Figure 1 .
Figure 1.The scheme of the CI/CI+MBPT/CI+all-order code package.
code package skips the all-order block.CI+all-order code package includes all codes.CI code package skips both parts.

Table 1 .
The runtime of subroutines FormH and Diag4 of the parallel conf program for increasing number of compute cores and the speedups of the parallel code are presented in seconds (s) for increasing number of cores N, relative to the code ran with 50 cores (N=50).These large test runs were done with Ir 17+ with 24 895 relativistic configurations and 17.4 × 10 6 determinants.Note that the total times include all serial subroutines outside of FormH and Diag.

Table 3 .
Contributions to the energies (in cm −1 ) of Fe 16+ calculated with increased basis set sizes and number of configurations.Contributions from triple excitations, excitations from the 1s 2 shell, and QED contributions are given in their respective columns.The last line gives the 3C − 3D energy difference in eV.