GPU-Accelerated Fock Matrix Computation with Efficient Reduction

Tsuji, Satoki; Ito, Yasuaki; Fujii, Haruto; Yokogawa, Nobuya; Suzuki, Kanta; Nakano, Koji; Parque, Victor; Kasagi, Akihiko

doi:10.3390/app15094779

Open AccessArticle

GPU-Accelerated Fock Matrix Computation with Efficient Reduction

by

Satoki Tsuji

^1,2,*

,

Yasuaki Ito

¹

,

Haruto Fujii

¹,

Nobuya Yokogawa

¹,

Kanta Suzuki

¹,

Koji Nakano

¹

,

Victor Parque

¹

and

Akihiko Kasagi

²

¹

School of Advanced Science and Engineering, Hiroshima University, Higashi-Hiroshima 739-8527, Japan

²

Computing Laboratory, Fujitsu Limited, Kawasaki 211-8588, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4779; https://doi.org/10.3390/app15094779

Submission received: 31 March 2025 / Revised: 21 April 2025 / Accepted: 23 April 2025 / Published: 25 April 2025

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

Download

Browse Figures

Versions Notes

Abstract

In quantum chemistry, constructing the Fock matrix is essential to compute Coulomb interactions among atoms and electrons and, thus, to determine electron orbitals and densities. In the fundamental framework of quantum chemistry such as the Hartree–Fock method, the iterative computation of the Fock matrix is a dominant process, constituting a critical computational bottleneck. Although the Fock matrix computation has been accelerated by parallel processing using GPUs, the issue of performance degradation due to memory contention remains unresolved. This is due to frequent conflicts of atomic operations accessing the same memory addresses when multiple threads update the Fock matrix elements concurrently. To address this issue, we propose a parallel algorithm that efficiently and suitably distributes the atomic operations; and significantly reduces the memory contention by decomposing the Fock matrix into multiple replicas, allowing each GPU thread to contribute to different replicas. Experimental results using a relevant set/configuration of molecules on an NVIDIA A100 GPU show that our approach achieves up to a

3.75 \times

speedup in Fock matrix computation compared to conventional high-contention approaches. Furthermore, our proposed method can also be readily combined with existing implementations that reduce the number of atomic operations, leading to a

1.98 \times

improvement.

Keywords:

GPU; Fock matrix; two-electron repulsion integrals; Hartree–Fock method; quantum chemistry; parallel algorithm; memory contention; atomic operations

1. Introduction

Quantum chemistry is a field of computational chemistry that theoretically elucidates the energy and concomitant electronic structure of chemical systems based on the principles of quantum mechanics. By considering the behavior of electrons as wave functions, it is possible to simulate the kinematics and dynamics of molecular structures with a high accuracy unattainable through approaches rooted on classical mechanics and phenomenological observations of chemical phenomena. Quantum chemistry has opened the door to a wide range of applications, including (1) drug discovery screening [1,2]: to narrow down drug candidates from a vast set of compounds across diverse libraries, and (2) catalyst development [3,4]: to predict the emerging molecular activities/properties and chemical reactions arising from atom/molecule interactions. However, in current developments, quantum chemical calculations are solely constrained to primarily supplement the observations derived from phenomenological/experimental results. Furthermore, the substantial computational cost behind quantum chemistry machinery hinders the much broader practical applications. In quantum chemistry, the Hartree–Fock method [5,6] and its post-Hartree–Fock derivatives [7] provide fundamental and established frameworks for theoretically solving the Schrödinger equation. These methods require computational cost proportional to at least the fourth power of the atom count in the input system, and have undergone decades of development focused on efficient implementation. In the Hartree–Fock method particularly, the evaluation of molecular integrals [8] describing interatomic and interelectronic interactions, and the subsequent construction of the Fock matrix, dominate the computational cost.

The Fock matrix encodes not only the information on the Hamiltonian in the Hartree–Fock mean-field approximation, but also the interactions, such as the Coulomb energy and the kinetic energy between atoms and electrons. The Hartree–Fock method iteratively updates and diagonalizes the Fock matrix until the electron orbitals and energies converge. Each element of the Fock matrix can be calculated by accumulating values of two-electron repulsion integrals (ERIs) onto an initial value consisting of one-electron integrals, which account for factors such as the Coulomb and the kinetic energy.

In the restricted Hartree–Fock method for closed-shell systems, the update of the Fock matrix of size

M \times M

is defined [9] by

\begin{matrix} F_{μ ν} & = F_{μ ν} + 4 D_{λ σ} (μ ν | λ σ), \\ F_{λ σ} & = F_{λ σ} + 4 D_{μ ν} (μ ν | λ σ), \\ F_{μ λ} & = F_{μ λ} - D_{ν σ} (μ ν | λ σ), \\ F_{ν σ} & = F_{ν σ} - D_{μ λ} (μ ν | λ σ), \\ F_{μ σ} & = F_{μ σ} - D_{ν λ} (μ ν | λ σ), \\ F_{ν λ} & = F_{ν λ} - D_{μ σ} (μ ν | λ σ), \end{matrix}

(1)

where

F_{μ ν}

denotes an element of the Fock matrix,

D_{μ ν}

denotes an element of the density matrix and

(μ ν | λ σ)

represents the value of the ERIs

(1 \leq μ, ν, λ, σ \leq M)

. ERIs literally describe the repulsive force between two electrons, and each value is added to the six locations of the Fock matrix, weighting by the corresponding density matrix elements. In quantum chemistry, the computational cost typically scales with the number M of basis functions representing the electron orbitals. Although the Fock matrix itself is a two-dimensional matrix, its construction requires consideration of all

O (M^{4})

ERIs, leading to a significant computational burden. In addition to evaluating ERIs, efficient reduction of these values based on Equation (1) is crucial for accelerating the Fock matrix construction.

In quantum chemistry, contemporary efforts have focused on accelerating targeted computations with high parallelism such as ERI evaluation and Fock matrix construction using GPUs [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]. Although GPUs offer high-performance machinery with thousands of cores tailored for matrix computations, capitalizing on their capabilities requires nontrivial and efficient implementations that address GPU-specific architectures and programming challenges. Among these, memory contention arising from parallel implementations of Fock matrix construction has been a persistent challenge. In the parallel processing of Equation (1), multiple threads concurrently update a single Fock matrix, leading to frequent conflicts in memory access at the same addresses. Although data inconsistency due to conflicts can be prevented with mutual exclusion, the atomic additions on the same address are executed sequentially, significantly degrading the overall parallel computation performance. Previous studies [15,16,18,20,27] have mitigated the memory contention issue by accumulating the updates within thread-local buffers and by reducing the number of the atomic operations to the Fock matrix in global memory. However, the number of ERIs aggregated within each thread is limited, even when using the thread-local reduction approaches, and

O (M^{4})

atomic operations are still required. The above leads to a severe memory contention due to the

O (M^{4})

operations concentrated into a limited

O (M^{2})

memory space of the Fock matrix.

Contributions

To address the memory contention problem in current Fock matrix construction frameworks, we propose an efficient GPU acceleration scheme that efficiently and suitably distributes the destination addresses of the atomic operations from multiple GPU threads. Our approach capitalizes on a distributed atomic reduction scheme and differs from existing approaches in accelerating Fock matrix updates without reducing the number of atomic operations. Concretely speaking, our primary contributions are as follows:

Distributed atomic reduction across replicated Fock matrices:

We propose a parallelization method that accelerates the Fock matrix computation using an efficient reduction scheme. Parallel computation of the Fock matrix often suffers from overhead due to memory contention arising from multiple threads simultaneously attempting to write to the same address using atomic operations. To address the arising memory contention problem, we distribute the targets of the atomic addition operations in Equation (1). As such, we decompose the Fock matrix into multiple replicas located at distinct memory addresses, allowing each thread to contribute to a different replica. Then, each replica serves as an intermediate buffer of the same size as the original matrix, and the desired Fock matrix is constructed by summing all the corresponding replicas. Each thread determines the target replica based on its own index, significantly reducing the frequency of the memory contention at the same address compared to concentrating the atomic additions on a single Fock matrix.

Hybrid approach with thread-local reduction:

Furthermore, we demonstrate that the proposed reduction scheme (above-mentioned) is able to accelerate Fock matrix construction of existing thread-local reduction techniques, enabling a hybridized Fock matrix construction scheme. The memory contention that could not be avoided even with the existing approaches that minimize the number of atomic operations can be further reduced by applying our replicated Fock matrices approach. The experimental results suggest that our proposal does not conflict with existing implementations of the Fock matrix construction, and can enhance computational performance by alleviating their bottlenecks.

Computational experiments with relevant molecules:

We implemented the above two reduction methods on an NVIDIA A100 GPU hardware and evaluated the performance landscape using a relevant set/configuration of molecules. Our proposed replicated Fock matrix approach achieved up to a

3.75 \times

speedup compared to the case where atomic additions concentrate on a single Fock matrix. Furthermore, the hybrid approach combining our proposed reduction scheme with the existing thread-local reduction method rendered an additional performance improvement of up to

1.98

times.

This paper is organized as follows: Section 2 briefly summarizes the Hartree–Fock method, outlining the key concepts and major computational bottlenecks behind the Schrödinger equation and the Fock matrix computation. Section 3 reviews the previous studies addressing the memory contention problem in the Fock matrix construction, outlines the contemporary challenges, and highlights the key aspects/motivations behind our proposal. Section 4 describes the main proposal of this paper, the distributed atomic reduction method using the replicated Fock matrices. Section 5 evaluates the performance landscape of our proposed approach, as well as of the hybridized Fock matrix construction scheme combining with the existing thread-local reduction technique. Finally, Section 6 concludes and summarizes the major findings in this study.

2. Background

This section describes the basic approximation method for numerically solving the Schrödinger equation in quantum chemistry. We focus on the Hartree–Fock method, where the Fock matrix computation is a major bottleneck. Here, we provide the formulation for the restricted Hartree–Fock method, which assumes a closed-shell system where all electrons are paired in occupied orbitals.

2.1. Hartree–Fock Method

The Schrödinger equation,

\hat{H} Ψ = E Ψ,

(2)

is a fundamental equation of quantum mechanics and primarily describes the behavior of electrons in quantum chemistry. The Hamiltonian

\hat{H}

is an operator that accounts for the kinetic energy and Coulomb energy of the atomic nuclei and electrons, and E represents the energy of a given system. It takes the form of an eigenvalue equation, and by solving Equation (2) with the atoms and their coordinates of the chemical system as input, we can obtain the wave function

Ψ

as output. While the wave function describes the orbitals of electrons within the system, Equation (2) involves the multivariate function

Ψ (r_{1}, r_{2}, \dots, r_{N_{e}})

contributed by

N_{e}

electrons at position

r

, becoming a many-body problem.

In the Hartree–Fock method, the N-electron wave function

Ψ

is approximated using single-electron wave functions

ϕ

, which simplifies Equation (2) into the Hartree–Fock equation for M molecular orbitals

\hat{F} ϕ_{i} (r) = ε_{i} ϕ_{i} (r),

(3)

where

\hat{F}

is the Fock operator,

ε_{i}

represents the orbital energies, and the molecular orbitals

ϕ_{i}

is the wave function for a single-electron in a given system or molecule

(1 \leq i \leq M)

. However, Equation (3) is an integro-differential equation, making it difficult to directly determine the molecular orbitals

ϕ_{i}

.

Therefore, Equation (3) is transformed into an algebraic equation by introducing the Linear Combination of Atomic Orbitals (LCAO) approximation

ϕ_{i} (r) = C_{1 i} χ_{1} (r) + C_{2 i} χ_{2} (r) + \dots + C_{M i} χ_{M} (r),

(4)

which expresses the molecular orbitals

ϕ_{i}

as a linear combination of atomic orbitals

χ_{μ} (1 \leq μ \leq M)

. Equation (4) allows us to reduce the complex problem of finding molecular orbitals to determining the orbital coefficients

C_{μ i}

. By expanding the molecular orbitals using Equation (4), the Hartree–Fock equation is reduced to the Roothaan–Hall equation

F C = ε S C,

(5)

where

S

is the overlap matrix and

F

is the Fock matrix, the central focus of this paper. Equation (5) is a generalized eigenvalue problem and can be expressed in matrix form using

M \times M

matrices as follows:

\begin{matrix} (\begin{matrix} F_{11} & \dots & F_{1 M} \\ ⋮ & ⋱ & ⋮ \\ F_{M 1} & \dots & F_{M M} \end{matrix}) (\begin{matrix} C_{11} & \dots & C_{1 M} \\ ⋮ & ⋱ & ⋮ \\ C_{M 1} & \dots & C_{M M} \end{matrix}) \\ = (\begin{matrix} ε_{1} & O \\ ⋱ \\ O & ε_{M} \end{matrix}) (\begin{matrix} S_{11} & \dots & S_{1 M} \\ ⋮ & ⋱ & ⋮ \\ S_{M 1} & \dots & S_{M M} \end{matrix}) (\begin{matrix} C_{11} & \dots & C_{1 M} \\ ⋮ & ⋱ & ⋮ \\ C_{M 1} & \dots & C_{M M} \end{matrix}) . \end{matrix}

(6)

As shown in Equation (6), the number of orbitals M determines the computational cost of the Hartree–Fock method, and its value depends on the atom count and atomic numbers in a chemical system. By solving the Roothaan–Hall equation using atomic orbitals approximated with known trial functions, the orbital coefficients

C_{μ i}

and orbital energies

ε_{i}

are obtained as eigenvectors and eigenvalues, respectively. The Fock matrix is defined using several molecular integrals and orbital coefficients,

\begin{matrix} F_{μ ν} = h_{μ ν} + \sum_{λ = 1}^{M} \sum_{σ = 1}^{M} D_{λ σ} (2 (μ ν | λ σ) - (μ λ | ν σ)) = h_{μ ν} + 2 J_{μ ν} - K_{μ ν}, \end{matrix}

(7)

\begin{matrix} J_{μ ν} = \sum_{λ = 1}^{M} \sum_{σ = 1}^{M} D_{λ σ} (μ ν | λ σ), \end{matrix}

(8)

\begin{matrix} K_{μ ν} = \sum_{λ = 1}^{M} \sum_{σ = 1}^{M} D_{λ σ} (μ λ | ν σ), \end{matrix}

(9)

where

h_{μ ν}

denotes a value of one-electron integrals calculated as the sum of the kinetic energy integrals and nuclear attraction integrals [29]. The two-dimensional matrix

h

is calculated as a preprocessing step in the Hartree–Fock method. The ERIs

(μ ν | λ σ)

describe the repulsive force between electron pairs. Evaluating

(μ ν | λ σ)

and reducing them into

J

- and

K

-matrices constitutes the computational bottleneck in constructing the Fock matrix. In the restricted Hartree–Fock method, the density matrix

D

is defined using the orbital coefficients, expressed as

D_{λ σ} = \sum_{i = 1}^{N_{e} / 2} C_{λ i} C_{σ i},

(10)

where

N_{e}

denotes the number of electrons in a given system.

Although the Roothaan–Hall equation appears straightforward to solve, it has a complex form where the Fock matrix depends on the desired eigenvectors

C

. Therefore, Equation (5) is solved using a Self-Consistent Field (SCF) procedure, which involves initializing the orbital coefficients

C_{μ i}

with some values and performing iterative calculations. The high computational cost of the Hartree–Fock method stems primarily from the iterative calculation of the Fock matrix in the SCF procedure. The Roothaan–Hall equation is solved using the SCF procedure as follows:

Step 1: Initialize the orbital coefficients $C$ ,
Step 2: Calculate the Fock matrix $F$ and update $C$ by solving Equation (5),
Step 3: Iterate Step 2 until $C$ and the energy converge.

While the density matrix

D

(and the orbital coefficient matrix

C

) changes with each SCF iteration, the values of the molecular integrals

S_{μ ν}

,

h_{μ ν}

, and

(μ ν | λ σ)

remain constant during the procedure. Therefore, evaluating these integrals and storing them in advance can avoid redundant computations. However, storing all ERI values involves an

O (M^{4})

space complexity, which can exceed the available memory or storage capacity for large molecules. To address this issue, a method that allows redundant computations is widely used to reduce the required space complexity from

O (M^{4})

to

O (M^{2})

by evaluating the ERIs on-the-fly every SCF iteration. This approach is known as Direct-SCF [9], and it allows us to handle larger chemical systems compared to Stored-SCF, which requires storing all molecular integrals in memory. On the other hand, constructing the Fock matrix in the SCF procedure becomes computationally more demanding. In this paper, we focus on acceleration techniques for the Fock matrix computation in Direct-SCF.

2.2. Fock Matrix Computation

Constructing the Fock matrix is computationally dominated by the evaluation of ERIs and the reduction of these integral values. This subsection explains the input and computational complexity of ERIs, and the update equations for the Fock matrix in Direct-SCF.

2.2.1. Two-Electron Repulsion Integrals

Two-electron repulsion integrals (ERIs) describe the Coulomb interactions between two electrons in a given chemical system and are defined as a double integral over basis functions:

(μ ν | λ σ) = \int \int χ_{μ} (r_{1}) χ_{ν} (r_{1}) \frac{1}{r_{12}} χ_{λ} (r_{2}) χ_{σ} (r_{2}) d r_{1} d r_{2},

(11)

where

r_{12}

denotes the Euclidean distance between electron 1 and 2. The basis functions

χ (r)

are trial functions for atomic orbitals at the 3-dimensional coordinates

r = (x, y, z)

, representing electron orbitals around an atom. The chemical notation

(μ ν | λ σ)

denotes an ERI over four basis functions

χ_{μ}

,

χ_{ν}

,

χ_{λ}

, and

χ_{σ}

(1 \leq μ, ν, λ, σ \leq M)

, corresponding to a single integral value. To simplify the calculations of molecular integrals, the basis functions are typically given as linear combinations of Gauss-type orbitals,

χ_{μ} (r) = \sum_{a = 1}^{K_{μ}} w_{a} G_{l m n} (r, A, ρ_{a}),

(12)

where

K_{μ}

denotes the number of Gauss-type orbitals and

w_{a}

are the weight coefficients for the linear combination. The Gauss-type orbital is defined as

G_{l m n} (r, A, ρ_{a}) = c \cdot {(x - A_{x})}^{l} {(y - A_{y})}^{m} {(z - A_{z})}^{n} \exp (- ρ_{a} {| r - A |}^{2}),

(13)

where c is a normalization constant,

A = (A_{x}, A_{y}, A_{z})

denotes the 3-dimensional coordinates of a nucleus,

l, m, n

are non-negative integers representing angular momenta, and

ρ_{a}

is the exponent that determines the radial extent of the orbital. The orbital shape is determined by the total angular momentum

L = l + m + n

and named such as

s

-,

p

-, and

d

-orbital, corresponding to

L = 0, 1, 2

, respectively. The set of parameters K,

w_{a}

, L, and

ρ_{a}

is known as a basis set. Various basis sets have been proposed and are available in databases such as Basis Set Exchange [30]. In quantum chemical calculations, an appropriate basis set is selected based on the target system and application, with common examples including the Pople basis sets (e.g., STO-NG) and the Dunning basis sets (e.g., cc-pVXZ).

By expanding Equation (11) with Equation (12), the ERI over basis functions can be calculated as a weighted sum of ERIs over Gauss-type orbitals,

(μ ν | λ σ) = \sum_{a = 1}^{K_{μ}} \sum_{b = 1}^{K_{ν}} \sum_{c = 1}^{K_{λ}} \sum_{d = 1}^{K_{σ}} w_{a} w_{b} w_{c} w_{d} [a b | c d] .

(14)

In contrast to the contracted ERI

(μ ν | λ σ)

,

[a b | c d]

is referred to as a primitive ERI, similarly defined by four Gauss-type orbitals,

[a b | c d] = \int \int G_{a} (r_{1}) G_{b} (r_{1}) \frac{1}{r_{12}} G_{c} (r_{2}) G_{d} (r_{2}) d r_{1} d r_{2},

(15)

where

G_{a} (r)

is a shortened form of the Gauss-type orbital

G_{l m n} (r, A, ρ_{a})

. The primitive ERI

[a b | c d]

is often calculated using recurrence relations such as the McMurchie–Davidson method [31], Obara–Saika method [32], and Head–Gordon–Pople method [33]. In either recursive function, Equation (15) is known to be expandable into a linear combination of a generalized mathematical function called the Boys function [34]:

[a b | c d] = \sum_{j = 0}^{L_{a} + L_{b} + L_{c} + L_{d}} c_{j} F_{j} (X),

(16)

where the expansion coefficients

c_{j}

and X are determined by the orbital center

A

and exponent

ρ_{a}

, and the total angular momentum L. The Boys function is given as an integral function as follows:

F_{j} (X) = \int_{0}^{1} t^{2 j} \exp (- X t^{2}) d t,

(17)

and several numerical evaluation methods have been devised for its efficient computation [21,35]. In this work, we compute the primitive ERI

[a b | c d]

using kernels expanded to the form of Equation (16) through the Head–Gordon–Pople method [25].

The evaluation of ERIs requires calculating the integral values for all possible combinations of four basis functions, resulting in a computational complexity of

O (M^{4})

for M basis functions. The value of M is determined by factors such as the number of atoms in a given system, their atomic numbers, and the chosen basis set. However, due to the symmetry of basis functions in Equation (11) as follows:

\begin{matrix} (μ ν | λ σ) & = (ν μ | λ σ) = (μ ν | σ λ) = (ν μ | σ λ) \\ = (λ σ | μ ν) = (λ σ | ν μ) = (σ λ | μ ν) = (σ λ | ν μ), \end{matrix}

(18)

the number of unique integral values is not

M^{4}

but rather

M (M + 1) (M^{2} + M + 2) / 8

, approximately reducing to

1 / 8

of the total.

Despite the symmetry in Equation (18), the evaluation of ERIs requires computational effort of the order of

O (M^{4})

. However, screening techniques have been developed to significantly reduce the number of integrals requiring computation. Among these, a prominent approach involves calculating an upper bound for each ERI based on the Schwarz inequality [36], which is a simple yet powerful technique widely used in various libraries and applications in quantum chemistry [37,38,39,40,41,42]. If this bound is negligibly small, the corresponding integral evaluation is omitted. By applying the Schwarz inequality to Equation (11), an upper bound for the absolute value of each ERI is given by

| (μ ν | λ σ) | \leq \sqrt{(μ ν | μ ν)} \sqrt{(λ σ | λ σ)} .

(19)

If the Schwarz upper bound falls below a predefined cutoff threshold

θ

, we can approximate the ERI as

(μ ν | λ σ) \approx 0

and skip the complex integral computation. The factors of the upper bound are ERIs over basis function pairs, and since their combinations scales only with

O (M^{2})

, we can compute all factors required for the screening at low cost. Once these factors are stored in memory, the significance of the desired ERI over the quartet of basis functions can be determined simply by reading two factors,

(μ ν | μ ν)

and

(λ σ | λ σ)

. The cutoff threshold

θ

control the trade-off between accuracy and computational cost. While a larger value significantly reduces computational expense, it can also introduce non-negligible errors in computed quantities such as the orbital coefficients

C

and energies of a system. The upper bound using the Schwarz inequality can be calculated not only for contracted ERIs but also for primitive ERIs as follows:

| [a b | c d] | \leq \sqrt{[a b | a b]} \sqrt{[c d | c d]},

(20)

where the factors

\sqrt{[a b | a b]}

and

\sqrt{[c d | c d]}

are calculated from primitive ERIs for a pair of the Gauss-type orbitals. By utilizing the upper bound of the each primitive ERI given by Equation (20), we can perform a finer screening of integrals that cannot be reduced by Equation (19) alone.

2.2.2. Update of the Fock Matrix

We restate the definition of the Fock matrix for M basis functions,

\begin{matrix} F_{μ ν} = h_{μ ν} + \sum_{λ = 1}^{M} \sum_{σ = 1}^{M} D_{λ σ} (2 (μ ν | λ σ) - (μ λ | ν σ)) = h_{μ ν} + 2 J_{μ ν} - K_{μ ν}, \end{matrix}

(21)

\begin{matrix} J_{μ ν} = \sum_{λ = 1}^{M} \sum_{σ = 1}^{M} D_{λ σ} (μ ν | λ σ), \end{matrix}

(22)

\begin{matrix} K_{μ ν} = \sum_{λ = 1}^{M} \sum_{σ = 1}^{M} D_{λ σ} (μ λ | ν σ) . \end{matrix}

(23)

During the SCF procedure, while the molecular integrals

h_{μ ν}

and

(μ ν | λ σ)

remain constant, the Fock matrix

F

is updated in each iteration according to the density matrix

D

. Equation (21) is computationally demanding not only due to the evaluation of ERIs itself, but also because of the reduction of

O (M^{4})

ERIs into an

O (M^{2})

memory space of the Fock matrix. The efficient implementation of the ERI reduction, particularly using GPUs, is challenging and has been a subject of ongoing discussion in previous research over several years.

In Direct-SCF, where ERIs are computed on-the-fly, an update equation of the Fock matrix is widely used rather than the definition itself. Equation (21) can be straightforwardly implemented in Stored-SCF, where all ERI values are stored in memory. However, in Direct-SCF, where ERIs are evaluated on-the-fly, redundant computations frequently occur, such as the overlap between

(μ ν | λ σ)

for the Coulomb

J

-matrix and

(μ λ | ν σ)

for the exchange

K

-matrix. Therefore, an update scheme for the Fock matrix that eliminates redundant integral calculations has been devised by considering the eight-fold symmetry of ERIs. Applying Equation (18) to Equation (21), the contribution of each unique ERI

(μ ν | λ σ)

to the Fock matrix can be calculated as

\begin{matrix} F_{μ ν} & = F_{μ ν} + 4 D_{λ σ} (μ ν | λ σ), \\ F_{λ σ} & = F_{λ σ} + 4 D_{μ ν} (μ ν | λ σ), \\ F_{μ λ} & = F_{μ λ} - D_{ν σ} (μ ν | λ σ), \\ F_{ν σ} & = F_{ν σ} - D_{μ λ} (μ ν | λ σ), \\ F_{μ σ} & = F_{μ σ} - D_{ν λ} (μ ν | λ σ), \\ F_{ν λ} & = F_{ν λ} - D_{μ σ} (μ ν | λ σ) . \end{matrix}

(24)

Equation (24) can be used to implement the Coulomb and exchange terms of Equation (21) by accumulating each of the

M (M + 1) (M^{2} + M + 2) / 8

unique ERIs into six elements in the Fock matrix. While this update scheme can eliminate redundant integral calculations,

O (M^{4})

ERIs are added directly to a small memory space of

O (M^{2})

. When this reduction operation is performed in parallel on a GPU, high memory contention arises from multiple threads simultaneously attempting to exclusively add integral values to the same elements of the Fock matrix, resulting in significant overhead. Consequently, even if it involves redundant ERI calculations, the strategy of constructing the Fock matrix by computing the

J

- and

K

-matrices separately is also being explored [12,24,27]. In this work, we provide an efficient GPU implementation that addresses the reduction issue of the memory contention in Equation (24), while fully exploiting the eight-fold symmetry of ERIs.

2.3. GPU Programming Model

Our GPU implementation is developed in CUDA, which is a general-purpose parallel computing platform offered by NVIDIA. Graphics Processing Units (GPUs) facilitate high-performance parallel processing by decomposing substantial computational tasks into smaller, autonomous units handled by CUDA blocks. These blocks are then distributed among a multitude of streaming multiprocessors (SMs), each comprising a large number of processing cores. An SM has the capacity to process multiple CUDA blocks concurrently. Each block is further subdivided into units termed warps, composed of 32 individual threads. A thread represents the fundamental unit of sequential instruction execution. The 32 threads within a warp are concurrently scheduled onto the CUDA cores. Because each thread in a warp executes a single instruction on different data simultaneously, this architecture is characterized as Single Instruction, Multiple Threads (SIMT). Although individual threads within a warp can branch and execute independently, different instructions branched by if-else statements are executed sequentially. This phenomenon is known as warp divergence, leading to a decrease in parallel execution efficiency.

3. Related Works

As described in Section 2, constructing the Fock matrix is a computationally demanding process involving complex molecular integrals and intricate summations. In the Hartree–Fock theory, iterative updates of the Fock matrix constitute the dominant computational bottleneck. This section briefly summarizes accumulated insights and implementation strategies for acceleration, focusing primarily on parallelization using CPUs and GPUs. The computational bottleneck in the calculation of the Fock matrix can be primarily attributed to two factors: the evaluation of ERIs (Section 2.2.1) and the subsequent aggregation of these values (Section 2.2.2). We particularly highlight the latter, directing attention to the efficient reduction of ERIs, while introducing previous research in this area.

The 2009 study by Ufimtsev and Martinez [12] is known as the first fully GPU-accelerated implementation of the Fock matrix construction within the Direct-SCF framework. Recognizing the significantly different memory access patterns, distinct algorithms were employed for the Coulomb

J

- and exchange

K

-matrices. ln both algorithms, the ERIs are pre-sorted according to the upper bound derived from the Schwarz inequality, thereby enhancing the effectiveness of screening on the GPU. This approach does not fully exploit the eight-fold symmetry of ERIs in Equation (18). Consequently, computing the

J

- and

K

-matrices separately involves approximately two-times and four-times redundant integral evaluations, respectively.

In 2012, Asadchev and Gordon [13] proposed a hybrid approach using CPUs and GPUs for the Hartree–Fock method, with a focus on efficient Fock matrix construction. They employed a blocked Fock and density matrix representation to improve cache locality on the CPU and promote coalesced memory access on the GPU. While their approach fully exploits the eight-fold symmetry of ERIs, updates to the Fock matrix using atomic operations remain susceptible to memory contention issues.

In 2017, Mironov et al. [15] proposed two techniques for aggregating the values of ERIs in the Fock matrix calculation. This paper discusses the acceleration of the Hartree–Fock method in the quantum chemistry package GAMESS [37], describing a CPU implementation using hybrid parallelization with OpenMP and MPI. The “Private Fock” algorithm allocates a unique Fock matrix for each CPU thread, reducing memory contention by multiple threads in Equation (24). Conversely, to avoid the increased memory footprint associated with per-thread Fock matrices, the “Shared Fock” algorithm was also employed, sharing a single Fock matrix. Both algorithms reduce the number of overhead-intensive atomic operations by performing partial ERI reductions thread-locally. However, in highly parallel GPU architectures, only a limited number of integrals can be locally aggregated, thereby involving frequent atomic additions across numerous GPU threads.

In 2020, Huang et al. [18] proposed methods for constructing the Fock matrix on shared-memory and distributed-memory parallel computers. Similar discussions to those by Mironov et al. were conducted, comparing two techniques: “Block buffer accumulation” and “Strip buffer accumulation”, both involving thread-local partial copies of the

J

- and

K

-matrices. The former utilizes copies of only the six elements of Equation (24), while the latter employs larger partial copies to further reduce atomic operations. They reported a 36% reduction in computation time with the Strip buffer accumulation compared to the Block buffer accumulation.

In the same year, Barca et al. [19] proposed an accelerated Fock matrix construction algorithm for the Hartree–Fock method on GPUs. Their work detailed several optimization strategies. For ERI evaluation, integrals were classified by the type of angular momentum and contraction degree K in Equation (12), enabling dynamic load balancing through batched processing. Schwarz screening is performed before computing ERIs on the GPU, avoiding conditional branching in device codes and enhancing parallel computation performance. Furthermore, a double-stream ERI reduction method exploited the eight-fold symmetry while avoiding explicit thread synchronization overhead.

The following year, Barca et al. [20] extended their work to a GPU implementation of the entire SCF procedure, including one-electron integrals, and further scaled it to multiple GPUs. The dynamic load balancing for ERI evaluation was adapted for multi-GPU environments, and hierarchical memory usage was optimized to improve memory bandwidth utilization. For the reduction in Equation (24), thread-local GPU registers were used as buffers to accumulate intermediate values, significantly reducing the number of atomic operations required for updates to the Fock matrix in global memory. However, a challenge arises with the ERI kernels for high angular momentum, where the register usage increases, leading to decreased GPU occupancy and memory access efficiency.

In 2023, Qi et al. [24] explored a heterogeneous computing approach to assemble the Fock matrix using a hybrid of CPUs and GPUs. A global task queue was employed to dynamically assign low angular momentum ERIs to the GPU and high angular momentum ERIs to the CPU. While the CPU implementation fully exploits the eight-fold symmetry of ERIs, the GPU implementation, similar to the approach by Ufimtsev et al. [12], computes the

J

- and

K

-matrices separately to avoid memory contention from atomic operation during the Fock matrix updates.

In 2024, Palethorpe et al. [27] developed two optimized algorithms, “opt-UM” and “opt-Brc”, building upon the work of Ufimtsev et al. [12] and Barca et al. [20], respectively. The opt-Brc algorithm fully utilizes the eight-fold symmetry of ERIs using Equation (24) and reduces atomic operations by partially performing ERI reductions in thread-local GPU registers. In the opt-UM algorithm, the

J

- and

K

-matrices are computed separately. For

J

-matrix construction, a warp-wise reduction using shared memory is employed. For

K

-matrix construction, the number of ERIs computed was significantly reduced by utilizing the exponential decay of density matrix elements with increasing the distance between basis functions. Both algorithms incorporate screening based not only on the Schwarz upper bound but also on the density matrix.

While the reduction of ERIs in the Fock matrix construction has been refined over the years, the use of atomic operations remains unavoidable in parallel computations on both CPUs and GPUs. Particularly when exploiting the eight-fold symmetry of ERIs using Equation (24), the memory contention issue caused by atomic operations from a large number of threads remains unresolved. Previous studies have primarily focused on reducing the number of atomic additions to the Fock matrix, achieving this through techniques such as the reduction of thread-local buffers. However, on GPUs with higher parallelism compared to CPUs, the number of ERIs that can be aggregated per thread is limited. Even with the thread-local partial reduction,

O (M^{4})

atomic operations persist in updating the Fock matrix using GPUs, and these operations concentrate on a small memory space of

O (M^{2})

. To address this, we avoid the memory contention by distributing the addresses accessed by each thread rather than reducing the number of atomic operations.

4. Proposed Method

In this section, we propose an efficient GPU implementation of the Fock matrix construction that significantly reduces memory contention while fully exploiting the eight-fold symmetry of ERIs. First, we detail an efficient parallel implementation of ERIs for computing only unique integrals, including the data structures and thread mapping. Then, we introduce the replicated Fock matrix update method using the distributed atomic reduction to avoid memory contention.

4.1. GPU Parallelization of ERIs

We begin by describing the GPU implementation of the ERI evaluation, which constitutes the first bottleneck in the Fock matrix construction. The ERI evaluation involves calculating an integral value for every combination of four basis functions, making it intuitive to parallelize by assigning each GPU thread to compute a single value of

(μ ν | λ σ)

. However, this approach is known to be inefficient due to redundant calculations of the Boys functions

F_{j} (X)

. Therefore, we employ an efficient mapping where each thread evaluates a task, which is a set of primitive ERIs that share the same values of the Boys function. In this mapping, instead of grouping Gauss-type orbitals by basis functions, they are managed in a data structure called a primitive shell denoted as

S_{1}, S_{2}, S_{3}, \dots

for each shell. Figure 1 shows an example for the mathematical expression of a

p

-orbital with the contraction degree

K = 3

. The

p

-orbital consists of three components:

p_{x}

-,

p_{y}

-,

p_{z}

-orbitals, corresponding to angular momentum

(l, m, n) = (1, 0, 0)

,

(0, 1, 0)

, and

(0, 0, 1)

, respectively. We organize Gauss-type orbitals with common inputs for the Boys function (total angular momentum L, orbital center

A

, and exponent

ρ_{a}

) into a primitive shell data structure. Assigning each GPU thread to a quartet of these primitive shells minimizes the number of Boys function evaluations. A single shell quartet contains multiple primitive ERIs, each contributing to different contracted ERIs. The set of these primitive ERIs constitute a task, defined as combinations of ERIs over Gauss-type orbitals within four shells

S_{α}

,

S_{β}

,

S_{γ}

, and

S_{δ}

(

0 \leq α, β, γ, δ \leq N - 1

):

[S_{α} S_{β} | S_{γ} S_{δ}] = \{[a b | c d] ∣ G_{a} \in S_{α}, G_{b} \in S_{β}, G_{c} \in S_{γ}, G_{d} \in S_{δ}\},

(25)

where N denotes the number of primitive shells, and the number of primitive ERIs in a task is determined by the type of the four shells. For example, we denote a task with three

s

-orbitals and one

p

-orbital as

[ss | sp]

for convenience. Since the combination of the angular momentum for

s

-orbitals is only

(0, 0, 0)

, the

[ss | sp]

task contains three ERIs over Gauss-type orbitals:

[ss | {sp}_{x}]

,

[ss | {sp}_{y}]

, and

[ss | {sp}_{z}]

. Since these primitive ERIs in a task share the same values of the Boys function, redundant computations can be avoided.

Using the shell data structure, the ERI calculations can be viewed as a task of evaluating primitive ERIs for all elements of an upper triangular matrix indexed by pairs of primitive shells. Figure 2 illustrates the task matrix for ERIs composed of

s

- and

p

-orbitals. We compute all unique ERI values by parallelizing this task matrix on GPUs. Each GPU thread is assigned a unique combination of four primitive shells and sequentially evaluates primitive ERIs within that task. Equation (14) can be implemented by appropriately accumulating the obtained primitive ERIs into the corresponding contracted ERIs. The task matrix can be classified into six integral types as task sub-matrices for

s

- and

p

-orbitals:

[ss | ss]

,

[ss | sp]

,

[ss | pp]

,

[sp | sp]

,

[sp | pp]

, and

[pp | pp]

, through sorting the primitive shell pairs

S_{α} S_{β}

by their total angular momenta

(L_{α}, L_{β})

. According to the angular momenta, a single task in each sub-matrix has 1, 3, 9, 9, 27, and 81 primitive ERIs, respectively. In GPU parallelization, dedicated CUDA kernels can be developed for these sub-matrices to eliminate warp divergence caused by differences in the angular momenta

l, m, n

and contraction degrees K in Equation (14).

Our GPU implementation employs the Schwarz screening of ERIs in accordance with the task matrix based on primitive shells. To determine whether to compute or skip entire primitive ERIs within a task, we use an ERI upper bound for each task

[S_{α} S_{β} | S_{γ} S_{δ}]

. This upper bound is defined as

\begin{matrix} \forall [a b | c d] \in [S_{α} S_{β} | S_{γ} S_{δ}], [a b | c d] \leq Q_{α β} Q_{γ δ}, \end{matrix}

(26)

\begin{matrix} Q_{α β} = max_{G_{a} \in S_{α}, G_{b} \in S_{β}} \sqrt{[a b | a b]}, \end{matrix}

(27)

\begin{matrix} Q_{γ δ} = max_{G_{c} \in S_{γ}, G_{d} \in S_{δ}} \sqrt{[c d | c d]} . \end{matrix}

(28)

where

Q_{α β}

denotes the maximum value of primitive ERIs within each task corresponding to diagonal elements of the task matrix in Figure 2. The upper bound factors

Q_{α β}

are calculated in advance with a computational complexity of

O (N^{2})

. For each shell quartet

[S_{α} S_{β} | S_{γ} S_{δ}]

, if its upper bound in Equation (26) is less than the predefined cutoff threshold

θ

, we omit the computation of all primitive ERIs within a task. However, divergent screening results among threads within a warp lead to performance degradation in parallelization, as threads that could skip tasks become idle while waiting for the other threads in the same warp to complete their ERI computations. To avoid this situation, we sort the pairs of primitive shells

[S_{α} S_{β} |

within each segment,

[ss |

,

[sp |

, and

[pp |

according to the upper bound factors

Q_{α β}

ensuring the uniformed screening result within warps. Furthermore, even if a task cannot be skipped, additional screening based on Equation (20) is applied to individual primitive ERIs to omit as many ERI computations as possible.

Efficient construction of the Fock matrix in Equation (24) based on the primitive shell data structure presents a nontrivial challenge for GPU implementation. The Fock matrix is allocated in global memory, which involves high latency to access, necessitating minimization of the access frequency. GPU computing typically employs hierarchical reduction strategies to reduce the global memory accesses, utilizing registers within each thread and shared memory within each CUDA block. Based on this concept, it is ideal to implement aggregation of integral values using shared memory in addition to thread-local reduction in registers. However, in the primitive shell-based GPU implementation, utilizing shared memory as an intermediate buffer for the Fock matrix is challenging. The contributions of multiple primitive integrals computed by each CUDA block to contracted ERIs and ultimately to Fock matrix elements are input-dependent and cannot be statically determined. Consequently, an appropriate mapping mechanism of the write destination for each ERIs is required to manage the limited shared memory space, which cannot store the entire Fock matrix. Additionally, sorting the pairs of primitive shells to avoid warp divergence in Schwarz screening further exacerbates the irregular memory access pattern. The following proposed method using replicated Fock matrices offers a simple yet effective approach that is readily applicable to such complex shell-structured GPU implementations.

4.2. Distributed Atomic Reduction Through Replicated Fock Matrix Update

Based on the GPU parallelization that computes only unique ERIs in Section 4.1, we efficiently update the Fock matrix while considering eight-fold symmetry. In our GPU implementation, each thread computes multiple primitive ERIs that contribute to different contracted ERIs. Consequently, we implement Equation (24) by directly accumulating the obtained primitive ERIs into the Fock matrix. Even with partial aggregation of primitive ERIs in thread-local buffers,

O (M^{4})

atomic additions to the

O (M^{2})

memory space of the Fock matrix are required. These exclusive additions are implemented using the atomicAdd operation in CUDA. The atomicAdd operation is a function that increments a variable stored at a specified address, executing the read and write as an indivisible process. The atomicAdd operations competing at the same address by multiple threads are processed sequentially, resulting in performance degradation in parallel computation. To avoid overhead caused by the memory contention, we distribute the atomicAdd operations into multiple replicated Fock matrices. Figure 3 illustrates the distributed atomic reduction method based on the replicated Fock matrices for an example of the

[ss | sp]

sub-matrix. We decompose a single Fock matrix

F

into multiple replicas

F_{k} (0 \leq k \leq N_{Fock} - 1)

and significantly reduce the memory contention by having each thread add values to different replicas. Each replica serves as an intermediate buffer of the same size as the original, and the desired Fock matrix is constructed by summing all the replicas. The index of replicas k contributed by each thread with index t is determined by

k = t mod N_{Fock},

(29)

ensuring that adjacent threads, which are prone to address conflicts, perform the atomicAdd operations on different replicas. Finally, the desired Fock matrix

F

is obtained by summing the

N_{Fock}

replicas, each of which accumulates the product of the primitive ERIs and density matrix elements. We consider not only the eight-fold symmetry of ERIs but also the symmetry of the Fock matrix, ensuring that the updates according to Equation (24) are applied only to the upper triangular elements of

F

. To detail our GPU implementation, Algorithm 1 presents pseudocode for the Fock matrix construction using the distributed atomic reduction.

The distributed atomic reduction further reduces the memory contention that could not be avoided by existing thread-local reductions. Our replicated Fock matrices approach focuses not on reducing the number of atomicAdd operations, but rather on avoiding conflicts at the same address. Note that the

N_{Fock}

replicas are allocated in global memory, just like the main Fock matrix itself, not in thread-local registers or shared memory. Ideally, the number of conflicts from atomicAdd operations by different threads is reduced by a factor of approximately

1 / N_{Fock}

, but the optimal number of replicas needs to be determined experimentally. This replicated Fock matrix approach does not require additional registers or shared memory, thus not adversely affecting the occupancy of Streaming Multiprocessors. Furthermore, it does not require complex data structures or devised thread assignment, making it a simple yet powerful acceleration technique that can be easily integrated into existing GPU implementation of the Fock matrix computation.

Although additional storage space of

O (M^{2}) \times N_{Fock}

in global memory is required, the space complexity increases only by a constant factor, having minimal impact on the size of molecules for which Direct-SCF can be executed on a GPU. Figure 4 shows the additional memory usage required by the proposed replicated Fock matrices. For reference, we consider the memory sizes available on NVIDIA A100 and H100 GPUs. Memory requirements become a significant consideration when handling very large molecules/basis sets with thousands of basis functions M on a single GPU. Assuming double-precision floating-point numbers for the Fock matrix, the memory usage of its replicas can be estimated as

M^{2} \times N_{Fock} \times 8

[Byte], scaling quadratically with the number of basis functions M. As shown in Figure 4, for

M \leq 4000

, the additional memory usage due to the Fock matrix replication has minimal impact on the computable molecular size. For

M > 4000

, the number of replicas

N_{Fock}

becomes somewhat limited. Considering the memory required for other matrices involved in the SCF procedure, the upper limits for the number of replicas can be estimated as

N_{Fock} = 32

for

M = 8000

and

N_{Fock} = 8

for M = 16,000. Nevertheless, the experimental results in Section 5 show that sufficient speedup can be achieved with the proposed method even with a small number of Fock replicas.

Algorithm 1 Fock matrix construction using the distributed atomic reduction method

Require:: Primitive shells $S_{0}, S_{1}, \dots, S_{N - 1}$
Upper bound factors for each task $Q_{00}, Q_{01}, \dots, Q_{N - 1 N - 1}$
Density matrix $D$

1:: for each GPU thread in parallel do
2:: Get a quartet of primitive shells $S_{α}$ , $S_{β}$ , $S_{γ}$ , and $S_{δ}$
3:: if $Q_{α β} Q_{γ δ} \geq θ$ then
4:: $k \leftarrow t mod N_{Fock}$
5:: for each combination $G_{a}$ , $G_{b}$ , $G_{c}$ , and $G_{d}$ from $S_{α} \otimes S_{β} \otimes S_{γ} \otimes S_{δ}$ do
6:: if $\sqrt{[a b | a b]} \sqrt{[c d | c d]} \geq θ$ then
7:: $g_{a b c d} \leftarrow w_{a} w_{b} w_{c} w_{d} [a b | c d]$
8:: $F_{k} (μ, ν) \leftarrow F_{k} (μ, ν) + 4 D_{λ σ} \times g_{a b c d}$
9:: $F_{k} (λ, σ) \leftarrow F_{k} (λ, σ) + 4 D_{μ ν} \times g_{a b c d}$
10:: $F_{k} (μ, λ) \leftarrow F_{k} (μ, λ) - D_{ν σ} \times g_{a b c d}$
11:: $F_{k} (ν, σ) \leftarrow F_{k} (ν, σ) - D_{μ λ} \times g_{a b c d}$
12:: $F_{k} (μ, σ) \leftarrow F_{k} (μ, σ) - D_{ν λ} \times g_{a b c d}$
13:: $F_{k} (ν, λ) \leftarrow F_{k} (ν, λ) - D_{μ σ} \times g_{a b c d}$
14:: end if
15:: end for
16:: end if { $θ$ : predefined Schwarz cutoff threshold}
17:: end for {t: thread index within each CUDA block}
18:: $F \leftarrow F_{0} + F_{1} + \dots + F_{N_{Fock} - 1}$ { $N_{Fock}$ : the number of Fock matrix replicas}

4.3. Thread-Local Reduction in Registers

We also implement and evaluate thread-local reduction employed in previous studies [20,27]. Several primitive ERIs within each task share the same destination in the Fock matrix, allowing them to be aggregated in private registers for each thread. By accumulating these primitive ERIs within each register at low cost, we can reduce the number of expensive atomicAdd operations to global memory. As shown in Figure 2, when calculating the task submatrix

[ss | sp]

as an example, the three contributing ERIs

[ss | {sp}_{x}]

,

[ss | {sp}_{y}]

, and

[ss | {sp}_{z}]

share the same indices of contracted ERIs

μ

,

ν

, and

λ

. Therefore, these primitive ERIs share the same destinations in the Fock matrix at

F_{μ ν}

,

F_{μ λ}

, and

F_{ν λ}

among six locations. Consequently, by summing the primitive ERIs within each thread’s registers before updating the Fock matrix in global memory, we can reduce the number of atomicAdd operations by a factor of

1 / 3

. Table 1 shows the reduction rate of atomic operations in each ERI kernel. While higher angular momentum sub-matrices exhibit greater reduction rates, they also require a correspondingly larger number of additional register variables. We have also implemented the thread-local reduction and will evaluate its combination with our distributed atomic reduction in the next section.

While our current GPU implementation has been developed for ERIs that include

s

- and

p

-orbitals, the proposed replicated Fock matrix approach is also expected to be effective for chemical systems involving higher angular momentum orbitals. Our distributed atomic reduction method does not affect the implementation of the shell-based complex ERI evaluation part, allowing it to be applied to CUDA kernels involving

d

-orbitals and beyond in the same way as low angular momenta. Furthermore, the proposed method should offer sufficient performance for higher angular momentum orbitals. This is because the number of atomicAdd operations for the Fock matrix update in Equation (24) increases with the total angular momentum L. Specifically, the number of primitive ERIs computed iteratively in the for loop on line 5 of Algorithm 1 increases. For instance, in the

[dd | dd]

kernel, each GPU thread sequentially evaluates 1296 primitive ERIs. Each integral

[a b | c d]

contributes to six elements of the Fock matrix, resulting in 7776 atomicAdd operations. Using the thread-local reduction can reduce this number by

1 / 36

, but still leaves 216 atomicAdd operations, which is four times more than the 54 operations in the

[pp | pp]

kernel. Consequently, we infer that our proposed method that reduces memory contention caused by atomicAdd conflicts is effective even for the Fock matrix computation involving higher angular momentum orbitals.

5. Performance Evaluation

We evaluated the performance of the Fock matrix construction method proposed in Section 4 through numerical experiments using various molecules and basis sets as inputs. Our primary contribution, the Distributed Atomic Reduction method (DAR: Section 4.2) builds upon the GPU parallelization of ERIs explained in Section 4.1. First, we examined the influence of the proposed DAR on the accuracy of the computed RHF energies using several molecules. Subsequently, we measured the total computation time including both evaluating ERIs and updating the Fock matrix in a single SCF iteration. Our DAR can be combined with the Thread-Local Reduction (TLR: Section 4.3) technique employed in previous work [20,27], and we also compared the computation time when using TLR in conjunction with DAR. In addition to measuring the computation time for the Fock matrix construction, we also profiled each CUDA kernel using Nsight Compute. Based on the profiling results, we discuss how the proposed method for avoiding atomicAdd conflicts contributes to reducing memory contention. Our GPU implementation has been developed by the CUDA version 11.2 and C++17 compiled with GCC version 10.5.0, using an A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, United States) with 80 GB device memory and a Xeon Gold 6338 CPU (Intel, Santa Clara, CA, United States). The cutoff threshold for Schwarz screening was set to

θ = 1.0 \times 10^{- 12}

. The computation time was measured with 256 threads per CUDA block. Performance was compared by doubling the number of Fock matrix replicas

N_{Fock}

, from 1 to 256. The case of

N_{Fock} = 1

corresponds to the conventional approach where atomicAdd operations concentrate on a single Fock matrix, and it serves as the baseline.

In parallel processing, such as GPU computing, the order of floating-point operations like addition and subtraction can vary, potentially affecting the accuracy of computed results. Our proposed DAR method computes the Fock matrix by accumulating integral values into multiple Fock replicas and then summing these replica matrices. This alters the order of floating-point operations compared to the standard implementation, which directly constructs the single Fock matrix. We numerically examined whether this change in operation order introduces any loss of accuracy due to effects such as cancellation of significant digits. Table 2 shows the converged energies in the Restricted Hartree–Fock (RHF) method for different numbers of Fock replicas using our DAR method. In this experiment, the convergence criterion for the SCF procedure is set to an energy difference of less than

1.0 \times 10^{- 10}

. For all molecules in the table, the maximum absolute error in the RHF energy due to the use of DAR is of the order of

10^{- 11}

or less, indicating that the proposed method has a negligible impact on the convergence of the SCF procedure.

Table 3 and Table 4 present the computation times for the Fock matrix construction on a GPU. These results are for small molecules with the 6-31G basis set and medium-sized molecules with the STO-3G basis set, respectively. For each molecule, we compare the computation times of the proposed DAR method and a hybrid approach combining DAR with TLR. The speedup rate is defined as the maximum acceleration achieved using DAR compared to the baseline case of

N_{Fock} = 1

, where the proposed method is not applied. For the small molecules in Table 3, our DAR achieves more than double the speedup in all cases. The hybrid method, which applies DAR to TLR, shows a slightly lower speedup rate but improves the performance of the existing reduction method by approximately

1.3

times. For the larger molecules in Table 4, the speedup rate increases, achieving up to

3.75 \times

acceleration with DAR. The hybrid method also shows further improved performance, reaching a maximum speedup of

1.98 \times

over the case where only TLR was applied. This is attributed to increased memory contention in the Fock matrix update as the molecular size grows. Since

O (M^{4})

atomicAdd operations from all GPU threads concentrate on an

O (M^{2})

memory space, the frequency of destination address conflicts increases with the number of basis functions M. Thus, the proposed DAR’s ability to reduce these conflicts becomes more effective. On the other hand, the optimal number of Fock replicas varies slightly depending on the input molecules. For small molecules,

N_{Fock} = 32

yields near-optimal performance in all cases, whereas

N_{Fock} = 16

was best for larger molecules, requiring fewer replicas. Devising a mechanism to determine the appropriate number of replicas based on the input will further enhance the practicality of the proposed DAR method.

To further investigate the reduction in memory contention achieved by the proposed method, we profiled each CUDA kernel using Nsight Compute. Table 5 presents the profiling results for the Fock matrix construction of azobenzene with the 6-311G basis set. We examined the average number of cycles per instruction for each warp, and within the warp cycles, the number of cycles stalled waiting for scoreboard dependency resolution. The scoreboard manages the execution status of each instruction and data dependencies. For example, if atomic operations from multiple threads conflict at the same address, the scoreboard stalls the atomic operations from other threads until one thread’s atomic operation is completed. This waiting time is recorded as stalled cycles, allowing us to quantitatively evaluate memory contention by referencing these values. Table 5 demonstrates that using the proposed DAR (

N_{Fock} = 64

) significantly reduced atomicAdd conflicts across all kernels. The average number of cycles each warp stalled was reduced by up to

84.8 %

, resulting in a

64.8 %

reduction in total cycles. Additionally, reducing memory contention with the proposed DAR improved memory access efficiency. The reduction in atomicAdd conflicts enabled more effective utilization of the previously bottlenecked memory bandwidth, resulting in up to a

2.82 \times

improvement in L2 cache throughput.

Finally, we briefly discuss the relationship between our proposed method and existing GPU-accelerated Fock matrix construction schemes as implemented in established quantum chemistry software. TeraChem [42,43] implements ERI calculations based on the same primitive shell as our data structure, constructing the Fock matrix by building the

J

- and

K

-matrices separately. For the

J

-matrix construction, atomic operations are avoided by allocating separate GPU memory to each GPU kernel. However, for the

K

-matrix, algorithms requiring atomicAdd operations have been considered, and this has been identified as a factor in performance degradation. We speculate that QUICK [14,44] and GPU4PySCF [45,46] implement the Fock matrix construction based on contracted shell structures. A contracted shell is a set of basis functions sharing the same total angular momentum L and orbital center

A

, allowing the aggregation in Equation (14) to be performed without atomic operations. This significantly reduces the number of atomicAdd operations, but the

O (M^{4})

contracted ERIs typically need to be added to the Fock matrix exclusively. Thus, the memory contention caused by atomicAdd operations remains a challenge in current GPU implementations. Our replicated Fock matrix approach is simple and independent of the ERI evaluation and complex data structures, suggesting it could improve these existing GPU implementations. Furthermore, combining our DAR method with other techniques such as the thread-local reduction [20,27] has been experimentally shown to further improve the Fock matrix construction performance. We emphasize that our proposed method does not compete with existing GPU implementations but rather has the potential to enhance them.

6. Conclusions

Realizing the efficient acceleration of Fock matrix computation requires use of the nontrivial atomic reduction schemes for a large number of two-electron repulsion integrals (ERIs) into a limited memory space, which leads to a severe memory contention problem. In this paper, to address the memory contention problem in current Fock matrix construction frameworks, we introduced a GPU acceleration scheme of the Fock matrix construction using an efficient reduction method. As such, we proposed the distributed atomic reduction based on the replicated Fock matrices to significantly reduce the conflicts of concurrent atomic operations by multiple threads. Our proposed reduction method effectively distributes the destination addresses of atomic operations by having each thread contribute to a different Fock replica. Furthermore, our replicated Fock matrix approach is compatible with existing implementations that reduce the number of atomic operations using thread-local reduction. We have also implemented and evaluated a hybrid method that combines the proposed distributed atomic reduction with the thread-local reduction adopted in previous work. Experimental results on NVIDIA A100 GPU hardware and a relevant set/configuration of molecules have demonstrated that our proposed distributed atomic reduction scheme can accelerate the Fock matrix computation by up to

3.75

times. In the hybrid method, we achieved up to

1.98 \times

speedup over Fock matrix computation using existing thread-local reduction. Our results suggest that our proposed distributed atomic reduction method has the potential to enhance existing GPU implementations in various quantum chemistry software applications in the field.

Author Contributions

Conceptualization, S.T. and K.N.; methodology, S.T. and Y.I.; software, S.T., N.Y., K.S., and H.F.; formal analysis, Y.I.; investigation, S.T.; resources, Y.I., K.N., and A.K.; data curation, N.Y., K.S., and H.F.; writing—original draft, S.T.; writing—review and editing, Y.I. and V.P.; supervision, Y.I., K.N., and V.P.; project administration, K.N. and A.K.; funding acquisition, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The molecular structures used in the experiments presented in this paper are openly available in PubChem at https://pubchem.ncbi.nlm.nih.gov/ (accessed on 25 March 2025).

Conflicts of Interest

Authors Satoki Tsuji and Akihiko Kasagi were employed by the company Fujitsu Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Arodola, O.; Soliman, M. Quantum mechanics implementation in drug-design workflows: Does it really help? Drug Des. Dev. Ther. 2017, 11, 2551–2564. [Google Scholar] [CrossRef] [PubMed]
Cavasotto, C.N.; Adler, N.S.; Aucar, M.G. Quantum Chemical Approaches in Structure-Based Virtual Screening and Lead Optimization. Front. Chem. 2018, 6, 188. [Google Scholar] [CrossRef] [PubMed]
Biz, C.; Fianchini, M.; Gracia, J. Strongly correlated electrons in catalysis: Focus on quantum exchange. ACS Catal. 2021, 11, 14249–14261. [Google Scholar] [CrossRef]
von Burg, V.; Low, G.H.; Häner, T.; Steiger, D.S.; Reiher, M.; Roetteler, M.; Troyer, M. Quantum computing enhanced computational catalysis. Phys. Rev. Res. 2021, 3, 033055. [Google Scholar] [CrossRef]
Roothaan, C.C.J. New Developments in Molecular Orbital Theory. Rev. Mod. Phys. 1951, 23, 69–89. [Google Scholar] [CrossRef]
Ito, Y.; Tsuji, S.; Fujii, H.; Suzuki, K.; Yokogawa, N.; Nakano, K.; Kasagi, A. Introduction to Computational Quantum Chemistry for Computer Scientists. In Proceedings of the 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 27–31 May 2024; pp. 273–282. [Google Scholar] [CrossRef]
Bartlett, R.J.; Stanton, J.F. Applications of Post-Hartree-Fock Methods: A Tutorial. In Reviews in Computational Chemistry; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 1994; pp. 65–169. [Google Scholar] [CrossRef]
Gill, P.M. Molecular integrals Over Gaussian Basis Functions. In Advances in Quantum Chemistry; Academic Press: Cambridge, MA, USA, 1994; Volume 25, pp. 141–205. [Google Scholar] [CrossRef]
Almlöf, J.; Faegri, K., Jr.; Korsell, K. Principles for a direct SCF approach to LICAO–MO ab-initio calculations. J. Comput. Chem. 1982, 3, 385–399. [Google Scholar] [CrossRef]
Yasuda, K. Two-electron integral evaluation on the graphics processor unit. J. Comput. Chem. 2008, 29, 334–342. [Google Scholar] [CrossRef] [PubMed]
Ufimtsev, I.S.; Martínez, T.J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222–231. [Google Scholar] [CrossRef]
Ufimtsev, I.S.; Martinez, T.J. Quantum chemistry on graphical processing units. 2. Direct self-consistent-field implementation. J. Chem. Theory Comput. 2009, 5, 1004–1015. [Google Scholar] [CrossRef]
Asadchev, A.; Gordon, M.S. New multithreaded hybrid CPU/GPU approach to Hartree-Fock. J. Chem. Theory Comput. 2012, 8, 4166–4176. [Google Scholar] [CrossRef]
Miao, Y.; Merz, K.M. Acceleration of high angular momentum electron repulsion integrals and integral derivatives on graphics processing units. J. Chem. Theory Comput. 2015, 11, 1449–1462. [Google Scholar] [CrossRef] [PubMed]
Mironov, V.; Alexeev, Y.; Keipert, K.; D’mello, M.; Moskovsky, A.; Gordon, M.S. An efficient MPI/openMP parallelization of the Hartree-Fock method for the second generation of Intel^®Xeon Phi^TM processor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 12–17 November 2017. [Google Scholar]
Huang, H.; Chow, E. Accelerating quantum chemistry with vectorized and batched integrals. In Proceedings of the SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 11–16 November 2018; pp. 529–542. [Google Scholar]
Tornai, G.J.; Ladjánszki, I.; Rák, A.; Kis, G.; Cserey, G. Calculation of Quantum Chemical Two-Electron Integrals by Applying Compiler Technology on GPU. J. Chem. Theory Comput. 2019, 15, 5319–5331. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Sherrill, C.D.; Chow, E. Techniques for high-performance construction of Fock matrices. J. Chem. Phys. 2020, 152, 024122. [Google Scholar] [CrossRef]
Barca, G.M.J.; Galvez-Vallejo, J.L.; Poole, D.L.; Rendell, A.P.; Gordon, M.S. High-performance, graphics processing unit-accelerated Fock build algorithm. J. Chem. Theory Comput. 2020, 16, 7232–7238. [Google Scholar] [CrossRef]
Barca, G.M.J.; Alkan, M.; Galvez-Vallejo, J.L.; Poole, D.L.; Rendell, A.P.; Gordon, M.S. Faster Self-Consistent Field (SCF) Calculations on GPU Clusters. J. Chem. Theory Comput. 2021, 17, 7486–7503. [Google Scholar] [CrossRef]
Tian, Y.; Suo, B.; Ma, Y.; Jin, Z. Optimizing two-electron repulsion integral calculations with McMurchie-Davidson method on graphic processing unit. J. Chem. Phys. 2021, 155, 34112. [Google Scholar] [CrossRef] [PubMed]
Manathunga, M.; Jin, C.; Cruzeiro, V.W.D.; Miao, Y.; Mu, D.; Arumugam, K.; Keipert, K.; Aktulga, H.M.; Merz, K.M.J.; Götz, A.W. Harnessing the Power of Multi-GPU Acceleration into the Quantum Interaction Computational Kernel Program. J. Chem. Theory Comput. 2021, 17, 3955–3966. [Google Scholar] [CrossRef]
Johnson, K.G.; Mirchandaney, S.; Hoag, E.; Heirich, A.; Aiken, A.; Martínez, T.J. Multinode Multi-GPU Two-Electron Integrals: Code Generation Using the Regent Language. J. Chem. Theory Comput. 2022, 18, 6522–6536. [Google Scholar] [CrossRef] [PubMed]
Qi, J.; Zhang, Y.; Yang, M. A hybrid CPU/GPU method for Hartree-Fock self-consistent-field calculation. J. Chem. Phys. 2023, 159, 104101. [Google Scholar] [CrossRef]
Suzuki, K.; Ito, Y.; Fujii, H.; Yokogawa, N.; Tsuji, S.; Nakano, K.; Kasagi, A. GPU acceleration of head-Gordon-Pople algorithm. In Proceedings of the 2024 Twelfth International Symposium on Computing and Networking (CANDAR), Naha, Japan, 3–6 December 2024; pp. 115–124. [Google Scholar]
Tsuji, S.; Ito, Y.; Fujii, H.; Yokogawa, N.; Suzuki, K.; Nakano, K.; Kasagi, A. Dynamic Screening of Two-Electron Repulsion Integrals in GPU Parallelization. In Proceedings of the 2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 26–29 November 2024; pp. 211–217. [Google Scholar] [CrossRef]
Palethorpe, E.; Stocks, R.; Barca, G.M.J. Advanced techniques for high-performance Fock matrix construction on GPU clusters. J. Chem. Theory Comput. 2024, 20, 10424–10442. [Google Scholar] [CrossRef]
Fujii, H.; Ito, Y.; Yokogawa, N.; Suzuki, K.; Tsuji, S.; Nakano, K.; Parque, V.; Kasagi, A. Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations. Appl. Sci. 2025, 15, 2572. [Google Scholar] [CrossRef]
Yokogawa, N.; Ito, Y.; Tsuji, S.; Fujii, H.; Suzuki, K.; Nakano, K.; Kasagi, A. Parallel GPU computation of nuclear attraction integrals in quantum chemistry. In Proceedings of the 2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 26–29 November 2024; pp. 163–169. [Google Scholar]
Pritchard, B.P.; Altarawy, D.; Didier, B.; Gibson, T.D.; Windus, T.L. New Basis Set Exchange: An Open, Up-to-Date Resource for the Molecular Sciences Community. J. Chem. Inf. Model. 2019, 59, 4814–4820. [Google Scholar] [CrossRef] [PubMed]
McMurchie, L.E.; Davidson, E.R. One- and two-electron integrals over cartesian Gaussian functions. J. Comput. Phys. 1978, 26, 218–231. [Google Scholar] [CrossRef]
Obara, S.; Saika, A. Efficient recursive computation of molecular integrals over Cartesian Gaussian functions. J. Chem. Phys. 1986, 84, 3963–3974. [Google Scholar] [CrossRef]
Head-Gordon, M.; Pople, J.A. A method for two-electron Gaussian integral and integral derivative evaluation using recurrence relations. J. Chem. Phys. 1988, 89, 5777–5786. [Google Scholar] [CrossRef]
Boys, S.F. Electronic wave functions — I. A general method of calculation for the stationary states of any molecular system. Proc. R. Soc. Lond. A Math. Phys. Sci. 1950, 200, 542–554. [Google Scholar] [CrossRef]
Tsuji, S.; Ito, Y.; Nakano, K.; Kasagi, A. GPU Acceleration of the Boys Function Evaluation in Computational Quantum Chemistry. Concurr. Comput. Pract. Exp. 2025, 37, e8328. [Google Scholar] [CrossRef]
Gill, P.M.; Johnson, B.G.; Pople, J.A. A simple yet powerful upper bound for Coulomb integrals. Chem. Phys. Lett. 1994, 217, 65–68. [Google Scholar] [CrossRef]
Gordon, M.S.; Schmidt, M.W. Chapter 41—Advances in electronic structure theory: GAMESS a decade later. In Theory and Applications of Computational Chemistry; Dykstra, C.E., Frenking, G., Kim, K.S., Scuseria, G.E., Eds.; Elsevier: Amsterdam, The Netherlands, 2005; pp. 1167–1189. [Google Scholar] [CrossRef]
Sun, Q. Libcint: An efficient general integral library for Gaussian basis functions. J. Comput. Chem. 2015, 36, 1664–1671. [Google Scholar] [CrossRef]
Parrish, R.M.; Burns, L.A.; Smith, D.G.A.; Simmonett, A.C.; DePrince, A.E.I.; Hohenstein, E.G.; Bozkaya, U.; Sokolov, A.Y.; Di Remigio, R.; Richard, R.M.; et al. Psi4 1.1: An Open-Source Electronic Structure Program Emphasizing Automation, Advanced Libraries, and Interoperability. J. Chem. Theory Comput. 2017, 13, 3185–3197. [Google Scholar] [CrossRef] [PubMed]
Sun, Q.; Berkelbach, T.C.; Blunt, N.S.; Booth, G.H.; Guo, S.; Li, Z.; Liu, J.; McClain, J.D.; Sayfutyarova, E.R.; Sharma, S.; et al. PySCF: The Python-based simulations of chemistry framework. WIREs Comput. Mol. Sci. 2018, 8, e1340. [Google Scholar] [CrossRef]
Kühne, T.D.; Iannuzzi, M.; Del Ben, M.; Rybkin, V.V.; Seewald, P.; Stein, F.; Laino, T.; Khaliullin, R.Z.; Schütt, O.; Schiffmann, F.; et al. CP2K: An electronic structure and molecular dynamics software package - Quickstep: Efficient and accurate electronic structure calculations. J. Chem. Phys. 2020, 152, 194103. [Google Scholar] [CrossRef] [PubMed]
Seritan, S.; Bannwarth, C.; Fales, B.S.; Hohenstein, E.G.; Isborn, C.M.; Kokkila-Schumacher, S.I.L.; Li, X.; Liu, F.; Luehr, N.; Snyder, J.W.; et al. TeraChem: A graphical processing unit-accelerated electronic structure package for large-scale ab initio molecular dynamics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2021, 11, e1494. [Google Scholar] [CrossRef]
Wang, Y.; Hait, D.; Johnson, K.G.; Fajen, O.J.; Zhang, J.H.; Guerrero, R.D.; Martínez, T.J. Extending GPU-accelerated Gaussian integrals in the TeraChem software package to f type orbitals: Implementation and applications. J. Chem. Phys. 2024, 161, 174118. [Google Scholar] [CrossRef] [PubMed]
Miao, Y.; Merz, K.M., Jr. Acceleration of electron repulsion integral evaluation on graphics processing units via use of recurrence relations. J. Chem. Theory Comput. 2013, 9, 965–976. [Google Scholar] [CrossRef]
Li, R.; Sun, Q.; Zhang, X.; Chan, G.K.L. Introducing GPU acceleration into the python-based simulations of chemistry framework. J. Phys. Chem. A 2025, 129, 1459–1468. [Google Scholar] [CrossRef]
Wu, X.; Sun, Q.; Pu, Z.; Zheng, T.; Ma, W.; Yan, W.; Yu, X.; Wu, Z.; Huo, M.; Li, X.; et al. Enhancing GPU-acceleration in the Python-based Simulations of Chemistry Framework. arXiv 2024, arXiv:2404.09452. [Google Scholar] [CrossRef]

Figure 1. Data structure of primitive shells.

Figure 2. ERI task matrix based on the primitive shell data structure for GPU parallelization.

Figure 3. Distributed atomic reduction method through the replicated Fock matrix update.

Figure 4. Global memory usage required for the replicated Fock matrices.

Table 1. Reduction rate of the atomicAdd operations achieved by thread-local reduction and the associated increase in register usage.

	Destination	$[ss \| ss]$	$[ss \| sp]$	$[ss \| pp]$	$[sp \| sp]$	$[sp \| pp]$	$[pp \| pp]$
Reduction rate	$F_{μ ν}$	1	$1 / 3$	$1 / 9$	$1 / 3$	$1 / 9$	$1 / 9$
	$F_{λ σ}$	1	1	1	$1 / 3$	$1 / 3$	$1 / 9$
	$F_{μ λ}$	1	$1 / 3$	$1 / 3$	$1 / 9$	$1 / 9$	$1 / 9$
	$F_{ν σ}$	1	1	$1 / 3$	1	$1 / 3$	$1 / 9$
	$F_{μ σ}$	1	1	$1 / 3$	$1 / 3$	$1 / 9$	$1 / 9$
	$F_{ν λ}$	1	$1 / 3$	$1 / 3$	$1 / 3$	$1 / 3$	$1 / 9$
#Additional register variables		0	3	16	16	36	54

Table 2. Converged RHF energies using the distributed atomic reduction method with different number of Fock replicas.

Compound		Azobenzene	Rivastigmine	Penicillin G
Molecular Formula		C₁₂H₁₀N₂	C₁₄H₂₂N₂O₂	C₁₆H₁₈N₂O₄S
Basis Set		6-311G	6-31G	6-31G
#Fock replicas for DAR: $N_{Fock}$	1	$- 568.970183986283$	$- 801.229482107621$	$- 1421.744965380880$
	2	$- 568.970183986282$	$- 801.229482107608$	$- 1421.744965380859$
	4	$- 568.970183986281$	$- 801.229482107605$	$- 1421.744965380850$
	8	$- 568.970183986281$	$- 801.229482107604$	$- 1421.744965380848$
	16	$- 568.970183986281$	$- 801.229482107604$	$- 1421.744965380846$
	32	$- 568.970183986281$	$- 801.229482107603$	$- 1421.744965380848$
	64	$- 568.970183986281$	$- 801.229482107603$	$- 1421.744965380846$
	128	$- 568.970183986281$	$- 801.229482107602$	$- 1421.744965380848$
	256	$- 568.970183986281$	$- 801.229482107604$	$- 1421.744965380847$
$\max (\| Δ E_{RHF} \|)$		$2.0 \times 10^{- 12}$	$1.9 \times 10^{- 11}$	$3.4 \times 10^{- 11}$

Table 3. Computation time for the Fock matrix construction of small molecule with 6-31G basis set in milliseconds.

Compound		Rivastigmine		Penicillin G		ATP
Molecular Formula		C₁₄H₂₂N₂O₂		C₁₆H₁₈N₂O₄S		C₁₀H₁₆N₅O₁₃P₃
#Basis Functions M		206		247		323
#Primitive Shells N		340		406		534
Reduction Method		DAR	TLR + DAR	DAR	TLR + DAR	DAR	TLR + DAR
#Fock replicas for DAR: $N_{Fock}$	1	251	152	462	276	1232	699
	2	183	131	349	243	888	600
	4	150	120	287	224	704	551
	8	132	116	243	216	606	531
	16	121	115	222	213	547	525
	32	113	115	208	212	563	525
	64	109	115	282	215	937	559
	128	169	121	350	231	982	597
	256	187	130	383	265	1041	693
Speedup rate		$2.30$	$1.32$	$2.22$	$1.30$	$2.25$	$1.33$

Table 4. Computation time for the Fock matrix construction of mid-size molecule with STO-3G basis set in seconds.

Compound		Paclitaxel		Valinomycin		Cyclosporine
Molecular Formula		C₄₇H₅₁NO₁₄		C₅₄H₉₀N₆O₁₈		C₆₂H₁₁₁N₁₁O₁₂
#Basis Functions M		361		480		536
#Primitive Shells N		711		972		1098
Reduction Method		DAR	TLR + DAR	DAR	TLR + DAR	DAR	TLR + DAR
#Fock replicas for DAR: $N_{Fock}$	1	$11.85$	$5.43$	$11.93$	$6.34$	$22.65$	$11.63$
	2	$7.37$	$4.01$	$7.91$	$5.00$	$14.68$	$8.76$
	4	$5.02$	$3.24$	$5.79$	$4.23$	$10.42$	$7.19$
	8	$3.78$	$2.86$	$4.68$	$3.86$	$8.14$	$6.47$
	16	$3.16$	$2.74$	$4.11$	$3.74$	$7.78$	$6.31$
	32	$5.27$	$2.85$	$6.69$	$4.20$	$11.47$	$7.16$
	64	$5.51$	$3.12$	$6.81$	$4.37$	$11.63$	$7.43$
	128	$5.65$	$3.31$	$6.90$	$4.56$	$11.75$	$7.73$
	256	$5.79$	$3.68$	$7.18$	$5.10$	$12.20$	$8.56$
Speedup rate		$3.75$	$1.98$	$2.90$	$1.70$	$2.91$	$1.84$

Table 5. Profiling results of each CUDA kernel for the ERI task matrix in azobenzene (C₁₂H₁₀N₂) with 6-311G basis set.

ERI Types	#Fock Replicas	Stall	Warp	Stall/Warp	L2 Cache
ERI Types	$N_{Fock}$	[cycles]	[cycles]	[%]	[Gbyte/s]
$[ss \| ss]$	1	$5.3$	$12.9$	$41.2$	$467.75$
	64	$3.0$	$9.3$	$31.9$	$689.54$
		$- 43.4 %$	$- 27.9 %$	$- 9.3$ pp	$1.47 \times$
$[ss \| sp]$	1	$23.1$	$31.5$	$73.4$	$626.25$
	64	$3.5$	$11.1$	$31.7$	$1763.09$
		$- 84.8 %$	$- 64.8 %$	$- 41.7$ pp	$2.82 \times$
$[ss \| pp]$	1	$10.6$	$16.1$	$65.6$	$1003.10$
	64	$2.9$	$8.5$	$34.2$	$1850.04$
		$- 72.6 %$	$- 47.2 %$	$- 31.4$ pp	$1.84 \times$
$[sp \| sp]$	1	$26.5$	$33.4$	$79.2$	$1020.25$
	64	$9.6$	$16.3$	$58.7$	$2169.42$
		$- 63.8 %$	$- 51.2 %$	$- 20.5$ pp	$2.13 \times$
$[sp \| pp]$	1	$10.7$	$16.3$	$65.6$	$1182.57$
	64	$2.6$	$8.3$	$31.6$	$2242.28$
		$- 75.7 %$	$- 49.1 %$	$- 31.4$ pp	$1.90 \times$
$[pp \| pp]$	1	$4.8$	$12.5$	$38.0$	$1303.60$
	64	$2.8$	$8.6$	$32.8$	$1811.86$
		$- 41.7 %$	$- 31.2 %$	$- 5.2$ pp	$1.39 \times$

Warp (Warp Cycles Per Issued Instruction): average #cycles each warp was resident per instruction issued. Stall (Stall Long Scoreboard): average #cycles each warp stalled waiting for scoreboard dependency. L2 Cache (L2 Cache Throughput): achieved L2 cache throughput in bytes per second.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tsuji, S.; Ito, Y.; Fujii, H.; Yokogawa, N.; Suzuki, K.; Nakano, K.; Parque, V.; Kasagi, A. GPU-Accelerated Fock Matrix Computation with Efficient Reduction. Appl. Sci. 2025, 15, 4779. https://doi.org/10.3390/app15094779

AMA Style

Tsuji S, Ito Y, Fujii H, Yokogawa N, Suzuki K, Nakano K, Parque V, Kasagi A. GPU-Accelerated Fock Matrix Computation with Efficient Reduction. Applied Sciences. 2025; 15(9):4779. https://doi.org/10.3390/app15094779

Chicago/Turabian Style

Tsuji, Satoki, Yasuaki Ito, Haruto Fujii, Nobuya Yokogawa, Kanta Suzuki, Koji Nakano, Victor Parque, and Akihiko Kasagi. 2025. "GPU-Accelerated Fock Matrix Computation with Efficient Reduction" Applied Sciences 15, no. 9: 4779. https://doi.org/10.3390/app15094779

APA Style

Tsuji, S., Ito, Y., Fujii, H., Yokogawa, N., Suzuki, K., Nakano, K., Parque, V., & Kasagi, A. (2025). GPU-Accelerated Fock Matrix Computation with Efficient Reduction. Applied Sciences, 15(9), 4779. https://doi.org/10.3390/app15094779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GPU-Accelerated Fock Matrix Computation with Efficient Reduction

Abstract

1. Introduction

Contributions

2. Background

2.1. Hartree–Fock Method

2.2. Fock Matrix Computation

2.2.1. Two-Electron Repulsion Integrals

2.2.2. Update of the Fock Matrix

2.3. GPU Programming Model

3. Related Works

4. Proposed Method

4.1. GPU Parallelization of ERIs

4.2. Distributed Atomic Reduction Through Replicated Fock Matrix Update

4.3. Thread-Local Reduction in Registers

5. Performance Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI