applsci-logo

Journal Browser

Journal Browser

Data Structures for Graphics Processing Units (GPUs)

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 January 2026 | Viewed by 8864

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer and Information Science, University of Mississippi, University, MS 38677, USA
Interests: hardware architecture and compilers for parallel and heterogeneous processors; GPU computing (GPGPU) and CPU-GPU heterogeneous computing and CPU-GPU heterogeneous computing

E-Mail Website
Guest Editor

Special Issue Information

Dear Colleagues,

This Special Issue will delve into the innovative and rapidly evolving field of data structures tailored to graphics processing units (GPUs). GPUs, originally designed for rendering graphics, have emerged as powerful parallel processors, revolutionizing computational tasks across diverse domains. This Special Issue will explore the development and optimization of data structures that leverage the parallel processing capabilities of GPUs to achieve significant performance enhancements.

Contributors to this Special Issue will present cutting-edge research into a variety of GPU-optimized data structures, including, but not limited to, stacks, queues, trees, graphs, hash tables, and priority queues. The articles will highlight novel approaches to memory management, data access patterns, and algorithmic modifications that harness the massive parallelism of GPUs. Furthermore, this Special Issue will address practical challenges, such as synchronization, load balancing, and efficient data transfer between CPU and GPU memory.

By featuring both theoretical advancements and practical implementations, this Special Issue will bridge the gap between traditional CPU-centric data structures and their GPU-optimized counterparts. Readers will gain insights into the latest techniques for maximizing GPU performance, making this Special Issue an essential resource for researchers and practitioners seeking to exploit the full potential of GPU computing for data-intensive applications.

Dr. Byunghyun Jang
Prof. Dr. Juan A. Gómez-Pulido
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • GPU computing
  • concurrent data structures
  • GPGPU
  • parallel computing

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 4896 KB  
Article
GPU-Driven Acceleration of Wavelet-Based Autofocus for Practical Applications in Digital Imaging
by HyungTae Kim, Duk-Yeon Lee, Dongwoon Choi and Dong-Wook Lee
Appl. Sci. 2025, 15(19), 10455; https://doi.org/10.3390/app151910455 - 26 Sep 2025
Abstract
A parallel implementation of wavelet-based autofocus (WBA) was presented to accelerate recursive operations and reduce computational costs. WBA evaluates digital focus indices (DFIs) using first- or second-order moments of the wavelet coefficients in high-frequency subbands. WBA is generally accurate and reliable; however, its [...] Read more.
A parallel implementation of wavelet-based autofocus (WBA) was presented to accelerate recursive operations and reduce computational costs. WBA evaluates digital focus indices (DFIs) using first- or second-order moments of the wavelet coefficients in high-frequency subbands. WBA is generally accurate and reliable; however, its computational cost is high owing to biorthogonal decomposition. Thus, this study parallelized the Daubechies-6 wavelet and norms of the high-frequency subbands for the DFI. The kernels of the DFI computation were constructed using open sources for driving multicore processors (MCPs) and general processing units (GPUs). The standard C++, OpenCV, OpenMP, OpenCL, and CUDA open-source platforms were selected to construct the DFI kernels, considering hardware compatibility. The experiment was conducted using the MCP, peripheral GPUs, and CPU-resident GPUs on desktops for advanced users and compact devices for industrial applications. The results demonstrated that the GPUs provided sufficient performance to achieve WBA even when using budget GPUs, indicating that the GPUs are advantageous for practical applications of WBA. This study also implies that although budget GPUs are left unused, they can potentially be great resources for wavelet-based processing. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

17 pages, 1460 KB  
Article
GPU-Accelerated High-Efficiency PSO with Initialization and Thread Self-Adaptation
by Zhikun Liu, Jia Wu, Bolei Dong and Ye Liu
Appl. Sci. 2025, 15(19), 10429; https://doi.org/10.3390/app151910429 - 25 Sep 2025
Abstract
Particle Swarm Optimization (PSO) is a widely used heuristic algorithm valued for its simplicity and robustness in solving diverse optimization problems. However, its high computational cost often limits large-scale applications. With the rapid development of parallel computing and Graphics Processing Units (GPUs), researchers [...] Read more.
Particle Swarm Optimization (PSO) is a widely used heuristic algorithm valued for its simplicity and robustness in solving diverse optimization problems. However, its high computational cost often limits large-scale applications. With the rapid development of parallel computing and Graphics Processing Units (GPUs), researchers have increasingly leveraged these technologies to enhance PSO efficiency. This paper introduces a High-Efficiency PSO (HEPSO) algorithm designed for GPU-based architectures. HEPSO improves computational performance through two key strategies: (1) transferring data initialization from the CPU to the GPU to reduce I/O overhead caused by repeated data migration, and (2) incorporating a self-adaptive thread management mechanism to enhance execution efficiency. Experiments conducted on nine benchmark optimization functions demonstrate that HEPSO achieves more than a sixfold speedup compared to conventional GPU-PSO. Moreover, in terms of convergence time, HEPSO requires only about one-third of the runtime in most cases. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

19 pages, 1845 KB  
Article
GPU-Accelerated PSO for High-Performance American Option Valuation
by Leon Xing Li and Ren-Raw Chen
Appl. Sci. 2025, 15(18), 9961; https://doi.org/10.3390/app15189961 - 11 Sep 2025
Viewed by 423
Abstract
Using artificial intelligence tools to evaluate financial derivatives has become increasingly popular. PSO (particle swarm optimization) is one such tool. We present a comprehensive study of PSO for pricing American options on GPUs using OpenCL. PSO is an increasingly popular heuristic for financial [...] Read more.
Using artificial intelligence tools to evaluate financial derivatives has become increasingly popular. PSO (particle swarm optimization) is one such tool. We present a comprehensive study of PSO for pricing American options on GPUs using OpenCL. PSO is an increasingly popular heuristic for financial parameter search; however, its high computational cost (especially for path-dependent derivatives) poses a challenge. We review PSO-based pricing and survey prior GPU acceleration efforts. We then describe our OpenCL optimization pipeline on an Apple M3 Max GPU (OpenCL 1.2 via PyOpenCL 2024.1). Starting from a NumPy baseline (36.7 s), we apply successive enhancements: an initial GPU offload (8.0 s), restructuring loops (forward/backward) to minimize divergence (2.3 s → 0.95 s), kernel fusion (0.94 s), and explicit SIMD vectorization (float4) (0.25 s). The fully fused float4 kernel achieves 0.246 s, a ~150X speedup over CPU. We analyzed all eight intermediate kernels (named by file), detailing techniques (memory coalescing, branch avoidance, etc.) and their effects on throughput. Our results exceed prior art in speed and vector efficiency, illustrating the power of combined OpenCL strategies. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

12 pages, 1880 KB  
Article
Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture
by Mohamed Aziz Zeroual, Natalia Dudysheva, Vincent Gras, Franck Mauconduit, Karyna Isaieva, Pierre-André Vuissoz and Freddy Odille
Appl. Sci. 2025, 15(11), 5840; https://doi.org/10.3390/app15115840 - 22 May 2025
Viewed by 513
Abstract
Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate [...] Read more.
Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate diagnostics. Graphics processing units (GPUs) have demonstrated substantial performance gains in various reconstruction tasks. In this work, we present a GPU implementation of the reconstruction kernel for the generalized reconstruction by inversion of coupled systems (GRICS), an iterative joint optimization approach that enables 3D high-resolution image reconstruction with motion correction. Three implementations were compared: (i) a C++ CPU version, (ii) a Matlab–GPU version (with minimal code modifications allowing data storage in GPU memory), and (iii) a native GPU version using CUDA. Six distinct datasets, including various motion types, were tested. The results showed that the Matlab–GPU approach achieved speedups ranging from 1.2× to 2.0× compared to the CPU implementation, whereas the native CUDA version attained speedups of 9.7× to 13.9×. Across all datasets, the normalized root mean square error (NRMSE) remained on the order of 106 to 104, indicating that the CUDA-accelerated method preserved image quality. Furthermore, a roofline analysis was conducted to quantify the kernel’s performance on one of the evaluated datasets. The kernel achieved 250 GFLOP/s, representing a 15.6× improvement over the performance of the Matlab–GPU version. These results confirm that GPU-based implementations of GRICS can drastically reduce reconstruction times while maintaining diagnostic fidelity, paving the way for more efficient clinical motion-compensated MRI workflows. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

24 pages, 740 KB  
Article
GPU-Accelerated Fock Matrix Computation with Efficient Reduction
by Satoki Tsuji, Yasuaki Ito, Haruto Fujii, Nobuya Yokogawa, Kanta Suzuki, Koji Nakano, Victor Parque and Akihiko Kasagi
Appl. Sci. 2025, 15(9), 4779; https://doi.org/10.3390/app15094779 - 25 Apr 2025
Viewed by 762
Abstract
In quantum chemistry, constructing the Fock matrix is essential to compute Coulomb interactions among atoms and electrons and, thus, to determine electron orbitals and densities. In the fundamental framework of quantum chemistry such as the Hartree–Fock method, the iterative computation of the Fock [...] Read more.
In quantum chemistry, constructing the Fock matrix is essential to compute Coulomb interactions among atoms and electrons and, thus, to determine electron orbitals and densities. In the fundamental framework of quantum chemistry such as the Hartree–Fock method, the iterative computation of the Fock matrix is a dominant process, constituting a critical computational bottleneck. Although the Fock matrix computation has been accelerated by parallel processing using GPUs, the issue of performance degradation due to memory contention remains unresolved. This is due to frequent conflicts of atomic operations accessing the same memory addresses when multiple threads update the Fock matrix elements concurrently. To address this issue, we propose a parallel algorithm that efficiently and suitably distributes the atomic operations; and significantly reduces the memory contention by decomposing the Fock matrix into multiple replicas, allowing each GPU thread to contribute to different replicas. Experimental results using a relevant set/configuration of molecules on an NVIDIA A100 GPU show that our approach achieves up to a 3.75× speedup in Fock matrix computation compared to conventional high-contention approaches. Furthermore, our proposed method can also be readily combined with existing implementations that reduce the number of atomic operations, leading to a 1.98× improvement. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

30 pages, 1684 KB  
Article
Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations
by Haruto Fujii, Yasuaki Ito, Nobuya Yokogawa, Kanta Suzuki, Satoki Tsuji, Koji Nakano, Victor Parque and Akihiko Kasagi
Appl. Sci. 2025, 15(5), 2572; https://doi.org/10.3390/app15052572 - 27 Feb 2025
Cited by 1 | Viewed by 1306
Abstract
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron [...] Read more.
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron repulsion integrals (ERIs). Central to the Hartree–Fock method is the efficient computation of ERIs over Gaussian functions (GTO-ERIs). Here, the well-known McMurchie–Davidson method (MD) offers an elegant formalism by incrementally extending Hermite Gaussian functions and auxiliary tabulated functions. Although the MD method offers a high degree of versatility to acceleration schemes through Graphics Processing Units (GPUs), the current GPU implementations limit the practical use of supported values of the azimuthal quantum number. In this paper, we propose a generalized framework capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms of the MD method can be stored. Our approach benefits from extending the MD recurrence relations through shells, batches, and triple-buffering of the shared memory, and ordering similar ERIs, thus enabling the effective parallelization and use of GPU resources. Furthermore, our approach proposes four GPU implementation schemes considering the suitable mappings between Gaussian basis and CUDA blocks and threads. Our computational experiments involving the GTO-ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA) have revealed the merits of the proposed acceleration schemes in terms of computation time, including up to a 72× improvement over our previous GPU implementation and up to a 4500× speedup compared to a naive CPU implementation, highlighting the effectiveness of our method in accelerating ERI computations for both monatomic and polyatomic molecules. Our work has the potential to explore new parallelization schemes of distinct and complex computation paths involved in ERI computation. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

20 pages, 899 KB  
Article
Boundary-Aware Concurrent Queue: A Fast and Scalable Concurrent FIFO Queue on GPU Environments
by Md. Sabbir Hossain Polak, David A. Troendle and Byunghyun Jang
Appl. Sci. 2025, 15(4), 1834; https://doi.org/10.3390/app15041834 - 11 Feb 2025
Viewed by 1352
Abstract
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its [...] Read more.
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue’s state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2× while preserving FIFO semantics. The paper demonstrates BACQ’s superior performance through real-world empirical evaluations. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

21 pages, 6218 KB  
Article
Multi-GPU Acceleration for Finite Element Analysis in Structural Mechanics
by David Herrero-Pérez and Humberto Martínez-Barberá
Appl. Sci. 2025, 15(3), 1095; https://doi.org/10.3390/app15031095 - 22 Jan 2025
Viewed by 2787
Abstract
This work evaluates the computing performance of finite element analysis in structural mechanics using modern multi-GPU systems. We can avoid the usual memory limitations when using one GPU device for many-core computing using multiple GPUs for scientific computing. We use a GPU-awareness MPI [...] Read more.
This work evaluates the computing performance of finite element analysis in structural mechanics using modern multi-GPU systems. We can avoid the usual memory limitations when using one GPU device for many-core computing using multiple GPUs for scientific computing. We use a GPU-awareness MPI approach implementing a suitable smoothed aggregation multigrid for preconditioning an iterative distributed conjugate gradient solver for GPU computing. We evaluate the performance and scalability of different models, problem sizes, and computing resources. We take an efficient multi-core implementation as the reference to assess the computing performance of the numerical results. The numerical results show the advantages and limitations of using distributed many-core architectures to address structural mechanics problems. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

Back to TopTop