MDPI - Publisher of Open Access Journals

12 pages, 1880 KB

Open AccessArticle

Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture

by Mohamed Aziz Zeroual, Natalia Dudysheva, Vincent Gras, Franck Mauconduit, Karyna Isaieva, Pierre-André Vuissoz and Freddy Odille

Appl. Sci. 2025, 15(11), 5840; https://doi.org/10.3390/app15115840 - 22 May 2025

Viewed by 916

Abstract

Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate [...] Read more.

Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate diagnostics. Graphics processing units (GPUs) have demonstrated substantial performance gains in various reconstruction tasks. In this work, we present a GPU implementation of the reconstruction kernel for the generalized reconstruction by inversion of coupled systems (GRICS), an iterative joint optimization approach that enables 3D high-resolution image reconstruction with motion correction. Three implementations were compared: (i) a C++ CPU version, (ii) a Matlab–GPU version (with minimal code modifications allowing data storage in GPU memory), and (iii) a native GPU version using CUDA. Six distinct datasets, including various motion types, were tested. The results showed that the Matlab–GPU approach achieved speedups ranging from 1.2× to 2.0× compared to the CPU implementation, whereas the native CUDA version attained speedups of 9.7× to 13.9×. Across all datasets, the normalized root mean square error (NRMSE) remained on the order of

10^{- 6}

to

10^{- 4}

, indicating that the CUDA-accelerated method preserved image quality. Furthermore, a roofline analysis was conducted to quantify the kernel’s performance on one of the evaluated datasets. The kernel achieved 250 GFLOP/s, representing a 15.6× improvement over the performance of the Matlab–GPU version. These results confirm that GPU-based implementations of GRICS can drastically reduce reconstruction times while maintaining diagnostic fidelity, paving the way for more efficient clinical motion-compensated MRI workflows. Full article

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

► Show Figures

Figure 1

35 pages, 11134 KB

Open AccessArticle

Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA

by Saeed Musaad Altalhi, Fathy Elbouraey Eassa, Sanaa Abdullah Sharaf, Ahmed Mohammed Alghamdi, Khalid Ali Almarhabi and Rana Ahmad Bilal Khalid

Computers 2025, 14(5), 164; https://doi.org/10.3390/computers14050164 - 28 Apr 2025

Viewed by 1573

Abstract

The growing adoption of supercomputers across various scientific disciplines, particularly by researchers without a background in computer science, has intensified the demand for parallel applications. These applications are typically developed using a combination of programming models within languages such as C, C++, and [...] Read more.

The growing adoption of supercomputers across various scientific disciplines, particularly by researchers without a background in computer science, has intensified the demand for parallel applications. These applications are typically developed using a combination of programming models within languages such as C, C++, and Fortran. However, modern multi-core processors and accelerators necessitate fine-grained control to achieve effective parallelism, complicating the development process. To address this, developers commonly utilize high-level programming models such as Open Multi-Processing (OpenMP), Open Accelerators (OpenACCs), Message Passing Interface (MPI), and Compute Unified Device Architecture (CUDA). These models may be used independently or combined into dual- or tri-model applications to leverage their complementary strengths. However, integrating multiple models introduces subtle and difficult-to-detect runtime errors such as data races, deadlocks, and livelocks that often elude conventional compilers. This complexity is exacerbated in applications that simultaneously incorporate MPI, OpenMP, and CUDA, where the origin of runtime errors, whether from individual models, user logic, or their interactions, becomes ambiguous. Moreover, existing tools are inadequate for detecting such errors in tri-model applications, leaving a critical gap in development support. To address this gap, the present study introduces a static analysis tool designed specifically for tri-model applications combining MPI, OpenMP, and CUDA in C++-based environments. The tool analyzes source code to identify both actual and potential runtime errors prior to execution. Central to this approach is the introduction of error dependency graphs, a novel mechanism for systematically representing and analyzing error correlations in hybrid applications. By offering both error classification and comprehensive static detection, the proposed tool enhances error visibility and reduces manual testing effort. This contributes significantly to the development of more robust parallel applications for high-performance computing (HPC) and future exascale systems. Full article

(This article belongs to the Special Issue Best Practices, Challenges and Opportunities in Software Engineering)

► Show Figures

Figure 1

17 pages, 1369 KB

Open AccessArticle

Enabling Parallel Performance and Portability of Solid Mechanics Simulations Across CPU and GPU Architectures

by Nathaniel Morgan, Caleb Yenusah, Adrian Diaz, Daniel Dunning, Jacob Moore, Erin Heilman, Evan Lieberman, Steven Walton, Sarah Brown, Daniel Holladay, Russell Marki, Robert Robey and Marko Knezevic

Information 2024, 15(11), 716; https://doi.org/10.3390/info15110716 - 7 Nov 2024

Cited by 3 | Viewed by 2104

Abstract

Efficiently simulating solid mechanics is vital across various engineering applications. As constitutive models grow more complex and simulations scale up in size, harnessing the capabilities of modern computer architectures has become essential for achieving timely results. This paper presents advancements in running parallel [...] Read more.

Efficiently simulating solid mechanics is vital across various engineering applications. As constitutive models grow more complex and simulations scale up in size, harnessing the capabilities of modern computer architectures has become essential for achieving timely results. This paper presents advancements in running parallel simulations of solid mechanics on multi-core CPUs and GPUs using a single-code implementation. This portability is made possible by the C++ matrix and array (MATAR) library, which interfaces with the C++ Kokkos library, enabling the selection of fine-grained parallelism backends (e.g., CUDA, HIP, OpenMP, pthreads, etc.) at compile time. MATAR simplifies the transition from Fortran to C++ and Kokkos, making it easier to modernize legacy solid mechanics codes. We applied this approach to modernize a suite of constitutive models and to demonstrate substantial performance improvements across different computer architectures. This paper includes comparative performance studies using multi-core CPUs along with AMD and NVIDIA GPUs. Results are presented using a hypoelastic–plastic model, a crystal plasticity model, and the viscoplastic self-consistent generalized material model (VPSC-GMM). The results underscore the potential of using the MATAR library and modern computer architectures to accelerate solid mechanics simulations. Full article

(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)

► Show Figures

Figure 1

24 pages, 830 KB

Open AccessArticle

On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures

by Nathaniel Morgan, Caleb Yenusah, Adrian Diaz, Daniel Dunning, Jacob Moore, Erin Heilman, Calvin Roth, Evan Lieberman, Steven Walton, Sarah Brown, Daniel Holladay, Marko Knezevic, Gavin Whetstone, Zachary Baker and Robert Robey

Information 2024, 15(11), 673; https://doi.org/10.3390/info15110673 - 28 Oct 2024

Cited by 4 | Viewed by 4278

Abstract

This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ mat [...] Read more.

This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ matrix and array (MATAR) library that uniquely offers: (1) a straightforward syntax for programming productivity, (2) usable data structures for data-oriented programming (DOP) for performance, and (3) a simple interface to the open-source C++ Kokkos library for portability and memory management across CPUs and GPUs. The portability across architectures with a single code implementation is achieved by automatically switching between diverse fine-grained parallelism backends (e.g., CUDA, HIP, OpenMP, pthreads, etc.) at compile time. The MATAR library solves many longstanding challenges associated with easily writing software that can run in parallel on any computer architecture. This work benefits projects seeking to write new C++ codes while also addressing the challenges of quickly making existing Fortran codes performant and portable over modern computer architectures with minimal syntactical changes from Fortran to C++. We demonstrate the feasibility of readily writing new C++ codes and modernizing existing codes with MATAR to be performant, parallel, and portable across diverse computer architectures. Full article

(This article belongs to the Special Issue Advances in High Performance Computing and Scalable Software)

► Show Figures

Figure 1

18 pages, 6787 KB

Open AccessArticle

An Implementation of LASER Beam Welding Simulation on Graphics Processing Unit Using CUDA

by Ernandes Nascimento, Elisan Magalhães, Arthur Azevedo, Luiz E. S. Paes and Ariel Oliveira

Computation 2024, 12(4), 83; https://doi.org/10.3390/computation12040083 - 17 Apr 2024

Cited by 4 | Viewed by 2573

Abstract

The maximum number of parallel threads in traditional CFD solutions is limited by the Central Processing Unit (CPU) capacity, which is lower than the capabilities of a modern Graphics Processing Unit (GPU). In this context, the GPU allows for simultaneous processing of several [...] Read more.

The maximum number of parallel threads in traditional CFD solutions is limited by the Central Processing Unit (CPU) capacity, which is lower than the capabilities of a modern Graphics Processing Unit (GPU). In this context, the GPU allows for simultaneous processing of several parallel threads with double-precision floating-point formatting. The present study was focused on evaluating the advantages and drawbacks of implementing LASER Beam Welding (LBW) simulations using the CUDA platform. The performance of the developed code was compared to that of three top-rated commercial codes executed on the CPU. The unsteady three-dimensional heat conduction Partial Differential Equation (PDE) was discretized in space and time using the Finite Volume Method (FVM). The Volumetric Thermal Capacitor (VTC) approach was employed to model the melting-solidification. The GPU solutions were computed using a CUDA-C language in-house code, running on a Gigabyte Nvidia GeForce RTX^™ 3090 video card and an MSI 4090 video card (both made in Hsinchu, Taiwan), each with 24 GB of memory. The commercial solutions were executed on an Intel^® Core^™ i9-12900KF CPU (made in Hillsboro, Oregon, United States of America) with a 3.6 GHz base clock and 16 cores. The results demonstrated that GPU and CPU processing achieve similar precision, but the GPU solution exhibited significantly faster speeds and greater power efficiency, resulting in speed-ups ranging from 75.6 to 1351.2 times compared to the CPU solutions. The in-house code also demonstrated optimized memory usage, with an average of 3.86 times less RAM utilization. Therefore, adopting parallelized algorithms run on GPU can lead to reduced CFD computational costs compared to traditional codes while maintaining high accuracy. Full article

(This article belongs to the Special Issue 10th Anniversary of Computation—Computational Heat and Mass Transfer (ICCHMT 2023))

► Show Figures

Figure 1

16 pages, 920 KB

Open AccessArticle

An Architecture for a Tri-Programming Model-Based Parallel Hybrid Testing Tool

by Saeed Musaad Altalhi, Fathy Elbouraey Eassa, Abdullah Saad Al-Malaise Al-Ghamdi, Sanaa Abdullah Sharaf, Ahmed Mohammed Alghamdi, Khalid Ali Almarhabi and Maher Ali Khemakhem

Appl. Sci. 2023, 13(21), 11960; https://doi.org/10.3390/app132111960 - 1 Nov 2023

Cited by 5 | Viewed by 2651

Abstract

As the development of high-performance computing (HPC) is growing, exascale computing is on the horizon. Therefore, it is imperative to develop parallel systems, such as graphics processing units (GPUs) and programming models, that can effectively utilise the powerful processing resources of exascale computing. [...] Read more.

As the development of high-performance computing (HPC) is growing, exascale computing is on the horizon. Therefore, it is imperative to develop parallel systems, such as graphics processing units (GPUs) and programming models, that can effectively utilise the powerful processing resources of exascale computing. A tri-level programming model comprising message passing interface (MPI), compute unified device architecture (CUDA), and open multi-processing (OpenMP) models may significantly enhance the parallelism, performance, productivity, and programmability of the heterogeneous architecture. However, the use of multiple programming models often leads to unexpected errors and behaviours during run-time. It is also difficult to detect such errors in high-level parallel programming languages. Therefore, this present study proposes a parallel hybrid testing tool that employs both static and dynamic testing techniques to address this issue. The proposed tool was designed to identify the run-time errors of C++ and MPI + OpenMP + CUDA systems by analysing the source code during run-time, thereby optimising the testing process and ensuring comprehensive error detection. The proposed tool was able to identify and categorise the run-time errors of tri-level programming models. This highlights the need for a parallel testing tool that is specifically designed for tri-level MPI + OpenMP + CUDA programming models. As contemporary parallel testing tools cannot, at present, be used to test software applications produced using tri-level MPI + OpenMP + CUDA programming models, this present study proposes the architecture of a parallel testing tool to detect run-time errors in tri-level MPI + OpenMP + CUDA programming models. Full article

► Show Figures

Figure 1

38 pages, 3886 KB

Open AccessArticle

Multiple-Relaxation-Time Lattice Boltzmann Simulation of Soret and Dufour Effects on the Thermosolutal Natural Convection of a Nanofluid in a U-Shaped Porous Enclosure

by Md. Mahadul Islam, Md Farhad Hasan and Md. Mamun Molla

Energies 2023, 16(21), 7229; https://doi.org/10.3390/en16217229 - 24 Oct 2023

Cited by 17 | Viewed by 1836

Abstract

This article reports an investigation of the Soret and Dufour effects on the double-diffusive natural convection of

A l_{2} O_{3}

-

H_{2} O

nanofluids in a U-shaped porous enclosure. Numerical problems were resolved using the multiple-relaxation-time (MRT) lattice Boltzmann method [...] Read more.

This article reports an investigation of the Soret and Dufour effects on the double-diffusive natural convection of

A l_{2} O_{3}

-

H_{2} O

nanofluids in a U-shaped porous enclosure. Numerical problems were resolved using the multiple-relaxation-time (MRT) lattice Boltzmann method (LBM). The indented part of the U-shape was cold, and the right and left walls were heated, while the bottom and upper walls were adiabatic. The experimental data-based temperature and nanoparticle size-dependent correlations for the

A l_{2} O_{3}

-water nanofluids are used here. The benchmark results thoroughly validate the graphics process unit (GPU) based in-house compute unified device architecture (CUDA) C/C++ code. Numeral simulations were performed for a variety of dimensionless variables, including the Rayleigh number, (

R a

=

10^{4}, 10^{5}, 10^{6}

), the Darcy number, (

D a

=

10^{- 2}, 10^{- 3}, 10^{- 4}

), the Soret number, (

S r

=

0.0, 0.1, 0.2

), the Dufour number, (

D_{f}

=

0.0, 0.1, 0.2

), the buoyancy ratio, (

- 2 \leq B r \leq 2

), the Lewis number, (

L e

=

1, 3, 5

), the volume fraction, (

0 \leq ϕ \leq 0.04

), and the porosity,

ϵ

= (

0.2 - 0.8

), and the Prandtl number,

P r

=

6.2

(water) is fixed to represent the base fluid. The numerical results are presented in terms of streamlines, isotherms, isoconcentrations, temperature, velocity, mean Nusselt number, mean Sherwood number, entropy generation, and statistical analysis using a response surface methodology (RSM). The investigation found that fluid mobility was enhanced as the

R a

number and buoyancy force increased. The isoconcentrations and isotherm density close to the heated wall increased when the buoyancy force shifted from a negative magnitude to a positive one. The local

N u

increased as the Rayleigh number increased but reduced as the volume fraction augmented. Furthermore, the mean

N u

(

\bar{N u}

) decreased by

3.12 %

and

6.81 %

and the

\bar{S h}

increased by

83.17 %

and

117.91 %

with rising Lewis number for (

R a = 10^{5}

and

D a = 10^{- 3}

) and (

R a = 10^{6}

and

D a = 10^{- 4}

), respectively. Finally, the

B r

and

S r

demonstrated positive sensitivity, and the

R a

and

ϕ

showed negative sensitivity only for higher values of

ϕ

based on the RSM. Full article

(This article belongs to the Special Issue Research on Fluid Mechanics and Heat Transfer)

► Show Figures

Figure 1

22 pages, 6594 KB

Open AccessArticle

Massively Parallel Monte Carlo Sampling for Xinanjiang Hydrological Model Parameter Optimization Using CPU-GPU Computer Cluster

by Guangyuan Kan, Chenliang Li, Depeng Zuo, Xiaodi Fu and Ke Liang

Water 2023, 15(15), 2810; https://doi.org/10.3390/w15152810 - 3 Aug 2023

Cited by 3 | Viewed by 2366

Abstract

The Monte Carlo sampling (MCS) method is a simple and practical way for hydrological model parameter optimization. The MCS procedure is used to generate a large number of data points. Therefore, its computational efficiency is a key issue when applied to large-scale problems. [...] Read more.

The Monte Carlo sampling (MCS) method is a simple and practical way for hydrological model parameter optimization. The MCS procedure is used to generate a large number of data points. Therefore, its computational efficiency is a key issue when applied to large-scale problems. The MCS method is an internally concurrent algorithm that can be parallelized. It has the potential to execute on massively parallel hardware systems such as multi-node computer clusters equipped with multiple CPUs and GPUs, which are known as heterogeneous hardware systems. To take advantage of this, we parallelize the algorithm and implement it on a multi-node computer cluster that hosts multiple INTEL multi-core CPUs and NVIDIA many-core GPUs by using C++ programming language combined with the MPI, OpenMP, and CUDA parallel programming libraries. The parallel parameter optimization method is coupled with the Xinanjiang hydrological model to test the acceleration efficiency when tackling real-world applications that have a very high computational burden. Numerical experiments indicate, on the one hand, that the computational efficiency of the massively parallel parameter optimization method is significantly improved compared to single-core CPU code, and the multi-GPU code achieves the fastest speed. On the other hand, the scalability property of the proposed method is also satisfactory. In addition, the correctness of the proposed method is also tested using sensitivity and uncertainty analysis of the model parameters. Study results indicate good acceleration efficiency and reliable correctness of the proposed parallel optimization methods, which demonstrates excellent prospects in practical applications. Full article

(This article belongs to the Special Issue Advances in Flood and Drought Disaster Forecasting and Early Warnings through Integrating Hydrological and Hydrodynamic Models)

► Show Figures

Figure 1

21 pages, 449 KB

Open AccessArticle

Hybrid GPU–CPU Efficient Implementation of a Parallel Numerical Algorithm for Solving the Cauchy Problem for a Nonlinear Differential Riccati Equation of Fractional Variable Order

by Dmitrii Tverdyi and Roman Parovik

Mathematics 2023, 11(15), 3358; https://doi.org/10.3390/math11153358 - 31 Jul 2023

Cited by 7 | Viewed by 2234

Abstract

The numerical solution for fractional dynamics problems can create a high computational load, which makes it necessary to implement efficient algorithms for their solution. The main contribution to the computational load of such computations is created by heredity (memory), which is determined by [...] Read more.

The numerical solution for fractional dynamics problems can create a high computational load, which makes it necessary to implement efficient algorithms for their solution. The main contribution to the computational load of such computations is created by heredity (memory), which is determined by the dependence of the current value of the solution function on previous values in the time interval. In terms of mathematics, the heredity here is described using a fractional differentiation operator in the Gerasimov–Caputo sense of variable order. As an example, we consider the Cauchy problem for the non-linear fractional Riccati equation with non-constant coefficients. An efficient parallel implementation algorithm has been proposed for the known sequential non-local explicit finite-difference numerical solution scheme. This implementation of the algorithm is a hybrid one, since it uses both GPU and CPU computational nodes. The program code of the parallel implementation of the algorithm is described in C and CUDA C languages, and is developed using OpenMP and CUDA hardware, as well as software architectures. This paper presents a study on the computational efficiency of the proposed parallel algorithm based on data from a series of computational experiments that were obtained using a computing server NVIDIA DGX STATION. The average computation time is analyzed in terms of the following: running time, acceleration, efficiency, and the cost of the algorithm. As a result, it is shown on test examples that the hybrid version of the numerical algorithm can give a significant performance increase of 3–5 times in comparison with both the sequential version of the algorithm and OpenMP implementation. Full article

(This article belongs to the Special Issue Computational Mathematics and Mathematical Modelling)

► Show Figures

Figure 1

15 pages, 1158 KB

Open AccessArticle

A Flexible and General-Purpose Platform for Heterogeneous Computing

by Jose Juan Garcia-Hernandez, Miguel Morales-Sandoval and Erick Elizondo-Rodríguez

Computation 2023, 11(5), 97; https://doi.org/10.3390/computation11050097 - 11 May 2023

Cited by 4 | Viewed by 2953

Abstract

In the big data era, processing large amounts of data imposes several challenges, mainly in terms of performance. Complex operations in data science, such as deep learning, large-scale simulations, and visualization applications, can consume a significant amount of computing time. Heterogeneous computing is [...] Read more.

In the big data era, processing large amounts of data imposes several challenges, mainly in terms of performance. Complex operations in data science, such as deep learning, large-scale simulations, and visualization applications, can consume a significant amount of computing time. Heterogeneous computing is an attractive alternative for algorithm acceleration, using not one but several different kinds of computing devices (CPUs, GPUs, or FPGAs) simultaneously. Accelerating an algorithm for a specific device under a specific framework, i.e., CUDA/GPU, provides a solution with the highest possible performance at the cost of a loss in generality and requires an experienced programmer. On the contrary, heterogeneous computing allows one to hide the details pertaining to the simultaneous use of different technologies in order to accelerate computation. However, effective heterogeneous computing implementation still requires mastering the underlying design flow. Aiming to fill this gap, in this paper we present a heterogeneous computing platform (HCP). Regarding its main features, this platform allows non-experts in heterogeneous computing to deploy, run, and evaluate high-computational-demand algorithms following a semi-automatic design flow. Given the implementation of an algorithm in C with minimal format requirements, the platform automatically generates the parallel code using a code analyzer, which is adapted to target a set of available computing devices. Thus, while an experienced heterogeneous computing programmer is not required, the process can run over the available computing devices on the platform as it is not an ad hoc solution for a specific computing device. The proposed HCP relies on the OpenCL specification for interoperability and generality. The platform was validated and evaluated in terms of generality and efficiency through a set of experiments using the algorithms of the Polybench/C suite (version 3.2) as the input. Different configurations for the platform were used, considering CPUs only, GPUs only, and a combination of both. The results revealed that the proposed HCP was able to achieve accelerations of up to 270× for specific classes of algorithms, i.e., parallel-friendly algorithms, while its use required almost no expertise in either OpenCL or heterogeneous computing from the programmer/end-user. Full article

(This article belongs to the Section Computational Engineering)

► Show Figures

Figure 1

33 pages, 9890 KB

Open AccessArticle

The VM2D Open Source Code for Two-Dimensional Incompressible Flow Simulation by Using Fully Lagrangian Vortex Particle Methods

by Ilia Marchevsky, Kseniia Sokol, Evgeniya Ryatina and Yulia Izmailova

Axioms 2023, 12(3), 248; https://doi.org/10.3390/axioms12030248 - 28 Feb 2023

Cited by 10 | Viewed by 4680

Abstract

This article describes the open-source C++ code VM2D for the simulation of two-dimensional viscous incompressible flows and solving fluid-structure interaction problems. The code is based on the Viscous Vortex Domains (VVD) method developed by Prof. G. Ya. Dynnikova, where the viscosity influence is [...] Read more.

This article describes the open-source C++ code VM2D for the simulation of two-dimensional viscous incompressible flows and solving fluid-structure interaction problems. The code is based on the Viscous Vortex Domains (VVD) method developed by Prof. G. Ya. Dynnikova, where the viscosity influence is taken into account by introducing the diffusive velocity. The original VVD method was supplemented by the author’s algorithms for boundary condition satisfaction, which made it possible to increase the accuracy of flow simulation near the airfoil’s surface line and reduce oscillations when calculating hydrodynamic loads. This paper is aimed primarily at assessing the efficiency of the parallelization of the algorithm. OpenMP, MPI, and Nvidia CUDA parallel programming technologies are used in VM2D, which allow performing simulations on computer systems of various architectures, including those equipped with graphics accelerators. Since the VVD method belongs to the particle methods, the efficiency of parallelization with the usage of graphics accelerators turns out to be quite high. It is shown that in a real simulation, one graphics card can replace about 80 nodes, each of which is equipped with 28 CPU cores. The source code of VM2D is available on GitHub under GNU GPL license. Full article

(This article belongs to the Special Issue Computational and Experimental Fluid Dynamics)

► Show Figures

Figure 1

21 pages, 957 KB

Open AccessArticle

Enhancement of In-Plane Seismic Full Waveform Inversion with CPU and GPU Parallelization

by Min Bahadur Basnet, Mohammad Anas, Zarghaam Haider Rizvi, Asmer Hamid Ali, Mohammad Zain, Giovanni Cascante and Frank Wuttke

Appl. Sci. 2022, 12(17), 8844; https://doi.org/10.3390/app12178844 - 2 Sep 2022

Cited by 6 | Viewed by 3980

Abstract

Full waveform inversion is a widely used technique to estimate the subsurface parameters with the help of seismic measurements on the surface. Due to the amount of data, model size and non-linear iterative procedures, the numerical computation of Full Waveform Inversion are computationally [...] Read more.

Full waveform inversion is a widely used technique to estimate the subsurface parameters with the help of seismic measurements on the surface. Due to the amount of data, model size and non-linear iterative procedures, the numerical computation of Full Waveform Inversion are computationally intensive and time-consuming. This paper addresses the parallel computation of seismic full waveform inversion with Graphical Processing Units. Seismic full-waveform inversion of in-plane wave propagation in the finite difference method is presented here. The stress velocity formulation of the wave equation in the time domain is used. A four nodded staggered grid finite-difference method is applied to solve the equation, and the perfectly matched layers are considered to satisfy Sommerfeld’s radiation condition at infinity. The gradient descent method with conjugate gradient method is used for adjoined modelling in full-waveform inversion. The host code is written in C++, and parallel computation codes are written in CUDA C. The computational time and performance gained from CUDA C and OpenMP parallel computation in different hardware are compared to the serial code. The performance improvement is enhanced with increased model dimensions and remains almost constant after a certain threshold. A GPU performance gain of up to 90 times is obtained compared to the serial code. Full article

(This article belongs to the Section Earth Sciences)

► Show Figures

Figure 1

18 pages, 3712 KB

Open AccessArticle

Cross-Platform GPU-Based Implementation of Lattice Boltzmann Method Solver Using ArrayFire Library

by Michal Takáč and Ivo Petráš

Mathematics 2021, 9(15), 1793; https://doi.org/10.3390/math9151793 - 28 Jul 2021

Cited by 5 | Viewed by 6389

Abstract

This paper deals with the design and implementation of cross-platform, D2Q9-BGK and D3Q27-MRT, lattice Boltzmann method solver for 2D and 3D flows developed with ArrayFire library for high-performance computing. The solver leverages ArrayFire’s just-in-time compilation engine for compiling high-level code into optimized kernels [...] Read more.

This paper deals with the design and implementation of cross-platform, D2Q9-BGK and D3Q27-MRT, lattice Boltzmann method solver for 2D and 3D flows developed with ArrayFire library for high-performance computing. The solver leverages ArrayFire’s just-in-time compilation engine for compiling high-level code into optimized kernels for both CUDA and OpenCL GPU backends. We also provide C++ and Rust implementations and show that it is possible to produce fast cross-platform lattice Boltzmann method simulations with minimal code, effectively less than 90 lines of code. An illustrative benchmarks (lid-driven cavity and Kármán vortex street) for single and double precision floating-point simulations on 4 different GPUs are provided. Full article

(This article belongs to the Special Issue Numerical Analysis with Applications in Machine Learning)

► Show Figures

Figure 1

19 pages, 4497 KB

Open AccessArticle

A Hydrodynamic-Based Robust Numerical Model for Debris Hazard and Risk Assessment

by Yongde Kang, Jingming Hou, Yu Tong and Baoshan Shi

Sustainability 2021, 13(14), 7955; https://doi.org/10.3390/su13147955 - 16 Jul 2021

Cited by 6 | Viewed by 3257

Abstract

Debris flow simulations are important in practical engineering. In this study, a graphics processing unit (GPU)-based numerical model that couples hydrodynamic and morphological processes was developed to simulate debris flow, transport, and morphological changes. To accurately predict the debris flow sediment transport and [...] Read more.

Debris flow simulations are important in practical engineering. In this study, a graphics processing unit (GPU)-based numerical model that couples hydrodynamic and morphological processes was developed to simulate debris flow, transport, and morphological changes. To accurately predict the debris flow sediment transport and sediment scouring processes, a GPU-based parallel computing technique was used to accelerate the calculation. This model was created in the framework of a Godunov-type finite volume scheme and discretized into algebraic equations by the finite volume method. The mass and momentum fluxes were computed using the Harten, Lax, and van Leer Contact (HLLC) approximate Riemann solver, and the friction source terms were calculated using the proposed splitting point-implicit method. These values were evaluated using a novel 2D edge-based MUSCL scheme. The code was programmed using C++ and CUDA, which can run on GPUs to substantially accelerate the computation. After verification, the model was applied to the simulation of the debris flow process of an idealized example. The results of the new scheme better reflect the characteristics of the discontinuity of its movement and the actual law of the evolution of erosion and deposition over time. The research results provide guidance and a reference for the in-depth study of debris flow processes and disaster prevention and mitigation. Full article

► Show Figures

Figure 1

23 pages, 7755 KB

Open AccessArticle

An Accelerated Tool for Flood Modelling Based on Iber

by Orlando García-Feal, José González-Cao, Moncho Gómez-Gesteira, Luis Cea, José Manuel Domínguez and Arno Formella

Water 2018, 10(10), 1459; https://doi.org/10.3390/w10101459 - 16 Oct 2018

Cited by 96 | Viewed by 9014

Abstract

This paper presents Iber+, a new parallel code based on the numerical model Iber for two-dimensional (2D) flood inundation modelling. The new implementation, which is coded in C++ and takes advantage of the parallelization functionalities both on CPUs (central processing units) and GPUs [...] Read more.

This paper presents Iber+, a new parallel code based on the numerical model Iber for two-dimensional (2D) flood inundation modelling. The new implementation, which is coded in C++ and takes advantage of the parallelization functionalities both on CPUs (central processing units) and GPUs (graphics processing units), was validated using different benchmark cases and compared, in terms of numerical output and computational efficiency, with other well-known hydraulic software packages. Depending on the complexity of the specific test case, the new parallel implementation can achieve speedups up to two orders of magnitude when compared with the standard version. The speedup is especially remarkable for the GPU parallelization that uses Nvidia CUDA (compute unified device architecture). The efficiency is as good as the one provided by some of the most popular hydraulic models. We also present the application of Iber+ to model an extreme flash flood that took place in the Spanish Pyrenees in October 2012. The new implementation was used to simulate 24 h of real time in roughly eight minutes of computing time, while the standard version needed more than 15 h. This huge improvement in computational efficiency opens up the possibility of using the code for real-time forecasting of flood events in early-warning systems, in order to help decision making under hazardous events that need a fast intervention to deploy countermeasures. Full article

(This article belongs to the Special Issue Selected Papers from the 1st International Electronic Conference on the Hydrological Cycle (ChyCle-2017))

► Show Figures

Figure 1

Search Results (16)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (16)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI