Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (43)

Search Parameters:
Keywords = Message Passing Interface (MPI)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
26 pages, 2688 KiB  
Article
Improved Parallel Differential Evolution Algorithm with Small Population for Multi-Period Optimal Dispatch Problem of Microgrids
by Tianle Li, Yifei Li, Fang Wang, Cheng Gong, Jingrui Zhang and Hao Ma
Energies 2025, 18(14), 3852; https://doi.org/10.3390/en18143852 - 19 Jul 2025
Viewed by 199
Abstract
Microgrids have drawn attention due to their helpfulness in the development of renewable energy. It is necessary to make an optimal power dispatch scheme for each micro-source in a microgrid in order to make the best use of fluctuating and unpredictable renewable energy. [...] Read more.
Microgrids have drawn attention due to their helpfulness in the development of renewable energy. It is necessary to make an optimal power dispatch scheme for each micro-source in a microgrid in order to make the best use of fluctuating and unpredictable renewable energy. However, the computational time of solving the optimal dispatch problem increases greatly when the grid’s structure is more complex. An improved parallel differential evolution (PDE) approach based on a message-passing interface (MPI) is proposed, aiming at the solution of the optimal dispatch problem of a microgrid (MG), reducing the consumed time effectively but not destroying the quality of the obtained solution. In the new approach, the main population of the parallel algorithm is divided into several small populations, and each performs the original operators of a differential evolution algorithm, i.e., mutation, crossover, and selection, in different processes concurrently. The gather and scatter operations are employed after several iterations to enhance population diversity. Some improvements on mutation, adaptive parameters, and the introduction of migration operation are also proposed in the approach. Two test systems are employed to verify and evaluate the proposed approach, and the comparisons with traditional differential evolution are also reported. The results show that the proposed PDE algorithm can reduce the consumed time on the premise of obtaining no worse solutions. Full article
Show Figures

Figure 1

22 pages, 2191 KiB  
Review
Towards Efficient HPC: Exploring Overlap Strategies Using MPI Non-Blocking Communication
by Yuntian Zheng and Jianping Wu
Mathematics 2025, 13(11), 1848; https://doi.org/10.3390/math13111848 - 2 Jun 2025
Viewed by 574
Abstract
As high-performance computing (HPC) platforms continue to scale up, communication costs have become a critical bottleneck affecting overall application performance. An effective strategy to overcome this limitation is to overlap communication with computation. The Message Passing Interface (MPI), as the de facto standard [...] Read more.
As high-performance computing (HPC) platforms continue to scale up, communication costs have become a critical bottleneck affecting overall application performance. An effective strategy to overcome this limitation is to overlap communication with computation. The Message Passing Interface (MPI), as the de facto standard for communication in HPC, provides non-blocking communication primitives that make such overlapping feasible. By enabling asynchronous communication, non-blocking operations reduce idle time of cores caused by data transfer delays, thereby improving resource utilization. Overlapping communication with computation is particularly important for enhancing the performance of large-scale scientific applications, such as numerical simulations, climate modeling, and other data-intensive tasks. However, achieving efficient overlapping is non-trivial and depends not only on advances in hardware technologies such as Remote Direct Memory Access (RDMA), but also on well-designed and optimized MPI implementations. This paper presents a comprehensive survey on the principles of MPI non-blocking communication, the core techniques for achieving computation–communication overlap, and some representative applications in scientific computing. Alongside the survey, we include a preliminary experimental study evaluating the effectiveness of asynchronous progress mechanism on modern HPC platforms to support the development of parallel programs for HPC researchers and practitioners. Full article
(This article belongs to the Special Issue Numerical Analysis and Algorithms for High-Performance Computing)
Show Figures

Figure 1

35 pages, 11134 KiB  
Article
Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA
by Saeed Musaad Altalhi, Fathy Elbouraey Eassa, Sanaa Abdullah Sharaf, Ahmed Mohammed Alghamdi, Khalid Ali Almarhabi and Rana Ahmad Bilal Khalid
Computers 2025, 14(5), 164; https://doi.org/10.3390/computers14050164 - 28 Apr 2025
Viewed by 586
Abstract
The growing adoption of supercomputers across various scientific disciplines, particularly by researchers without a background in computer science, has intensified the demand for parallel applications. These applications are typically developed using a combination of programming models within languages such as C, C++, and [...] Read more.
The growing adoption of supercomputers across various scientific disciplines, particularly by researchers without a background in computer science, has intensified the demand for parallel applications. These applications are typically developed using a combination of programming models within languages such as C, C++, and Fortran. However, modern multi-core processors and accelerators necessitate fine-grained control to achieve effective parallelism, complicating the development process. To address this, developers commonly utilize high-level programming models such as Open Multi-Processing (OpenMP), Open Accelerators (OpenACCs), Message Passing Interface (MPI), and Compute Unified Device Architecture (CUDA). These models may be used independently or combined into dual- or tri-model applications to leverage their complementary strengths. However, integrating multiple models introduces subtle and difficult-to-detect runtime errors such as data races, deadlocks, and livelocks that often elude conventional compilers. This complexity is exacerbated in applications that simultaneously incorporate MPI, OpenMP, and CUDA, where the origin of runtime errors, whether from individual models, user logic, or their interactions, becomes ambiguous. Moreover, existing tools are inadequate for detecting such errors in tri-model applications, leaving a critical gap in development support. To address this gap, the present study introduces a static analysis tool designed specifically for tri-model applications combining MPI, OpenMP, and CUDA in C++-based environments. The tool analyzes source code to identify both actual and potential runtime errors prior to execution. Central to this approach is the introduction of error dependency graphs, a novel mechanism for systematically representing and analyzing error correlations in hybrid applications. By offering both error classification and comprehensive static detection, the proposed tool enhances error visibility and reduces manual testing effort. This contributes significantly to the development of more robust parallel applications for high-performance computing (HPC) and future exascale systems. Full article
(This article belongs to the Special Issue Best Practices, Challenges and Opportunities in Software Engineering)
Show Figures

Figure 1

22 pages, 3570 KiB  
Article
High-Performance Computing and Parallel Algorithms for Urban Water Demand Forecasting
by Georgios Myllis, Alkiviadis Tsimpiris, Stamatios Aggelopoulos and Vasiliki G. Vrana
Algorithms 2025, 18(4), 182; https://doi.org/10.3390/a18040182 - 22 Mar 2025
Cited by 2 | Viewed by 728
Abstract
This paper explores the application of parallel algorithms and high-performance computing (HPC) in the processing and forecasting of large-scale water demand data. Building upon prior work, which identified the need for more robust and scalable forecasting models, this study integrates parallel computing frameworks [...] Read more.
This paper explores the application of parallel algorithms and high-performance computing (HPC) in the processing and forecasting of large-scale water demand data. Building upon prior work, which identified the need for more robust and scalable forecasting models, this study integrates parallel computing frameworks such as Apache Spark for distributed data processing, Message Passing Interface (MPI) for fine-grained parallel execution, and CUDA-enabled GPUs for deep learning acceleration. These advancements significantly improve model training and deployment speed, enabling near-real-time data processing. Apache Spark’s in-memory computing and distributed data handling optimize data preprocessing and model execution, while MPI provides enhanced control over custom parallel algorithms, ensuring high performance in complex simulations. By leveraging these techniques, urban water utilities can implement scalable, efficient, and reliable forecasting solutions critical for sustainable water resource management in increasingly complex environments. Additionally, expanding these models to larger datasets and diverse regional contexts will be essential for validating their robustness and applicability in different urban settings. Addressing these challenges will help bridge the gap between theoretical advancements and practical implementation, ensuring that HPC-driven forecasting models provide actionable insights for real-world water management decision-making. Full article
Show Figures

Figure 1

23 pages, 6475 KiB  
Article
Genetic Algorithm-Enhanced Direct Method in Protein Crystallography
by Ruijiang Fu, Wu-Pei Su and Hongxing He
Molecules 2025, 30(2), 288; https://doi.org/10.3390/molecules30020288 - 13 Jan 2025
Viewed by 1004
Abstract
Direct methods based on iterative projection algorithms can determine protein crystal structures directly from X-ray diffraction data without prior structural information. However, traditional direct methods often converge to local minima during electron density iteration, leading to reconstruction failure. Here, we present an enhanced [...] Read more.
Direct methods based on iterative projection algorithms can determine protein crystal structures directly from X-ray diffraction data without prior structural information. However, traditional direct methods often converge to local minima during electron density iteration, leading to reconstruction failure. Here, we present an enhanced direct method incorporating genetic algorithms for electron density modification in real space. The method features customized selection, crossover, and mutation strategies; premature convergence prevention; and efficient message passing interface (MPI) parallelization. We systematically tested the method on 15 protein structures from different space groups with diffraction resolutions of 1.35∼2.5 Å. The test cases included high-solvent-content structures, high-resolution structures with medium solvent content, and structures with low solvent content and non-crystallographic symmetry (NCS). Results showed that the enhanced method significantly improved success rates from below 30% to nearly 100%, with average phase errors reduced below 40°. The reconstructed electron density maps were of sufficient quality for automated model building. This method provides an effective alternative for solving structures that are difficult to predict accurately by AlphaFold3 or challenging to solve by molecular replacement and experimental phasing methods. The implementation is available on Github. Full article
(This article belongs to the Special Issue Advanced Research in Macromolecular Crystallography)
Show Figures

Figure 1

25 pages, 1511 KiB  
Article
Performance Study of an MRI Motion-Compensated Reconstruction Program on Intel CPUs, AMD EPYC CPUs, and NVIDIA GPUs
by Mohamed Aziz Zeroual, Karyna Isaieva, Pierre-André Vuissoz and Freddy Odille
Appl. Sci. 2024, 14(21), 9663; https://doi.org/10.3390/app14219663 - 23 Oct 2024
Cited by 1 | Viewed by 1409
Abstract
Motion-compensated image reconstruction enables new clinical applications of Magnetic Resonance Imaging (MRI), but it relies on computationally intensive algorithms. This study focuses on the Generalized Reconstruction by Inversion of Coupled Systems (GRICS) program, applied to the reconstruction of 3D images in cases of [...] Read more.
Motion-compensated image reconstruction enables new clinical applications of Magnetic Resonance Imaging (MRI), but it relies on computationally intensive algorithms. This study focuses on the Generalized Reconstruction by Inversion of Coupled Systems (GRICS) program, applied to the reconstruction of 3D images in cases of non-rigid or rigid motion. It uses hybrid parallelization with the MPI (Message Passing Interface) and OpenMP (Open Multi-Processing). For clinical integration, the GRICS needs to efficiently harness the computational resources of compute nodes. We aim to improve the GRICS’s performance without any code modification. This work presents a performance study of GRICS on two CPU architectures: Intel Xeon Gold and AMD EPYC. The roofline model is used to study the software–hardware interaction and quantify the code’s performance. For CPU–GPU comparison purposes, we propose a preliminary MATLAB–GPU implementation of the GRICS’s reconstruction kernel. We establish the roofline model of the kernel on two NVIDIA GPU architectures: Quadro RTX 5000 and A100. After the performance study, we propose some optimization patterns for the code’s execution on CPUs, first considering only the OpenMP implementation using thread binding and affinity and appropriate architecture-compilation flags and then looking for the optimal combination of MPI processes and OpenMP threads in the case of the hybrid MPI–OpenMP implementation. The results show that the GRICS performed well on the AMD EPYC CPUs, with an architectural efficiency of 52%. The kernel’s execution was fast on the NVIDIA A100 GPU, but the roofline model reported low architectural efficiency and utilization. Full article
Show Figures

Figure 1

13 pages, 3590 KiB  
Proceeding Paper
Performance Evaluation of Recursive Mean Filter Using Scilab, MATLAB, and MPI (Message Passing Interface)
by Hristina Andreeva and Atanaska Bosakova-Ardenska
Eng. Proc. 2024, 70(1), 33; https://doi.org/10.3390/engproc2024070033 - 8 Aug 2024
Viewed by 892
Abstract
As a popular linear filter, the mean filter is widely used in different applications as a basic tool for image enhancement. Its main purpose is to reduce the noise in an image and thus to prepare the picture for other image-processing operations depending [...] Read more.
As a popular linear filter, the mean filter is widely used in different applications as a basic tool for image enhancement. Its main purpose is to reduce the noise in an image and thus to prepare the picture for other image-processing operations depending on the current task. In the last decade, the amount of data, particularly images, that has to be processed in a variety of applications has increased significantly, and thus the usage of effective and fast filtering algorithms has become crucial. The aim of the present research is to identify what type of software (MATLAB, Scilab, or MPI-based) is preferred for reducing the filtering time and consequently save energy. Thus, the aim of the present research corresponds to actual trends in information processing and corresponds to green computing concepts. A set of experimental images divided into two groups—one for small images and a second one for big images—is used for performance evaluation of the recursive mean filter. This type of linear filter was chosen due to its very good denoising characteristics. The filter is implemented in MATLAB and Scilab environments using their specific commands and it is also implemented using the C language with the MPI library to provide the opportunity for parallel execution. Two mobile computer systems are used for experimental performance evaluation and the results indicate that the slowest filtering execution is registered when Scilab is used and the fastest execution is achieved when MPI is used with the C implementation. Depending on the amount and size of the images that have to be filtered, this study formulates advice for achieving effective performance throughout the whole process of working with images. Full article
Show Figures

Figure 1

9 pages, 5236 KiB  
Article
Beamline Optimisation for High-Intensity Muon Beams at PSI Using the Heterogeneous Island Model
by Eremey Valetov, Giovanni Dal Maso, Peter-Raymond Kettle, Andreas Knecht and Angela Papa
Particles 2024, 7(3), 683-691; https://doi.org/10.3390/particles7030039 - 1 Aug 2024
Viewed by 1660
Abstract
The High Intensity Muon Beams (HIMB) project at the Paul Scherrer Institute (PSI) will deliver muon beams with unprecedented intensities of up to 1010muons/s for next-generation particle physics and material science experiments. This represents a hundredfold increase over the [...] Read more.
The High Intensity Muon Beams (HIMB) project at the Paul Scherrer Institute (PSI) will deliver muon beams with unprecedented intensities of up to 1010muons/s for next-generation particle physics and material science experiments. This represents a hundredfold increase over the current state-of-the-art muon intensities, also provided by PSI. We performed beam dynamics optimisations and studies for the design of the HIMB beamlines MUH2 and MUH3 using Graphics Transport, Graphics Turtle, and G4beamline, the latter incorporating PSI’s own measured π+ cross-sections and variance reduction. We initially performed large-scale beamline optimisations using asynchronous Bayesian optimisation with DeepHyper. We are now developing an island-based evolutionary optimisation code glyfada based on the Paradiseo framework, where we implemented Message Passing Interface (MPI) islands with OpenMP parallelisation within each island. Furthermore, we implemented an island model that is also suitable for high-throughput computing (HTC) environments with asynchronous communication via a Redis database. The code interfaces with the codes COSY INFINITY and G4beamline. The code glyfada will provide heterogeneous island model optimisation using evolutionary optimisation and local search methods, as well as part-wise optimisation of the beamline with automatic advancement through stages. We will use the glyfada for a future large-scale optimisation of the HIMB beamlines. Full article
Show Figures

Figure 1

13 pages, 8672 KiB  
Article
Efficient Parallel FDTD Method Based on Non-Uniform Conformal Mesh
by Kaihui Liu, Tao Huang, Liang Zheng, Xiaolin Jin, Guanjie Lin, Luo Huang, Wenjing Cai, Dapeng Gong and Chunwang Fang
Appl. Sci. 2024, 14(11), 4364; https://doi.org/10.3390/app14114364 - 21 May 2024
Viewed by 2049
Abstract
The finite-difference time-domain (FDTD) method is a versatile electromagnetic simulation technique, widely used for solving various broadband problems. However, when dealing with complex structures and large dimensions, especially when applying perfectly matched layer (PML) absorbing boundaries, tremendous computational burdens will occur. To reduce [...] Read more.
The finite-difference time-domain (FDTD) method is a versatile electromagnetic simulation technique, widely used for solving various broadband problems. However, when dealing with complex structures and large dimensions, especially when applying perfectly matched layer (PML) absorbing boundaries, tremendous computational burdens will occur. To reduce the computational time and memory, this paper presents a Message Passing Interface (MPI) parallel scheme based on non-uniform conformal FDTD, which is suitable for convolutional perfectly matched layer (CPML) absorbing boundaries, and adopts a domain decomposition approach, dividing the entire computational domain into several subdomains. More importantly, only one magnetic field exchange is required during the iterations, and the electric field update is divided into internal and external parts, facilitating the synchronous communication of magnetic fields between adjacent subdomains and internal electric field updates. Finally, unmanned helicopters, helical antennas, 100-period folded waveguides, and 16 × 16 phased array antennas are designed to verify the accuracy and efficiency of the algorithm. Moreover, we conducted parallel tests on a supercomputing platform, showing its satisfactory reduction in computational time and excellent parallel efficiency. Full article
(This article belongs to the Special Issue Parallel Computing and Grid Computing: Technologies and Applications)
Show Figures

Figure 1

30 pages, 5007 KiB  
Article
Temporal-Logic-Based Testing Tool Architecture for Dual-Programming Model Systems
by Salwa Saad, Etimad Fadel, Ohoud Alzamzami, Fathy Eassa and Ahmed M. Alghamdi
Computers 2024, 13(4), 86; https://doi.org/10.3390/computers13040086 - 25 Mar 2024
Cited by 2 | Viewed by 2030
Abstract
Today, various applications in different domains increasingly rely on high-performance computing (HPC) to accomplish computations swiftly. Integrating one or more programming models alongside the used programming language enhances system parallelism, thereby improving its performance. However, this integration can introduce runtime errors such as [...] Read more.
Today, various applications in different domains increasingly rely on high-performance computing (HPC) to accomplish computations swiftly. Integrating one or more programming models alongside the used programming language enhances system parallelism, thereby improving its performance. However, this integration can introduce runtime errors such as race conditions, deadlocks, or livelocks. Some of these errors may go undetected using conventional testing techniques, necessitating the exploration of additional methods for enhanced reliability. Formal methods, such as temporal logic, can be useful for detecting runtime errors since they have been widely used in real-time systems. Additionally, many software systems must adhere to temporal properties to ensure correct functionality. Temporal logics indeed serve as a formal frame that takes into account the temporal aspect when describing changes in elements or states over time. This paper proposes a temporal-logic-based testing tool utilizing instrumentation techniques designed for a dual-level programming model, namely, Message Passing Interface (MPI) and Open Multi-Processing (OpenMP), integrated with the C++ programming language. After a comprehensive study of temporal logic types, we found and proved that linear temporal logic is well suited as the foundation for our tool. Notably, while the tool is currently in development, our approach is poised to effectively address the highlighted examples of runtime errors by the proposed solution. This paper thoroughly explores various types and operators of temporal logic to inform the design of the testing tool based on temporal properties, aiming for a robust and reliable system. Full article
Show Figures

Figure 1

31 pages, 3433 KiB  
Article
Predicting Software Defects in Hybrid MPI and OpenMP Parallel Programs Using Machine Learning
by Amani S. Althiban, Hajar M. Alharbi, Lama A. Al Khuzayem and Fathy Elbouraey Eassa
Electronics 2024, 13(1), 182; https://doi.org/10.3390/electronics13010182 - 30 Dec 2023
Cited by 6 | Viewed by 2717
Abstract
High-performance computing (HPC) and its supercomputers are essential for solving the most difficult issues in many scientific computing domains. The proliferation of computational resources utilized by HPC systems has resulted in an increase in the associated error rates. As such, modern HPC systems [...] Read more.
High-performance computing (HPC) and its supercomputers are essential for solving the most difficult issues in many scientific computing domains. The proliferation of computational resources utilized by HPC systems has resulted in an increase in the associated error rates. As such, modern HPC systems promote a hybrid programming style that integrates the message-passing interface (MPI) and open multi-processing (OpenMP). However, this integration often leads to complex defects, such as deadlocks and race conditions, that are challenging to detect and resolve. This paper presents a novel approach: using machine learning algorithms to predict defects in C++-based systems by employing hybrid MPI and OpenMP models. We focus on employing a balanced dataset to enhance prediction accuracy and reliability. Our study highlights the effectiveness of the support vector machine (SVM) classifier, enhanced with term frequency (TF) and recursive feature elimination (RFE) techniques, which demonstrates superior accuracy and performance in defect prediction when compared to other classifiers. This research contributes significantly to the field by providing a robust method for early defect detection in hybrid programming environments, thereby reducing development time, costs and improving the overall reliability of HPC systems. Full article
(This article belongs to the Special Issue New Insights and Techniques for Neural Networks)
Show Figures

Figure 1

15 pages, 14472 KiB  
Article
Speed Up of Volumetric Non-Local Transform-Domain Filter Utilising HPC Architecture
by Petr Strakos, Milan Jaros, Lubomir Riha and Tomas Kozubek
J. Imaging 2023, 9(11), 254; https://doi.org/10.3390/jimaging9110254 - 20 Nov 2023
Viewed by 1861
Abstract
This paper presents a parallel implementation of a non-local transform-domain filter (BM4D). The effectiveness of the parallel implementation is demonstrated by denoising image series from computed tomography (CT) and magnetic resonance imaging (MRI). The basic idea of the filter is based on grouping [...] Read more.
This paper presents a parallel implementation of a non-local transform-domain filter (BM4D). The effectiveness of the parallel implementation is demonstrated by denoising image series from computed tomography (CT) and magnetic resonance imaging (MRI). The basic idea of the filter is based on grouping and filtering similar data within the image. Due to the high level of similarity and data redundancy, the filter can provide even better denoising quality than current extensively used approaches based on deep learning (DL). In BM4D, cubes of voxels named patches are the essential image elements for filtering. Using voxels instead of pixels means that the area for searching similar patches is large. Because of this and the application of multi-dimensional transformations, the computation time of the filter is exceptionally long. The original implementation of BM4D is only single-threaded. We provide a parallel version of the filter that supports multi-core and many-core processors and scales on such versatile hardware resources, typical for high-performance computing clusters, even if they are concurrently used for the task. Our algorithm uses hybrid parallelisation that combines open multi-processing (OpenMP) and message passing interface (MPI) technologies and provides up to 283× speedup, which is a 99.65% reduction in processing time compared to the sequential version of the algorithm. In denoising quality, the method performs considerably better than recent DL methods on the data type that these methods have yet to be trained on. Full article
(This article belongs to the Section Medical Imaging)
Show Figures

Figure 1

16 pages, 920 KiB  
Article
An Architecture for a Tri-Programming Model-Based Parallel Hybrid Testing Tool
by Saeed Musaad Altalhi, Fathy Elbouraey Eassa, Abdullah Saad Al-Malaise Al-Ghamdi, Sanaa Abdullah Sharaf, Ahmed Mohammed Alghamdi, Khalid Ali Almarhabi and Maher Ali Khemakhem
Appl. Sci. 2023, 13(21), 11960; https://doi.org/10.3390/app132111960 - 1 Nov 2023
Cited by 5 | Viewed by 2138
Abstract
As the development of high-performance computing (HPC) is growing, exascale computing is on the horizon. Therefore, it is imperative to develop parallel systems, such as graphics processing units (GPUs) and programming models, that can effectively utilise the powerful processing resources of exascale computing. [...] Read more.
As the development of high-performance computing (HPC) is growing, exascale computing is on the horizon. Therefore, it is imperative to develop parallel systems, such as graphics processing units (GPUs) and programming models, that can effectively utilise the powerful processing resources of exascale computing. A tri-level programming model comprising message passing interface (MPI), compute unified device architecture (CUDA), and open multi-processing (OpenMP) models may significantly enhance the parallelism, performance, productivity, and programmability of the heterogeneous architecture. However, the use of multiple programming models often leads to unexpected errors and behaviours during run-time. It is also difficult to detect such errors in high-level parallel programming languages. Therefore, this present study proposes a parallel hybrid testing tool that employs both static and dynamic testing techniques to address this issue. The proposed tool was designed to identify the run-time errors of C++ and MPI + OpenMP + CUDA systems by analysing the source code during run-time, thereby optimising the testing process and ensuring comprehensive error detection. The proposed tool was able to identify and categorise the run-time errors of tri-level programming models. This highlights the need for a parallel testing tool that is specifically designed for tri-level MPI + OpenMP + CUDA programming models. As contemporary parallel testing tools cannot, at present, be used to test software applications produced using tri-level MPI + OpenMP + CUDA programming models, this present study proposes the architecture of a parallel testing tool to detect run-time errors in tri-level MPI + OpenMP + CUDA programming models. Full article
Show Figures

Figure 1

17 pages, 7511 KiB  
Article
Acceleration of a Production-Level Unstructured Grid Finite Volume CFD Code on GPU
by Jian Zhang, Zhe Dai, Ruitian Li, Liang Deng, Jie Liu and Naichun Zhou
Appl. Sci. 2023, 13(10), 6193; https://doi.org/10.3390/app13106193 - 18 May 2023
Cited by 8 | Viewed by 2871
Abstract
Due to the complex topological relationship, poor data locality, and data racing problems in unstructured CFD computing, how to parallelize the finite volume method algorithms in shared memory to efficiently explore the hardware capabilities of many-core GPUs has become a significant challenge. Based [...] Read more.
Due to the complex topological relationship, poor data locality, and data racing problems in unstructured CFD computing, how to parallelize the finite volume method algorithms in shared memory to efficiently explore the hardware capabilities of many-core GPUs has become a significant challenge. Based on a production-level unstructured CFD software, three shared memory parallel programming strategies, atomic operation, colouring, and reduction were designed and implemented by deeply analysing its computing behaviour and memory access mode. Several data locality optimization methods—grid reordering, loop fusion, and multi-level memory access—were proposed. Aimed at the sequential attribute of LU-SGS solution, two methods based on cell colouring and hyperplane were implemented. All the parallel methods and optimization techniques implemented were comprehensively analysed and evaluated by the three-dimensional grid of the M6 wing and CHN-T1 aeroplane. The results show that using the Cuthill–McKee grid renumbering and loop fusion optimization techniques can improve memory access performance by 10%. The proposed reduction strategy, combined with multi-level memory access optimization, has a significant acceleration effect, speeding up the hot spot subroutine with data races three times. Compared with the serial CPU version, the overall speed-up of the GPU codes can reach 127. Compared with the parallel CPU version, the overall speed-up of the GPU codes can achieve more than thirty times the result in the same Message Passing Interface (MPI) ranks. Full article
(This article belongs to the Topic Theory and Applications of High Performance Computing)
Show Figures

Figure 1

27 pages, 958 KiB  
Article
Parallel Algorithm for Solving Overdetermined Systems of Linear Equations, Taking into Account Round-Off Errors
by Dmitry Lukyanenko
Algorithms 2023, 16(5), 242; https://doi.org/10.3390/a16050242 - 7 May 2023
Cited by 5 | Viewed by 3264
Abstract
The paper proposes a parallel algorithm for solving large overdetermined systems of linear algebraic equations with a dense matrix. This algorithm is based on the use of a modification of the conjugate gradient method, which is able to take into account rounding errors [...] Read more.
The paper proposes a parallel algorithm for solving large overdetermined systems of linear algebraic equations with a dense matrix. This algorithm is based on the use of a modification of the conjugate gradient method, which is able to take into account rounding errors accumulated during calculations when making a decision to terminate the iterative process. The parallel algorithm is constructed in such a way that it takes into account the capabilities of the message passing interface (MPI) parallel programming technology, which is used for the software implementation of the proposed algorithm. The programming examples are shown using the Python programming language and the mpi4py package, but all programs are built in such a way that they can be easily rewritten using the C/C++/Fortran programming languages. The advantage of using the modern MPI-4.0 standard is demonstrated. Full article
(This article belongs to the Collection Parallel and Distributed Computing: Algorithms and Applications)
Show Figures

Graphical abstract

Back to TopTop