Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA

Altalhi, Saeed Musaad; Eassa, Fathy Elbouraey; Sharaf, Sanaa Abdullah; Alghamdi, Ahmed Mohammed; Almarhabi, Khalid Ali; Khalid, Rana Ahmad Bilal

doi:10.3390/computers14050164

Open AccessArticle

Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA

by

Saeed Musaad Altalhi

^1,2,*

,

Fathy Elbouraey Eassa

¹

,

Sanaa Abdullah Sharaf

¹

,

Ahmed Mohammed Alghamdi

^3,*

,

Khalid Ali Almarhabi

⁴

and

Rana Ahmad Bilal Khalid

⁵

¹

Department of Computer Science, Faculty of Computing, and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

²

Department of Computer Science and Artificial Intelligence, Umm Al-Qura University, Makkah 21955, Saudi Arabia

³

Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 21493, Saudi Arabia

⁴

Department of Computer Science, College of Computing at Alqunfudah, Umm Al-Qura University, Makkah 21514, Saudi Arabia

⁵

College of Engineering and Physical Sciences, Aston University, Aston Triangle, Birmingham B4 7ET, UK

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(5), 164; https://doi.org/10.3390/computers14050164

Submission received: 21 March 2025 / Revised: 22 April 2025 / Accepted: 23 April 2025 / Published: 28 April 2025

(This article belongs to the Special Issue Best Practices, Challenges and Opportunities in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

The growing adoption of supercomputers across various scientific disciplines, particularly by researchers without a background in computer science, has intensified the demand for parallel applications. These applications are typically developed using a combination of programming models within languages such as C, C++, and Fortran. However, modern multi-core processors and accelerators necessitate fine-grained control to achieve effective parallelism, complicating the development process. To address this, developers commonly utilize high-level programming models such as Open Multi-Processing (OpenMP), Open Accelerators (OpenACCs), Message Passing Interface (MPI), and Compute Unified Device Architecture (CUDA). These models may be used independently or combined into dual- or tri-model applications to leverage their complementary strengths. However, integrating multiple models introduces subtle and difficult-to-detect runtime errors such as data races, deadlocks, and livelocks that often elude conventional compilers. This complexity is exacerbated in applications that simultaneously incorporate MPI, OpenMP, and CUDA, where the origin of runtime errors, whether from individual models, user logic, or their interactions, becomes ambiguous. Moreover, existing tools are inadequate for detecting such errors in tri-model applications, leaving a critical gap in development support. To address this gap, the present study introduces a static analysis tool designed specifically for tri-model applications combining MPI, OpenMP, and CUDA in C++-based environments. The tool analyzes source code to identify both actual and potential runtime errors prior to execution. Central to this approach is the introduction of error dependency graphs, a novel mechanism for systematically representing and analyzing error correlations in hybrid applications. By offering both error classification and comprehensive static detection, the proposed tool enhances error visibility and reduces manual testing effort. This contributes significantly to the development of more robust parallel applications for high-performance computing (HPC) and future exascale systems.

Keywords:

high-performance computing; parallel programming; MPI; OpenMP; CUDA; exascale systems; static testing; software testing; classification of testing tools; runtime errors

1. Introduction

Exascale computing relies on advanced hardware and accessible resources to dramatically improve computational performance. Although programming languages such as C, C++, and Fortran have evolved to support memory-level parallelism, they still exhibit limitations, particularly in efficiently distributing tasks across accelerators and multi-core processors. To overcome these limitations, developers adopt additional programming paradigms that support greater concurrency, optimize resource utilization, and accelerate execution. High-level programming models including the Message Passing Interface (MPI) [1,2], Open Accelerator (OpenACC) [3,4,5], Compute Unified Device Architecture (CUDA) [5,6,7,8], Open Multi-Processing (OpenMP) application program interface (API) [1,9] and Open Computing Language (OpenCL) [10] extend traditional languages by enabling the more efficient use of hardware resources such as shared memory, GPUs, and multi-core CPUs. Each model introduces distinct parallelization strategies and implementation practices. As a result, combining multiple models can significantly improve performance and portability. For example, hybrid applications that integrate OpenMP with OpenACC or MPI with CUDA can distribute workloads across heterogeneous systems, achieving superior throughput. However, integrating these models introduces substantial complexity. While some emphasize GPU utilization, others prioritize memory locality or thread-level parallelism. Consequently, developers must carefully manage these interactions to prevent subtle runtime issues, including race conditions, deadlocks, livelocks, and other nondeterministic behaviors that often elude compiler detection. Moreover, even when integration is syntactically correct, extensive testing is required to ensure functional correctness and runtime stability. Multiple scientific and engineering domains have demonstrated the potential of hybrid programming models, employing combinations of MPI, OpenMP, CUDA, and others to accelerate workloads in fields such as molecular dynamics, embedded computing, and machine learning. However, the concurrent execution of numerous threads and diverse memory hierarchies in such systems increases the likelihood of hard-to-detect runtime errors.

Several studies have demonstrated that the characteristics of different programming models can be integrated to generate hybrid models that are scaled between processing and connectivity depending on memory efficiency. These models are further classified as follows:

Single-level, including independent models such as CUDA, MPI, or OpenMP.
Two programming models, such as MPI + OpenMP or OpenMP + CUDA, are combined to generate dual-level (X + Y) systems that increase parallelism.
Tri-level (MPI + X + Y) systems consist of three different programming models to improve parallelism.

As a result, parallel models generate more parallel errors and require a lot of effort for their testing. Furthermore, as parallel models are merged into a single application, the reasons for these errors fluctuate throughout their runtime with the integration of the models. To date, there are no compilers that can detect these forms of runtime errors.

To address this challenge, the present study proposes a static analysis tool specifically designed for C++-based applications that integrate MPI, OpenMP, and CUDA. A key contribution of this work is the development of a novel error dependency graph which systematically models the relationships between potential runtime errors in hybrid applications. The proposed tool also introduces a comprehensive classification scheme for these errors, supplemented by illustrative examples that demonstrate their causes and implications.

With the increasing significance of multi-GPU computing in high-performance computing (HPC), particularly in fields such as deep learning, climate modeling, and molecular dynamics, the combination of MPI, OpenMP, and CUDA is viewed as important for achieving performance and scalability. However, the complexity of these diverse systems increases the possibility of mild, hard-to-detect runtime errors. Scientific workloads heavily rely on hybrid multi-GPU environments, as demonstrated by real-world applications such as MDScale [11] and GPU-accelerated molecular dynamics, which employ state-of-the-art software performance and porting from NVIDIA CUDA to AMD HIP [12]. The above instances highlight the need to develop robust debugging and static analysis tools that can handle the complexities of tri-model programming. The suggested tool attempts to assist developers in improving the accuracy, portability, and general performance of HPC applications running at scale by addressing these issues.

Because the current compilers and existing tools are unable to detect these errors in tri-model applications, the proposed solution was rigorously evaluated using a suite of benchmarks. These tests reflect the latest MPI, OpenMP, and CUDA specifications, ensuring the relevance and applicability of the results. Accuracy and performance were compared against state-of-the-art static analysis tools.

To the best of our knowledge, this study represents the first comprehensive effort to detect and classify runtime errors arising from the integration of MPI, OpenMP, and CUDA in C++-based environments. It also introduces the first static analysis tool tailored specifically for this purpose. At present, no existing frameworks offer such capabilities.

The remainder of this paper is structured as follows: Section 2 outlines the programming models utilized. Section 3 introduces the challenges and testing techniques. Section 4 reviews related work identifies integration-specific runtime errors and presents a classification and taxonomy of these errors. Section 5 describes the implementation of the proposed tool, detailing the analysis and detection phases, including the algorithms employed. Section 6 presents the testing and evaluation, reporting the results and discussing the experimental findings. Finally, Section 7 concludes the study and outlines future research directions, summarizing the main contributions of this work.

2. Background

This section provides a brief overview of the MPI, OpenMP, and CUDA programming models, along with the presently available testing techniques.

The MPI was first introduced in 1994 (version 1.0) to support message passing paradigms across large-scale parallel systems. Its capabilities were expanded in its sub-sequent versions. For instance, parallel I/O and process management were introduced in MPI 2.0, while the number of procedures increased to approximately 430 in MPI 3.1. Meanwhile, in the latest release, MPI 4.0, support for hybrid programming models (MPI + X) was improved and fault tolerance mechanisms with which to enhance the robustness of large-scale applications were included [2].

The widely adopted OpenMP API, which was first introduced in 1997, facilitates parallelisms in shared-memory systems by distributing tasks across multiple threads, thereby allowing concurrent executions in a node. OpenMP 4.0, which was released in 2013, came with support for accelerator offloading, which enabled hybrid CPU-GPU parallelism. Meanwhile, in the latest version, OpenMP 5.2, which was released in 2021, compatibility across C, C++, and Fortran was enhanced and supported using multiple implementations [13,14,15,16,17].

The CUDA, which was developed by NVIDIA^® and released in 2006, is a specialized programming model that was designed to fully exploit GPU acceleration. It supports C, C++, and Fortran, offers direct access to GPU memory, and enables low-level optimizations that maximize parallel performance. CUDA 12.0, which was released in December of 2022, further optimized GPU resource utilization for parallel tasks [18].

Developing parallel applications for heterogeneous systems, particularly those that combine multiple programming models, leaves room for considerable complexities to arise. For instance, runtime errors may arise at the interfaces of the programming models, leading to race conditions, deadlocks, and data races, to name a few. As these errors often arise due to a combination of factors, such as input data and execution order, they are difficult to predict and even harder to diagnose once an application is running.

3. Challenges and Testing Techniques

Testing parallel applications is particularly challenging when the code is poorly structured or ambiguously written, as it becomes difficult to determine whether errors originate from a specific programming model, the user’s implementation logic, or the interactions between multiple models. Existing static testing tools analyze source code to detect potential runtime errors prior to execution. Compared with dynamic testing, which evaluates program behavior during execution, static testing is significantly faster and capable of identifying a wider range of issues early in the development cycle. Dynamic testing, while useful, is often impractical for large-scale parallel applications due to its computational overhead. Although comprehensive, model-checking techniques require exhaustive test case generation, rendering them unsuitable for large, complex systems.

To address these challenges, this study introduces a testing tool that integrates static analysis with error dependency graphs to detect runtime errors arising from the combined use of MPI, OpenMP, and CUDA in C++-based applications. In contrast to existing tools, the proposed approach enhances detection coverage and testing efficiency. Within the context of high-performance and exascale computing, this reduces both the testing burden and runtime overhead.

4. Literature Review

The use of parallel applications and HPC has been extensively studied for a range of objectives. Error detection techniques include static analysis, dynamic analysis, or a combination of both in hybrid approaches. Static testing evaluates the application codebase prior to execution [19,20]. On the other hand, dynamic testing evaluates the codebase of an application while it is running [21]. Although there are numerous advantages to dynamic testing, it requires several test cases and may fail to detect some errors during start-up. On the other hand, a hybrid testing tool integrates the best features of both static and dynamic testing [19,22].

Applications such as the Intel^® Trace Analyzer and Collector [23] and MPI-Checker [24,25] use relatively mild static methods to detect runtime issues in MPI. On the other hand, tools such as MEMCHECKER [26], Umpire [26,27], the Marmot umpire scalable tool (MUST) [28,29], MAD [30], MOPPER [31], and runtime error detection (RTED) [32,33] use dynamic methods to identify runtime errors in MPI. Apart from that, hybrid methods, which involve using different tools to evaluate different runtime issues, are used [33] to identify deadlocks in MPI. The ACC_TEST [34] is one such tool that was specifically developed to quickly and accurately pinpoint runtime issues such as data races, deadlocks, and mismatches, to name a few.

Applications such as PolyOMP [4], OmpVerify [35], and DRACO [36] are static methods of identifying runtime errors in the OpenMP API, whereas Helgrind [36,37], Valgrind [38,39], Intel^® Thread Checker [40,41], Sun Studio^® thread analyzer [40,42], ROMP [43], and RTED [32] are dynamic methods. On the other hand, AddressSanitizer (ASan) and ThreadSanitizer (TSan) [44,45] use OpenMP’s directives to identify runtime issues related to architecture. Meanwhile, ARCHER [46] uses a hybrid method to identify data races in extensive OpenMP applications. It is also used to validate dual-architecture-based systems such as Marmot, which is a combination of OpenMP and MPI programming models [47,48].

Applications such as GPUVerify [49,50] and PolyOMP [4] are static methods of identifying data races in CUDA, whereas RaceTM [51], GUARD [52], and KUDA [52,53] are dynamic methods. On the other hand, GMRace [54], GRace [55], Grace [56,57], and SESA [58] are hybrid methods of identifying runtime errors in CUDA. Our literature review revealed a broad divide concerning the testing options for tri-level programming frameworks. As such, the present study proposes a highly tailored validation tool for MPI, OpenMP, and CUDA programming models.

Although debuggers can successfully be used to validate code, they typically require clarification. Furthermore, some debugging tools are commercially available, while others such as PDT [59], AutomaDeD [60], Linaro DDT [61], which was formally known as AL-LINEA DDT [62], TotalView [63,64], MPVisualizer [65], Intel® Inspector [66], and Arm DDT [67] are not. Apart from that, as non-commercial debuggers identify the causes of a problem rather than evaluating or detecting issues, they cannot be categorized according to their evaluation methods. Many other tools have also been examined, such as a temporal-logic-based testing tool for runtime error detection [68].

Our literature review revealed a lack of testing tools capable of identifying runtime issues in a tri-level MPI + OpenMP + CUDA application. Therefore, the present study proposes a hybrid tool that pre-emptively pinpoints several real and potential issues in the C++ codebase.

Although extant studies have proposed many tools with which to evaluate parallel architectures, detecting dynamic and static issues in tri-level frameworks warrants closer attention. Moreover, for heterogeneous architecture, three-level frameworks still require further improvements. Therefore, it is critical to address the dearth of testing tools that are capable of identifying runtime issues for processes coded using a tri-level MPI + OpenMP + CUDA framework so that the processes can run effectively and efficiently.

Runtime Errors in MPI, OpenMP, and CUDA Programming Models

A compiler cannot identify runtime errors in parallel systems as the errors may arise only once the program is running. Hybrid models experience a higher number of errors. The explanations for these errors also differ from those of conventional errors [1,20,69,70,71,72,73,74]. In order for parallel systems to prove advantageous, they must be error-free. This section lists several errors that occur when MPI, OpenMP, and CUDA programming models are integrated into C++-based applications. Actual examples are used to illustrate these concerns. The following codes, written in C++, used the MPI, OpenMP, and CUDA programming models. It is possible for some mistakes to arise from errors in either of the three models, whereas others may occur even if none have been found. The dependency graph below illustrates how this dependency affects the errors.

Multiple studies have thoroughly examined deadlocks and their causes [75,76,77,78]. Deadlocks usually occur when two or more threads await an input from the other. This creates an infinite loop between the threads, which prevents either of the threads from moving forward. The MPI, OpenMP, and CUDA programming models are prone to deadlocks [72].

A data race can be created in the background, using a lot of threads. For instance, a data race occurs when a shared memory resource is accessed by multiple threads, with at least one of those threads being able to write. However, the result may differ if the threads are processed in a different order. Therefore, shared variables should be synchronized to prevent data races. Each programming model uses a different technique to detect data races [79,80,81,82,83,84,85,86]. In a multi-threaded application, a data race occurs when several threads simultaneously access the same shared variable.

The OpenMP model simplifies the development of a parallel program in a shared-memory setting. When working with shared resources, OpenMP requires synchronization tools, such as locks, to avoid data races and deadlocks. The basic locking and unlocking functions provided by OpenMP include omp_set_lock, omp_unset_lock, and omp_test_lock, while the omp_init_lock and omp_destroy_lock functions are used for initialization and destruction. To prevent concurrent changes to the data in a shared resource, these functions ensure that only a single thread can access a critical section at a time.

The OpenMP model also features nested locks, which enables threads to set locks more than once. Nested locks require the same number of unsets to release. They also record the number of times that they have been set by using a thread. These functions can be used to launch, destroy, set, unset, and test a nested lock. They also offer a higher level of control for complicated synchronization settings. When a lock is no longer required, it should be destroyed using the omp_destroy_lock or omp_destroy_nest_lock functions. As data races occur when multiple threads concurrently access the same shared resource without proper synchronization, their outcomes are unpredictable and depend on the timing of the threads. For instance, implementing a shared resource in a parallel system without establishing locks can lead to a data race. This can be solved by using the omp_set_lock and omp_unset_lock functions to protect the shared resource by ensuring that only one thread can modify it at a time.

A deadlock occurs when multiple threads acquire multiple locks but in different orders. This causes two or more of the threads to be blocked forever, perpetually waiting for the other thread to release them. In this case, both threads will be stuck indefinitely if one thread sets lock A and waits for lock B while another thread sets lock B and waits for lock A. Therefore, all the threads must acquire the locks in the same order to prevent deadlocks and avoid circular dependencies.

Parallel programs can use OpenMP’s basic and nested locks to synchronize their access to shared resources and prevent data races. However, it is imperative to ensure consistent locking across the threads and to destroy locks after they are no longer required to prevent deadlocks. It is believed that when developers understand and apply these concepts, parallel applications can be written safely and efficiently.

Listing A1 (Appendix A), below, provides an example of a deadlock that occurs when the same lock, lock_a, in this case, is set twice using the same thread without being unset between the lines. As the second omp_set_lock and lock_a do not release the lock in the same thread, the program hangs.

Listing A2 (Appendix A) below, provides another example of a deadlock that occurs when lock_a is set in each iteration of the loop in the parallel region of the OpenMP section. Here, each thread tried to set the lock, perform an operation, and then unset the lock. However, the lock was destroyed in the parallel region, which led to undefined behavior. Therefore, in this example, the improper handling of the lock caused the threads to wait indefinitely for the lock to be released, resulting in a deadlock.

Listing A3 (Appendix A), below, provides an example of a deadlock that occurs when lock_a is set in the first section, in Line 10, and then unset and destroyed in the second section, in Lines 15 and 16. As the sections ran concurrently, the first section never released the lock, causing the second section to wait indefinitely for the lock to be released before it could destroy it. Therefore, in this example, the lock should have been set and unset within the same section, and the lock should have been destroyed after all the sections had been completed.

The OpenMP model requires the use of lock and unlock commands in conjunction with conditional statements. As such, it is crucial to ensure that locks are handled correctly in all possible execution paths, as leaving a lock set indefinitely could result in a deadlock or threads that are blocked indefinitely. Listing A4 (Appendix A), below, provides an example of a deadlock that occurs in the first section of OpenMP in Lines 8–15. As observed, lock_a was set in Line 8 and conditionally unset based on the available flag in Lines 11–13. If the available flag was false, the lock was properly unset. However, if the available flag was true, the lock remained set while the remaining work was performed before it was unset at the end. Later on, if another section or thread attempts to set the lock without first ensuring that it has been properly released from all conditions, it causes a deadlock. Therefore, every path that sets the lock must also unset it before the end of the work or before another section or thread tries to set the same lock again.

In Listing A5 (Appendix A), the code enters the parallel region of the OpenMP section using the nowait clause with the #pragma omp parallel and #pragma omp functions. The nowait clause allows multiple sections to run concurrently without having to wait for each section to complete. However, a deadlock occurs due to the incorrect placement of the lock and the order of the locks acquired in the two OpenMP sections.

In the first section, which began at Line 2, lock_a was acquired using the omp_set_lock(&lock_a) function (Line 3). After some of the work was carried out, lock_b was acquired using the omp_set_lock(&lock_b) function (Line 5). After the work was complete, lock_b was released using the omp_unset_lock(&lock_b) function (Line 7), followed by lock_a using the lock_a omp_unset_lock(&lock_a) function (incorrectly missing in the code).

In the second section, which began at Line 11, lock_b was acquired using the omp_set_lock(&lock_b) function (Line 13). After some of the work was carried out, lock_a was acquired using the omp_set_lock(&lock_a) function (Line 15). After the work was complete, lock_a was released using the omp_unset_lock(&lock_a) function (Line 17), followed by lock_b using the omp_unset_lock(&lock_b) function (Line 18). This created a deadlock as each section had to wait for the other section to release the lock, creating a circular dependency.

In Listing A6 (Appendix A), Lines 36 and 37 caused a deadlock. If Process 2 (P2) arrives at the first receiver (MPI_Recv) in Line 36 but the second receiver is in Line 37 and is still waiting for a message from Process 2, it will end up waiting indefinitely. However, if Process 1 (P1) arrives first and sends its message, the program will run without issues.

Listing A7 (Appendix A) depicts the specific section of the MPI code where a deadlock occurs due to mismatched tags. The deadlock occurred as Process 0 was waiting to receive a message from Process 1 with Tag 2. However, Process 1 sent a message with Tag 1 and Process 0 waited to receive the message with Tag 2. As a result, both processes ended up waiting indefinitely for messages that the other process did not send.

Listing A8 (Appendix A) depicts the specific section of the MPI code where a deadlock occurred due to a mismatch of the types of data in the MPI_Send and MPI_Recv operations. More specifically, Process 0 sent an integer (MPI_INT) but expected to receive a double (MPI_DOUBLE). This mismatch caused undefined behavior and the program to crash, as the data types must match for successful communication. A deadlock can also occur due to a mismatch in the tags. For instance, as observed in Line 27 of Process 0, the MPI_Send operation sent a message with Tag 0 but the MPI_Recv operation in Line 28 expected to receive a message with Tag 2. Furthermore, as observed in Line 32 of Process 1, the MPI_Recv operation expected to receive a message with Tag 0 but the MPI_Send operation in Line 44 sent a message with Tag 1. As a result, both processes ended up waiting indefinitely for messages that the other process did not send.

In Listing A9 (Appendix A), Process 0 transmitted Array A to Process 1 with Tag 0 and Array B to Process 2 with Tag 0. Process 1 received Array A from Process 0 with Tag 0 and transmitted the updated Array A to Process 3 with Tag 0. At the same time, Process 2 received Array B from Process 0 with Tag 0 and then transmitted the updated Array B to Process 3 with Tag 0. Finally, Process 3 waited to receive Array A from Process 1 with Tag 0 and Array B from Process 2 with Tag 0. This sequence of actions causes a deadlock if any of the tags do not match or if the operations have not been properly synchronized, as Process 3 will wait indefinitely for messages that may not arrive.

In Listing A10 (Appendix A), below, in Line 2, Process 0 sent Array A to Process 1 with Tag 0. In Line 3, Process 0 sent Array B to Process 2 with Tag 0. In Line 4, Process 0 sent Array B to Process 3 with Tag 0 without a corresponding receiver, which can lead to hidden resource leak issues. In Line 8, Process 1 received Array A from Process 0 with Tag 0. Process 1 later sent the updated Array A to Process 3 in Line 16 with the same tag. Similarly, in Line 21, Process 2 received Array B from Process 0 with Tag 0 and then sent the updated Array B to Process 3 in Line 26. Finally, in Lines 31 and 32, Process 3 waited to receive Array A from Process 1 with Tag 0 and Array B from Process 2 with Tag 0. However, Process 3 ended up waiting indefinitely for messages that have not been sent. For example, if Processes 1 or 2 fail to send the updated arrays to Process 3, or if Process 3 starts receiving before the arrays are sent, a deadlock will occur.

In Listing A11 (Appendix A), an atomic compare-and-swap (CAS) operation was used to implement a semaphore with which to control access to a critical section of the CUDA kernel. As observed, a deadlock occurred in Line 13 due to the way that the atomic CAS operation was used. If the semaphore was not released properly, the threads spin indefinitely.

In Listing A12 (Appendix A), in Line 16, a data race occurred as the threads had not been synchronized properly after executing the kernel. The CUDA kernel does not ensure that the read–modify–write operations on device_vecA are performed atomically or synchronized across threads. In other words, it attempts to add the value of device_vecA[index] to itself for all indices except the last one. However, this can lead to a data race, where multiple threads may attempt to simultaneously read and write to device_vecA[index], leading to a data race as the read and write operations are not atomic.

In Listing A13 (Appendix A), the kernel had an infinite loop (while (true)) that continuously checked and updated elements of device_vecA between Lines 13 and 19. Therefore, the threads were perpetually running without ever exiting the loop. More specifically, the threads kept updating the same memory location in an infinite loop without any mechanism with which to break out of it, leading to a livelock.

Table 1 lists the errors that may occur in each of the programming models and how these errors will impact the tri-programming model. For example, a CUDA model may experience a deadlock, data race, potential deadlock, or potential livelock, each of which may cause distinct runtime errors in the CUDA model; however, the MPI and OpenMP models may still function flawlessly. The table shows the serious repercussions of even one model failing, including data races that lead to unreliable outcomes as well as deadlocks that can cause the program to crash. Furthermore, apart from their complexity, deadlocks and livelocks can make programs sluggish and even cause them to fail.

Table 2 lists the errors that may occur in any of the two constituting programming model and how these errors will impact the tri-programming model. For example, if the OpenMP model does not experience any errors but the CUDA model experiences a data race and the MPI model experiences a deadlock, the deadlock will cause a system-wide failure in the tri-programming model. Furthermore, data races and livelocks may yield wrong results as they keep the program busy with concurrent access problems.

Table 3 lists the errors that may occur in all three constituting programming models and how these errors will impact the tri-programming model.

This highlights the complex correlations between the potential errors. The present study discusses several scenarios including data races, deadlocks, and livelocks in CUDA, MPI, and OpenMP programming models. Deadlocks and wrong results due to data races are among the wide range of results. This table, particularly, identifies scenarios in which errors may result in livelocks, deadlocks, or possibly wrong results.

It also provides a thorough examination of how these errors affect the overall performance of multi-model applications as well as the specific models in which they arise. These data are vital for developers working in HPC environments, where distributed and parallel systems are critical, as they can use them to anticipate, diagnose, and resolve complex runtime errors.

The occurrence and interdependence of runtime errors, specifically data races, deadlocks, prospective deadlocks, and livelocks, in programs that combine the MPI, OpenMP, and CUDA programming models are depicted in Figure 1, Figure 2 and Figure 3. These images supplement the textual information by providing a visual summary of how such errors develop and change across multi-model systems.

The complex nature of hybrid parallel settings is highlighted in Figure 1, which also presents a dependency graph depicting how the introduction of one model affects the others. The paths illustrate the necessity of strict synchronization and resource coordination by displaying both optimal (error-free) execution and transitions into different error states.

In Figure 2, the error-handling pathways for two of the three models are traced, both independently and convergently. It illustrates how errors can affect adjacent models even when they originate in the same execution stream. Their systemic significance is further supported by the fact that identical errors are noted in several models.

This idea is applied concurrently to all three models in Figure 3. It shows that if a single error type is handled incorrectly, especially data races and deadlocks, the entire program may fail, producing either inaccurate results or a system hang.

All these statistics highlight an important point: runtime mistakes in hybrid parallel systems are not isolated. When runtime mistakes appear in one component (such as MPI), they frequently set off a series of problems in other components (such as CUDA or OpenMP). As a result, the diagrams serve as diagnostic tools for comprehending the propagation of failures and the crucial role that synchronized execution plays in HPC, which surpasses basic illustration.

5. Static Testing Approach Implementation

Parallel systems are very difficult to test due to the many challenges that arise during the implementation process. As such, multiple studies have examined the possibility of developing a tool that covers all potential test case situations and data [78]. A static automated testing tool was developed to detect errors in MPI, OpenMP, and CUDA programming models in C++-based applications. The design of the proposed tool, as well as the dependency errors, were taken into consideration, as outlined in Table 1, Table 2 and Table 3 and Figure 1, Figure 2 and Figure 3. The tool will enable programmers to identify problems in the source code before a program is executed, log the problems, and view them in a separate log file containing all the necessary information to locate the errors. Therefore, all these tasks must be completed before the program is compiled. The runtime errors will also be listed in a dependency graph, which will enable programmers to resolve an error before running the program.

5.1. Analysis Phase

The proposed tool was created using the C++ programming language. The source code to be tested was first entered into a C or C++ development environment which contained the source code for the MPI, OpenMP, and CUDA tri-level programming model. In order to detect runtime errors, the proposed tool analyses the code and gathers information as follows:

The MPI, OpenMP, and CUDA data are stored in vectors.
These data include the start line number and the end line number of each model.
The tool gathers information about the explicit barriers in the OpenMP model.
The kernel definition, kernel launch, data transfer, and memory cleanup regions contain information about the start and end lines.
The tool sets the start and end values of the variables in every parallel region in the two programming models.
Finally, the tool records the ‘while’, ‘for’, and ‘do while’ loop data, the dependent variables, and the start and end values of these loops.

An organized log file is then created, where all these data are stored and presented to the programmer for review. Figure 4 illustrates part of the log file, depicting all the data collected from the MPI section, and reveals how the processes communicate in a distributed computing environment. It includes comprehensive data, such as the total number of MPI send and receive operations; their locations within the code, with specific line numbers; and confirmation on whether the corresponding send/receive operations were successfully found. It also details the specific variables involved in the MPI operations, outlines the data types and the amount of data transferred, and describes the MPI tags and communicators used in each operation.

Figure 5 illustrates part of the log file depicting all the data collected from the CUDA and OpenMP sections, both of which are crucial components of parallel programming but with a focus on the different aspects of HPC. Using functions such as omp_init_lock, omp_set_lock, and omp_parallel, the OpenMP section targets parallel programming in a shared-memory computing environment. To help developers effectively manage synchronization, it highlights critical variables, such as locks (OMP Lock), and captures where they were initialized, set, and unset. As a result, the parallel sections of code run smoothly and efficiently.

The CUDA section focuses on the parallel computing capabilities of NVIDIA’s GPUs. The CUDA model’s memory management functions, such as cudaMalloc and cudaMemcpy, are examined and their call locations are identified to ensure efficient memory management in the GPU. It also documents the launch of the CUDA kernels, indicating where and how they were executed on the GPUs. In-depth details about each memory transfer, including the sources, destinations, and sizes, are also recorded. This meticulous documentation facilitates the optimal use of memory and increases the efficiency of data transfer between the host and the GPU. This demonstrates the synergistic correlation between these two powerful technologies in the field of modern computing.

5.2. Detection Phase

It is necessary to acknowledge and address conditions that are often overlooked, such as the interactions between various regions, critical regions, sections, shared resources, reduction clauses, and synchronization constructs—all of which are common when programming in parallel. The proposed tool is capable of identifying any potential problems with these runtimes. Even though these errors may not have been evident during compilation, the tool can identify and clearly display them to the programmer.

Algorithm 1 is capable of determining whether an MPI deadlock will arise when there is no matching MPI_Recv for an MPI_Send or if there is no matching MPI_Send for an MPI_Recv. Therefore, a deadlock will occur if there is any difference in the type of data, tag, communicator, or count.

Algorithm 1: The MPI deadlock print error when there is a send without a receive or a receive without a send
1:	OPEN THE INPUT CODE FILE
2:	SEARCH for mpi_send and mpi__recv which one happen first
3:	if (mpi_send is found)
4:	IDENTIFY receiver rank
5:	SEARCH for matching mpi_recv at the receiver
6:	if (mpi__recv is found)
7:	MATCH tag, count, sender rank, communicator, datatype
8:	if (there is any MISMATCH)
9:	CONTINUE Search
10:	else
11:	Matching mpi__recv found
12:	end if
13:	if (no matching mpi__recv is found at the receiver rank)
14:	“Print Error: Deadlock No corresponding RECV found”
15:	end if
16:	end if
17:	if (mpi__recv is found)
18:	IDENTIFY Sender rank
19:	SEARCH for a matching mpi_send at the Sender
20:	if (mpi_send is found)
21:	MATCH tag, count, receiver rank, communicator, datatype
22:	if (there is any MISMATCH)
23:	CONTINUE Search
24:	else
25:	Matching mpi_send found
26:	end if
27:	if (no matching mpi_send is found at the sender rank)
28:	“Print Error: Deadlock No corresponding SEND found”
29:	end if
30:	end if

An MPI livelock occurs when processes are stuck in an infinite loop and exchanging messages with each other. As observed in Algorithm 2, if an MPI process is found to be communicating inside an infinite loop that is not broken anywhere, this error will pop up.

Algorithm 2: The MPI livelock print error when an infinite loop develops
1:	OPEN THE INPUT CODE FILE
2:	if (while(true) statement is found in any process)
3:	SEARCH for a possible BREAK of the loop
4:	if (loop BREAK statement is not found)
5:	“Print Error: Livelock MPI process stuck in an infinite loop”
6:	end if
7:	end if

An MPI data race occurs when a process sends multiple immediate sends to another process without calling the MPI_Wait routine. As observed in Algorithm 3, this could occur due to a conflict in the order of receiving these messages. As such, it is possible for the messages to be processed out of order, leading to inaccurate results.

Algorithm 3: The MPI data race print error when an infinite loop develops
1:	OPEN THE INPUT CODE FILE
2:	if (mpi__Isend is found)
3:	IDENTIFY receiver rank
4:	SEARCH for a matching mpi__Irecv at the receiver
5:	if (mpi__Irecv is found)
6:	MATCH tag, count, sender rank, communicator, datatype
7:	if (there is any MISMATCH)
8:	CONTINUE Search
9:	else
10:	Matching mpi__Irecv found
11:	end if
12:	if (matching mpi__Irecv is found)
13:	SEARCH for mpi__wait statement after mpi__Isend
14:	if (there is no mpi__wait statement)
15:	“Print Error: Data Race Not waiting for messages to be received before sending again”
16:	end if
17:	else
18:	“Print Error: Deadlock No corresponding mpi__Irecv found”
19:	end if

An OpenMP deadlock will occur if a lock is set but not unset. In this scenario, the same logic is applied. As seen in Algorithm 4, the name of the lock variable is identified if a lock is set. If a corresponding unset lock is not found in the remaining lines of the code, it means that a deadlock will occur.

Algorithm 4: The OpenMP deadlock print error when a lock is misused
1:	OPEN THE INPUT CODE FILE
2:	if (OMP_Setlock is found)
3:	IDENTIFY lock variable
4:	SEARCH for a corresponding OMP_Setlock
5:	if (OMP_Setlock is not found)
6:	“Print Error: Deadlock Lock is set but not unset”
7:	end if
8:	end if

An OpenMP deadlock will occur if an infinite loop develops in a critical section as the threads access the critical section sequentially. Therefore, when a thread enters a critical section, the other threads wait for it to finish executing. As observed in Algorithm 5, in an infinite loop, one thread will be stuck in the critical section and the other threads will be stuck outside it, creating a deadlock.

Algorithm 5: The OpenMP deadlock print error when an infinite loop develops in a critical section
1:	OPEN THE INPUT CODE FILE
2:	if (OMP_critical construct is found)
3:	CHECK that there is no infinite while loop in the critical section
4:	if (infinite while loop is found inside critical section)
5:	“Print Error: Deadlock because One thread enters the critical section and the others keep on waiting”
6:	end if
7:	end if

An OpenMP livelock occurs if an infinite loop develops in the OpenMP section. Algorithm 6 checks for a possible break in the loop. If a break statement is not found, a livelock will develop in the OpenMP section.

Algorithm 6: The OpenMP livelock print error when an infinite loop develops in a #pragma construct
1:	OPEN THE INPUT CODE FILE
2:	if (while(true) statement is found inside a #pragma construct)
3:	SEARCH for a possible BREAK of the loop
4:	else if (loop BREAK statement is not found)
5:	“Print Error: Livelock occur because threads stuck in an infinite loop”
6:	end If
7:	end if

A CUDA deadlock occurs if an atomicCAS operation is used improperly. In Algorithm 7, a semaphore for controlling access to a critical section statement was deliberately omitted from the kernel for illustration purposes. An improperly released semaphore can cause the threads to spin indefinitely as they wait to access the semaphore.

Algorithm 7: The CUDA print error when an atomicCAS operation is misused
1:	OPEN THE INPUT CODE FILE
2:	SEARCH for cuda_kernel
3:	IDENTIFY cuda_kernel name
4:	if (atomicCAS statement if found inside cuda_kernel)
5:	“Print Error: Deadlock Because Only one thread allowed to access critical section”
6:	end if

As seen in Algorithm 8, a CUDA livelock occurs when it is called repeatedly in an infinite loop without a break statement.

Algorithm 8: The CUDA livelock print error when a kernel is called repeatedly in an infinite loop
1:	OPEN THE INPUT CODE FILE
2:	IDENTIFY cuda_kernel name
3:	FIND where cuda_kernel is being called
4:	if (cuda_kernel is being called in an infinite loop)
5:	“Print Error: Livelock: Kernel is being called repeatedly”
6:	end if

An OpenMP data race occurs when a thread tries to update or access values that are updated by another thread. As seen in Algorithm 9, when a thread tries to access the index of an array other than its assigned indices, a data race may occur. In the pseudocode below, the loop variable of an OpenMP loop is first identified. Then, if a thread accesses the index of an array other than its loop variable, a data race occurs.

Algorithm 9: The OpenMP data race print error
1:	OPEN THE INPUT CODE FILE
2:	SEARCH for #PRAGMA construct
3:	If (#PRAGMA construct is found)
4:	FIND if it is OMP PARALLEL or OMP PARALLEL FOR
5:	If (OMP PARALLEL is found \|\| OMP PARALLEL FOR IS FOUND)
6:	CHECK if there is a for loop in this section
7:	If (for loop IS FOUND)
8:	IDENTIFY the loop variable
9:	If (an array is using an index other than the loop variable)
10:	“Print Error: DATA RACE: Accessing values that may be
	updated by other threads”
11:	end If
12:	end If
13:	end If
14:	end If

In Algorithm 10, a CUDA data race occurs if the threads are not synchronized after the kernel call. This is because, if there are multiple kernels in a program, some of the threads may start to execute the next kernel before all the threads have finished completing the first.

Algorithm 10: The CUDA data race print error
1:	OPEN THE INPUT CODE FILE
2:	SEARCH for CUDA Kernel
3:	IDENTIFY CUDA Kernel name
4:	“Print Error: DEADLOCK: Only one thread allowed to access critical section “
5:	end If

The proposed error dependency graph is a reliable tool with which to pinpoint issues after analyzing the source code and generating lists of runtime errors for each programming model. It clearly demonstrates that a deadlock or a livelock in any model can cause a deadlock in the entire program. Leveraging Algorithm 11, an error dependency scheme was effectively implemented. Furthermore, it is evident that data races in all three models can lead to data races or unexpected behavior in the program.

Algorithm 11: The dependency errors between the three models in the tri-programming model
1:	IF MPI_DEADLOCK == TRUE
2:	“Print Error: Deadlock in the whole system because there is a deadlock in MPI”
3:	else if omp_deadlock == True
4:	“Print Error: Deadlock in the whole system because there is deadlock in OpenMP”
5:	else if cuda_deadlock ==True
6:	“Print Error: Deadlock in the whole system because there is deadlock inCUDA”
7:	if mpi_Livelock == True
8:	“Print Error: Livelock in the whole system because there is livelock in MPI”
9:	else if omp_Livelock == True
10:	“Print Error: Deadlock in the whole system because there is livelock in OpenMP”
11:	else if Cuda_Livelock == True
12:	“Print Error: Deadlock in the whole system because there is livelock inCUDA”
13:	else if mpi_race == True omp_race == True
14:	“Print Error: Race condition in the system because there is race condition in MPI and OpenMP”
15:	else if mpi_race == True AND cuda_race == True
16:	“Print Error: Race condition in the system because there is race condition in MPI and CUDA”
17:	else if mpi_race == True
18:	“Print Error: Race condition and wrong result in the whole systembecause there is race condition in MPI”
19:	else if omp_race == True AND cuda_race == True
20:	“Print Error: Race condition the system because there is race condition in OpenMP and CUDA”
21:	else if omp_race == True
22:	“Print Error: Race condition in the system because there is racecondition in OpenMP”
23:	else if CudaRace == True
24:	“Print Error: Race condition in the system because there is race condition. in CUDA”

The threads have to be closely monitored to prevent deadlocks. Dynamic testing codes, which will be implemented in a subsequent study, will be seamlessly integrated into the header file, which will help to deliver comprehensive reports to programmers regarding the test’s status.

6. Testing and Evaluation

The runtime error detection performance of the proposed tool was compared with that of multiple existing benchmark programs. The tests were conducted on a laptop equipped with an 11th Generation Intel^® Core™ i7-10750H processor (2.66 GHz, 12 threads) that had 16 GB of RAM and an NVIDIA^® GeForce GTX 1650 GPU. The operating system (OS) was Ubuntu™ 20.04.4 LTS, which served as the development environment. Specialized compilers that suited each parallel computing framework were used in the compiling process. More specifically, the NVIDIA^® CUDA compiler (NVCC), which is provided in the CUDA Toolkit (recommended version 12.2 or newer), was used for the CUDA model. The MPI compiler wrapper (MPICC), which is provided in OpenMPI (version 4.1.5) or MPICH (version 3.4.3), was used for the MPI model. The GNU compiler collection (GCC) (version 11.3.0 or newer) was used for the OpenMP model as it supports OpenMP 5.0 and facilitates efficient multi-threading. Therefore, in the compilation process, NVCC was used for the CUDA code, MPICC was used for MPI-based parallelization, and the OpenMP flag in the GCC was used to enable the functionality of OpenMP.

A total of 2 criteria were utilized to benchmark the performance of the MPI section of the proposed tool. These included the NAS parallel benchmarks (NPB) [87], the Edinburgh Parallel Computing Centre (EPCC) OpenMP microbenchmarks [88], and mpiBench [89,90]. In addition, OpenMP DataRaceBench version 1.4.0 [91,92], NPB-C++ parallel programming (CPP) version 3.3.1 [93], NPB version 3.4.1 [94], and the HPC group at RWTH Aachen University’s DRACC microbenchmark suite [95] were used to benchmark the performance of the CUDA section. As shown in Table 4 provides a brief overview of the benchmarks used in this study. Figure 6, Figure 7 and Figure 8 illustrate the effectiveness of the proposed tool. The x-axis indicates which codes were tested using the selected benchmark programs, the left y-axis indicates how many lines in each code were present, and the right y-axis indicates the execution time of each benchmark program, which was measured in clocks per second.

Figure 9, which is a screenshot of the log file of one of the tests, shows the errors that the compiler did not detect during the test.

Figure 10 shows the errors that the proposed tool detected in one of the codes during the test. The final summary highlights the primary problem with integrating these three programming models. The proposed tool provided a summary of the errors, which is displayed at the end of the report, that would occur in the program. In the dynamic testing stage, which will be developed in the next study, these potential errors will be verified.

The proposed tool is the first of its kind to analyze C++-based applications integrating MPI, OpenMP, and CUDA programming models. This study aimed to not only compare the proposed tool with existing static analysis tools but to also assess error detection challenges introduced by increasing model complexity. To this end, during these studies, we systematically evaluated runtime error patterns across single-model (MPI, OpenMP, or CUDA individually), dual-model (MPI + OpenMP), and tri-model (MPI + OpenMP + CUDA) applications.

The investigation found a significant increase in error complexity and propagation as more models were integrated. Single-model configurations were characterized by isolated errors in the model, such as data races or deadlocks. Tri-model applications had the highest frequency of interdependent runtime errors, whereas dual-model systems displayed interactions that led to severe synchronization problems. These results underscore the drawbacks of conventional tools in managing hybrid environments.

Table 5 provides an overview of related runtime-testing tools. Notably, the proposed tool is the only one capable of detecting both deadlocks and livelocks across all three models in C++ applications. It also offers structured error logs and dependency graphs that contextualize runtime behaviors, further validating its applicability in multi-model parallel programming environments.

7. Conclusions and the Future Directions

In this study, the researchers designed a tool for identifying runtime errors caused by data races, deadlocks, and livelocks in C++-based systems that employ the MPI, OpenMP, and CUDA programming models. Before executing the code, errors were detected using a static testing technique. The proposed tool could also detect errors in any programming paradigm and generate an error dependency tree for MPI, OpenMP, and CUDA runtime issues. Furthermore, the researchers also proposed a taxonomy of the runtime mistakes that result from combining the MPI, OpenMP, and CUDA programming paradigms and described various strategies for monitoring these errors.

The proposed application demonstrated better performance than other tools in detecting issues during benchmarking tests using C and C++ code. The proposed tool was the only tool that could identify deadlock errors in the MPI, OpenMP, and CUDA programming models as well as identify the problems that could occur during integration. These features help to guarantee that the application is error-free and increase its dependability. It is worth noting that using the proposed tool does not slow down an application; instead, it displays a list of errors that the programmer needs to address. This paper presents the static component of our hybrid tool, which is available as open-source software in [96].

The researchers of this study aim to improve the effectiveness of the proposed tool by identifying several MPI, OpenMP, and CUDA runtime issues that are currently not supported by the tool. A hybrid version of the proposed tool that incorporates a dynamic technique to detect runtime problems as they are being executed will also be designed in a future study.

Author Contributions

Conceptualization, S.M.A., F.E.E., A.M.A., S.A.S., and K.A.A.; Methodology, S.M.A., F.E.E., A.M.A., and S.A.S.; Software, S.M.A., A.M.A., and R.A.B.K.; Validation, S.M.A., F.E.E., and R.A.B.K.; Formal Analysis, S.M.A. and S.A.S.; Writing—Original Draft, S.M.A. and K.A.A.; Writing—Review and Editing, S.M.A., F.E.E., A.M.A., and K.A.A.; Visualization, S.M.A.; Supervision, F.E.E., A.M.A., and S.A.S.; Funding Acquisition S.M.A.; All authors have read and agreed to the published version of the manuscript.

Funding

This research work was funded by Umm Al-Qura University, Saudi Arabia, under grant number 25UQU4350478GSSR01S.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author [Saeed Musaad Altalhi] upon reasonable request.

Acknowledgments

The authors extend their appreciation to Umm Al-Qura University, Saudi Arabia, for funding this research work through grant number 25UQU4350478GSSR01S.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding and Acknowledgement statement. This change does not affect the scientific content of the article.

Appendix A

Listings A1–A13

You can access these listings in [96].

Listing A1: The deadlock that occurs when lock_a is set twice in Lines 5 and 7 of the same thread without being unset between the lines.

Listing A2: The deadlock that occurs when lock_a is set in each iteration of the loop in the parallel region of the OpenMP section.

Listing A3: The deadlock that occurs when lock_a is set in the first section and then unset and destroyed in the second section.

Listing A4: The deadlock that occurs when a lock termination is conditionally executed with an if/else statement. The lock will be released when the condition is met. However, if the condition is not met, the lock remains in place, creating a deadlock.

Listing A5: The deadlock that occurs when locks are incorrectly placed and ordered in two OpenMP sections.

Listing A6: The deadlock that occurs when P2 arrives at the first receiver (recv), causing the second recv to wait forever.

Listing A7: The deadlock that occurs when tags are mismatched or the order of operations has not been properly synchronized.

Listing A8: The deadlock that occurs when the data types and tags of the MPI_Send and MPI_Recv operations are mismatched.

Listing A9: The deadlock that occurs when tags are mismatched and the order of operations has not been properly synchronized.

Listing A10: The deadlock that occurs when there are hidden resource leak issues that send without receiving, causing one of the processes to wait indefinitely for messages.

Listing A11: The deadlock that occurs when an atomicCAS operation is improperly used in the CUDA kernel.

Listing A12: The data race that occurs when the threads are not synchronized post-kernel execution.

Listing A13: The livelock that occurs due to an infinite loop in the kernel.

References

Miri Rostami, S.R.; Ghaffari-Miab, M. Finite Difference Generated Transient Potentials of Open-Layered Media by Parallel Computing Using OpenMP, MPI, OpenACC, and CUDA. IEEE Trans. Antennas Propag. 2019, 67, 6541–6550. [Google Scholar] [CrossRef]
MPI Forum MPI Documents. Available online: https://www.mpi-forum.org/docs/ (accessed on 6 February 2023).
Eassa, F.E.; Alghamdi, A.M.; Haridi, S.; Khemakhem, M.A.; Al-Ghamdi, A.S.A.-M.; Alsolami, E.A. ACC_TEST: Hybrid Testing Approach for OpenACC-Based Programs. IEEE Access 2020, 8, 80358–80368. [Google Scholar] [CrossRef]
Chatarasi, P.; Shirako, J.; Kong, M.; Sarkar, V. An Extended Polyhedral Model for SPMD Programs and Its Use in Static Data Race Detection. In Lecture Notes in Computer Science; Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer International Publishing: Cham, Switzerland, 2017; Volume 10136 LNCS, pp. 106–120. [Google Scholar]
Norman, M.; Larkin, J.; Vose, A.; Evans, K. A Case Study of CUDA FORTRAN and OpenACC for an Atmospheric Climate Kernel. J. Comput. Sci. 2015, 9, 1–6. [Google Scholar] [CrossRef]
Hoshino, T.; Maruyama, N.; Matsuoka, S.; Takaki, R. CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application. In Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2013, Delft, The Netherlands, 13–16 May 2013; pp. 136–143. [Google Scholar] [CrossRef]
Sunitha, N.V.; Raju, K.; Chiplunkar, N.N. Performance Improvement of CUDA Applications by Reducing CPU-GPU Data Transfer Overhead. In Proceedings of the 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 10–11 March 2017; IEEE: New York, NY, USA, 2017; pp. 211–215. [Google Scholar]
NVIDIA About CUDA|NVIDIA Developer. Available online: https://developer.nvidia.com/about-cuda (accessed on 6 February 2023).
OpenMP ARB About Us—OpenMP. Available online: https://www.openmp.org/about/about-us/ (accessed on 6 February 2023).
Jin, Z.; Finkel, H. Performance-Oriented Optimizations for OpenCL Streaming Kernels on the FPGA. In Proceedings of the IWOCL’18: International Workshop on OpenCL, Oxford, UK, 14–16 May 2018; ACM: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
Barreales, G.N.; Novalbos, M.; Otaduy, M.A.; Sanchez, A. MDScale: Scalable Multi-GPU Bonded and Short-Range Molecular Dynamics. J. Parallel Distrib. Comput. 2021, 157, 243–255. [Google Scholar] [CrossRef]
Kondratyuk, N.; Nikolskiy, V.; Pavlov, D.; Stegailov, V. GPU-Accelerated Molecular Dynamics: State-of-Art Software Performance and Porting from Nvidia CUDA to AMD HIP. Int. J. High Perform. Comput. Appl. 2021, 35, 312–324. [Google Scholar] [CrossRef]
Strout, M.M.; De Supinski, B.R.; Scogland, T.R.W.; Davis, E.C.; Olschanowsky, C. Evolving OpenMP for Evolving Architectures. In Proceedings of the 14th International Workshop on OpenMP, IWOMP 2018, Barcelona, Spain, 26–28 September 2018; Proceedings. Volume 11128, ISBN 978-3-319-98520-6. [Google Scholar]
Münchhalfen, J.F.; Hilbrich, T.; Protze, J.; Terboven, C.; Müller, M.S. Classification of Common Errors in OpenMP Applications. In Lecture Notes in Computer Science; Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in, Bioinformatics); DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S., Eds.; Springer International Publishing: Cham, Switzerland, 2014; Volume 8766, pp. 58–72. [Google Scholar]
Supinski, B.R.D.; Scogland, T.R.W.; Duran, A.; Klemm, M.; Bellido, S.M.; Olivier, S.L.; Terboven, C.; Mattson, T.G. The Ongoing Evolution of OpenMP. Proc. IEEE 2018, 106, 2004–2019. [Google Scholar] [CrossRef]
Bertolacci, I.; Strout, M.M.; de Supinski, B.R.; Scogland, T.R.W.; Davis, E.C.; Olschanowsky, C.; de Supinski, B.R.; Labarta, J.; Valero-Lara, P.; Martorell, X.; et al. Extending OpenMP to Facilitate Loop Optimization. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11128, pp. 53–65. ISBN 0302-9743. [Google Scholar]
Sato, M.; Hanawa, T.; Müller, M.S.; Chapman, B.M.; de Supinski, B.R. (Eds.) Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6132, ISBN 978-3-642-13216-2. [Google Scholar]
Harakal, M. Compute Unified Device Architecture (CUDA) GPU Programming Model and Possible Integration to the Parallel Environment. Sci. Mil. J. 2008, 3, 64–68. [Google Scholar]
Saillard, E. Static/Dynamic Analyses for Validation and Improvements of Multi-Model HPC Applications. Ph.D. Thesis, Université de Bordeaux, Bordeaux, France, 2015. Volume 1228. [Google Scholar]
Basloom, H.S.; Dahab, M.Y.; Alghamdi, A.M.; Eassa, F.E.; Al-Ghamdi, A.S.A.-M.; Haridi, S. Errors Classification and Static Detection Techniques for Dual-Programming Model (OpenMP and OpenACC). IEEE Access 2022, 10, 117808–117826. [Google Scholar] [CrossRef]
Cai, Y.; Lu, Q. Dynamic Testing for Deadlocks via Constraints. IEEE Trans. Softw. Eng. 2016, 42, 825–842. [Google Scholar] [CrossRef]
Saillard, E.; Carribault, P.; Barthou, D. Static/Dynamic Validation of MPI Collective Communications in Multi-Threaded Context. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, San Francisco, CA, USA, 7–11 February 2015; ACM: New York, NY, USA, 2015; Volume 2015, pp. 279–280. [Google Scholar]
Intel® Trace Analyzer and Collector Available. Available online: https://www.intel.com/content/www/us/en/docs/trace-analyzer-collector/user-guide-reference/2023-1/correctness-checking-of-mpi-applications.html (accessed on 18 June 2023).
Droste, A.; Kuhn, M.; Ludwig, T.M.-C. MPI-Checker-Static Analysis for MPI. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, Austin, TX, USA, 15 November 2015; ACM: New York, NY, USA, 2015; pp. 1–10. [Google Scholar] [CrossRef]
Fan, S.; Keller, R.; Resch, M. Enhanced Memory Debugging of MPI-Parallel Applications in Open MPI. In Tools for High Performance Computing; Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 49–60. [Google Scholar]
Vetter, J.S.; de Supinski, B.R. Dynamic Software Testing of MPI Applications with Umpire. In Proceedings of SC ‘00: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, Dallas, TX, USA, 4–10 November 2000; IEEE: Dallas, TX, USA, 2000; pp. 4–10. [Google Scholar] [CrossRef]
Hilbrich, T.; Schulz, M.; de Supinski, B.R.; Müller, M.S. MUST: A Scalable Approach to Runtime Error Detection in MPI Programs. In Tools for High Performance Computing 2009; Springer: Berlin/Heidelberg, Germany, 2010; pp. 53–66. ISBN 978-3-642-11260-7. [Google Scholar]
Kranzlmueller, D.; Schaubschlaeger, C.; Volkert, J. A Brief Overview of the MAD Debugging Activities. In Proceedings of the AADEBUG 2000, 4th International Workshop on Automated Testing, Munich, Germany, 28–30 August 2000; pp. 1–6. [Google Scholar]
MUST—RWTH AACHEN UNIVERSITY Lehrstuhl Für Informatik 12—Deutsch. Available online: https://www.i12.rwth-aachen.de/cms/Lehrstuhl-fuer-Informatik/Forschung/Forschungsschwerpunkte/Lehrstuhl-fuer-Hochleistungsrechnen/~nrbe/MUST/ (accessed on 19 February 2023).
Hilbrich, T.; Protze, J.; Schulz, M.; de Supinski, B.R.; Muller, M.S. MPI Runtime Error Detection with MUST: Advances in Deadlock Detection. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, Washington, DC, USA, 10–16 November 2012; IEEE: New York, NY, USA, 2012; Volume 21, pp. 1–10. [Google Scholar]
Forejt, V.; Joshi, S.; Kroening, D.; Narayanaswamy, G.; Sharma, S. Precise Predictive Analysis for Discovering Communication Deadlocks in MPI Programs. ACM Trans. Program. 2017, 39, 1–27. [Google Scholar] [CrossRef]
Luecke, G.R.; Coyle, J.; Hoekstra, J.; Kraeva, M.; Xu, Y.; Park, M.-Y.; Kleiman, E.; Weiss, O.; Wehe, A.; Yahya, M. The Importance of Run-Time Error Detection. In Tools for High Performance Computing 2009; Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 145–155. ISBN 978-3-642-11261-4. [Google Scholar]
Saillard, E.; Carribault, P.; Barthou, D. Combining Static and Dynamic Validation of MPI Collective Communications. In Proceedings of the 20th European MPI Users’ Group Meeting, Madrid, Spain, 15–18 September 2013; ACM: New York, NY, USA, 2013; pp. 117–122. [Google Scholar]
Alghamdi, A.S.A.A.M.; Alghamdi, A.S.A.A.M.; Eassa, F.E.; Khemakhem, M.A.A. ACC_TEST: Hybrid Testing Techniques for MPI-Based Programs. IEEE Access 2020, 8, 91488–91500. [Google Scholar] [CrossRef]
Basupalli, V.; Yuki, T.; Rajopadhye, S.; Morvan, A.; Derrien, S.; Quinton, P.; Wonnacott, D. OmpVerify: Polyhedral Analysis for the OpenMP Programmer. In OpenMP in the Petascale Era; IWOMP 2011 Lecture Notes in Computer Science; Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6665. [Google Scholar] [CrossRef]
Ye, F.; Schordan, M.; Liao, C.; Lin, P.-H.; Karlin, I.; Sarkar, V. Using Polyhedral Analysis to Verify OpenMP Applications Are Data Race Free. In Proceedings of the 2018 IEEE/ACM 2nd International Workshop on Software Correctness for HPC Applications (Correctness), Dallas, TX, USA, 12 November 2018; IEEE: New York, NY, USA, 2018; pp. 42–50. [Google Scholar]
Jannesari, A.; Bao, K.; Pankratius, V.; Tichy, W.F. Helgrind+: An Efficient Dynamic Race Detector. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, Rome Italy, 23–29 May 2009; IEEE: New York, NY, USA, 2009; pp. 1–13. [Google Scholar]
Valgrind: Tool Suite. Available online: https://valgrind.org/info/tools.html#memcheck (accessed on 8 March 2023).
Nethercote, N.; Seward, J. Valgrind. ACM Sigplan Not. 2007, 42, 89–100. [Google Scholar] [CrossRef]
Terboven, C. Comparing Intel Thread Checker and Sun Thread Analyzer. Adv. Parallel Comput. 2008, 15, 669–676. [Google Scholar]
Intel(R) Thread Checker 3.1 Release Notes. Available online: https://registrationcenter-download.intel.com/akdlm/irc_nas/1366/ReleaseNotes.htm (accessed on 8 March 2023).
Sun Studio 12: Thread Analyzer User’s Guide. Available online: https://docs.oracle.com/cd/E19205-01/820-0619/820-0619.pdf (accessed on 8 March 2023).
Gu, Y.; Mellor-Crummey, J. Dynamic Data Race Detection for OpenMP Programs. In Proceedings of the SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, 11–16 November 2018; IEEE: New York, NY, USA, 2018; pp. 767–778. [Google Scholar]
Serebryany, K.; Bruening, D.; Potapenko, A.; Vyukov, D. AddressSanitizer: A Fast Address Sanity Checker. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), Boston, MA, USA, 13–15 June 2012; USENIX Association: Boston, MA, USA, 2012; pp. 309–318. [Google Scholar]
Serebryany, K.; Potapenko, A.; Iskhodzhanov, T.; Vyukov, D. Dynamic Race Detection with LLVM Compiler: Compile-Time Instrumentation for ThreadSanitizer. In Proceedings of the International Conference on Runtime Verification, Istanbul, Turkey, 25–28 September 2012; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7186 LNCS, pp. 110–114. [Google Scholar]
Atzeni, S.; Gopalakrishnan, G.; Rakamaric, Z.; Ahn, D.H.; Laguna, I.; Schulz, M.; Lee, G.L.; Protze, J.; Muller, M.S. ARCHER: Effectively Spotting Data Races in Large OpenMP Applications. In Proceedings of the 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, Chicago, IL, USA, 23–27 May 2016; pp. 53–62. [Google Scholar] [CrossRef]
Hilbrich, T.; Müller, M.S.; Krammer, B. Detection of Violations to the Mpi Standard in Hybrid Openmp/Mpi Applications. In Lecture Notes in Computer Science; Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5004 LNCS, pp. 26–35. ISBN 354079560X. [Google Scholar]
Krammer, B.; Bidmon, K.; Müller, M.S.; Resch, M.M. MARMOT: An MPI Analysis and Checking Tool. Adv. Parallel Comput. 2004, 13, 493–500. [Google Scholar] [CrossRef]
Betts, A.; Chong, N.; Donaldson, A.F.; Qadeer, S.; Thomson, P. GPU Verify: A Verifier for GPU Kernels. In Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA, New York, NY, USA, 19–26 October 2012; pp. 113–131. [Google Scholar]
Bardsley, E.; Betts, A.; Chong, N.; Collingbourne, P.; Deligiannis, P.; Donaldson, A.F.; Ketema, J.; Liew, D.; Qadeer, S. Engineering a Static Verification Tool for GPU Kernels. In Lecture Notes in Computer Science; Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer International Publishing: Cham, Switzerland, 2014; Volume 8559 LNCS, pp. 226–242. ISBN 9783319088662. [Google Scholar]
Gupta, S.; Sultan, F.; Cadambi, S.; Ivancić, F.; Rotteler, M. Using Hardware Transactional Memory for Data Race Detection. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, Rome, Italy, 23–29 May 2009; pp. 1–11. [Google Scholar] [CrossRef]
Mekkat, V.; Holey, A.; Zhai, A. Accelerating Data Race Detection Utilizing On-Chip Data-Parallel Cores. In Runtime Verification: 4th International Conference, RV 2013, Rennes, France, 24–27 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8174, pp. 201–218. [Google Scholar]
Bekar, C.; Elmas, T.; Okur, S.; Tasiran, S. KUDA: GPU Accelerated Split Race Checker. In Proceedings of the Workshop on Determinism and Correctness in Parallel Programming (WoDet), London, UK, 3 March 2012; Elsevier: Amsterdam, The Netherlands, 2012. [Google Scholar]
Zheng, M.; Ravi, V.T.; Qin, F.; Agrawal, G. GMRace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme. IEEE Trans. Parallel Distrib. Syst. 2014, 25, 104–115. [Google Scholar] [CrossRef]
Zheng, M.; Ravi, V.T.; Qin, F.; Agrawal, G. GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, San Antonio, TX USA, 12–16 February 2011; pp. 135–145. [Google Scholar] [CrossRef]
Dai, Z.; Zhang, Z.; Wang, H.; Li, Y.; Zhang, W. Parallelized Race Detection Based on GPU Architecture. Commun. Comput. Inf. Sci. 2014, 451 CCIS, 113–127. [Google Scholar] [CrossRef]
Boyer, M.; Skadron, K.; Weimer, W. Automated Dynamic Analysis of CUDA Programs. In Proceedings of the Third Workshop on Software Tools for MultiCore Systems, Amsterdam Netherlands, 1 November 2016. [Google Scholar]
Li, P.; Li, G.; Gopalakrishnan, G. Practical Symbolic Race Checking of GPU Programs. In Proceedings of the SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 16–21 November 2014; Volume 14. pp. 179–190. [Google Scholar]
Clemencon, C.; Fritscher, J.; Ruhl, R. Visualization, Execution Control and Replay of Massively Parallel Programs within Annai’s Debugging Tool. In Proceedings of the High-Performance Computing Symposium (HPCS’95), Raleigh, NC, USA, 22–25 January 1995; pp. 393–404. [Google Scholar]
Bronevetsky, G.; Laguna, I.; Bagchi, S.; De Supinski, B.R.; Ahn, D.H.; Schulz, M. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. In Proceedings of the International Conference on Dependable Systems and Networks, Chicago, IL, USA, 28 June–1 July 2010; IEEE: New York, NY, USA, 2010; pp. 231–240. [Google Scholar]
Linaro DDT. Available online: https://www.linaroforge.com/linaro-ddt (accessed on 31 October 2023).
Allinea DDT|HPC@LLNL. Available online: https://hpc.llnl.gov/software/development-environment-software/allinea-ddt (accessed on 31 October 2023).
Totalview Technologies: Totalview—Parallel and Thread Debugger. Available online: https://help.totalview.io/ (accessed on 14 July 2023).
TotalView Debugger|HPC@LLNL. Available online: https://hpc.llnl.gov/software/development-environment-software/totalview-debugger (accessed on 31 October 2023).
Claudio, A.P.; Cunha, J.D.; Carmo, M.B. Monitoring and Debugging Message Passing Applications with MPVisualizer. In Proceedings of the 8th Euromicro Workshop on Parallel and Distributed Processing, Rhodes, Greece, 19–21 January 2000; pp. 376–382. [Google Scholar] [CrossRef]
Intel Inspector|HPC@LLNL. Available online: https://hpc.llnl.gov/software/development-environment-software/intel-inspector (accessed on 8 March 2023).
Documentation—Arm DDT. Available online: https://www.alcf.anl.gov/support-center/training/debugging-arm (accessed on 31 October 2023).
Saad, S.; Fadel, E.; Alzamzami, O.; Eassa, F.; Alghamdi, A.M. Temporal-Logic-Based Testing Tool Architecture for Dual-Programming Model Systems. Computers 2024, 13, 86. [Google Scholar] [CrossRef]
Alghamdi, A.M.; Eassa, F.E. Openacc Errors Classification and Static Detection Techniques. IEEE Access 2019, 7, 113235–113253. [Google Scholar] [CrossRef]
Checkaraou, A.W.M.; Rousset, A.; Besseron, X.; Varrette, S.; Peters, B. Hybrid MPI+openMP Implementation of EXtended Discrete Element Method. In Proceedings of the 2018 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD, Lyon, France, 24–27 September 2018; pp. 450–457. [Google Scholar] [CrossRef]
Garcia-Gasulla, M.; Houzeaux, G.; Ferrer, R.; Artigues, A.; López, V.; Labarta, J.; Vázquez, M. MPI+X: Task-Based Parallelisation and Dynamic Load Balance of Finite Element Assembly. Int. J. Comput Fluid Dyn. 2019, 33, 115–136. [Google Scholar] [CrossRef]
Altalhi, S.M.; Eassa, F.E.; Al-Ghamdi, A.S.A.M.; Sharaf, S.A.; Alghamdi, A.M.; Almarhabi, K.A.; Khemakhem, M.A. An Architecture for a Tri-Programming Model-Based Parallel Hybrid Testing Tool. Appl. Sci. 2023, 13, 1960. [Google Scholar] [CrossRef]
Freire, Y.N.; Senger, H. Integrating CUDA Memory Management Mechanisms for Domain Decomposition of an Acoustic Wave Kernel Implemented in OpenMP. In Escola Regional de Alto Desempenho de São Paulo (ERAD-SP); SBC: Bento Gonçalves, Brazil, 2023; pp. 21–24. [Google Scholar] [CrossRef]
Lai, J.; Yu, H.; Tian, Z.; Li, H. Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters. Sci. Program. 2020, 2020, 8862123. [Google Scholar] [CrossRef]
Haque, W. Concurrent Deadlock Detection in Parallel Programs. Int. J. Comput. Appl. 2006, 28, 19–25. [Google Scholar] [CrossRef]
Eslamimehr, M.; Palsberg, J. Sherlock: Scalable Deadlock Detection for Concurrent Programs. In Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, (FSE 2014). Association for Computing Machinery, Hong Kong China, 16–21 November 2014; pp. 353–365. [Google Scholar] [CrossRef]
Agarwal, R.; Bensalem, S.; Farchi, E.; Havelund, K.; Nir-Buchbinder, Y.; Stoller, S.D.; Ur, S.; Wang, L. Detection of Deadlock Potentials in Multithreaded Programs. IBM J. Res. Dev. 2010, 54, 1–15. [Google Scholar] [CrossRef]
OpenMP Application Programming Interface. Available online: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf (accessed on 16 April 2025).
Yang, C.T.; Huang, C.L.; Lin, C.F. Hybrid CUDA, OpenMP, and MPI Parallel Programming on Multicore GPU Clusters. Comput. Phys. Commun. 2011, 182, 266–269. [Google Scholar] [CrossRef]
Diener, M.; Kale, L.V.; Bodony, D.J. Heterogeneous Computing with OpenMP and Hydra. Concurr. Comput. 2020, 32, e5728. [Google Scholar] [CrossRef]
Akhmetova, D.; Iakymchuk, R.; Ekeberg, O.; Laure, E. Performance Study of Multithreaded MPI and Openmp Tasking in a Large Scientific Code. In Proceedings of the 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, Orlando, FL, USA, 29 May–2 June 2017; pp. 756–765. [Google Scholar] [CrossRef]
Gonzalez Tallada, M.; Morancho, E. Heterogeneous Programming Using OpenMP and CUDA/HIP for Hybrid CPU-GPU Scientific Applications. Int. J. High. Perform. Comput. Appl. 2023, 37, 626–646. [Google Scholar] [CrossRef]
Gil-Costa, V.; Senger, H. High-Performance Computing for Computational Science. Concurr. Comput. 2020, 32, 18–19. [Google Scholar] [CrossRef]
Aji, A.M.; Panwar, L.S.; Ji, F.; Chabbi, M.; Murthy, K.; Balaji, P.; Bisset, K.R.; Dinan, J.; Feng, W.C.; Mellor-Crummey, J.; et al. On the Efficacy of GPU-Integrated MPI for Scientific Applications. In Proceedings of the 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing—HPDC, Minneapolis, MN, USA, 27 June–1 July 2022; pp. 191–202. [Google Scholar] [CrossRef]
Gottschlich, J.; Boehm, H. Generic Programming Needs Transactional Memory. In Proceedings of the Transact 2013: 8th ACM SIGPLAN Workshop on Transactional Computing, Houston, TX, USA, 17 March 2013. [Google Scholar]
Suess, M.; Leopold, C. Generic Locking and Deadlock-Prevention with C++. Adv. Parallel Comput. 2008, 15, 211–218. [Google Scholar]
NAS Parallel Benchmarks Version 3.4.3. Available online: https://www.nas.nasa.gov/software/npb.html (accessed on 14 October 2024).
Bull, J.M.; Enright, J.; Guo, X.; Maynard, C.; Reid, F. Performance Evaluation of Mixed-Mode OpenMP/MPI Implementations. Int. J. Parallel Program. 2010, 38, 396–417. [Google Scholar] [CrossRef]
Grove, D.A.; Coddington, P.D. Precise MPI Performance Measurement Using MPIBench. In Proceedings of the HPC Asia, Nagoya, Japan, 25–27 January 2024. [Google Scholar]
GitHub—LLNL/MpiBench: MPI Benchmark to Test and Measure Collective Performance. Available online: https://github.com/LLNL/mpiBench (accessed on 14 October 2024).
Liao, C.; Lin, P.H.; Asplund, J.; Schordan, M.; Karlin, I. DataRaceBench: A Benchmark Suite for Systematic Evaluation of Data Race Detection Tools. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 9–19 November 2020. [Google Scholar] [CrossRef]
GitHub—LLNL/Dataracebench: Data Race Benchmark Suite for Evaluating OpenMP Correctness Tools Aimed to Detect Data Races. Available online: https://github.com/LLNL/dataracebench (accessed on 14 October 2024).
Griebler, D.; Loff, J.; Mencagli, G.; Danelutto, M.; Fernandes, L.G. Efficient NAS Benchmark Kernels with C++ Parallel Programming. In Proceedings of the 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), Cambridge, UK, 21–23 March 2018; pp. 733–740. [Google Scholar]
Löff, J.; Griebler, D.; Mencagli, G.; Araujo, G.; Torquati, M.; Danelutto, M.; Fernandes, L.G. The NAS Parallel Benchmarks for Evaluating C++ Parallel Programming Frameworks on Shared-Memory Architectures. Future Gener. Comput. Syst. 2021, 125, 743–757. [Google Scholar] [CrossRef]
GitHub—RWTH-HPC/DRACC: Benchmarks for Data Race Detection on Accelerators. Available online: https://github.com/RWTH-HPC/DRACC (accessed on 14 October 2024).
Altalhi, S.M.; Eassa, F.E.; Alghamdi, A.M.; Khalid, R.A.B. Static-Tools-for-Detecting-Tri-Level-Programming-Models-MPI-OpenMP-CUDA-MOC-: Static Analysis Components for Tri-Level-Programming Model Using MPI, OpenMP, and CUDA (MOC). Available online: https://github.com/saeedaltalhi/Static-Tools-for-Detecting-Tri-Level-Programming-Models-MPI-OpenMP-CUDA-MOC- (accessed on 22 April 2025).

Figure 1. The single models in the tri-programming model’s runtime error dependency graph.

Figure 2. The dual models in the tri-programming model’s runtime error dependency graph.

Figure 3. The three models in the tri-programming model’s runtime error dependency graph.

Figure 4. A snapshot from the log file, which provides information about the data collected from the MPI section in the tri-level programming model.

Figure 5. A snapshot from the log file which provides information about the data collected from the CUDA and OpenMP sections in the tri-level programming model.

Figure 6. The execution times of the OpenMP benchmark programs.

Figure 7. The execution times of the CUDA benchmark programs.

Figure 8. The execution times of the MPI benchmark programs.

Figure 9. The errors in the program according to the log that the proposed tool created.

Figure 10. One of the code errors that the proposed tool found while testing one of the codes.

Table 1. The errors that may occur in each programming model and their effects on the tri-programming model.

CUDA	MPI	OpenMP	Tri-Programming Model
Error-free	Error-free	Error-free	No error
Data race	Error-free	Error-free	Data race
Deadlock	Error-free	Error-free	Deadlock
Potential deadlock	Error-free	Error-free	Potential deadlock
Livelock	Error-free	Error-free	Livelock
Potential livelock	Error-free	Error-free	Potential livelock
Error-free	Deadlock	Error-free	Deadlock
Error-free	Potential deadlock	Error-free	Potential deadlock
Error-free	Data race	Error-free	Data race
Error-free	Potential data race	Error-free	Potential data race
Error-free	Error-free	Deadlock	Deadlock
Error-free	Error-free	Potential deadlock	Potential deadlock
Error-free	Error-free	Data race	Data race
Error-free	Error-free	Potential data race	Potential data race
Error-free	Error-free	Livelock	Livelock
Error-free	Error-free	Potential livelock	Potential livelock

Table 2. The errors that may occur in either of the two programming models and their effects on the tri-programming model.

CUDA	MPI	OpenMP	Tri-Programming Model
Data race	Deadlock	Error-free	Deadlock
Data race	Race condition	Error-free	Data race, with possibly incorrect results
Deadlock	Deadlock	Error-free	Deadlock
Deadlock	Race condition	Error-free	Deadlock
Livelock	Deadlock	Error-free	Stuck, depending on which error occurs first
Livelock	Data race	Error-free	Livelock
Data race	Error-free	Deadlock	Deadlock
Data race	Error-free	Data race	Data race, with possibly incorrect results
Data race	Error-free	Livelock	Livelock
Deadlock	Error-free	Deadlock	Deadlock
Deadlock	Error-free	Data race	Deadlock
Deadlock	Error-free	Livelock	Deadlock
Livelock	Error-free	Deadlock	Stuck, depending on which error occurs first
Livelock	Error-free	Data race	Livelock
Livelock	Error-free	Livelock	Livelock
Error-free	Deadlock	Deadlock	Stuck, depending on which deadlock occurs first
Error-free	Deadlock	Data race	Deadlock
Error-free	Deadlock	Livelock	Stuck, depending on which error occurs first
Error-free	Data race	Deadlock	Deadlock
Error-free	Data race	Data race	Data race, with possibly incorrect results
Error-free	Data race	Livelock	Livelock

Table 3. The errors that may occur in all three programming models and their effects on the tri-programming model.

CUDA	MPI	OpenMP	Tri-Programming Model
Data race	Deadlock	Deadlock	Stuck, depending on which deadlock occurs first
Data race	Deadlock	Data race	Deadlock
Data race	Deadlock	Livelock	Stuck, depending on which error occurs first
Data race	Data race	Deadlock	Deadlock
Data race	Data race	Data race	Data race, with possibly incorrect results
Data race	Data race	Livelock	Livelock
Deadlock	Deadlock	Deadlock	Stuck, depending on which deadlock occurs first
Deadlock	Data race	Data race	Deadlock
Deadlock	Deadlock	Livelock	Stuck, depending on which error occurs first
Deadlock	Data race	Deadlock	Stuck, depending on which deadlock occurs first
Deadlock	Deadlock	Data race	Stuck, depending on which deadlock occurs first
Deadlock	Data race	Livelock	Stuck, depending on which error occurs first
Livelock	Deadlock	Deadlock	Stuck, depending on which error occurs first
Livelock	Data race	Data race	Livelock
Livelock	Deadlock	Livelock	Stuck, depending on which livelock occurs first
Livelock	Data race	Deadlock	Stuck, depending on which error occurs first
Livelock	Deadlock	Data race	Stuck, depending on which error occurs first
Livelock	Data race	Livelock	Stuck, depending on which livelock occurs first

Table 4. The benchmark programs used to compare the performance of the tri-programing model.

Model	Benchmark Program		Lines
MPI	NAS	DT	714
	NAS	IS	1179
	EPCC	PingPong	774
		PingPing	400
		Broadcast	169
		ParallelEnvironment	353
	mpiBench	mpiBench	1077
OpenMP	RWTH-HPC/DRACC	DRB001	19
		DRB017	22
		DRB021	24
		DRB023	15
		DRB031	21
		DRB037	17
		DRB104	29
	NPB-CPP/NPB-OMP	CG	1027
		EP	334
		FT	1184
		IS	796
		MG	1388
CUDA	RWTH-HPC/DRACC	DRAAC_CUDA_001	66
		DRAAC_CUDA_004	93
		DRAAC_CUDA_007	85
		DRAAC_CUDA_020	120
		DRAAC_CUDA_023	75
		DRAAC_CUDA_027	67

Table 5. The error coverage of the proposed tool for runtime error detection in tri-programming models.

Programming Model	Runtime Error	Our Tool (Static Approach) *	ACC_TEST *	ROMP *	LLOV *	ARCHER *	MPR ACER *	GPUVerify
OpenMP	Race Condition
	Loop Parallelization Race	✓	✕	✓	✓	✓	✓	✕
	NowaitClause	✓	✕	✓	✕	✓	✓	✕
	Shared Clause	✓	✕	✓	✓	✓	✓	✕
	Barrier Construct	✓	✕	✓	✕	✓	✓	✕
	Atomic Construct	✓	✕	✓	✓	✓	✓	✕
	Critical Construct	✓	✕	✓	✓	✓	✓	✕
	Master Construct	P	✕	✓	✓	✓	✓	✕
	Single Construct	P	✕	✕	✓	✕	✓	✕
	SIMD Directives	✕	✕	✕	✓	✕	✓	✕
	Teams Construct	✕	✕	✓	✓	✕	✓	✕
	Deadlock
	LockRoutines	✓	✕		✕	✕	✕	✕
	Section Construct	✓	✕		✕	✕	✕	✕
	Barrier Construct	✓	✕		✕	✕	✕	✕
	NowaitClause	P	✕		✕	✕	✕	✕
	Livelock	P	✕		✕	✕	✕	✕
	Device Directives	✕	✕		✕	✕	✕	✕
CUDA	Race Condition		✕		✕		✕
	Host/Device Synchronization Race	✓	✕	✕	✕	✕	✕	✓
	Asynchronous Directive Race	✓	✕	✕	✕	✕	✕	✓
	Deadlock
	Device Deadlock	✓	✕	✕	✕	✕	✕	✕
	Host Deadlock	P	✕	✕	✕	✕	✕	✕
	Livelock	P	✕	✕	✕	✕	✕	✕
MPI	Point-to-Point Blocking/ Nonblocking Communications
	Illegal MPI Calls	✓	✓	✕	✕	✕	✓
	Data Type Mismatching	✓	✓	✕	✕	✕	✓
	Data Size Mismatching	✓	✓	✕	✕	✕	✓
	Resource Leaks	✓	✓	✕	✕	✕	✕
	Inconsistence Send/Recv Pairs Wildcard Receive	✓	✓	✕	✕	✕	✓
	Race Condition	P	P	✕	✕	✕	✓
	Deadlock	P	P	✕	✕	✕	✓
	Collective Blocking/ Nonblocking Communications					✕
	Illegal MPI Calls	✓	✓	✕	✕	✕	✓
	Data Type Mismatching	✓	✓	✕	✕	✕	✓
	Data Size Mismatching	✓	✓	✕	✕	✕	✓
	Inconsistent Send/Recv Pairs—Wildcard Receive	P	P	✕	✕	✕	✓
	Race Condition	P	P	✕	✕	✕	✓
	Deadlock	P	P	✕	✕	✕	✓

* ✓: Fully detected; P: Partially detected; ✕: Not detected.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Altalhi, S.M.; Eassa, F.E.; Sharaf, S.A.; Alghamdi, A.M.; Almarhabi, K.A.; Khalid, R.A.B. Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA. Computers 2025, 14, 164. https://doi.org/10.3390/computers14050164

AMA Style

Altalhi SM, Eassa FE, Sharaf SA, Alghamdi AM, Almarhabi KA, Khalid RAB. Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA. Computers. 2025; 14(5):164. https://doi.org/10.3390/computers14050164

Chicago/Turabian Style

Altalhi, Saeed Musaad, Fathy Elbouraey Eassa, Sanaa Abdullah Sharaf, Ahmed Mohammed Alghamdi, Khalid Ali Almarhabi, and Rana Ahmad Bilal Khalid. 2025. "Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA" Computers 14, no. 5: 164. https://doi.org/10.3390/computers14050164

APA Style

Altalhi, S. M., Eassa, F. E., Sharaf, S. A., Alghamdi, A. M., Almarhabi, K. A., & Khalid, R. A. B. (2025). Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA. Computers, 14(5), 164. https://doi.org/10.3390/computers14050164

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Error Classification and Static Detection Methods in Tri-Programming Models: MPI, OpenMP, and CUDA

Abstract

1. Introduction

2. Background

3. Challenges and Testing Techniques

4. Literature Review

Runtime Errors in MPI, OpenMP, and CUDA Programming Models

5. Static Testing Approach Implementation

5.1. Analysis Phase

5.2. Detection Phase

6. Testing and Evaluation

7. Conclusions and the Future Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Correction Statement

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI