Parallel Hybrid Testing Techniques for the Dual-Programming Models-Based Programs

: The importance of high-performance computing is increasing, and Exascale systems will be feasible in a few years. These systems can be achieved by enhancing the hardware’s ability as well as the parallelism in the application by integrating more than one programming model. One of the dual-programming model combinations is Message Passing Interface (MPI) + OpenACC, which has several features including increased system parallelism, support for di ﬀ erent platforms with more performance, better productivity, and less programming e ﬀ ort. Several testing tools target parallel applications built by using programming models, but more e ﬀ ort is needed, especially for high-level Graphics Processing Unit (GPU)-related programming models. Owing to the integration of di ﬀ erent programming models, errors will be more frequent and unpredictable. Testing techniques are required to detect these errors, especially runtime errors resulting from the integration of MPI and OpenACC; studying their behavior is also important, especially some OpenACC runtime errors that cannot be detected by any compiler. In this paper, we enhance the capabilities of ACC_TEST to test the programs built by using the dual-programming models MPI + OpenACC and detect their related errors. Our tool integrated both static and dynamic testing techniques to create ACC_TEST and allowed us to beneﬁt from the advantages of both techniques reducing overheads, enhancing system execution time, and covering a wide range of errors. Finally, ACC_TEST is a parallel testing tool that creates testing threads based on the number of application threads for detecting runtime errors.


Introduction
Exascale systems will be available in a few years. These systems can be achieved by enhancing hardware ability as well as parallelism in the application by integrating different programming models using dual-and tri-programming models. Exascale systems can achieve 10 18 floating-point operations features to support parallelism, including portability, flexibility, compatibility, and less programming effort and time.
In this paper, the ACC_TEST has been improved upon to have the ability to test programs built using MPI + OpenACC dual-programming models and detect related errors. Our solution aimed at covering a wide range of errors that occur in the dual-programming models MPI + OpenACC with less overhead and better system execution time. Finally, our testing tool works in parallel by detecting runtime errors with testing threads created based on application threads numbers.
This paper is structured as follows: Section 2 provides a related work, including testing tools classified by the testing techniques that were used. We explain our techniques for testing the programs based on dual-programming models in Section 3. We discuss our implementation, testing, and evaluation of ACC_TEST in Section 4 and show some results from our experiments. The conclusion and future scope will be discussed in Section 5.

Related Work
In our study, approximately 30 different tools and techniques were reviewed, varying from open-source to commercial tools. Including different types of testing techniques targeted at programming models led to discovering runtime errors for different purposes when finding errors or tracking the cause of these errors (debuggers). Only parallel systems-related testing has been included in our study, in which we survey parallel systems testing techniques. In addition, we focus on the testing techniques used to detect runtime errors that occur on parallel systems. The tools and techniques in our study were chosen from a wide range of available testing tools and techniques in our full published survey [13]. We eliminated any tool or technique that did not meet our objectives. We aimed to survey testing tools and techniques that detect runtime errors in parallel systems using programming models. Therefore, we classified the used testing techniques into four categories, namely static, dynamic, hybrid, and symbolic testing techniques. Further, we classified these techniques into two subcategories to determine the targeted programming model level, single-or dual-level programming models. The following subsections will discuss our classifications.

Static Testing Techniques
Five testing tools were classified as using the static testing technique to detect errors in parallel programs that parallelize using programming models. The testing tools [14] and GPUVerify [15] used the static technique to detect data race in CUDA, OpenCL, and OpenMP programming models individually. For OpenMP, the testing tools [16] and ompVerify [17] were used to detect data race. Finally, MPI-Checker [18] used static techniques to detect MPI mismatching errors.

Dynamic Testing Techniques
In our study, many testing tools use dynamic testing to detect runtime errors in parallel programs. Regarding detecting errors in the MPI programming model, fourteen testing tools use a dynamic technique that targets MPI. The testing tools MEMCHECKER [19], MUST [20], STAT [21], Nasty-MPI [22], and Intel Message Checker [23] were used to detect MPI runtime errors, including deadlocks, data race, and mismatching. For detecting deadlocks and mismatching, MPI-CHECK [24], GEM [25], and Umpire [26] were used. The tools PDT [27], MAD [28], and [29] were used for detecting MPI deadlocks and race conditions. For deadlocks, MOPPER [30] and ISP [31] were used. Finally, MPIRace-Check [32] was used for detecting MPI race condition.
For detecting data race in CUDA using dynamic techniques, the testing tool in [33] was used. Regarding data race detection, there are several testing tools for different programming models, including GUARD [34], RaceTM [35], and KUDA [36] for CUDA. For detecting errors in a heterogeneous programming model by using dynamic testing, WEFT [37] was used to detect deadlocks and race conditions. Finally, for testing the hybrid MPI/OpenMP programming model using dynamic testing, the testing tools MARMOT [38] were used for detecting deadlocks, race conditions, and mismatching.

Hybrid Testing Techniques
Several reviewed testing tools used hybrid-testing techniques, which combine static/dynamic testing. In our survey, five tools used hybrid-testing techniques. These tools are classified into tools targeting single-and dual-level programming models.
In terms of testing tools designed for the single-level programming model, four testing tools targeted single-level programming models, including OpenMP, OpenCL, and CUDA. ARCHER [39] and Dragon [40] are testing tools that use hybrid testing techniques to detect data race in the OpenMP programming model. GMRace [41] and GRace [42] use hybrid testing techniques to detect data race in the CUDA programming model. Finally, GRace is also used to test the OpenCL programming model for detecting data race. All the previous testing tools used static/dynamic hybrid testing techniques to detect runtime errors.
Two testing tools used static/dynamic hybrid testing techniques to detect runtime errors in MPI + OpenMP dual programming model. These tools are PARCOACH [43] and [44], which used the hybrid model to detect deadlocks and other runtime errors resulting from the dual programming model. Even though combining two programming models is beneficial, it creates complex runtime errors that are difficult to detect and determine.
It is noticeable that dynamic testing used, mainly for detecting runtime errors for different programming models. Single-level programming models were targeted to be tested, especially MPI and OpenMP because of their wide use and their history in programming models. Regarding heterogeneous programming models in the reviewed testing tools, CUDA is the most targeted programming model, while OpenACC has not been targeted in any reviewed testing tools for detecting runtime errors, despite its benefits and trending use. However, there are some OpenACC-related studies, including compilers' evaluation of OpenACC 2.0 in [7] and OpenACC 2.5 in [8]. In the published paper [9], an evaluation comparative study was conducted for the compilers, comparing and evaluating PGI, CRAY, and CAPS compilers. For testing high numerical programs, PGI Compiler Assisted Software Testing (PCAST) [45] was released as a feature in PGI compilers and runtime. PCAST is useful to detect errors when numerical results diverge between CPU and GPU versions of code and when they run on different processor architectures within the same code. However, PCAST cannot detect runtime errors, including errors in the OpenACC data directives, race conditions, and deadlock. Finally, we believe that a lot of work needs to be done in creating and developing testing tools for massively parallel systems, especially heterogeneous parallel systems, which will be needed when the Exascale systems are applied in different fields.

Our Techniques for Testing Dual-Programming Models Related Programs
We designed ACC_TEST for detecting errors in parallel programs created by using the integrated model MPI + OpenACC. Our solution integrates the static and dynamic analysis for checking the actual and potential errors with lower overheads and covering a wide range of errors. In our solution, we use the static analysis for discovering as many errors as possible as well as annotate any potential errors for further dynamic testing. This method helps us reduce any unnecessary code instrumentation and minimize the dynamic testing to cover only the code parts that need to be tested during runtime. The dynamic analysis will be used to check and detect any errors that cannot be detected during our static analysis and to detect thread interactions during runtime for any possible errors like race condition and deadlock. An overview of our solution is shown in Figure 1.
ACC_TEST classifies the targeted source code into several parts for ensuring efficient error detection so that only the parts that needed to be investigated will be covered. This classification includes OpenACC data and compute regions, MPI point-to-point and collective communications, as well as non-parallel code parts. ACC_TEST detects errors based on the OpenACC error classification published in [4] that classified OpenACC errors into OpenACC data managements errors, race condition, deadlock, and livelock. In addition, MPI errors are detected by ACC_TEST based on [46] that classified MPI errors into deadlock, data race, mismatches, resource handling, and memory errors. ACC_TEST classifies the targeted source code into several parts for ensuring efficient error detection so that only the parts that needed to be investigated will be covered. This classification includes OpenACC data and compute regions, MPI point-to-point and collective communications, as well as non-parallel code parts. ACC_TEST detects errors based on the OpenACC error classification published in [4] that classified OpenACC errors into OpenACC data managements errors, race condition, deadlock, and livelock. In addition, MPI errors are detected by ACC_TEST based on [46] that classified MPI errors into deadlock, data race, mismatches, resource handling, and memory errors.
OpenACC data clause detection is built based on the assumption that the OpenACC data clause can be used inefficiently, which makes it error-prone and the errors cannot be detected by a compiler. As a result, ACC_TEST investigates all OpenACC data clauses in the source code using our static approach to ensure their correctness, including their syntax and semantics. There are two main types of OpenACC data clauses, structured and unstructured data clauses, which need to be checked differently based on their nature. Algorithm 1 was built to check the structure data clauses in both data and compute regions because of their similar behavior and role. This algorithm used our static approach to examine any variable related to OpenACC data clause directives and determine their locations in three places in the targeted source code, including data region, before and after this region. In case of any error detected by our static approach, an error message will be written to the error list resulting from ACC_TEST. OpenACC data clause detection is built based on the assumption that the OpenACC data clause can be used inefficiently, which makes it error-prone and the errors cannot be detected by a compiler. As a result, ACC_TEST investigates all OpenACC data clauses in the source code using our static approach to ensure their correctness, including their syntax and semantics. There are two main types of OpenACC data clauses, structured and unstructured data clauses, which need to be checked differently based on their nature. Algorithm 1 was built to check the structure data clauses in both data and compute regions because of their similar behavior and role. This algorithm used our static approach to examine any variable related to OpenACC data clause directives and determine their locations in three places in the targeted source code, including data region, before and after this region. In case of any error detected by our static approach, an error message will be written to the error list resulting from ACC_TEST.
In terms of the OpenACC unstructured data clause, Algorithm 2 is responsible for detecting any error related to the unstructured data clause as well as detecting the data movements between the enter and exit region. In addition, some behaviors that are not considered an error, but could cause memory-related problems, will be detected by our static approach when variables are moved to the GPU without being used or a temporary GPU variable is created and not deleted at the exit region.
ACC_TEST dynamic approach uses the information resulting from our static testing to determine the place of the present data clause in the source code and its variable list. Then, our dynamic instrumentation uses the API function (acc_is_present), which indicates if the variable is present on the device to test the present clause. In addition, an if-statement will be inserted to test the result of the OpenACC API function and issue error message, indicating the error type and place to the programmer. The following code in Figure 2 is an example of our insertion statements before being instrumented, where the var_name, var_size, and var_type will be determined by our static testing and inserted with the comment character, followed by the word "ASSERT" to be distinguished by our instrumenter. ACC_TEST identifies a variable to receive the value from the API function (acc_is_present), which takes the variable name, size, and type as shown in the previous code. This OpenACC API function tests if the data are present on the GPU. In C and C++, the function returns nonzero value if the data are present and zero if not present. If (acc_is_present) returns a nonzero value, this shows that the targeted variable is present in the current GPU, which can then complete the program. Our instrumenter will Symmetry 2020, 12, 1555 9 of 25 remove any comment character followed by our inserted label "ASSERT" to keep the insert test code to be compiled, when the user code and the inserted test code move to the instrumenter phase. ACC_TEST identifies a variable to receive the value from the API function (acc_is_present), which takes the variable name, size, and type as shown in the previous code. This OpenACC API function tests if the data are present on the GPU. In C and C++, the function returns nonzero value if the data are present and zero if not present. If (acc_is_present) returns a nonzero value, this shows that the targeted variable is present in the current GPU, which can then complete the program. Our instrumenter will remove any comment character followed by our inserted label "ASSERT" to keep the insert test code to be compiled, when the user code and the inserted test code move to the instrumenter phase.
In terms of detecting OpenACC race condition, there are several situations that cause race condition in OpenACC, including host and device synchronization, loop parallelization, shared data read-and-write, asynchronous directives, reduction clause, and independent clause races. Our instrumentation mechanism will use the static approach annotation to insert codes into the targeted source code for collecting information during runtime. In the code given in Figure 3, our insertion mechanism will insert data structure for collecting actual information during runtime and use them for several test cases in our dynamic testing phase after the instrumentation phase. For detecting data dependency in OpenACC compute region loops, Algorithm 3 shows the process of detecting loop race condition, finding any dependency, and the possibility of having race condition within each loop threads. In terms of detecting OpenACC race condition, there are several situations that cause race condition in OpenACC, including host and device synchronization, loop parallelization, shared data read-and-write, asynchronous directives, reduction clause, and independent clause races. Our instrumentation mechanism will use the static approach annotation to insert codes into the targeted source code for collecting information during runtime. In the code given in Figure 3, our insertion mechanism will insert data structure for collecting actual information during runtime and use them for several test cases in our dynamic testing phase after the instrumentation phase. ACC_TEST identifies a variable to receive the value from the API function (acc_is_present), which takes the variable name, size, and type as shown in the previous code. This OpenACC API function tests if the data are present on the GPU. In C and C++, the function returns nonzero value if the data are present and zero if not present. If (acc_is_present) returns a nonzero value, this shows that the targeted variable is present in the current GPU, which can then complete the program. Our instrumenter will remove any comment character followed by our inserted label "ASSERT" to keep the insert test code to be compiled, when the user code and the inserted test code move to the instrumenter phase.
In terms of detecting OpenACC race condition, there are several situations that cause race condition in OpenACC, including host and device synchronization, loop parallelization, shared data read-and-write, asynchronous directives, reduction clause, and independent clause races. Our instrumentation mechanism will use the static approach annotation to insert codes into the targeted source code for collecting information during runtime. In the code given in Figure 3, our insertion mechanism will insert data structure for collecting actual information during runtime and use them for several test cases in our dynamic testing phase after the instrumentation phase. For detecting data dependency in OpenACC compute region loops, Algorithm 3 shows the process of detecting loop race condition, finding any dependency, and the possibility of having race condition within each loop threads. For detecting data dependency in OpenACC compute region loops, Algorithm 3 shows the process of detecting loop race condition, finding any dependency, and the possibility of having race condition within each loop threads.
For ensuring there are worked parallel codes, we detect the threads generated in OpenACC, including gangs and vectors. Our static testing phase generates tested gangs and vectors for each compute region for further comparison with the actual number of gang and vectors generated by the original source code. By using this compassion, the user code performance could be enhanced, because users assume that some compute regions work in parallel when they actually work sequentially. For each thread, some information will be collected during runtime, as shown in the following inserted test code in Figure 4, which will be used in our dynamic testing phase after instrumentation. These inserted statements will be added to each OpenACC compute region. For ensuring there are worked parallel codes, we detect the threads generated in OpenACC, including gangs and vectors. Our static testing phase generates tested gangs and vectors for each compute region for further comparison with the actual number of gang and vectors generated by the original source code. By using this compassion, the user code performance could be enhanced, because users assume that some compute regions work in parallel when they actually work sequentially. For each thread, some information will be collected during runtime, as shown in the following inserted test code in Figure 4, which will be used in our dynamic testing phase after instrumentation. These inserted statements will be added to each OpenACC compute region. All thread information in each OpenACC compute region from the inserted test code in Figure  5 will be used to test actual parallelism for each OpenACC compute region and detect any differences between the tested and actual parallelism indicating the compute region, which is not parallelized. All thread information in each OpenACC compute region from the inserted test code in Figure 5 will be used to test actual parallelism for each OpenACC compute region and detect any differences between the tested and actual parallelism indicating the compute region, which is not parallelized. In the case of read-write race condition, Algorithm 4 will be used by ACC_TEST static approach to detect different types of related race conditions.  In the case of read-write race condition, Algorithm 4 will be used by ACC_TEST static approach to detect different types of related race conditions.  In our dynamic phase, for the case of data race detection for writing and reading to memory and at compiling time, if the addresses cannot be determined or the addresses potentially have conflicts, ACC_TEST static approach will insert codes to be investigated during the dynamic phase. In our dynamic approach, each memory access statement that is marked by our static analysis will be monitored, and their information will be recorded to detect any data race. The information will include variable, thread id, operation (R/W), and memory addresses.
To detect race condition that results from reading and writing to or from the same address in the Host or Device, our dynamic tester will instrument the inserted test statements in the user code to register each address space in Host and Device and then check if there is any writing to the same address at the same time. In this case, the dynamic tester will execute user code and inserted code to discover if there is any reading or writing to the same address. If so, an error message is shown to the user, indicating the race condition along with the reason. The inserted test code is shown in Figure 6, which will use the data structure to store actual information during runtime for each thread, as explained earlier. These test codes will investigate the device addressed in the same compute region as well as across different compute regions.
In terms of detecting OpenACC deadlock, ACC_TEST will use the static analysis to partially detect deadlock by annotating the code parts that need further investigation during runtime. The dynamic part of ACC_TEST will use the instrumentations annotated for testing the threads arrival at the end of each OpenACC compute region. This approach was used because we assumed that each OpenACC parallel compute region could have deadlock. At the end of each OpenACC compute region, our dynamic phase will check the number of threads included in the region and compare them by the number of threads at the end of the OpenACC region. Because of the implicit barrier on OpenACC, which hides information from the developers, our approach will make sure any unexpected behavior is investigated and reported to the developers along with related information. ACC_TEST dynamic approach will also be used to detect deadlock caused by GPU livelock as well as livelock that can occur in the Host and Device interaction. In terms of detecting OpenACC deadlock, ACC_TEST will use the static analysis to partially detect deadlock by annotating the code parts that need further investigation during runtime. The dynamic part of ACC_TEST will use the instrumentations annotated for testing the threads arrival at the end of each OpenACC compute region. This approach was used because we assumed that each OpenACC parallel compute region could have deadlock. At the end of each OpenACC compute region, our dynamic phase will check the number of threads included in the region and compare them by the number of threads at the end of the OpenACC region. Because of the implicit barrier on OpenACC, which hides information from the developers, our approach will make sure any unexpected behavior is investigated and reported to the developers along with related information. ACC_TEST dynamic approach will also be used to detect deadlock caused by GPU livelock as well as livelock that can occur in the Host and Device interaction.
In the case of deadlock detection, the source code execution will be continuously working, causing the execution to hang without knowing the problem. Therefore, our dynamic tester will add an asynchronous directive for each compute region that has potential deadlock and number these asynchronous directives with the same number as the compute region. However, if our static tester does not detect any potential deadlock in the compute regions, our dynamic tester will not add the asynchronous directives. Our static tester will be responsible for marking each compute region that has potential deadlock and providing our dynamic tester with the appropriate information needed to proceed with the dynamic testing. Our insertion mechanism will insert test codes for testing the threads' arrival at the end of each OpenACC compute region as well as for testing the threads' arrival at the end of all regions. As shown in Figure 7, at the end of all compute regions, our tester will insert a timer and test the arrival of all threads at the end of each OpenACC compute region. If all threads arrived at the end of all OpenACC compute regions, it indicates that the source code is deadlock-free, but if the timer has ended and not all threads arrived at the end of all regions, it indicates deadlock. In the case of deadlock detection, the source code execution will be continuously working, causing the execution to hang without knowing the problem. Therefore, our dynamic tester will add an asynchronous directive for each compute region that has potential deadlock and number these asynchronous directives with the same number as the compute region. However, if our static tester does not detect any potential deadlock in the compute regions, our dynamic tester will not add the asynchronous directives. Our static tester will be responsible for marking each compute region that has potential deadlock and providing our dynamic tester with the appropriate information needed to proceed with the dynamic testing. Our insertion mechanism will insert test codes for testing the threads' arrival at the end of each OpenACC compute region as well as for testing the threads' arrival at the end of all regions. As shown in Figure 7, at the end of all compute regions, our tester will insert a timer and test the arrival of all threads at the end of each OpenACC compute region. If all threads arrived at the end of all OpenACC compute regions, it indicates that the source code is deadlock-free, but if the timer has ended and not all threads arrived at the end of all regions, it indicates deadlock.
During runtime and after executing the inserted test code, ACC_TEST will investigate the OpenACC compute region that caused the deadlock and provided the developers error message, which indicates the error type and its related compute region. The timer takes into consideration the sending, receiving, and execution time. This can detect the CPU deadlock that resulted from the GPU livelock. Our static phase will also analyze the source code to detect any potential livelock in the GPU from the user source code and without execution. Using the hybrid testing technique will ensure the user code correctness and detect any potential deadlock. During runtime and after executing the inserted test code, ACC_TEST will investigate the OpenACC compute region that caused the deadlock and provided the developers error message, which indicates the error type and its related compute region. The timer takes into consideration the sending, receiving, and execution time. This can detect the CPU deadlock that resulted from the GPU livelock. Our static phase will also analyze the source code to detect any potential livelock in the GPU from the user source code and without execution. Using the hybrid testing technique will ensure the user code correctness and detect any potential deadlock.
In terms of MPI error detection, our static approach starts by checking each MPI communication, whether it is point-to-point, collectives, blocking, or non-blocking. This process will determine the MPI calls' locations on the source code, rank, and communication type. In addition, our static approach has the ability to retrieve all attributes related to the MPI sending and receiving calls, including data, counter, data type, source, destination, tag, and MPI communicator as well as other information related to each type of MPI calls. Algorithm 5 shows the process of storing all related MPI calls and checking MPI data type and size mismatching as well as the MPI send/receive pair for detecting any differences between the numbers of sends and receives. This pairing will also examine the message tag to detect any unmatched message pairing. This process will avoid any potential race condition and deadlocks. In terms of MPI error detection, our static approach starts by checking each MPI communication, whether it is point-to-point, collectives, blocking, or non-blocking. This process will determine the MPI calls' locations on the source code, rank, and communication type. In addition, our static approach has the ability to retrieve all attributes related to the MPI sending and receiving calls, including data, counter, data type, source, destination, tag, and MPI communicator as well as other information related to each type of MPI calls. Algorithm 5 shows the process of storing all related MPI calls and checking MPI data type and size mismatching as well as the MPI send/receive pair for detecting any differences between the numbers of sends and receives. This pairing will also examine the message tag to detect any unmatched message pairing. This process will avoid any potential race condition and deadlocks.
In addition, in the case of the MPI sending calls more than receiving calls, this will be reported as there will be messages sent without being received, which can cause potential errors. On the other hand, if the MPI is receiving calls more than sending calls, it will also cause potential errors, including deadlock and race condition, as there will be processes waiting for receiving messages forever. In addition, some cases of deadlock will be detected when there is any MPI_Recv call without sender, and potential deadlock will be detected when using the wild card receive. In the case of having MPI_Send without receive, it will cause lake of resource errors. In addition, when there is a data exchange between two different processes where the same process sends and receives, it can cause potential deadlock, which needs further detection during ACC_TEST dynamic phase. Finally, the receive/receive deadlock will be detected by the static phase, but the send/send deadlock needs further investigation during the dynamic phase.
ACC_TEST dynamic phase is responsible for detecting deadlocks and race conditions by using the annotation from the static phase and using the insertion mechanism for executing testing during runtime. Connections that have potential errors will be investigated during our dynamic phase, which enhances the testing time and system performance by only testing the required parts of the code and minimizing overheads. In terms of detecting deadlock in point-to-point blocking communication, the following code in Figure 8 shows the inserted codes that test MPI_Send and MPI_Recv calls. the annotation from the static phase and using the insertion mechanism for executing testing during runtime. Connections that have potential errors will be investigated during our dynamic phase, which enhances the testing time and system performance by only testing the required parts of the code and minimizing overheads. In terms of detecting deadlock in point-to-point blocking communication, the following code in Figure 8 shows the inserted codes that test MPI_Send and MPI_Recv calls. For detecting race condition, ACC_TEST will compare all received calls' actual information with our static phase information to detect any potential race conditions, as displayed in Figure 9, where some values will be inserted and compared with the result from actual values. Similarly, to test the MPI_sendrecv, ACC_TEST will split this connection into MPI_Send and MPI_Recv and test them individually, as shown in Figure 10.  For detecting race condition, ACC_TEST will compare all received calls' actual information with our static phase information to detect any potential race conditions, as displayed in Figure 9, where some values will be inserted and compared with the result from actual values. Similarly, to test the MPI_sendrecv, ACC_TEST will split this connection into MPI_Send and MPI_Recv and test them individually, as shown in Figure 10.
the annotation from the static phase and using the insertion mechanism for executing testing during runtime. Connections that have potential errors will be investigated during our dynamic phase, which enhances the testing time and system performance by only testing the required parts of the code and minimizing overheads. In terms of detecting deadlock in point-to-point blocking communication, the following code in Figure 8 shows the inserted codes that test MPI_Send and MPI_Recv calls. For detecting race condition, ACC_TEST will compare all received calls' actual information with our static phase information to detect any potential race conditions, as displayed in Figure 9, where some values will be inserted and compared with the result from actual values. Similarly, to test the MPI_sendrecv, ACC_TEST will split this connection into MPI_Send and MPI_Recv and test them individually, as shown in Figure 10.  In the case of collective communication, MPI_Bcast be will examined during runtime to detect any deadlock by inserting some testing codes as displayed in Figure 11 for testing the data exchange between different MPI broadcasts. The annotation has been used for replacing each MPI blocking broadcast (MPI_Bcast) with MPI non-blocking broadcast (MPI_Ibcast) for avoiding any runtime blocking behavior and testing the arrival of all broadcasts calls and to compare actual information with the tested broadcast calls.
In terms of the used instrumentation method, we add the testing codes to the source code, because by adding the testing code to the user code, the testing will be distributed, increasing the reliability of our testing tool and avoiding the single point of failure, which happens when using the second method of having centralized control and call function for testing. In our case, reliability is more important than a smaller size. In addition, the testing code will be used in the testing house. In the operation phase, the user has the choice to use the uninstrumented source code, which will allow the compiler to ignore the test code, and the code size will not affect the operational user code. In addition, by using the chosen method, we will enhance performance by having distributed testing codes rather than centralized control. In the case of collective communication, MPI_Bcast be will examined during runtime to detect any deadlock by inserting some testing codes as displayed in Figure 11 for testing the data exchange between different MPI broadcasts. The annotation has been used for replacing each MPI blocking broadcast (MPI_Bcast) with MPI non-blocking broadcast (MPI_Ibcast) for avoiding any runtime blocking behavior and testing the arrival of all broadcasts calls and to compare actual information with the tested broadcast calls. In terms of the used instrumentation method, we add the testing codes to the source code, because by adding the testing code to the user code, the testing will be distributed, increasing the reliability of our testing tool and avoiding the single point of failure, which happens when using the second method of having centralized control and call function for testing. In our case, reliability is more important than a smaller size. In addition, the testing code will be used in the testing house. In the operation phase, the user has the choice to use the uninstrumented source code, which will allow the compiler to ignore the test code, and the code size will not affect the operational user code. In In the case of collective communication, MPI_Bcast be will examined during runtime to detect any deadlock by inserting some testing codes as displayed in Figure 11 for testing the data exchange between different MPI broadcasts. The annotation has been used for replacing each MPI blocking broadcast (MPI_Bcast) with MPI non-blocking broadcast (MPI_Ibcast) for avoiding any runtime blocking behavior and testing the arrival of all broadcasts calls and to compare actual information with the tested broadcast calls. In terms of the used instrumentation method, we add the testing codes to the source code, because by adding the testing code to the user code, the testing will be distributed, increasing the reliability of our testing tool and avoiding the single point of failure, which happens when using the second method of having centralized control and call function for testing. In our case, reliability is more important than a smaller size. In addition, the testing code will be used in the testing house. In the operation phase, the user has the choice to use the uninstrumented source code, which will allow the compiler to ignore the test code, and the code size will not affect the operational user code. In ACC_TEST collected some historical information during the dynamic and static phases and stored it in a log file for use in debugging and tracking changes in user code. The runtime errors detected by our dynamic phase will be stored in a report separate from the report for the static analysis detection. After the execution of our testing for MPI and OpenACC-related programs, ACC_TEST will create a historical log file including several useful types of information to be used in debugging. This file will include the following information related to MPI + OpenACC programs: (1) Summary of OpenACC regions: total number of compute, structured, and unstructured data regions, as well as the starting and ending points for each region; (2) Compute region variables information; (3) Loops information; (4) Equations information; Finally, our testing tool will produce several files reporting errors detected in our static and dynamic phases. In summary, our testing tool outputs will be: (1) Inserted source code, including user codes and uninstrumented test codes; (2) Static errors report, including information of errors detected during our static phase; (3) Dynamic errors report, including information related to errors detected during our dynamic phase; (4) Historical log file, including information from our static analysis; (5) Historical log file for our dynamic analysis.
ACC_TEST has the ability to detect MPI + OpenACC hybrid-based programs and to detect OpenACC-based programs and MPI-based programs individually.

Discussion and Evaluation
Our tool has been implemented and tested for verifying and validating the ACC_TEST. Several experiments have been conducted, covering several scenarios and test suites for testing our proposed solution and ensuring ACC_TEST's capability to detect different types of errors in MPI, OpenACC, and MPI + OpenACC dual-programming models. Because of the lack of the dual-programming models MPI + OpenACC benchmarks, we created our own hybrid programming models' test suites for evaluating our testing techniques as well as the error coverage. We built several test cases for testing our proposed techniques and our testing tool, as shown in Table 1. Because of the lack of MPI + OpenACC benchmarks, we created our own hybrid programming models test suites for evaluating our testing techniques as well as the error coverage. These test suites include both OpenACC and MPI directives for building parallel programs using the dual-programming models MPI + OpenACC. We built these test suites for evaluating our hybrid-testing tool and examining its ability to cover the runtime errors that we targeted and to measure different overheads, including size, compilation, and execution overheads.  Table 2 shows our hybrid testing tool's ability to detect errors, which occur in MPI + OpenACC dual-programming models. We collected all errors that can be identified by our OpenACC and MPI testers and examine them on our hybrid testing tool for the dual-programming model. We found that our integrated tool could detect all errors targeted. In the following, Figure 12 displays the number of detected errors detected by our tool, including the number of errors detected by our static and dynamic approaches. In the following, Figure 12 displays the number of detected errors detected by our tool, including the number of errors detected by our static and dynamic approaches. In terms of size overheads, we used Equation (1) for measuring the size overhead, as shown in Figure 13. We noted that the size overheads range between 79% and 135%, based on the nature of the source code and its behavior. However, these size overheads will not affect the user code because all the inserted statements will be considered by the compiler as a comment, as they all start with the comment character. These inserted statements will affect the user source code only on the testing In terms of size overheads, we used Equation (1) for measuring the size overhead, as shown in Figure 13. We noted that the size overheads range between 79% and 135%, based on the nature of the source code and its behavior. However, these size overheads will not affect the user code because all the inserted statements will be considered by the compiler as a comment, as they all start with the comment character. These inserted statements will affect the user source code only on the testing house and when these codes pass our instrumenter. We believe that these size overheads are needed because of the nature of runtime errors, which cannot be detected during our static phase analysis and need to be tested during runtime: house and when these codes pass our instrumenter. We believe that these size overheads are needed because of the nature of runtime errors, which cannot be detected during our static phase analysis and need to be tested during runtime: (1) Figure 13. MPI + OpenACC hybrid program size overhead (by bytes).
In terms of compilation and execution times, we measure our test cases and compare the compilation and execution times before and after our insertion process. Figure 14 shows the average compilation time in milliseconds, which is 198 milliseconds before insertion and 230 milliseconds after insertion.  In terms of compilation and execution times, we measure our test cases and compare the compilation and execution times before and after our insertion process. Figure 14 shows the average compilation time in milliseconds, which is 198 milliseconds before insertion and 230 milliseconds after insertion. In terms of compilation and execution times, we measure our test cases and compare the compilation and execution times before and after our insertion process. Figure 14 shows the average compilation time in milliseconds, which is 198 milliseconds before insertion and 230 milliseconds after insertion. In terms of the execution time, there are differences in the execution time based on the number of processes. The average time before insertion ranges between 230 milliseconds in the 4 processes and 1578 milliseconds in the 128 processes. After the insertion, the execution time ranges between 241 and 1765 milliseconds for the 4 and 128 processes, respectively.
Compilation Time (CT) or Execution Time (ET) overheads will be calculated using the following Equation (2): (2) Figure 15 shows the compilation time overhead resulting from ACC_TEST. The compilation overheads range from 12% to 27%, which is considered acceptable and can vary based on the used In terms of the execution time, there are differences in the execution time based on the number of processes. The average time before insertion ranges between 230 milliseconds in the 4 processes and 1578 milliseconds in the 128 processes. After the insertion, the execution time ranges between 241 and 1765 milliseconds for the 4 and 128 processes, respectively.
Compilation Time (CT) or Execution Time (ET) overheads will be calculated using the following Equation (2): CT/ET Overheads = T with inserted test code − T without inserted test code T without inserted test code .
(2) Figure 15 shows the compilation time overhead resulting from ACC_TEST. The compilation overheads range from 12% to 27%, which is considered acceptable and can vary based on the used compiler and system behavior. In terms of measuring the overheads and their relation to the number of MPI processes, Figure 16 shows the execution time overhead for MPI + OpenACC-related test cases.
Symmetry 2020, 12, x FOR PEER REVIEW 21 of 26 compiler and system behavior. In terms of measuring the overheads and their relation to the number of MPI processes, Figure 16 shows the execution time overhead for MPI + OpenACC-related test cases. We successfully minimized the execution overheads to be under 18%; the execution overheads range between 1% and below 18% based on the system behavior, the machine status, and the number of processes.
Finally, Figure 17 shows the execution time for testing programs built using MPI + OpenACC dual-programming models, for which the average testing time is 115 milliseconds.     We successfully minimized the execution overheads to be under 18%; the execution overheads range between 1% and below 18% based on the system behavior, the machine status, and the number of processes.
Finally, Figure 17 shows the execution time for testing programs built using MPI + OpenACC dual-programming models, for which the average testing time is 115 milliseconds.  In comparison to the MPI testing techniques published in [47], our testing tool minimizes the size overhead for testing MPI-related programs because we avoid adding unnecessary messaging (communications) to test the connection between senders and receivers to detect deadlock. Another main advantage of ACC_TEST is that we used distributed testing techniques, unlike some other tools [48,49], which used a centralized manager for detecting MPI-related errors, causing a single point of failure and single point of attack.
OpenACC has been used for building ACC_TEST, which makes it portable and hardware architecture-independent. It therefore works with any type of GPU accelerator, hardware, platform, and operating system. In addition, ACC_TEST is easy to maintain and requires less effort because of the high maintainability of OpenACC. Our insertion techniques help increase the reliability of ACC_TEST because this technique avoids centralized control and single-point-of-failure problems as well as increase performance by distributing our testing tasks and avoiding centralized controlled testing. ACC_TEST also helps produce more high-quality systems without errors.
In Table 3, we summarize the comparative study conducted in our research. Because there is no published work or existing testing tool that detects OpenACC errors or the dual-programming model In comparison to the MPI testing techniques published in [47], our testing tool minimizes the size overhead for testing MPI-related programs because we avoid adding unnecessary messaging (communications) to test the connection between senders and receivers to detect deadlock. Another main advantage of ACC_TEST is that we used distributed testing techniques, unlike some other tools [48,49], which used a centralized manager for detecting MPI-related errors, causing a single point of failure and single point of attack.
OpenACC has been used for building ACC_TEST, which makes it portable and hardware architecture-independent. It therefore works with any type of GPU accelerator, hardware, platform, and operating system. In addition, ACC_TEST is easy to maintain and requires less effort because of the high maintainability of OpenACC. Our insertion techniques help increase the reliability of ACC_TEST because this technique avoids centralized control and single-point-of-failure problems as well as increase performance by distributing our testing tasks and avoiding centralized controlled testing. ACC_TEST also helps produce more high-quality systems without errors.
In Table 3, we summarize the comparative study conducted in our research. Because there is no published work or existing testing tool that detects OpenACC errors or the dual-programming model MPI + OpenACC, we chose the closest work to compare with our techniques in different attribute. ACC_TEST has the capability to cover different types of errors in OpenACC, MPI, and dual-programming models. In addition, ACC_TEST used hybrid-testing techniques for covering a wide range of errors while minimizing overheads. ACC_TEST is built based on a distributed mechanism, which avoids single point of failures. Additionally, the dual-programming models MPI + OpenACC have been supported by ACC_TEST for the first time in the research field. ACC_TEST is scalable and adaptable and can run on any platform.

Conclusions and Future Work
To conclude, good effort has been made in testing parallel systems to detect runtime errors, but it still insufficient, especially for systems using heterogeneous programming models as well as dual-and tri-level programming models. The integration of different programming models into the same system will need new testing techniques to detect runtime errors in heterogeneous parallel systems. We believe that in order to achieve good systems that can be used in Exascale supercomputers, we should focus on testing those systems because of their massively parallel natures as well as their huge size, which increases difficulties and issues. We enhanced the capabilities of ACC_TEST to detect errors occurring in the hybrid dual-programming models MPI + OpenACC. In addition, ACC_TEST works in parallel by detecting runtime errors with testing threads created based on the targeted application threads. Additionally, our tool can work with any heterogeneous architecture, which will increase the portability of ACC_TEST. We have implemented our solution and evaluated its ability to detect runtime errors. Using our parallel hybrid testing techniques will yield benefits, such as reduction of overhead, enhanced system execution time, and coverage of a wide range of errors.
MPI + OpenACC applications errors have been successfully detected by using ACC_TEST hybrid-testing techniques. Helping to increase reliability and ensure error-free codes are the main objectives of creating our testing tool. Our tool also achieves covering range of errors with execution overhead in an acceptable level, which is less than 20%. Because of using the testing techniques that provide overheads, the testing processes will be used only in the testing house and will not affect the delivered user applications. Finally, to the best of our knowledge, ACC_TEST is the first parallel testing tool built to test applications programmed by using the dual-programming model MPI + OpenACC. ACC_TEST focuses only on the integration of MPI and OpenACC and the resulting errors that can occur in this integration as well as in MPI and OpenACC individually.
In future works, ACC_TEST will be enhanced to cover errors that occur in tri-programming models MPI + OpenACC + X. In addition, we will enhance our techniques to run on real Exascale systems when they become available. Finally, we will enhance ACC_TEST to be intelligent using AI techniques for generating test cases automatically and using deep learning while testing.

Data Availability:
The data used to support the findings of this study are available from the corresponding author upon request.