Assessment of OpenMP Master–Slave Implementations for Selected Irregular Parallel Applications

: The paper investigates various implementations of a master–slave paradigm using the popular OpenMP API and relative performance of the former using modern multi-core workstation CPUs. It is assumed that a master partitions available input into a batch of predeﬁned number of data chunks which are then processed in parallel by a set of slaves and the procedure is repeated until all input data has been processed. The paper experimentally assesses performance of six implementations using OpenMP locks, the tasking construct, dynamically partitioned for loop, without and with overlapping merging results and data generation, using the gcc compiler. Two distinct parallel applications are tested, each using the six aforementioned implementations, on two systems representing desktop and worstation environments: one with Intel i7-7700 3.60GHz Kaby Lake CPU and eight logical processors and the other with two Intel Xeon E5-2620 v4 2.10GHz Broadwell CPUs and 32 logical processors. From the application point of view, irregular adaptive quadrature numerical integration, as well as ﬁnding a region of interest within an irregular image is tested. Various compute intensities are investigated through setting various computing accuracy per subrange and number of image passes, respectively. Results allow programmers to assess which solution and conﬁguration settings such as the numbers of threads and thread afﬁnities shall be preferred. featuring nodes with 2 sockets with Intel Xeon Platinum 8160 CPUs, with 24 cores and 48 logical processors. Results are shown for TRSM for th original LASs, sLaSs and PLASMA, MKL and ATLAS, for NPGETRF for LASs, sLaSs and MKL and for NPGESV for LASs and sLaSs demonstrating improvement of the proposed solution of about 18% compared to LASs.


Introduction
In today's parallel programming, a variety of general purpose Application Programming Interfaces (APIs) are widely used, such as OpenMP, OpenCL for shared memory systems including CPUs and GPUs, CUDA, OpenCL, OpenACC for GPUs, MPI for cluster systems or combinations of these APIs such as: MPI+OpenMP+CUDA, MPI+OpenCL [1], etc. On the other hand, these APIs allow to implement a variety of parallel applications falling into the following main paradigms: master-slave, geometric single program multiple data, pipelining, divide-and-conquer.
At the same time, multi-core CPUs have become widespread and present in all computer systems, both desktop and server type systems. For this reason, optimization of implementations of such paradigms on such hardware is of key importance nowadays, especially as such implementations can serve as templates for coding specific domain applications. Consequently, within this paper we investigate various implementations of one of such popular programming patterns-master-slave, implemented with one of the leading APIs for programming parallel applications for shared memory systems-OpenMP [2].

Related Work
Works related to the research addressed in this paper can be associated with one of the following areas, described in more detail in subsequent subsections: 1. frameworks related to or using OpenMP that target programming abstractions even easier to use or at a higher level than OpenMP itself; #pragma omp directives, one creates work using #pragma omp task or #pragma omp for and the program already starts with a team of threads out of which one executes the main function. #pragma omp parallel is ignored. For the for loop, a compiler creates a task which will create internally several more tasks out of which each implements some part of the iteration space of the corresponding parallel loop. OmpSs has been an OpenMP forerunner for some of the features [14,15]. Recent paper [16] presents an architecture and a solution that extends the OmpSs@FPGA environment with the possibility for the tasks offloaded to FPGA to create and synchronize nested tasks without the need to involve the host. OmpSs-2, following its specification (https://pm.bsc.es/ftp/ompss-2/doc/spec), extends the tasking model of OmpSs/OpenMP so that both task nesting and fine-grained dependencies across different nesting levels are supported. It uses #pragma oss constructs. Important features include, in particular: nested dependency domain connection, early release of dependencies, weak dependencies, native offload API task Pause/Resume API. It should be noted that the latest OpenMP standard also allows tasking as well as offloading to external devices such as Intel Xeon Phi or GPUs [2]. Paper [17] presents PLASMA-the Parallel Linear Algebra Software for Multicore Architectures-a version which is an OpenMP task based implementation adopting a tilebased approach to storage, along with algorithms that operate on tiles and use OpenMP for dynamic scheduling based on tasks with dependencies and priorities. Detailed assessment of the software performance is presented in the paper using three platforms with 2 × Intel Xeon CPU E5-2650 v3 CPUs at 2.3 GHz, Intel Xeon Phi 7250 and 2 × IBM POWER8 CPUs at 3.5 GHz, respectively, using gcc compared to MKL (for Intel) and ESSL (for IBM). PLASMA resulted in better performance for algorithms suited for its tile type approach such as LDL T factorization as well as QR factorization in the case of tall and skinny matrices.
In [18] authors presented parts of the first prototype of sLaSs library with auto tunable implementations of operations for linear algebra. They used OmpSs with its task based programming model and features such as weak dependencies and regions with the final clause. They benchmarked their solution using a supercomputer featuring nodes with 2 sockets with Intel Xeon Platinum 8160 CPUs, with 24 cores and 48 logical processors. Results are shown for TRSM for th original LASs, sLaSs and PLASMA, MKL and ATLAS, for NPGETRF for LASs, sLaSs and MKL and for NPGESV for LASs and sLaSs demonstrating improvement of the proposed solution of about 18% compared to LASs.

Parallelization of Master-Slave with OpenMP
master-slave can be thought of as a paradigm to enable parallelization of processing among independently working slaves that receive input data chunks from the master and return results to the master.
OpenMP by itself offers ways of implementing the master-slave paradigm, in particular using: 1. #pragma omp parallel along with #pragma omp master directives or #pragma omp parallel with distinguishing master and slave codes based on thread ids.

2.
#pragma omp parallel with threads fetching tasks in a critical section, a counter can be used to iterate over available tasks. In [19], it is called an all slave model. 3.
Assignment of work through dynamic scheduling of independent iterations of a for loop.
In [19], the author presented virtually identical and almost perfectly linear speed-up of the all slave model and the (dynamic,1) loop distribution for the Mandelbrot application on 8 processors. In our case, we provide extended analysis of more implementations and many more CPU cores.
In work [20], authors proposed a way to extend OpenMP for master-slave programs that can be executed on top of a cluster of multiprocessors. A source-to-source translator translates programs that use an extended version of OpenMP into versions with calls to their runtime library. OpenMP's API is proposed to be extended with #pragma domp parallel taskq for initialization of a work queue and #pragma domp task for starting tasks as well as #pragma domp function for specification of MPI description for the arguments of a function. The authors presented performance results for applications such as computing Fibonacci numbers as well as embarrassingly parallel examples such as generation of Gaussian random deviates and Synthetic Matrix Addition showing very good scalability with configurations up to 4 × 2 and 8 × 1 (processes × threads). More interesting in the context of this paper were results for MAND which is a master-slave application that computes the Mandelbrot set for a 2-d image of size 512 × 512 pixels. Speed-up on an SMP machine for the best 1 × 4 configuration (4 CPUs) amounted to 3.72 while on a cluster of machines (8 CPUs) was 6.4, with a task stealing mechanism.
OpenMP will typically be used for parallelization within cluster nodes and integrated with MPI at a higher level for parallelization of master-slave computations among cluster nodes [1,21]. Such a technique should yield better performance in a cluster with multicore CPUs than an MPI only approach in which several processes are used as slaves as opposed to threads within a process communicating with MPI. Furthermore, overlapping communication and computations can be used for earlier sending out data packets by the master for hiding slave idle times. Such a hybrid MPI/OpenMP scheme has been further extended in terms of dynamic behavior and malleability (ability to adapt to a changing number of processors) in [22]. Specifically, the authors have implemented a solution and investigated MPI's support in terms of needed features for an extended and dynamic master/slave scheme. A specific implementation was used which is called WaterGAP that computes current and future water availability worldwide. It partitions the tested global region in basins of various sizes which are forwarded to slaves for independent (fro other slaves) processing. Speed-up is limited by processing of the slave that takes the maximum of slaves' times. In order to deal with load imbalance, dynamic arrival of slaves has been adopted. The master assigns the tasks by size, from the largest task. Good allocation results in large basins being allocated to a process with many (powerful) processors, smaller basins to a process with fewer (weaker) processors. If a more powerful (in the aforementioned sense) slave arrives, the system can reassign a large basin. Furthermore, slave processes can dynamically split into either processes or threads for parallelization. The authors have concluded that MPI-2 provides needed support for these features apart from a scenario of sudden withdrawal of slaves in the context of proper finalization of an MPI application. No numerical results have been presented though.
In the case of OpenMP, implementations of master-slave and the producer-consumer pattern might share some elements. A buffer could be (but does not have to be) used for passing data between the master and slaves and is naturally used in producer-consumer implementations. In master-slave, the master would typically manage several data chunks ready to be distributed among slaves while in producer-consumer producer or producers will typically add one data chunk at a time to a buffer. Furthermore, in the producerconsumer pattern consumers do not return results to the producer(s). In the producerconsumer model we typically consider one or more producers and one or more consumers of data chunks. Data chunk production and consuming rates/speeds might differ, in which case a limited capacity buffer is used into which producer(s) inserts() data and consumer(s) fetches() data from for processing.
Book [1] contains three implementations of the master-slave paradigm in OpenMP. These include the designated-master, integrated-master and tasking, also considered in this work. Research presented in this paper extends directly those OpenMP implementations. Specifically, the paper extends the implementations with the dynamic-for version, as well as versions overlapping merging and data generation-tasking2 and dynamic-for2. Additionally, tests within this paper are run for a variety of thread affinity configurations, for various compute intensities as well as on four multi-core CPU models, of modern generations, including Kaby Lake, Coffee Lake, Broadwell and Skylake.
There have been several works focused on optimization of tasking in OpenMP that, as previously mentioned, can be used for implementation of master-slave. Specifically, in paper [23], authors proposed extensions of the tasking and related constructs with dependencies produce and consume which creates a multi-producer multi-consumer queue that is associated with a list item. Such a queue can be reused if it already exists. The life time of such a queue is linked to the life time of a parallel region that encompasses the construct. Such a construct can then be used for implementation of the master-slave model as well. In paper [24], the authors proposed an automatic correction algorithm meant for the OpenMP tasking model. It automatically generates correct task clauses and inserts appropriate task synchronization to maintain data dependence relationships. Authors of paper [25] show that when using OpenMP's tasks for stencil type of computations, when tasks are generated with #pragma omp task for a block of a 3D space, significant gains in performance are possible by adding block objects to locality queues from which a given thread executing a task dequeues blocks using an optimized policy.

Motivations, Application Model and Implementations
It should be emphasized that since the master-slave processing paradigm is widespread and at the same time multi-core CPUs are present in practically all desktops and workstations/cluster nodes thus it is important to investigate various implementations and determine preferred settings for such scenarios. At the same time, the processor families tested in this work are in fact representatives of the new generations CPUs in their respective CPU lines. The contribution of this work is experimental assessment of performance of proposed master-slave codes using OpenMP directives and library calls, compiled with gcc and -fopenmp flag for representative desktop and workstation systems with multicore CPUs listed in Table 1.
The model analyzed in this paper distinguishes the following conceptual steps, that are repeated:

1.
Master generates a predefined number of data chunks from a data source if there is still data to be fetched from the data source.

2.
Data chunks are distributed among slaves for parallel processing.

3.
Results of individually processed data chunks are provided to the master for integration into a global result.
It should be noted that this model, assuming that the buffer size is smaller than the size of total input data, differs from a model in which all input data is generated at once by the master. It might be especially well suited to processing, e.g., data from streams such as from the network, sensors or devices such as cameras, microphones etc.

Implementations of the Master-Slave Pattern with OpenMP
The OpenMP-based implementations of the analyzed master-slave model described in Section 3 and used for benchmarking are as follows: 1.
designated-master ( Figure 1)-direct implementation of master-slave in which a separate thread is performing the master's tasks of input data packet generation as well as data merging upon filling in the output buffer. The other launched threads perform slaves' tasks. 2.
integrated-master ( Figure 2)-modified implementation of the designated-master code. Master's tasks are moved to within a slave thread. Specifically, if a consumer thread has inserted the last result into the result buffer, it merges the results into a global shared result, clears its space and generates new data packets into the input buffer. If the buffer was large enough to contain all input data, such implementation would be similar to the all slave implementation shown in [19]. 3. tasking ( Figure 3)-code using the tasking construct. Within a region in which threads operate in parallel (created with #pragma omp parallel), one of the threads generates input data packets and launches tasks (in a loop) each of which is assigned processing of one data packet. These are assigned to the aforementioned threads. Upon completion of processing of all the assigned tasks, results are merged by the one designated thread, new input data is generated and the procedure is repeated. 4.
tasking2-this version is an evolution of tasking. It potentially allows overlapping of generation of new data into the buffer and merging of latest results into the final result by the thread that launched computational tasks in version tasking. The only difference compared to the tasking version is that data generation is executed using #pragma omp task.

5.
dynamic-for ( Figure 4)-this version is similar to the tasking one with the exception that instead of tasks, in each iteration of the loop a function processing a given input data packet is launched. Parallelization of the for loop is performed with #pragma omp for with a dynamic chunk 1 size scheduling clause. Upon completion, output is merged, new input data is generated and the procedure is repeated. 6.
dynamic-for2 ( Figure 5)-this version is an evolution of dynamic-for. It allows overlapping of generation of new data into the buffer and merging of latest results into the final result through assignment of both operations to threads with various ids (such as 0 and 4 in the listing). It should be noted that ids of these threads can be controlled in order to make sure that these are threads running on different physical cores as was the case for the two systems tested in the following experiments.
For test purposes, all implementations used the buffer of 512 elements which is a multiple of the numbers of logical processors.     The following two applications are irregular in nature which results in various execution times per data chunk and subsequently exploits the dynamic load balancing capabilities of the tested master-slave implementations.

Parallel adaptive quadrature numerical integration
The first, compute-intensive, application, is numerical integration of any given function. For benchmarking, integration of f (x) = x · sin 2 (x 2 ) was run over the [0,100] range. The range was partitioned into 100000 subranges which were regarded as data chunks

Parametrized Irregular Testbed Applications
The following two applications are irregular in nature which results in various execution times per data chunk and subsequently exploits the dynamic load balancing capabilities of the tested master-slave implementations.

Parallel Adaptive Quadrature Numerical Integration
The first, compute-intensive, application, is numerical integration of any given function. For benchmarking, integration of f (x) = x · sin 2 (x 2 ) was run over the [0, 100] range. The range was partitioned into 100,000 subranges which were regarded as data chunks in the processing scheme. Each subrange was then integrated (by a slave) by using the following adaptive quadrature [26] and recursive technique for a given range [a, b] being considered: 1.
otherwise, recursive partitioning into two subranges (a, a+b 2 ) and ( a+b 2 , b) is performed and the aforementioned procedure is repeated for each of these until the condition is met.
This way increasing the partitioning coefficient increases accuracy of computations and consequently increases the compute to synchronization ratio. Furthermore, this application does not require large size memory and is not memory bound.

Parallel Image Recognition
In contrast to the previous application, parallel image recognition was used as a benchmark that requires much memory and frequent memory reads. Specifically, the goal of the application is to search for at least one occurrence of a template (sized TEMPLATEX-SIZExTEMPLATEYSIZE in pixels) within an image (sized IMAGEXSIZExIMAGEYSIZE).
In this case, the initial image is partitioned and within each chunk, a part of the initial image of size (TEMPLATEXSIZE + BLOCKXSIZE)x (TEMPLATEYSIZE + BLOCKYSIZE) is searched for occurrence of the template. In the actual implementation values of IMAGEXSIZE = IMAGEYSIZE = 20,000, BLOCKXSIZE = BLOCKYSIZE = 20, TEMPLATEXSIZE =TEMPLATEYSIZE = 500 in pixels were used.
The image was initialized with every third row and every third column having pixels not matching the template. This results in earlier termination of search for template, also depending on the starting search location in the initial image which results in various search times per chunk.
In the case of this application a compute coefficient reflects how many passes over the initial image are performed. In actual use cases it might correspond to scanning slightly updated images in a series (e.g., satellite images or images of location taken with a drone) for objects. On the other hand, it allows to simulate scenarios of various relative compute to memory access and synchronization overheads for various systems.

Testbed Environment and Methodology of Tests
Experiments were performed on two systems typical of a modern desktop and workstation systems with specifications outlined in Table 1. The range of thread counts tested depends on the implementation and varied as follows, based on preliminary tests that identified the most interesting values based on most promising execution times, where npl means the number of logical processors: for designated-master these were npl/2, 1 + npl/2, npl and 1 + npl, for all other versions the following were tested: npl/4, npl/2, npl and 2 · npl. Thread affinity settings were imposed with environment variables OMP_PLACES and OMP_PROC_BIND [27,28]. Specifically, the following combinations were tested independently: default (no additional affinity settings) marked with default, OMP_PROC_BIND=false which turns off thread affinity (marked in results as noprocbind), OMP_PLACES=cores and OMP_PROC_BIND=close marked with corclose, OMP_PLACES=cores and OMP_PROC_BIND=spread marked with corspread, OMP_PLACES=threads and OMP_PROC_ BIND=close marked with thrclose, OMP_ PLACES=sockets without setting OMP_PROC_BIND marked with sockets which defaults to true if OMP_PLACES is set for gcc (https://gcc.gnu. org/onlinedocs/gcc-9.3.0/libgomp/OMP_005fPROC_005fBIND.html). If OMP_PROC_BIND equals true then behavior is implementation defined and thus the above concrete settings were tested.In the experiments the code was tested with compilation flags -O3 and also -O3 -march=native. Best values are reported for each configuration, an average value out of 20 runs is presented along with corresponding standard deviation.

Results
Since all combinations of tested configurations resulted in a very large number of execution times, we present best results as follows. For each partitioning coefficient separately for numerical integration and compute coefficient for image recognition and for each code implementation 3 best results with a configuration description are presented in Tables 2 and 3 for numerical integration as well as in Tables 4 and 5 for image recognition, along with the standard deviation computed from the results. Consequently, it is possible to identify how code versions compare to each other and how configurations affect execution times.
Additionally, for the coefficients, execution times and corresponding standard deviation values are shown for various numbers of threads. These are presented in Figures 6 and 7 for numerical integration as well as in Figures 8 and 9 for image recognition.   For image recognition best implementations for system 1 are dynamic-for2/dynamic-384 for and integrated-master with very similar results, followed by tasking, designated-

3.
For system 2 we can see benefits from overlapping for dynamic-for2 over dynamic-389 for for numerical integration and for both tasking2 over tasking as well as dynamic-

4.
For the compute intensive numerical integration example we see that best results       Figure 9. Image recognition -system 2 results for various numbers of threads for image recognition and system 1 it is between 10.9% and 11.3% and similarly for 415 system 2 between 10.4% and 11.3%. 416

9.
We can see that ratios of best system 2 to system 1 times for image recognition are means that results for system 2 for this application get relatively better compared 419 to system 1's. As outlined in Table 1  thrclose/corspread for system 1 and sockets for system 2 for smaller compute coefficients 432 and default for system 1 and noprocbind for system 2 for compute coefficient 8.

Performance
From the performance point of view, based on the results the following observations can be drawn and subsequently be generalized:

1.
For numerical integration, best implementations are tasking and dynamic-for2 (or dynamic-for for system 1) with practically very similar results. These are very closely followed by tasking2 and dynamic-for and then by visibly slower integrated-master and designated-master.

2.
For image recognition best implementations for system 1 are dynamic-for2/dynamicfor and integrated-master with very similar results, followed by tasking, designatedmaster and tasking2. For system 2, best results are shown by dynamic-for2/dynamicfor and tasking2, followed by tasking and then by visibly slower integrated-master and designated-master. 3.
For system 2, we can see benefits from overlapping for dynamic-for2 over dynamic-for for numerical integration and for both tasking2 over tasking, as well as dynamic-for2 over dynamic-for for image recognition. The latter is expected as those configurations operate on considerably larger data and memory access times constitute a larger part of the total execution time, compared to integration.

4.
For the compute intensive numerical integration example we see that best results were generally obtained for oversubscription, i.e., for tasking* and dynamic-for* best numbers of threads were 64 rather than generally 32 for system 2 and 16 rather than 8 for system 1. The former configurations apparently allow to mitigate idle time without the accumulated cost of memory access in the case of oversubscription. 5.
In terms of thread affinity, for the two applications best configurations were measured for default/noprocbind for numerical integration for both systems and for thrclose/corspread for system 1 and sockets for system 2 for smaller compute coefficients and default for system 1 and noprocbind for system 2 for compute coefficient 8. 6.
For image recognition, configurations generally show visibly larger standard deviation than for numerical integration, apparently due to memory access impact.

7.
We can notice that relative performance of the two systems is slightly different for the two applications. Taking into account best configurations, for numerical integration system 2's times are approx. 46-48% of system 1's times while for image recognition system 2's times are approx. 53-61% of system 1's times, depending on partitioning and compute coefficients. 8.
We can assess gain from HyperThreading for the two applications and the two systems (between 4 and 8 threads for system 1 and between 16 and 32 threads for system 2) as follows: for numerical integration and system 1 it is between 24.6% and 25.3% for the coefficients tested, for system 2 it is between 20.4% and 20.9%; for image recognition and system 1, it is between 10.9% and 11.3% and similarly for system 2 between 10.4% and 11.3%. 9.
We can see that ratios of best system 2 to system 1 times for image recognition are approx. 0.61 for coefficient 2, 0.57 for coefficient 4 and 0.53 for coefficient 8 which means that results for system 2 for this application get relatively better compared to system 1's. As outlined in Table 1, system 2 has larger cache and for subsequent passes more data can reside in the cache. This behavior can also be seen when results for 8 threads are compared-for coefficients 2 and 4 system 1 gives shorter times but for coefficient 8 system 2 is faster. 10. integrated-master is relatively better compared to the best configuration for system 1 as opposed to system 2-in this case, the master's role can be taken by any thread, running on one of the 2 CPUs.
The bottom line, taking into consideration the results, is that preferred configurations are tasking and dynamic-for based ones, with preferring thread oversubscription (2 threads per logical processor) for the compute intensive numerical integration and 1 thread per logical processor for memory requiring image recognition. In terms of affinity, default/noprocbind are to be preferred for numerical integration for both systems and thrclose/corspread for system 1 and sockets for system 2 for smaller compute coefficients and default for system 1 and noprocbind for system 2 for compute coefficient 8.

Ease of Programming
Apart from the performance of the proposed implementations, ease of programming can be assessed in terms of the following aspects: 1. code length-the order from the shortest to the longest version of the code is as follows: tasking, dynamic-for, tasking2, integrated-master, dynamic-for2 and designatedmaster, 2. the numbers of OpenMP directives and functions. In this case the versions can be characterized as follows: • designated-master-3 directives and 13 function calls; • integrated-master-1 directive and 6 function calls; • tasking-4 directives and 0 function calls; • tasking2-6 directives and 0 function calls; • dynamic-for-7 directives and 0 function calls; • dynamic-for2-7 directives and 1 function call, which makes tasking the most elegant and compact solution. 3. controlling synchronization-from the programmer's point of view this seems more problematic than the code length, specifically how many distinct thread codes' points need to synchronize explicitly in the code. In this case, the easiest code to manage is tasking/tasking2 as synchronization of independently executed tasks is performed in a single thread. It is followed by integrated-master which synchronizes with a lock in two places and dynamic-for/dynamic-for2 which require thread synchronization within #pragma omp parallel, specifically using atomics and designated-master which uses two locks, each in two places. This aspect potentially indicates how prone to errors each of these implementations can be for a programmer.

Conclusions and Future Work
Within the paper, we compared six different implementations of the master-slave paradigm in OpenMP and tested relative performances of these solutions using a typical desktop system with 1 multi-core CPU-Intel i7 Kaby Lake and a workstation system with 2 multi-core CPUs -Intel Xeon E5 v4 Broadwell CPUs.
Tests were performed for irregular numerical integration and irregular image recognition with three various compute intensities and for various thread affinities, compiled with the popular gcc compiler. Best results were generally obtained for OpenMP task and dynamic for based construct implementations, either with thread oversubscription (numerical integration) or without oversubscription (image recognition) for the aforementioned applications.
Future work includes investigation of aspects such as the impact of buffer length and false sharing on the overall performance of the model, as well as performing tests using other compilers and libraries. Furthermore, tests with a different compiler and OpenMP library such as using, e.g., icc -openmp would be practical and interesting for their users. Another research direction relates to consideration of potential performance-energy aspects of implementations in the context of CPUs used and configurations, also when using power capping as an extension of previous works in this field [29][30][31]. Finally, investigation of performance of basic OpenMP constructs for modern multi-core systems and compilers is of interest, as an extension of previous works such as [32,33].