A Theoretical Model for Global Optimization of Parallel Algorithms

: With the quickly evolving hardware landscape of high-performance computing (HPC) and its increasing specialization, the implementation of efﬁcient software applications becomes more challenging. This is especially prevalent for domain scientists and may hinder the advances in large-scale simulation software. One idea to overcome these challenges is through software abstraction. We present a parallel algorithm model that allows for global optimization of their synchronization and dataﬂow and optimal mapping to complex and heterogeneous architectures. The presented model strictly separates the structure of an algorithm from its executed functions. It utilizes a hierarchical decomposition of parallel design patterns as well-established building blocks for algorithmic structures and captures them in an abstract pattern tree (APT) . A data-centric ﬂow graph is constructed based on the APT, which acts as an intermediate representation for rich and automated structural transformations. We demonstrate the applicability of this model to three representative algorithms and show runtime speedups between 1.83 and 2.45 on a typical heterogeneous CPU/GPU architecture.


Introduction
Advances in science are intrinsically linked to a steady rise in computing power. Due to the three walls [1], this demand is met with increasingly parallel processors and hardware specialization. Modern clusters are thus heterogeneous and the computing nodes are often equipped with different, specialized processors. However, the size and complexity of typical legacy codes challenge the programming productivity of many domain scientists and foster their reliance on (automatic) performance portability between architectures.
The major challenge for maintaining the performance of a parallel algorithm on new architectures is the optimal utilization of parallelism on three levels: (i) instruction-level, (ii) routine (local), and (iii) algorithm (global). As such, optimization on all these levels is laborious and requires expert knowledge. Thus, significant efforts target the design of programming abstractions such as parallel programming models and automatic transformation techniques: There are various transformation techniques and capable production compilers for level (i) [2]. Level (ii) is mainly addressed with parallel programming models providing an abstraction over specific processor features such as OpenMP [3], CUDA, and MPI [4]. Furthermore, a sophisticated parallel programming methodology has been developed around the concept of parallel patterns [5,6] and the abstraction to local structural parallelism of algorithms. Based on these advances, recent programming models like RAJA [7], Kokkos [8], and Stateful Dataflow Multigraphs [9] enable the implementation and (partially) automatic optimization for different target architectures. These optimiza-tions include local dataflow and control flow transformations such as data layouts and nested parallelism.
However, there is a lack of effective global transformation techniques. On the one hand, global transformations require large-scale static analyses for identifying concurrency. These analyses are highly combinatorial complex or even infeasible for dynamic programs. On the other hand, global transformations intertwine with the more general decision of mapping the workload to different processors. While lower-level transformations are applied for a single target architecture, global transformations need to tradeoff significant structural changes based on their effect on the concurrency of the algorithm and its utilization of the available hardware. For instance, the fusion of multiple routines might enable the use of a massively parallel accelerator. At the same time, each of these routines might, however, be executed on dedicated processors simultaneously. With increasing heterogeneity, the global transformation of parallel algorithms and their mapping decision cannot be separated.
This paper provides the theoretical model of a framework that enables global optimizations and automatic hardware mapping. It abstracts parallel algorithms in a global, structural representation called abstract pattern tree (APT). This APT captures high-level data dependencies between local parallel structures formalized in a generic parallel pattern definition. Global transformations and automatic mappings are then derived based on optimizing algorithmic efficiencies. These efficiencies are necessary performance conditions defined over the global properties of the APT.
In summary, our key contributions are as follows: • A model of parallel algorithms is introduced, which abstracts the algorithmic structure from the executed functions. This model facilitates the analysis of global algorithmic properties while building on a flexible definition of local parallel structures. • A new class of global transformation techniques is enabled by introducing necessary performance conditions called algorithmic efficiencies. Three main efficiencies are identified: synchronization, inter-processor dataflow, and intra-processor dataflow efficiency. • The model's applicability is demonstrated on three typical parallel algorithms showcasing the major transformation capabilities, and their performance improvements are investigated.
The remainder of this paper is structured as follows: Related work is analyzed in Section 2. Section 3 introduces the essential algorithm representation in the form of the APT. Section 4 introduces the idea of algorithmic efficiencies and identifies three main efficiencies. The applicability of the model is investigated with three case studies of typical parallel algorithms in Section 5. Section 6 provides a discussion of the presented techniques towards optimality, applicability, and its integration into compiler frameworks. Section 7 concludes the work and provides an outlook into future work.

Related Work
There is extensive literature that addresses related problems or subproblems of this work. In the following, the most relevant literature for this work is discussed in groups based on the specific research question they target.

Abstractions for Parallel Programming
The used separation of structure and function in this work relates to the original work on design patterns introduced by Christopher Alexander et al. [10] for the architectural domain and later applied to the design of software [11,12]. Based on these works, algorithmic skeletons [13] and parallel design patterns were developed by Mattson et al. [5] and McCool et al. [6]. These building blocks can be found in most parallel programming models and provide interoperability and a common terminology with this work. Cole [13] and Darlington et al. [14] have developed algorithmic skeleton frameworks as the first approach to this end. They were later extended to a wide variety of similar approaches such as GrPPI [15], Fastflow [16], and many others [17]. These frameworks are typically designed as libraries of pre-defined patterns. While they provide high-performance imple-mentations, they typically lack the global transformations and optimal hardware mapping targeted in this work.
Similarly, OpenMP [3], OpenACC [18], OpenCL [19], and SYCL [20] focuses on local loop-level optimizations, RAJA [7] and Kokkos [8] on task-based parallelism and rule-based performance portability, Halide [21] on image and array processing, and MPI [4] on communication and SPMD optimizations. Julia [22], MATLAB [23], and similar programming languages focus on computer algebra, while Tensorflow [24], MapReduce [25], and similar programming languages mainly target data-centric and machine learning workloads. These approaches allow expressing parallel algorithms adequately but focus on specific domains or local optimization. Instead, our holistic approach aims at global optimizations for the broad domains of scientific software. We use generic parallel patterns as formal elements that provide structural and syntactical information about local concurrency. The proposed model exposes concurrency, guarantees correctness, and globally optimizes the execution for a specific hardware architecture.

Transformation Techniques
The optimization techniques developed in this work are targeted at global optimizations. For instruction-level optimizations, there is extensive literature on optimizing compilers targeted at HPC, such as Bacon et al. [2]. The framework CHiLL [26] proposes transformations to complex loop nests described by a sequence of high-level transformations. Flattening transformations as seen in NESL [27] and data-parallel Haskell [28], as well as studies by Blelloch et al. [29] and Chakravarty and Keller [30], provide means to compile nested data parallelism to flattened data-parallel code. While nested parallelism allows for high-level abstractions, its transformations can significantly impact the performance of the generated code [31]. Moreover, there exist many rule-based approaches for transforming routines such as Lift [32], Steuwer et al. [33], Rasch et al. [34]. In contrast, this work provides automatic global transformations and hardware mappings based on a static performance model.

Architectural Mapping and Code Generation
There are multiple approaches for mapping parallel algorithms to specific hardware architecture. The NP-hard MAKESPAN SCHEDULING problem on unrelated machines is a static approach for which different approximation algorithms were proposed [35]. Beaumont et al. [36] have discussed the automatic mapping of parallelism to heterogeneous architectures for local parts of programs instead of the global approach suggested in this work. While this work focuses on static optimizations, there exist many dynamic approaches such as cluster resource management systems [37] and runtime systems that place threads and processes according to their memory affinities and communication patterns [38]. Furthermore, this work integrates the mapping decision into the global optimization to minimize the overall runtime and optimize the utilization of the given hardware architecture.
Intel's Array Building Blocks [39] dynamically generates code from a high-level specification of data-parallel patterns to target heterogeneous architectures. Similarly, Copperhead [40] and Stateful Dataflow Multigraphs [9] optimize and lower data-parallel Python code. Furthermore, modern polyhedral compilers such as Pluto [41], PetaBricks [42], PPCG [43], and Tensor Comprehensions [44,45] can expose parallelism and target multiple hardware architectures. They typically provide advanced optimizations for common problems such as loop-level parallelism.

A Theoretical Model of Parallel Algorithms
In the following, the basic theoretical model of the optimization framework is defined, and a representation of parallel algorithms based on parallel building blocks denoted abstract pattern tree (APT) is introduced. The APT is a self-contained structure for the global optimizations to be applied, and the whole framework is therefore widely decoupled from language-specific properties.

Performance Definition and Model Assumptions
The framework's goals are global optimization and the optimal mapping of parallel algorithms onto a target architecture. Thereby, optimality is defined in terms of minimal overall execution time for given hardware. This can include multiple nodes with heterogeneous architectures. The focus on global properties allows for the following separation: A parallel algorithm consists of structural information such as concurrency, synchronization, dependencies, input and output data and functional information, which provide the actual computation. While the functional information may guide specific information, e.g., in a static performance model, the global transformations rely on structural information only. Furthermore, the model uses the following assumptions on the parallelism of the algorithm provided by the developer: • Local Optimality: Locally, the parallel hotspots have been identified, and the potential independence of their operations is expressed optimally in the algorithmic structure. This assumption can be assured by methods for identifying concurrency, such as introduced by Mattson et al. [5]. • Correct: All dependencies are well-defined, and the algorithm is free of data races, deadlocks, and similar correctness issues.

Algorithmic Representation
The algorithmic structure is represented in a dataflow-centric fashion with the following structural elements: • A data item, in short data, is produced and consumed as the result of computations during the execution of an algorithm. Thereby, a data item is immutable and does not refer to a memory location. • An operation is a set of instructions producing and consuming data, which resembles the task definition of typical parallel programming models. This set of instructions is interpreted atomically. • A place is a source (does not consume data) or a sink (does not produce data), i.e., external inputs and outputs of the parallel algorithm. • A data dependency between operations occurs when one operation consumes a data item another operation produces. The definition by A. J. Bernstein [46] corresponds to a flow dependency (or true dependency, read-after-write, RAW). Hence, the model represents a directed data-dependency graph based on flow dependencies. It covers data and control dependencies, while name dependencies like the anti-dependency (write-after-read) and the output dependency (write-after-write) are not interpretable in this model because of the missing relation to memory locations. This also ensures that data-dependency graphs are always acyclic in this model.

Local Structures: Serial and Parallel Patterns
Local structures like loops or functional calls reoccur throughout parallel algorithms and are thus called patterns. Formally, each pattern is a data-dependency graph with operations and data items; the places of this graph are the results of preceding local structures. The resulting directed graph is denoted a pattern diagram (PD), PD = (V PD , E PD ). PDs allow for the efficient analysis of local parallelism, which is defined as follows: • An operation o ∈ V PD depends on another operation o ∈ V PD , iff there is a non-empty directed path from o to o. All other operations are independent (cf. happens-before relation [47]). A pattern is called a parallel pattern if there are at least two parallel operations. Otherwise, it is called a serial pattern. For example, Figure 1 shows four operations, each consuming overlapping items from an array of data in a regular access scheme. This is the structural abstraction of the four applications of a 2 × 2 kernel on a flattened 3 × 3 matrix. The structure is commonly called a stencil, which is a specific instance of the generic parallel pattern definition.
Pattern diagram of a stencil on 3 × 3 matrix with a 2 × 2 filter.

Global Structure: Abstract Pattern Tree
The APT is the high-level representation of an algorithmic structure providing a global perspective on a parallel algorithm. Formally, the APT is an undirected graph with the nodes being the patterns occurring in the algorithm and the edges reflecting the execution order specified by the developer. The execution starts with the topmost serial node, which calls its child nodes in sequential order from left to right. Rectangular boxes represent the serial nodes, and undirected edges represent the dependencies between patterns. The parallel nodes are shown through circular boxes and summarize the local pattern diagrams. In a typical application, the pattern nodes are instances of specific parallel patterns such as a map or a reduction. Such patterns can be understood as higher-order functions with fixed schemes of data dependencies repeated over the input data. In these cases, the well-defined regularity of the pattern allows significant compression of the structural information making a global static analysis between different local patterns in the APT feasible.
Furthermore, the APT is enriched with information regarding hardware mapping. During the optimization step, as described in the following chapter, the execution schedule and target hardware is determined and added as metadata to the nodes of the APT. This includes splitting of parallel patterns into partitions to be executed by processors and the required data transfers. Additionally, the hardware description as used by the optimization can be stored in the APT or through a separate hardware description language. This includes the abstraction of the hardware with the main performance metrics such as computational throughput, sustainable memory bandwidths and latencies, and the memory hierarchy. The final APT, after the optimization steps, then contains all information required to generate machine code.
An example of an APT is provided in Figure 2, which shows the algorithmic structure of a typical image manipulation algorithm that first computes the image gradients and then applies a 1D filter to the image. Its structure consists of a stencil with a PD similar to Figure 1 and a subsequent composition of a map and a reduction corresponding to the structure of matrix-vector multiplication.

Basic Notations
Throughout this paper, a target architecture comprises a set of processors P. Graphically, an operation in a PD, o ∈ V PD , is described by a rectangular box, a place by a circular box, and a data dependency by a directed edge. Parallel operations of a PD form a STEP and are arranged on the same horizontal axis. The APT consists of rectangular boxes representing serial patterns, circular boxes representing parallel patterns, and undirected edges representing the dependencies between patterns. The children of a serial pattern are to be executed from left to right. Additionally, the STEP notation introduced on the level of PDs is also used at the scope of the whole algorithm. These global steps GSTEP 1 , . . . , GSTEP N are defined analogously and can be constructed directly from the local steps STEP n . The disjoint union of all global steps equals the set of all operations O.

Algorithmic Efficiencies
The following chapter introduces a static performance model, enabling the derivation of global mapping decisions and transformations. The proposed performance model makes use of the concept of algorithmic efficiencies as sketched in [48]. Algorithmic efficiencies define necessary optimality conditions of performance over different global properties of algorithms. The separation into different properties allows to optimize the performance separately and reduces the complexity compared to a joint optimization. Runtime estimates in algorithmic efficiencies are thereby parameterized into cost functions, modeled with existing performance models. Furthermore, each efficiency is defined over the properties of the APT. Thus, transformations of the APT are also directly captured by the performance model.

Algorithmic Steps and Synchronization
The synchronization efficiency seeks to maximize the potential parallelism before mapping the operations to an architecture. On the global algorithmic level, this potential is mainly limited by unidentified parallelism such as false linearization of independent parallel patterns. Linearization of parallel patterns is thereby defined as the sequential order of two patterns due to data dependencies.
Although asynchronous techniques on the instruction-level may hide such linearization to some extent, linearization on the global level still reduces the potential parallelism during optimization. Therefore, maximizing this potential by resolving false linearization is a prerequisite before deriving the actual mapping and applying transformations. Formally, this problem is described by the global steps, GSTEP 1 , . . . , GSTEP N , defined by the developer and the goal of pulling patterns into earlier steps so that the overall number of steps is reduced and the width of the steps is maximized. Utilizing static data dependency analysis [46,49], false data dependencies might be identified and the number of steps reduced: Definition 1 (Synchronization Efficiency). An algorithm is synchronization efficient if it has a minimal number of global steps.

Inter-Processor Dataflow
The inter-processor dataflow efficiency guides the mapping of operations to processors. In this context, the abstract term processor refers to any homogeneous group of cores sharing the same processor-local cache. Furthermore, the space of mappings is restricted to functions, i.e., each operation must be executed on a single processor.
Without loss of generality, a mapping M : O → P can be decomposed into a sequence of step-wise mappings M T 1 , where each step-mapping M t : GSTEP t → P is a function on the subset of operations of the global step. To compare mappings, the efficiency assigns any mapping a cost as follows: The execution costs E t : P × 2 GSTEP t → R define the costs for executing a set operations on a processor. Hence, the execution costs are step-local. Furthermore, the network costs N t : P × 2 GSTEP t → R account for the costs of communicating the data between two operations. Those costs may depend on previous steps and mappings, which are indicated by the semicolon notation in the function N t (·, ·; M t−1 1 ). The total costs of a mapping is then the sum of the maximal costs of operations assigned to a processor M −1 t (P) over all steps t = 1, . . . , T: leading to the following efficiency:

Definition 2 (Inter-Processor Dataflow Efficiency). A mapping M T 1 is inter-processor dataflow efficient if it has minimal total costs for the execution and network.
Conceptually, this efficiency implies a multi-step scheduling problem, where processors are distinguishable in the execution of operations and data access times. This distinction of two dueling properties is an essential aspect of effective mapping and transformation decisions: Minimizing the execution costs is typically achieved by spreading the computation across processors. Simultaneously, the network costs and its full contextdependence require minimizing the dataflow between different processors to keep the need for communication minimal.

Cost Modeling
The costs of the above efficiency are introduced in a modular manner. Thus, different performance models can be used to refine the exact costs with respect to the target architectures and the complexity of the resulting optimization problem. For instance, the modeling may be similar to the roofline model [50]: • Execution costs: The execution of operations is captured by the number of floating point operations (FLOPS) divided by the peak performance π P (clock frequency times FLOPS per cycle): • Network costs: The network costs are defined as the slowest data transfer between two processors. A data transfer thereby bundles all bytes to be transferred from one processor to another to satisfy the data dependencies. The bandwidth β s (P , P) is determined by the slowest interconnect between these two processors and a latency penalty Γ s (P , P) is added: Furthermore, the roofline model assumes the execution and network costs to overlap entirely. In general, the degree of overlap may, however, be controlled by a set of hyperparameters κ P,t yielding:

Intra-Processor Dataflow
Previous efficiencies result in a global mapping of operations to the different processors. To this end, the notion of steps and linearization is a convenient simplification on the global level, where workloads are assumed to be significant. However, the simplification becomes inadequate for analyzing the execution of operations on the cores of a processor. For instance, multiple operations may be executed simultaneously through vectorization. Furthermore, operations of different steps may overlap due to asynchronous techniques in hardware such as prefetching or hyperthreading.
The intra-processor dataflow efficiency therefore seeks to optimize the execution of operations on a processor's cores targeting the execution units and core-local caches. In principle, this assignment may be solved as part of the previous efficiency. However, the cores of a processor are assumed to be homogeneous, and the resulting scheduling problem may therefore be adequately solved with simpler heuristics such as static scheduling based on loop indices. Furthermore, the efficiency involves other instruction-level optimizations, such as improving operations overlap through asynchronous techniques. Because of the global scope of this paper, this optimization must be delegated to downward compilers relying on a comprehensive toolset of best practices; see related work for approaches (RAJA, Kokkos).

Evaluation
The theoretical model formulates the automatic mapping and global transformation problem as an optimization over costs. The following evaluation's purpose is to assess whether this provides a suitable basis for a class of optimization and mapping algorithms targeting heterogeneous architectures. In detail, it must be shown that performance-critical transformation and mapping decisions, as typically applied by a performance engineer, can be reproduced as the result of cost minimization on practical problems. This work focuses on algorithmic changes, accelerator offloading, and distributed computing. The evaluation is based on benchmarks representing typical parallel algorithms and two representative heterogeneous CPU-GPU nodes found in modern clusters.

Experimental Setup
Due to the combinatorial complexity of possible mappings, the evaluation focuses on comparing two mapping hypotheses: A baseline version and an optimized version are identified for each benchmark. The baseline version closely follows the numerical definition of the algorithm, whereas the optimized version comprises the typical performance-critical transformation and mapping decisions. Based on these transformations and the cost definition sketched in Section 4.2, it is investigated whether the optimized version is also preferred in the cost-based framework. Furthermore, both hypotheses are manually implemented in C, and both versions' runtimes are compared to their costs. The difference to an algorithmic setup is then the space of mappings, i.e., an algorithm can search through a more expansive space of mappings of finer distinctions. The exemplary cost function may not be detailed enough to reproduce the quality of the optima in these cases, and a cost function would be required, which captures more architecture-specific properties.
The evaluation comprises the following benchmarks:

Results
The experiments and results for each benchmark are explained in the following sections. Table 1 provides a brief overview of costs, runtimes, and transformations.

Jacobi Algorithm
The benchmark consists of two linear equation systems Ax = b and Ax = b sharing the same matrix A. Both systems are to be solved iteratively with the Jacobi algorithm. Each iteration is a map over the rows of the matrix, yielding an APT consisting of a sequence of maps. The setup exploits two high-level properties to be considered by an optimization framework: 1.
The two Jacobi applications are independent and could be fused into a single sequence of Jacobi iterations.

2.
In each iteration, the corresponding rows between both equation systems share the same data from matrix A.
Hypotheses: The baseline hypothesis does not fuse the Jacobi applications and solves the systems one after another. Each Jacobi iteration is split into an upper half and lower half of equations. These halves are always assigned to the two 24-core CPUs of a single node. The optimized version fuses the two Jacobi applications and executes the whole workload on the same two 24-core CPUs of a single node. It thereby assigns the corresponding halves of both equation systems of a single iteration to the same CPU according to the shared data of matrix A. The resulting costs are reported in Table 1, which shows that the baseline costs are 0.410 s, and the optimized costs are 0.288 s. Furthermore, the implementation of both versions yield median runtimes of 0.986 s and 0.539 s, respectively.

k-Means Algorithm
The benchmark consists of a clustering task on a dataset with the k-means algorithm. In detail, 10 7 two-dimensional points of a synthetic dataset need to be assigned to k = 128 clusters, where the result is obtained after 100 iterations. Each iteration of the k-means algorithm comprises two stages, an assignment and an update step. Both steps contain map patterns on large, dense data defining a massive, data-parallel task. Therefore, a typical manual optimization is to offload the whole computation to a suitable accelerator such as the GPU in the considered setup.
Hypotheses: The baseline version executes the whole computation on the CPUs of a single node with 48 cores. The optimized version offloads the computation to the GPU of a single node considering the massive data parallelism of the application. The resulting costs are reported in Table 1, showing the costs of baseline and optimized with 15.543 and 5.052 s. The runtimes are measured at 9.594 and 3.921 s.

Monte Carlo Pi
The benchmark approximates π by accumulating the area of a unit circle obtained from 10 9 random draws. The estimation is repeated 96 times, and the final result is obtained by averaging. The benchmark defines a typical, compute-bound Monte Carlo method, where each estimation is embarrassingly parallel. Therefore, a typical manual mapping decision distributes the independent estimation over multiple nodes as the communication overhead is minimal.
Hypotheses: The baseline version executes the whole computation on the CPUs of a single node with 48 cores. The optimized version, which represents the optimized mapping decision, executes the same computation of the four CPUs of both nodes. The resulting costs and median runtimes are shown in Table 1. The costs are 12.381 and 6.190 s, respectively, while the runtimes are 43.449 and 22.238 s.

Discussion
This paper provides a theoretical model for global optimizations of parallel algorithms and their mapping to specific target architectures. The optimization techniques are derived from algorithmic efficiencies, which state generic criteria for performance.

Analysis of the Results
The model's capabilities to automatically optimize parallel algorithms through algorithmic changes, accelerator offloading, and distributed computing was shown. The three real-world use-cases were optimized successfully, and a significant speedup in the estimated runtime costs between 1.42 and 3.08 was achieved. Similarly, the measured runtime shows speedups between 1.83 and 2.45. Hence, complex transformations utilizing the specific hardware characteristics were automatically applied, significantly increasing the developer's productivity and aiding performance portability.

Performance Model
The costs for the k-means algorithm are overestimated by a factor of 1.62 for the baseline and 1.29 for the optimized version. The other two benchmarks show underestimations between 0.28 and 0.53, where the Monte Carlo Pi example shows the highest discrepancies. However, the actual mapping and transformation decisions are guided by the relative cost improvement of the optimized version over the baseline version, which is consistent with runtime improvements on all benchmarks.
In general, the costs of the static performance model must be interpreted as a loss/utility rather than a direct runtime estimate. As considered in the evaluation, coarse-grained cost modeling leads to sufficient mapping and transformation decisions if the investigated set of mappings is sparse and each hypothesis differs significantly from others. On the global level, this assumption is valid in cases where the processors of heterogeneous architectures are distinctly different. However, a predictive performance model would improve the resulting transformations' interpretability and allow for finer mapping and transformation decisions. Extensions to the roofline model to include further hardware characteristics, including accelerator-specifics or a more detailed performance model such as the execution-cache-memory performance model [51], could be utilized.

Optimizations
While the model allows for rich optimizations of parallel algorithms, it is limited in finding completely different algorithms for a specific problem. This could be enabled by adding a library of equivalent structures used as replacements for user-provided parts of the algorithms. Furthermore, the model relies on static information and, thus, dynamic information cannot be included in the algorithm's specification. This limitation could be mitigated by combining the model with a dynamic approach such as autotuning techniques [52][53][54] to incorporate runtime information into the optimization process. Similarly, monitoring data could be collected and fed back to the optimization step for commonly executed programs.
Moreover, the optimality of the optimization algorithms highly depends on the choice of initial hypothesis and their pruning. Well-tested heuristics will be required to handle the tradeoff between optimality effectively and compile times. Finally, the analysis of the quality of the optimizations is challenging due to the lack of a predictive model, as discussed above. The combination with detailed performance models could enable estimations of lower bounds of necessary synchronization and data movements. Furthermore, additional constraints could be included to cater for energy efficiency and power capping, cost-effective usage of the available resources by minimizing the total cost of ownership of the cluster, and multi-job scheduling problems, such as specific times one needs the respective simulation results. Weighting factors could handle the tradeoff between multiple target metrics.

Integration into Compiler Frameworks
Three main stages are required to implement the proposed global optimization framework: In the frontend, the hierarchy of patterns needs to be extracted from existing code bases. To this end, pattern recognition tools such as AutoPar [55], Pluto [41,56] or Dis-coPoP [57] could be leveraged. In the middleware, the exposed structure is optimized and mapped to target hardware. In the backend, the optimized structure is lowered to a representation that enables loop-level and ILP optimizations. This could be integrated as a pre-processing step into existing compiler frameworks via an intermediate representation (IR). Alternatively, one could implement the framework in a source-to-source fashion where optimized and target-specific code is generated, lowered to an existing programming language. The backend could utilize existing projects such as LLVM [58] and libraries such as Boost [59] to reduce the required implementation efforts. A typical production compiler could then further lower the code to machine code, which can be dynamically optimized with autotuning techniques [52][53][54].

Conclusions
We present a systematic approach to optimize parallel algorithms globally and efficiently map them to heterogeneous architectures. The approach leverages an abstract representation of parallel algorithms via a hierarchical decomposition of parallel patterns. This representation allows for global transformations to optimize their synchronization and dataflow efficiencies, including pattern matching, siblings fusion, pipelining, cache blocking, and reordering.
We have demonstrated that the proposed model can identify parallelism and restructure real-world parallel algorithms. To this end, dataflow optimizations, including pipelining and cache-blocking, were applied automatically. Furthermore, the model proposed the optimal hardware mapping for these algorithms and provided speedups between 1.83 and 2.45.
The following steps are the design and implementation of the global transformations into a compiler framework. This includes the representation of the APT, the optimization algorithms, and a code generator. Furthermore, a more detailed performance model could improve the cost estimation, and heuristics and libraries could aid in substituting algorithms with more advanced candidates.

Conflicts of Interest:
The authors declare no conflict of interest.