GRASP and Iterated Local Search-Based Cellular Processing algorithm for Precedence-Constraint Task List Scheduling on Heterogeneous Systems

: High-Performance Computing systems rely on the software’s capability to be highly parallelized in individual computing tasks. However, even with a high parallelization level, poor scheduling can lead to long runtimes; this scheduling is in itself an NP-hard problem. Therefore, it is our interest to use a heuristic approach, particularly Cellular Processing Algorithms (CPA), which is a novel metaheuristic framework for optimization. This framework has its foundation in exploring the search space by multiple Processing Cells that communicate to exploit the search and in the individual stagnation detection mechanism in the Processing Cells. In this paper, we proposed using a Greedy Randomized Adaptive Search Procedure (GRASP) to look for promising task execution orders; later, a CPA formed with Iterated Local Search (ILS) Processing Cells is used for the optimization. We assess our approach with a high-performance ILS state-of-the-art approach. Experimental results show that the CPA outperforms the previous ILS in real applications and synthetic instances.


Introduction
According to the website www.top500.org, the supercomputer Fugaku from the Fujitsu RIKEN Center for Computational Science in Japan consists of 7,299,072 processing units. The above gives an outstanding parallel computing power. However, without accurate and efficient scheduling methods, parallel programs can be very computationally inefficient. In this paper, we approach the precedence-constraint task scheduling of parallel programs on systems formed by heterogeneous processing units to minimize the final computing time [1]. Scheduling is a well-known NP-hard optimization problem [2], as well as the task scheduling for parallel systems [3]. Therefore, different scheduling approaches were developed for this problem: Heuristics [4][5][6][7], Local Searches [8][9][10], and Metaheuristics [11][12][13][14][15]. Unfortunately, the wide variety of works in the state-of-the-art differs in their objective definitions and uses different sets of instances. From the relevant works in the state-of-the-art, we highlight [11]. To our knowledge, it is the only work that compares the obtained

Instance of the Problem
An instance of the problem is made up of two parts, a Directed Acyclic Graph (DAG) and the computation cost of the tasks in every machine. We represent the set of tasks of a parallel computing program and their precedencies as a DAG. Therefore, the parallel program is represented as the graph G = (T, C) where T is the set of tasks (vertices) and C is the set of communication costs between tasks (edges) (see Figure 1). The complete instance of the problem is formed by G and the computational costs P i,j of each task t i in every machine m j (see Table 1).   15 10 Any task t i cannot be initiated until its precedent tasks t j ∈ T|(t j , t i ) ∈ C finalize their executions and communications (C j,i ). However, when any pair of tasks are scheduled in the same machine, the actual communication cost C j,i between them is depreciated.

Objective Function
In this work, we follow the approach of list scheduling algorithms. List scheduling algorithms are a family of heuristics in which tasks are ordered according to a particular priority criterion. The task execution order is equivalent to a DAG G = (T, C) topological order from the task graph, without violating the precedence constraints. Table 2 shows an example of a feasible order for the task graph from Figure 1. Task t 0 t 4 t 3 t 1 t 2 t 5 t 6 t 7 Although the task order of execution is not indispensable for the scheduling, it simplifies the objective function computation [1], i.e., because it is not necessary to compute different combinations of the tasks, starting and finish times, to compute the minimum makespan [3,15,17]. However, this approach has the deficiency that the optimal value may not be possible in every task execution order. Algorithm 1 details the computation of the makespan objective function, using the computation times P i,j from Table 1 and a time counter (Time j ) for each machine, to keep track of the last executed task in each machine.

Algorithm 1 Makespan objective function
Input: G = (T, C), computational costs P t i ,j , and an order execution of the tasks O = {o 1 , . . . , o |T| }. Output: makespan 1: Time j ← 0, ∀m j ∈ M 2: for x = 1 to |O| do 3: j ← the index of the machine m j assign to t current

13:
Time j ← t f current

14:
end if 15: end for 16: return makespan ← max{t f i }∀t i ∈ T Finally, the parallel program makespan (computation time) is the difference between the start and the ending of the first and last tasks. The makespan objective function uses the auxiliary variables ts i (the starting time of task i), t f i (the finish time of task i), and C i,j which is the communication cost that is zero if the tasks are executed in the same machine. The objective is to compute the tasks, from the first to the last of the feasible execution order. Finally, the parallel program makespan (computation time) is the difference between the start and end of the first and last tasks. The complexity of the Algorithm 1 is O(|T| · |C|), although, in practice, it is remarkably lower than that, because not all the edges in G are connected to every node.

Algorithms Descriptions
This section introduces the generic metaheuristic frameworks (Sections 3.1 and 3.2), as well as a high-performance algorithm in the state-of-the-art (Section 3.3). Finally, our proposed algorithm is detailed in Section 3.4.

Iterated Local Search (ILS)
The ILS is a multi-start metaheuristic search based on local improvements (LocalSearch) and solution alterations (Perturbation) [31], see Algorithm 2. This algorithm starts initializing the current solution s with a random solution, which is also assigned to the best solution s best (see lines 1 and 2). The main loop of the algorithm iterates over the solution s applying a perturbation followed by a Local Search [32], if the ILS detects a new best-known solution, then s best is updated (see line 7). The above process continues until the stopping criterion is reached, usually a maximum Central Processing Unit (CPU) time or a fixed number of objective function evaluations. GRASP is a multi-start metaheuristic algorithm that builds a solution by selecting one promising fragment of the solution at a time [33], see Algorithm 3. The inner loop of the GRASP, in line 5 builds a Solution by adding random individual elements from a Restricted Candidate List (RCL) (see line 11). In order to build the RCL, the algorithm evaluates the increase of the partial objective and stores its maximum and minimum values (see lines 7 and 8) to set a limit for the original candidate list (CL) (see line 9), thus creating the RCL (see line 10); this process occurs at every step of the construction. while Solution is not complete do 6: CL ← SelectFeasibleElements() 7: : RCL ← BuildRCL(CL, limit)

11:
Solution i = t r |t r ∈ RCL Add to Solution a random element from the RCL 12: end while 14: The Local Search procedure is optional 15: if f (Solution) < f (s best ) then 16: s best ← Solution 17:

end if 18: end while
The RCL only includes candidate tasks whose incremental costs are bounded by f min + α( f max − f min ) in line 9. Where f max and f min are the maximum and minimum incremental cost of the objective function, for all the candidate elements t i ∈ CL, which is calculated with a modification of Algorithm 1 named PartialObjectiveEvaluation that evaluates up to the last element in the partial solution. Additionally, α ∈ [0, 1] defines the greedy level of the algorithm; where α = 0 defines a completely greedy search, and α = 1 defines a completely random search. Additionally, the candidate list (CL) must be created or updated every iteration of the inner loop (see line 6). Once the Solution is constructed, an optional LocalSearch procedure can be used to improve the current Solution (see line 14). Furthermore, s best is updated every time a complete Solution outperforms the objective value of s best (see line 16). Finally, the outer loop in line 2 restarts the Solution and iterates until the algorithm reaches the stopping criterion.

State-of-the-Art (Earliest Finish Time) EFT-ILS
In [11], the authors introduce an ILS, which will be called Earliest Finish Time (EFT)-ILS (in this paper), see Algorithm 4. EFT-ILS consists of two phases; the first explores random feasible execution orders of the tasks' graph from lines 3 to 10. After each new ordering o , the algorithm assigns the machines to the tasks that produce their Earliest Finish Time (EFT) (see line 5 and Algorithm 5). For heavy computational instances, we suggest using an external stopping criterion as CPU time, see line 10. The second phase initializes an ILS described in Section 3.1, using the best order o best and solution s best , found by the first phase, as initial solution. The next subsection details the Local Search and perturbation processes. Assign to the task o i the machine m j which produce their minimum finish time. 3: end for

EFT-ILS Local Search
The Local Search (LS) in EFT-ILS is based on the first improvement pivoting rule, see Algorithm 6. This algorithm evaluates the tasks in the execution order O (see line 2). The algorithm generates neighbors s assigning machines m j ∈ M to the current task t current , if a neighbor improves solution s then s is updated (see line 7), as consequence the search is reinitialized in line 8. Finally, the algorithm verifies an auxiliary external stopping criterion before continuing with the neighbor generation process to avoid exceeding the maximum CPU time or objective function evaluations.

Algorithm 6 EFT-ILS Local Search procedure.
Input: Solution to improve s, and an order execution of the tasks O = {o 1 , ..., o |T| } Output: s 1: for i = 1 to |T| do 2: t current ← o i Assigns the task o i in the execution order as the current task 3: for j = 1 to |M| do 4: s ← s

5:
Assign t current in s to the machine m j end if 10: end for If the external stopping criterion is reach stop the Local Search 11: end for

EFT-ILS Perturbation
EFT-ILS uses a perturbation process based on a probability. Every task t i of the solution has a probability to be changed from its current machine. If the probability occurs the task t i will be moved from its current machine and assigned to a new random one. For our experiment, we use a probability of 5% which is the best probability presented in [11].

Proposed GRASP-Cellular Processing Algorithm (GRASP-CPA)
In a similar manner as in EFT-ILS, our proposed algorithm GRASP-CPA consists of two phases (see Figure 2). First, a GRASP explores feasible tasks' orders for the next phase of the algorithm. In the second phase, the algorithm uses the best order o best and solution s best found by GRASP, in a homogeneous Cellular Processing Algorithm (CPA). The algorithm is composed by three ILS Processing Cells (PCells), the PCells have two functions that are independent from their ILS procedure. The first is to update the global best solution s best if the PCell finds a better solution. The second is to update their current solutions through the communication processes. The communication is performed using the well-known single point-crossover from Genetic Algorithms (GAs), where two solutions from different PCells split and combine their information [12,34]. This phase continues until a fixed number of iterations or CPU seconds is reached.
Algorithm 7 describes our GRASP-CPA proposal, where the GRASPConstruction produces feasible orderings o that are evaluated using EFT to produce the solution s (see lines 5 and 6). Here, we can see that the algorithm uses a GRASPConstruction that receives an α value that can be either 0.9 or 1 with the same probability. This α value is used to restrict the candidate list (see lines 9 and 10 from Algorithm 3). GRASP algorithms usually use α values between 0.1 and 0.3. However, preliminary experimentation proved that the our proposed α values where the ones with the best performance for the instances used. Finally, line 12 executes the cellular processing section of the algorithm.
Algorithm 8 shows the general idea of the CPA(s best , o best ) function. Here, the execution of the ILS Processing Cells (see lines 5 and 7) iterate five times each one, which limits the inner computational effort of the Processing Cells. After the Processing Cells' execution, the Communication processes the current solutions, recombining the s best of PCell1 with the s best of PCell2, the first offspring becomes the new current solution in PCell1. At the same time, the second offspring is used for a second recombination with the s best of PCell3. The resulting offsprings from the second recombination become the new current solution of PCell2 and PCell3. This process continues until the stopping criterion is reached (see line 4).   PCell1ILS(PCell1 s current )

Experimental Setup
This section describes the experimental set of instances, the experimental configuration, and the statistical indicators of confidence in the results.

Parallel Application Instances
The applications used in the experimentation are: • Double precision floating point FORTRAN benchmark (Fpppp) [35].
A benchmark of fourteen small synthetic instances from [11].
The applications Fpppp, Robot, and Sparse are included in the Standard Task Graph Set (STG) in [37]. We performed the same treatment to the original applications instances as in [8], considering different (Communication to Computation Ratios, CCRs) CCRs = {0.1, 0.5, 1, 5, 10}, (Heterogeneity Factors) HFs = {0.1, 0.25, 0.5, 0.75, 1}, and number of machines |M| = {8, 16, 32, 64}. The combination of the mentioned configurations for the four parallel applications gives a total of 4 · 5 · 5 · 4 = 400 instances. The nomenclature for the large instance set of Fpppp, Robot, and Sparse instances used in this work is Application-Machines-Tasks-HF-CCR. For the small benchmark in [11], the nomenclature is Name-Machines-Tasks. We made available the complete instance set in [38].

Experimental Settings
In the case of EFT-ILS, we use the best configuration in [11] (see Table 3). The GRASP-CPA uses a few extra parameters. One of them is the α value of the GRASP algorithm, for which we carried out extensive experimentation with α = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, finding that the best values for these instances were 1.0 and 0.9 without a clear dominance. Thus, we decide to use 1.0 and 0.9 as α values randomly with an equal probability before the GRASPConstruction (see line 5 of Algorithm 5). We use the same Local Search, perturbation process, and perturbation probability p m in both algorithms for a fair comparison. Finally, to ensure the CPA communication process, we set the probability of recombination p r as 100%. In both algorithms, the first phase has two stopping criteria, 50 iterations or a maximum of 5 min, while the second phase uses a stopping criterion of 100,000 objective function evaluations. Every instance from Section 4.1 is run 100 independent times. The complete experimental parameter settings are shown in Table 3. We made available the algorithm implementations at [39,40].

Statistical Indicators
To assess statistical confidence in our experimentation, we compute the median value from the independent runs as well as the interquartile range (IQR). The tables presenting our results follow the format MEDI AN IQR . The tables also emphasize the best and second-best reported values for every problem with a gray and light background, respectively. For the sake of completeness, we apply the non-parametric Wilcoxon signed ranks test on the results to assess statistical differences in a pairwise comparison for every problem, at 95% confidence level [41]. A symbol indicates that EFT-ILS was statistically worse than GRASP-CPA according to the Wilcoxon signed ranks test; we use otherwise. Finally, we marked as '-' the cases where there were no statistical differences.

Results
First, we start analyzing the results of the large benchmark set of 400 scheduling problems. Focusing on the Fpppp instance set (see Table 4), GRASP-CPA outperformed with statistical significance EFT-ILS in 53 instances, not statistically outperforming in any instance with 8 and 16 machines. However, for the instances with 32 and 64 machines, EFT-ILS outperformed GRASP-CPA in 3 and 11 instances, respectively.
Regarding the LIGO benchmark results from Table 5, GRASP-CPA outperformed EFT-ILS in 45 instances. EFT-ILS only outperformed GRASP-CPA in 13 instances, distributed as follows: two, three, five, and three for the instances with 8, 16, 32, and 64 machines, respectively.
For the Robot benchmark (see Table 6), GRASP-CPA outperformed EFT-ILS with statistical significance in 48 cases, while EFT-ILS only outperformed GRASP-CPA in 12 instances, where most of them occurred for the instances with 16 machines.
The results for the Sparse benchmark from Table 7 show that GRASP-CPA outperformed EFT-ILS with statistical significance in 33 instances, while EFT-ILS outperformed GRASP-CPA in 10 instances.
Finally, for these instances sets, GRASP-CPA achieved 265% more best median values found than EFT-ILS, with statistical significance. Therefore, we consider that GRASP-CPA is superior to EFT-ILS in a relevant proportion of the studied cases.
Furthermore, we analyze the results from the small benchmark of 14 synthetic instances in Table 8. For the 14 synthetic problems, GRASP-CPA achieves the best median value in all the cases, with statistical significance in ten cases. In addition, Table 8 shows the average computing time of the enumerative optimal algorithm (Time Enum ), EFT-ILS (Time EFT-ILS ), and GRASP-CPA (Time GRASP-CPA ) in CPU seconds on a Macbook pro 13-inch late 2011. Table 9 presents the best solution found by EFT-ILS and GRASP-CPA, where EFT-ILS computes twelve optimal values, while GRASP-CPA computes 13. Additionally, GRASP-CPA achieves an IQR of 0.0 in six cases where the algorithm reached the optimal value in the median.    Table 7. Sparse instances median and IQR of EFT-ILS and GRASP-CPA over 100 independent runs. Light gray emphasizes the best results.  Table 9. Small synthetic benchmark instances best results found by EFT-ILS and GRASP-CPA over 100 independent runs. Light gray emphasizes the best results.

Conclusions and Future Work
This paper proposes a new Cellular Processing Algorithm that uses a GRASP construction, called GRASP-CPA, for scheduling precedence-constraint tasks on heterogeneous systems. Experimental results showed that GRASP-CPA outperformed a high-performance algorithm from the state-of-the-art called EFT-ILS, regarding optimal and median values with statistical significance for the proposed set of instances. Two main features of the GRASP-CPA contribute to its performance. The first is the generation of tasks' execution orders using a GRASP algorithm, conversely to a completely random order generation in EFT-ILS. The second is the communication between different Processing Cells to help explore the search space. We encourage researches to apply the Cellular Processing Algorithm approach to their problems. This approach is more a framework than a strict algorithm that allows flexible implementations with homogeneous or heterogeneous Processing Cells. Despite being a novel approach, it has proven to be effective for several problems and still has many open research areas. As future work, we would like to research other methods to produce tasks' execution orderings, to improve the results yielded by GRASP-CPA.