PkMin: Peak Power Minimization for Multi-Threaded Many-Core Applications

: Multiple multi-threaded tasks constitute a modern many-core application. An accompanying generic Directed Acyclic Graph (DAG) represents the execution precedence relationship between the tasks. The application comes with a hard deadline and high peak power consumption. Parallel execution of multiple tasks on multiple cores results in a quicker execution, but higher peak power. Peak power single-handedly determines the involved cooling costs in many-cores, while its violations could induce performance-crippling execution uncertainties. Less task parallelization, on the other hand, results in lower peak power, but a more prolonged deadline violating execution. The problem of peak power minimization in many-cores is to determine task-to-core mapping configuration in the spatio-temporal domain that minimizes the peak power consumption of an application, but ensures application still meets the deadline. All previous works on peak power minimization for many-core applications (with or without DAG) assume only single-threaded tasks. We are the first to propose a framework, called PkMin , which minimizes the peak power of many-core applications with DAG that have multi-threaded tasks. PkMin leverages the inherent convexity in the execution characteristics of multi-threaded tasks to find a configuration that satisfies the deadline, as well as minimizes peak power. Evaluation on hundreds of applications shows PkMin on average results in 49.2% lower peak power than a similar state-of-the-art framework.


Introduction
A many-core application is made up of tens of tasks. All of the tasks are inherently multi-threaded. An accompanying generic Directed Acyclic Graph (DAG) models the execution dependency between the tasks and therefore determines the precedence order for execution [1]. The application must complete execution within a given hard deadline. One way to meet the deadline is to execute as many tasks as possible in parallel. Executing them with the maximum parallelization permitted under the DAG with all available cores results in a short execution time, but also results in high peak power.
The highest power consumption observed during the task's execution defines its peak power. Rated peak power consumption predominantly determines the cost (and weight) of cooling infrastructure that accompanies the many-core. Higher peak power also results in higher on-chip temperatures, which leads to reliability issues [2][3][4]. Higher temperatures can also trigger performance crippling thermal-throttling, which makes execution unpredictable [5]. We can minimize peak power by executing all of the tasks sequentially on a single core, but such an execution will violate the deadline.
The problem of peak power minimization [6,7] in many-cores is to determine a spatio-temporal task-to-core mapping (configuration) that still meets the application deadline, but minimizes the peak power. We make use of two well-known observations in the execution of multi-threaded tasks in order to efficiently solve the problem. First, Figure 1a shows that the execution time of tasks in a many-core application is discretely convex with increasing core allocation [8][9][10][11]. Second, Figure 1b shows that the peak power of tasks is discretely linear with increasing core allocation. These two observations open up the possibility of employing convex optimization in order to solve this problem.  Most works on peak power minimization for many-core applications only assume single-threaded tasks. Therefore, they have no hope for the exploitation of execution characteristics that are shown in Figure 1, which are specific to multi-threaded tasks. In this work, we propose a framework called PkMin that minimizes the peak power of multi-threaded many-core applications with DAG under deadline constraint by exploiting the observations made in Figure 1. We evaluate PkMin against a similar state-of-the-art framework [12] while using hundreds of applications for a thorough evaluation. Empirical evaluations show that PkMin results in on average 48% lower peak power than the state-of-the-art. Figure 2 shows a motivational example of the problem of peak power minimization in many-cores. We use Sniper [13] multicore x86 simulator to execute the binaries that are associated with the task. Sniper directly reports the execution time values (in clock cycles) for the binaries and also generates traces that is then used by downstream tools, like McPAT [14], to estimate the power consumed by various per core components like L1 (I/D) caches, instruction fetch units, L2 caches, etc., as well uncore components, like memory controllers and DRAM subsystem. This methodology avoids the insertion of costly instrumentation hooks within the program. We account for the power consumption only on the "active" cores, i.e., those that are directly reponsible for execution of the application. Figure 2a gives a DAG for an application composed of four tasks that we need to execute on many-core within 425 ms. Tasks Figure 2b shows the execution time of the benchmark in a configuration that executes the tasks serially with all available cores. The execution in Figure 2b meets the deadline, but leads to a high peak power of 75.95 W. Figure 2c shows the execution time of the application in a configuration that parallelizes the execution, but does not use multi-threading in tasks. Execution in Figure 2c leads to low peak power, but it violates the deadline. Figure 2d shows the execution time of the application in an optimal configuration generated by PkMin that allocates just enough cores to tasks, such that both precedence and deadline constraints are met, but with a minimal peak power of 10.77 W. Figure 2d results in 85.82% lower and only 17.45% higher peak power than the configuration in Figure 2b,c, respectively.

Related Work
Peak power minimization in the context of multi-/many-core scheduling is an active subject of research [15]. The problem is important in both embedded [16][17][18] as well as super-computing domain [19,20]. The authors in [21] propose an algorithm to minimize peak power for an application with a task graph without a deadline. The authors of [22] were the first to study the problem of peak power minimization in the context of task graph-based many-core applications with a deadline. They propose an optimal algorithm that schedules a set of independent tasks from the application on a many-core to minimize application's peak power while meeting its deadline. The authors of [23] work on the same problem, but proposed an alternative light-weight heuristic algorithm. Most recently, the authors in [24] proposed an algorithm to minimize peak power for many-core applications with DAG under reliability constraints. All of the works that target the peak power minimization problem directly assume the tasks with DAG within the many-core application to be single-threaded and thereby individually schedulable only on a single core. On the contrary, we focus on tasks that are all individually multi-threaded, wherein the level of multi-threading in each of them is an independently configurable design knob for solving the peak power minimization problem.

Convex Optimization Sub-Routine for Solving Serialized DAG
This section provides the details of the convex optimization sub-routine that is at the heart of PkMin. This sub-routine solves the problem of peak power minimization for a many-core application with a serial DAG optimally in the continuous domain and near-optimally in the discrete domain, as the problem is NP-Hard in the discrete domain. PkMin uses this sub-routine to perform peak power minimization of any generic DAG in Section 5.
System Model: we need to execute an application with M ∈ N multi-threaded tasks indexed while using i on a many-core with N ∈ N cores. Tasks execute serially in an order ordained by the serialized DAG. N i is the maximum number of cores that can be allocated to the task i. Each task i is executed with C i ∈ R + number of cores with the domain constraint 1 ≤ C i ≤ N i . The number of cores C i allocated to the task i in practice needs to be discrete. However, we, at first, assume it to be a non-negative real value greater than equal to 1 and less than equal to N i for tractability.
Based on the observations made in Figure 1a, we assume execution time τ i : R + → R + of task i with C i cores allocated to be a univariate convex function of the number of allocated cores. In the domain [1, N i ] ∈ R + allocated cores, the execution time τ i (C i ) is a monotonically decreasing convex function of the number of allocated cores C i . Based on observations made in Figure 1b, we assume the power ρ i : R + → R + of a task i with C i cores allocated to be a univariate linear function of the number of allocated cores.
Execution Model: the execution time of the application in totality τ : R M + → R + with configuration (core allocation vector) C = C 1 , C 2 , ..., C M ∈ R M + is the sum of the execution time of the individual tasks.
Because the execution time of each task is individually convex and the sum of convex functions is a convex function, then the application's execution time τ( C) is a multivariate convex function of configuration C. Figure 3a shows the non-negative convex execution time surface of a two-task application.
The application's peak powerρ( C) is a multivariate convex function of configuration C, as the peak power of each task is individually linear and the max of linear functions is a convex function. Figure 3b shows the non-negative convex peak power surface of a two-task application.
Deadline Model: the peak power functionρ( C) attains its lowest value when all of the tasks execute with bare minimum cores (∀ i C i = n i ), but this is only permitted when there is no constraint on the execution time. The application's hard deadline ofτ ∈ R + put a constraint τ( C) ≤τ on its execution time. The deadlineτ divides the domain for minimization of peak power functionρ( C) into feasible and infeasible regions.
Given a deadline, we are required to minimize the peak power over the feasible region F ⊂ R M + . We now prove the feasible region F to be a convex set. Let C x , C y ∈ F be two feasible configurations. By feasibility definition Because τ( C) is a convex function, Using Equation (3), we obtain Therefore, F is a convex set, since, for any C x , C y ∈ F and any λ ∈ [0, 1], we have Figure 4 shows the feasible region F for an application with two tasks with a given hard deadline.  Solution: our problem reduces to minimizing a convex peak power functionρ( C) over the feasible convex set F as its domain. Formally, it can also be summarized, as follows We solve the above convex optimization problem in the continuous domain using NLOpt [25] tool that internally solves the problem whlileusing the Method of Moving Asymptotes (MMA) [26] algorithm. The problem-solving time is insignificant, even with an extremely large number of tasks.
Solution Discretization: when we modify the constraint in Equation (4) to force cores allocated to the task to be integers i.e., ∀C i ∈ Z + in the domain [1, N i ] ∈ Z + instead of real-numbers, the problem becomes an NP-Hard Convex Mixed Integer Non-Linear Programming (CMINLP) problem [27]. Still, we can expect local optima in the discrete domain to be close to the discrete global optimum, because the optimization is tractable using convex programming in its relaxed continuous domain, unlike an arbitrary optimization problem [28].
We first solve Equation (4) to obtain the optimal real-valued configuration in the continuous domain. We then round up all of the individual real-valued core allocations to the nearest integer. The rounding up/down decision is tricky, because there are 2 task graph size possibilities, all of which cannot be exhaustively tested for optimality. Two choices are immediately obvious, i.e., to round-down all the allocations or round-up all of them. Because the execution time is non-increasing with increasing core allocation for any task (Figure 1a), rounding down can make the allocation infeasible while rounding up will preserve the feasibility, although it not guaranteed to be optimal in the discrete domain. For example, if the optimal configuration in a continuous domain for a three task application is found to be C = 4.1, 3.4, 5.9 , then we round up to discrete configuration C = 5, 4, 6 .

Peak Power Minimization with PkMin
The problem of peak power minimization for many-core applications with precedence and deadline constraints is inherently a multi-dimensional bin-packing problem, which is well-known to be NP-Hard [29]. We introduce a framework, called PkMin, which solves the problem near-optimally by exploiting the observations in Figure 1. At the heart of PkMin is a convex optimization sub-routine that can solve the problem optimally for a serialized DAG in a continuous domain. Section 4 provides the details of the sub-routine. The problem is NP-Hard, even with a serialized DAG, in the discrete domain [27]. Therefore, the sub-routine extrapolates the real-valued solution to the discrete domain.
In general, an application DAG allows for the possibility of multiple tasks to be run in parallel. For a given deadline, parallel execution allows for tasks to individually stretch out further along in the time domain more than a serialized execution. Therefore, we can execute tasks with a fewer number of cores being allocated to them individually and still meet the deadline. Tasks execute with a lower peak power (Figure 1b) with a smaller number of cores. However, parallel execution is not guaranteed to lower the peak power of the application, because the peak power of tasks that execute in parallel adds up. Figure 5 shows the functioning of PkMin with the help of a flowchart. PkMin begins by serializing the DAG for an application by applying a topological sorting procedure [30]. It then passes the DAG to a convex optimization sub-routine (Section 4) that computes an allocation for the serialized DAG.
PkMin then enumerates all of the task pairs that can be executed in parallel by computing the transitive closure [31] of the DAG that exposes the pairwise independent tasks. A pair of independent tasks is then "stacked" together to form a single unified task.  All of the sub-tasks in the unified task always execute with the same number of cores. The execution time and peak power of the unified task is the max and sum of the execution times and peak powers of the sub-tasks, respectively. The sum operator preserves the linearity of power characteristics, while the maximum operator preserves the convexity of execution time characteristics. The new unified task has characteristics that are similar to its constituent tasks. If there are multiple task pairs to choose from, PkMin chooses the task pair that gives the greatest reduction in execution time on unification.
The unified task replaces its sub-tasks in the original DAG. PkMin then serializes the modified DAG and passes it to the convex optimization sub-routine again in order to obtain a new feasible configuration. It repeats the process of DAG modification, followed by convex optimization iteratively until an exit condition is encountered in one of the following ways.
1. The new configuration yields a higher peak power than the previous configuration, i.e., a local minimum is reached. 2. There are no more candidate task pairs that can be parallelized.
PkMin reports the configuration from the previous iteration as the final solution configuration.

Working Example
This section explains the functioning of PkMin with the help of a working example. Figure 6 visualizes the steps that were taken by PkMin to solve the motivational example shown in Figure 2. PkMin begins with the original DAG that is shown in Figure 2a. It then serializes the DAG, as shown in Figure 6a. It then runs the convex optimization module from Section 4 in order to obtain the core allocation for the serialized DAG, as shown in Figure 6b. It then tries to stack Task B and Task C together by combining them into a new Task B||C and create a new DAG, as shown in Figure 6c. Figure 6d gives the characteristics of unified Task B||C, which inherits the convexity properties of the parent tasks under equal core distribution. Because the DAG in Figure 6c is already serialized, then PkMin can directly operate on it. PkMin runs convex optimization sub-routine again on the unified DAG to obtain a new core allocation as shown in Figure 6e. However, allocation in Figure 6e has worse peak power than the allocation in Figure 6b, and, therefore, the algorithm terminates with allocation in Figure 6b as the reported solution.
When compared to the worst-case peak power in Figure 2b, the solution reported by PkMin has 85.15% lower peak power. The solution reported by PkMin has only 4.27% higher peak power when compared to the optimal solution in Figure 2d.  Figure 2 using PkMin.

Experimental Evaluation
Experimental Setup: we use Sniper simulator [32] to simulate the execution of multi-threaded many-core applications. The simulated multi-core is composed of eight tiles-with two cores each-arranged in a 4 × 2 grid connected while using a Network on Chip (NoC) with hop latency of four cycles and link bandwidth of 256 bits. Two cores within the tile share a 1 MB L2 cache. Cores implement Intel x86 Instruction Set Architecture (ISA) and run at a frequency of 4 GHz with each core holding a 32 KB private L1 data and instruction caches. Many-core's power consumption is provided by the integrated McPat [14] assuming a 22 nm technology node fabrication.
Application Task Graphs: we use a set of five benchmarks-CilkSort, DFS, Fibonacci, Pi, and Queens-from Lace benchmark suite [33] to create our tasks. In order to generate random DAGs of size N, we first sample with replacement N tasks from the benchmark set, thereafter with a probability p, we add an edge between select pair of nodes, such that the acyclic property of the resulting directed graph is preserved. The setup allows for us to thoroughly evaluate PkMin with an arbitrarily large number of tasks while simultaneously generating a large number of randomized applications for a given number of tasks.
Application Deadline: setting up arbitrarily short deadlines will render application execution infeasible. In order to set up a feasible deadline, we first note the minimum execution time that is achievable by all of the benchmarks, as, for example, illustrated in Figure 1a for DFS and CilkSort. Let B be the benchmark execution time that is worst among all of the benchmarks considered. We then set the deadline to B · N for an application task graph with N tasks. This ensures the existence of a feasible solution. This is also a fairly tight deadline, as all of the tasks are forced to execute with maximum available cores, if they choose to execute one after the other in a serial fashion. If the application deadline is relaxed further, then other execution configurations with much lower cores (and hence peak power) may become feasible.
Baseline: we are unaware of any work that also solves the problem of peak power minimization for multi-threaded many-cores applications with DAG under deadline constraints. The authors of [12] propose a framework, called D&C, which uses a divide and conquer algorithm to minimize execution time for multi-threaded many-core applications with DAG under a peak power constraint. Therefore, D&C solves dual of the problem solved by PkMin. We modify D&C to DCPace that solves the same problem as PkMin by replacing the constraint from peak power to deadline and replacing the objective function from minimizing executing time to minimizing peak power. Modification keeps the underlying algorithm's ethos intact. DCPace thus acts as a suitable baseline for PkMin.
DCPace begins by allocating cores to tasks, to run them in their most energy-efficient configuration, i.e., the one with minimum energy consumed. It then generates an intermediate schedule by scheduling the tasks at the earliest possible time permitted under precedence constraints. It then identifies the midpoint of the schedule along the time axis. All of the tasks that are actively executing at the midpoint must be independent. DCPace divides the task into three bins beg, mid, and end. All three bins are assigned a sub-deadline that is equal to the third of the original deadline. All independent tasks in mid are greedily scheduled, such that the bin's peak power is minimized under the available core and sub-deadline constraints. This is done using a strip packing heuristic, like Next-Fit Decreasing Height (NFDH) [34]. In a strip-packing problem, a collection of rectangles of different height and width are to be packed on a rectangular bin, with a fixed width and unbounded height, such that no two rectangles overlap. This problem is NP-complete. NFDH begins by sorting the rectangles in decreasing order of their heights. After that, it packs the rectangle in a left-justified fashion until the next rectangle can no longer fit in the remaining space to the right. A new level is defined as the packing restarts from the remaining rectangles. DCPace continues to divide recursively mid and end using their midpoint. Recursion breaks when a bin becomes a singleton with only one task.
Power and Energy Consumption Analyses: Figure 7 illustrates the working of DCPace and PkMin algorithms. In this experiment, we use tasks graph with 100 tasks and set the deadline to 1700 million clock cycles or 425 ms. DCPace chooses the most energy-efficient core allocation to execute each task, which is not changed thereafter. Given the task execution time and task peak power characteristics, as shown in Figure 1, the minimum energy allocation can only occur either when all of the cores are allocated, or a minimum number of core is allocated to each task. Figure 7b, shows the variation in the total cores allocated as the application execution proceeds in time. Both DCPace and PkMin show considerable variations in the total-cores allocated, although the former exclusively varies between the maximum and the minimum possible allocations. Because the goal of PkMin is to reduce peak power exclusively, its allocations amongst tasks under PkMin are such that any two different non-overlapping tasks have almost similar peak power consumption when compared to DCPace. PkMin exploits the convexity properties of the task characteristics in order to achieve this "equivalent power" allocations. The power trace of application execution in Figure 7a under PkMin has almost no peaks and troughs as compared to DCPace. and PkMin show considerable variations in the total-cores allocated, although the former exclusively varies between the maximum and the minimum possible allocations. Because the goal of PkMin is to reduce peak power exclusively, its allocations amongst tasks under PkMin are such that any two different non-overlapping tasks have almost similar peak power consumption when compared to DCPace. PkMin exploits the convexity properties of the task characteristics in order to achieve this "equivalent power" allocations. The power trace of application execution in Figure 7a under PkMin has almost no peaks and troughs as compared to DCPace. Performance Evaluation: we evaluate the efficacy of PkMin in minimizing peak power for applications with an increasing number of tasks. We also evaluate the same applications using DCPace to put the performance of PkMin in context. First, we show the peak power savings for application task graph of sizes varying from 10 to 100. We set the deadline to 17 · N million clock cycles, where the best possible execution time of each task is 17 million clock cycles and N is the number of tasks in the application. Figure 8a shows that PkMin has, on average, 48% lower peak power when compared to the DCPace. The energy consumption of PkMin is, however, around 0.8% higher than the DCPace, as illustrated in Figure 8b. Figure 9 orthogonally shows the efficacy of PkMin in minimizing the peak power of hundred random applications, each with 100 tasks. The deadline is similarly set to 1700 million clock cycles. PkMin results in lower peak power than DCPace, with only 1% additional energy overhead. Figure 10 shows the efficacy of PkMin in minimizing peak power in a random application (with 100 tasks) as its deadline is relaxed. In this case, the improvement of peak power for PkMin over DCPace comes at the cost of worsening energy consumption. However, a significant reduction in the peak power of approximately 88% is possible with less than 10% increase in energy consumed. Performance Evaluation: we evaluate the efficacy of PkMin in minimizing peak power for applications with an increasing number of tasks. We also evaluate the same applications using DCPace to put the performance of PkMin in context. First, we show the peak power savings for application task graph of sizes varying from 10 to 100. We set the deadline to 17 · N million clock cycles, where the best possible execution time of each task is 17 million clock cycles and N is the number of tasks in the application. Figure 8a shows that PkMin has, on average, 48% lower peak power when compared to the DCPace. The energy consumption of PkMin is, however, around 0.8% higher than the DCPace, as illustrated in Figure 8b. Figure 9 orthogonally shows the efficacy of PkMin in minimizing the peak power of hundred random applications, each with 100 tasks. The deadline is similarly set to 1700 million clock cycles. PkMin results in lower peak power than DCPace, with only 1% additional energy overhead. Figure 10 shows the efficacy of PkMin in minimizing peak power in a random application (with 100 tasks) as its deadline is relaxed. In this case, the improvement of peak power for PkMin over DCPace comes at the cost of worsening energy consumption. However, a significant reduction in the peak power of approximately 88% is possible with less than 10% increase in energy consumed.    Figure 10. Peak power under PkMin for a 100-task application with different deadlines that are normalized against its peak power under DCPace. Scalability: PkMin uses NLOpt internally, which has a low polynomial-time computational complexity. It invokes NLOpt at the max number of tasks |M| times, keeping the computational complexity still polynomial. PkMin also uses topological sort and transitive closure graph algorithms that also have a worst-case polynomial computational complexity of O(|M|) and O(|M| 2 ), respectively. This low polynomial-time computational complexity makes PkMin highly scalable. Figure 11 shows the increase in worst-case problem-solving time that is required under PkMin with an increase in the number of tasks in applications. For a 100-task application, PkMin requires 1.3 s to compute the near-optimal configuration.

Conclusions
We introduced a framework, called PkMin, in this work that solves the problem of peak power minimization for a multi-thread many-core application with DAG under a deadline. PkMin exploits the execution characteristics of multi-threaded tasks in many-core applications to optimally solve the problem for a serialized DAG in the continuous domain while using convex optimization. It then uses the convex optimization sub-routine to solve the problem near-optimally for any generic DAG.
Empirical evaluations on hundreds of applications show configurations obtained under PkMin have, on average, up to 48% lower peak power than similar state-of-the-art with less than 1% additional total energy. The peak power savings can be further increased to 88% with less than 10% energy overheads whenever the deadline is relaxed. PkMin has polynomial-time computation complexity with negligible problem-solving overheads, which makes it suitable for use at both run-time and design-time.