Task-Level Aware Scheduling of Energy-Constrained Applications on Heterogeneous Multi-Core System

: Minimizing the schedule length of parallel applications, which run on a heterogeneous multi-core system and are subject to energy consumption constraints, has recently attracted much attention. The key point of this problem is the strategy to pre-allocate the energy consumption of unscheduled tasks. Previous articles used the minimum value, average value or a power consumption weight value as the pre-allocation energy consumption of tasks. However, they all ignored the di ﬀ erent levels of tasks. The tasks in di ﬀ erent task levels have di ﬀ erent impact on the overall schedule length when they are allocated the same energy consumption. Considering the task levels, we designed a novel task energy consumption pre-allocation strategy that is conducive to minimizing the scheduling time and developed a novel task schedule algorithm based on it. After getting the preliminary scheduling results, we also proposed a task execution frequency re-adjustment mechanism that can re-adjust the execution frequency of tasks, to further reduce the overall schedule length. We carried out a considerable number of experiments with practical parallel application models. The results of the experiments show that our method can reach better performance compared with the existing algorithms.


Introduction
Computer systems nowadays must perform much better than ever before, ensuring the simultaneous running of many applications. By using heterogeneous multi-core processors and increasing the number of processor cores, it is possible to improve the performance while keeping energy consumption at the bay [1][2][3][4][5]. From small, embedded devices to large data centers, heterogeneous multi-core systems have been widely used. It is expected that in the near future, the number of heterogeneous processors and cores in these systems will increase dramatically [6][7][8]. On the other hand, although the performance of such systems has been greatly improved, the power consumption is also increasing. Huge energy consumption has caused various problems, such as economy, environment, technology and so on [9][10][11][12][13]. Therefore, energy consumption is one of the main design constraints for such heterogeneous multi-core systems. A well-known typical mechanism for reducing power consumption of computing systems is dynamic voltage and frequency scaling (DVFS), which is realized to achieve the balance between energy consumption and performance by reducing the power supply

•
We design a novel energy pre-allocation strategy considering task level and prove its feasibility.

•
We develop a novel scheduling algorithm to minimize the schedule length under energy consumption constraints based on the new energy pre-allocation strategy.

•
We introduce a frequency re-adjustment mechanism after task scheduling to reduce the negative impact of local optimization.

•
We evaluate our algorithm based on real parallel applications. The experimental results consistently prove the superiority and competitiveness of our algorithm.
The structure of this article is as follows. Section 2 reviews some existing studies that are relevant to us. Section 3 gives some preliminaries related to the problem of minimizing the schedule length for energy consumption constrained parallel applications. In Section 4, we present our approach for this problem. In Section 5, we discuss and analyze the experimental results. Finally, we conclude the paper in Section 6.

Related Work
Energy saving design technology based on DVFS was first proposed by [1]. Nowadays, DVFS has been widely used in multi-core task scheduling problems related to energy consumption. Reference [2] studied the problem of minimizing the schedule length of independent sequential applications with energy consumption constraints. In Reference [29], the task scheduling problem with energy Electronics 2020, 9,2077 3 of 22 consumption constraints is considered as a combinatorial optimization problem. In Reference [31], the authors consider three constraints (energy consumption, deadline and reward). These studies are mainly focused on homogeneous systems, so they are different from our study.
In addition to the above research, many researchers have studied the task scheduling problem on heterogeneous multi-core systems. For example, reference [32] proposed an energy-saving workflow task scheduling algorithm based on DVFS. Huang et al. [33] proposed an enhanced energy-saving scheduling algorithm to minimize energy consumption under the condition of satisfying a certain performance level. Rusu et al. [31] added constraints and proposed an efficient algorithm for minimizing energy consumption under multiple constraints. The goal of these studies is generally contrary to ours. We study the minimization of the schedule length under energy consumption constraints, while those studies focus on minimizing energy consumption under other constraints.
There are also a lot of excellent studies that are closely related to our study. For example, a representative paper proposed the classical Heterogeneous Earliest Finish Time (HEFT) algorithm, which was developed to minimize the schedule length in heterogeneous multicore systems. The application model and energy consumption model they use are consistent with ours, but they do not consider the constraints of energy consumption. At present, the studies close to us should be the four articles mentioned in Section 1 [20][21][22][23]. They pre-allocate energy consumption to each task according to the minimum energy consumption of the task [20], the execution time ratio of the task [21], the overall average energy consumption [22], or the defined task energy consumption weight [23]. What is not considered in the above studies is that different task hierarchies have different impacts on the overall schedule length. Moreover, they ignored the negative impact of the local optimal characteristics of scheduling algorithm on the schedule length. Based on this, we propose a novel energy pre-allocation strategy considering task level, and, in order to reduce the negative impact of local optimal characteristics, we develop a scheduling task execution frequency re-adjustment mechanism. Finally, our method achieves better performance than previous studies.

Models and Preliminaries
In this section, we first introduce the application model (Section 3.1), then the energy consumption model (Section 3.2), and next we describe in detail the issues that need to be addressed (Section 3.3). Finally, we briefly introduce the current situation and reveal its limitations (Section 3.4). Table 1 shows the main notations we use.

Notation Description
w i,k Execution time of task n i on the processor core u k with the maximum frequency c i, j Communication time from n i to n j pred(n i ) The set of direct predecessor tasks of task n i succ(n i ) The set of direct successor tasks of task n i n entry Entry task of an application n exit Exit task of an application The energy consumption of the task n i on the processor core u k with the frequency f k,h EST(n i , u k ) The earliest start time of task n i running on processor core u k EFT n i , u k , f k,h The earliest finish time of task n i running on processor core u k with frequency f k,h AST(n i ) The actual start time of task n i AET(n i ) The actual execution time of task n i AFT(n i ) The actual finish time of task n i LFT(n i ) The latest finish time of task n i L(n i ) The level of task n i u pr(i) The processor core allocated to task n i f pr(i),hz(i) The execution frequency allocated to task n i on processor core u pr(i) E given (n i ) The calculated energy consumption constraint of task n i E pre (n i ) The pre-allocated energy consumption of task n i E given (G) The given energy consumption constraint of application G E(G) The energy consumption of application G SL(G) The schedule length of application G Electronics 2020, 9, 2077 4 of 22

Application Model
As in previous studies [20][21][22][23][24]34], we also use directed acyclic graph (DAG) to represent parallel application models. As for the processor cores, we define U = u 1 , u 2 , . . . , u k , . . . , u |U| to represent a collection of processor cores, where |U| is defined as the number of processor cores. Note that for any set X, we use |X| to denote its size. We define the DAG application model as G = {N, W, C}. N = n 1 , n 2 , . . . , n i , . . . , n |N| represents the set of nodes in the graph, that is, the set of tasks in the application. Due to the heterogeneous nature of the processor, the execution time of n i ∈ N on different processor cores is different. W refers to a matrix with size |N| × |U|, where w i,k denotes the execution time for n i to run on u k with the maximum frequency. C denotes the weight of edges between connected nodes in DAG, that is, the communication time between tasks. c i,j ∈ C represents the communication time from n i to n j . If c i,j = 0, it means there is no communication from n i to n j . We define pred(n i ) and succ(n i ) as the set of direct predecessor tasks and the set of direct successor tasks of task n i . For example, pred(n 2 ) = {n 1 } and succ(n 2 ) = {n 8 , n 9 }. We define n entry and n exit as the entry task and the exit task of an application. In Figure 1, n entry and n exit are n 1 and n 10 .
collection of processor cores, where | | is defined as the number of processor cores. Note that for any set , we use | | to denote its size. We define the DAG application model as = { , , }. = { 1 , 2 , … , , … , | | } represents the set of nodes in the graph, that is, the set of tasks in the application. Due to the heterogeneous nature of the processor, the execution time of ∈ on different processor cores is different.
refers to a matrix with size | | × | |, where , denotes the execution time for to run on with the maximum frequency. denotes the weight of edges between connected nodes in DAG, that is, the communication time between tasks. , ∈ represents the communication time from to . If , = 0, it means there is no communication from to . We define ( ) and ( ) as the set of direct predecessor tasks and the set of direct successor tasks of task . For example, ( 2 ) = { 1 } and ( 2 ) = { 8 , 9 }. We define and as the entry task and the exit task of an application. In Figure 1, and are 1 and 10 . Figure 1 shows an example of a parallel application based on DAG with ten tasks. Each node in Figure 1 represents a task, and the values on the edges between connecting nodes represent the communication time between the two nodes if they are not assigned to the same processor core. For example, the value 18 on the edge between 1 and 2 indicates that the communication time between 1 and 2 is 18.
Assuming that there are three heterogeneous processor cores { 1 , 2 , 3 } in the system, Table 2 shows the execution time of the tasks in Figure 1 running on each processor core with the maximum frequency. For example, the first number 14 in Table 2 indicates that the execution time of 1 running on 1 with the maximum frequency is 14.   Figure 1 shows an example of a parallel application based on DAG with ten tasks. Each node in Figure 1 represents a task, and the values on the edges between connecting nodes represent the communication time between the two nodes if they are not assigned to the same processor core. For example, the value 18 on the edge between n 1 and n 2 indicates that the communication time between n 1 and n 2 is 18.
Assuming that there are three heterogeneous processor cores {u 1 , u 2 , u 3 } in the system, Table 2 shows the execution time of the tasks in Figure 1 running on each processor core with the maximum frequency. For example, the first number 14 in Table 2 indicates that the execution time of n 1 running on u 1 with the maximum frequency is 14. Table 2. Execution time of tasks on different processors with the maximum frequency of the application in Figure 1.

Energy Model
In DVFS technology, the relationship between supply voltage and operating frequency is almost linear. Therefore, DVFS will also adjust the supply voltage when adjusting the clock frequency. Similar to [20][21][22][23], we use frequency regulation to indicate simultaneous regulation of supply voltage and frequency. In this article, we use the same energy model as the references [20][21][22][23]. Therefore, the calculation formula of system power consumption with respect to frequency is as follows: (1) In the above equation, P s denotes static power and can only be removed when the system is completely powered down. P ind is a constant that represents the frequency-independent dynamic power, that is, it corresponds to power independent of CPU processing speed. P d denotes frequency-dependent power, including the power primarily consumed by the CPU and any power that depends on the system processing frequency f . h denotes the system state, specifically, h = 1 means the system is active and the application is executing; h = 0 means the system is in the sleep mode or powered down. C e f denotes the effective capacitance and m denotes the dynamic power exponent and is no smaller than 2. C e f and m are constants related to the processor system.
Our study is in the active state of the system (h = 1), so dynamic power consumption is the main part of the whole energy consumption. Considering the unmanageability of static power consumption, this article, like references [20][21][22][23], does not consider static power consumption. Therefore, the calculation formula of system power consumption in this article becomes the following equation: ( Due to the heterogeneity of processors, each processor should have its own parameters. Assuming that the frequency range of the processor u k is from the lowest frequency f min to maximum frequency f max temporarily, we define the following sets of parameters:

•
The set of P ind : P 1,ind , P 2,ind , . . . , P |U|,ind ; • The set of P d : P 1,d , P 2,d , . . . , P |U|,d ; • The set of C e f : C 1,e f , C 2,e f , . . . , C |U|,e f ; • The set of m: m 1 , m 2 , . . . , m |U| ; The set of execution frequency: The execution time of the task n i on the processor core u k with the frequency f k,h can be obtained by the following equation: Then the energy consumption E n i , u k , f k,h of the task n i on the processor core u k with the frequency f k,h can be obtained by the following equation: Electronics 2020, 9, 2077 6 of 22 Therefore, the energy consumption of application G will be As a result of the P ind , E is not monotonic with f and less f does not always result less energy. Therefore, we can get the minimum value of energy-effective frequency by finding the minimum value of Equation (4). Similar to [20][21][22][23], we define the minimum value of energy-effective frequency as f ee . After calculation, we can get that When the execution frequency is less than f ee , it is meaningless to continue to reduce the frequency, because this will increase energy consumption. Therefore, the range of execution frequency variation The new set of execution frequency becomes as follows:

Problem Description
The problem to be solved in this study is to assign a suitable frequency and processor core to each task, and minimize the schedule length of the application under the condition that the energy consumption of the application does not exceed the energy consumption constraint [35][36][37][38].
First, we define the earliest start time (EST) and the earliest finish time (EFT) of tasks. Given a task n i executed on processor u k , its earliest start time (EST) is denoted as EST(n i , u k ), which is computed as where avail[k] is the earliest available time of processor core u k , that is, all tasks executed on processor core u k have been completed, and processor core u k is ready to execute new tasks. AFT n j is the actual finish time of task n j . c i,j is the actual communication time between task n i and n j . If n i and n j are assigned to the same processor core, c i,j = 0; otherwise, c i,j = c i,j . The earliest finish time (EFT) of task n i executed on processor u k with frequency f k,h is the earliest start time plus the execution time of task n i , which is computed as We define SL(G) as the schedule length of application, where SL(G) = AFT(n exit ).
We define E given (G) as the given energy consumption constraint of application G. Therefore, the problem to solve can be expressed as minimizing SL(G) while Electronics 2020, 9, 2077 7 of 22 where u pr(i) denotes the processor core assigned to the task n i , and f pr(i),hz(i) denotes the execution frequency assigned to the task n i .

Effective Range of Energy Consumption Constraint
Since the execution time of each task on each processor core is known, we can obtain the minimum and maximum energy consumption of n i represented by E min (n i ) and E max (n i ) respectively by traversing all processors. E min (n i ) and E max (n i ) perform task n i at minimum and maximum frequencies, respectively. The equations are as follows: Therefore, the minimum and maximum energy consumption of application G can be computed as follows: It should be noted that the given energy consumption constraint has a reasonable range. If E given (G) < E min (G), the energy consumption constraint can never be satisfied; if E given (G) > E max (G), the energy constraint can always be met. Both of the above situations are unreasonable, so the reasonable range of the given energy consumption constraint is E min (G) ≤ E given (G) ≤ E max (G).

Task Priority Determination
Before scheduling, we need to determine the priority of tasks. Similar to [20][21][22][23], we use the upward rank value (rank u ) as the criterion to determine the priority of tasks. rank u is defined as follows: The priority of tasks is sorted in descending order of rank u , that is, the higher rank u of a task, the higher the priority of it. Table 3 shows the upward rank values of all the tasks in Figure 1. Therefore, the task priority list of the application in Figure 1 will be {n 1 , n 3 , n 4 , n 2 , n 5 , n 6 , n 9 , n 7 , n 8 , n 10 }. Table 3. Upward rank values for tasks of the application in Figure 1.

The ISAECC Method
In this subsection, we review the existing method that is closest to us and reveal its limitations.

Method Description
The ISAECC method is proposed in [23], and it consists of several major steps: 1. It prioritizes tasks by using the upward rank value which is defined in Section 3.3.3; 2.
It uses a self-defined energy consumption weight value to give each unscheduled task a pre-allocated energy consumption; Electronics 2020, 9, 2077 8 of 22 3. It calculates the energy consumption constraint of each task according to the given energy consumption constraint of the whole application and the pre-allocated energy consumption value of each task; 4.
According to the order of the task priority list, it assigns each task the processor core and frequency that can minimize its EST time by traversing each processor core and optional execution frequency.
For the sake of generality, we use n o(1) , n o (2) , . . . , n o(|N|) to represent the task priority order.
Assuming that the currently scheduled task is n o( j) , then the scheduled task set is n o(1) , n o (2) , . . . , n o( j−1) and the unscheduled task set is n o( j+1) , n o( j+2) , . . . , n o(|N|) . Therefore, when scheduling the task n o( j) , the overall energy consumption of application G can be expressed as where E pre n o(y) denotes the pre-allocation energy consumption of task n o(y) . According to the energy consumption constraints shown in Equation (8), we can get Therefore, we can get Let the energy consumption constraint of task n o( j) be With Equation (17), we only need to consider the energy consumption constraint of each task which is shown as follows: Therefore, the key problem is how to determine the pre-allocation energy consumption (E pre n o( j) ) of each task. The central idea of the method ISAECC used is to pre-allocate the energy consumption for unscheduled tasks by a weight mechanism. First, they define the improvable energy of application G called E ie (G), which is computed as Then, they define E ave (n i ) and E ave (G) as the energy consumption level of task n i and the energy consumption level of application G. E ave (n i ) and E ave (G) are computed as Next, they define el(n i ) as the weight of energy consumption level of task n i , which is computed as Electronics 2020, 9, 2077 9 of 22 After that, they calculated the pre-allocated energy consumption for task n i as follows: After determining the pre-allocated energy consumption of each task, the task scheduling can be completed according to steps 3 and 4, described at the beginning of this section (Section 3.4.1).

Limitations of ISAECC
ISAECC solves the problem of increasing schedule length caused by unfair energy constraint allocation of low priority tasks by MSLECC in [20]. However, we find that it is not the best practice to treat each task fairly, because tasks in different levels have different impacts on the whole application. For example, in Figure 1, the task n 1 and task n 10 should be assigned more energy than other tasks, because their execution time can affect the overall schedule length of the application G directly. It is obvious that if the execution time of task n 1 or task n 10 is shortened or lengthened under other same conditions, the overall schedule length will shorten or lengthen the same amount accordingly. Other tasks like n 2 cannot be compared to n 1 and n 10 . If the execution time of task n 2 is shortened or lengthened under other same conditions, the schedule length of application G may not change obviously, or even not change at all. Therefore, we should consider the levels of the tasks and the number of tasks in each level when pre-allocating the energy consumption of tasks. In addition, previous studies ignored the negative impact of the local optimal characteristics of scheduling algorithm on the schedule length. In our design, these problems have been greatly improved.

Our Solution
In this section, we introduce our new strategy in detail. First, we introduce a new task energy pre-allocation method considering task level (Section 4.1). Then, we give a new task scheduling algorithm to minimize the schedule length under the constraint of energy consumption (Section 4.2). Finally, we describe the task execution frequency re-adjustment mechanism we added after getting the preliminary scheduling results (Section 4.3).

The New Task Energy Pre-Allocation Method
The central idea of our pre-allocation method is to consider the levels of tasks. The impact of tasks in different hierarchies on the overall schedule length of an application is different. For example, in Figure 1, if we shorten the execution time of task n 1 by increasing its execution frequency, the schedule length of application G will certainly shorten the corresponding time; but if we shorten the execution time of task n 2 , the schedule length of application G is unlikely to change much, because there are still many tasks in a similar position to it. Therefore, a more reasonable approach is to appropriately pre-allocate more energy consumption to task n 1 in the case of Figure 1. Based on this idea, we put forward a new energy pre-allocation method, based on the weight value of energy consumption and the levels of tasks.

Method Description
We define the level of tasks as follows: We define N l = n i , n j , . . . as the set of tasks contained in level l, where L(n i ) = L n j = L(. . .) = l. The number of tasks in level l is |N l |. We can have that the maximum of the level of tasks is L(n exit ).
We define the improvable energy of application G (E ie (G)) and the energy consumption level of task n i (E ave (n i )) the same as ISAECC. We define that the energy consumption level of the task level l is the sum of energy consumption level in level l, which is computed as We define el n i,l as the energy consumption weight of task n i in its level l, which is computed as We define E var (N l ) as the variation energy consumption of N l , which is computed as Correspondingly, the variation energy consumption of application G is computed as The energy consumption weight of N l in application G can be defined as We define E ie (N l ) as the improvable energy of N l , which is computed as Therefore, we can get the new energy pre-allocation formula of n i as follows: where

Feasibility of the Task Energy Pre-Allocation Mechanism
In order to prove the feasibility of our method, we need to prove the following theorem: Given an application G, if the unscheduled tasks are pre-allocated energy consumption according to our method, then each task n j can satisfy Equation (15).
We use mathematical induction to prove the above theorem. First, we need to prove task n o(1) can satisfy Equation (15), and the other tasks are all unscheduled. By Equations (14), (22)-(29), we can have We can at least find a situation in which In other words, at least when E n o(1) = E min n o(1) , we can have From the above derivation, we prove that task n o(1) can satisfy Equation (15). Then, we assume that task n o( j) can satisfy Equation (15). That is The above formulation can be written as Next, we prove task n o( j+1) can satisfy Equation (15). By Equation (30), we can have From Equations (28) and (30), we can have E pre n o( j+1) = min E wa n o( j+1) , E max n o( j+1) ≥ E min n o( j+1) . Therefore, at least when E n o( j+1) = E min n o( j+1) , we can have In summary, given an application G, if the unscheduled tasks are pre-allocated energy consumption according to our method, then each task n j can satisfy Equation (15). The feasibility of our method has been proved.

The Proposed Algorithm for Minimizing Schedule Length
In this section, we show our new task scheduling algorithm in Algorithm 1. In the algorithm, Line 1 is to prioritize tasks in the input application; Lines 2-10 calculate some required values for each task, each level and the application G; Lines 11 and 12 calculate the pre-allocation energy consumption of each task; Lines 13-26 are to select processor and frequency for each task; Lines 27 and 28 are to calculate the actual energy consumption E(G) and the final schedule length SL(G).
For each task, selecting the processor with the minimum EFT has complexity O(|N| × |U| × |F|), where |F| represents the maximum number of discrete frequencies from f k,low to f k,max . Therefore, the complexity of Algorithm 1 is O |N| 2 × |U| × |F| the same as ISAECC in [23].

The Task Execution Frequency Re-Adjustment Mechanism
Through the new task energy pre-allocation method in Section 4.1 and the new task scheduling algorithm in Section 4.2, we can get the preliminary scheduling results. Other methods generally end here such as those in [20][21][22][23], but they do not realize that the scheduling results can be optimized to further shorten the schedule length. The same as the scheduling algorithms in [20][21][22][23], the algorithm in Section 4.2 makes tasks finish as soon as possible when scheduling them, which is not entirely reasonable. Premature completion of some tasks cannot shorten the overall schedule length, but will take up more energy consumption. Therefore, we introduce the concept of the latest finish time of tasks [28,39,40], which is defined as follows: where n dn(i) represents the downward neighbor task of n i , that is, n dn(i) is on the same processer core as n i and it is the first case after n i . Through Equation (32), we can replace AFT of tasks with LFT to delay the finish time of some tasks without increasing the schedule length. As the execution time of tasks is prolonged, their running frequency will be reduced accordingly, which can save some energy consumption. Therefore, the new execution time of task n i is changed from AFT(n i ) − AST(n i ) to LFT(n i ) − AST(n i ), and the new frequency of task n i can be changed as follows: The frequency range of task n i is f pr(i),low , f pr(i),max , so f pr(i),nhz(i) should be Therefore, the actual execution time (AET) of task n i will be Finally, the new AST(n i ) should be updated as On the basis of above equations, the algorithm to save energy after preliminary scheduling is shown in Algorithm 2. In Line 2, we reorder the tasks in descending order of the actual finish time (AFT) of tasks according to the scheduling results of Algorithm 1. In Lines 4-10, AFT(n i ), AST(n i ) and f pr(i),hz(i) are updated. In Line 11, we compute the new E(n i ) using the above updated values. In Lines 12 and 13, we compute the saved energy E save (G). n i = tl.out(); 5: Compute LFT(n i ); //By (33) 6: Compute the new frequency f pr(i),nhz(i) ; //By (34) 7: Compute the new AET(n i ); //By (35) 8: Update AFT(n i ) ← LFT(n i ) ; 9: Update AST(n i ) ← LFT(n i ) − AET(n i ) ; //By (36) 10: Update f pr(i),hz(i) ← f pr(i),nhz(i) ; 11: Compute the new E(n i ); //By (4) 12: Compute the new energy consumption of G E new (G); //By (5) 13: Compute the saved energy E save (G) = E(G) − E new (G); 14: return E save (G).
Generally speaking, after the task scheduling is completed, there will be remaining energy consumption that is not used up as follows: Therefore, the energy consumption that can be reused is After calculating the energy consumption that can be reused, we need to re-allocate the energy consumption to the tasks that can directly affect the overall schedule length. Directly affecting the overall schedule length means that according to how much the task running time changes, the overall schedule length will also change. For example, task n 1 in Figure 1, the total schedule length will be shortened or extended as much as its execution time is shortened or extended. However, due to the diversity of task models, it is very difficult to find out which tasks can directly affect the overall schedule length. Therefore, we adopt a more direct way: If the application model level is strict (the tasks in level l only communicate with the tasks in their adjacent levels which are level l+1 and level l−1), we re-allocate the reused energy consumption (E reu (G)) to the tasks that have not reached the highest execution frequency in the levels whose |N| = 1; otherwise, we only re-allocate the reused energy consumption to n entry and n exit . For ease of description, we define N das as the set of tasks that can directly affect the overall schedule length.
After the assignment object of E reu (G) is determined, we need to determine the allocation proportion. We define the maximum energy consumption of n i ∈ N das on processer core u pr(i) is Therefore, the growable energy consumption of n i ∈ N das on processer core u pr(i) will be We take the values of E gro (n i ) as the allocation proportion of E reu (G). Therefore, the reused energy consumption of n i ∈ N das will be Adding E reu (n i ) and E(n i ) which is computed in Algorithm 2, we can get the energy consumption that n i ∈ N das can use will be E use (n i ) = E reu (n i ) + E(n i ).
According to E use (n i ), we can find the new frequency f pr(i),nh of n i ∈ N das by traversing the execution frequency on processor core u pr(i) . Then we can compute the shortened actual execution time of n i ∈ N das as follows: Therefore, the new length of the application G will be Combined with the Algorithms 1 and 2, the task execution frequency re-adjustment mechanism is shown in Algorithm 3. Lines 1 and 2 call Algorithms 1 and 2 to get the preliminary scheduling results and E save (G). Line 3 compute the reused energy consumption E reu (G). Lines 6 and 7 calculate some required values for each task belonging to N das . Lines 8-14 are to select the new frequency for each task belonging to N das . Finally, we compute the new schedule length of the application G SL new (G) after the task execution frequency re-adjustment mechanism in Line 15. The value of SL new (G) is the final minimum schedule length we get.
In general, our method includes Algorithms 1-3. Algorithm 1 is a task energy pre-allocation strategy, and Algorithms 2 and 3 constitute the task execution frequency re-adjustment mechanism. For a given application, we operate in the order of Algorithms 1, Algorithm 2 and Algorithm 3, then we can get the scheduling result of minimizing the schedule length of the application. Algorithm 3. The task execution frequency re-adjustment mechanism.

Experiments
In this section, we use four algorithms, MSLECC [20], WALECC [21], EECC [22] and ISAECC [23], which are the same as the goal of this article to compare with our proposed method. The configuration of the experimental platform is AMD Ryzen 5 2500U CPU @ 2.00 GHz, 8 GB RAM (Santa Clara, CA, USA), 64-bit Windows 10 Home Edition. The whole set of codes is mainly implemented by C and scripts. The final schedule length SL(G) is the only evaluation standard of these algorithms.
The parameters of the processors and applications as follows: 10 ms ≤ w i,k ≤ 100 ms, 10 ms ≤ c i,j ≤ 100 ms, 0.03 ≤ P k,ind ≤ 0.07, 0.8 ≤ C k,e f ≤ 1.2, 2.5 ≤ m k ≤ 3.0, and f k,max = 1 GHz. The execution frequency is discrete, and the precision is 0.01 GHz. The simulated heterogeneous platform for testing the problem of minimizing the schedule length uses four processor cores.
We chose two DAG models to evaluate our algorithm, which are two real-world applications (Fast Fourier transform and Gaussian elimination).

Fast Fourier Transform Application
We first consider the fast Fourier transform (FFT), Figure 2a shows an example of the FFT parallel application with ρ = 4. The parameter ρ can represent the size of application models. For FFT application, the number of tasks is |N| = (2 × ρ − 1) + ρ × log 2 ρ, and ρ = 2 x where x is an integer. We can see that there are four exit tasks (task 12,13,14,15) in the FFT graph in Figure 2a, and there will be ρ exit tasks in the FFT graph with parameter ρ. In order to match the application model in Section 3.1, we add a dummy exit task whose execution time is 0. We also connect the dummy exit task to the last ρ exit tasks, and we set their communication time is 0. For example, Figure 2b shows the changed FFT parallel application with ρ = 4 which is added a dummy exit task.
( ) + 233 ( ( ) − ( )), where 1 ≤ ≤ 232 and is an integer. Table 4 shows the details of the final schedule lengths of FFT application with = 32 for varying ( ) by using all the algorithms, and a more intuitive feeling can be performed through Figure 3. It can be seen that our algorithm has the obvious advantage on the schedule length ( ) compared to other algorithms. From the experimental results, we can get that our method outperforms MSLECC by about 28.13%~36.35%, and it outperforms the newest method ISAECC by about 3.65%~10.48%. The results of WALECC and EECC are similar to that of ISAECC.  Experiment 1. In order to observe the performance on different energy consumption constraints, this experiment is carried out to compare the final schedule length values of the FFT application for varying energy consumption constraints. We use the FFT application with ρ = 32, that is, the number of tasks is 233 (|N| = 233). We set the energy consumption constraints E given (G) = E min (G) + M 233 (E max (G) − E min (G)), where 1 ≤ M ≤ 232 and M is an integer. Table 4 shows the details of the final schedule lengths of FFT application with ρ = 32 for varying E given (G) by using all the algorithms, and a more intuitive feeling can be performed through Figure 3. It can be seen that our algorithm has the obvious advantage on the schedule length SL(G) compared to other algorithms. From the experimental results, we can get that our method outperforms MSLECC by about 28.13%~36.35%, and it outperforms the newest method ISAECC by about 3.65%~10.48%. The results of WALECC and EECC are similar to that of ISAECC.   Further, we can find that the gaps between our method and other algorithms increase when ( ) decreases. This is because all tasks can be assigned to a relatively large energy consumption constraint when the energy consumption constraint is large, so that the impact of task level is smaller. Moreover, at this time, most tasks belonging to have reached the maximum frequency, and cannot continue to increase the frequency to shorten the schedule length. In addition, as we expected, the larger ( ) is, the shorter schedule length we can obtain. Experiment 2. In order to observe the algorithm performance under different number of tasks, an experiment is carried out to compare the final schedule length values of the FFT application for Further, we can find that the gaps between our method and other algorithms increase when E given (G) decreases. This is because all tasks can be assigned to a relatively large energy consumption constraint when the energy consumption constraint is large, so that the impact of task level is smaller. Moreover, at this time, most tasks belonging to N das have reached the maximum frequency, and cannot continue to increase the frequency to shorten the schedule length. In addition, as we expected, the larger E given (G) is, the shorter schedule length we can obtain. Experiment 2. In order to observe the algorithm performance under different number of tasks, an experiment is carried out to compare the final schedule length values of the FFT application for varying number of tasks. The parameter ρ is changed from 8 to 256. In order to get relatively obvious results, we set the energy consumption constraints to a relatively small value: E given (G) = E min (G) + . Table 5 shows the results of FFT applications for different number of tasks by using all the algorithms, and a more intuitive feeling can be performed through Figure 4. The results show that our method has better performance than other algorithms. Our method outperforms MSLECC by about 21.24%-31.10%, and outperforms ISAECC by about 4.80%-7.32%. From the results, we can find that the more tasks we have, the better our method performs. This is because that there will be more different hierarchies in FFT graphs as the number of tasks increases, so that our new energy pre-allocation strategy will become more advantageous.

Gaussian Elimination Application
Similarly, in Gaussian elimination (GE) application, we define as the size of the application, and the number of tasks can be calculated by | | = 2 + −2

2
. Figure 5 shows the GE application model with = 5.

Gaussian Elimination Application
Similarly, in Gaussian elimination (GE) application, we define ρ as the size of the application, and the number of tasks can be calculated by |N| = ρ 2 +ρ−2 2 . Figure 5 shows the GE application model with ρ = 5.

Gaussian Elimination Application
Similarly, in Gaussian elimination (GE) application, we define as the size of the application, and the number of tasks can be calculated by | | = 2 + −2 2 . Figure 5 shows the GE application model with = 5.  Experiment 3. This experiment compares the final schedule length values of GE application for varying energy consumption constraints. We use the GE application with ρ = 21, that is, the number of tasks is 230 (|N| = 230). We set the energy consumption constraints E given (G) = E min (G) + M 230 (E max (G) − E min (G)), where 1 ≤ M ≤ 229 and M is an integer. Table 6 shows the results of the final schedule lengths of GE application with ρ = 21 for varying E given (G) by using all the algorithms, and a more intuitive feeling can be performed through Figure 6. We can see that our method still performs better than other algorithms, specifically, it outperforms MSLECC by about 14.69%-34.55%, and outperforms ISAECC by about 3.94%-9.60%, but the improvement is not obvious compared with FFT models. This is because that the level of GE application is not as strict as the FFT application, so that the effect of our method has diminished.  2436  6826  6144  5912  5973  5692  2763  6155  5389  5603  5398  5128  3665  5519  4457  4521  4324  4119  6535  4721  3776  3502  3418  3090  7191  3864  3093  3214  3186  3043  8749  3397  3012  3129  3017  2898 Electronics 2020, 9, x FOR PEER REVIEW 18 of 22 Experiment 3. This experiment compares the final schedule length values of GE application for varying energy consumption constraints. We use the GE application with = 21, that is, the number of tasks is 230 ( | | = 230 ). We set the energy consumption constraints ( ) = ( ) + 230 ( ( ) − ( )), where 1 ≤ ≤ 229 and is an integer. Table 6 shows the results of the final schedule lengths of GE application with = 21 for varying ( ) by using all the algorithms, and a more intuitive feeling can be performed through Figure  6. We can see that our method still performs better than other algorithms, specifically, it outperforms MSLECC by about 14.69%-34.55%, and outperforms ISAECC by about 3.94%-9.60%, but the improvement is not obvious compared with FFT models. This is because that the level of GE application is not as strict as the FFT application, so that the effect of our method has diminished.  2436  6826  6144  5912  5973  5692  2763  6155  5389  5603  5398  5128  3665  5519  4457  4521  4324  4119  6535  4721  3776  3502  3418  3090  7191  3864  3093  3214  3186  3043  8749  3397  3012  3129 3017 2898   Table 7 shows the results of the final schedule lengths of GE application for varying number of tasks by using all the algorithms, and a more intuitive feeling can be performed through Figure 7.  and change the number of tasks. Table 7 shows the results of the final schedule lengths of GE application for varying number of tasks by using all the algorithms, and a more intuitive feeling can be performed through Figure 7. Our method still has a better effect on the schedule length than other algorithms which performs better 16.00%-28.85% than MSLECC and 3.36%-10.98% than ISAECC.

Analysis and Summary of Experimental Results
We can obtain that our method has better performance compared with the other algorithms from the experimental results. However, when ( ) becomes larger or the number of tasks is relatively small, the advantage of our method is not particularly obvious. The former is because that all tasks can be allocated to a relatively large energy consumption constraint when ( ) is large, so that the impact of task level will be smaller. The latter is because that the experimental results are more accidental and cannot reflect the general law when the number of tasks is small. We can also find that our method performs better on FFT models than on GE models with similar ( ) and the number of tasks. This is because that the task level of GE application is not as strict as the FFT application, so that the effect of our method has diminished. In summary, our method is more applicable when the energy constraints are stringent, the number of tasks is large, or the task level of the application is strict.

Conclusions
In this article, we propose a novel method to minimize the scheduling length of energyconstrained applications which run on a heterogeneous multi-core system. Our method mainly includes two parts: a novel task energy pre-allocation strategy and the schedule algorithm based on it; a re-adjustment mechanism of task execution frequency after preliminary scheduling. The core

Analysis and Summary of Experimental Results
We can obtain that our method has better performance compared with the other algorithms from the experimental results. However, when E given (G) becomes larger or the number of tasks is relatively small, the advantage of our method is not particularly obvious. The former is because that all tasks can be allocated to a relatively large energy consumption constraint when E given (G) is large, so that the impact of task level will be smaller. The latter is because that the experimental results are more accidental and cannot reflect the general law when the number of tasks is small. We can also find that our method performs better on FFT models than on GE models with similar E given (G) and the number of tasks. This is because that the task level of GE application is not as strict as the FFT application, so that the effect of our method has diminished. In summary, our method is more applicable when the energy constraints are stringent, the number of tasks is large, or the task level of the application is strict.

Conclusions
In this article, we propose a novel method to minimize the scheduling length of energy-constrained applications which run on a heterogeneous multi-core system. Our method mainly includes two parts: a novel task energy pre-allocation strategy and the schedule algorithm based on it; a re-adjustment mechanism of task execution frequency after preliminary scheduling. The core idea of our method is that the tasks in different hierarchies have different impacts on the whole application and the negative impact of local optimal scheduling should be reduced. Our method can be integrated into actual multi-core embedded systems, and it is particularly suitable for wearable devices, mobile robots and other products with high requirements for energy saving and performance. We carry out a considerable number of experiments with two practical parallel application models (FFT and GE). The results of experiments show that our method is generally superior to other existing algorithms. However, the experimental results also demonstrate the limitations of our method, which are that our method does not offer much of an advantage when the energy constraints are not stringent, the number of tasks is small or the task level of the application is not strict.
In the future, we will improve and extend our method, and some further studies will be done. The points that can be further studied are as follows: • Consider other factors that affect the length of application scheduling, such as the way to determine the priority of tasks. • Explore ways to improve the limitations of our method and make it more universal.

•
Integrate our method into an actual embedded multi-core system and test its performance.

•
Extend our method to study other indicators in multi-core task scheduling, such as reliability.