Sw/Hw Partitioning and Scheduling on Region-Based Dynamic Partial Reconfigurable System-on-Chip

A heterogeneous system-on-chip (SoC) integrates multiple types of processors on the same chip. It has great advantages in many aspects, such as processing capacity, size, weight, cost, power, and energy consumption, which result in it being widely adopted in many fields. The SoC based on region-based dynamic partial reconfigurable (DPR) FPGA plays an important role in the SoC field. However, delivering its powerful capacity to the consumer depends on the efficient Sw/Hw partitioning and scheduling technology that determines the resource volume of the DPR region, the mapping of the application to the DPR region and other processors, and the schedule of the task and its reconfiguration. This paper first proposes an exact approach based on the mixed integer linear programming (MILP) for the Sw/Hw partitioning and scheduling problem. The proposed MILP is able to solve the problem optimally; however, its scalability is poor, despite that we carefully designed its formulation and tried to make it as concise as possible. Therefore, a multi-step hybrid method that combines graph partitioning and MILP is proposed, which is able to reduce the time complexity significantly with the solution quality being degraded marginally. A set of experiments is carried out using a set of real-life applications, and the result demonstrates the effectiveness of the proposed methods.


Introduction
As the user requirements keep increasing and technology advances, reconfigurable systems become ubiquitous currently, including traditional computers, smart phones, multimedia systems, and software-defined radio systems. The reconfigurable systems utilize various kinds of flexible structures and can be applied to high-performance computing, e.g., multimedia, communication, and radar signal processing. The primary feature of the reconfigurable systems, despite the diversity of their structures, is the programmability. The field programmable gate array (FPGA) is an important programmable device, the function of which can be changed by dynamically loading new bitstreams that implement various kinds of hardware accelerators. Such a reconfiguration, however, needs to re-write all logics of the FPGA. Since the bitstream file for the whole FPGA is quite large, a remarkable reconfiguration overhead results. However, the application may be small in size, requiring only a few reconfigurable resources; thus, such a full reconfiguration increases the reconfiguration latency and also incurs resource waste. Therefore, a dynamic partial reconfiguration is introduced to most modern FPGAs. Dynamic partial reconfigurable (DPR) FPGAs enable run-time reconfiguration of a specific portion of the FPGA circuit without affecting the execution of other parts of the FPGA. Compared with full dynamic reconfiguration, dynamic partial reconfiguration can speed up the load of a hardware accelerator, thus augmenting the response speed of the platform. For example, it takes more than 8 ms to with the partial result of the former sub-problem as the input. Further, a priority-based graph partitioning method is proposed to effectively divide the problem into a set of small sub-problems. Therefore, the contributions of this paper are: • The formal definition of the partitioning and scheduling problem on DPR SoC integrating the CPU and 2D DPR FPGA. • An accurate MILP formulation for the partitioning and scheduling problem. • A multi-step hybrid method that integrates the priority-based graph partitioning method.

•
Extensive experiments with a set of real-life applications.
The remainder of the paper is organized as follows. In Section 2, we discuss the related work. Section 3 introduces the platform model, the application model, and the studied problem. The proposed MILP model is introduced in Section 4, and the multi-step hybrid approach is illustrated in Section 5. Section 6 presents and analyzes the experimental results. We conclude the paper in Section 7.

Related Work
The authors of [14] studied the effectiveness of using DPR at reducing the size, cost, power, and energy consumption by experimentally evaluating a set of real-life signal processing hardware accelerators on the DPR FPGA. The result also demonstrated the usefulness of DPR in improving real-time performance and reducing resource consumption.
The work in [15] utilized the DPR FPGA to tolerate the faults occurring at runtime and improve the system reliability. The proposed method was demonstrated to be advantageous at improving the resource utilization and task acceptance ratio and reducing the fragmentation ratio.
The usage of the DPR FPGA in cloud computing to speed up provided services is an emerging research area, which was studied in [16]. The work in [16] proposed a benefit-based scheduling metric to guide task assignment on the DPR FPGA, and the result demonstrated that the method was advantageous at reducing resource consumption.
In [17], the authors proposed two optimal module/task placement and scheduling methods for applications with temporal precedence constraints on 2D DPR FPGAs. The proposed methods can be used to optimize either the schedule length or the FPGA size. In this work, each task was modeled as a three-dimensional box in space and time; hence, the placing and scheduling problem was transformed into a packing problem in a container with a specific size. However, the proposed approach models the reconfiguration time in the task execution time, without taking into account the single reconfigure port constraint and configuration prefetch. Therefore, the produced result is inaccurate, since multiple reconfigurations can be carried out in parallel. Besides, ignoring the reconfiguration pre-fetching also worsens the schedule performance. The work in [18] presented a time-step ILP-based exact approach for the Sw/Hw partitioning problem (placing and scheduling) on an SoC with a microprocessor and a 1D DPR FPGA. The proposed model takes into account practical physical constraints, i.e., an Hw should occupy continuous FPGA columns and one reconfiguration port constraint; and configuration prefetch was utilized to reduce reconfiguration overhead so as to further improve the performance. Besides, a heuristic algorithm based on Kernighan-Lin Fiduccia-Mattheyses (KLFM) was also proposed. However, module reuse was not shown in this work.
The paper [4] proposed a time-step ILP formulation for placing the scheduling task graphs on the 1D DPR FPGA, with configuration prefetch, a single reconfiguration port, and module reuse being taken into account. Besides, the 1D ILP formulation was also extended to the SoC with multiple 2D DPR FPGAs. Since there are multiple FPGAs on the SoC, the same number of concurrent reconfigurations are enabled in the 2D ILP; however, it ignores the fact that on each FPGA, only one reconfiguration is supported at the same time. The authors also proposed a heuristic called Napoleon for the problem. The heuristic considers reconfiguration pre-fetching, module reuse, and anti-fragmentation techniques. The result showed the effectiveness of module reuse compared to the work in [18].
The authors of [19] proposed an ant colony optimization (ACO) approach for mapping, scheduling, and placing directed acyclic graphs (DAG) on the SoC with a 1D DPR FPGA and one or several instruction processors (i.e., microprocessor and/or digital signal processor (DSP)). Reconfiguration pre-fetching is utilized in the proposed method to hide reconfiguration overhead, and reconfiguration port contention is taken into account. The result demonstrated that ACO is more effective in schedule length compared with KLFM [18].
The work in [20] introduced a duplication-based time-step ILP model with reconfiguration pre-fetching and a single reconfiguration constraint for placing and scheduling precedence constrained task chains on the 1D DPR FPGA. The proposed method makes use of the data parallelism provided by the application task and replicates the same task several times on the DPR FPGA to improve the real-time performance. The proposed model assumes that the up-bound duplication number of each task is known a priori; and each copy of a child task is assumed to start after all instances of the ancestor task finish execution. Besides, the execution time of each duplication of the same task can differ, which is resolved by the ILP model. Further, the authors proposed a semi-online heuristic algorithm PARLGRANfor the same problem.
Region-based Sw/Hw partitioning on the SoC with the 2D DPR FPGA and microprocessor was studied in [5,10,21], with reconfiguration pre-fetching and a single reconfiguration port constraint being taken into account. These methods assume that the up-bound number of reconfigurable regions is fixed. The work in [5] proposed a mixed-integer linear programming (MILP) formulation for mapping, scheduling DAG, and sizing reconfigurable regions on the heterogeneous SoC with a microprocessor and a 2D region-based DPR FPGA to optimize power, energy, and schedule length. The work in [21] proposed two heuristics for the same problem to reduce the algorithm complexity. Floorplanning [22] is required to ensure the feasibility of the final solution produced by the algorithms in [5,21], and if the solution is infeasible, the proposed algorithms have to be re-executed by virtually reducing the number of FPGA resources [21]. Note that in [5,21], the reconfiguration of the first task on each reconfiguration region was not taken into account.
The authors of [10] proposed an MILP model for the Sw/Hw partitioning problem on the heterogeneous SoC with a microprocessor and a 2D region-based DPR FPGA, assuming tasks of the application to be clustered with the available heuristic algorithm, and the task of a cluster was limited to being mapped to either the microprocessor or a specific reconfigurable region of the FPGA. The independent set-based algorithm (ISBA) [23] is used to map the DAG to clusters by merging tasks that are not active in one or more periods to a cluster. The work of [10] is the most related work to the studied problem in this paper; hence, it was also selected for performance evaluation. Again, as in [5,21], the reconfiguration of the first task on the region was ignored.
The authors of [24] developed a simulated annealing algorithm for partitioning, scheduling, and floorplanning of the application on the DPR FPGA. The novelty of this work was that it solved all critical problems with respect to deploying an application on the DPR FPGA. However, the proposed method ignores the possibility of deploying the task of the application on the CPU. The work in [25] studied how to evaluate the reconfiguration cost in terms of time for the DPR FPGA. The authors coined the term "DPR profitability", targeting real-time systems, and an innovative approach for calculating the DPR time and the worst-case bound was developed. The work in [26] studied the automotive placement on the DPR FPGA to reduce resource utilization and the total wirelength. The selection of the Partial Reconfigurable Region (PRR) was based on the shapes, location, and communication with others.
The works in [8,9] proposed a hybrid mapping-scheduling algorithm (HMS) to capture the pipeline behaviors of the application and reduce the reconfiguration overhead on the region-based DPR FPGA with a fixed number of reconfigurable regions. The application is first translated to sequential snapshots that are composed of a set of Hw cores; each snapshot is then mapped separately by clustering the snapshot into islands [6,7]; each island is then assigned to a reconfigurable region. Since islands in subsequent snapshots may be compatible, i.e., they share the same Hw core, the reconfiguration overhead can be potentially reduced, hence improving the schedule performance.
The papers [6,7] studied reducing reconfiguration overhead when scheduling multiple applications sequentially on the region-based DPR FPGA with a fixed number of reconfigurable regions of the same size. The proposed algorithm captures the fact that many applications share some hardware cores to perform common operations. This kind of similarity in applications can be used to reduce reconfigurations when switching applications on the FPGA using dynamic partial reconfiguration.
One novelty of the works in [6][7][8][9] was that different Hw cores were allowed to be combined into a single island/cluster that was then assigned to a reconfiguration region to take into account that the region size may be large enough to hold multiple Hw cores. Such a combination contributes to less reconfiguration overhead, since the reconfiguration overhead of an Hw core is directly related to the size of the reconfigurable region. If some Hw cores are combined into one Hw, then less reconfiguration is required.

Problem Description
We consider Sw/Hw partitioning and scheduling of an application on the hardware platform with the system architecture shown in Figure 1. The application is specified as a directed acyclic graph, which is also called the task dependency graph since the edge in the graph represents inter-task data dependency. The application model can be extracted from the high-level language of the application code, e.g., C, VHDL, etc. As the works in [10,18,23], the feasibility of the Sw/Hw partitioning and scheduling result still needs to be verified using the floorplanning tool [22,27,28]. If the result is impractical, the algorithm can be iterated by reducing the total resource number of the platform until a feasible result is found. Further, integrating the MILP-based floorplanning technique [22] into the proposed MILP approach is also possible.

Platform Model
The target platform architecture is as shown in Figure 1. The platform is composed of a microprocessor/CPU and an FPGA with DPR ability. The FPGA can be divided into a static region and a set of dynamic reconfigurable regions that are denoted as PR = {PR 1 , PR 2 , ..., PR |PR| }. Each reconfigurable region composes a set of different reconfigurable resources, including CLB, DSP, BRAM, etc. We use H = {h 1 , h 2 , ..., h |H| } to denote the set of different resource types of the FPGA. For each kind of resource h ∈ H, N k h represents the number of resources h in the reconfigurable region PR k ∈ PR. For each kind of resource, TN h denotes the total number of resources h. Each reconfigurable region PR i is associated with an integer RT i , representing the time needed to reconfigure this region. RT i = S i bit /B c f g , where S i bit is the bitstream size of region PR i and B c f g is the bandwidth of the reconfiguration port.
where S h is the size of the resource h measured in bits. In the platform, the local communication either in the microprocessor or in the FPGA is treated as delay-free by using local memory access [10]. An example DPR SoC platform abstracted from the Xilinx Zynq-series SoC is shown in Figure 1. The SoC is composed of a microprocessor and an FPGA, and the FPGA has three dynamic partial reconfigurable regions and a static region. The FPGA can be partially reconfigured by the microprocessor using the ICAP/PCAP port.

Application Model
As many other works [2,10,12], the application is modeled as a directed acyclic graph (DAG) G(V, E), where V is a finite set of nodes representing the tasks of an application and E is a finite set of directed edges denoting data precedences among tasks. Each task v ∈ V is associated with two computation costs cs v and ch v of integers, representing the computation time it takes to complete the execution of the task on the microprocessor and the FPGA, respectively. Each task is also associated with a set of resource consumptions corresponding to each kind of resource, e.g., CLB, DSP, BRAM, etc. ∀v ∈ V and ∀h ∈ H, N v h represents the required number of resource h to execute task v on the FPGA. The task can be mapped to a specific reconfigurable region only if the type and number of resources provided by the region are those requiredby the task match. Each edge e ∈ E is defined as a tuple (src, dst), with src and dst being the source and destination/child task of the edge, respectively. Each edge e ∈ E is associated with a communication cost w e of the integer, denoting the communication time required to move the dependence data from the source task to the destination task when they are deployed onto different kinds of processing elements. The task without any child is called the exit task, and we use V exit to represent the set of exit tasks of the DAG. Figure 2 shows an example application [18] modeled with DAG. The application has eight tasks and nine edges. Table 1 records the task parameters of the DAG in Figure 2. Task parameters include the execution time on the CPU (Sw time; the unit is cycles), the execution time on the FPGA (Hw time; the unit is cycles), and the required CLB number of each task (CLB num).

The Sw/Hw Partitioning and Scheduling Problem
With the platform and application model, the Sw/Hw partitioning and scheduling problem can be described to map and schedule tasks of the application to different computing element (the CPU and FPGA static and dynamic regions) while complying with the different physical constraints of the DPR-enabled heterogeneous SoC. These constraints include those that are common to static non-preemptive scheduling on multiprocessors, e.g., tasks on the processor should execute in series, and a task cannot be interrupted by other tasks when it is executing [15,29,30]. These constraints also include those that are specific to the DPR FPGA, e.g., the single reconfigure port constraint [12,31]; a task on the FPGA must be reconfigured before executing, reconfiguration, and execution on the same region of the FPGA and must happen in series; the number of each kind of resource of a region must be enough to accommodate each task assigned to it; and the aggregate resource volume of all regions should not exceed that provided by the FPGA. What is more, the features provided by the hardware have to be employed for the sake of better performance, e.g., reconfiguration pre-fetching. Reconfiguration pre-fetching enables hiding part of the reconfiguration delay by utilizing the physical character of the FPGA, that the reconfiguration of one region and the execution on other regions can be carried out in parallel [31].
To summarize, for the target Sw/Hw partitioning and scheduling problem, the following solutions are to be resolved with the physical constraints being satisfied and the features of the platform being employed: • The resource type and number on each reconfiguration region.

•
The mapping of each task (the task may be mapped to the CPU or a reconfiguration region).

•
The start execution time of each task.

•
The start reconfiguration time of the task that is mapped to the FPGA.

•
The execution order and reconfiguration order of all tasks.
For the application in Figure 2, with the task parameters described in Table 1, assuming that the communication cost of each edge in Figure 2 is one and the CLB number of each task is equal to the reconfiguration delay, Figure 3 shows the Sw/Hw partitioning and scheduling of the application on the platform with eight CLBs. In Figure 3, the rectangles with labels n i and r i represent the execution and reconfiguration of task n i , respectively.
According to the Sw/Hw partitioning and scheduling result in Figure 3, the following observations can be obtained: • The resource number of the FPGA is eight, while the aggregate resource number of the application by adding the resource requirement of each task of the DAG is 17, showing that the DPR ability of the hardware enables the execution of the application on a resource-constrained platform. By partitioning the FPGA into three partial dynamic regions, the resource of the FPGA is shared by the application in both time and space. • Reconfigurations of different tasks are carried out in series. As shown in the figure, the reconfiguration port is busy during the time interval [0, 19], and no pairs of reconfigurations overlap in time.

•
Reconfiguration and execution of two tasks can happen in parallel (reconfiguration pre-fetching), e.g., n 0 on PR 1 and r 2 on PR 2 overlap during [5,6]. When task n 0 is executing on PR 1 , the reconfiguration of task n 2 is carried out on PR 2 , thus hiding its reconfiguration delay.

•
In each region, reconfiguration and execution happen in turn, e.g., in region PR 2 , the reconfiguration and execution sequence is r 2 , n 2 , r 3 , n 3 , r 7 , n 7 .

The Proposed MILP Formulation
This section introduces the exact approach for the Sw/Hw partitioning and scheduling problem based on the MILP.
Reconfiguration is an important part of the Sw/Hw partitioning and scheduling problem. Due to the physical features of the FPGA, each task assigned to an FPGA region should be reconfigured before it is executed. The reconfiguration renders extra time overhead, and the length of the reconfiguration time is directly related to the resource volume of the FPGA region to be reconfigured and the bandwidth of the reconfiguration port. To take into account the reconfiguration time in the schedule, we make use of the concept of the reconfiguration node [10]. Each task of the application is associated with a reconfiguration node. ∀v ∈ V, R v is the reconfiguration node of task v. Note that the reconfiguration node R v may not be carried out in the schedule ultimately, and whether or not it should be executed is figured out by the algorithm. When the task is assigned to the microprocessor, reconfiguration is unnecessary; otherwise, it is indispensable. The time taken by the reconfiguration node R v is determined by the resource volume of the reconfiguration region where task v is assigned, and the resource volume of the region is determined by the set of tasks that are finally mapped onto it.

Variables
To model the Sw/Hw partitioning and scheduling problem with MILP, the following variables are used: ∀v ∈ V, we use the integer variables s v and r v to represent the task start execution time and start reconfiguration time (if the task is mapped to the reconfigurable region). ∀v ∈ V, PR i ∈ PR, we introduce the binary variable m vi to represent if the task v is finally mapped to the reconfigurable region PR i . If m vi = 1, this means that the task v is mapped to PR i ; otherwise, the task is not mapped to PR i , i.e., the task is either mapped to the microprocessor or to other reconfigurable regions. If ∀PR i ∈ PR, m vi = 0, then task v is mapped to the microprocessor.
∀e ∈ E, we use the binary variable d e to represent if the source and destination tasks of edge e are mapped to different computation elements (one to the microprocessor and the other to the FPGA). It equals one if the source task and the destination task are mapped to different processing elements; otherwise, they are mapped to the same processing element (both mapped to the microprocessor or the FPGA).
∀a, b ∈ V, a = b, we use the auxiliary binary variable z ab to represent the execution order of these two tasks on the microprocessor if they are both mapped to the microprocessor (otherwise, it is a free variable). It equals zero if task a executes before b on the microprocessor; otherwise, b is executed before a.
∀a, b ∈ V, a = b, we use the auxiliary binary variable x ab to represent the reconfiguration order of tasks a and b on the FPGA if they are both mapped to the FPGA (otherwise, it is a free variable), so as to model the single reconfiguration controller constraint. It equals zero if task a is reconfigured before b on the FPGA; otherwise, task b is reconfigured before a.
For each PR k ∈ PR and h ∈ H, we use the integer variable N k h to represent the resource number of h on the reconfiguration region PR k .
The integer variable sl is used to represent the schedule length.

Constraints
Each task of the application can be mapped to only one processing element (the microprocessor or the FPGA); and if the task is assigned to the FPGA, it can be mapped to only one reconfigurable region. Therefore, the following constraint should hold for all v ∈ V.
Note that if ∑ PR i ∈PR m vi = 1, then the task is mapped to the FPGA; otherwise, it is mapped to the microprocessor. Hence, ∑ PR i ∈PR m vi gives the information about to where the task is mapped. For simplicity, we use the notation h v to denote ∑ PR i ∈PR m vi in the following.
The following constraints should hold for all a, b ∈ V, a = b to guarantee sequential task execution on the microprocessor.
where M 1 is a large positive constant integer that makes s a + (1 − h a ) * cs a ≤ s b + M 1 and s b + (1 − h b ) * cs b ≤ s a + M 1 always hold. We first consider the case when h a = 0 and h b = 0, i.e., both tasks a and b are mapped to the microprocessor. If z ab = 0, the constraint (2) is reduced to s a + (1 − h a ) * cs a ≤ s b , which guarantees that task b starts execution after task a finishes; while the constraint (3) is reduced to s b + (1 − h b ) * cs b ≤ s a * M 1 , which always holds owing to the existence of M 1 . Otherwise, if z ab = 1, the constraint (2) is reduced to s a + (1 − h a ) * cs a ≤ s b + M 1 , which always holds; while the constraint (3) is reduced to s b + (1 − h b ) * cs b ≤ s a , which guarantees that task a starts execution after task b finishes. However, if h a = 1 or h b = 1, both the constraints (2) and (3) always hold owing to the definition of M 1 ; in this case, variables s a , s b , and z ab are free. The use of the large positive constant integer M 1 is a trick to make the constraint linear [29]. The value of M 1 is application related. Generally, the single-processor schedule length on the CPU, i.e., the sum of the execution time of each task of the application on the CPU, satisfies the requirement.
The following constraints guarantee the data precedence for each edge e(a, b) ∈ E.
If h a = h b , the constraints (4) and (5) make d e = 1, i.e., tasks a and b are mapped to different processing elements, and the data precedence between them renders communication delay. Note that the constraints (4)-(6) do not guarantee d e to be zero for the case when h a = h b . However, the MILP solver is able to do this in order to minimize the schedule length. Based on the value of d e , communication delay is taken into account by the constraint (6), which makes sure task b starts execution after the dependent data from task a are ready.
The following constraint should hold for all a ∈ V to make sure that a task executes after its reconfiguration if it is mapped to the FPGA.
where M 2 is a large positive constant integer that makes r a + ∑ PR i ∈PR (m ai * RT i ) ≤ s a + M 2 always hold. If h a = 1, i.e., task a is mapped to the FPGA, then the constraint (7) is reduced to r a + ∑ PR i ∈PR (m ai * RT i ) ≤ s a , making task a execute after its reconfiguration finishes. Otherwise, if h a = 0, i.e., task a is mapped to the microprocessor, the constraint is reduced to r a + ∑ PR i ∈PR (m ai * RT i ) ≤ s a + M 2 , which always holds due to the definition of M 2 .
Since both m ai and RT i are variables, the constraint (7) is not linear. To make it linear, we introduce the constraint (8) to replace (7).
When m ai = 1, task a is mapped to region PR i for execution, and the constraint is reduced to r a + RT i ≤ s a , making the reconfiguration of task a finish before its execution. If m ai = 0, the constraint is reduced to r a + RT i ≤ s a + M 2 , which always holds since M 2 is a large positive constant integer.
The following constraints are introduced for all a, b ∈ V, a = b to guarantee serial reconfiguration on the FPGA. When both h a and h b equal one, i.e., tasks a and b are both assigned to the FPGA, the constraints (9) and (10) assure that the reconfiguration times of tasks a and b do not overlap.
On each reconfigurable region, the execution and reconfiguration of different tasks should not overlap in the time domain, which is captured by the following constraints that should hold for all a, b ∈ V, a = b and PR i ∈ PR. When both m ai and m bi equal one, i.e., tasks a and b are both assigned to the FPGA reconfigurable region PR i , the constraints (11) and (12) guarantee that the execution of task a is carried out before the reconfiguration of task b (when x ab = 0) or the execution of task b is carried out before the reconfiguration of task a (when x ab = 1).
To model the resource constraint, we use the integer variable N k h for each PR k ∈ PR, denoting the number of resources h of the reconfiguration region PR k . The following equations are introduced to model the resource constraint.
where TN h is the total volume of resource h on the FPGA. Constraint (13) makes sure the resource number of each reconfiguration region is large enough to hold any task assigned to it. Constraint (14) guarantees that the aggregate resource number of each kind of resource of all reconfigurable regions does not exceed that provided by the platform.
The following constraint is introduced to bound the schedule length by the finish time of the exit task a ∈ V exit .

Comment
The variable and constraint complexity of the proposed MILP formulation can be potentially reduced using the structural information of the DAG. One critical piece of information encoded in the DAG is the data precedences (direct and indirect) among the tasks of the application. For an edge of the DAG, its source task and the destination task have to be executed in the dedicated order, as captured by the constraint (6); hence, the number of variable z ab can be reduced using such information, and the corresponding constraints can also be reduced. Besides, the same trick can be applied to tasks with similar indirect data precedence. Further, on the same reconfigurable region, the reconfiguration order of the tasks should also comply with the data precedence, which can be used to reduce the number of variable x ab and the corresponding constraints.

Multi-Step Hybrid Algorithm
The former illustrated MILP-based approach paves the way to resolve the optimal solution for the Sw/Hw partitioning and scheduling problem; however, this exact approach does not work well for even medium-scale problems owing to its exponential increase of the time complexity with the problem scale. Its poor scalability motivated us to develop a low-time complexity method that produces reasonably good solutions for the target Sw/Hw partitioning and scheduling problem.
The proposed low-time complexity method is a multi-step hybrid approach. The whole problem is divided into a set of sub-problems that are easier to solve than the original problem; besides, the sub-problems are solved one by one with the partial result of the former one being the input of the next one. The proposed approach mixes the graph partitioning method that divides the whole problem into a set of nested sub-problems, and the former proposed MILP method is utilized for each sub-problem. The graph partitioning approach partitions the DAG of the application into a sequence of ordered nested sub-graphs, with the former sub-graph being contained in the next sub-graph. Hence, the problem is divided to solve each sub-problem corresponding to each sub-graph generated by the graph partitioning approach. Since the former sub-graph is contained in the next sub-graph, the complexity of the sub-problem increases with the order of the sub-problem using MILP. To reduce the complexity, part of the result of the former sub-problem is used as the input of the next sub-problem, thus reducing the problem scale. In the following, we first introduce the structure of the multi-step hybrid algorithm, and then, a priority-based graph partitioning method that is used in the proposed multi-step hybrid algorithm is proposed.

Algorithm Structure
Algorithm 1 illustrates the structure of the multi-step hybrid algorithm. Firstly, the DAG of the application is partitioned, generating an ordered set of sub-DAGs (G = {G 1 , G 2 , ..., G |G | }) that satisfies the following requirements: Since G i+1 is scheduled directly after G i , the partial result yielded by scheduling G i can be used as the input of scheduling G i+1 . To better reuse the scheduling results for the former sub-problem, the condition G i ⊂ G i+1 has to be satisfied during graph partitioning.
The second condition means that there is no communication edge directed from G j to G i . Note that tasks in G i are scheduled before those in G j , as the proposed algorithm presents. Intuitively, it is better to schedule ancestors (a task is the ancestor of its child task) of the DAG that are more significant since their schedules affect more child tasks of the application; hence, this condition is utilized while partitioning the application.
Then, each sub-DAG G i ∈ G is scheduled using the MILP approach with the partial schedule result psr G i−1 of the former sub-DAG G i−1 as the input. Note that sub-DAGs in G are scheduled according to the order obtained by the graph partitioning method, and the first sub-DAG G 1 is scheduled without any partial schedule result. The partial schedule result is selected carefully, including the mapping information of each task, the execution and reconfiguration order of all tasks on each independent processing element (microprocessor and FPGA region), and the inter-processor communication indicator d e . However, the exact task start execution time and reconfiguration time and the resource volume of each FPGA region are not fixed according to the former sub-problem, thus leaving more solution space for performance optimization. Finally, scheduling of the last sub-DAG G |G | produces the result for the original problem.

Algorithm 1 MSHA (G, P)
1: partition the DAG G to an ordered set of sub-DAGs G = {G 1 , G 2 , ..., 3: extract partial schedule result psr G i−1 for the sub-problem corresponding to G i−1

4:
solve the Sw/Hw partitioning and scheduling problem of G i on P with psr G i−1 as the input 5: end for

Priority-Based Graph Partitioning
As illustrated in the former subsection, partitioning the original problem into a set of small sub-problems is a key step of the multi-step hybrid algorithm. We coined the concept of sub-DAG, each of which corresponds to a sub-problem. The sub-DAGs are nested one-by-one; hence, the result of the former sub-problem can be partly transferred to the next sub-problem, thus reducing the complexity of solving the next sub-problem. In this subsection, we develop a priority-based graph partitioning method that partitions the original DAG into a set of nested sub-DAGs that meet our requirements.
To obtain the set of sub-DAGs and reserve the features needed by the proposed multi-step hybrid algorithm, we order the tasks of the original DAG with priorities that guarantee the topological order of the DAG. We use two kinds of static priorities: the first is the static bottom level (sbl), and the second is the static top level (stl). The static bottom level of a task refers to the maximum length of all paths that originate from the task to the exit task of the DAG without considering the communication cost. Since execution on the FPGA is of the most important for the performance ( the task execution time on the microprocessor is 3-5 times slower than the Hw implementation on the FPGA [18]), the task cost on the FPGA is used for computing the priorities. A task with a high static bottom level implies that it may have more direct and/or indirect successors. Such a task is more important and should be scheduled earlier, i.e., be partitioned into the sub-DAGs ordered in the former part of the ordered sub-DAG list G . For two tasks a and b with sbl a > sbl b , either a and b are unconnected or b is a direct or indirect successor of task a, i.e., a b. The static top level of a task refers to the maximum length of all paths that originate from the source task of the DAG to the task. A task with a large static top level may have more predecessors, and scheduling it earlier may potentially reduce the finish time of the last task on the longest path originating from it.
Algorithm 2 illustrates the details of the priority-based graph partitioning method. The algorithm has two inputs: the first is the DAG, and the second is the maximum task number of the sub-DAGs. The value of m should be set according to the solving time of the MILP formulation. Generally, the larger the value of m is, the higher the time complexity. According to our experiment, m ≤ 10 works well. The PBGPalgorithm firstly computes the sbl and stl of the DAG recursively. Then, the tasks of the DAG are ordered with these two priorities. Based on the ordered task list, tasks in it are assigned to the sub-DAGs, as the first two for-loops show. Then, edges are added to each sub-DAG. Note that the output edge of a task in the sub-DAG is assigned to the next sub-DAG that contains the destination task of the edge, according to the first condition required to partition the DAG into sub-DAGs.
The sub-DAGs generated by Algorithm 2 comply with the conditions presented in Algorithm 1. For G i , its tasks and edges also belong to G i+1 , as the second and the third for-loops show; hence, the first condition is satisfied. Since the proposed algorithm orders the tasks of the DAG according to the static bottom level, no edge would direct from a latter task to a former one in the sub-DAG list. While assigning tasks in the ordered task list to the sub-DAGs, the task with a smaller index in the task list is assigned to the sub-DAG with a smaller index, as the first for-loop shows; hence,

Algorithm 2 PBGP (G, m)
1: compute the static bottom level of each task of G 2: compute the static top level of each task of G 3: order the tasks of the DAG in non-increasing order of the static bottom level; for tasks with the same static bottom level, order them in non-increasing order of the static top level, thus obtaining the task list V = v 1 , v 2 , ..., v |V | 4: for each v i ∈ V do 5: assign v i to sub-graph G f loor(i/m)+1 6: end for 7: for each G i ∈ G do 8: assign G i−1 to sub-graph G i (suppose G 0 = ) 9: end for 10: for each G i ∈ G do 11: for each v ∈ G i do 12: assign each input edge of v to G i

13:
for each e ∈ outputEdge v do 14: if dst(e) ∈ G i && e / ∈ G i then 15: assign e to G i

16:
end if 17: end for 18: end for 19: end for To illustrate how the PBGP algorithm works, we take the application in Figure 2 with the parameters in Table 1 as an example. Table 2 shows the sbl and stl of each task of the DAG in Figure 2 and the sub-DAG(s) where each task is partitioned. Figure 4 shows the graph partitioning result using PBGP algorithm with m = 5. As shown in the graph, the DAG is divided into two sub-DAGs, i.e., G 1 and G 2 . Each task and edge in G 1 is contained in G 2 , and no edge is directed from G 2 to G 1 , showing that the proposed PBGP algorithm complies the requirements of the graph partitioning approach. Note that such a partition also results in the optimal result as the MILP approach.

Experiments and Results
In this section, we evaluate the performance of the proposed methods experimentally. The proposed methods are compared with the cluster-based MILP (cMILP) [10] that combines ISBA (independent set-based algorithm) [23], since cMILP solves a similar problem of this work. In the following, the target platform and benchmarks used in the experiment are described; then, the experimental results are presented and discussed.

Target Platform Configuration
The target platform was composed of a CPU and a DPR-enabled FPGA, as illustrated in Section 3. Specifically, we targeted the Zynq-7000 series SoC platform, which provides an ARM Cortex-A9 CPU and a Xilinx XC7Z020 FPGA. Reconfiguration of the FPGA was carried out by the ICAP, which has a 32 bit input and output interface and is clocked at 100 MHz; hence, the reconfiguration speed achieved up to 3.2 G bit/s.
For each application to be scheduled on the target platform, the resource volume of the FPGA was set as 50% and 70% of the aggregate resource requirement of the application.

Benchmark
We used a set of real-life applications [32,33] to test the performance of the proposed methods. These real-life applications included Ferret, fast Fourier transform (FFT), Gauss elimination, Gauss-Jordan, the JPEG encoder, the Laplace equation, LUdecomposition, parallel Gauss elimination, parallel mean value analysis (MVA), parallel tiled QR factorization, the quadratic equation solver, cyber shake, epigenomics, LIGO, montage, SIPHT, molecular dynamics, channel equalizer, modem, MP3 decoder block parallelism, time division-synchronous code division multiple access (TDSCDMA), and wireless local area network (WLAN) 802.11a receiver. The parameters of these applications are shown in Table 3. As shown in the table, the task number of these applications ranged from eight to 50.  Table 4 shows the Sw/Hw partitioning and scheduling result of the set of real applications shown in Table 3 on the platform with the resource volume of the FPGA being set as 70% of the aggregate resource requirement of the application. Table 4 compares the proposed MILP and multi-step hybrid method with the cMILP. The performance comparison was carried out from two aspects, i.e., the schedule length (SL) and the time used to find the solution. Since the MILP-based method is hard to solve even for medium-scale problems, we set a timeout of 2 h for the MILP solver. When timeout happens, the best feasible solution that has been found by the MILP solver is used as the final solution, and the time is set as timeout. When timeout does not happen, the optimal solution is found, and the time represents that used to find the optimal result. The proposed MILP approach is competitive in terms of schedule length. As shown in the table, for 14 applications among the 22 applications, the MILP produces schedules that are better than or equal to the other two methods. What is more, the proposed MILP approach is able to find the optimal solution for the Sw/Hw partitioning and scheduling problem. As shown in the table, for all problem instances without timeout, i.e., Gauss elimination, JPEG encoder, LU decomposition, parallel Gauss elimination, and parallel tiled QR factorization, the proposed MILP approach produces the Sw/Hw partitioning and scheduling solution with the smallest schedule length, e.g., for the Gauss elimination, the SL of the MILP is 51, while that of the hybrid approach and cMILP is 53 and 72, respectively. Note that for the JPEG encoder, all three methods can produce the optimal solution. For those problem instances when timeout happens, the SL of the MILP approach may be worse than the other two methods owing to the large solution space to be searched. What is more, among the three methods, the proposed MILP approach does not achieve optimality in terms of time complexity for any problem instance. For the modem with 50 tasks, the MILP cannot find any feasible solution when timeout happens, demonstrating the complexity of the MILP.

Experimental Results
Among the 22 applications, the multi-step hybrid approach produces the best solution among the three methods for nine applications, i.e., JPEG encoder, cyber shake, epigenomics, montage, SIPHT, molecular dynamics, channel equalizer, modem, and WLAN 802.11a receiver. What is more, for seven applications, i.e., cyber shake, epigenomics, montage, SIPHT, molecular dynamics, modem, and WLAN 802.11a receiver, the multi-step hybrid approach produces the best solution in terms of both SL and time complexity. For other cases, the multi-step hybrid approach also shows great advantages in balancing the SL and time complexity, e.g., for Ferret, the multi-step hybrid approach produces the schedule with SL being 64 using only 4.180 s, while the MILP produces a better schedule with more than 2 h (timeout); further, cMILP takes 3810.6 s to produce a schedule with the SL being 70, which is far worse than the hybrid.
The multi-step hybrid approach performs better than or equal to cMILP for 21 applications in terms of SL. Among these 21 applications, the hybrid approach performs better than cMILP for 17 applications in terms of time complexity; while for the other four applications, i.e., JPEG encoder, LU decomposition, parallel Gauss elimination, and channel equalizer, the hybrid approach is still competitive in terms of time complexity. For these applications, the time complexity of the hybrid approach is higher than cMILP by several seconds at most, which is negligible, e.g., for channel equalizer, the time for cMILP is 0.470 s, while that for the hybrid approach is 2.264 s. Note that for the other three applications, the difference in the time complexity of the hybrid approach and cMILP is much smaller than for channel equalizer.
The competitiveness of cMILP in terms of SL can only be observed for three applications, i.e., JPEG encoder, LIGO, and MP3 decoder block parallelism. For JPEG encoder, since its task number is only eight, and all three methods produce the optimal result in terms of SL using less than 1 s, making the competitiveness of cMILP negligible. For LIGO, cMILP finds the solution with the SL being 90 using more than 2 h (timeout), while the hybrid approach takes 9.477 s to yield a schedule with the SL being 93, showing that the hybrid approach achieves a better balance between SL and time used. For MP3 decoder block parallelism, cMILP produces a solution with the SL being 136 using more than 2 h (timeout), while the hybrid approach takes 9.477 s to yield a schedule with the SL being 143 using about 2 min. Though the SL of the hybrid approach is 4.8% worse than cMILP, its time complexity is hundreds of times less than cMILP. Table 5 shows the Sw/Hw partitioning and scheduling result of the set of real applications shown in Table 3 on the platform with the resource volume of the FPGA being set as 50% of the aggregate resource requirement of the application. As the former experiment, a timeout of 2 h is set for the MILP solver, and the best feasible solution that was found by MILP solver is used as the solution when timeout happens.
Similar to Table 4, the data in Table 5 show the optimality of the proposed MILP approach and the advantage of the proposed multi-step hybrid method in achieving a good balance between schedule length and time complexity. Among the 22 applications, MILP produces the best solution in terms of SL for 13 applications while costing the longest time among the three methods. The multi-step hybrid method shows a comparative advantage in terms of SL and time among the three methods for 14 and 11 applications, respectively. However, cMILP only produces the shortest SL for only one application, i.e., LIGO. What is more, for this application, the multi-step hybrid method produces the same result with much less time.
Comparing Tables 4 and 5, we can find that the same schedule performance can be gained with less resources for some cases, e.g., for Gauss elimination, the MILP approach finds the optimal result for the cases in Tables 4 and 5 since no timeout has happened. However, the schedule lengths are both 51. Note that it is possible to use the proposed method to optimize the resource number of the platform by iteratively executing the algorithm for the problem with various resource volumes. However, it remains for further study to minimize the resource of the platform while respecting the performance requirement of the application.