Next Article in Journal
Cross-Domain Adversarial Alignment for Network Anomaly Detection Through Behavioral Embedding Enrichment
Previous Article in Journal
SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimizing Federated Scheduling for Real-Time DAG Tasks via Node-Level Parallelization

School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Computers 2025, 14(10), 449; https://doi.org/10.3390/computers14100449
Submission received: 11 September 2025 / Revised: 17 October 2025 / Accepted: 18 October 2025 / Published: 21 October 2025

Abstract

Real-time task scheduling in multi-core systems is a crucial research area, especially for parallel task scheduling, where the Directed Acyclic Graph (DAG) model is commonly used to represent task dependencies. However, existing research shows that resource utilization and schedulability rates for DAG task set scheduling remain relatively low. Meanwhile, some studies have identified that certain parallel task nodes exhibit “parallelization freedom,” allowing them to be decomposed into sub-threads that can execute concurrently. This presents a promising opportunity for improving task schedulability. Building on this, we propose an approach that optimizes both node parallelization and processor core allocation under federated scheduling. Simulation experiments demonstrate that by parallelizing nodes, we can significantly reduce the number of cores required for each task and increase the percentage of task sets being schedulable.

1. Introduction

Multi-core processors are widely employed in real-time systems, significantly enhancing application execution efficiency. DAGs serve as a representation of the complexity and parallelism of an application. Each node in the DAG is referred to as a subtask of the DAG task. The DAG task model is a universal framework in parallel task scheduling that effectively illustrates the temporal constraints between subtasks and provides an abstract structure describing the fundamental nature of most practical parallel tasks. A crucial characteristic of a DAG is its “longest path” (or critical path), which is the sequence of dependent subtasks whose total execution time is the longest; this path determines the minimum possible execution time for the entire DAG task.
DAG tasks are widely used in industrial applications. For example, a leading open-source autonomous driving framework can easily incorporate over 20 tightly integrated modules solely for the perception stack. This complex program model is described by a DAG. Due to this intricate structure and intensive computational demands, extensive system-wide optimization is required to ensure stable real-time execution [1]. The advantage of a DAG structure lies in its ability to clearly represent task dependencies, ensuring tasks are executed in a certain order and with the conditions of predecessor and successor tasks, thereby reducing resource conflicts and execution delays. By parallelizing the nodes within a DAG, these applications can significantly improve processing efficiency and real-time performance. For example, in autonomous driving, parallel processing of multiple subtasks within the perception module can accelerate data processing, reduce latency, and ensure real-time responsiveness.
The development of parallel frameworks such as OpenCL [2] and OpenMP [3] has enabled the parallelization of subtasks within DAG tasks into multiple versions under computing resource constraints, a concept referred to as “parallelization freedom”. This term describes the flexibility to execute a given task using a variable number of parallel threads; for instance, a task can be run on one, two, or more cores, which in turn affects its execution time and resource requirements. For example, an object detection program using the Euclidean clustering algorithm can be executed with an optimal number of threads. However, due to the parallelization overhead, this increases the total execution time of the task. Therefore, we can maximize system schedulability while meeting real-time and parallelization constraints by carefully selecting an appropriate “parallelization option” for each task, meaning that each task is parallelized with the optimal number of threads.
Research on parallelization has garnered attention due to the potential for maximizing task schedulability through a reasonable selection of the parallelization options. For instance, in the context of multi-threaded task models, parallelization freedom was applied to a fluid scheduling algorithm [4,5]. Parallelization freedom has also been applied in global Fixed Priority (FP) [6] and global Earliest Deadline First (EDF) [7,8] scheduling algorithms. Specifically for DAG parallel task models, Cho et al. [1] employed global EDF to handle the issue of DAG node parallelization and analyzed how appropriate parallelization can better meet the conditions for schedulability analysis under global EDF scheduling. The article in [1] also presented the relation between the total execution time of DAG nodes after parallelization and the parallelization overhead. Recently, different types of scheduling algorithms have been proposed for parallel real-time tasks modeled as DAG, among which federated scheduling [9] has shown its superiority in real-time performance. However, to our knowledge, no solution has been proposed for parallelizing the DAG task model under federated scheduling. Therefore, this paper proposes a parallelization algorithm under federated scheduling.
In federated scheduling [9], each high-density task (with the ratio of the worst-case execution time to the deadline greater than 1) is executed on dedicated processors, while all low-density tasks (with the ratio of the worst-case execution time to the deadline less than or equal to 1) are executed on the remaining processors. However, federated scheduling can lead to a significant waste of resources. Parallelization overhead will increase the number of cores allocated to high-density tasks. We use state-of-the-art federated scheduling based on long paths to further improve the task schedulability [10]. By analyzing the method of allocating cores to high-density tasks presented in [10], we find that assigning reasonable parallelization options to the nodes on the longest path of the high-density DAG tasks can also reduce the number of cores allocated to high-density tasks, thereby reducing resource waste. The impact of parallelization overhead is also present. Leveraging these characteristics, we develop a federated scheduling parallelization algorithm based on long paths.
Our algorithm applies to scenarios in which a node’s parallelization changes the longest path in a DAG task. Meanwhile, we experimentally verify the order of the nodes’ parallelization, and based on this order, we select the node on the longest path that can maximize the schedulability.
This paper is organized as follows: Section 2 introduces related work. Section 3 formally defines the structure of a DAG task and explains the problem. Section 4 introduces federated scheduling based on Graham’s bound algorithm for sporadic DAG tasks and the federated scheduling algorithm based on long paths. The extension of DAG task parallelism is then considered under these two federated scheduling algorithms, and an analysis is presented of the formula for the allocation of core numbers in high-density DAG tasks parallelized under the federated scheduling algorithm based on long paths. Section 5 proposes the algorithm for allocating parallelization options for DAG tasks and validates the order of nodes’ parallelization. Subsequently, Section 6 reports our experimental results. Section 7 summarizes the entire paper.

2. Related Work

The research presented in this paper is based on federated scheduling. Significant research has been conducted on this topic in recent years. For example, a semi-federated scheduling algorithm based on the federated scheduling algorithm was proposed in [11]. A new reservation-based joint scheduling method was proposed in [12] for federated scheduling of DAG tasks, and it was then used to schedule fragmented DAG tasks with arbitrary deadlines. A virtual federated scheduling algorithm for DAG tasks with constrained deadlines was proposed in [13]. Another federated scheduling algorithm based on long paths was proposed in [10].
The aforementioned research does not consider parallelization freedom. Parallelization freedom is allowed in modern parallel programming frameworks (such as OpenCL and OpenMP). Recently, Cho et al. [6,7] adopted the BCL schedulability analysis method in global FP and global EDF, respectively, for a multi-threading task model to allocate the optimal parallelization options to the tasks. In global fluid scheduling, Kwon et al. [4] proposed using parallelization for a multi-segment task model to determine and constrain the parallelization options for each segment and their respective deadlines. They utilized the relation between individual segment deadlines and the overall deadline to improve the schedulability. In subsequent fluid scheduling research, four controllable variables have been identified: parallelization options, artificial deadlines, offsets, and artificial cycles. A method was proposed to balance the density of the entire task system and the execution time of each task by adjusting these four parameters. By controlling the deadlines of the task through parallelization and subsequently controlling the offset of the next task, the method aims to minimize the cumulative density of the tasks after parallelization and ensure that each task is completed before its deadline, thereby enhancing the task schedulability [5]. Lastly, while the previously presented research on the DAG task model does not consider parallelization, Cho et al. [1] used EDF and BCL analyzable scheduling conditions in a study of the DAG task model to propose a node parallelization option algorithm.
It can be seen from the above review that research on federated scheduling of the DAG model and on parallelization freedom under scheduling algorithms has achieved some results. Federated scheduling shows superiority in real-time performance, and reasonable parallelization options can also increase the task schedulability. However, the existing studies have not investigated parallelization algorithms for DAG tasks under federated scheduling. It has been shown that the task schedulability conditions affected by DAG node parallelization can be further improved through reasonable parallelization [1]. Additionally, the latest research on federated scheduling based on long paths has been presented by [10], who applied DAG node parallelization to the latest federated scheduling results, assigning parallelization options to high-density DAG task nodes. This approach reduces the total length of long paths, balances the impact of parallelization overhead, and subsequently reduces the waste resources caused by high-density tasks. As a result, it significantly improves the schedulability of the system.

3. System Model and Problem Description

We consider a DAG task set  Γ composed of n sporadic DAG tasks  { τ 1 , τ 2 , , τ n } , scheduled on M identical processors.
A DAG task is represented by a triplet,  τ k = G , D , T , where D denotes the relative deadline and T denotes the period, i.e., the minimum interval between the release times of two consecutive tasks of  τ k . A DAG task G is represented by  G = ( V , E ) , where V is a collection of  n k nodes, denoted by  V = v 1 , v 2 , , v n k , and E is the set of directed edges between these nodes. An edge  ( v i , v k ) E represents the precedence relation between  v i and  v k , signifying that  v k may only begin execution after  v i has finished. In this case,  v i is the predecessor of  v k , and conversely,  v k is the successor of  v i . A vertex  v V represents a workload that is characterized by its worst-case execution time (WCET)  c ( v ) . A node without a predecessor node is called a source node, and a node without a successor node is called a sink node. A node is eligible for execution only if all of its predecessor tasks have been completed.
Considering the parallelization freedom of the nodes, each node can be parallelized into an ideal number of threads. In the example in Figure 1, node  v 1 can select any value between 1 and  O max as its parallelization option, where  O max denotes the maximum resources available for the DAG task. In this case, selecting 2 as the node’s parallelization option, i.e.,  O 1 = 2 , means that node  v 1 is parallelized into two sibling threads with the same priorities, release times, and deadlines. These sibling threads are denoted by  v 1 ( 2 ) = v 1 1 ( 2 ) , v 1 2 ( 2 ) . From this, we generalize a more common expression for sibling threads, parallelizing node  v i into  O i threads, resulting in
v i ( O i ) = v i 1 ( O i ) , v i 2 ( O i ) , v i O i ( O i )
where  v i l ( O i ) denotes the lth sibling thread. The WCET for the node’s sub-threads is denoted by  e ( v i l ( O i ) ) . This paper assumes that, after parallelization, each thread of a task has an identical execution time. This assumption is motivated by the widespread adoption of load-balancing strategies in modern parallel programming frameworks such as OpenCL and OpenMP, where tasks are divided into threads with equal workloads to maximize parallel efficiency. Furthermore, this assumption has been adopted in the field of real-time multi-core scheduling research [5], where scheduling analysis and optimization also build models of scheduling density and timing bounds based on this premise. The total computation time for the  O i sibling threads is denoted by  c ( v i ( O i ) ) = l = 1 O i e ( v i l ( O i ) ) . Due to the presence of parallelization overhead, the total computation of all sibling threads will increase with an increase in the parallelization option.
Now, let us consider the definition of path length and the total WCET for DAG tasks with parallelization freedom. A path is defined as a sequence of nodes starting from a source node and ending at its receiving node, with each consecutive pair of nodes connected by an edge. For example, in Figure 1 v 1 , v 3 , v 5 is one possible path within G. The path of a DAG with parallelization freedom is denoted by  λ . The length of  λ is defined by
l e n ( λ ) = v i λ e ( v i 1 ( O i ) )
This is the sum of the WCETs of the longest threads for all nodes on the path. The longest path of  τ k is represented by  λ * , and G represents the actual structure of the DAG, including nodes and directed edges. The length of the longest path  λ * in the DAG can be defined as follows:
l e n ( G ) = l e n ( λ * ) = max λ G l e n ( λ )
The threads that constitute this path are referred to as the longest threads. However, as the threads of the DAG task nodes change, the path may change with variations in the execution time of the threads, and the longest threads may also change depending on changes in the parallelization. After parallelization, the total execution time of all threads in G is referred to as the volume of G:
v o l ( G ) = v i V c ( v i ( O i ) )
We employ a federated scheduling algorithm to schedule n DAG tasks, where the scheduling cost is considered negligible or is included in the WCET. The problem is defined as follows.
  • Problem Definition: For high-density tasks within a task set, our problem is to identify parallelization options for the nodes on the longest path to minimize the number of cores allocated to this high-density task. However, as DAG task node parallelization progresses, the longest thread path may also change. We aim to design a strategy that allocates parallelization options for nodes on the changing longest path, thereby minimizing the number of cores allocated to the high-density task and enhancing the federated scheduling.

4. Extension of Federated Scheduling for DAG Tasks with Parallelization Freedom

In this section, we will review federated scheduling. In the first subsection, we start with a review of federated scheduling based on Graham’s bound, followed by an introduction to the response time bounds for DAG tasks under federated scheduling with long paths, and the method for allocating the number of cores with high-density DAG tasks. In the second subsection, we first derive the condition for reducing the number of cores allocated to high-density DAG tasks with parallelization freedom under federated scheduling based on Graham’s bound. Subsequently, we derive the condition for reducing the number of cores allocated to high-density DAG tasks with parallelization freedom under federated scheduling based on long paths and provide a formula for core allocation. The main notations used in this paper are shown in Table 1.

4.1. Overview of Federated Scheduling Based on Graham’s Bound and Long Paths

We first review the method of federated scheduling [9]. Upon the arrival of a DAG, we classify the tasks into high-density tasks (C greater than D) and low-density tasks (C less than or equal to D). This paper focuses on high-density tasks (with the assumption that D is equal to its period T). To each task in the set of high-density tasks, we allocate a dedicated number of cores for its execution. Each high-density task  τ k can be executed independently on m processors, where m can be determined using the following formula:
m = C L D L
Low-density tasks are treated and executed as though they are sequential tasks and any multiprocessor scheduling algorithm, such as partitioned EDF or global EDF, can be employed for them. Formula (1) ensures that any greedy parallel scheduler can be used to schedule the high-density DAG tasks; a greedy scheduler is one that never leaves cores idle when nodes in the DAG task are ready for execution. However, a significant portion of the processor’s resources is wasted during the execution of task  τ k . Next, we introduce the response time bounds for DAG tasks under federated scheduling based on long paths and the method for allocating cores to high-density DAG tasks.
The response time bound for a DAG with federated scheduling based on long paths is
R l e n ( G ) + v o l ( G ) i = 0 k l e n ( λ i ) m k
After the list of generalized paths is given, the response time of the DAG can be written as
R min p a 0 , k l e n ( G ) + v o l ( G ) i = 0 p a l e n ( λ i ) m p a
We now introduce the concept of a generalized path. The subtasks in a generalized path cannot be executed at the same unit time in any execution sequence. That is to say, the nodes on the generalized path have precedence constraints, and so does the path. All nodes in the path have a predecessor or a successor, so the generalized path evolves from the path; that is, the generalized path skips some vertices in the path so as to satisfy the constraint relationships of any previous and subsequent executions. For example, in Figure 1 v 1 , v 3 , v 5 is both a path and a generalized path, while  v 1 , v 5 is a generalized path but not a traditional path. Algorithm 1 describes how to generate a generalized path for a DAG task. Algorithm 1 is adapted from Reference [14]; however, we have made minor modifications to its details to align with the context of this study.
In Algorithm 1,  k ¯ + 1 denotes the number of all generalized paths in the DAG, and  k ¯ denotes the index of the last generalized path. The value of k in Formula (3) cannot exceed  min ( k ¯ , m 1 ) , where k is chosen from  [ 0 , min ( k ¯ , m 1 ) ] to compute Formula (3). In Algorithm 1,  r e s ( G , λ ) (the residual graph) retains the same structure as G, but with the WCETs of nodes on  λ set to zero. Let  G = G (line 1); when  v o l ( G ) 0 , the loop is entered (lines 2–6). In each iteration, the longest path of  r e s ( G , λ ) is selected; then, the WCETs of the nodes on this path are set to zero, and  G is updated. The loop continues until the condition  v o l ( G ) = 0 is satisfied. Therefore,  v o l ( G ) = i = 0 k ¯ l e n ( λ i ) . Then, from Formula (3), the number of dedicated cores for the high-density DAG tasks can be derived as follows:
m ( p a ) = C i = 0 p a L i D L + p a p a < k _ D > L k _ + 1 p a = k _ m = min p a [ 0 , k _ ] m ( p a )
in which  k ¯ + 1 denotes the total number of generalized paths in a DAG task; it is necessary to calculate this total according to Algorithm 1. Based on the variations in the value of  p a in Formula (4), we can compute the number of cores required for different values of  p a . We then select the case with the minimum number of cores and record the corresponding value of  p a , representing the required maximum generalized path index, where  p a + 1 denotes the number of generalized paths required to compute the minimum number of cores via Formula (4). Through this analysis, multiple generalized paths can be used to determine the dedicated number of cores required for a high-density DAG task in federated scheduling, as well as the necessary number of generalized paths.
Algorithm 1 Computing_Generalized_Path_List(G) [14]
Input:  G = ( V , E )
Output:  ( λ i ) 0 k _
1:
G G i 0 k _ 1 ( λ i ) 0 k _ {}
2:
while   v o l ( G ) 0  do
3:
    λ i  ← the longest path of  G
4:
    k _ k _ + 1
5:
    ( λ i ) 0 k _  ←  ( λ i ) 0 k _ λ i
6:
    λ i  ←  λ i ∖{ v λ i | c ( v ) of  G is 0 }
7:
    G r e s ( G , λ i ) i i + 1
8:
end while
9:
return   ( λ i ) 0 k _

4.2. Extension of Federated Scheduling Based on Graham’s Bound and Long Paths with Parallelization Freedom

We first derive the condition for decreasing the number of cores in high-density DAG task assignments with parallelization freedom under federated scheduling based on Graham’s bound. By distributing the number of cores through Formula (1), we understand that L denotes the length of the longest path, and C denotes the total WCET. To reduce the number of cores allocated to  τ k , it is evident from the formula that only L and C can be altered, with D being unchangeable. Due to parallelization overhead, splitting nodes on the longest path will increase C. However, if the parallelization overhead is sufficiently small, it can decrease the length of the longest thread L. This leads to two cases:  
  • Case 1: In this case, nodes on the longest path are split, and the longest path remains unchanged. Since the nodes along the path are parallelized, the longest thread is reduced; hence, the total length of this path is shorter than before the parallelization.  
  • Case 2: In this case, nodes on the longest path are split, leading to a change in the longest path. However, since the length of this new path is shorter than the length of the previous the longest path, the length of the longest path is reduced.
For instance, as shown in Figure 2, the longest path before parallelization is  v 1 , v 2 , v 5 . Assume a parallelization overhead of 0.2, and split  v 2 into two threads. Consider the parallelization overhead  α = c ( v i ( O i + 1 ) ) c ( v i ( O i ) ) c ( v i ( O i ) ) . According to the concept introduced in [1], the parallelization overhead is defined as the ratio between the difference in the total WCETs before and after parallelization and the WCET before parallelization. The WCET after node parallelization is calculated according to the method described in [1], resulting in  c ( v 2 ( O i + 1 ) ) = ( 1 + α ) c ( v 2 ( O i ) ) , where  α denotes the parallelization overhead. Therefore, the WCET becomes  c ( v 2 ( 2 ) ) = ( 1 + 0.2 ) × c ( v 2 ) = 6 , where  c ( v 2 ) = 5 . The longest thread becomes  v 1 , v 2 1 , v 5 , which aligns with Case 1.
To focus on the core impact of the node parallelization sequence on scheduling performance, this paper models the parallelization overhead coefficient  α as a fixed value derived from a pessimistic estimation. While this simplified treatment does not capture the full dynamic characteristics of  α in real-world systems, it provides a clear baseline for theoretical analysis and algorithm design. It is important to note that, despite this simplification, the experimental data in Section 6 robustly demonstrate that the proposed algorithm still achieves a significant schedulability improvement under the assumption of a fixed  α . Looking forward, constructing a generalized model that treats  α as a stochastic variable or a function of the degree of parallelism to more accurately reflect system behavior represents a highly valuable direction for future work.
Based on the aforementioned analysis, it is necessary to parallelize the nodes on the longest path to reduce the length of the longest thread, L. Subsequently, we will discuss the impact of changes in L and C, after parallelization, on the changes in the number of cores. To analyze the trend of the change in Formula (1), we omit the ceiling function part of this formula. The new formula is as follows:
y = C L D L
When C increases, it is evident that y will also increase, leading to a tendency for m to also increase. Below, we illustrate the impact of a decrease in L on y.
Lemma 1. 
When C is greater than D, a decrease in L will lead to a tendency for m to decrease. The proof is as follows:
Proof. 
Taking the derivative of Formula (5) with respect to L, we obtain
d y d L = ( D L ) + C L ( D L ) 2 = C D ( D L ) 2
For high-density tasks where  C > D , the aforementioned expression is greater than zero, and y increases monotonically with L. Therefore, as L decreases, y also decreases, indicating a tendency for m to decrease too. C and L have opposite effects on y.    □
Lemma 2 illustrates the condition under which y decreases after node parallelization, which is the condition for a decreasing trend in m.
Lemma 2. 
After parallelization, if  τ k satisfies Formula (6), then y decreases.
( C C ) ( L L ) < ( C D ) ( D L )
Proof. 
Let C become  C and L become  L after the next parallelization step:
y = C L D L , y = C L D L
Let  y < y . Consequently, we have
C L D L < C L D L ( C > D , D > L ) ( C L ) ( D L ) < ( D L ) ( C L ) C × D C × L D × L + L × L < C × D C × L L × D + L × L C × D C × L D × L < C × D C × L L × D ( C C ) D + D ( L L ) < C × L C × L ( C C ) D + D ( L L ) < C × L C × L + C × L C × L ( C C ) D + D ( L L ) < ( C C ) L + C ( L L ) ( C C ) ( D L ) < ( C D ) ( L L )
This yields
( C C ) ( L L ) < ( C D ) ( D L )
   □
According to Lemma 2, if we aim to ensure a decreasing trend in the number of cores after node parallelization, the parameters before and after parallelization need to satisfy Formula (6). For instance, as shown in Figure 2a, before parallelization, the required number of cores calculated using Formula (1) is  14 9 11 9 = 3 . Figure 2b depicts the graph after parallelizing node  v 2 , with a parallelization overhead of 0.2. The required number of cores, calculated using Formula (1), is  15 7 11 7 = 2 . At this point,  15 14 9 7 < 14 11 11 9 satisfies Formula (6).
Next, we introduce the extension of federated scheduling based on long paths with parallelization freedom. Formula (4) gives the number of cores allocated to high-density DAG task cores under federated scheduling based on long paths. In this case, each DAG node is considered to be single-threaded, and what we need to consider is the case with parallelization freedom.
Similarly, we select nodes on the longest path for parallel execution. Through parallelization freedom, each node can be parallelized into the ideal number of threads. As mentioned in Section 4.1, we understand that it is crucial to find an effective parallelization strategy to minimize the number of cores calculated by Formula (4). It is clear that after identifying all generalized paths in the DAG, one must use Formula (4) to find the number of generalized paths that minimizes the value of m. It is not necessarily true that an increase in the number of generalized paths will lead to a decrease in the required number of cores. Here, we assume that the minimum number of cores for a DAG is achieved with  p a + 1 paths. Our algorithm is applicable in situations where  p a = k , which will be discussed later. For the sake of analysis, let us define
m = C i = 0 p a L i D L + p a
In the analysis provided by Section 4.1, it is necessary to identify parameters that can vary with node parallelization to facilitate a decreasing trend in the value of m. The values of  p a and D are fixed. Unlike the analysis in Section 4.1, aside from the total WCET of the DAG denoted by C and the longest path length denoted by L, which change with node parallelization,  i = 1 p a L i may also vary as nodes are parallelized. By eliminating the ceiling function and  p a from the expression  m = C i = 0 p a L i D L + p a , we can analyze the trend in changes to the formula as follows:
y = C L i = 1 p a L i D L
To analyze Formula (7), it should be observed that y exhibits a decreasing trend as  i = 1 p a L i increases. To illustrate the changes in  i = 1 p a L i , we will consider three cases. Prior to the analysis, we set  p a to 2 to facilitate the examination, thereby indicating that under three generalized paths, m is minimized.  
  • Case 1: After the node on the longest path is parallelized, the longest path becomes the second longest, and the second longest path becomes the longest. The third path remains unchanged. In this case, L and  i = 1 p a L i both decrease.
  • Case 2: After the node on the longest path is parallelized, the nodes of the longest path remain unchanged, while the lengths of the other two paths are replaced by the sub-threads of the nodes parallel to the longest path. As a result, L decreases and  i = 1 p a L i increases.
  • Case 3: After the node on the longest path is parallelized, the nodes of the longest path remain unchanged, and the nodes of the other two paths also remain unchanged. In this case, L decreases and  i = 1 p a L i remains unchanged.  
The selection of the node parallelization order will be explained in Section 5.2.
Based on the analysis of the aforementioned three cases, we can see that  i = 1 p a L i may increase or decrease depending on the relations between the lengths of the three paths. Importantly, this change is independent of the selection of the nodes for parallelization.
Let  i = 1 p a L i = Z . Then, Formula (7) becomes
y = C L Z D L
Through Formula (8), we can see that after node parallelization, the monotonic increase in C continuously impedes the decrease in y, yet Z may either increase or decrease. An increase in Z is conducive to a decrease in y. To illustrate this, suppose  C = 12 L = 5 Z = 4 D = 10 , and then from Formula (8),  y = 3 / 5 . After parallelization, suppose  C = 12.5 L = 4 Z = 5 , and  D = 10 . Then, from Formula (8),  y = 7 / 12 , showing that  y is less than y. Therefore, it is necessary to identify conditions under which  y < y , and based on Lemma 2, Lemma 3 is derived.
Lemma 3. 
After parallelization, y decreases when  τ k satisfies Formula (9).
( C Z C + Z ) ( L L ) < ( C D ) ( D L )
Proof. 
Following the next parallelization step, C changes to  C , L changes to  L , and Z changes to  Z ; hence,
y = C L Z D L , y = C L Z D L
Let  y < y :
C L Z D L < C L Z D L ( C > D , D > L ) ( C Z L ) ( D L ) < ( D L ) ( C L Z ) ( C Z ) × D ( C Z ) × L D × L < ( C Z ) × D ( C Z ) × L L × D ( C Z C + Z ) D + D ( L L ) < ( C Z ) × L ( C Z ) × L ( C Z C + Z ) D + D ( L L ) < ( C Z ) × L ( C Z ) × L + ( C Z ) × L ( C Z ) × L ( C Z C + Z ) D + D ( L L ) < ( C Z C + Z ) L + ( C Z ) ( L L ) ( C Z C + Z ) ( D L ) < ( L L ) ( C Z D )
This yields
( C Z C + Z ) ( L L ) < ( C D ) ( D L )
   □
According to Lemma 3, to ensure that the number of cores decreases after the parallelization of the nodes, the parameters before and after the parallelization need to satisfy Formula (9). For instance, as shown in Figure 2a, without parallelization and calculated according to Formula (4), the resulting  p a is 1. This means that for two generalized paths, the number of required cores m is minimized, and m is  14 9 3 11 9 + 1 = 2 . At this point, Formula (7) computes y as 1. Figure 2b presents the graph after the parallelization of node  v 2 , with a parallelization overhead of 0.2. According to Formula (7),  y is calculated as 1.25. However, this time,  15 3 14 + 3 9 7 = 1 2 > 14 3 11 11 9 = 0 does not satisfy Formula (9). The selection of nodes on the longest path for parallelization will be introduced in Section 4.
Following the above analysis, let the list of parallelization options for DAG nodes be denoted as  O c u r . The formula for calculating the number of cores after parallelization is derived as Formula (10):
m = C ( O c u r ) i = 0 p a L i ( O c u r ) D L ( O c u r ) + p a
If  p a calculated according to Formula (4) is equal to  k _ , encompassing all nodes in the DAG, that is, when the number of generalized paths is  k _ + 1 , the number of cores calculated by Formula (4) is minimized. Then,  C i = 0 p a L i = 0. Since the number of paths increases after parallelization, which is the case where  C ( O c u r ) i = 0 p a L i ( O c u r ) = 0 does not occur, then Formula (10) is also applicable.

5. Parallelization Algorithm

In this section, we will introduce our parallelization algorithm, followed by a discussion of the node selection algorithm, and conclude with an analysis of the algorithm’s time complexity.

5.1. Parallelization Algorithm

Based on the above theoretical analysis, we will now introduce the parallelization algorithm under federated scheduling based on long paths. The vertices that should be parallelized are those located on the longest path in the DAG task, as this has the most significant impact. After vertex parallelization, the longest path of the vertex is likely to change. Therefore, our algorithm selects the node on the newly longest path after parallelization to parallelize at each iteration. All nodes start running in single-threaded mode, with the algorithm progressively adding parallelization options.
Assuming there is a task  τ k that includes  n k nodes within its structure, we use  O c u r to denote the list of parallelization options for the task nodes, leading to the following expression:
O c u r = O 1 , O 2 , , O n k
Here,  O i denotes the parallelization options for the  v i node, and  Θ encompasses the set of parallelization options for nodes across the entire task set. A task begins its parallelization with all nodes set to  O c u r = 1 , 1 , , 1 . It is important to note that in  O c u r = O 1 , O 2 , , O n k , the maximum limit for each node’s option is determined by the number of currently calculated available cores. For example, if the currently calculated number of available cores is 4, then each element in  O c u r = O 1 , O 2 , , O n k is limited to a maximum of 4. A potential issue is illustrated by the following example: Suppose  O 2 = 4 , and the number of dedicated cores for a DAG as calculated based on  O c u r is 3; then, the maximum limit would be 3, making  O 2 = 4 infeasible. However, if  O 2 were not equal to 4, the number of cores calculated would not be 3, leading to a contradiction. Our algorithm calculates the number of cores after parallelization without contradictions, as demonstrated below.
In the task set  Γ , let the set of high-density tasks be denoted by  Γ h , and  M h denote the total number of cores allocated for high-density tasks.  τ k is a task within  Γ h . We use  A r r d e x to denote the list of indices of the nodes of the longest path in the DAG. For example, if the longest path is  v 1 , v 2 , v 3 , then  A r r d e x is  0 , 1 , 2 . Algorithm 2 demonstrates our parallelization algorithm, which allocates node parallelization options for high-density tasks. It determines the list of node parallelization options for all tasks in the task set, as well as the number of dedicated cores allocated after parallelizing the high-density task set. Initially,  O c u r of  τ k is initialized (lines 1–3). Using Algorithm 1, we identify all generalized paths for the current  τ k , and then calculate the number of dedicated cores m and the required number of long paths  p a + 1 based on Formula (4) (line 5). Parallelization is considered for task  τ k only if the calculated number of cores m is greater than 2 (lines 6–7). This is because if m is 1, there is no need for parallelization. If  m = 2 , parallelization would either leave m unchanged or reduce it to 1. If reduced to 1, the maximum parallelization limit becomes 1, contradicting the fact that each node in  τ k has a parallelization option of 2. Thus, for  m = 2 , there is no need for parallelization; m is directly added to M (line 35). Within lines 8–30, the for loop is initiated, with each iteration subject to a limitation that starts from 2 and iterates up to the initially calculated core count. Under each limitation, we parallelize to obtain the minimum value of m that is not less than the limitation. Eventually, the minimum m across all limitations is selected, and m is updated; then, the current  O c u r is updated and set equal to the temporary node’s list of parallelization options,  O p r e (lines 26–28). Under a certain limitation,  A r r d e x is initialized. According to Node_Selection (Algorithm 3), the node indexed by  d e x is to be parallelized, and the temporary node parallelization option list  O p r e is selected (lines 9–11). Subsequently, a while loop is entered (lines 12–29), where we determine the minimum number of cores m and the list of parallelization options  O c u r under the current parallelization constraint. Initially,  A r r d e x is updated (line 13), and then, using Algorithm 1, all generalized paths of  τ k ( O p r e ) are obtained, along with the lengths of the first  p a + 1 generalized paths (line 13). If  O c u r d e x (the  d e x + 1 t h element of  O c u r ) reaches the current  O l i m i t , the value of  d e x is updated using Node_Selection (Algorithm 3). If  d e x changes, then  O p r e d e x = O p r e d e x + 1 ; otherwise, the loop exits. If not, the value of  d e x is updated using Node_Selection (Algorithm 3), and  O p r e d e x = O p r e d e x + 1 (lines 14–23). Subsequently, all the generalized paths of  τ k ( O p r e ) are calculated using Algorithm 1, and the lengths of the first  p a + 1 paths are computed to determine the new number of cores (line 24). If the new number of cores is computed to be less than m and greater than or equal to  O l i m i t , then m and  O c u r can be updated (lines 25–27). At this point, the number of parallelization options is less than or equal to  O l i m i t , and the calculated number of cores is greater than or equal to  O l i m i t , thereby resolving the aforementioned contradiction. Finally, the list  Θ containing the parallelization options for all tasks in the task set and the number of cores  M h for the high-density tasks is returned.
The pseudocode is illustrated using Figure 1. According to [5], after task parallelization, the execution time for all threads is considered equal. According to [1], assuming the parallelization overhead to be  α , the total execution time for a node is  c ( v i ( O i + 1 ) ) = ( 1 + α ) c ( v i ( O i ) ) . Assuming  α = 0.2 , we first use Formula (4) to calculate  p a = 1 m = 3 . Entering the loop (lines 8–30),  O l i m i t is set within the closed range of 2 to 3. First,  O l i m i t = 2 . According to Algorithm 3, a node is selected from the longest path  v 1 , v 2 , v 5 . After the parallelization of node  v 1 , we have  C = 15.6 L = 7.8 , and  i = 1 p a L i = 4.8 . Therefore, using Formula (7), we calculate  15.6 7.8 4.8 11 7.8 = 0.94 . For node  v 2 after parallelization,  C = 16 L = 7 , and  i = 1 p a L i = 3 . Using Formula (7), we calculate  16 7 3 11 7 = 1.50 . After the parallelization of node  v 5 , we have  C = 15.2 L = 8.6 , and  i = 1 p a L i = 3.6 . Using Formula (7), we calculate  15.2 8.6 3.6 11 8.6 = 1.25 . The result corresponding to  v 1 is the smallest, indicating that the initial node index  d e x = 0 , which corresponds to node  v 1 in Figure 1, should be selected. This node is then parallelized into two threads, with each thread having an execution time of 1.8. At this point, the required  m c u r = 15.6 7.8 4.8 11 7.8 + 1 = 2 , and since  m c u r meets the condition on line 26,  m = 2 and  O c u r = [ 2 , 1 , 1 , 1 , 1 ] . After parallelizing  v 1 , the new longest path in the DAG becomes  v 1 1 , v 2 , v 5 , and the  A r r d e x is  0 , 1 , 4 (line 13), at which point  O p r e [ 0 ] = 2 meets the current limit requirement (line 14). Based on Algorithm 3 for re-selecting the node indices, with parallelizing  v 2 based on  v 1 being parallelized, the parameters are  C = 16.6 L = 5.8 , and   i = 1 p a L i = 4.8 . According to Formula (7), this results in  16.6 5.8 4.8 11 5.8 = 1.154 . With parallelizing  v 5 based on  v 1 being parallelized, the parameters are  C = 15.8 L = 7.4 , and  i = 1 p a L i = 5.4 . According to Formula (7), this results in  15.8 7.4 5.4 11 7.4 = 0.833 . Since the result for  v 5 is smaller, node  v 5 is selected, at which point  d e x = 4 . The required  m c u r is calculated as  15.8 7.4 5.4 11 7.4 + 1 = 2 , which does not satisfy the condition on line 26; thus, the while loop continues. After parallelizing  v 1 and  v 5 , the new longest path in the DAG is  v 1 1 , v 2 , v 5 1 , and the  A r r d e x is  0 , 1 , 4 (line 13). At this point,  O p r e [ 4 ] = 2 meets the current limit requirement (line 14), which only allows the selection of node  v 2 (line 15), setting  d e x = 1 . The required  m c u r = 16.8 5.4 5.4 11 5.4 + 1 = 3 does not satisfy the condition on line 26; thus, the while loop continues. However, since all nodes on the longest path have reached  O l i m i t , the loop exits. Within the loop where  O l i m i t = 2 , m is updated to 2, and  O c u r = [ 2 , 1 , 1 , 1 , 1 ] . The value of  O l i m i t is then updated to 3, and the loop continues iteratively updating the smallest value of m and the current  O c u r until all limits have been explored. The final value of m is the new number of cores allocated to high-density tasks after parallelization. If m is equal to the number of cores allocated without parallelization, then parallelization is not necessary for this high-density task (lines 31–32).

5.2. Node Selection Method

This section will introduce the parallel order of nodes on the longest path of the DAG, that is, how to select the node on the longest path for parallel processing. We first analyze the federated scheduling method based on Graham’s bounds. Let  O c u r = O 1 , O 2 , , O n k denote the parallel paths of the DAG task nodes, where each element denotes the parallelization option for the corresponding indexed node. For a task’s longest path, the choice of which node to parallelize can minimize the newly calculated number of cores after parallelization. To illustrate this, suppose all nodes have a parallelization option of 1 before parallelization, as shown in Figure 2. We choose nodes  v 2 and  v 5 , which have the maximum and minimum execution times, respectively, for parallel execution. Before parallelization, both nodes have a parallelization option of 1. After parallelization,  v 2 remains on the longest thread with a parallelization overhead of 0.2, and the result, as shown in Figure 2b, is that the longest path length is  L ( O c u r ) = 7 and  C ( O c u r ) = 15 . The calculation using the formula  15 14 9 7 = 0.5 < 14 11 11 9 = 1.5 and satisfies Formula (6). Using Formula (5), we calculate  y 1 = 15 7 11 7 = 2 < y = 14 9 11 9 = 2.5 . Similarly, after parallelization,  v 5 remains on the longest thread, and the result, as shown in Figure 2c, shows that the longest path length is  L ( O c u r ) = 8.6 and  C ( O c u r ) = 14.2 . The calculation using the formula  14.2 14 9 8.6 = 0.5 < 14 11 11 9 = 1.5 and satisfies Formula (6). Using Formula (5), we calculate  y 2 = 14.2 8.6 11 8.6 = 2.3 < y = 14 9 11 9 = 2.5 .
Following the above analysis, where  y 2 > y 1 , we observe that  y 1 produces a better effect. Furthermore, the results from using Formula (6) remain unchanged when parallelizing  v 2 or  v 5 , under the supposition that the initial parallelization options for  v 2 and  v 5 are both 1. This raises the question of whether the conclusion drawn from the above analysis still holds if the initial parallelization options for both are equal, but for values other than 1. Suppose the initial parallelization options for both are x, with the total WCET and the longest path length for the DAG task denoted by C and L, respectively. Assuming idealized parallelism, where the total WCET is evenly distributed across each thread, suppose the parallelization option is x for  v 2 with each thread’s WCET being 5 and is x for  v 5 with each thread’s WCET being 1. The parallelization overhead is  α . After increasing the parallelization option for  v 2 by 1, considering the impact of the parallelization overhead on the WCET for a DAG node, the WCET of the DAG before and after parallelization changes from  5 x ( 1 + α ) to  5 x . After parallelization, the execution time per thread becomes  5 x ( 1 + α ) x + 1 . Therefore, the length of the longest path is reduced by  5 5 x ( 1 + α ) x + 1 . The result of the left-hand side of Formula (6) is as follows:
5 x ( 1 + α ) 5 x 5 5 x ( 1 + α ) x + 1 = x α ( x + 1 ) 1 x α
After increasing the parallelization option for  v 5 by 1, the calculation using Formula (6) would be as follows:
x ( 1 + α ) x 1 x ( 1 α ) x + 1 = x α ( x + 1 ) 1 x α
It can then be seen that the initial parallelization options for  v 2 and  v 5 are equivalent and can be set to any arbitrary value, leading to consistent outcomes from using Formula (6).
Consequently, if the initial parallelization options for  v 2 and  v 5 are equivalent and can be set to any arbitrary value, and if the longest path remains unchanged, one might question whether  y 1 can still yield improved outcomes. We introduce Lemma 4 to address this question.
Lemma 4. 
When the parallelization options for two nodes are equal (but an arbitrary value), selecting the node with a longer execution time does not necessarily result in a greater reduction in the number of cores required.
Proof. 
Let the total WCET and the longest path length of the DAG task be denoted by C and L, respectively, with an absolute deadline of D. Suppose the longest path length is reduced by  Δ L due to node parallelization. Based on the aforementioned evidence, parallel processing of any nodes with identical parallelization options results in the same values being obtained from Formula (6). Formula (6) gives the ratio of the change in the DAG’s total WCET to the change in the longest path length. Let this ratio be z. Then, the total WCET increases by  Δ L z. Therefore, Formula (5) can be rewritten as follows:
Y = C L + Δ L z + Δ L D L + Δ L
Taking the derivative of Formula (11) with respect to  Δ L yields
Y = z ( D L ) + D C ( D L + Δ L ) 2 < C D + D C ( D L + Δ L ) 2 = 0
Therefore, Y increases as  Δ L decreases: the larger  Δ L is, the smaller Y becomes. Let the current parallelization option for a node be x, and the execution time per thread for this node be b. The parallelization overhead is  α . After increasing the parallelization option by 1, the execution time per thread becomes
b x ( 1 + α ) x + 1
Thus, the change in the longest path length  Δ L is
b b x ( 1 + α ) x + 1
Taking the derivative of Formula (13) with respect to b yields
1 x α x + 1
Formula (14) is not strictly greater than 0, indicating that Formula (13) does not strictly increase with respect to b. Therefore, it is not necessarily true that a larger single-thread execution time results in a greater reduction in the longest path length. The current parallelization option of the node also plays a crucial role. Hence, choosing a node with a longer execution time does not necessarily lead to a significant decrease in the computed number of cores.    □
Lemma 4 informs us that selecting nodes with longer execution times does not necessarily lead to a larger decrease in the required number of cores: the current parallelization option of the node must also be considered. Moreover, parallelizing the node with a longer thread time increases the likelihood of altering the longest path. The above analysis is based on the assumption that the longest path remains unchanged. If selecting the node with a longer thread time for parallelization results in a change to the longest path, this could lead to a relatively smaller variation in the longest path length. However, choosing the node with a shorter thread time is still not advisable. Assuming that the current parallelization options are identical for the node with a shorter thread time and the node with a longer thread time, when the longest path remains unchanged after parallelization, the impact of the node with a shorter thread time is not as significant as that of the node with a longer thread time.
The analysis presented above is based on federated scheduling using Graham’s bound. Considering the federated scheduling based on long paths, it is necessary to consider additional paths. When parallelizing nodes on the longest path, those with larger thread times are more likely to affect the probability of changing the longest path, and the decreasing trend in the calculated number of cores is not necessarily more significant. The sum of other required paths may increase, decrease, or remain unchanged. Nodes with shorter thread times can reduce the likelihood of changes to the longest path. By analyzing the monotonicity of Formula (13), it can be concluded that when the longest path remains unchanged, with a certain parallelization overhead and within a certain bound of parallelization options, selecting nodes with shorter thread times can lead to a greater reduction in the number of cores. Therefore, the impact of selecting nodes with either shorter or longer thread times is uncertain. Therefore, we employ five different node selection algorithms to determine the best choice:
  • Select the node that minimizes the outcome of Formula (7), as shown in Algorithm 3.
  • Based on Algorithm 3, the selection criteria have been modified so that a node is selected only if it meets the condition of Formula (9). Subsequently, select the node that minimizes the result of Formula (7).
  • If the current node’s option does not reach the limit, the node index remains unchanged; otherwise, select a node according to its topological order in the longest path.
  • In the longest path, select the node with the longest single-thread execution time among the nodes where the parallelization option has not reached its limit.
  • In the longest path, among the nodes where the parallelization option has not reached its limit, select the node with the shortest single-thread execution time.
Algorithm 2 Parallelization_Algorithm( Γ h i g h )
Input: the high-density task sets:  Γ h i g h
Output: parallelization option combination:  Θ , the number of cores required for high-density tasks:  M h
1:
for each  τ k Γ h i g h  do
2:
    O c u r = [ 1 , 1 , , 1 ]
3:
end for
4:
for each  τ k Γ h i g h  do
5:
    p a , m ← Formula (4)
6:
   if  m > 2  then
7:
      m m i n m
8:
     for  O l i m i t [ 2 , m m i n ]  do
9:
        Update  A r r d e x
10:
         O p r e = [ 1 , 1 , , 1 ]
11:
         d e x ← Node_Selection ( d e x O p r e O l i m i t A r r d e x )
12:
        while True do
13:
          Update  A r r d e x , Use Algorithm 1 to get the generalized path after node parallelization
14:
          if  O p r e [ d e x ] = O l i m i t  then
15:
             x d e x d e x ← Node_Selection ( d e x O p r e O l i m i t A r r d e x )
16:
             if  d e x = x  then
17:
               break
18:
             else
19:
                O p r e [ d e x ] = O p r e [ d e x ] + 1
20:
             end if
21:
          else
22:
              d e x ← Node_Selection ( d e x O p r e O l i m i t A r r d e x ),  O p r e [ d e x ] = O p r e [ d e x ] + 1
23:
          end if
24:
          Use Algorithm 1 to get the parallel generalized path
25:
           m c u r ← Formula (10), y ← Formula (7)
26:
          if  m c u r < m and  m c u r > = O l i m i t and  y > 0  then
27:
             m m c u r O c u r O p r e
28:
          end if
29:
        end while
30:
     end for
31:
     if  m m i n = m  then
32:
         O c u r = [ 1 , 1 , , 1 ]
33:
     end if
34:
   end if
35:
    Θ Θ O c u r M h M h + m
36:
end for
37:
return  Θ M h
We denote the above five selection algorithms as N-S-Min, N-S-Condition-Min, N-S-Topology, N-S-Descend, and N-S-Ascend. We compared their acceptance ratio differences on a 16-core system. We conducted experiments with parallelization overheads of 0.1, 0.2, and 0.4. Figure 3 displays the scheduling results for these selection methods. From the results in Figure 3, it can be seen that the first selection method achieves the best scheduling outcome in the majority of cases. Furthermore, as the parallelization overhead increases, the acceptance ratios for the first two selection methods tend to converge. However, the acceptance ratio for the first selection method remains slightly higher. This first method aligns with the characteristics of a greedy algorithm. Therefore, we choose Algorithm 3 for node selection in each iteration.
Algorithm 3 Node_Selection( d e x O p r e O l i m i t A r r d e x )
Input:  d e x O p r e O l i m i t A r r d e x
Output:  d e x
1:
initialize  y m i n
2:
for i in  A r r d e x  do
3:
    O p t  ← Copy( O p r e )
4:
   if  O p t [ i ] O l i m i t  then
5:
      O p t [ i ] O p t [ i ] + 1
6:
     Use Algorithm 1 to get
     the parallel generalized path
7:
     y ← Formula (7)
8:
     if  y < y m i n and  y > 0  then
9:
         y m i n y d e x i
10:
     end if
11:
   end if
12:
end for
13:
return   d e x

5.3. Time Complexity of Algorithm

This section will introduce the time complexity of Algorithm 2. The outermost loop conducts a one-way search on the tasks (lines 4–36). For n high-density tasks, each high-density task has  n k nodes, and the number of cores dedicated to each task is denoted by  m k . The inner loop iterates from 2 to  m k (lines 8–30). However, not all tasks can be parallelized, so the time complexity at this point is less than  O ( n · m k ) . Under the constraint  d k [ 2 , m k ] , after entering the while loop (lines 12–29), the number of nodes in the longest path is less than  n k , so the time complexity of Algorithm 3 is less than  O ( n k ) . The maximum parallelization limit for the nodes is  d k . Not every node needs to be parallelized, nor can every node be parallelized to the maximum limit. Therefore, the time complexity of the while loop is less than  O ( d k · n k 2 ) . Therefore, the overall upper bound of the time complexity for Algorithm 2 is  O ( d k · n k 2 · n · m k ) .

6. Experimental Results

In this section, we provide a comprehensive evaluation of the proposed algorithm. Initially, in the first segment, we conduct a comparative analysis between our parallelization strategy and alternative parallelization methods in terms of scheduling performance for an identical task set. Following this, in the second segment, we compare our algorithm with existing algorithms in real-time scheduling to determine its effectiveness and performance.

6.1. Compared to Other Parallel Methods

We generate a DAG task set with the following parameter settings for each DAG task  τ k = ( G , D , T ) : (1) The number of nodes,  n k , is randomly determined, within the range of 3 to 10. (2) Implicit deadlines are considered, where  D = T , and D is calculated by  β ( C L ) + L , with L denoting the length of the longest path, C denoting the volume of the DAG, and  β being a parameter. According to Formula (1), the maximum number of cores required for a task is  1 β . The number of cores dedicated to high-density tasks is set between 2 and 40, so  β is assumed to lie within the range of 0.025 to 0.5. (3) The WCET for each node in  τ k is selected from the range of 200 to 900. (4) Based on [1], the parallelization overhead is  α , and the total execution time of the nodes is given by  c ( v i ( O i + 1 ) ) = ( 1 + α ) c ( v i ( O i ) ) . (5) According to [5], after parallelizing the node of a DAG task, the execution time for all threads is assumed equal, i.e.,  e ( v i 1 ( O i ) ) = e ( v i 2 ( O i ) ) , , e ( v i O i ( O i ) ) . (6) Each node within  τ k is connected with a probability of  p r = 0.3 , forming a set of edges E.
To ensure the reproducibility of our simulation study, we generated synthetic DAG task sets with the following principled parameter choices:
  • The number of nodes per DAG,  n k , was uniformly sampled between 3 and 10 to reflect typical parallel task complexity.
  • Node WCETs were drawn from  [ 200 , 900 ] to model heterogeneous computational loads.
  • Implicit deadlines were set as  D = T = β ( C L ) + L , where  β [ 0.025 , 0.5 ] controls task density and core allocation bounds.
  • Edge connectivity probability  p r = 0.3 was chosen to balance sparsity and dependency richness.
  • Parallelization overhead  α was varied as 0.1, 0.2, and 0.4 to assess robustness under different overhead regimes.
These settings align with common practices in the real-time systems literature and allow for systematic evaluation of schedulability under controlled yet realistic conditions.
To construct a task set, we first initialize an empty task set, denoted by  Γ . Using the above method, a DAG task is generated and added to  Γ . The necessary condition for the schedulability of the task set is checked as follows:  τ k Γ v o l ( τ k ) / T < m . If this condition is satisfied,  Γ becomes a valid task set; otherwise, it is re-established.
Following the described method for generating task sets, we created 2000 task sets for the experimental testing. We compare the acceptance ratio differences of the following four parallel approaches on a 16-core system:
  • Ours: Parallelization according to Algorithm 2, denoted as Federate-Parallel.
  • Single: Tasks are not parallelized and run in a single thread, denoted as Federate-Single.
  • Max: Tasks are parallelized to the maximum extent possible within constraints, denoted as Federate-Max.
  • Random: For each task node, the limit on parallelization is randomly selected, and then the parallelization options are randomly chosen within the limit, denoted as Federate-Random.
In our experiment, we set the parallelization overhead to 0.1, 0.2, and 0.4, respectively, and Figure 4a–c display the experimental results. The task set utilization is represented on the x-axis, while the acceptance ratio of the task set is represented on the y-axis. Thus, each data point on the graph represents the ratio of schedulable task sets with the same utilization to the total number of generated task sets. The results demonstrate that the acceptance ratio of our parallelization method significantly surpasses that of the other methods. As the parallelization overhead increases, the effectiveness of “Max” and “Random” decreases due to the larger execution times for individual threads in parallel, making it more challenging to satisfy Formula (9). Because no threads are parallelized in “Single”, the acceptance ratio is not affected.
Figure 4a shows that, under the condition of low parallelization overhead, the results of “Max” are superior to those of “Random”. Figure 4b,c show that although both “Max” and “Random” are affected by an increase in the parallelization overhead, “Max” is more severely affected. This is because, under the condition of high parallelization overhead, “Max” results in significantly longer total execution times for a node’s threads, with individual thread execution times exceeding those of other methods, thereby greatly affecting its results. When the parallelization overhead reaches a certain level, the results of “Ours” also decline. However, “Ours” still maintains an advantage in scheduling performance, demonstrating strong resilience to the impact of parallelization overhead. However, its performance gradually approaches that of “Single.” This trend is consistent with the findings reported in [1]. When the parallelization overhead increases, for some task sets, the benefit of critical path reduction achieved through node parallelization diminishes significantly compared to the “Single” method. Consequently, the acceptance ratios of these two methods in the graph become increasingly similar as the parallelization overhead grows.

6.2. Compared with Other Scheduling Algorithms

In this section, we will use the task generation method proposed in the previous section to compare our parallel algorithm with other state-of-the-art real-time parallel task scheduling algorithms. The experimental comparison includes the following advanced scheduling algorithms:
  • A global DM scheduling algorithm based on the TDTA strategy, denoted as TDTA-DM [15].
  • An improved virtual federated scheduling algorithm, denoted as Virtual-Federate [13].
  • A federated scheduling algorithm based on decomposition, denoted as Federate-Decompose [16].
  • A gang reservation scheduling algorithm based on federated scheduling, denoted as Gang-Reservation [17].
  • An ordinary reservation scheduling algorithm based on federated scheduling, denoted as Ordinary-Reservation [17].
  • A parallel algorithm based on the global EDF scheduling algorithm, denoted as EDF-Parallel [1].
  • A fluid scheduling algorithm for implicit deadlines, denoted as Fluid-Implicit [18].
  • A basic federated scheduling algorithm, denoted as Federate-Li [9].
  • A federated scheduling algorithm based on the long path, denoted as Federate-Path [14].
We conduct experiments with parallelization overhead set to 0.1 and 0.2 under 8-core, 16-core, and 32-core conditions to investigate the acceptance ratios of the different task set-normalized utilizations. The experimental results are depicted in Figure 5a–f. The acceptance rate of task sets under our proposed Federate-Parallel algorithm consistently matches or slightly exceeds that of the state-of-the-art Federate-Path algorithm [14] under most task utilization levels and core configurations. This limited yet stable performance improvement stems primarily from our node-level parallelization strategy, which further optimizes core allocation efficiency for high-density tasks. It should be noted that the Federate-Path algorithm itself already tightly bounds response times using long-path techniques, and its scheduling performance is already near the upper limit of such methods. Therefore, the room for further compressing the schedulability boundary on top of such an efficient baseline is inherently limited. The value of our work lies in demonstrating that, even under constraints close to the theoretical boundary, the introduction of node-level parallelization freedom can still exhibit additional system schedulability within an already high-performance scheduling framework. Moreover, as the number of cores increases, the proportion of high-density tasks also rises. Our algorithm effectively reduces the number of cores allocated to high-density tasks, leading to a slight increase in the acceptance ratio as the number of cores of the system increases. In contrast, other algorithms experience a slight decrease in the acceptance ratio as the number of system cores increases. However, as the parallelization overhead increases, as shown in Figure 5b,e, there is a slight decline in the performance of our algorithm.

7. Conclusions

This paper presents a method for parallelizing DAG task nodes using federated scheduling. It also analyzes the parallelization sequence of nodes on the longest path of a DAG task. This algorithm applies DAG node parallelization to federated scheduling based on long paths, addressing issues related to any changes in the longest path and changes in the maximum parallelization limit for DAG nodes, thereby further enhancing the schedulability of tasks. We conducted experiments to evaluate the effectiveness of our algorithm. Firstly, we compared our parallelization algorithm with other parallelization algorithms. The implementation results demonstrate that our method outperforms the others when the parallelization overhead is not particularly high. Secondly, we compared our algorithm with other real-time scheduling algorithms. The implementation results indicate that our algorithm achieves better schedulability compared to other algorithms to a certain extent.
While the proposed algorithm demonstrates significant improvements in schedulability under the considered model, we acknowledge three primary limitations that outline directions for future research.
First, our algorithm assumes that all sub-threads of a parallelized node have identical execution times. This assumption is grounded in the load-balancing strategies commonly employed in modern parallel programming frameworks such as OpenMP and OpenCL, and has been widely adopted in real-time scheduling research. However, in practice, thread execution times may vary due to data dependencies, memory access patterns, cache behaviors, or contention for hardware resources. Such variations could lead to deviations in the estimation of the longest path length L and the total workload C, thereby affecting the accuracy of core allocation. Future work could consider introducing an uncertainty model for execution times or incorporating runtime monitoring mechanisms to enhance the algorithm’s robustness.
Second, our algorithm and its theoretical analysis are developed within the context of identical (homogeneous) multi-core processors. However, modern embedded and high-performance systems increasingly leverage heterogeneous processor architectures (e.g., big.LITTLE, CPU-GPU systems). In such environments, the execution time of a parallel thread can vary significantly depending on the type of core it is allocated to. Our current method, which relies on a single WCET value per thread, does not account for this variability. The core allocation strategy in federated scheduling would also need to be re-evaluated to handle the complex trade-offs between core performance and power consumption. Extending the proposed parallelization algorithm to heterogeneous platforms presents a core challenge and critical step: the algorithm must be capable of characterizing and adapting to the variations in execution time across different types of processing cores. To achieve this, it is essential to establish a multiple-version WCET model to enable effective extension of the algorithm.
Third, our model assumes a fixed parallelization overhead ratio for all nodes. In practice, this overhead may not be constant and could vary with the number of threads, the specific node’s computational characteristics, or the system’s current contention level. This variable overhead could affect the optimality of the node selection and parallelization option allocation determined by our algorithm. The condition for reducing core count (as defined in Lemma 3) might be violated under variable overhead scenarios, potentially leading to suboptimal decisions. Future work will involve refining the algorithm to be robust against such variable overheads, perhaps by adopting a more adaptive or worst-case overhead estimation model.
A highly promising research direction involves extending the deterministic model of this work into stochastic environments to more accurately characterize system behavior. Specifically, building upon the work of Li et al. [19] on federated scheduling for stochastic parallel tasks, the DAG task model presented in this paper can be extended to stochastic DAG tasks, where the execution time of each node and the critical path length L are modeled as random variables. The hard real-time scheduling objective would then be relaxed to “ensuring bounded expected delay” or “meeting probabilistic timing constraints.” Under this stochastic framework, analyzing the response time distribution of tasks becomes crucial. For this purpose, the method introduced by Kocian and Chessa [20] can be adopted, which treats the DAG as a synchronous dataflow graph and utilizes their generalized rejection sampling Monte Carlo technique to efficiently estimate the probability of tasks meeting their deadlines. This approach provides system designers with performance metrics that are richer and more practical than WCET. Ultimately, this framework can be further extended to accommodate dynamic parallelization overhead, transforming the fixed coefficient  α into a function of the degree of parallelism or a stochastic process. This would establish a unified and comprehensive stochastic performance prediction model, aiming to maximize resource utilization and system throughput while guaranteeing soft real-time performance.

Author Contributions

Conceptualization, L.F. and J.Q.; methodology, J.Q. and S.C.; formal analysis, T.C. and S.C.; validation, J.Q. and T.C.; writing—original draft preparation, S.C.; writing—review and editing, J.Q.; supervision, L.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this simulation study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cho, Y.; Shin, D.; Park, J.; Lee, C. Conditionally Optimal Parallelization of Real-Time DAG Tasks for Global EDF. In Proceedings of the 2021 IEEE Real-Time Systems Symposium (RTSS), Virtual Event, 7–10 December 2021; pp. 188–200. [Google Scholar]
  2. Stone, J.E.; Gohara, D.; Shi, G. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 2010, 12, 66–73. [Google Scholar] [CrossRef] [PubMed]
  3. de Supinski, B.R.; Scogland, T.R.W.; Duran, A.; Klemm, M.; Bellido, S.M.; Olivier, S.L.; Terboven, C.; Mattson, T.G. The ongoing evolution of openmp. Proc. IEEE 2018, 106, 2004–2019. [Google Scholar] [CrossRef]
  4. Kwon, J.; Kim, K.; Paik, S.; Lee, J.; Lee, C. Multicore scheduling of parallel real-time tasks with multiple parallelization options. In Proceedings of the 2015 IEEE Real-Time and Embedded Technology and Applications Symposium, Seattle, WA, USA, 13–16 April 2015; pp. 232–244. [Google Scholar]
  5. Kim, K.; Cho, Y.; Eo, J.; Lee, C.; Han, J. System-wide time versus density tradeoff in real-time multicore fluid scheduling. IEEE Trans. Comput. 2018, 67, 1007–1022. [Google Scholar] [CrossRef]
  6. Park, D.; Cho, Y.; Lee, C. Conditionally optimal parallelization for global fp on multi-core systems. In Proceedings of the 2020 3rd International Conference on Information and Computer Technologies (ICICT), San Jose, CA, USA, 9–12 March 2020; pp. 403–412. [Google Scholar]
  7. Cho, Y.; Kim, D.H.; Park, D.; Lee, S.S.; Lee, C. Conditionally optimal task parallelization for global edf on multi-core systems. In Proceedings of the 2019 IEEE Real-Time Systems Symposium (RTSS), Hong Kong, China, 3–6 December 2019; pp. 194–206. [Google Scholar]
  8. Cho, Y.; Kim, D.H.; Park, D.; Lee, S.S.; Lee, C. Optimal parallelization of single/multi-segment real-time tasks for global edf. IEEE Trans. Comput. 2021, 71, 1077–1091. [Google Scholar] [CrossRef]
  9. Li, J.; Chen, J.J.; Agrawal, K.; Lu, C.; Gill, C.; Saifullah, A. Analysis of federated and global scheduling for parallel real-time tasks. In Proceedings of the 2014 26th Euromicro Conference on Real-Time Systems, Madrid, Spain, 8–11 July 2014; pp. 85–96. [Google Scholar]
  10. He, Q.; Guan, N.; Lv, M.; Jiang, X.; Chang, W. The shape of a DAG: Bounding the response time using long paths. Real-Time Syst. 2023, 60, 199–238. [Google Scholar] [CrossRef]
  11. Jiang, X.; Guan, N.; Long, X.; Yi, W. Semi-federated scheduling of parallel real-time tasks on multiprocessors. In Proceedings of the 2017 IEEE Real-Time Systems Symposium (RTSS), Paris, France, 5–8 December 2017; pp. 80–91. [Google Scholar]
  12. Ueter, N.; Von Der Brüggen, G.; Chen, J.; Li, J.; Agrawal, K. Reservation-based federated scheduling for parallel real-time tasks. In Proceedings of the 2018 IEEE Real-Time Systems Symposium (RTSS), Nashville, TN, USA, 11–14 December 2018; pp. 482–494. [Google Scholar]
  13. Jiang, X.; Liang, H.; Guan, N.; Tang, Y.; Qiao, L.; Wang, Y. Scheduling Parallel Real-Time Tasks on Virtual Processors. IEEE Trans. Parallel Distrib. Syst. 2022, 34, 33–47. [Google Scholar] [CrossRef]
  14. He, Q.; Guan, N.; Lv, M.; Jiang, X.; Chang, W. Bounding the Response Time of DAG Tasks Using Long Paths. In Proceedings of the 2022 IEEE Real-Time Systems Symposium (RTSS), Houston, TX, USA, 5–8 December 2022; pp. 474–486. [Google Scholar]
  15. Wu, Y.; Zhang, W.; Guan, N.; Ma, Y. TDTA: Topology-based Real-Time DAG Task Allocation on Identical Multiprocessor Platforms. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 2895–2909. [Google Scholar] [CrossRef]
  16. Jiang, X.; Guan, N.; Long, X.; Tang, Y.; He, Q. Real-time scheduling of parallel tasks with tight deadlines. J. Syst. Archit. 2020, 108, 101742. [Google Scholar] [CrossRef]
  17. Ueter, N.; Günzel, M.; von der Brüggen, G.; Chen, J. Parallel Path Progression DAG Scheduling. IEEE Trans. Comput. 2023, 72, 3002–3016. [Google Scholar] [CrossRef]
  18. Guan, F.; Qiao, J.; Han, Y. DAG-fluid: A real-time scheduling algorithm for DAGs. IEEE Trans. Comput. 2020, 70, 471–482. [Google Scholar] [CrossRef]
  19. Li, J.; Agrawal, K.; Gill, C.; Lu, C. Federated Scheduling for Stochastic Parallel Real-Time Tasks. In Proceedings of the 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, Chongqing, China, 20–22 August 2014; pp. 1–10. [Google Scholar]
  20. Kocian, A.; Chessa, S. Iterative Probabilistic Performance Prediction for Multiple IoT Applications in Contention. IEEE Internet Things J. 2022, 9, 13416–13424. [Google Scholar] [CrossRef]
Figure 1. A detailed example of a DAG task, which serves as a running case to illustrate key concepts throughout the paper. Each node  v i is labeled with its WCET, e.g., c( v 1 ) = 3. The key parameters for this task are a total WCET of  C = 15 , a longest path length of  L = 9 for the path  v 1 , v 2 , v 5 , and a relative deadline of D = 11. As C > D, this is classified as a “high-density” task, which is the primary target for optimization by our proposed algorithm.
Figure 1. A detailed example of a DAG task, which serves as a running case to illustrate key concepts throughout the paper. Each node  v i is labeled with its WCET, e.g., c( v 1 ) = 3. The key parameters for this task are a total WCET of  C = 15 , a longest path length of  L = 9 for the path  v 1 , v 2 , v 5 , and a relative deadline of D = 11. As C > D, this is classified as a “high-density” task, which is the primary target for optimization by our proposed algorithm.
Computers 14 00449 g001
Figure 2. A DAG task following the parallel execution of  v 2 and  v 5 . (a) The DAG without any node splits. The key parameters for this task are a total WCET of  C = 14 , a longest path length of  L = 9 for the path  v 1 , v 2 , v 5 , and a relative deadline of D = 11. (b) The DAG after splitting node  v 2 into two threads, with an assumed parallelization overhead of 0.2. (c) The DAG after parallelizing node  v 5 into two threads, also with  α = 0.2. This visualization supports the analysis of scenarios where parallelizing different nodes on the longest path leads to distinct changes in the graph’s temporal parameters (C, L).
Figure 2. A DAG task following the parallel execution of  v 2 and  v 5 . (a) The DAG without any node splits. The key parameters for this task are a total WCET of  C = 14 , a longest path length of  L = 9 for the path  v 1 , v 2 , v 5 , and a relative deadline of D = 11. (b) The DAG after splitting node  v 2 into two threads, with an assumed parallelization overhead of 0.2. (c) The DAG after parallelizing node  v 5 into two threads, also with  α = 0.2. This visualization supports the analysis of scenarios where parallelizing different nodes on the longest path leads to distinct changes in the graph’s temporal parameters (C, L).
Computers 14 00449 g002
Figure 3. The acceptance ratios of five node selection strategies (N-S-Min, N-S-Condition-Min, N-S-Topology, N-S-Descend, and N-S-Ascend) under different parallelization overheads ( α = 0.1 , 0.2 , 0.4 ) on a 16-core system ( α is the parallelization overhead, m is the number of cores).
Figure 3. The acceptance ratios of five node selection strategies (N-S-Min, N-S-Condition-Min, N-S-Topology, N-S-Descend, and N-S-Ascend) under different parallelization overheads ( α = 0.1 , 0.2 , 0.4 ) on a 16-core system ( α is the parallelization overhead, m is the number of cores).
Computers 14 00449 g003
Figure 4. The acceptance ratios of four parallelization approaches (Ours, Single, Max, Random) under varying overheads ( α = 0.1 , 0.2 , 0.4 ) on a 16-core system ( α is the parallelization overhead, m is the number of cores).
Figure 4. The acceptance ratios of four parallelization approaches (Ours, Single, Max, Random) under varying overheads ( α = 0.1 , 0.2 , 0.4 ) on a 16-core system ( α is the parallelization overhead, m is the number of cores).
Computers 14 00449 g004
Figure 5. The acceptance ratios of multiple state-of-the-art scheduling algorithms (e.g., TDTA-DM, Virtual-Federate, and EDF-Parallel) under different system configurations (8, 16, and 32 cores) and parallelization overheads ( α = 0.1 , 0.2 ) (a is the parallelization overhead; m is the number of cores).
Figure 5. The acceptance ratios of multiple state-of-the-art scheduling algorithms (e.g., TDTA-DM, Virtual-Federate, and EDF-Parallel) under different system configurations (8, 16, and 32 cores) and parallelization overheads ( α = 0.1 , 0.2 ) (a is the parallelization overhead; m is the number of cores).
Computers 14 00449 g005
Table 1. Notations used in the paper.
Table 1. Notations used in the paper.
NotationDescription
c ( v ) the WCET of vertex v
O i the parallelization option for the  v i node
l e n ( λ ) the length of path  λ
l e n ( G ) the length of the longest path of DAG G
v o l ( G ) the volume of G
( λ i ) 0 k a generalized path list
k _ + 1 the number of all generalized paths in G
Cthe total WCET for all nodes within G
Lthe length of the longest path of G
Dthe relative deadline of G
Tthe period of G
v i ( O i ) the  v i after parallelizing node  v i into  O i threads
v i l ( O i ) the lth sibling thread of  v i ( O i )
e ( v i l ( O i ) ) the WCET of  v i l ( O i )
mthe number of cores for high-density DAG
i = 0 p a l e n ( λ i ) the sum of the lengths of the first  p a +1 paths in a DAG
m ( p a ) the number of cores is calculated by Formula (4)
using the first  p a +1 paths
c ( v i ( O i ) ) the total WCET for all threads within the  v i ( O i )
i = 0 p a L i the sum of the lengths of the first  p a +1 paths in a DAG
Z i = 1 p a L i
O c u r the list of parallelization options for DAG nodes
O p r e the temporary node’s list of parallelization options
τ k ( O c u r ) the DAG task with  O c u r
τ k ( O p r e ) the DAG task with  O p r e
ythe formula inside the ceiling function
in Formula (1) or (4)
y the value of y after the next parallel step in a DAG
C the value of C after the next parallel step in a DAG
L the value of L after the next parallel step in a DAG
Z the value of Z after the next parallel step in a DAG
i = 0 p a L i ( O c u r ) the sum of the lengths of the first  p a +1 paths
in the DAG task with  O c u r
C ( O c u r ) the total WCET for all nodes within the DAG
with the list of parallelization options
L ( O c u r ) the length of the longest path of DAG
with the list of parallelization options
xthe current parallelization option for a node
α parallelization overhead
bthe execution time per thread for node
Δ L the change in the longest path of a DAG
Ythe transformed expression for y
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qiao, J.; Chen, S.; Chen, T.; Feng, L. Optimizing Federated Scheduling for Real-Time DAG Tasks via Node-Level Parallelization. Computers 2025, 14, 449. https://doi.org/10.3390/computers14100449

AMA Style

Qiao J, Chen S, Chen T, Feng L. Optimizing Federated Scheduling for Real-Time DAG Tasks via Node-Level Parallelization. Computers. 2025; 14(10):449. https://doi.org/10.3390/computers14100449

Chicago/Turabian Style

Qiao, Jiaqing, Sirui Chen, Tianwen Chen, and Lei Feng. 2025. "Optimizing Federated Scheduling for Real-Time DAG Tasks via Node-Level Parallelization" Computers 14, no. 10: 449. https://doi.org/10.3390/computers14100449

APA Style

Qiao, J., Chen, S., Chen, T., & Feng, L. (2025). Optimizing Federated Scheduling for Real-Time DAG Tasks via Node-Level Parallelization. Computers, 14(10), 449. https://doi.org/10.3390/computers14100449

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop