Next Article in Journal
Seamless Inter-Domain Mobility with Hybrid SDN-LISP
Previous Article in Journal
Scheduling Jamming Resources in Complex Terrain: A Multi-Objective Air—Ground Collaborative Optimization Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Energy-Aware Ranking and Optimization

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
*
Author to whom correspondence should be addressed.
Future Internet 2026, 18(5), 226; https://doi.org/10.3390/fi18050226
Submission received: 6 March 2026 / Revised: 19 April 2026 / Accepted: 20 April 2026 / Published: 22 April 2026

Abstract

The increase in delay-sensitive application tasks requires heterogeneous edge clusters to maintain low online latency and energy efficiency without relying on rigid scheduling policies. To address this, we propose HERO (Hybrid Energy-aware Ranking and Optimization), a lightweight collaborative scheduling framework. HERO utilizes a perturbation-based communication-aware multi-layer perceptron (MLP) predictor to quantify global time sensitivity and discover latent time slack in non-critical paths. A hybrid budget mechanism then converts this slack into customized DVFS decisions. These decisions are based on the inherent computational load and topological criticality to optimize energy consumption. A communication-aware hole-filling strategy dynamically recovers sporadic idle times fragmented by heterogeneous communication overhead. Extensive simulations were conducted across varying DAG depths, parallelism levels, and system utilizations. Compared to state-of-the-art algorithms (NSGA-II, SSA, TOM, and DPMC), HERO reduced the completion time by an average of 10.89% under high-density topologies, and achieved up to 4.04% energy savings across varying task depths.

1. Introduction

Multi-access Edge Computing (MEC) has emerged as a critical infrastructure for latency-sensitive applications in the 5G/6G and IoT era [1,2]. To process complex Deep Neural Network (DNN) inferences, edge computing is shifting towards multi-node heterogeneous clusters that collaborate via task offloading [3,4]. However, edge servers exhibit significant differences in computing capabilities and face strict battery capacity or Thermal Design Power (TDP) constraints [5,6]. Achieving the joint optimization of millisecond-level real-time response (makespan) and total system energy consumption in such heterogeneous and dynamic environments has become a highly challenging NP-hard bi-objective scheduling problem [7,8].
To tackle this problem, edge computational workloads are typically modeled as complex Directed Acyclic Graphs (DAGs) to capture fine-grained dependencies and strict millisecond-level deadlines. Although academia has proposed various advanced DAG scheduling strategies, severe challenges remain in resource-constrained and highly dynamic edge environments. First, evolutionary algorithms, such as NSGA-II [1], rely on heavy iterative searches. Consequently, they incur excessively high online scheduling latency. Second, existing deep reinforcement learning-based algorithms (such as SSA-DAG) [9,10,11] often exhibit a “performance-heavy, energy-light” tendency, lacking proactive energy budget planning. More crucially, they typically treat edge weights as static features. This makes it difficult to capture the condition-triggered characteristics of communication overheads in heterogeneous clusters. As a result, they fail to accurately identify the true system bottlenecks. Finally, dedicated algorithms based on specific structures or mixed criticality (such as TOM and DPMC) [12,13,14] usually adopt pessimistic resource reservation or rigid slack allocation strategies. They ignore differences in the intrinsic computational workloads of individual tasks. Therefore, they are highly prone to over-scaling the frequency of heavy-workload tasks on non-critical paths, thereby creating new system bottlenecks.
To address the aforementioned pain points, this paper proposes a lightweight and efficient collaborative scheduling framework: HERO (Hybrid Energy-aware Ranking and Optimization). This framework constructs a “perception–decision–compensation” closed-loop optimization system, aiming to break the limitations of traditional static heuristic and greedy learning strategies. The main contributions of this paper are summarized as follows:
  • Establishment of a Communication-Aware Sensitivity Quantification Model: We propose a perturbation-based mechanism to quantify the marginal effect of task execution fluctuations on the global makespan, accurately stripping away pseudo-critical paths.
  • Hybrid Budget Allocation: We design a multi-factor energy arbitration mechanism that balances critical path progression with the resource needs of heavy-load tasks on non-critical paths.
  • Time Fragment Recovery Mechanism: We introduce an aggressive hole-filling strategy to reclaim discrete idle time slots induced by heterogeneous communication overheads.
  • Performance Validation: Extensive experiments on a diverse testbed (Raspberry Pi 4, Jetson Orin Nano, and Xeon D) demonstrate that HERO reduced the completion time by an average of 10.89% under high-density topologies, and achieved up to 4.04% energy savings across varying task depths.
The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 formalizes the system and communication models, alongside the bi-objective optimization problem. Section 4 elaborates on the detailed design of the proposed HERO framework. Section 5 presents the experimental setup, comparative results, and performance evaluation. Finally, Section 6 concludes the paper.

2. Related Work

Directed Acyclic Graph (DAG) task scheduling and offloading in heterogeneous edge environments is a relatively classical NP-hard problem. As summarized in Table 1, existing solutions can generally be divided into three categories: heuristic and evolutionary algorithms, deep learning-based intelligent scheduling, and structure-aware specific offloading strategies.

2.1. Heuristic and Evolutionary Algorithms

Due to the heterogeneity of edge environments, multi-objective evolutionary and heuristic algorithms are widely adopted to address the trade-offs between latency, energy consumption, and system reliability [1,5,7,8,12,15,16]. While these methods excel in finding Pareto-optimal solutions or ensuring fault tolerance, they face severe challenges in real-time edge collaboration scenarios. Algorithms represented by NSGA-II rely on heavy population iteration and mutation processes, resulting in scheduling delays that are too high to meet the millisecond-level response requirements of intelligent edge services. Furthermore, their energy optimization methods often employ rigid strategies, allocating energy budgets solely based on time margins while ignoring intrinsic computational workload differences, which easily creates new system bottlenecks.

2.2. Intelligent Scheduling Based on Deep Reinforcement Learning

Schedulers based on Deep Reinforcement Learning (DRL) have become a research hotspot due to their environmental adaptability. Recent advancements frequently combine DRL with Graph Neural Networks (GNNs) or Transformers to capture complex DAG dependencies and optimize offloading decisions in highly dynamic environments [3,9,10,11,17,18,19,20].
While DRL methods perform well in finding high-quality solutions, deploying them on resource-constrained edge nodes remains challenging. The high-dimensional state encoding required by complex GNNs or Transformers introduces unacceptable millisecond-level inference latency for lightweight edge tasks. Additionally, as detailed in Table 1, existing DRL schedulers predominantly prioritize performance over energy consumption, lacking an active energy budget mechanism and often neglecting the critical impact of heterogeneous communication overheads on feature extraction.

2.3. Structure-Aware and Specific Offloading Strategies

Structure-aware and specific offloading strategies simplify the scheduling problem by exploiting DAG structural characteristics or predefined rules, such as service caching, 1-Opt local search, or decentralized game theory [2,4,6,13,14,21,22,23].
Although effective in specific use cases, these strategies lack generality. Chain-based optimizations make strong assumptions about the DAG’s shape, making it difficult to handle highly irregular edge workflows. Moreover, unlike the aggressive gap-filling strategy proposed in our HERO framework, these structured methods often struggle to flexibly recycle discrete idle time slots caused by heterogeneous communication, resulting in limited overall resource utilization.

3. System Model and Problem Description

To facilitate a clear understanding of the mathematical models presented in this section, the primary notations used throughout this paper are summarized in the table in the Abbreviations.
To provide a rigorous mathematical abstraction for the task scheduling problem in an edge collaborative environment, this section first defines an application model based on a DAG to depict the complex dependencies and heterogeneous computational workloads among tasks. Subsequently, the heterogeneous edge cluster architecture and communication model are constructed to quantify the heterogeneous transmission overhead incurred by cross-node collaboration. Furthermore, considering the resource-constrained nature of edge devices, a power and energy consumption model based on DVFS is introduced. Finally, building upon the aforementioned models, the joint optimization of latency and energy consumption is formulated as a constrained bi-objective optimization problem.

3.1. Application Model

We model the dependency-driven task flow—specifically the deep learning inference pipeline in an edge computing environment—as a DAG, as illustrated in Figure 1, and defined as
G = ( V , E )
The vertex set V = { v 1 , v 2 , , v n } represents the computational tasks, where each task v i has a computational load w i . The set of directed edges E denotes task dependencies. A directed edge e i , j = ( v i , v j ) E indicates that v j cannot start until v i completes, involving a data transmission volume d i , j over a network with bandwidth B.
To quantitatively characterize the divergent requirements for computational resources and communication bandwidth across various DAG applications, we define the Communication-to-Computation Ratio (CCR) as the ratio of the average communication cost T ¯ comm to the average computational cost T ¯ comp of the entire graph:
CCR = 1 | E | ( v i , v j ) E d i , j B ¯ 1 | V | v i V t ¯ i
where the numerator represents the average data transfer time across all dependency edges given the average bandwidth B ¯ within the heterogeneous edge cluster; the denominator represents the average execution time of all tasks. Due to system heterogeneity and Dynamic Voltage and Frequency Scaling (DVFS) capabilities, the execution time of task v i is not constant. Therefore, we define t ¯ i as the expected execution time of task v i across all available physical configurations in the cluster P :
t ¯ i = 1 p k P | F k | p k P f F k w i s k · f
In this formula, F k is the set of available frequencies supported by server p k , and s k is its architectural performance coefficient.
For any task v i , we define Pred ( v i ) as its set of direct predecessors and Succ ( v i ) as its set of direct successors. To simplify the model, we assume G has a unique entry task v entry and a unique exit task v exit . For practical workflows with multiple entries or exits, this can be unified by adding virtual nodes with zero computational load and zero communication overhead.

3.2. Architecture and Communication Model

We model the edge computing cluster as a set of M heterogeneous edge servers, denoted by P = { p 1 , p 2 , , p M } . Due to the diverse hardware architectures (CPU, GPU, or specialized accelerators) of the edge servers, significant disparities exist in their efficiency when processing the same task. In practical implementation, this cluster typically operates under a master–worker paradigm: one resource-sufficient node is designated as the master controller to maintain cluster state and handle scheduling logic, while the remaining heterogeneous devices act as worker nodes to execute dispatched tasks. For any task v i V , if it is assigned to processor p k P and executed at a frequency f i , k , its execution time t i , k is expressed as
t i , k = w i s k · f i , k
where w i represents the computational workload of the task, and s k denotes the architectural performance coefficient of server p k , reflecting the processor’s Instructions Per Cycle (IPC) throughput per unit frequency.
Communication Cost Model: The communication overhead between tasks is determined by the data transmission volume and network bandwidth. Let B denote the average transmission bandwidth within the edge cluster. For a dependency edge e i , j , if task v i is assigned to server p m and task v j is assigned to server p n , the communication time c i , j is defined as
c i , j = 0 , if m = n d i , j B , if m n
When parent and child tasks are scheduled on the same processor, data is exchanged via shared memory; thus, the communication overhead is considered to be zero.

3.3. Power and Energy Consumption Model

To support energy-efficiency optimization, we assume that each edge server supports Dynamic Voltage and Frequency Scaling (DVFS) technology.
DVFS Frequency Set: For each server p k , the processor supports a set of discrete voltage–frequency pairs. The available frequency set for p k is defined as F k = { f k , 1 , f k , 2 , , f k , max } , where f k , max is the maximum clock frequency of the processor.
Following widely adopted dynamic voltage and frequency scaling (DVFS) power models, the instantaneous power P k ( f ) of edge server p k operating at frequency f is modeled according to the well-known Cubic Law:
P k ( f ) = P k stat + ξ k · f 3
where P k stat is the static baseline power, and ξ k is a hardware-specific constant reflecting the processor’s capacitive characteristics.
The total system energy consumption E total is the sum of the energy consumed during task execution and the energy consumed during idle periods. For task v i running on server p k at frequency f i , k , its execution energy is defined as
E i exec = P k ( f i , k ) × t i , k = P k stat + ξ k · f i , k 3 × w i s k · f i , k
The objective function for the total energy consumption of the entire edge cluster during the scheduling cycle is
Minimize : E total = v i V E i exec + p k P P k stat × T idle , k
where T idle , k is the idle wait time of server p k .

3.4. Formal Problem Description

The scheduling of DAG tasks in a heterogeneous edge environment is formulated as finding an optimal mapping of tasks to processors, determining their execution sequence, and allocating operating frequencies. For any task v i assigned to processor p k , its timing constraints are governed by the completion status of its predecessors and the resource availability of the assigned processor.
Following standard task scheduling semantics, the Data Ready Time (DRT), Earliest Start Time (EST), and Earliest Finish Time (EFT) for a task v i on processor p k are calculated recursively:
DRT ( v i , p k ) = max v j Pred ( v i ) { AFT ( v j ) + c j , i }
EST ( v i , p k ) = max Avail ( p k ) , DRT ( v i , p k )
EFT ( v i , p k ) = EST ( v i , p k ) + t i , k
where AFT ( v j ) denotes the actual finish time of predecessor v j , and Avail ( p k ) is the earliest time at which p k becomes ready to execute a new task.
The primary objective of this study is to address a bi-objective optimization problem: minimizing the application makespan while simultaneously reducing total system energy consumption.
The makespan minimization aims to minimize the completion time of the entire application, which is determined by the actual finish time of the exit task v exit :
min T makespan = AFT ( v exit )
The total energy minimization aims to minimize the total energy consumption, which includes the dynamic execution energy of all tasks and the static energy consumed during idle periods:
min E total = v i V E i exec + p k P P k stat × T idle , k
Given a DAG G and a heterogeneous cluster P , the goal is to identify an optimal scheduling strategy S = ( M , O , F ) —where M is the task-to-processor mapping, O is the execution order, and F is the frequency allocation—that minimizes the objective vector:
min S J ( S ) = T makespan ( S ) , E total ( S )

4. HERO Framework Design

This chapter will elaborate on the design details of the HERO framework. The framework comprises two core phases: communication-aware priority learning and bottleneck-aware resource allocation.

4.1. Framework Overview

As shown in Figure 2 and Algorithm 1, HERO establishes a closed-loop identify–utilize–reclaim cascade process. In the identification phase, an MLP predictor is used to unravel complex dependencies and isolate exploitable time redundancy. In the utilization phase, we exploit this time redundancy through a hybrid budgeting mechanism. This mechanism converts non-critical time slack into energy efficiency via DVFS. This conversion leads to scheduling fragmentation. Finally, the reclaim phase acts as a compensator by employing a hole-filling strategy. This strategy recovers these fragmented gaps for small tasks, thereby maximizing resource density through compute–communication overlap.
From an implementation perspective, to ensure strict online real-time performance, the deep-enhanced MLP predictor is trained offline. During the online phase, the master controller only performs an O ( 1 ) lightweight forward inference, strictly bounding the decision-making overhead to the microsecond level. The master then dispatches the parsed subtasks and specific DVFS frequency commands to designated worker nodes via lightweight remote procedure calls (e.g., gRPC), fully bridging the theoretical algorithm with practical edge orchestration.
Algorithm 1: HERO: Hybrid Energy-aware Ranking and Optimization
Futureinternet 18 00226 i001

4.2. Task Ranking Based on Enhanced MLP

To extract nonlinear scheduling features from high-dimensional heterogeneous DAGs with low inference latency, HERO employs a Deep Enhanced Multilayer Perceptron (MLP).

4.2.1. DAG Feature Extraction

For each task v i in the DAG, we extract an 11-dimensional feature vector x i , as detailed in Table 2. By encoding macro-level path criticality (e.g., R a n k up , T path ) and micro-level topological attributes (e.g., d in , L e v e l topo ) into a unified linear vector space, this feature set effectively captures the topological integrity of the DAG.

4.2.2. Perturbation-Based Sensitivity Generation

In learning-based DAG scheduling research, obtaining high-quality supervision signals is a key bottleneck for model performance. Existing imitation learning methods typically use static priority sequences generated by heuristic algorithms (such as HEFT) as training labels. This approach has the limitation of locking the upper limit of model performance. To overcome this limitation, HERO proposes a perturbation-based sensitivity analysis mechanism. The model learns the global time sensitivity of each task—the extent to which local execution fluctuations of that task will degrade the completion time of the entire system.
We define a sensitivity label y i for task v i as the marginal effect of load variation on global completion time. As shown in Figure 3, for each instance G in the training set, the label generation process includes the following three standardized steps:
First, using a standard list scheduling algorithm (the HEFT algorithm is used in this paper) to schedule DAG G under the target heterogeneous cluster configuration, a baseline scheduling scheme S base and its corresponding baseline completion time T base are obtained:
T base = Schedule ( G , HEFT )
Then, for each task v i V in the DAG, a perturbed instance G i is constructed. In G i , we increase the worst-case execution time w i of task v i by a significant perturbation factor δ (in the experimental setup, δ = w i , i.e., simulating a doubling of the load):
w i = w i + δ , w j = w j ( j i )
Subsequently, G i is re-evaluated using the same scheduling algorithm to obtain the perturbed completion time T i .
Finally, the absolute impact of the delay of task v i on the system is Δ T i = T i T base . To eliminate the dimensional differences caused by different DAG sizes, we define the final training label y i as the normalized marginal deterioration rate:
y i = T i T base T base
If y i > 0 , it indicates that the task is in some form of bottleneck state, and the larger the value, the greater its impact on the system completion time. If y i 0 , it indicates that the disturbance of the task is absorbed by the system’s parallel gaps or communication latency.
Through this mechanism, HERO’s MLP predictor learns the global time sensitivity of each task, enabling HERO to dynamically identify hidden bottlenecks masked by traditional static CCR metrics during the inference phase.

4.2.3. Model Design and Optimization Strategies

To address the challenges of high-dimensional non-linearity and sparsity inherent in DAG task scheduling features, we designed a deep funnel-like feature extraction network coupled with a robust training mechanism:
Funnel-like Network Architecture: To capture the implicit coupling among topological features (e.g., R a n k up and communication overhead), we construct a five-layer descending network. Specifically, a funnel-shaped information bottleneck structure is utilized. Combined with batch normalization and ReLU activation functions in the first three layers, it can effectively filter out noise and distill high-level abstract features. A decaying Dropout strategy ( 0.15 0.1 0.05 ) is adopted to prevent structural overfitting while maintaining the stability of deep semantic representations.
Dynamic Optimization and Training Strategy: The model uses the Mean Squared Error (MSE) loss function to heavily penalize prediction deviations, forcing it to quickly lock onto key path nodes. The AdamW optimizer is selected to enhance generalization ability. We also integrate a ReduceLROnPlateau learning rate scheduler (halving the learning rate if validation loss stagnates for five consecutive epochs) and an early stopping mechanism (training stops if there is no improvement for 15 consecutive epochs).

4.2.4. Model Performance Analysis

To ensure the robustness of our model, we constructed a comprehensive dataset containing 100,000 DAG task samples. This dataset is randomly divided into a training set, validation set, and test set in a 90:5:5 ratio. As shown in Figure 4a, the training loss and validation loss converge rapidly and stabilize after approximately 30 epochs. The slight gap between the two curves indicates that no significant overfitting has occurred, demonstrating the effectiveness of batch normalization and the Dropout mechanism. Figure 4b shows the comparison between actual sensitivity and predicted values on the test set. The randomly selected 5000 data points are evenly distributed around the y = x diagonal, indicating that the MLP can accurately predict the potential impact of tasks in heterogeneous environments.
We selected four classic regression models (linear regression, decision tree, random forest, XGBoost) to verify the necessity of deep learning, and conducted an ablation study (2-layer, 5-layer, and 12-layer MLPs) to justify the architectural depth of HERO-MLP. All models were trained on the same 11-dimensional features and evaluated using MSE, MAE, R 2 , inference latency, and parameter count.
As shown in Table 3 and Table 4, classic models fail to capture the complex non-linear relationships, with the best ensemble method (XGBoost) only reaching an R 2 of 0.5597. In the neural network ablation study, the five-layer HERO-MLP achieves the highest prediction accuracy ( R 2 = 0.707295 , MSE = 0.005880). Compared to a shallow two-layer MLP (limited feature extraction, R 2 = 0.690191 ) and a 12-layer MLP (where performance marginalization occurs due to increased complexity, R 2 = 0.696074 ), the five-layer HERO-MLP strikes the optimal balance. It delivers superior accuracy while maintaining an ultra-low inference latency (199.09 μs) and a compact parameter size (50,049), which is crucial for real-time edge scheduling.

4.2.5. Analysis of Key Node Identification Capability

To investigate the actual performance of the MLP in scheduling decisions, we designed a stress test to observe its ability to identify key nodes. Traditional heuristic list scheduling algorithms (such as the R a n k u strategy in HEFT) primarily rely on static Communication-to-Computation Ratio (CCR) to construct task priorities. To verify whether the MLP predictor in the HERO framework truly learns a global bottleneck awareness capability beyond simple data memorization, we conducted a targeted evaluation.
The experiment was deployed in a heterogeneous computing environment containing two types of computing nodes: high-performance cores ( p 0 ) with processing speed s k = 6.0 , simulating the master computing node in an edge cluster, and medium-performance cores ( p 1 ) with processing speed s k = 2.0 , simulating auxiliary computing nodes or low-power cores.
To simulate stress load scenarios, we constructed a set of synthetic DAG datasets with bimodal distribution characteristics. This dataset contains two types of tasks with opposing properties:
  • Computationally intensive isolated tasks (Type-C): High computational load with low communication overhead, representing potential “computational bottlenecks.”
  • Communication-intensive coupled tasks (Type-D): Low computational load but requiring significant data transfer, representing “communication bottlenecks.”
Experimental results reveal the fundamental difference in scheduling decision logic between the benchmark algorithm ( R a n k u ) and the MLP, as shown in Figure 5. Faced with the substantial communication overhead generated by Type-D tasks, the R a n k u algorithm schedules all such tasks to the high-performance core p 0 to eliminate cross-node data transmission latency. This locally greedy strategy leads to scarce p 0 resources being occupied by a large number of low-computation-value tasks. Consequently, when Type-C tasks—which truly determine the global completion time—arrive, they are relegated to the low-speed core p 1 , ultimately deteriorating the total makespan.
In contrast, the MLP predictor, by learning global time sensitivity, successfully identifies that although Type-C tasks lack explicit communication constraints, their heavy computational burden constitutes the global critical path. Therefore, HERO assigns Type-C tasks a higher scheduling priority, allowing them to preempt the computational resources of p 0 .
As shown in Figure 6, across test cases with different graph depths and parallelism, HERO’s normalized makespan consistently outperforms the benchmark. Particularly in deep graph structures (Depth > 8 ), HERO’s average normalized makespan is 0.8468, achieving a performance improvement of approximately 15.3%; in high-concurrency scenarios (Layers > 8 ), the average normalized makespan is 0.8479, an improvement of 15.2%. These results demonstrate that HERO possesses global bottleneck identification capabilities that surpass local greedy strategies, validating the model’s generalization effectiveness in extremely heterogeneous environments.

4.2.6. Model Interpretability and Microbehavior Analysis

Based on the importance analysis of permutation features, we verified the effectiveness of the HERO feature set and the nonlinear learning capability of deep MLPs (as shown in Figure 7). Experimental results show that computational load ( w i ) and communication overhead ( C o m m max ) have an absolutely dominant contribution to prediction accuracy, confirming the importance of communication overhead in cross-node data transmission in heterogeneous edge environments. Meanwhile, the model has high weights on the number of paths ( N path ) and the exit distance ( R a n k up ), indicating that it has successfully learned a strategy of prioritizing scheduling topology intersections and key nodes in long chains.

4.3. Hybrid Budget Allocation

After determining the task scheduling order, HERO introduces a dynamic hybrid budget mechanism to address the trade-off between energy consumption and performance. Unlike traditional methods that allocate resources solely based on static load, this mechanism combines topology importance and computational volume for global budget planning and introduces runtime energy recovery strategies.

4.3.1. Task Energy Consumption Boundary and Global Energy Margin Definition

Before allocating resources, it is necessary to first define the energy consumption boundaries of each task and the entire task flow within the current heterogeneous edge cluster.
For any task v i V , considering that it can be executed on any processor p k P at any frequency f k , j F k , we can pre-calculate the upper and lower bounds of the task’s energy consumption.
Minimum execution energy consumption: E min ( i ) represents the lowest energy consumption value that task v i can achieve among all possible combinations of processors and frequencies:
E min ( i ) = min p k P , f k , j F k P k ( f k , j ) × w i s k · f k , j
Maximum execution energy consumption: E max ( i ) represents the highest energy consumption that task v i may generate at the highest performance configuration:
E max ( i ) = max p k P , f k , j F k P k ( f k , j ) × w i s k · f k , j
System minimum energy consumption: E sys min is the minimum energy required to complete the entire application, which is the total energy required when all tasks are executed at their respective most energy-efficient configurations:
E sys min = v i V E min ( i )
System peak energy consumption: E sys max is the upper limit of energy consumption when pursuing maximum performance:
E sys max = v i V E max ( i )
Energy Constraints and Global Margin: To balance performance and energy consumption, the system sets a total energy consumption constraint E constraint . This constraint must lie within the system’s physically feasible region, satisfying
E constraint = γ E sys max + ( 1 γ ) E sys min
Under this constraint, we define the global energy reserve E slack global as the additional energy pool that the system can use for higher performance beyond meeting the minimum operating requirements:
E slack global = E constraint E sys min

4.3.2. Two-Factor Mixed Weight Definition

To allocate energy margin more scientifically, we no longer rely solely on computational cost, but instead define a hybrid importance score S i . This score combines the task’s topological criticality and relative workload:
S i = θ · R a n k ( v i ) R a n k max + ( 1 θ ) · w i W max
Topological criticality ( Rank ( v i ) ): Reflects the task’s position on the critical path of the DAG (based on R a n k up + R a n k down ). Relative workload ( w i ): Reflects the size of the task itself. θ is a balancing factor used to adjust the weights of the two.

4.3.3. Initial Budget Allocation

Based on the mixed score, the global margin E slack global is proportionally allocated to each task to form the initial budget E budget ( i ) :
E budget ( i ) = E min ( i ) + S i v j V S j · E slack global
During online scheduling, HERO employs a dynamic budget reclamation strategy. The accumulated surplus E saved ( t ) is defined as the sum of unused budgets from previously scheduled tasks. The actual available dynamic upper limit E limit ( i ) for the current task v i is
E limit ( i ) = E budget ( i ) + E saved ( t )
The scheduler selects the highest frequency, f i , k * , on the chosen processor p k , which meets the dynamic upper limit:
f i , k * = max f F k E i , k ( f ) E limit ( i )
After the task is executed, update the accumulated balance:
E saved ( t + 1 ) E saved ( t ) + E budget ( i ) E actual ( i )

4.3.4. Effectiveness of the Hybrid Budgeting Mechanism

To further explore the effectiveness of HERO’s proposed two-factor hybrid weighting mechanism, we designed a set of controlled variable experiments. The experiments aim to demonstrate a core issue: under identical energy consumption constraints, HERO’s hybrid strategy outperforms single-dimensional allocation strategies.
We categorized HERO’s budget allocation module into three baseline strategies and compared their completion times (makespan) under the same total energy consumption constraint ( E constraint with γ = 0.9 ):
  • Rank-Only: Allocation of budget based solely on the task’s position on the critical path, with a balance factor θ = 1.0 .
  • Workload-Only: Allocation of budget based solely on the task’s base energy consumption (i.e., computational load), with θ = 0.0 .
  • Hybrid (Ours): HERO’s default configuration ( θ = 0.85 ), considering both topological criticality and relative workload.
Figure 8 shows the performance comparison of the three strategies across 10 groups of DAG instances. Experimental data reveals that Rank-Only can easily lead to resource starvation for heavily loaded tasks on non-critical paths, causing new cascading blockages due to excessive frequency reduction, while Workload-Only ignores global dependencies, wasting budget on non-critical tasks with high slack, resulting in insufficient acceleration of critical nodes. In contrast, HERO’s hybrid mechanism successfully balances topological criticality and computational volume. Experiments show that, under the same energy consumption, HERO’s completion time is improved by 8% and 2% respectively compared to the aforementioned single strategies, demonstrating that jointly considering topology and load is key to achieving energy-efficient scheduling in heterogeneous environments.

4.4. Processor Selection Based on Hole-Filling Strategy

4.4.1. Hole-Filling Strategy

To overcome the resource fragmentation problem caused by heterogeneous communication latency, HERO introduces a hole-filling strategy to actively reclaim idle time slices on the processors.
Define the m-th idle time slice as the interval Slot m = [ t end ( m ) , t start ( m + 1 ) ] . For task v i to be safely inserted into Slot m , the following timing constraints must be met:
min t start ( m + 1 ) , Deadline i max t end ( m ) , DRT i , k w i s k · f i , k *
Based on the hole-filling strategy, the earliest start time (EST) of the task is the earliest time among all feasible slots:
EST i , k = min m max t end ( m ) , DRT i , k Slot m is feasible

4.4.2. Verify the Effectiveness of the Hole-Filling Strategy

To demonstrate the effectiveness of reclaiming fragmented idle time in heterogeneous edge clusters, we designed a controlled ablation experiment with the task mapping strategy as the sole independent variable. All other scheduling components remained completely consistent in both schemes.
Baseline Scheme: This strategy strictly follows the tail-append principle. For any task v i assigned to processor p j , its earliest start time (EST) is restricted to no earlier than the processor’s current available time Avail ( p j ) , i.e., the completion time of the last scheduled task:
EST ( v i , p j ) = max Avail ( p j ) , DRT ( v i , p j )
The experimental results are shown in Figure 9. The results demonstrate that the hole-filling strategy achieves synergistic optimization of latency and energy consumption. In terms of execution efficiency, this strategy effectively compresses the system’s critical path by dynamically filling tasks into fragmented time, resulting in an average completion time reduction of 6.96%. Regarding energy efficiency, compared to the static power waste caused by processor idling in the traditional Append-Only mode, the hole-filling strategy achieves energy savings of up to 9.46% by eliminating invalid waiting periods.

5. Experiments

To fully verify the effectiveness of the HERO framework in heterogeneous edge computing environments, we built a high-fidelity simulation platform based on Python 3.8.20 programming and conducted extensive comparative experiments with four mainstream scheduling algorithms in the current academic community.

5.1. Experimental Setup

5.1.1. Task Generation

To evaluate the performance of the scheduling algorithm in different application scenarios, we use a parameter-controlled random DAG generator to construct a diverse benchmark set. This generation model strictly follows the mathematical formulations below:
Topology generation employs a hierarchical generation method to construct the DAG structure, fundamentally guaranteeing the acyclic property. We divide the vertex set V into L disjoint hierarchical subsets:
V = k = 1 L V k , V a V b = Ø , a b
Graph depth: The number of layers L follows a uniform distribution L U [ L min , L max ] , reflecting the serial length of the task flow. Graph width: The parallelism (number of nodes) of each layer | V k | follows a distribution | V k | U [ P min , P max ] . Connection constraint: For any edge ( v i , v j ) E , if v i V a and v j V b , then the hierarchical constraint a < b must be satisfied.
For computational load generation, we employ the UUnifast algorithm. This algorithm performs uniform sampling within a simplex space defined by the total system utilization U sys , generating a set of unbiased utilization vectors u = { u 1 , u 2 , , u n } that satisfy the following conditions:
i = 1 n u i = U sys , 0 < u i < 1
Subsequently, the computational load w i of task v i is jointly determined by its allocated utilization u i and task period T i ( w i u i · T i ), ensuring the statistical uniformity of the load distribution.
To cover diverse application characteristics, we introduce CCR as a control parameter. The average computational load of the graph is calculated as
w ¯ = 1 n i = 1 n w i
The data transfer volume d i , j of edge e i , j is generated based on the following random process:
d i , j = w ¯ × CCR × δ , δ U [ 0.8 , 1.2 ]
Here, δ is a random perturbation factor used to simulate the random fluctuations in communication overhead between different tasks in a real environment.

5.1.2. Simulation Platform and Tools

The proposed HERO framework is implemented in Python 3; specifically, the communication-aware predictor of HERO uses the PyTorch 2.4.1+cu118 framework for offline training and forward inference. Data aggregation and visualization are handled by the Pandas 2.0.3 and Seaborn 0.13.2 libraries. Furthermore, to ensure computational efficiency, the NSGA-II benchmark was independently implemented in C++ and dynamically invoked by the Python-based main scheduler during the simulation.
To construct a representative modern heterogeneous edge environment, we employ three distinct types of processors: Raspberry Pi 4 (Raspberry Pi Foundation, Cambridge, UK), Jetson Orin Nano (NVIDIA Corporation, Santa Clara, CA, USA), and Intel Xeon D (Intel Corporation, Santa Clara, CA, USA). Their corresponding frequencies (f) and power consumption profiles (P) across different levels are detailed in Table 5. For systematic comparison and frequency scaling analysis, the processor frequencies are normalized across five discrete levels with a step size of 0.2. The cluster consists of three heterogeneous nodes, and the static leakage power of each processor is set to 10% of its peak dynamic power at the maximum frequency level (1.0).
To ensure the objectivity and reproducibility of the benchmark set, during the topology generation phase, the average connection probability of our DAG is set to 0.3, and the basic computational load of tasks is randomly sampled between 10 and 3000. The Communication-to-Computation Ratio (CCR) was dynamically sampled between a wide range of [ 0.5 , 3.0 ] . The energy budget factor γ is empirically set to 0.85 (i.e., E constraint = 0.85 × E sys max + 0.15 × E sys min ) during the resource allocation phase.

5.1.3. Evaluation Metrics

To quantitatively evaluate the performance of the proposed HERO framework and baseline algorithms, we employ two primary absolute metrics defined in Section 3.4: Makespan ( T makespan ) and Total Energy Consumption ( E total ). Furthermore, to clearly illustrate the relative advantages of HERO, we define the Performance Improvement Ratio (PIR) for both latency and energy. Let M base denote the metric value (Makespan or Energy) obtained by the benchmark algorithm, and M HERO denote the corresponding value obtained by HERO. The improvement percentage is calculated as
PIR = M base M HERO M base × 100 %
A positive PIR indicates that HERO outperforms the baseline algorithm, while a negative value indicates performance degradation. In our multi-trial experiments, all reported results are the arithmetic mean of multiple random instances to eliminate statistical outliers.

5.2. Comparison with Benchmark Algorithms

To comprehensively evaluate the performance boundaries of the HERO framework, we selected four state-of-the-art (SOTA) algorithms with significant representative scheduling strategies as benchmarks:
  • NSGA-II: A classic Pareto-optimal multi-objective evolutionary algorithm. Theoretically, NSGA-II can approach the global optimum within an infinite search time. This experiment adapted NSGA-II to a scale-adaptive constraint. We set the population size as a linear function of the number of tasks and limited the maximum number of iterations to three times the number of tasks.
  • DPMC: A heuristic algorithm for mixed-criticality systems that distinguishes between high- and low-criticality tasks. It employs a relatively conservative frequency reservation strategy to ensure the deadlines of high-priority tasks.
  • SSA: A structure-aware scheduling algorithm based on MLP. Its core lies in using neural networks to learn node importance and introducing a dual-queue mechanism to reserve resources for high-priority tasks waiting for predecessor tasks to complete, in order to optimize task completion time.
  • TOM: An algorithm based on time-triggered and chain-structure optimization. It performs excellently in merging linear task chains to reduce synchronization overhead.

5.3. Experimental Results

5.3.1. Impact of Task Size (Number of Layers) on Performance

Figure 10 shows algorithm performance as DAG depth M increases from 8 to 13 (with fixed width N = 15 ).
HERO achieves the fastest completion time across all depths. Without fine-grained communication awareness, SSA-DAG lags behind HERO by 28.89% on average (peaking at 32.85% at M = 12 ). DPMC’s rigid resource reservations similarly cause a 15.46% scheduling delay.
Regarding energy trade-offs, NSGA-II and DPMC save 9.58% and 3.29% energy, respectively, but severely sacrifice real-time performance (lagging 12.29% and 15.46% in makespan). Meanwhile, SSA-DAG and TOM consume more energy (4.04% and 1.92%) while remaining slower. Ultimately, HERO’s hybrid budgeting secures the best completion time without excessive energy waste.

5.3.2. Impact of Task Parallelism (Width) on Performance

Figure 11 shows the performance trends of various algorithms under different levels of parallelism. We fixed the DAG task depth at M = 10 , and the average number of parallel nodes per layer N varied from 10 to 30.
HERO maintained the fastest completion time in all tests. SSA-DAG’s completion time was on average 22.51% slower than HERO. The TOM algorithm, due to insufficient adaptability to complex mesh dependencies, was on average 9.13% slower. DPMC’s static rule-based reservation mechanism struggles to adapt to dynamic concurrent workloads, resulting in performance fluctuations with an average lag of 13.19%.
In terms of energy optimization, DPMC and NSGA-II saved 3.52% and 7.71% of energy on average compared to HERO, respectively. NSGA-II achieved this by sacrificing 6.39% of critical time performance, with the worst-case performance degradation reaching 7.78%. Moreover, NSGA-II relies on heavy population iterations (requiring minutes), whereas HERO completes inference in microseconds using neural network forward propagation.

5.3.3. Impact of Graph Density (Connection Probability)

Figure 12 illustrates algorithm performance across varying graph densities. We adjusted the connection probability C from 0.1 (highly sparse) to 0.7 (highly dense), with fixed depth M = 12 , width N = 15 , and task utilization U = 2.5 .
Increasing graph density exponentially exacerbates communication bottlenecks and synchronization barriers. HERO consistently achieves the lowest makespan across all densities. In highly dense scenarios ( C 0.5 ), algorithms lacking fine-grained communication awareness struggle to resolve complex mesh dependencies: SSA-DAG and TOM lag behind HERO by an average of 10.89% and 5.36%, respectively.
Conversely, HERO’s deep-enhanced MLP explicitly incorporates maximum communication overhead ( C o m m max ) to accurately identify true critical paths amidst massive data transfer delays. While DPMC and NSGA-II exhibit energy-saving tendencies under dense topologies (saving 3.67% and 10.54% on average, respectively), they severely compromise real-time performance, with makespans lagging behind HERO by 8.24% and 4.53%.

5.3.4. System Load Stress Test

Figure 13 shows the performance of various scheduling algorithms as the system task utilization (U) increases from U = 1.0 to overload ( U = 3.0 ). We keep the other parameters fixed at M = 12 , N = 15 .
For DPMC, as the load increases, its completion time delay expands significantly, rising from 19.75% under light load to 19.97% under heavy load ( U = 3.0 ), indicating severe underutilization of resources. SSA-DAG has an average completion time delay of 40.90%. This illustrates the limitation of relying solely on structural features without fine-grained communication awareness in high-density computing scenarios. NSGA-II sacrifices 14.66% of execution speed to achieve 12.18% energy savings. Across all tests, HERO maintains its absolute lead in completion time.

6. Conclusions and Future Work

In this paper, we proposed the lightweight HERO framework to resolve the bi-objective scheduling challenges of delay-sensitive DAG tasks in heterogeneous edge clusters. Rather than reiterating the algorithmic design, our extensive evaluations directly demonstrate the framework’s practical superiority. Notably, when compared to the representative learning-based baseline (SSA-DAG), HERO achieves an average 10.89 % reduction in makespan under high-density topologies, and saves up to 4.04 % of system energy across varying task depths. For resource-constrained edge devices, this continuous energy margin is highly significant, as it cumulatively extends battery lifespan and prevents hardware thermal throttling during sustained workloads. It pushes energy optimization to the extreme without introducing new system bottlenecks, all while strictly maintaining the ultra-low, microsecond-level scheduling latency crucial for real-time edge intelligence. Building upon these promising quantitative results, our future research will focus on three main avenues: (1) adapting the framework for dynamic, online environments (e.g., vehicular networks) with unpredictable task generation and topologies; (2) integrating lightweight fault-tolerance mechanisms to ensure high reliability against transient edge node failures; and (3) advancing to hardware-in-the-loop deployments on actual microcontrollers and embedded IoT sensor nodes to assess real-world physical overhead and end-to-end adaptability outside of simulated environments.

Author Contributions

Conceptualization, methodology, data curation, writing—original draft preparation, project administration, and funding acquisition, Z.Z.; visualization, writing—review and editing, validation, and funding acquisition, Y.J.; supervision, N.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Summary of Key Notations.
NotationDescriptionNotationDescription
G , V , E DAG components ( G , V , E ) v i i-th computational task
e i , j Data dependency edge w i Task computational workload
d i , j Data transmission volumeBAverage network bandwidth
CCRComm-to-computation ratio Pred , Succ Predecessor and Successor sets
P , p k Processor set and k-th server s k Processing speed coefficient
F k , f i , k Frequency set and selected freq t i , k Task execution time on p k
c i , j Communication time cost P k stat Static baseline power of p k
ξ k Hardware capacitive constant E i exec Task execution energy
E min ( i ) Task minimum energy bound E max ( i ) Task maximum energy bound
E total Total system energy consumption T makespan Total completion time
x i 11-D task feature vector y i Marginal sensitivity label
E constraint System energy constraint γ Energy budget control factor
E slack global Global energy margin pool S i Hybrid importance score
θ Weight balancing factor E budget ( i ) Initial task energy budget
E saved ( t ) Accumulated energy surplus E limit ( i ) Actual dynamic energy limit
E actual ( i ) Final energy consumed by v i DRT i , k Data ready time on p k
EST i , k Earliest start time on p k Slot m m-th idle time slice
Deadline i Task execution deadline δ Random perturbation factor
PIRPerformance improvement ratio U sys System task utilization

References

  1. Li, J.; Shang, Y.; Qin, M.; Yang, Q.; Cheng, N.; Gao, W.; Kwak, K.S. Multiobjective oriented task scheduling in heterogeneous mobile edge computing networks. IEEE Trans. Veh. Technol. 2022, 71, 8955–8966. [Google Scholar] [CrossRef]
  2. Zhou, X.; Ge, S.; Liu, P.; Qiu, T. DAG-based dependent tasks offloading in MEC-enabled IoT with soft cooperation. IEEE Trans. Mob. Comput. 2023, 23, 6908–6920. [Google Scholar] [CrossRef]
  3. Cao, Z.; Deng, X.; Yue, S.; Jiang, P.; Ren, J.; Gui, J. Dependent task offloading in edge computing using GNN and deep reinforcement learning. IEEE Internet Things J. 2024, 11, 21632–21646. [Google Scholar] [CrossRef]
  4. Peng, Q.; Wu, C.; Xia, Y.; Ma, Y.; Wang, X.; Jiang, N. DoSRA: A decentralized approach to online edge task scheduling and resource allocation. IEEE Internet Things J. 2021, 9, 4677–4692. [Google Scholar] [CrossRef]
  5. Taghinezhad-Niar, A.; Taheri, J. Fault-Tolerant Cost-Efficient Scheduling for Energy and Deadline-Constrained IoT Workflows in Edge-Cloud Continuum. IEEE Trans. Serv. Comput. 2025, 18, 2892–2903. [Google Scholar] [CrossRef]
  6. He, X.; Pang, S.; Gui, H.; Zhang, K.; Wang, N.; Yu, S. Online offloading and mobility awareness of DAG tasks for vehicle edge computing. IEEE Trans. Netw. Serv. Manag. 2024, 22, 675–690. [Google Scholar] [CrossRef]
  7. Jiang, Q.; Xin, X.; Zhang, T.; Chen, K. Energy-Efficient Task Scheduling and Resource Allocation in Edge Heterogeneous Computing Systems Using Multi-Objective Optimization. IEEE Internet Things J. 2025, 12, 36747–36764. [Google Scholar] [CrossRef]
  8. Biswas, S.K.; Muhuri, P.K.; Roy, U.K. Binary search-based fast scheduling algorithms for reliability-aware energy-efficient task graph scheduling with fault tolerance. IEEE Trans. Sustain. Comput. 2023, 9, 433–451. [Google Scholar] [CrossRef]
  9. Yu, Z.; Liu, W.; Liu, X.; Wang, G. Drag-JDEC: A deep reinforcement learning and graph neural network-based job dispatching model in edge computing. In Proceedings of the 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS); IEEE: Piscataway, NJ, USA, 2021; pp. 1–10. [Google Scholar]
  10. Zhou, Y.; Li, X.; Luo, J.; Yuan, M.; Zeng, J.; Yao, J. Learning to optimize dag scheduling in heterogeneous environment. In Proceedings of the 2022 23rd IEEE International Conference on Mobile Data Management (MDM); IEEE: Piscataway, NJ, USA, 2022; pp. 137–146. [Google Scholar]
  11. Deng, X.; Yang, H.; Zhang, J.; Gui, J.; Lin, S.; Wang, X.; Min, G. Task offloading in internet of vehicles: A drl-based approach with representation learning for dag scheduling. IEEE Trans. Mob. Comput. 2025, 24, 5045–5060. [Google Scholar] [CrossRef]
  12. Zhang, J.; Mo, L.; Wang, X.; Yang, C.; Wang, M.; Niu, D. Mixed-criticality DAGs Scheduling and Performance Optimization for Heterogeneous Multicore Systems. In Proceedings of the 2025 37th Chinese Control and Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2025; pp. 3013–3019. [Google Scholar]
  13. Gao, Y.; Yi, H.; Chen, H.; Fang, X.; Zhao, S. A structure-aware DAG scheduling and allocation on heterogeneous multicore systems. In Proceedings of the 2024 IEEE 14th International Symposium on Industrial Embedded Systems (SIES); IEEE: Piscataway, NJ, USA, 2024; pp. 26–33. [Google Scholar]
  14. Wang, S.; Li, D.; Huang, S.Y.; Deng, X.; Sifat, A.H.; Huang, J.B.; Jung, C.; Williams, R.; Zeng, H. Time-triggered scheduling for nonpreemptive real-time DAG tasks using 1-opt local search. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 3650–3661. [Google Scholar] [CrossRef]
  15. Liu, D.; Chen, J.; Huang, X.; Hong, H. A reliability-aware and energy-aware task scheduling algorithm for heterogeneous multi-core systems. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2024; pp. 3212–3217. [Google Scholar]
  16. Zhang, Y.; Zhao, S.; Chen, G.; Huang, K. Fault-tolerant DAG scheduling with runtime reconfiguration on multicore real-time systems. In Proceedings of the 2024 IEEE 35th International Conference on Application-Specific Systems, Architectures and Processors (ASAP); IEEE: Piscataway, NJ, USA, 2024; pp. 19–27. [Google Scholar]
  17. Sun, B.; Theile, M.; Qin, Z.; Bernardini, D.; Roy, D.; Bastoni, A.; Caccamo, M. Edge generation scheduling for dag tasks using deep reinforcement learning. IEEE Trans. Comput. 2024, 73, 1034–1047. [Google Scholar] [CrossRef]
  18. Song, X.; Feng, J.; Liu, L.; Pei, Q.; Yu, F.R.; Zhang, N. A Deep Reinforcement Learning with Transformer Integration for Directed Acyclic Graph Scheduling in Edge Networks. IEEE Trans. Wirel. Commun. 2025, 25, 5506–5520. [Google Scholar] [CrossRef]
  19. Liu, Z.; Huang, L.; Gao, Z.; Luo, M.; Hosseinalipour, S.; Dai, H. GA-DRL: Graph neural network-augmented deep reinforcement learning for DAG task scheduling over dynamic vehicular clouds. IEEE Trans. Netw. Serv. Manag. 2024, 21, 4226–4242. [Google Scholar] [CrossRef]
  20. Ding, W.; Luo, F.; Gu, C.; Dai, Z.; Lu, H. A multiagent meta-based task offloading strategy for mobile-edge computing. IEEE Trans. Cogn. Dev. System 2023, 16, 100–114. [Google Scholar] [CrossRef]
  21. Zhao, G.; Xu, H.; Zhao, Y.; Qiao, C.; Huang, L. Offloading dependent tasks in mobile edge computing with service caching. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications; IEEE: Piscataway, NJ, USA, 2020; pp. 1997–2006. [Google Scholar]
  22. Lou, J.; Tang, Z.; Zhang, S.; Jia, W.; Zhao, W.; Li, J. Cost-effective scheduling for dependent tasks with tight deadline constraints in mobile edge computing. IEEE Trans. Mob. Comput. 2022, 22, 5829–5845. [Google Scholar] [CrossRef]
  23. Chen, Y.; Liu, S.; Zhou, J.; Ling, X. Real-Time DAG Task Allocation Strategy for Multiprocessor by Optimistic Parallelism. In Proceedings of the 2024 IEEE 24th International Conference on Communication Technology (ICCT); IEEE: Piscataway, NJ, USA, 2024; pp. 1016–1021. [Google Scholar]
Figure 1. An example of a Directed Acyclic Graph (DAG) task model. Nodes ( v 0 v 9 ) represent computational tasks, and directed edges indicate execution dependencies and data transmission flow.
Figure 1. An example of a Directed Acyclic Graph (DAG) task model. Nodes ( v 0 v 9 ) represent computational tasks, and directed edges indicate execution dependencies and data transmission flow.
Futureinternet 18 00226 g001
Figure 2. The architecture of the HERO framework.
Figure 2. The architecture of the HERO framework.
Futureinternet 18 00226 g002
Figure 3. The task sensitivity analysis process, showing how marginal delay impacts are calculated to generate training labels.
Figure 3. The task sensitivity analysis process, showing how marginal delay impacts are calculated to generate training labels.
Futureinternet 18 00226 g003
Figure 4. Training performance of the proposed deep-enhanced MLP predictor. (a) The rapid convergence of training and validation loss within 80 epochs. (b) A scatter plot comparing true vs. predicted sensitivity on the test set; the dashed line represents the ideal prediction ( y = x ), demonstrating the model’s high prediction accuracy.
Figure 4. Training performance of the proposed deep-enhanced MLP predictor. (a) The rapid convergence of training and validation loss within 80 epochs. (b) A scatter plot comparing true vs. predicted sensitivity on the test set; the dashed line represents the ideal prediction ( y = x ), demonstrating the model’s high prediction accuracy.
Futureinternet 18 00226 g004
Figure 5. Through micro-behavior analysis, we compared and contrasted the traditional ( R a n k u ) strategy with the proposed MLP-based strategy.
Figure 5. Through micro-behavior analysis, we compared and contrasted the traditional ( R a n k u ) strategy with the proposed MLP-based strategy.
Futureinternet 18 00226 g005
Figure 6. Performance comparison of R a n k u and MLP under varying graph depths and parallelism levels.
Figure 6. Performance comparison of R a n k u and MLP under varying graph depths and parallelism levels.
Futureinternet 18 00226 g006
Figure 7. Permutation feature importance analysis, highlighting computational load and communication overhead as the most critical scheduling features.
Figure 7. Permutation feature importance analysis, highlighting computational load and communication overhead as the most critical scheduling features.
Futureinternet 18 00226 g007
Figure 8. Performance comparison of Rank-Only, Workload-Only, and HERO scheduling strategies under identical energy constraints.
Figure 8. Performance comparison of Rank-Only, Workload-Only, and HERO scheduling strategies under identical energy constraints.
Futureinternet 18 00226 g008
Figure 9. Completion time and energy consumption comparison between hole-filling and Append-Only strategies.
Figure 9. Completion time and energy consumption comparison between hole-filling and Append-Only strategies.
Futureinternet 18 00226 g009
Figure 10. Performance Comparison of Algorithms at Different Depths (M).
Figure 10. Performance Comparison of Algorithms at Different Depths (M).
Futureinternet 18 00226 g010
Figure 11. Performance comparison of the algorithms under different parallelism levels (N).
Figure 11. Performance comparison of the algorithms under different parallelism levels (N).
Futureinternet 18 00226 g011
Figure 12. Performance comparison of algorithms under different graph densities (C).
Figure 12. Performance comparison of algorithms under different graph densities (C).
Futureinternet 18 00226 g012
Figure 13. Performance comparison of algorithms under different system utilizations (U).
Figure 13. Performance comparison of algorithms under different system utilizations (U).
Futureinternet 18 00226 g013
Table 1. Comprehensive Feature Comparison of Existing Task Scheduling Strategies and the Proposed HERO Framework.
Table 1. Comprehensive Feature Comparison of Existing Task Scheduling Strategies and the Proposed HERO Framework.
CategoryRef. & MethodOptimization ObjectivesEnergy StrategyCommunication Overhead Handling
Heuristic &Evolutionary AlgorithmsLi et al. [1]Makespan, EnergyStandard DVFSStatic transmission assumption
Jiang et al. [7]Energy, DelayDVFS auto-adjustmentPartially considered
Zhang et al. [12]Performance, Service QualityDynamic DVFSNot strictly prioritized
Liu et al. [15]Reliability, EnergyStandard DVFSRedundancy-based
Biswas et al. [8]Reliability, EnergyFast DVFS switchingStatic bounds
Zhang et al. [16]Makespan, Fault-toleranceNone (Re-execution)Unpredictable failure status
Taghinezhad-Niar [5]Cost, Energy, DeadlineEnergy-constrainedEdge-cloud congestion modeled
DRL & Intelligent SchedulingDrag-JDEC [9]Makespan, QoSNoneGNN feature extraction
Cao et al. [3]MakespanNoneGAT-based dependency
Sun et al. [17]DAG Width, DeadlineNoneEdge generation representation
Song et al. [18]Energy, MakespanTransmit power & CPU freq.Attention-based feature
GA-DRL [19]Makespan, TimelinessNoneTopology two-way aggregation
DVTP [11]MakespanNoneSpatiotemporal representation
Ding et al. [20]Latency, EnergyCharging time trade-offDynamic environment-aware
LACHESIS [10]Completion timeNoneTopological perception
Structure-Aware & Specific StrategiesZhao et al. [21]Execution timeNoneWireless interference
LOU [22]Latency, CostStrict ConstraintsDependency-aware
Zhou et al. [2]Latency, Energy, GainEnd-device frequencySoft cooperation (Data sharing)
He et al. [6]Makespan, Queue stabilityNoneCross-slot queue (Lyapunov)
Gao et al. [13]MakespanNonePre-calculated node priority
Wang et al. [14]Worst-case latencyNone1-Opt Local Search path
DoSRA [4]Efficiency, DelayNoneDecentralized provision
OPSA [23]Processor utilizationNoneParallelism limitation
ProposedHERO (Ours)Makespan, EnergyPreventive-bottleneck budgetAggressive hole-filling strategy
Table 2. The 11-dimensional feature vector for task representation.
Table 2. The 11-dimensional feature vector for task representation.
Feature CategoriesFeature NameSpecific Meaning
Computation & Communication w i The task’s own computational load
Comm max Maximum data transfer weight of all output edges of the task
Static Priority Features Rank up Distance from the task to the exit node
Rank down Distance from task to the entry node
Level topo Task level depth in the DAG topological sorting
Graph Structure & Path d in In-degree of the task node
d out Out-degree of the task node
d tot Total degree of the task node
d diff The difference between in-degree and out-degree
T path Execution time of the critical path passing through the node
N path Number of paths from entry node to exit node passing through the node
Table 3. Model performance comparison against classic baselines.
Table 3. Model performance comparison against classic baselines.
ModelMSEMAE R 2
Linear Regression0.0186630.0756540.1786
Decision Tree0.0172550.0709810.2406
Random Forest0.0105750.0553870.5346
XGBoost0.0100040.0545530.5597
Table 4. Ablation study of MLP architectures on prediction accuracy and computational overhead.
Table 4. Ablation study of MLP architectures on prediction accuracy and computational overhead.
ArchitectureMSE R 2 Latency (μs)Parameters
Shallow-MLP (2-layer)0.0062240.69019143.531537
HERO-MLP (5-layer)0.0058800.707295199.0950,049
Deep-MLP (12-layer)0.0061060.696074254.532,638,849
Table 5. Processor configurations.
Table 5. Processor configurations.
LevelNormalized FreqRaspberry Pi 4Jetson Orin NanoXeon D
f (MHz) P (mW) f (MHz) P (mW) f (MHz) P (mW)
10.23002500302500060020,000
20.460032006057000120028,000
30.6900420090710,000180038,000
40.812005500120915,000240050,000
51.015007000151225,000300065,000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, Z.; Jiang, Y.; Niu, N. Hybrid Energy-Aware Ranking and Optimization. Future Internet 2026, 18, 226. https://doi.org/10.3390/fi18050226

AMA Style

Zeng Z, Jiang Y, Niu N. Hybrid Energy-Aware Ranking and Optimization. Future Internet. 2026; 18(5):226. https://doi.org/10.3390/fi18050226

Chicago/Turabian Style

Zeng, Zhiling, Yuxuan Jiang, and Na Niu. 2026. "Hybrid Energy-Aware Ranking and Optimization" Future Internet 18, no. 5: 226. https://doi.org/10.3390/fi18050226

APA Style

Zeng, Z., Jiang, Y., & Niu, N. (2026). Hybrid Energy-Aware Ranking and Optimization. Future Internet, 18(5), 226. https://doi.org/10.3390/fi18050226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop