Hybrid Energy-Aware Ranking and Optimization

Zeng, Zhiling; Jiang, Yuxuan; Niu, Na

doi:10.3390/fi18050226

Open AccessArticle

Hybrid Energy-Aware Ranking and Optimization

by

Zhiling Zeng

,

Yuxuan Jiang

and

Na Niu

^*

College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(5), 226; https://doi.org/10.3390/fi18050226

Submission received: 6 March 2026 / Revised: 19 April 2026 / Accepted: 20 April 2026 / Published: 22 April 2026

Download

Browse Figures

Versions Notes

Abstract

The increase in delay-sensitive application tasks requires heterogeneous edge clusters to maintain low online latency and energy efficiency without relying on rigid scheduling policies. To address this, we propose HERO (Hybrid Energy-aware Ranking and Optimization), a lightweight collaborative scheduling framework. HERO utilizes a perturbation-based communication-aware multi-layer perceptron (MLP) predictor to quantify global time sensitivity and discover latent time slack in non-critical paths. A hybrid budget mechanism then converts this slack into customized DVFS decisions. These decisions are based on the inherent computational load and topological criticality to optimize energy consumption. A communication-aware hole-filling strategy dynamically recovers sporadic idle times fragmented by heterogeneous communication overhead. Extensive simulations were conducted across varying DAG depths, parallelism levels, and system utilizations. Compared to state-of-the-art algorithms (NSGA-II, SSA, TOM, and DPMC), HERO reduced the completion time by an average of 10.89% under high-density topologies, and achieved up to 4.04% energy savings across varying task depths.

Keywords:

edge intelligence; Directed Acyclic Graph (DAG) scheduling; deep learning; DVFS

1. Introduction

Multi-access Edge Computing (MEC) has emerged as a critical infrastructure for latency-sensitive applications in the 5G/6G and IoT era [1,2]. To process complex Deep Neural Network (DNN) inferences, edge computing is shifting towards multi-node heterogeneous clusters that collaborate via task offloading [3,4]. However, edge servers exhibit significant differences in computing capabilities and face strict battery capacity or Thermal Design Power (TDP) constraints [5,6]. Achieving the joint optimization of millisecond-level real-time response (makespan) and total system energy consumption in such heterogeneous and dynamic environments has become a highly challenging NP-hard bi-objective scheduling problem [7,8].

To tackle this problem, edge computational workloads are typically modeled as complex Directed Acyclic Graphs (DAGs) to capture fine-grained dependencies and strict millisecond-level deadlines. Although academia has proposed various advanced DAG scheduling strategies, severe challenges remain in resource-constrained and highly dynamic edge environments. First, evolutionary algorithms, such as NSGA-II [1], rely on heavy iterative searches. Consequently, they incur excessively high online scheduling latency. Second, existing deep reinforcement learning-based algorithms (such as SSA-DAG) [9,10,11] often exhibit a “performance-heavy, energy-light” tendency, lacking proactive energy budget planning. More crucially, they typically treat edge weights as static features. This makes it difficult to capture the condition-triggered characteristics of communication overheads in heterogeneous clusters. As a result, they fail to accurately identify the true system bottlenecks. Finally, dedicated algorithms based on specific structures or mixed criticality (such as TOM and DPMC) [12,13,14] usually adopt pessimistic resource reservation or rigid slack allocation strategies. They ignore differences in the intrinsic computational workloads of individual tasks. Therefore, they are highly prone to over-scaling the frequency of heavy-workload tasks on non-critical paths, thereby creating new system bottlenecks.

To address the aforementioned pain points, this paper proposes a lightweight and efficient collaborative scheduling framework: HERO (Hybrid Energy-aware Ranking and Optimization). This framework constructs a “perception–decision–compensation” closed-loop optimization system, aiming to break the limitations of traditional static heuristic and greedy learning strategies. The main contributions of this paper are summarized as follows:

Establishment of a Communication-Aware Sensitivity Quantification Model: We propose a perturbation-based mechanism to quantify the marginal effect of task execution fluctuations on the global makespan, accurately stripping away pseudo-critical paths.
Hybrid Budget Allocation: We design a multi-factor energy arbitration mechanism that balances critical path progression with the resource needs of heavy-load tasks on non-critical paths.
Time Fragment Recovery Mechanism: We introduce an aggressive hole-filling strategy to reclaim discrete idle time slots induced by heterogeneous communication overheads.
Performance Validation: Extensive experiments on a diverse testbed (Raspberry Pi 4, Jetson Orin Nano, and Xeon D) demonstrate that HERO reduced the completion time by an average of 10.89% under high-density topologies, and achieved up to 4.04% energy savings across varying task depths.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 formalizes the system and communication models, alongside the bi-objective optimization problem. Section 4 elaborates on the detailed design of the proposed HERO framework. Section 5 presents the experimental setup, comparative results, and performance evaluation. Finally, Section 6 concludes the paper.

2. Related Work

Directed Acyclic Graph (DAG) task scheduling and offloading in heterogeneous edge environments is a relatively classical NP-hard problem. As summarized in Table 1, existing solutions can generally be divided into three categories: heuristic and evolutionary algorithms, deep learning-based intelligent scheduling, and structure-aware specific offloading strategies.

2.1. Heuristic and Evolutionary Algorithms

Due to the heterogeneity of edge environments, multi-objective evolutionary and heuristic algorithms are widely adopted to address the trade-offs between latency, energy consumption, and system reliability [1,5,7,8,12,15,16]. While these methods excel in finding Pareto-optimal solutions or ensuring fault tolerance, they face severe challenges in real-time edge collaboration scenarios. Algorithms represented by NSGA-II rely on heavy population iteration and mutation processes, resulting in scheduling delays that are too high to meet the millisecond-level response requirements of intelligent edge services. Furthermore, their energy optimization methods often employ rigid strategies, allocating energy budgets solely based on time margins while ignoring intrinsic computational workload differences, which easily creates new system bottlenecks.

2.2. Intelligent Scheduling Based on Deep Reinforcement Learning

Schedulers based on Deep Reinforcement Learning (DRL) have become a research hotspot due to their environmental adaptability. Recent advancements frequently combine DRL with Graph Neural Networks (GNNs) or Transformers to capture complex DAG dependencies and optimize offloading decisions in highly dynamic environments [3,9,10,11,17,18,19,20].

While DRL methods perform well in finding high-quality solutions, deploying them on resource-constrained edge nodes remains challenging. The high-dimensional state encoding required by complex GNNs or Transformers introduces unacceptable millisecond-level inference latency for lightweight edge tasks. Additionally, as detailed in Table 1, existing DRL schedulers predominantly prioritize performance over energy consumption, lacking an active energy budget mechanism and often neglecting the critical impact of heterogeneous communication overheads on feature extraction.

2.3. Structure-Aware and Specific Offloading Strategies

Structure-aware and specific offloading strategies simplify the scheduling problem by exploiting DAG structural characteristics or predefined rules, such as service caching, 1-Opt local search, or decentralized game theory [2,4,6,13,14,21,22,23].

Although effective in specific use cases, these strategies lack generality. Chain-based optimizations make strong assumptions about the DAG’s shape, making it difficult to handle highly irregular edge workflows. Moreover, unlike the aggressive gap-filling strategy proposed in our HERO framework, these structured methods often struggle to flexibly recycle discrete idle time slots caused by heterogeneous communication, resulting in limited overall resource utilization.

3. System Model and Problem Description

To facilitate a clear understanding of the mathematical models presented in this section, the primary notations used throughout this paper are summarized in the table in the Abbreviations.

To provide a rigorous mathematical abstraction for the task scheduling problem in an edge collaborative environment, this section first defines an application model based on a DAG to depict the complex dependencies and heterogeneous computational workloads among tasks. Subsequently, the heterogeneous edge cluster architecture and communication model are constructed to quantify the heterogeneous transmission overhead incurred by cross-node collaboration. Furthermore, considering the resource-constrained nature of edge devices, a power and energy consumption model based on DVFS is introduced. Finally, building upon the aforementioned models, the joint optimization of latency and energy consumption is formulated as a constrained bi-objective optimization problem.

3.1. Application Model

We model the dependency-driven task flow—specifically the deep learning inference pipeline in an edge computing environment—as a DAG, as illustrated in Figure 1, and defined as

G = (V, E)

(1)

The vertex set

V = {v_{1}, v_{2}, \dots, v_{n}}

represents the computational tasks, where each task

v_{i}

has a computational load

w_{i}

. The set of directed edges E denotes task dependencies. A directed edge

e_{i, j} = (v_{i}, v_{j}) \in E

indicates that

v_{j}

cannot start until

v_{i}

completes, involving a data transmission volume

d_{i, j}

over a network with bandwidth B.

To quantitatively characterize the divergent requirements for computational resources and communication bandwidth across various DAG applications, we define the Communication-to-Computation Ratio (CCR) as the ratio of the average communication cost

{\bar{T}}_{comm}

to the average computational cost

{\bar{T}}_{comp}

of the entire graph:

CCR = \frac{\frac{1}{| E |} \sum_{(v_{i}, v_{j}) \in E} \frac{d_{i, j}}{\bar{B}}}{\frac{1}{| V |} \sum_{v_{i} \in V} {\bar{t}}_{i}}

(2)

where the numerator represents the average data transfer time across all dependency edges given the average bandwidth

\bar{B}

within the heterogeneous edge cluster; the denominator represents the average execution time of all tasks. Due to system heterogeneity and Dynamic Voltage and Frequency Scaling (DVFS) capabilities, the execution time of task

v_{i}

is not constant. Therefore, we define

{\bar{t}}_{i}

as the expected execution time of task

v_{i}

across all available physical configurations in the cluster

P

:

{\bar{t}}_{i} = \frac{1}{\sum_{p_{k} \in P} | F_{k} |} \sum_{p_{k} \in P} \sum_{f \in F_{k}} \frac{w_{i}}{s_{k} \cdot f}

(3)

In this formula,

F_{k}

is the set of available frequencies supported by server

p_{k}

, and

s_{k}

is its architectural performance coefficient.

For any task

v_{i}

, we define

Pred (v_{i})

as its set of direct predecessors and

Succ (v_{i})

as its set of direct successors. To simplify the model, we assume G has a unique entry task

v_{entry}

and a unique exit task

v_{exit}

. For practical workflows with multiple entries or exits, this can be unified by adding virtual nodes with zero computational load and zero communication overhead.

3.2. Architecture and Communication Model

We model the edge computing cluster as a set of M heterogeneous edge servers, denoted by

P = {p_{1}, p_{2}, \dots, p_{M}}

. Due to the diverse hardware architectures (CPU, GPU, or specialized accelerators) of the edge servers, significant disparities exist in their efficiency when processing the same task. In practical implementation, this cluster typically operates under a master–worker paradigm: one resource-sufficient node is designated as the master controller to maintain cluster state and handle scheduling logic, while the remaining heterogeneous devices act as worker nodes to execute dispatched tasks. For any task

v_{i} \in V

, if it is assigned to processor

p_{k} \in P

and executed at a frequency

f_{i, k}

, its execution time

t_{i, k}

is expressed as

t_{i, k} = \frac{w_{i}}{s_{k} \cdot f_{i, k}}

(4)

where

w_{i}

represents the computational workload of the task, and

s_{k}

denotes the architectural performance coefficient of server

p_{k}

, reflecting the processor’s Instructions Per Cycle (IPC) throughput per unit frequency.

Communication Cost Model: The communication overhead between tasks is determined by the data transmission volume and network bandwidth. Let B denote the average transmission bandwidth within the edge cluster. For a dependency edge

e_{i, j}

, if task

v_{i}

is assigned to server

p_{m}

and task

v_{j}

is assigned to server

p_{n}

, the communication time

c_{i, j}

is defined as

c_{i, j} = \{\begin{matrix} 0, & if m = n \\ \frac{d_{i, j}}{B}, & if m \neq n \end{matrix}

(5)

When parent and child tasks are scheduled on the same processor, data is exchanged via shared memory; thus, the communication overhead is considered to be zero.

3.3. Power and Energy Consumption Model

To support energy-efficiency optimization, we assume that each edge server supports Dynamic Voltage and Frequency Scaling (DVFS) technology.

DVFS Frequency Set: For each server

p_{k}

, the processor supports a set of discrete voltage–frequency pairs. The available frequency set for

p_{k}

is defined as

F_{k} = {f_{k, 1}, f_{k, 2}, \dots, f_{k, \max}}

, where

f_{k, \max}

is the maximum clock frequency of the processor.

Following widely adopted dynamic voltage and frequency scaling (DVFS) power models, the instantaneous power

P_{k} (f)

of edge server

p_{k}

operating at frequency f is modeled according to the well-known Cubic Law:

P_{k} (f) = P_{k}^{stat} + ξ_{k} \cdot f^{3}

(6)

where

P_{k}^{stat}

is the static baseline power, and

ξ_{k}

is a hardware-specific constant reflecting the processor’s capacitive characteristics.

The total system energy consumption

E_{total}

is the sum of the energy consumed during task execution and the energy consumed during idle periods. For task

v_{i}

running on server

p_{k}

at frequency

f_{i, k}

, its execution energy is defined as

E_{i}^{exec} = P_{k} (f_{i, k}) \times t_{i, k} = (P_{k}^{stat} + ξ_{k} \cdot f_{i, k}^{3}) \times \frac{w_{i}}{s_{k} \cdot f_{i, k}}

(7)

The objective function for the total energy consumption of the entire edge cluster during the scheduling cycle is

Minimize : E_{total} = \sum_{v_{i} \in V} E_{i}^{exec} + \sum_{p_{k} \in P} P_{k}^{stat} \times T_{idle, k}

(8)

where

T_{idle, k}

is the idle wait time of server

p_{k}

.

3.4. Formal Problem Description

The scheduling of DAG tasks in a heterogeneous edge environment is formulated as finding an optimal mapping of tasks to processors, determining their execution sequence, and allocating operating frequencies. For any task

v_{i}

assigned to processor

p_{k}

, its timing constraints are governed by the completion status of its predecessors and the resource availability of the assigned processor.

Following standard task scheduling semantics, the Data Ready Time (DRT), Earliest Start Time (EST), and Earliest Finish Time (EFT) for a task

v_{i}

on processor

p_{k}

are calculated recursively:

DRT (v_{i}, p_{k}) = max_{v_{j} \in Pred (v_{i})} {AFT (v_{j}) + c_{j, i}}

(9)

EST (v_{i}, p_{k}) = max (Avail (p_{k}), DRT (v_{i}, p_{k}))

(10)

EFT (v_{i}, p_{k}) = EST (v_{i}, p_{k}) + t_{i, k}

(11)

where

AFT (v_{j})

denotes the actual finish time of predecessor

v_{j}

, and

Avail (p_{k})

is the earliest time at which

p_{k}

becomes ready to execute a new task.

The primary objective of this study is to address a bi-objective optimization problem: minimizing the application makespan while simultaneously reducing total system energy consumption.

The makespan minimization aims to minimize the completion time of the entire application, which is determined by the actual finish time of the exit task

v_{exit}

:

min T_{makespan} = AFT (v_{exit})

(12)

The total energy minimization aims to minimize the total energy consumption, which includes the dynamic execution energy of all tasks and the static energy consumed during idle periods:

min E_{total} = \sum_{v_{i} \in V} E_{i}^{exec} + \sum_{p_{k} \in P} P_{k}^{stat} \times T_{idle, k}

(13)

Given a DAG G and a heterogeneous cluster

P

, the goal is to identify an optimal scheduling strategy

S = (M, O, F)

—where

M

is the task-to-processor mapping,

O

is the execution order, and

F

is the frequency allocation—that minimizes the objective vector:

min_{S} J (S) = (T_{makespan} (S), E_{total} (S))

(14)

4. HERO Framework Design

This chapter will elaborate on the design details of the HERO framework. The framework comprises two core phases: communication-aware priority learning and bottleneck-aware resource allocation.

4.1. Framework Overview

As shown in Figure 2 and Algorithm 1, HERO establishes a closed-loop identify–utilize–reclaim cascade process. In the identification phase, an MLP predictor is used to unravel complex dependencies and isolate exploitable time redundancy. In the utilization phase, we exploit this time redundancy through a hybrid budgeting mechanism. This mechanism converts non-critical time slack into energy efficiency via DVFS. This conversion leads to scheduling fragmentation. Finally, the reclaim phase acts as a compensator by employing a hole-filling strategy. This strategy recovers these fragmented gaps for small tasks, thereby maximizing resource density through compute–communication overlap.

From an implementation perspective, to ensure strict online real-time performance, the deep-enhanced MLP predictor is trained offline. During the online phase, the master controller only performs an

O (1)

lightweight forward inference, strictly bounding the decision-making overhead to the microsecond level. The master then dispatches the parsed subtasks and specific DVFS frequency commands to designated worker nodes via lightweight remote procedure calls (e.g., gRPC), fully bridging the theoretical algorithm with practical edge orchestration.

Algorithm 1: HERO: Hybrid Energy-aware Ranking and Optimization

4.2. Task Ranking Based on Enhanced MLP

To extract nonlinear scheduling features from high-dimensional heterogeneous DAGs with low inference latency, HERO employs a Deep Enhanced Multilayer Perceptron (MLP).

4.2.1. DAG Feature Extraction

For each task

v_{i}

in the DAG, we extract an 11-dimensional feature vector

x_{i}

, as detailed in Table 2. By encoding macro-level path criticality (e.g.,

R a n k_{up}

,

T_{path}

) and micro-level topological attributes (e.g.,

d_{in}

,

L e v e l_{topo}

) into a unified linear vector space, this feature set effectively captures the topological integrity of the DAG.

4.2.2. Perturbation-Based Sensitivity Generation

In learning-based DAG scheduling research, obtaining high-quality supervision signals is a key bottleneck for model performance. Existing imitation learning methods typically use static priority sequences generated by heuristic algorithms (such as HEFT) as training labels. This approach has the limitation of locking the upper limit of model performance. To overcome this limitation, HERO proposes a perturbation-based sensitivity analysis mechanism. The model learns the global time sensitivity of each task—the extent to which local execution fluctuations of that task will degrade the completion time of the entire system.

We define a sensitivity label

y_{i}

for task

v_{i}

as the marginal effect of load variation on global completion time. As shown in Figure 3, for each instance G in the training set, the label generation process includes the following three standardized steps:

First, using a standard list scheduling algorithm (the HEFT algorithm is used in this paper) to schedule DAG G under the target heterogeneous cluster configuration, a baseline scheduling scheme

S_{base}

and its corresponding baseline completion time

T_{base}

are obtained:

T_{base} = Schedule (G, HEFT)

(15)

Then, for each task

v_{i} \in V

in the DAG, a perturbed instance

G_{i}^{'}

is constructed. In

G_{i}^{'}

, we increase the worst-case execution time

w_{i}

of task

v_{i}

by a significant perturbation factor

δ

(in the experimental setup,

δ = w_{i}

, i.e., simulating a doubling of the load):

w_{i}^{'} = w_{i} + δ, w_{j}^{'} = w_{j} (\forall j \neq i)

(16)

Subsequently,

G_{i}^{'}

is re-evaluated using the same scheduling algorithm to obtain the perturbed completion time

T_{i}^{'}

.

Finally, the absolute impact of the delay of task

v_{i}

on the system is

Δ T_{i} = T_{i}^{'} - T_{base}

. To eliminate the dimensional differences caused by different DAG sizes, we define the final training label

y_{i}

as the normalized marginal deterioration rate:

y_{i} = \frac{T_{i}^{'} - T_{base}}{T_{base}}

(17)

If

y_{i} > 0

, it indicates that the task is in some form of bottleneck state, and the larger the value, the greater its impact on the system completion time. If

y_{i} \approx 0

, it indicates that the disturbance of the task is absorbed by the system’s parallel gaps or communication latency.

Through this mechanism, HERO’s MLP predictor learns the global time sensitivity of each task, enabling HERO to dynamically identify hidden bottlenecks masked by traditional static CCR metrics during the inference phase.

4.2.3. Model Design and Optimization Strategies

To address the challenges of high-dimensional non-linearity and sparsity inherent in DAG task scheduling features, we designed a deep funnel-like feature extraction network coupled with a robust training mechanism:

Funnel-like Network Architecture: To capture the implicit coupling among topological features (e.g.,

R a n k_{up}

and communication overhead), we construct a five-layer descending network. Specifically, a funnel-shaped information bottleneck structure is utilized. Combined with batch normalization and ReLU activation functions in the first three layers, it can effectively filter out noise and distill high-level abstract features. A decaying Dropout strategy (

0.15 \to 0.1 \to 0.05

) is adopted to prevent structural overfitting while maintaining the stability of deep semantic representations.

Dynamic Optimization and Training Strategy: The model uses the Mean Squared Error (MSE) loss function to heavily penalize prediction deviations, forcing it to quickly lock onto key path nodes. The AdamW optimizer is selected to enhance generalization ability. We also integrate a ReduceLROnPlateau learning rate scheduler (halving the learning rate if validation loss stagnates for five consecutive epochs) and an early stopping mechanism (training stops if there is no improvement for 15 consecutive epochs).

4.2.4. Model Performance Analysis

To ensure the robustness of our model, we constructed a comprehensive dataset containing 100,000 DAG task samples. This dataset is randomly divided into a training set, validation set, and test set in a 90:5:5 ratio. As shown in Figure 4a, the training loss and validation loss converge rapidly and stabilize after approximately 30 epochs. The slight gap between the two curves indicates that no significant overfitting has occurred, demonstrating the effectiveness of batch normalization and the Dropout mechanism. Figure 4b shows the comparison between actual sensitivity and predicted values on the test set. The randomly selected 5000 data points are evenly distributed around the

y = x

diagonal, indicating that the MLP can accurately predict the potential impact of tasks in heterogeneous environments.

We selected four classic regression models (linear regression, decision tree, random forest, XGBoost) to verify the necessity of deep learning, and conducted an ablation study (2-layer, 5-layer, and 12-layer MLPs) to justify the architectural depth of HERO-MLP. All models were trained on the same 11-dimensional features and evaluated using MSE, MAE,

R^{2}

, inference latency, and parameter count.

As shown in Table 3 and Table 4, classic models fail to capture the complex non-linear relationships, with the best ensemble method (XGBoost) only reaching an

R^{2}

of 0.5597. In the neural network ablation study, the five-layer HERO-MLP achieves the highest prediction accuracy (

R^{2} = 0.707295

, MSE = 0.005880). Compared to a shallow two-layer MLP (limited feature extraction,

R^{2} = 0.690191

) and a 12-layer MLP (where performance marginalization occurs due to increased complexity,

R^{2} = 0.696074

), the five-layer HERO-MLP strikes the optimal balance. It delivers superior accuracy while maintaining an ultra-low inference latency (199.09 μs) and a compact parameter size (50,049), which is crucial for real-time edge scheduling.

4.2.5. Analysis of Key Node Identification Capability

To investigate the actual performance of the MLP in scheduling decisions, we designed a stress test to observe its ability to identify key nodes. Traditional heuristic list scheduling algorithms (such as the

R a n k_{u}

strategy in HEFT) primarily rely on static Communication-to-Computation Ratio (CCR) to construct task priorities. To verify whether the MLP predictor in the HERO framework truly learns a global bottleneck awareness capability beyond simple data memorization, we conducted a targeted evaluation.

The experiment was deployed in a heterogeneous computing environment containing two types of computing nodes: high-performance cores (

p_{0}

) with processing speed

s_{k} = 6.0

, simulating the master computing node in an edge cluster, and medium-performance cores (

p_{1}

) with processing speed

s_{k} = 2.0

, simulating auxiliary computing nodes or low-power cores.

To simulate stress load scenarios, we constructed a set of synthetic DAG datasets with bimodal distribution characteristics. This dataset contains two types of tasks with opposing properties:

Computationally intensive isolated tasks (Type-C): High computational load with low communication overhead, representing potential “computational bottlenecks.”
Communication-intensive coupled tasks (Type-D): Low computational load but requiring significant data transfer, representing “communication bottlenecks.”

Experimental results reveal the fundamental difference in scheduling decision logic between the benchmark algorithm (

R a n k_{u}

) and the MLP, as shown in Figure 5. Faced with the substantial communication overhead generated by Type-D tasks, the

R a n k_{u}

algorithm schedules all such tasks to the high-performance core

p_{0}

to eliminate cross-node data transmission latency. This locally greedy strategy leads to scarce

p_{0}

resources being occupied by a large number of low-computation-value tasks. Consequently, when Type-C tasks—which truly determine the global completion time—arrive, they are relegated to the low-speed core

p_{1}

, ultimately deteriorating the total makespan.

In contrast, the MLP predictor, by learning global time sensitivity, successfully identifies that although Type-C tasks lack explicit communication constraints, their heavy computational burden constitutes the global critical path. Therefore, HERO assigns Type-C tasks a higher scheduling priority, allowing them to preempt the computational resources of

p_{0}

.

As shown in Figure 6, across test cases with different graph depths and parallelism, HERO’s normalized makespan consistently outperforms the benchmark. Particularly in deep graph structures (Depth

> 8

), HERO’s average normalized makespan is 0.8468, achieving a performance improvement of approximately 15.3%; in high-concurrency scenarios (Layers

> 8

), the average normalized makespan is 0.8479, an improvement of 15.2%. These results demonstrate that HERO possesses global bottleneck identification capabilities that surpass local greedy strategies, validating the model’s generalization effectiveness in extremely heterogeneous environments.

4.2.6. Model Interpretability and Microbehavior Analysis

Based on the importance analysis of permutation features, we verified the effectiveness of the HERO feature set and the nonlinear learning capability of deep MLPs (as shown in Figure 7). Experimental results show that computational load (

w_{i}

) and communication overhead (

C o m m_{\max}

) have an absolutely dominant contribution to prediction accuracy, confirming the importance of communication overhead in cross-node data transmission in heterogeneous edge environments. Meanwhile, the model has high weights on the number of paths (

N_{path}

) and the exit distance (

R a n k_{up}

), indicating that it has successfully learned a strategy of prioritizing scheduling topology intersections and key nodes in long chains.

4.3. Hybrid Budget Allocation

After determining the task scheduling order, HERO introduces a dynamic hybrid budget mechanism to address the trade-off between energy consumption and performance. Unlike traditional methods that allocate resources solely based on static load, this mechanism combines topology importance and computational volume for global budget planning and introduces runtime energy recovery strategies.

4.3.1. Task Energy Consumption Boundary and Global Energy Margin Definition

Before allocating resources, it is necessary to first define the energy consumption boundaries of each task and the entire task flow within the current heterogeneous edge cluster.

For any task

v_{i} \in V

, considering that it can be executed on any processor

p_{k} \in P

at any frequency

f_{k, j} \in F_{k}

, we can pre-calculate the upper and lower bounds of the task’s energy consumption.

Minimum execution energy consumption:

E_{\min}^{(i)}

represents the lowest energy consumption value that task

v_{i}

can achieve among all possible combinations of processors and frequencies:

E_{\min}^{(i)} = min_{p_{k} \in P, f_{k, j} \in F_{k}} \{P_{k} (f_{k, j}) \times \frac{w_{i}}{s_{k} \cdot f_{k, j}}\}

(18)

Maximum execution energy consumption:

E_{\max}^{(i)}

represents the highest energy consumption that task

v_{i}

may generate at the highest performance configuration:

E_{\max}^{(i)} = max_{p_{k} \in P, f_{k, j} \in F_{k}} \{P_{k} (f_{k, j}) \times \frac{w_{i}}{s_{k} \cdot f_{k, j}}\}

(19)

System minimum energy consumption:

E_{sys}^{\min}

is the minimum energy required to complete the entire application, which is the total energy required when all tasks are executed at their respective most energy-efficient configurations:

E_{sys}^{\min} = \sum_{v_{i} \in V} E_{\min}^{(i)}

(20)

System peak energy consumption:

E_{sys}^{\max}

is the upper limit of energy consumption when pursuing maximum performance:

E_{sys}^{\max} = \sum_{v_{i} \in V} E_{\max}^{(i)}

(21)

Energy Constraints and Global Margin: To balance performance and energy consumption, the system sets a total energy consumption constraint

E_{constraint}

. This constraint must lie within the system’s physically feasible region, satisfying

E_{constraint} = γ E_{sys}^{\max} + (1 - γ) E_{sys}^{\min}

(22)

Under this constraint, we define the global energy reserve

E_{slack}^{global}

as the additional energy pool that the system can use for higher performance beyond meeting the minimum operating requirements:

E_{slack}^{global} = E_{constraint} - E_{sys}^{\min}

(23)

4.3.2. Two-Factor Mixed Weight Definition

To allocate energy margin more scientifically, we no longer rely solely on computational cost, but instead define a hybrid importance score

S_{i}

. This score combines the task’s topological criticality and relative workload:

S_{i} = θ \cdot \frac{R a n k (v_{i})}{R a n k_{\max}} + (1 - θ) \cdot \frac{w_{i}}{W_{\max}}

(24)

Topological criticality (

Rank (v_{i})

): Reflects the task’s position on the critical path of the DAG (based on

R a n k_{up} + R a n k_{down}

). Relative workload (

w_{i}

): Reflects the size of the task itself.

θ

is a balancing factor used to adjust the weights of the two.

4.3.3. Initial Budget Allocation

Based on the mixed score, the global margin

E_{slack}^{global}

is proportionally allocated to each task to form the initial budget

E_{budget}^{(i)}

:

E_{budget}^{(i)} = E_{\min}^{(i)} + \frac{S_{i}}{\sum_{v_{j} \in V} S_{j}} \cdot E_{slack}^{global}

(25)

During online scheduling, HERO employs a dynamic budget reclamation strategy. The accumulated surplus

E_{saved}^{(t)}

is defined as the sum of unused budgets from previously scheduled tasks. The actual available dynamic upper limit

E_{limit}^{(i)}

for the current task

v_{i}

is

E_{limit}^{(i)} = E_{budget}^{(i)} + E_{saved}^{(t)}

(26)

The scheduler selects the highest frequency,

f_{i, k}^{*}

, on the chosen processor

p_{k}

, which meets the dynamic upper limit:

f_{i, k}^{*} = max \{f \in F_{k} ∣ E_{i, k} (f) \leq E_{limit}^{(i)}\}

(27)

After the task is executed, update the accumulated balance:

E_{saved}^{(t + 1)} \leftarrow E_{saved}^{(t)} + (E_{budget}^{(i)} - E_{actual}^{(i)})

(28)

4.3.4. Effectiveness of the Hybrid Budgeting Mechanism

To further explore the effectiveness of HERO’s proposed two-factor hybrid weighting mechanism, we designed a set of controlled variable experiments. The experiments aim to demonstrate a core issue: under identical energy consumption constraints, HERO’s hybrid strategy outperforms single-dimensional allocation strategies.

We categorized HERO’s budget allocation module into three baseline strategies and compared their completion times (makespan) under the same total energy consumption constraint (

E_{constraint}

with

γ = 0.9

):

Rank-Only: Allocation of budget based solely on the task’s position on the critical path, with a balance factor $θ = 1.0$ .
Workload-Only: Allocation of budget based solely on the task’s base energy consumption (i.e., computational load), with $θ = 0.0$ .
Hybrid (Ours): HERO’s default configuration ( $θ = 0.85$ ), considering both topological criticality and relative workload.

Figure 8 shows the performance comparison of the three strategies across 10 groups of DAG instances. Experimental data reveals that Rank-Only can easily lead to resource starvation for heavily loaded tasks on non-critical paths, causing new cascading blockages due to excessive frequency reduction, while Workload-Only ignores global dependencies, wasting budget on non-critical tasks with high slack, resulting in insufficient acceleration of critical nodes. In contrast, HERO’s hybrid mechanism successfully balances topological criticality and computational volume. Experiments show that, under the same energy consumption, HERO’s completion time is improved by 8% and 2% respectively compared to the aforementioned single strategies, demonstrating that jointly considering topology and load is key to achieving energy-efficient scheduling in heterogeneous environments.

4.4. Processor Selection Based on Hole-Filling Strategy

4.4.1. Hole-Filling Strategy

To overcome the resource fragmentation problem caused by heterogeneous communication latency, HERO introduces a hole-filling strategy to actively reclaim idle time slices on the processors.

Define the m-th idle time slice as the interval

{Slot}_{m} = [t_{end}^{(m)}, t_{start}^{(m + 1)}]

. For task

v_{i}

to be safely inserted into

{Slot}_{m}

, the following timing constraints must be met:

min (t_{start}^{(m + 1)}, {Deadline}_{i}) - max (t_{end}^{(m)}, {DRT}_{i, k}) \geq \frac{w_{i}}{s_{k} \cdot f_{i, k}^{*}}

(29)

Based on the hole-filling strategy, the earliest start time (EST) of the task is the earliest time among all feasible slots:

{EST}_{i, k} = min_{m} \{max (t_{end}^{(m)}, {DRT}_{i, k}) ∣ {Slot}_{m} is feasible\}

(30)

4.4.2. Verify the Effectiveness of the Hole-Filling Strategy

To demonstrate the effectiveness of reclaiming fragmented idle time in heterogeneous edge clusters, we designed a controlled ablation experiment with the task mapping strategy as the sole independent variable. All other scheduling components remained completely consistent in both schemes.

Baseline Scheme: This strategy strictly follows the tail-append principle. For any task

v_{i}

assigned to processor

p_{j}

, its earliest start time (EST) is restricted to no earlier than the processor’s current available time

Avail (p_{j})

, i.e., the completion time of the last scheduled task:

EST (v_{i}, p_{j}) = max (Avail (p_{j}), DRT (v_{i}, p_{j}))

(31)

The experimental results are shown in Figure 9. The results demonstrate that the hole-filling strategy achieves synergistic optimization of latency and energy consumption. In terms of execution efficiency, this strategy effectively compresses the system’s critical path by dynamically filling tasks into fragmented time, resulting in an average completion time reduction of 6.96%. Regarding energy efficiency, compared to the static power waste caused by processor idling in the traditional Append-Only mode, the hole-filling strategy achieves energy savings of up to 9.46% by eliminating invalid waiting periods.

5. Experiments

To fully verify the effectiveness of the HERO framework in heterogeneous edge computing environments, we built a high-fidelity simulation platform based on Python 3.8.20 programming and conducted extensive comparative experiments with four mainstream scheduling algorithms in the current academic community.

5.1. Experimental Setup

5.1.1. Task Generation

To evaluate the performance of the scheduling algorithm in different application scenarios, we use a parameter-controlled random DAG generator to construct a diverse benchmark set. This generation model strictly follows the mathematical formulations below:

Topology generation employs a hierarchical generation method to construct the DAG structure, fundamentally guaranteeing the acyclic property. We divide the vertex set

V

into L disjoint hierarchical subsets:

V = ⋃_{k = 1}^{L} V_{k}, V_{a} \cap V_{b} = Ø, \forall a \neq b

(32)

Graph depth: The number of layers L follows a uniform distribution

L \sim U [L_{\min}, L_{\max}]

, reflecting the serial length of the task flow. Graph width: The parallelism (number of nodes) of each layer

| V_{k} |

follows a distribution

| V_{k} | \sim U [P_{\min}, P_{\max}]

. Connection constraint: For any edge

(v_{i}, v_{j}) \in E

, if

v_{i} \in V_{a}

and

v_{j} \in V_{b}

, then the hierarchical constraint

a < b

must be satisfied.

For computational load generation, we employ the UUnifast algorithm. This algorithm performs uniform sampling within a simplex space defined by the total system utilization

U_{sys}

, generating a set of unbiased utilization vectors

u = {u_{1}, u_{2}, \dots, u_{n}}

that satisfy the following conditions:

\sum_{i = 1}^{n} u_{i} = U_{sys}, 0 < u_{i} < 1

(33)

Subsequently, the computational load

w_{i}

of task

v_{i}

is jointly determined by its allocated utilization

u_{i}

and task period

T_{i}

(

w_{i} \propto u_{i} \cdot T_{i}

), ensuring the statistical uniformity of the load distribution.

To cover diverse application characteristics, we introduce CCR as a control parameter. The average computational load of the graph is calculated as

\bar{w} = \frac{1}{n} \sum_{i = 1}^{n} w_{i}

(34)

The data transfer volume

d_{i, j}

of edge

e_{i, j}

is generated based on the following random process:

d_{i, j} = \bar{w} \times CCR \times δ, δ \sim U [0.8, 1.2]

(35)

Here,

δ

is a random perturbation factor used to simulate the random fluctuations in communication overhead between different tasks in a real environment.

5.1.2. Simulation Platform and Tools

The proposed HERO framework is implemented in Python 3; specifically, the communication-aware predictor of HERO uses the PyTorch 2.4.1+cu118 framework for offline training and forward inference. Data aggregation and visualization are handled by the Pandas 2.0.3 and Seaborn 0.13.2 libraries. Furthermore, to ensure computational efficiency, the NSGA-II benchmark was independently implemented in C++ and dynamically invoked by the Python-based main scheduler during the simulation.

To construct a representative modern heterogeneous edge environment, we employ three distinct types of processors: Raspberry Pi 4 (Raspberry Pi Foundation, Cambridge, UK), Jetson Orin Nano (NVIDIA Corporation, Santa Clara, CA, USA), and Intel Xeon D (Intel Corporation, Santa Clara, CA, USA). Their corresponding frequencies (f) and power consumption profiles (P) across different levels are detailed in Table 5. For systematic comparison and frequency scaling analysis, the processor frequencies are normalized across five discrete levels with a step size of 0.2. The cluster consists of three heterogeneous nodes, and the static leakage power of each processor is set to 10% of its peak dynamic power at the maximum frequency level (1.0).

To ensure the objectivity and reproducibility of the benchmark set, during the topology generation phase, the average connection probability of our DAG is set to 0.3, and the basic computational load of tasks is randomly sampled between 10 and 3000. The Communication-to-Computation Ratio (CCR) was dynamically sampled between a wide range of

[0.5, 3.0]

. The energy budget factor

γ

is empirically set to 0.85 (i.e.,

E_{constraint} = 0.85 \times E_{sys}^{\max} + 0.15 \times E_{sys}^{\min}

) during the resource allocation phase.

5.1.3. Evaluation Metrics

To quantitatively evaluate the performance of the proposed HERO framework and baseline algorithms, we employ two primary absolute metrics defined in Section 3.4: Makespan (

T_{makespan}

) and Total Energy Consumption (

E_{total}

). Furthermore, to clearly illustrate the relative advantages of HERO, we define the Performance Improvement Ratio (PIR) for both latency and energy. Let

M_{base}

denote the metric value (Makespan or Energy) obtained by the benchmark algorithm, and

M_{HERO}

denote the corresponding value obtained by HERO. The improvement percentage is calculated as

PIR = \frac{M_{base} - M_{HERO}}{M_{base}} \times 100 %

(36)

A positive PIR indicates that HERO outperforms the baseline algorithm, while a negative value indicates performance degradation. In our multi-trial experiments, all reported results are the arithmetic mean of multiple random instances to eliminate statistical outliers.

5.2. Comparison with Benchmark Algorithms

To comprehensively evaluate the performance boundaries of the HERO framework, we selected four state-of-the-art (SOTA) algorithms with significant representative scheduling strategies as benchmarks:

NSGA-II: A classic Pareto-optimal multi-objective evolutionary algorithm. Theoretically, NSGA-II can approach the global optimum within an infinite search time. This experiment adapted NSGA-II to a scale-adaptive constraint. We set the population size as a linear function of the number of tasks and limited the maximum number of iterations to three times the number of tasks.
DPMC: A heuristic algorithm for mixed-criticality systems that distinguishes between high- and low-criticality tasks. It employs a relatively conservative frequency reservation strategy to ensure the deadlines of high-priority tasks.
SSA: A structure-aware scheduling algorithm based on MLP. Its core lies in using neural networks to learn node importance and introducing a dual-queue mechanism to reserve resources for high-priority tasks waiting for predecessor tasks to complete, in order to optimize task completion time.
TOM: An algorithm based on time-triggered and chain-structure optimization. It performs excellently in merging linear task chains to reduce synchronization overhead.

5.3. Experimental Results

5.3.1. Impact of Task Size (Number of Layers) on Performance

Figure 10 shows algorithm performance as DAG depth M increases from 8 to 13 (with fixed width

N = 15

).

HERO achieves the fastest completion time across all depths. Without fine-grained communication awareness, SSA-DAG lags behind HERO by 28.89% on average (peaking at 32.85% at

M = 12

). DPMC’s rigid resource reservations similarly cause a 15.46% scheduling delay.

Regarding energy trade-offs, NSGA-II and DPMC save 9.58% and 3.29% energy, respectively, but severely sacrifice real-time performance (lagging 12.29% and 15.46% in makespan). Meanwhile, SSA-DAG and TOM consume more energy (4.04% and 1.92%) while remaining slower. Ultimately, HERO’s hybrid budgeting secures the best completion time without excessive energy waste.

5.3.2. Impact of Task Parallelism (Width) on Performance

Figure 11 shows the performance trends of various algorithms under different levels of parallelism. We fixed the DAG task depth at

M = 10

, and the average number of parallel nodes per layer N varied from 10 to 30.

HERO maintained the fastest completion time in all tests. SSA-DAG’s completion time was on average 22.51% slower than HERO. The TOM algorithm, due to insufficient adaptability to complex mesh dependencies, was on average 9.13% slower. DPMC’s static rule-based reservation mechanism struggles to adapt to dynamic concurrent workloads, resulting in performance fluctuations with an average lag of 13.19%.

In terms of energy optimization, DPMC and NSGA-II saved 3.52% and 7.71% of energy on average compared to HERO, respectively. NSGA-II achieved this by sacrificing 6.39% of critical time performance, with the worst-case performance degradation reaching 7.78%. Moreover, NSGA-II relies on heavy population iterations (requiring minutes), whereas HERO completes inference in microseconds using neural network forward propagation.

5.3.3. Impact of Graph Density (Connection Probability)

Figure 12 illustrates algorithm performance across varying graph densities. We adjusted the connection probability C from 0.1 (highly sparse) to 0.7 (highly dense), with fixed depth

M = 12

, width

N = 15

, and task utilization

U = 2.5

.

Increasing graph density exponentially exacerbates communication bottlenecks and synchronization barriers. HERO consistently achieves the lowest makespan across all densities. In highly dense scenarios (

C \geq 0.5

), algorithms lacking fine-grained communication awareness struggle to resolve complex mesh dependencies: SSA-DAG and TOM lag behind HERO by an average of 10.89% and 5.36%, respectively.

Conversely, HERO’s deep-enhanced MLP explicitly incorporates maximum communication overhead (

C o m m_{\max}

) to accurately identify true critical paths amidst massive data transfer delays. While DPMC and NSGA-II exhibit energy-saving tendencies under dense topologies (saving 3.67% and 10.54% on average, respectively), they severely compromise real-time performance, with makespans lagging behind HERO by 8.24% and 4.53%.

5.3.4. System Load Stress Test

Figure 13 shows the performance of various scheduling algorithms as the system task utilization (U) increases from

U = 1.0

to overload (

U = 3.0

). We keep the other parameters fixed at

M = 12

,

N = 15

.

For DPMC, as the load increases, its completion time delay expands significantly, rising from 19.75% under light load to 19.97% under heavy load (

U = 3.0

), indicating severe underutilization of resources. SSA-DAG has an average completion time delay of 40.90%. This illustrates the limitation of relying solely on structural features without fine-grained communication awareness in high-density computing scenarios. NSGA-II sacrifices 14.66% of execution speed to achieve 12.18% energy savings. Across all tests, HERO maintains its absolute lead in completion time.

6. Conclusions and Future Work

In this paper, we proposed the lightweight HERO framework to resolve the bi-objective scheduling challenges of delay-sensitive DAG tasks in heterogeneous edge clusters. Rather than reiterating the algorithmic design, our extensive evaluations directly demonstrate the framework’s practical superiority. Notably, when compared to the representative learning-based baseline (SSA-DAG), HERO achieves an average

10.89 %

reduction in makespan under high-density topologies, and saves up to

4.04 %

of system energy across varying task depths. For resource-constrained edge devices, this continuous energy margin is highly significant, as it cumulatively extends battery lifespan and prevents hardware thermal throttling during sustained workloads. It pushes energy optimization to the extreme without introducing new system bottlenecks, all while strictly maintaining the ultra-low, microsecond-level scheduling latency crucial for real-time edge intelligence. Building upon these promising quantitative results, our future research will focus on three main avenues: (1) adapting the framework for dynamic, online environments (e.g., vehicular networks) with unpredictable task generation and topologies; (2) integrating lightweight fault-tolerance mechanisms to ensure high reliability against transient edge node failures; and (3) advancing to hardware-in-the-loop deployments on actual microcontrollers and embedded IoT sensor nodes to assess real-world physical overhead and end-to-end adaptability outside of simulated environments.

Author Contributions

Conceptualization, methodology, data curation, writing—original draft preparation, project administration, and funding acquisition, Z.Z.; visualization, writing—review and editing, validation, and funding acquisition, Y.J.; supervision, N.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Summary of Key Notations.

Notation	Description	Notation	Description
$G, V, E$	DAG components ( $G, V, E$ )	$v_{i}$	i-th computational task
$e_{i, j}$	Data dependency edge	$w_{i}$	Task computational workload
$d_{i, j}$	Data transmission volume	B	Average network bandwidth
CCR	Comm-to-computation ratio	$Pred, Succ$	Predecessor and Successor sets
$P, p_{k}$	Processor set and k-th server	$s_{k}$	Processing speed coefficient
$F_{k}, f_{i, k}$	Frequency set and selected freq	$t_{i, k}$	Task execution time on $p_{k}$
$c_{i, j}$	Communication time cost	$P_{k}^{stat}$	Static baseline power of $p_{k}$
$ξ_{k}$	Hardware capacitive constant	$E_{i}^{exec}$	Task execution energy
$E_{\min}^{(i)}$	Task minimum energy bound	$E_{\max}^{(i)}$	Task maximum energy bound
$E_{total}$	Total system energy consumption	$T_{makespan}$	Total completion time
$x_{i}$	11-D task feature vector	$y_{i}$	Marginal sensitivity label
$E_{constraint}$	System energy constraint	$γ$	Energy budget control factor
$E_{slack}^{global}$	Global energy margin pool	$S_{i}$	Hybrid importance score
$θ$	Weight balancing factor	$E_{budget}^{(i)}$	Initial task energy budget
$E_{saved}^{(t)}$	Accumulated energy surplus	$E_{limit}^{(i)}$	Actual dynamic energy limit
$E_{actual}^{(i)}$	Final energy consumed by $v_{i}$	${DRT}_{i, k}$	Data ready time on $p_{k}$
${EST}_{i, k}$	Earliest start time on $p_{k}$	${Slot}_{m}$	m-th idle time slice
${Deadline}_{i}$	Task execution deadline	$δ$	Random perturbation factor
PIR	Performance improvement ratio	$U_{sys}$	System task utilization

References

Li, J.; Shang, Y.; Qin, M.; Yang, Q.; Cheng, N.; Gao, W.; Kwak, K.S. Multiobjective oriented task scheduling in heterogeneous mobile edge computing networks. IEEE Trans. Veh. Technol. 2022, 71, 8955–8966. [Google Scholar] [CrossRef]
Zhou, X.; Ge, S.; Liu, P.; Qiu, T. DAG-based dependent tasks offloading in MEC-enabled IoT with soft cooperation. IEEE Trans. Mob. Comput. 2023, 23, 6908–6920. [Google Scholar] [CrossRef]
Cao, Z.; Deng, X.; Yue, S.; Jiang, P.; Ren, J.; Gui, J. Dependent task offloading in edge computing using GNN and deep reinforcement learning. IEEE Internet Things J. 2024, 11, 21632–21646. [Google Scholar] [CrossRef]
Peng, Q.; Wu, C.; Xia, Y.; Ma, Y.; Wang, X.; Jiang, N. DoSRA: A decentralized approach to online edge task scheduling and resource allocation. IEEE Internet Things J. 2021, 9, 4677–4692. [Google Scholar] [CrossRef]
Taghinezhad-Niar, A.; Taheri, J. Fault-Tolerant Cost-Efficient Scheduling for Energy and Deadline-Constrained IoT Workflows in Edge-Cloud Continuum. IEEE Trans. Serv. Comput. 2025, 18, 2892–2903. [Google Scholar] [CrossRef]
He, X.; Pang, S.; Gui, H.; Zhang, K.; Wang, N.; Yu, S. Online offloading and mobility awareness of DAG tasks for vehicle edge computing. IEEE Trans. Netw. Serv. Manag. 2024, 22, 675–690. [Google Scholar] [CrossRef]
Jiang, Q.; Xin, X.; Zhang, T.; Chen, K. Energy-Efficient Task Scheduling and Resource Allocation in Edge Heterogeneous Computing Systems Using Multi-Objective Optimization. IEEE Internet Things J. 2025, 12, 36747–36764. [Google Scholar] [CrossRef]
Biswas, S.K.; Muhuri, P.K.; Roy, U.K. Binary search-based fast scheduling algorithms for reliability-aware energy-efficient task graph scheduling with fault tolerance. IEEE Trans. Sustain. Comput. 2023, 9, 433–451. [Google Scholar] [CrossRef]
Yu, Z.; Liu, W.; Liu, X.; Wang, G. Drag-JDEC: A deep reinforcement learning and graph neural network-based job dispatching model in edge computing. In Proceedings of the 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS); IEEE: Piscataway, NJ, USA, 2021; pp. 1–10. [Google Scholar]
Zhou, Y.; Li, X.; Luo, J.; Yuan, M.; Zeng, J.; Yao, J. Learning to optimize dag scheduling in heterogeneous environment. In Proceedings of the 2022 23rd IEEE International Conference on Mobile Data Management (MDM); IEEE: Piscataway, NJ, USA, 2022; pp. 137–146. [Google Scholar]
Deng, X.; Yang, H.; Zhang, J.; Gui, J.; Lin, S.; Wang, X.; Min, G. Task offloading in internet of vehicles: A drl-based approach with representation learning for dag scheduling. IEEE Trans. Mob. Comput. 2025, 24, 5045–5060. [Google Scholar] [CrossRef]
Zhang, J.; Mo, L.; Wang, X.; Yang, C.; Wang, M.; Niu, D. Mixed-criticality DAGs Scheduling and Performance Optimization for Heterogeneous Multicore Systems. In Proceedings of the 2025 37th Chinese Control and Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2025; pp. 3013–3019. [Google Scholar]
Gao, Y.; Yi, H.; Chen, H.; Fang, X.; Zhao, S. A structure-aware DAG scheduling and allocation on heterogeneous multicore systems. In Proceedings of the 2024 IEEE 14th International Symposium on Industrial Embedded Systems (SIES); IEEE: Piscataway, NJ, USA, 2024; pp. 26–33. [Google Scholar]
Wang, S.; Li, D.; Huang, S.Y.; Deng, X.; Sifat, A.H.; Huang, J.B.; Jung, C.; Williams, R.; Zeng, H. Time-triggered scheduling for nonpreemptive real-time DAG tasks using 1-opt local search. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 3650–3661. [Google Scholar] [CrossRef]
Liu, D.; Chen, J.; Huang, X.; Hong, H. A reliability-aware and energy-aware task scheduling algorithm for heterogeneous multi-core systems. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2024; pp. 3212–3217. [Google Scholar]
Zhang, Y.; Zhao, S.; Chen, G.; Huang, K. Fault-tolerant DAG scheduling with runtime reconfiguration on multicore real-time systems. In Proceedings of the 2024 IEEE 35th International Conference on Application-Specific Systems, Architectures and Processors (ASAP); IEEE: Piscataway, NJ, USA, 2024; pp. 19–27. [Google Scholar]
Sun, B.; Theile, M.; Qin, Z.; Bernardini, D.; Roy, D.; Bastoni, A.; Caccamo, M. Edge generation scheduling for dag tasks using deep reinforcement learning. IEEE Trans. Comput. 2024, 73, 1034–1047. [Google Scholar] [CrossRef]
Song, X.; Feng, J.; Liu, L.; Pei, Q.; Yu, F.R.; Zhang, N. A Deep Reinforcement Learning with Transformer Integration for Directed Acyclic Graph Scheduling in Edge Networks. IEEE Trans. Wirel. Commun. 2025, 25, 5506–5520. [Google Scholar] [CrossRef]
Liu, Z.; Huang, L.; Gao, Z.; Luo, M.; Hosseinalipour, S.; Dai, H. GA-DRL: Graph neural network-augmented deep reinforcement learning for DAG task scheduling over dynamic vehicular clouds. IEEE Trans. Netw. Serv. Manag. 2024, 21, 4226–4242. [Google Scholar] [CrossRef]
Ding, W.; Luo, F.; Gu, C.; Dai, Z.; Lu, H. A multiagent meta-based task offloading strategy for mobile-edge computing. IEEE Trans. Cogn. Dev. System 2023, 16, 100–114. [Google Scholar] [CrossRef]
Zhao, G.; Xu, H.; Zhao, Y.; Qiao, C.; Huang, L. Offloading dependent tasks in mobile edge computing with service caching. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications; IEEE: Piscataway, NJ, USA, 2020; pp. 1997–2006. [Google Scholar]
Lou, J.; Tang, Z.; Zhang, S.; Jia, W.; Zhao, W.; Li, J. Cost-effective scheduling for dependent tasks with tight deadline constraints in mobile edge computing. IEEE Trans. Mob. Comput. 2022, 22, 5829–5845. [Google Scholar] [CrossRef]
Chen, Y.; Liu, S.; Zhou, J.; Ling, X. Real-Time DAG Task Allocation Strategy for Multiprocessor by Optimistic Parallelism. In Proceedings of the 2024 IEEE 24th International Conference on Communication Technology (ICCT); IEEE: Piscataway, NJ, USA, 2024; pp. 1016–1021. [Google Scholar]

Figure 1. An example of a Directed Acyclic Graph (DAG) task model. Nodes (

v_{0}

–

v_{9}

) represent computational tasks, and directed edges indicate execution dependencies and data transmission flow.

Figure 1. An example of a Directed Acyclic Graph (DAG) task model. Nodes (

v_{0}

–

v_{9}

) represent computational tasks, and directed edges indicate execution dependencies and data transmission flow.

Figure 2. The architecture of the HERO framework.

Figure 3. The task sensitivity analysis process, showing how marginal delay impacts are calculated to generate training labels.

Figure 4. Training performance of the proposed deep-enhanced MLP predictor. (a) The rapid convergence of training and validation loss within 80 epochs. (b) A scatter plot comparing true vs. predicted sensitivity on the test set; the dashed line represents the ideal prediction (

y = x

), demonstrating the model’s high prediction accuracy.

Figure 4. Training performance of the proposed deep-enhanced MLP predictor. (a) The rapid convergence of training and validation loss within 80 epochs. (b) A scatter plot comparing true vs. predicted sensitivity on the test set; the dashed line represents the ideal prediction (

y = x

), demonstrating the model’s high prediction accuracy.

Figure 5. Through micro-behavior analysis, we compared and contrasted the traditional (

R a n k_{u}

) strategy with the proposed MLP-based strategy.

Figure 5. Through micro-behavior analysis, we compared and contrasted the traditional (

R a n k_{u}

) strategy with the proposed MLP-based strategy.

Figure 6. Performance comparison of

R a n k_{u}

and MLP under varying graph depths and parallelism levels.

Figure 6. Performance comparison of

R a n k_{u}

and MLP under varying graph depths and parallelism levels.

Figure 7. Permutation feature importance analysis, highlighting computational load and communication overhead as the most critical scheduling features.

Figure 8. Performance comparison of Rank-Only, Workload-Only, and HERO scheduling strategies under identical energy constraints.

Figure 9. Completion time and energy consumption comparison between hole-filling and Append-Only strategies.

Figure 10. Performance Comparison of Algorithms at Different Depths (M).

Figure 11. Performance comparison of the algorithms under different parallelism levels (N).

Figure 12. Performance comparison of algorithms under different graph densities (C).

Figure 13. Performance comparison of algorithms under different system utilizations (U).

Table 1. Comprehensive Feature Comparison of Existing Task Scheduling Strategies and the Proposed HERO Framework.

Category	Ref. & Method	Optimization Objectives	Energy Strategy	Communication Overhead Handling
Heuristic &Evolutionary Algorithms	Li et al. [1]	Makespan, Energy	Standard DVFS	Static transmission assumption
	Jiang et al. [7]	Energy, Delay	DVFS auto-adjustment	Partially considered
	Zhang et al. [12]	Performance, Service Quality	Dynamic DVFS	Not strictly prioritized
	Liu et al. [15]	Reliability, Energy	Standard DVFS	Redundancy-based
	Biswas et al. [8]	Reliability, Energy	Fast DVFS switching	Static bounds
	Zhang et al. [16]	Makespan, Fault-tolerance	None (Re-execution)	Unpredictable failure status
	Taghinezhad-Niar [5]	Cost, Energy, Deadline	Energy-constrained	Edge-cloud congestion modeled
DRL & Intelligent Scheduling	Drag-JDEC [9]	Makespan, QoS	None	GNN feature extraction
	Cao et al. [3]	Makespan	None	GAT-based dependency
	Sun et al. [17]	DAG Width, Deadline	None	Edge generation representation
	Song et al. [18]	Energy, Makespan	Transmit power & CPU freq.	Attention-based feature
	GA-DRL [19]	Makespan, Timeliness	None	Topology two-way aggregation
	DVTP [11]	Makespan	None	Spatiotemporal representation
	Ding et al. [20]	Latency, Energy	Charging time trade-off	Dynamic environment-aware
	LACHESIS [10]	Completion time	None	Topological perception
Structure-Aware & Specific Strategies	Zhao et al. [21]	Execution time	None	Wireless interference
	LOU [22]	Latency, Cost	Strict Constraints	Dependency-aware
	Zhou et al. [2]	Latency, Energy, Gain	End-device frequency	Soft cooperation (Data sharing)
	He et al. [6]	Makespan, Queue stability	None	Cross-slot queue (Lyapunov)
	Gao et al. [13]	Makespan	None	Pre-calculated node priority
	Wang et al. [14]	Worst-case latency	None	1-Opt Local Search path
	DoSRA [4]	Efficiency, Delay	None	Decentralized provision
	OPSA [23]	Processor utilization	None	Parallelism limitation
Proposed	HERO (Ours)	Makespan, Energy	Preventive-bottleneck budget	Aggressive hole-filling strategy

Table 2. The 11-dimensional feature vector for task representation.

Feature Categories	Feature Name	Specific Meaning
Computation & Communication	$w_{i}$	The task’s own computational load
Computation & Communication	${Comm}_{\max}$	Maximum data transfer weight of all output edges of the task
Static Priority Features	${Rank}_{up}$	Distance from the task to the exit node
	${Rank}_{down}$	Distance from task to the entry node
	${Level}_{topo}$	Task level depth in the DAG topological sorting
Graph Structure & Path	$d_{in}$	In-degree of the task node
	$d_{out}$	Out-degree of the task node
	$d_{tot}$	Total degree of the task node
	$d_{diff}$	The difference between in-degree and out-degree
	$T_{path}$	Execution time of the critical path passing through the node
	$N_{path}$	Number of paths from entry node to exit node passing through the node

Table 3. Model performance comparison against classic baselines.

Model	MSE	MAE	$R^{2}$
Linear Regression	0.018663	0.075654	0.1786
Decision Tree	0.017255	0.070981	0.2406
Random Forest	0.010575	0.055387	0.5346
XGBoost	0.010004	0.054553	0.5597

Table 4. Ablation study of MLP architectures on prediction accuracy and computational overhead.

Architecture	MSE	$R^{2}$	Latency (μs)	Parameters
Shallow-MLP (2-layer)	0.006224	0.690191	43.53	1537
HERO-MLP (5-layer)	0.005880	0.707295	199.09	50,049
Deep-MLP (12-layer)	0.006106	0.696074	254.53	2,638,849

Table 5. Processor configurations.

Level	Normalized Freq	Raspberry Pi 4		Jetson Orin Nano		Xeon D
Level	Normalized Freq	$f$ (MHz)	$P$ (mW)	$f$ (MHz)	$P$ (mW)	$f$ (MHz)	$P$ (mW)
1	0.2	300	2500	302	5000	600	20,000
2	0.4	600	3200	605	7000	1200	28,000
3	0.6	900	4200	907	10,000	1800	38,000
4	0.8	1200	5500	1209	15,000	2400	50,000
5	1.0	1500	7000	1512	25,000	3000	65,000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, Z.; Jiang, Y.; Niu, N. Hybrid Energy-Aware Ranking and Optimization. Future Internet 2026, 18, 226. https://doi.org/10.3390/fi18050226

AMA Style

Zeng Z, Jiang Y, Niu N. Hybrid Energy-Aware Ranking and Optimization. Future Internet. 2026; 18(5):226. https://doi.org/10.3390/fi18050226

Chicago/Turabian Style

Zeng, Zhiling, Yuxuan Jiang, and Na Niu. 2026. "Hybrid Energy-Aware Ranking and Optimization" Future Internet 18, no. 5: 226. https://doi.org/10.3390/fi18050226

APA Style

Zeng, Z., Jiang, Y., & Niu, N. (2026). Hybrid Energy-Aware Ranking and Optimization. Future Internet, 18(5), 226. https://doi.org/10.3390/fi18050226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Energy-Aware Ranking and Optimization

Abstract

1. Introduction

2. Related Work

2.1. Heuristic and Evolutionary Algorithms

2.2. Intelligent Scheduling Based on Deep Reinforcement Learning

2.3. Structure-Aware and Specific Offloading Strategies

3. System Model and Problem Description

3.1. Application Model

3.2. Architecture and Communication Model

3.3. Power and Energy Consumption Model

3.4. Formal Problem Description

4. HERO Framework Design

4.1. Framework Overview

4.2. Task Ranking Based on Enhanced MLP

4.2.1. DAG Feature Extraction

4.2.2. Perturbation-Based Sensitivity Generation

4.2.3. Model Design and Optimization Strategies

4.2.4. Model Performance Analysis

4.2.5. Analysis of Key Node Identification Capability

4.2.6. Model Interpretability and Microbehavior Analysis

4.3. Hybrid Budget Allocation

4.3.1. Task Energy Consumption Boundary and Global Energy Margin Definition

4.3.2. Two-Factor Mixed Weight Definition

4.3.3. Initial Budget Allocation

4.3.4. Effectiveness of the Hybrid Budgeting Mechanism

4.4. Processor Selection Based on Hole-Filling Strategy

4.4.1. Hole-Filling Strategy

4.4.2. Verify the Effectiveness of the Hole-Filling Strategy

5. Experiments

5.1. Experimental Setup

5.1.1. Task Generation

5.1.2. Simulation Platform and Tools

5.1.3. Evaluation Metrics

5.2. Comparison with Benchmark Algorithms

5.3. Experimental Results

5.3.1. Impact of Task Size (Number of Layers) on Performance

5.3.2. Impact of Task Parallelism (Width) on Performance

5.3.3. Impact of Graph Density (Connection Probability)

5.3.4. System Load Stress Test

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI