1. Introduction
Multi-access Edge Computing (MEC) has emerged as a critical infrastructure for latency-sensitive applications in the 5G/6G and IoT era [
1,
2]. To process complex Deep Neural Network (DNN) inferences, edge computing is shifting towards multi-node heterogeneous clusters that collaborate via task offloading [
3,
4]. However, edge servers exhibit significant differences in computing capabilities and face strict battery capacity or Thermal Design Power (TDP) constraints [
5,
6]. Achieving the joint optimization of millisecond-level real-time response (makespan) and total system energy consumption in such heterogeneous and dynamic environments has become a highly challenging NP-hard bi-objective scheduling problem [
7,
8].
To tackle this problem, edge computational workloads are typically modeled as complex Directed Acyclic Graphs (DAGs) to capture fine-grained dependencies and strict millisecond-level deadlines. Although academia has proposed various advanced DAG scheduling strategies, severe challenges remain in resource-constrained and highly dynamic edge environments. First, evolutionary algorithms, such as NSGA-II [
1], rely on heavy iterative searches. Consequently, they incur excessively high online scheduling latency. Second, existing deep reinforcement learning-based algorithms (such as SSA-DAG) [
9,
10,
11] often exhibit a “performance-heavy, energy-light” tendency, lacking proactive energy budget planning. More crucially, they typically treat edge weights as static features. This makes it difficult to capture the condition-triggered characteristics of communication overheads in heterogeneous clusters. As a result, they fail to accurately identify the true system bottlenecks. Finally, dedicated algorithms based on specific structures or mixed criticality (such as TOM and DPMC) [
12,
13,
14] usually adopt pessimistic resource reservation or rigid slack allocation strategies. They ignore differences in the intrinsic computational workloads of individual tasks. Therefore, they are highly prone to over-scaling the frequency of heavy-workload tasks on non-critical paths, thereby creating new system bottlenecks.
To address the aforementioned pain points, this paper proposes a lightweight and efficient collaborative scheduling framework: HERO (Hybrid Energy-aware Ranking and Optimization). This framework constructs a “perception–decision–compensation” closed-loop optimization system, aiming to break the limitations of traditional static heuristic and greedy learning strategies. The main contributions of this paper are summarized as follows:
Establishment of a Communication-Aware Sensitivity Quantification Model: We propose a perturbation-based mechanism to quantify the marginal effect of task execution fluctuations on the global makespan, accurately stripping away pseudo-critical paths.
Hybrid Budget Allocation: We design a multi-factor energy arbitration mechanism that balances critical path progression with the resource needs of heavy-load tasks on non-critical paths.
Time Fragment Recovery Mechanism: We introduce an aggressive hole-filling strategy to reclaim discrete idle time slots induced by heterogeneous communication overheads.
Performance Validation: Extensive experiments on a diverse testbed (Raspberry Pi 4, Jetson Orin Nano, and Xeon D) demonstrate that HERO reduced the completion time by an average of 10.89% under high-density topologies, and achieved up to 4.04% energy savings across varying task depths.
The remainder of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 formalizes the system and communication models, alongside the bi-objective optimization problem.
Section 4 elaborates on the detailed design of the proposed HERO framework.
Section 5 presents the experimental setup, comparative results, and performance evaluation. Finally,
Section 6 concludes the paper.
2. Related Work
Directed Acyclic Graph (DAG) task scheduling and offloading in heterogeneous edge environments is a relatively classical NP-hard problem. As summarized in
Table 1, existing solutions can generally be divided into three categories: heuristic and evolutionary algorithms, deep learning-based intelligent scheduling, and structure-aware specific offloading strategies.
2.1. Heuristic and Evolutionary Algorithms
Due to the heterogeneity of edge environments, multi-objective evolutionary and heuristic algorithms are widely adopted to address the trade-offs between latency, energy consumption, and system reliability [
1,
5,
7,
8,
12,
15,
16]. While these methods excel in finding Pareto-optimal solutions or ensuring fault tolerance, they face severe challenges in real-time edge collaboration scenarios. Algorithms represented by NSGA-II rely on heavy population iteration and mutation processes, resulting in scheduling delays that are too high to meet the millisecond-level response requirements of intelligent edge services. Furthermore, their energy optimization methods often employ rigid strategies, allocating energy budgets solely based on time margins while ignoring intrinsic computational workload differences, which easily creates new system bottlenecks.
2.2. Intelligent Scheduling Based on Deep Reinforcement Learning
Schedulers based on Deep Reinforcement Learning (DRL) have become a research hotspot due to their environmental adaptability. Recent advancements frequently combine DRL with Graph Neural Networks (GNNs) or Transformers to capture complex DAG dependencies and optimize offloading decisions in highly dynamic environments [
3,
9,
10,
11,
17,
18,
19,
20].
While DRL methods perform well in finding high-quality solutions, deploying them on resource-constrained edge nodes remains challenging. The high-dimensional state encoding required by complex GNNs or Transformers introduces unacceptable millisecond-level inference latency for lightweight edge tasks. Additionally, as detailed in
Table 1, existing DRL schedulers predominantly prioritize performance over energy consumption, lacking an active energy budget mechanism and often neglecting the critical impact of heterogeneous communication overheads on feature extraction.
2.3. Structure-Aware and Specific Offloading Strategies
Structure-aware and specific offloading strategies simplify the scheduling problem by exploiting DAG structural characteristics or predefined rules, such as service caching, 1-Opt local search, or decentralized game theory [
2,
4,
6,
13,
14,
21,
22,
23].
Although effective in specific use cases, these strategies lack generality. Chain-based optimizations make strong assumptions about the DAG’s shape, making it difficult to handle highly irregular edge workflows. Moreover, unlike the aggressive gap-filling strategy proposed in our HERO framework, these structured methods often struggle to flexibly recycle discrete idle time slots caused by heterogeneous communication, resulting in limited overall resource utilization.
3. System Model and Problem Description
To facilitate a clear understanding of the mathematical models presented in this section, the primary notations used throughout this paper are summarized in the table in the Abbreviations.
To provide a rigorous mathematical abstraction for the task scheduling problem in an edge collaborative environment, this section first defines an application model based on a DAG to depict the complex dependencies and heterogeneous computational workloads among tasks. Subsequently, the heterogeneous edge cluster architecture and communication model are constructed to quantify the heterogeneous transmission overhead incurred by cross-node collaboration. Furthermore, considering the resource-constrained nature of edge devices, a power and energy consumption model based on DVFS is introduced. Finally, building upon the aforementioned models, the joint optimization of latency and energy consumption is formulated as a constrained bi-objective optimization problem.
3.1. Application Model
We model the dependency-driven task flow—specifically the deep learning inference pipeline in an edge computing environment—as a DAG, as illustrated in
Figure 1, and defined as
The vertex set represents the computational tasks, where each task has a computational load . The set of directed edges E denotes task dependencies. A directed edge indicates that cannot start until completes, involving a data transmission volume over a network with bandwidth B.
To quantitatively characterize the divergent requirements for computational resources and communication bandwidth across various DAG applications, we define the Communication-to-Computation Ratio (CCR) as the ratio of the average communication cost
to the average computational cost
of the entire graph:
where the numerator represents the average data transfer time across all dependency edges given the average bandwidth
within the heterogeneous edge cluster; the denominator represents the average execution time of all tasks. Due to system heterogeneity and Dynamic Voltage and Frequency Scaling (DVFS) capabilities, the execution time of task
is not constant. Therefore, we define
as the expected execution time of task
across all available physical configurations in the cluster
:
In this formula,
is the set of available frequencies supported by server
, and
is its architectural performance coefficient.
For any task , we define as its set of direct predecessors and as its set of direct successors. To simplify the model, we assume G has a unique entry task and a unique exit task . For practical workflows with multiple entries or exits, this can be unified by adding virtual nodes with zero computational load and zero communication overhead.
3.2. Architecture and Communication Model
We model the edge computing cluster as a set of
M heterogeneous edge servers, denoted by
. Due to the diverse hardware architectures (CPU, GPU, or specialized accelerators) of the edge servers, significant disparities exist in their efficiency when processing the same task. In practical implementation, this cluster typically operates under a master–worker paradigm: one resource-sufficient node is designated as the master controller to maintain cluster state and handle scheduling logic, while the remaining heterogeneous devices act as worker nodes to execute dispatched tasks. For any task
, if it is assigned to processor
and executed at a frequency
, its execution time
is expressed as
where
represents the computational workload of the task, and
denotes the architectural performance coefficient of server
, reflecting the processor’s Instructions Per Cycle (IPC) throughput per unit frequency.
Communication Cost Model: The communication overhead between tasks is determined by the data transmission volume and network bandwidth. Let
B denote the average transmission bandwidth within the edge cluster. For a dependency edge
, if task
is assigned to server
and task
is assigned to server
, the communication time
is defined as
When parent and child tasks are scheduled on the same processor, data is exchanged via shared memory; thus, the communication overhead is considered to be zero.
3.3. Power and Energy Consumption Model
To support energy-efficiency optimization, we assume that each edge server supports Dynamic Voltage and Frequency Scaling (DVFS) technology.
DVFS Frequency Set: For each server , the processor supports a set of discrete voltage–frequency pairs. The available frequency set for is defined as , where is the maximum clock frequency of the processor.
Following widely adopted dynamic voltage and frequency scaling (DVFS) power models, the instantaneous power
of edge server
operating at frequency
f is modeled according to the well-known Cubic Law:
where
is the static baseline power, and
is a hardware-specific constant reflecting the processor’s capacitive characteristics.
The total system energy consumption
is the sum of the energy consumed during task execution and the energy consumed during idle periods. For task
running on server
at frequency
, its execution energy is defined as
The objective function for the total energy consumption of the entire edge cluster during the scheduling cycle is
where
is the idle wait time of server
.
3.4. Formal Problem Description
The scheduling of DAG tasks in a heterogeneous edge environment is formulated as finding an optimal mapping of tasks to processors, determining their execution sequence, and allocating operating frequencies. For any task assigned to processor , its timing constraints are governed by the completion status of its predecessors and the resource availability of the assigned processor.
Following standard task scheduling semantics, the Data Ready Time (DRT), Earliest Start Time (EST), and Earliest Finish Time (EFT) for a task
on processor
are calculated recursively:
where
denotes the actual finish time of predecessor
, and
is the earliest time at which
becomes ready to execute a new task.
The primary objective of this study is to address a bi-objective optimization problem: minimizing the application makespan while simultaneously reducing total system energy consumption.
The makespan minimization aims to minimize the completion time of the entire application, which is determined by the actual finish time of the exit task
:
The total energy minimization aims to minimize the total energy consumption, which includes the dynamic execution energy of all tasks and the static energy consumed during idle periods:
Given a DAG
G and a heterogeneous cluster
, the goal is to identify an optimal scheduling strategy
—where
is the task-to-processor mapping,
is the execution order, and
is the frequency allocation—that minimizes the objective vector:
4. HERO Framework Design
This chapter will elaborate on the design details of the HERO framework. The framework comprises two core phases: communication-aware priority learning and bottleneck-aware resource allocation.
4.1. Framework Overview
As shown in
Figure 2 and Algorithm 1, HERO establishes a closed-loop identify–utilize–reclaim cascade process. In the identification phase, an MLP predictor is used to unravel complex dependencies and isolate exploitable time redundancy. In the utilization phase, we exploit this time redundancy through a hybrid budgeting mechanism. This mechanism converts non-critical time slack into energy efficiency via DVFS. This conversion leads to scheduling fragmentation. Finally, the reclaim phase acts as a compensator by employing a hole-filling strategy. This strategy recovers these fragmented gaps for small tasks, thereby maximizing resource density through compute–communication overlap.
From an implementation perspective, to ensure strict online real-time performance, the deep-enhanced MLP predictor is trained offline. During the online phase, the master controller only performs an
lightweight forward inference, strictly bounding the decision-making overhead to the microsecond level. The master then dispatches the parsed subtasks and specific DVFS frequency commands to designated worker nodes via lightweight remote procedure calls (e.g., gRPC), fully bridging the theoretical algorithm with practical edge orchestration.
| Algorithm 1: HERO: Hybrid Energy-aware Ranking and Optimization |
![Futureinternet 18 00226 i001 Futureinternet 18 00226 i001]() |
4.2. Task Ranking Based on Enhanced MLP
To extract nonlinear scheduling features from high-dimensional heterogeneous DAGs with low inference latency, HERO employs a Deep Enhanced Multilayer Perceptron (MLP).
4.2.1. DAG Feature Extraction
For each task
in the DAG, we extract an 11-dimensional feature vector
, as detailed in
Table 2. By encoding macro-level path criticality (e.g.,
,
) and micro-level topological attributes (e.g.,
,
) into a unified linear vector space, this feature set effectively captures the topological integrity of the DAG.
4.2.2. Perturbation-Based Sensitivity Generation
In learning-based DAG scheduling research, obtaining high-quality supervision signals is a key bottleneck for model performance. Existing imitation learning methods typically use static priority sequences generated by heuristic algorithms (such as HEFT) as training labels. This approach has the limitation of locking the upper limit of model performance. To overcome this limitation, HERO proposes a perturbation-based sensitivity analysis mechanism. The model learns the global time sensitivity of each task—the extent to which local execution fluctuations of that task will degrade the completion time of the entire system.
We define a sensitivity label
for task
as the marginal effect of load variation on global completion time. As shown in
Figure 3, for each instance
G in the training set, the label generation process includes the following three standardized steps:
First, using a standard list scheduling algorithm (the HEFT algorithm is used in this paper) to schedule DAG
G under the target heterogeneous cluster configuration, a baseline scheduling scheme
and its corresponding baseline completion time
are obtained:
Then, for each task
in the DAG, a perturbed instance
is constructed. In
, we increase the worst-case execution time
of task
by a significant perturbation factor
(in the experimental setup,
, i.e., simulating a doubling of the load):
Subsequently,
is re-evaluated using the same scheduling algorithm to obtain the perturbed completion time
.
Finally, the absolute impact of the delay of task
on the system is
. To eliminate the dimensional differences caused by different DAG sizes, we define the final training label
as the normalized marginal deterioration rate:
If
, it indicates that the task is in some form of bottleneck state, and the larger the value, the greater its impact on the system completion time. If
, it indicates that the disturbance of the task is absorbed by the system’s parallel gaps or communication latency.
Through this mechanism, HERO’s MLP predictor learns the global time sensitivity of each task, enabling HERO to dynamically identify hidden bottlenecks masked by traditional static CCR metrics during the inference phase.
4.2.3. Model Design and Optimization Strategies
To address the challenges of high-dimensional non-linearity and sparsity inherent in DAG task scheduling features, we designed a deep funnel-like feature extraction network coupled with a robust training mechanism:
Funnel-like Network Architecture: To capture the implicit coupling among topological features (e.g., and communication overhead), we construct a five-layer descending network. Specifically, a funnel-shaped information bottleneck structure is utilized. Combined with batch normalization and ReLU activation functions in the first three layers, it can effectively filter out noise and distill high-level abstract features. A decaying Dropout strategy () is adopted to prevent structural overfitting while maintaining the stability of deep semantic representations.
Dynamic Optimization and Training Strategy: The model uses the Mean Squared Error (MSE) loss function to heavily penalize prediction deviations, forcing it to quickly lock onto key path nodes. The AdamW optimizer is selected to enhance generalization ability. We also integrate a ReduceLROnPlateau learning rate scheduler (halving the learning rate if validation loss stagnates for five consecutive epochs) and an early stopping mechanism (training stops if there is no improvement for 15 consecutive epochs).
4.2.4. Model Performance Analysis
To ensure the robustness of our model, we constructed a comprehensive dataset containing 100,000 DAG task samples. This dataset is randomly divided into a training set, validation set, and test set in a 90:5:5 ratio. As shown in
Figure 4a, the training loss and validation loss converge rapidly and stabilize after approximately 30 epochs. The slight gap between the two curves indicates that no significant overfitting has occurred, demonstrating the effectiveness of batch normalization and the Dropout mechanism.
Figure 4b shows the comparison between actual sensitivity and predicted values on the test set. The randomly selected 5000 data points are evenly distributed around the
diagonal, indicating that the MLP can accurately predict the potential impact of tasks in heterogeneous environments.
We selected four classic regression models (linear regression, decision tree, random forest, XGBoost) to verify the necessity of deep learning, and conducted an ablation study (2-layer, 5-layer, and 12-layer MLPs) to justify the architectural depth of HERO-MLP. All models were trained on the same 11-dimensional features and evaluated using MSE, MAE, , inference latency, and parameter count.
As shown in
Table 3 and
Table 4, classic models fail to capture the complex non-linear relationships, with the best ensemble method (XGBoost) only reaching an
of 0.5597. In the neural network ablation study, the five-layer HERO-MLP achieves the highest prediction accuracy (
, MSE = 0.005880). Compared to a shallow two-layer MLP (limited feature extraction,
) and a 12-layer MLP (where performance marginalization occurs due to increased complexity,
), the five-layer HERO-MLP strikes the optimal balance. It delivers superior accuracy while maintaining an ultra-low inference latency (199.09 μs) and a compact parameter size (50,049), which is crucial for real-time edge scheduling.
4.2.5. Analysis of Key Node Identification Capability
To investigate the actual performance of the MLP in scheduling decisions, we designed a stress test to observe its ability to identify key nodes. Traditional heuristic list scheduling algorithms (such as the strategy in HEFT) primarily rely on static Communication-to-Computation Ratio (CCR) to construct task priorities. To verify whether the MLP predictor in the HERO framework truly learns a global bottleneck awareness capability beyond simple data memorization, we conducted a targeted evaluation.
The experiment was deployed in a heterogeneous computing environment containing two types of computing nodes: high-performance cores () with processing speed , simulating the master computing node in an edge cluster, and medium-performance cores () with processing speed , simulating auxiliary computing nodes or low-power cores.
To simulate stress load scenarios, we constructed a set of synthetic DAG datasets with bimodal distribution characteristics. This dataset contains two types of tasks with opposing properties:
Computationally intensive isolated tasks (Type-C): High computational load with low communication overhead, representing potential “computational bottlenecks.”
Communication-intensive coupled tasks (Type-D): Low computational load but requiring significant data transfer, representing “communication bottlenecks.”
Experimental results reveal the fundamental difference in scheduling decision logic between the benchmark algorithm (
) and the MLP, as shown in
Figure 5. Faced with the substantial communication overhead generated by Type-D tasks, the
algorithm schedules all such tasks to the high-performance core
to eliminate cross-node data transmission latency. This locally greedy strategy leads to scarce
resources being occupied by a large number of low-computation-value tasks. Consequently, when Type-C tasks—which truly determine the global completion time—arrive, they are relegated to the low-speed core
, ultimately deteriorating the total makespan.
In contrast, the MLP predictor, by learning global time sensitivity, successfully identifies that although Type-C tasks lack explicit communication constraints, their heavy computational burden constitutes the global critical path. Therefore, HERO assigns Type-C tasks a higher scheduling priority, allowing them to preempt the computational resources of .
As shown in
Figure 6, across test cases with different graph depths and parallelism, HERO’s normalized makespan consistently outperforms the benchmark. Particularly in deep graph structures (Depth
), HERO’s average normalized makespan is 0.8468, achieving a performance improvement of approximately 15.3%; in high-concurrency scenarios (Layers
), the average normalized makespan is 0.8479, an improvement of 15.2%. These results demonstrate that HERO possesses global bottleneck identification capabilities that surpass local greedy strategies, validating the model’s generalization effectiveness in extremely heterogeneous environments.
4.2.6. Model Interpretability and Microbehavior Analysis
Based on the importance analysis of permutation features, we verified the effectiveness of the HERO feature set and the nonlinear learning capability of deep MLPs (as shown in
Figure 7). Experimental results show that computational load (
) and communication overhead (
) have an absolutely dominant contribution to prediction accuracy, confirming the importance of communication overhead in cross-node data transmission in heterogeneous edge environments. Meanwhile, the model has high weights on the number of paths (
) and the exit distance (
), indicating that it has successfully learned a strategy of prioritizing scheduling topology intersections and key nodes in long chains.
4.3. Hybrid Budget Allocation
After determining the task scheduling order, HERO introduces a dynamic hybrid budget mechanism to address the trade-off between energy consumption and performance. Unlike traditional methods that allocate resources solely based on static load, this mechanism combines topology importance and computational volume for global budget planning and introduces runtime energy recovery strategies.
4.3.1. Task Energy Consumption Boundary and Global Energy Margin Definition
Before allocating resources, it is necessary to first define the energy consumption boundaries of each task and the entire task flow within the current heterogeneous edge cluster.
For any task , considering that it can be executed on any processor at any frequency , we can pre-calculate the upper and lower bounds of the task’s energy consumption.
Minimum execution energy consumption:
represents the lowest energy consumption value that task
can achieve among all possible combinations of processors and frequencies:
Maximum execution energy consumption:
represents the highest energy consumption that task
may generate at the highest performance configuration:
System minimum energy consumption:
is the minimum energy required to complete the entire application, which is the total energy required when all tasks are executed at their respective most energy-efficient configurations:
System peak energy consumption:
is the upper limit of energy consumption when pursuing maximum performance:
Energy Constraints and Global Margin: To balance performance and energy consumption, the system sets a total energy consumption constraint
. This constraint must lie within the system’s physically feasible region, satisfying
Under this constraint, we define the global energy reserve
as the additional energy pool that the system can use for higher performance beyond meeting the minimum operating requirements:
4.3.2. Two-Factor Mixed Weight Definition
To allocate energy margin more scientifically, we no longer rely solely on computational cost, but instead define a hybrid importance score
. This score combines the task’s topological criticality and relative workload:
Topological criticality (
): Reflects the task’s position on the critical path of the DAG (based on
). Relative workload (
): Reflects the size of the task itself.
is a balancing factor used to adjust the weights of the two.
4.3.3. Initial Budget Allocation
Based on the mixed score, the global margin
is proportionally allocated to each task to form the initial budget
:
During online scheduling, HERO employs a dynamic budget reclamation strategy. The accumulated surplus
is defined as the sum of unused budgets from previously scheduled tasks. The actual available dynamic upper limit
for the current task
is
The scheduler selects the highest frequency,
, on the chosen processor
, which meets the dynamic upper limit:
After the task is executed, update the accumulated balance:
4.3.4. Effectiveness of the Hybrid Budgeting Mechanism
To further explore the effectiveness of HERO’s proposed two-factor hybrid weighting mechanism, we designed a set of controlled variable experiments. The experiments aim to demonstrate a core issue: under identical energy consumption constraints, HERO’s hybrid strategy outperforms single-dimensional allocation strategies.
We categorized HERO’s budget allocation module into three baseline strategies and compared their completion times (makespan) under the same total energy consumption constraint ( with ):
Rank-Only: Allocation of budget based solely on the task’s position on the critical path, with a balance factor .
Workload-Only: Allocation of budget based solely on the task’s base energy consumption (i.e., computational load), with .
Hybrid (Ours): HERO’s default configuration (), considering both topological criticality and relative workload.
Figure 8 shows the performance comparison of the three strategies across 10 groups of DAG instances. Experimental data reveals that Rank-Only can easily lead to resource starvation for heavily loaded tasks on non-critical paths, causing new cascading blockages due to excessive frequency reduction, while Workload-Only ignores global dependencies, wasting budget on non-critical tasks with high slack, resulting in insufficient acceleration of critical nodes. In contrast, HERO’s hybrid mechanism successfully balances topological criticality and computational volume. Experiments show that, under the same energy consumption, HERO’s completion time is improved by 8% and 2% respectively compared to the aforementioned single strategies, demonstrating that jointly considering topology and load is key to achieving energy-efficient scheduling in heterogeneous environments.
4.4. Processor Selection Based on Hole-Filling Strategy
4.4.1. Hole-Filling Strategy
To overcome the resource fragmentation problem caused by heterogeneous communication latency, HERO introduces a hole-filling strategy to actively reclaim idle time slices on the processors.
Define the
m-th idle time slice as the interval
. For task
to be safely inserted into
, the following timing constraints must be met:
Based on the hole-filling strategy, the earliest start time (EST) of the task is the earliest time among all feasible slots:
4.4.2. Verify the Effectiveness of the Hole-Filling Strategy
To demonstrate the effectiveness of reclaiming fragmented idle time in heterogeneous edge clusters, we designed a controlled ablation experiment with the task mapping strategy as the sole independent variable. All other scheduling components remained completely consistent in both schemes.
Baseline Scheme: This strategy strictly follows the tail-append principle. For any task
assigned to processor
, its earliest start time (EST) is restricted to no earlier than the processor’s current available time
, i.e., the completion time of the last scheduled task:
The experimental results are shown in
Figure 9. The results demonstrate that the hole-filling strategy achieves synergistic optimization of latency and energy consumption. In terms of execution efficiency, this strategy effectively compresses the system’s critical path by dynamically filling tasks into fragmented time, resulting in an average completion time reduction of 6.96%. Regarding energy efficiency, compared to the static power waste caused by processor idling in the traditional Append-Only mode, the hole-filling strategy achieves energy savings of up to 9.46% by eliminating invalid waiting periods.
5. Experiments
To fully verify the effectiveness of the HERO framework in heterogeneous edge computing environments, we built a high-fidelity simulation platform based on Python 3.8.20 programming and conducted extensive comparative experiments with four mainstream scheduling algorithms in the current academic community.
5.1. Experimental Setup
5.1.1. Task Generation
To evaluate the performance of the scheduling algorithm in different application scenarios, we use a parameter-controlled random DAG generator to construct a diverse benchmark set. This generation model strictly follows the mathematical formulations below:
Topology generation employs a hierarchical generation method to construct the DAG structure, fundamentally guaranteeing the acyclic property. We divide the vertex set
into
L disjoint hierarchical subsets:
Graph depth: The number of layers L follows a uniform distribution , reflecting the serial length of the task flow. Graph width: The parallelism (number of nodes) of each layer follows a distribution . Connection constraint: For any edge , if and , then the hierarchical constraint must be satisfied.
For computational load generation, we employ the UUnifast algorithm. This algorithm performs uniform sampling within a simplex space defined by the total system utilization
, generating a set of unbiased utilization vectors
that satisfy the following conditions:
Subsequently, the computational load
of task
is jointly determined by its allocated utilization
and task period
(
), ensuring the statistical uniformity of the load distribution.
To cover diverse application characteristics, we introduce CCR as a control parameter. The average computational load of the graph is calculated as
The data transfer volume
of edge
is generated based on the following random process:
Here,
is a random perturbation factor used to simulate the random fluctuations in communication overhead between different tasks in a real environment.
5.1.2. Simulation Platform and Tools
The proposed HERO framework is implemented in Python 3; specifically, the communication-aware predictor of HERO uses the PyTorch 2.4.1+cu118 framework for offline training and forward inference. Data aggregation and visualization are handled by the Pandas 2.0.3 and Seaborn 0.13.2 libraries. Furthermore, to ensure computational efficiency, the NSGA-II benchmark was independently implemented in C++ and dynamically invoked by the Python-based main scheduler during the simulation.
To construct a representative modern heterogeneous edge environment, we employ three distinct types of processors: Raspberry Pi 4 (Raspberry Pi Foundation, Cambridge, UK), Jetson Orin Nano (NVIDIA Corporation, Santa Clara, CA, USA), and Intel Xeon D (Intel Corporation, Santa Clara, CA, USA). Their corresponding frequencies (
f) and power consumption profiles (
P) across different levels are detailed in
Table 5. For systematic comparison and frequency scaling analysis, the processor frequencies are normalized across five discrete levels with a step size of 0.2. The cluster consists of three heterogeneous nodes, and the static leakage power of each processor is set to 10% of its peak dynamic power at the maximum frequency level (1.0).
To ensure the objectivity and reproducibility of the benchmark set, during the topology generation phase, the average connection probability of our DAG is set to 0.3, and the basic computational load of tasks is randomly sampled between 10 and 3000. The Communication-to-Computation Ratio (CCR) was dynamically sampled between a wide range of . The energy budget factor is empirically set to 0.85 (i.e., ) during the resource allocation phase.
5.1.3. Evaluation Metrics
To quantitatively evaluate the performance of the proposed HERO framework and baseline algorithms, we employ two primary absolute metrics defined in
Section 3.4: Makespan (
) and Total Energy Consumption (
). Furthermore, to clearly illustrate the relative advantages of HERO, we define the Performance Improvement Ratio (PIR) for both latency and energy. Let
denote the metric value (Makespan or Energy) obtained by the benchmark algorithm, and
denote the corresponding value obtained by HERO. The improvement percentage is calculated as
A positive PIR indicates that HERO outperforms the baseline algorithm, while a negative value indicates performance degradation. In our multi-trial experiments, all reported results are the arithmetic mean of multiple random instances to eliminate statistical outliers.
5.2. Comparison with Benchmark Algorithms
To comprehensively evaluate the performance boundaries of the HERO framework, we selected four state-of-the-art (SOTA) algorithms with significant representative scheduling strategies as benchmarks:
NSGA-II: A classic Pareto-optimal multi-objective evolutionary algorithm. Theoretically, NSGA-II can approach the global optimum within an infinite search time. This experiment adapted NSGA-II to a scale-adaptive constraint. We set the population size as a linear function of the number of tasks and limited the maximum number of iterations to three times the number of tasks.
DPMC: A heuristic algorithm for mixed-criticality systems that distinguishes between high- and low-criticality tasks. It employs a relatively conservative frequency reservation strategy to ensure the deadlines of high-priority tasks.
SSA: A structure-aware scheduling algorithm based on MLP. Its core lies in using neural networks to learn node importance and introducing a dual-queue mechanism to reserve resources for high-priority tasks waiting for predecessor tasks to complete, in order to optimize task completion time.
TOM: An algorithm based on time-triggered and chain-structure optimization. It performs excellently in merging linear task chains to reduce synchronization overhead.
5.3. Experimental Results
5.3.1. Impact of Task Size (Number of Layers) on Performance
Figure 10 shows algorithm performance as DAG depth
M increases from 8 to 13 (with fixed width
).
HERO achieves the fastest completion time across all depths. Without fine-grained communication awareness, SSA-DAG lags behind HERO by 28.89% on average (peaking at 32.85% at ). DPMC’s rigid resource reservations similarly cause a 15.46% scheduling delay.
Regarding energy trade-offs, NSGA-II and DPMC save 9.58% and 3.29% energy, respectively, but severely sacrifice real-time performance (lagging 12.29% and 15.46% in makespan). Meanwhile, SSA-DAG and TOM consume more energy (4.04% and 1.92%) while remaining slower. Ultimately, HERO’s hybrid budgeting secures the best completion time without excessive energy waste.
5.3.2. Impact of Task Parallelism (Width) on Performance
Figure 11 shows the performance trends of various algorithms under different levels of parallelism. We fixed the DAG task depth at
, and the average number of parallel nodes per layer
N varied from 10 to 30.
HERO maintained the fastest completion time in all tests. SSA-DAG’s completion time was on average 22.51% slower than HERO. The TOM algorithm, due to insufficient adaptability to complex mesh dependencies, was on average 9.13% slower. DPMC’s static rule-based reservation mechanism struggles to adapt to dynamic concurrent workloads, resulting in performance fluctuations with an average lag of 13.19%.
In terms of energy optimization, DPMC and NSGA-II saved 3.52% and 7.71% of energy on average compared to HERO, respectively. NSGA-II achieved this by sacrificing 6.39% of critical time performance, with the worst-case performance degradation reaching 7.78%. Moreover, NSGA-II relies on heavy population iterations (requiring minutes), whereas HERO completes inference in microseconds using neural network forward propagation.
5.3.3. Impact of Graph Density (Connection Probability)
Figure 12 illustrates algorithm performance across varying graph densities. We adjusted the connection probability
C from 0.1 (highly sparse) to 0.7 (highly dense), with fixed depth
, width
, and task utilization
.
Increasing graph density exponentially exacerbates communication bottlenecks and synchronization barriers. HERO consistently achieves the lowest makespan across all densities. In highly dense scenarios (), algorithms lacking fine-grained communication awareness struggle to resolve complex mesh dependencies: SSA-DAG and TOM lag behind HERO by an average of 10.89% and 5.36%, respectively.
Conversely, HERO’s deep-enhanced MLP explicitly incorporates maximum communication overhead () to accurately identify true critical paths amidst massive data transfer delays. While DPMC and NSGA-II exhibit energy-saving tendencies under dense topologies (saving 3.67% and 10.54% on average, respectively), they severely compromise real-time performance, with makespans lagging behind HERO by 8.24% and 4.53%.
5.3.4. System Load Stress Test
Figure 13 shows the performance of various scheduling algorithms as the system task utilization (
U) increases from
to overload (
). We keep the other parameters fixed at
,
.
For DPMC, as the load increases, its completion time delay expands significantly, rising from 19.75% under light load to 19.97% under heavy load (), indicating severe underutilization of resources. SSA-DAG has an average completion time delay of 40.90%. This illustrates the limitation of relying solely on structural features without fine-grained communication awareness in high-density computing scenarios. NSGA-II sacrifices 14.66% of execution speed to achieve 12.18% energy savings. Across all tests, HERO maintains its absolute lead in completion time.
6. Conclusions and Future Work
In this paper, we proposed the lightweight HERO framework to resolve the bi-objective scheduling challenges of delay-sensitive DAG tasks in heterogeneous edge clusters. Rather than reiterating the algorithmic design, our extensive evaluations directly demonstrate the framework’s practical superiority. Notably, when compared to the representative learning-based baseline (SSA-DAG), HERO achieves an average reduction in makespan under high-density topologies, and saves up to of system energy across varying task depths. For resource-constrained edge devices, this continuous energy margin is highly significant, as it cumulatively extends battery lifespan and prevents hardware thermal throttling during sustained workloads. It pushes energy optimization to the extreme without introducing new system bottlenecks, all while strictly maintaining the ultra-low, microsecond-level scheduling latency crucial for real-time edge intelligence. Building upon these promising quantitative results, our future research will focus on three main avenues: (1) adapting the framework for dynamic, online environments (e.g., vehicular networks) with unpredictable task generation and topologies; (2) integrating lightweight fault-tolerance mechanisms to ensure high reliability against transient edge node failures; and (3) advancing to hardware-in-the-loop deployments on actual microcontrollers and embedded IoT sensor nodes to assess real-world physical overhead and end-to-end adaptability outside of simulated environments.
Author Contributions
Conceptualization, methodology, data curation, writing—original draft preparation, project administration, and funding acquisition, Z.Z.; visualization, writing—review and editing, validation, and funding acquisition, Y.J.; supervision, N.N. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
Summary of Key Notations.
| Notation | Description | Notation | Description |
| DAG components () | | i-th computational task |
| Data dependency edge | | Task computational workload |
| Data transmission volume | B | Average network bandwidth |
| CCR | Comm-to-computation ratio | | Predecessor and Successor sets |
| Processor set and k-th server | | Processing speed coefficient |
| Frequency set and selected freq | | Task execution time on |
| Communication time cost | | Static baseline power of |
| Hardware capacitive constant | | Task execution energy |
| Task minimum energy bound | | Task maximum energy bound |
| Total system energy consumption | | Total completion time |
| 11-D task feature vector | | Marginal sensitivity label |
| System energy constraint | | Energy budget control factor |
| Global energy margin pool | | Hybrid importance score |
| Weight balancing factor | | Initial task energy budget |
| Accumulated energy surplus | | Actual dynamic energy limit |
| Final energy consumed by | | Data ready time on |
| Earliest start time on | | m-th idle time slice |
| Task execution deadline | | Random perturbation factor |
| PIR | Performance improvement ratio | | System task utilization |
References
- Li, J.; Shang, Y.; Qin, M.; Yang, Q.; Cheng, N.; Gao, W.; Kwak, K.S. Multiobjective oriented task scheduling in heterogeneous mobile edge computing networks. IEEE Trans. Veh. Technol. 2022, 71, 8955–8966. [Google Scholar] [CrossRef]
- Zhou, X.; Ge, S.; Liu, P.; Qiu, T. DAG-based dependent tasks offloading in MEC-enabled IoT with soft cooperation. IEEE Trans. Mob. Comput. 2023, 23, 6908–6920. [Google Scholar] [CrossRef]
- Cao, Z.; Deng, X.; Yue, S.; Jiang, P.; Ren, J.; Gui, J. Dependent task offloading in edge computing using GNN and deep reinforcement learning. IEEE Internet Things J. 2024, 11, 21632–21646. [Google Scholar] [CrossRef]
- Peng, Q.; Wu, C.; Xia, Y.; Ma, Y.; Wang, X.; Jiang, N. DoSRA: A decentralized approach to online edge task scheduling and resource allocation. IEEE Internet Things J. 2021, 9, 4677–4692. [Google Scholar] [CrossRef]
- Taghinezhad-Niar, A.; Taheri, J. Fault-Tolerant Cost-Efficient Scheduling for Energy and Deadline-Constrained IoT Workflows in Edge-Cloud Continuum. IEEE Trans. Serv. Comput. 2025, 18, 2892–2903. [Google Scholar] [CrossRef]
- He, X.; Pang, S.; Gui, H.; Zhang, K.; Wang, N.; Yu, S. Online offloading and mobility awareness of DAG tasks for vehicle edge computing. IEEE Trans. Netw. Serv. Manag. 2024, 22, 675–690. [Google Scholar] [CrossRef]
- Jiang, Q.; Xin, X.; Zhang, T.; Chen, K. Energy-Efficient Task Scheduling and Resource Allocation in Edge Heterogeneous Computing Systems Using Multi-Objective Optimization. IEEE Internet Things J. 2025, 12, 36747–36764. [Google Scholar] [CrossRef]
- Biswas, S.K.; Muhuri, P.K.; Roy, U.K. Binary search-based fast scheduling algorithms for reliability-aware energy-efficient task graph scheduling with fault tolerance. IEEE Trans. Sustain. Comput. 2023, 9, 433–451. [Google Scholar] [CrossRef]
- Yu, Z.; Liu, W.; Liu, X.; Wang, G. Drag-JDEC: A deep reinforcement learning and graph neural network-based job dispatching model in edge computing. In Proceedings of the 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS); IEEE: Piscataway, NJ, USA, 2021; pp. 1–10. [Google Scholar]
- Zhou, Y.; Li, X.; Luo, J.; Yuan, M.; Zeng, J.; Yao, J. Learning to optimize dag scheduling in heterogeneous environment. In Proceedings of the 2022 23rd IEEE International Conference on Mobile Data Management (MDM); IEEE: Piscataway, NJ, USA, 2022; pp. 137–146. [Google Scholar]
- Deng, X.; Yang, H.; Zhang, J.; Gui, J.; Lin, S.; Wang, X.; Min, G. Task offloading in internet of vehicles: A drl-based approach with representation learning for dag scheduling. IEEE Trans. Mob. Comput. 2025, 24, 5045–5060. [Google Scholar] [CrossRef]
- Zhang, J.; Mo, L.; Wang, X.; Yang, C.; Wang, M.; Niu, D. Mixed-criticality DAGs Scheduling and Performance Optimization for Heterogeneous Multicore Systems. In Proceedings of the 2025 37th Chinese Control and Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2025; pp. 3013–3019. [Google Scholar]
- Gao, Y.; Yi, H.; Chen, H.; Fang, X.; Zhao, S. A structure-aware DAG scheduling and allocation on heterogeneous multicore systems. In Proceedings of the 2024 IEEE 14th International Symposium on Industrial Embedded Systems (SIES); IEEE: Piscataway, NJ, USA, 2024; pp. 26–33. [Google Scholar]
- Wang, S.; Li, D.; Huang, S.Y.; Deng, X.; Sifat, A.H.; Huang, J.B.; Jung, C.; Williams, R.; Zeng, H. Time-triggered scheduling for nonpreemptive real-time DAG tasks using 1-opt local search. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 3650–3661. [Google Scholar] [CrossRef]
- Liu, D.; Chen, J.; Huang, X.; Hong, H. A reliability-aware and energy-aware task scheduling algorithm for heterogeneous multi-core systems. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC); IEEE: Piscataway, NJ, USA, 2024; pp. 3212–3217. [Google Scholar]
- Zhang, Y.; Zhao, S.; Chen, G.; Huang, K. Fault-tolerant DAG scheduling with runtime reconfiguration on multicore real-time systems. In Proceedings of the 2024 IEEE 35th International Conference on Application-Specific Systems, Architectures and Processors (ASAP); IEEE: Piscataway, NJ, USA, 2024; pp. 19–27. [Google Scholar]
- Sun, B.; Theile, M.; Qin, Z.; Bernardini, D.; Roy, D.; Bastoni, A.; Caccamo, M. Edge generation scheduling for dag tasks using deep reinforcement learning. IEEE Trans. Comput. 2024, 73, 1034–1047. [Google Scholar] [CrossRef]
- Song, X.; Feng, J.; Liu, L.; Pei, Q.; Yu, F.R.; Zhang, N. A Deep Reinforcement Learning with Transformer Integration for Directed Acyclic Graph Scheduling in Edge Networks. IEEE Trans. Wirel. Commun. 2025, 25, 5506–5520. [Google Scholar] [CrossRef]
- Liu, Z.; Huang, L.; Gao, Z.; Luo, M.; Hosseinalipour, S.; Dai, H. GA-DRL: Graph neural network-augmented deep reinforcement learning for DAG task scheduling over dynamic vehicular clouds. IEEE Trans. Netw. Serv. Manag. 2024, 21, 4226–4242. [Google Scholar] [CrossRef]
- Ding, W.; Luo, F.; Gu, C.; Dai, Z.; Lu, H. A multiagent meta-based task offloading strategy for mobile-edge computing. IEEE Trans. Cogn. Dev. System 2023, 16, 100–114. [Google Scholar] [CrossRef]
- Zhao, G.; Xu, H.; Zhao, Y.; Qiao, C.; Huang, L. Offloading dependent tasks in mobile edge computing with service caching. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications; IEEE: Piscataway, NJ, USA, 2020; pp. 1997–2006. [Google Scholar]
- Lou, J.; Tang, Z.; Zhang, S.; Jia, W.; Zhao, W.; Li, J. Cost-effective scheduling for dependent tasks with tight deadline constraints in mobile edge computing. IEEE Trans. Mob. Comput. 2022, 22, 5829–5845. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, S.; Zhou, J.; Ling, X. Real-Time DAG Task Allocation Strategy for Multiprocessor by Optimistic Parallelism. In Proceedings of the 2024 IEEE 24th International Conference on Communication Technology (ICCT); IEEE: Piscataway, NJ, USA, 2024; pp. 1016–1021. [Google Scholar]
Figure 1.
An example of a Directed Acyclic Graph (DAG) task model. Nodes (–) represent computational tasks, and directed edges indicate execution dependencies and data transmission flow.
Figure 1.
An example of a Directed Acyclic Graph (DAG) task model. Nodes (–) represent computational tasks, and directed edges indicate execution dependencies and data transmission flow.
Figure 2.
The architecture of the HERO framework.
Figure 2.
The architecture of the HERO framework.
Figure 3.
The task sensitivity analysis process, showing how marginal delay impacts are calculated to generate training labels.
Figure 3.
The task sensitivity analysis process, showing how marginal delay impacts are calculated to generate training labels.
Figure 4.
Training performance of the proposed deep-enhanced MLP predictor. (a) The rapid convergence of training and validation loss within 80 epochs. (b) A scatter plot comparing true vs. predicted sensitivity on the test set; the dashed line represents the ideal prediction (), demonstrating the model’s high prediction accuracy.
Figure 4.
Training performance of the proposed deep-enhanced MLP predictor. (a) The rapid convergence of training and validation loss within 80 epochs. (b) A scatter plot comparing true vs. predicted sensitivity on the test set; the dashed line represents the ideal prediction (), demonstrating the model’s high prediction accuracy.
Figure 5.
Through micro-behavior analysis, we compared and contrasted the traditional () strategy with the proposed MLP-based strategy.
Figure 5.
Through micro-behavior analysis, we compared and contrasted the traditional () strategy with the proposed MLP-based strategy.
Figure 6.
Performance comparison of and MLP under varying graph depths and parallelism levels.
Figure 6.
Performance comparison of and MLP under varying graph depths and parallelism levels.
Figure 7.
Permutation feature importance analysis, highlighting computational load and communication overhead as the most critical scheduling features.
Figure 7.
Permutation feature importance analysis, highlighting computational load and communication overhead as the most critical scheduling features.
Figure 8.
Performance comparison of Rank-Only, Workload-Only, and HERO scheduling strategies under identical energy constraints.
Figure 8.
Performance comparison of Rank-Only, Workload-Only, and HERO scheduling strategies under identical energy constraints.
Figure 9.
Completion time and energy consumption comparison between hole-filling and Append-Only strategies.
Figure 9.
Completion time and energy consumption comparison between hole-filling and Append-Only strategies.
Figure 10.
Performance Comparison of Algorithms at Different Depths (M).
Figure 10.
Performance Comparison of Algorithms at Different Depths (M).
Figure 11.
Performance comparison of the algorithms under different parallelism levels (N).
Figure 11.
Performance comparison of the algorithms under different parallelism levels (N).
Figure 12.
Performance comparison of algorithms under different graph densities (C).
Figure 12.
Performance comparison of algorithms under different graph densities (C).
Figure 13.
Performance comparison of algorithms under different system utilizations (U).
Figure 13.
Performance comparison of algorithms under different system utilizations (U).
Table 1.
Comprehensive Feature Comparison of Existing Task Scheduling Strategies and the Proposed HERO Framework.
Table 1.
Comprehensive Feature Comparison of Existing Task Scheduling Strategies and the Proposed HERO Framework.
| Category | Ref. & Method | Optimization Objectives | Energy Strategy | Communication Overhead Handling |
|---|
| Heuristic &Evolutionary Algorithms | Li et al. [1] | Makespan, Energy | Standard DVFS | Static transmission assumption |
| Jiang et al. [7] | Energy, Delay | DVFS auto-adjustment | Partially considered |
| Zhang et al. [12] | Performance, Service Quality | Dynamic DVFS | Not strictly prioritized |
| Liu et al. [15] | Reliability, Energy | Standard DVFS | Redundancy-based |
| Biswas et al. [8] | Reliability, Energy | Fast DVFS switching | Static bounds |
| Zhang et al. [16] | Makespan, Fault-tolerance | None (Re-execution) | Unpredictable failure status |
| Taghinezhad-Niar [5] | Cost, Energy, Deadline | Energy-constrained | Edge-cloud congestion modeled |
| DRL & Intelligent Scheduling | Drag-JDEC [9] | Makespan, QoS | None | GNN feature extraction |
| Cao et al. [3] | Makespan | None | GAT-based dependency |
| Sun et al. [17] | DAG Width, Deadline | None | Edge generation representation |
| Song et al. [18] | Energy, Makespan | Transmit power & CPU freq. | Attention-based feature |
| GA-DRL [19] | Makespan, Timeliness | None | Topology two-way aggregation |
| DVTP [11] | Makespan | None | Spatiotemporal representation |
| Ding et al. [20] | Latency, Energy | Charging time trade-off | Dynamic environment-aware |
| LACHESIS [10] | Completion time | None | Topological perception |
| Structure-Aware & Specific Strategies | Zhao et al. [21] | Execution time | None | Wireless interference |
| LOU [22] | Latency, Cost | Strict Constraints | Dependency-aware |
| Zhou et al. [2] | Latency, Energy, Gain | End-device frequency | Soft cooperation (Data sharing) |
| He et al. [6] | Makespan, Queue stability | None | Cross-slot queue (Lyapunov) |
| Gao et al. [13] | Makespan | None | Pre-calculated node priority |
| Wang et al. [14] | Worst-case latency | None | 1-Opt Local Search path |
| DoSRA [4] | Efficiency, Delay | None | Decentralized provision |
| OPSA [23] | Processor utilization | None | Parallelism limitation |
| Proposed | HERO (Ours) | Makespan, Energy | Preventive-bottleneck budget | Aggressive hole-filling strategy |
Table 2.
The 11-dimensional feature vector for task representation.
Table 2.
The 11-dimensional feature vector for task representation.
| Feature Categories | Feature Name | Specific Meaning |
|---|
| Computation & Communication | | The task’s own computational load |
| Maximum data transfer weight of all output edges of the task |
| Static Priority Features | | Distance from the task to the exit node |
| Distance from task to the entry node |
| Task level depth in the DAG topological sorting |
| Graph Structure & Path | | In-degree of the task node |
| Out-degree of the task node |
| Total degree of the task node |
| The difference between in-degree and out-degree |
| Execution time of the critical path passing through the node |
| Number of paths from entry node to exit node passing through the node |
Table 3.
Model performance comparison against classic baselines.
Table 3.
Model performance comparison against classic baselines.
| Model | MSE | MAE | |
|---|
| Linear Regression | 0.018663 | 0.075654 | 0.1786 |
| Decision Tree | 0.017255 | 0.070981 | 0.2406 |
| Random Forest | 0.010575 | 0.055387 | 0.5346 |
| XGBoost | 0.010004 | 0.054553 | 0.5597 |
Table 4.
Ablation study of MLP architectures on prediction accuracy and computational overhead.
Table 4.
Ablation study of MLP architectures on prediction accuracy and computational overhead.
| Architecture | MSE | | Latency (μs) | Parameters |
|---|
| Shallow-MLP (2-layer) | 0.006224 | 0.690191 | 43.53 | 1537 |
| HERO-MLP (5-layer) | 0.005880 | 0.707295 | 199.09 | 50,049 |
| Deep-MLP (12-layer) | 0.006106 | 0.696074 | 254.53 | 2,638,849 |
Table 5.
Processor configurations.
Table 5.
Processor configurations.
| Level | Normalized Freq | Raspberry Pi 4 | Jetson Orin Nano | Xeon D |
|---|
| (MHz) | (mW) | (MHz) | (mW) | (MHz) | (mW) |
|---|
| 1 | 0.2 | 300 | 2500 | 302 | 5000 | 600 | 20,000 |
| 2 | 0.4 | 600 | 3200 | 605 | 7000 | 1200 | 28,000 |
| 3 | 0.6 | 900 | 4200 | 907 | 10,000 | 1800 | 38,000 |
| 4 | 0.8 | 1200 | 5500 | 1209 | 15,000 | 2400 | 50,000 |
| 5 | 1.0 | 1500 | 7000 | 1512 | 25,000 | 3000 | 65,000 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |