4.1. Introduction to the WOA
The WOA is a nature-inspired metaheuristic modeled on the social hunting behavior of humpback whales, especially their Bubble-Net Feeding Strategy. The WOA iteratively transitions between exploration (global search) and exploitation (local search). Using different mathematical formulations for these phases, WOA maintains a balance between wide-ranging search and precise local optimization. The algorithm simulates key behaviors such as the Spiral Bubble-Net Attack and Prey Encircling through subsequent procedures:
Whales identify the position of their prey and proceed to envelop it. This behavior is represented mathematically as
where
represents the current best solution’s position,
is the position vector of a whale, and
and
are coefficient vectors given by
where
linearly decreases from 2 to 0 over iterations and
.
The Bubble-Net Strategy is modeled through a spiral movement, allowing whales to refine their search around high-quality solutions. This motion is expressed by the spiral update equation:
where
, the parameter
b controls the spiral’s shape, and
l represents a uniformly distributed random variable within the interval
.
To maintain diversity and avoid premature convergence, whales perform random searches across the global space. When
, the update is modeled as
where
represents a randomly chosen whale’s position from the current population.
The Whale Optimization Algorithm (WOA) starts by generating an initial set of candidate solutions, representing whales, with random placements across the search domain. During every cycle, the fitness of these potential solutions is evaluated through a problem-specific objective function. Subsequently, the whales adjust their locations either by approaching the current best-known solution or by exploring new regions, according to a probabilistic rule. This iterative process continues until a stopping criterion is reached, such as a maximum number of iterations or attainment of a convergence threshold.
Task scheduling in distributed systems with DAG-structured workloads constitutes a challenging combinatorial optimization problem. It involves assigning heterogeneous tasks to diverse resources while adhering to constraints on execution time, resource utilization, and data locality. Effective solutions must strike a balance between broad exploration of the search space and fine-grained exploitation of promising regions. WOA is well suited to this requirement because its mathematical model alternates naturally between exploration and exploitation, thereby mitigating premature convergence and enabling near-optimal schedules to be discovered. In this formulation, the position of each whale encodes a mapping of tasks to resources, offering a direct and intuitive representation of scheduling decisions. Compared with alternative metaheuristics such as Genetic Algorithms (GAs) and Particle Swarm Optimization (PSO), WOA requires fewer control parameters and relies on simpler update mechanisms, which reduces computational overhead and accelerates convergence. Furthermore, its flexible structure allows seamless integration of domain-specific heuristics, such as awareness of data locality and task precedence. Consequently, we employ WOA as the optimization backbone of this work, with additional problem-driven enhancements incorporated to further improve scheduling efficiency. These refinements are elaborated in the following subsections.
4.2. Analysis of Positive Correlation
This section analyzes the relationship between Scheduling Plan Distance (
SPD) and Finish Time Gap (
FTG) using both theoretical arguments and experimental evidence. To highlight their inherent positive correlation, we begin by analyzing a representative workload with a clearly defined network structure as described in
Section 3.3. In this scenario, we derive an analytical expression showing that
, revealing a direct linear dependency between the two metrics. This theoretical insight lays the foundation for our subsequent experimental validation across a range of diverse scheduling cases.
The Finish Time Gap (
) represents the absolute difference in completion times between two scheduling strategies, formulated as
where
is the final finish time for scheduling plan
,
. This metric reflects the performance gap between two scheduling strategies in terms of makespan.
Scheduling Plan Distance (
) is a novel metric introduced in this work to quantify the distance between two scheduling plans, denoted as
and
. It captures the degree of difference in workload distribution across GPUs between the two plans. The metric is formally defined as
where
n is the total number of nodes in the neural network, and
denotes the cumulative execution time computed according to Equation (
14) of all nodes mapped to the same GPU as node
l under scheduling strategy
with
i being either 1 or 2. To compute
, we first determine, for each node
l, the absolute difference in workload between the GPUs to which
l is assigned in the two scheduling plans. These differences reflect how much the GPU workloads vary between the two plans from the perspective of each node. Finally, we take the average of these absolute differences over all nodes to obtain the overall
value. This metric provides an intuitive and quantitative means of comparing scheduling plans, especially in terms of their impact on load balancing and GPU utilization.
In WORL-RTGS algorithm, we utilize the Finish Time Gap (
) as a measure of dissimilarity between two scheduling plans. This metric provides an intuitive and direct reflection of how the overall performance differs between two scheduling approaches. Although
serves as a useful evaluation metric, it is not directly applicable for steering the creation of new scheduling strategies in iterative optimization. To address this, we investigate the mathematical connection between
and the Scheduling Plan Distance (
), a novel structural measure introduced here. Our findings indicate a robust positive correlation, where an increase in
typically corresponds to a proportional rise in
. This finding is crucial, as it allows us to indirectly control and optimize
by manipulating
. Leveraging this relationship, we treat
as a proxy objective that is differentiable and structurally informative, making it suitable for guiding the search process. In particular, we use the functional form of
to derive directions for generating next step scheduling plans, elaborated in
Section 4.3.
To explore the connection between
and
, we analyze a distinct workload scenario featuring a clear network structure. The workload consists of
n task nodes, including
entry nodes and a single exit node. The computing platform comprises two identical GPUs on a server. The execution time of the exit node, as well as all data transfer costs from the entry nodes to the exit node, are considered negligible. Under these conditions, the DAG reduces to a star topology, while the platform becomes two homogeneous GPUs without bandwidth limitations. Each entry node requires 1 s of execution time. In the baseline scheduling strategy
, the entry nodes are split evenly, with each GPU handling
nodes. Conversely, in the second strategy
, GPU
processes
entry nodes, while GPU
manages
nodes, where
and
. Let
represent the degree of imbalance between the two plans. We then derive the Scheduling Plan Distance as
Since each node has a unit execution time, the Finish Time Gap between the two scheduling plans is determined solely by the workload imbalance across the GPUs. In particular, as
balances the workload equally, and
does not, the Finish Time Gap is given by
Substituting Equation (
29) into Equation (
28), we obtain the following:
This reveals a clear linear relationship between
and
within this particular workload context. This theoretical result supports the idea that
can be used as a proxy for controlling
during scheduling optimization.
Furthermore, to comprehensively investigate the relationship between
and
, we conduct a series of large-scale experiments using both real-world data from the Alibaba Cluster Trace Program (
https://github.com/alibaba/clusterdata, accessed on 6 November 2025) and synthetically generated workloads. The dataset from Alibaba’s operational clusters contains detailed job scheduling information, including DAG structures for both batch workloads and long-running online services, making it highly suitable for evaluating task scheduling strategies.
Real-World Data Experiments. The experimental analysis using the Alibaba Cluster Trace data is divided into two groups based on server status: busy and idle, to reflect different practical workload conditions. In the busy state, each GPU’s resource usage rate is randomly sampled from the range
, while in the idle state, it is sampled from
. Jobs are randomly selected from the trace across various computational scales. Specifically, the number of nodes ranges from 100 to 1000 in increments of 100 for small-scale jobs, from 1200 to 3000 in increments of 200 for medium-scale jobs, and from 4000 to 10,000 in increments of 1000 for large-scale jobs. We simulate six cluster environments with 10, 20, 50, 100, 200, and 500 GPUs, each capable of processing multiple nodes concurrently. For each combination of GPU count, system state (busy/idle), and job scale (small/medium/large), we perform 1000 repetitions to ensure statistical reliability. In each repetition, two distinct scheduling plans,
and
, are randomly generated, and
and
are computed according to Equations (
25) and (
26). The resulting
pairs are recorded for correlation analysis. After completing all repetitions for each combination of GPU count, system state and job scale, we compute the Pearson Correlation Coefficient (PCC) [
42] to quantify the strength of the correlation between
and
.
The experimental results in
Table 1 report the PCC values between
and
across different GPU count, workload scales, and system states, based on real-world Alibaba cluster data. Each PCC value in
Table 1 represents the mean correlation calculated for a specific combination of GPU count, system state, and job scale. Overall, the results indicate a consistent positive correlation across all configurations, suggesting that higher SPD values generally correspond to higher FTG values.
Figure 3 and
Figure 4 show this trend specificly for 50-GPU and 500-GPU clusters, respectively. Both line charts show a clear positive relationship between
and
, reinforcing that the SPD metric can serve as a reliable indicator for FTG across different cluster scales. Notably, while the correlation strength fluctuates slightly depending on job scale and system state, no negative correlations are observed, highlighting the robustness of SPD as a predictive metric. These PCC values indicate a moderate correlation strength, aligning well with benchmarks in the scheduling literature where proxy metrics for makespan (e.g., load imbalance or resource utilization) typically yield PCCs in the range of 0.3–0.6 [
43,
44]. For instance, PCC was used for load correlation detection in VM placement decisions to minimize Service Level Agreement (SLA) violations and migration counts. Experiments show that PCC-guided load balancing can reduce energy consumption by 15% to 20%, with PCC values in the range of 0.4–0.55 considered sufficient for dynamic optimization [
43]. Higher PCCs (>0.7) are rarer in such complex, dynamic environments and often limited to synthetic, low-variability benchmarks; thus, our results are not only comparable but also indicative of
SPD’s suitability as a proxy for
FTG in practical DAG scheduling, where full linear predictability is not required.
We acknowledge that the average PCC of 0.4603 reflects moderate rather than strong linear correlation (where PCC > 0.7 is conventionally considered strong). However, in complex DAG scheduling under resource contention and precedence constraints, there may not exist perfect linearity and consistent directional alignment is of more practical significance: SPD and FTG monotonically co-vary across all tested conditions. To further validate this non-parametric robustness, we supplement SPD and FTG with Spearman’s rank correlation coefficient (), which measures monotonicity without the assumption on linearity.
Table 2 reports Spearman’s
for the same experimental configurations. Values range from 0.49 to 0.55 with mean = 0.52 and variance = 0.004, consistently higher than PCC and uniformly positive, confirming that the relationship between SPD and FTG is even stronger than linear trends. This is expected because structural divergence SPD influences makespan FTG through cumulative, nonlinear dependency chains, precisely the regime where rank-based metrics outperform Pearson’s. Critically, no configuration yields negative
, and the gap
holds in 92% of the cases, reinforcing that SPD provides reliable monotonic guidance even when linear fit is moderate. In optimization contexts such as WORL-RTGS, this ensures that minimizing SPD would reduce FTG on average. Edge cases with lower PCC (e.g., 0.386) still yield
, preserving directivity.
Synthetic Workload Experiments. The experimental analysis with synthetically generated workloads, also divided into busy and idle states. GPU resource usage is sampled in the same ranges as in the real-world experiments. We simulate the same six cluster environments, with job scales and node counts as previously described. Execution times for jobs are sampled from normal distributions:
for small-scale,
for medium-scale, and
for large-scale jobs. DAG topologies are generated using the Erdos–Rényi model [
45], with the average in-degree/out-degree set to either 1 or 3, resulting in 54 workloads in total: 27 with degree 1 and 27 with degree 3. For each combination of system state, job scale, GPU count, and average degree, we perform 1000 independent repetitions. In each repetition, two scheduling plans,
and
, are randomly created, and
and
are computed. The
pairs are recorded for correlation analysis, and the PCC is computed for each combination to measure the correlation strength between
and
.
Table 3 reports the PCC values between
and
under synthetic workloads across different system states, job scales, GPU counts, and average in-degree/out-degree. Overall, the results demonstrate a consistently positive correlation, with PCC values ranging approximately from 0.39 to 0.55. Stronger correlations are observed for medium and large workloads in the busy state, especially when the correlation degree is 3 (e.g., 0.550 at 20 GPUs and 0.534 at 200 GPUs). In the idle state, small workloads exhibit greater variability, while medium and large workloads remain relatively stable, often exceeding 0.45 and reaching above 0.54 in several cases.
Figure 5 and
Figure 6 further show the positive trend, showing consistent correlation of 50 GPUs and 500 GPUs. These PCC values indicate a moderate correlation strength, aligning well with benchmarks in the scheduling literature where proxy metrics for makespan (e.g., load imbalance or resource utilization) typically yield PCCs in the range of 0.3–0.6 [
43,
44], particularly in heterogeneous or variable-degree topologies. In our case, the slightly higher PCCs for
(e.g., up to 0.550) underscore
SPD’s enhanced predictiveness in denser graphs, where inter-task dependencies amplify load-makespan sensitivity, ideal for our proxy’s role in WORL-RTGS, as it enables reliable indirect minimization of
FTG amid topological variability.
As in the real-world data, we compute Spearman’s for synthetic workloads. Across all 108 configurations, ranges from 0.42 to 0.59 with mean = 0.508 and variance = 0.004, again outperforming PCC and uniformly positive. The improvement is most pronounced in high-degree () topologies, where dependency chains amplify effects (e.g., vs. PCC = 0.550 at 20 GPUs, medium scale, busy state). This further validates SPD’s robustness as a proxy in topologically complex scenarios.
Furthermore, the positive correlation observed in
Figure 3,
Figure 4,
Figure 5 and
Figure 6 has profound implications: it empirically substantiates that greater structural divergence in scheduling plans (
SPD) reliably predicts larger makespan disparities (
FTG), enabling
SPD to steer search toward balanced, low-makespan solutions without exhaustive
FTG evaluations. This not only accelerates convergence in metaheuristic frameworks but also enhances load balancing, as higher
SPD correlates with imbalance-induced delays, directly tying to energy efficiency and SLA compliance on GPU clusters.
The results shown in the experiments of real-world data and synthetic workloads confirm a persistent positive correlation between
and
across varying conditions. In all cases, the PCC value remains above zero. From a statistical perspective, the PCC values ranging from 0.16 to 0.57 with the average of 0.4603 and variance around 0.003 suggest a consistent trend: as
increases,
tends to increase as well. These experiments clearly illustrate the increasing trend of
with larger values of
. This persistent positive correlation, spanning workload scale, topology, cluster size, and utilization, validates
SPD as a reliable proxy for
FTG. The moderate PCC strength (mean 0.46) is reasonable for DAG scheduling problems, where exact linearity may not exist due to precedence constraints and resource heterogeneity [
43,
44]; instead, it ensures directional guidance for optimization, reducing makespan by aligning structural adjustments with performance gaps. Notably, the absence of negative correlations across diverse setups and the low variance (0.003) make
SPD a computationally tractable surrogate in iterative algorithms such as WORL-RTGS. We report global Spearman’s
, reinforcing stability and predictability. In WORL-RTGS, we exploit this relationship to indirectly optimize makespan by minimizing
SPD via differentiable, structural updates (
Section 4.3). The data and code used to generate these visualizations are publicly available in our GitHub repository (
https://github.com/AIprinter/WORL-RTGS.git, accessed on 6 November 2025). This observed correlation plays a crucial role in our algorithm design. It provides theoretical and empirical support for using
as a proxy to indirectly control and optimize
, thereby improving the efficiency of our scheduling strategy.
4.3. Whale Optimization and Reinforcement Learning-Based Running Time Gap Strategy
We design WORL-RTGS by integrating the WOA with a Deep Reinforcement Learning (DRL) framework. WOA is a nature-inspired meta-heuristic optimization algorithm introduced by Seyedali et al. [
7] in 2016, modeled on the unique hunting behavior of humpback whales [
46].
In the context of WORL-RTGS, each humpback whale metaphorically represents a candidate scheduling plan. This plan is encoded as an n-dimensional vector, where each dimension specifies the GPU assigned to a corresponding task node. A population of whales is maintained across iterations, with each whale representing a different scheduling plan. Following the core principles of WOA, the optimization process in WORL-RTGS is divided into three phases. Encircling Prey phase simulates the convergence towards an optimal solution, Spiral Bubble-Net Feeding Maneuver balances exploration and exploitation through a logarithmic spiral, and Search for Prey phase diversifies the search space by moving whales randomly. In each iteration, every whale updates its position by executing one of these three directions. However, the traditional WOA mechanism is insufficient for accurately determining the position matrix of a new whale when applied to complex task scheduling problems, especially in high-dimensional GPU assignment spaces. To overcome this limitation, WORL-RTGS integrates a Double Deep Q-Network (DDQN) module within the DRL framework. Based on the direction selected by WOA, the DDQN is responsible for predicting the next-step position matrix for each whale, thereby enhancing the quality and precision of the scheduling plan evolution.
4.3.1. Encircling Prey
Humpback whales are able to detect prey and move around it. In WORL-RTGS, the prey’s position is analogous to the Optimal Scheduling Plan (OSP). Since the exact OSP is unknown beforehand, the algorithm instead designates the best available scheduling plan at each step as the target prey, represented by the leading whale. Once this leader is identified, the remaining whales adjust their positions in relation to it, which is expressed mathematically as
where
k denotes the iteration index, while
A and
C are coefficients obtained from Equations (
33) and (
34). The term
represents the leader’s position, i.e., the best scheduling plan in iteration
k. Each whale in the population is represented by
, and its updated position in the next iteration is given by
. The symbol
indicates element-wise absolute values, and the operator · refers to element-wise multiplication. The distance between a whale
and the leader
is expressed as
D. The definitions of
A and
C are provided below:
where
is a random value uniformly sampled from the interval
. The parameter
introduces stochasticity into both
A and
C, ensuring population diversity and preventing premature convergence. Its value in
represents the normalized probability range for random behavior:
corresponds to no influence from the leader (pure exploration), whereas
enforces complete attraction toward the leader (pure exploitation). This probabilistic modulation helps balance the two phases during optimization. The coefficient
controls the exploration–exploitation transition. As Equation (
35) shows,
decreases linearly from 2 to 0 as the iteration index
k increases, meaning that early iterations favor exploration (larger
A range) and later iterations emphasize exploitation (smaller
A). Consequently,
A is confined to the range
, gradually shrinking to zero as convergence progresses.
denotes the maximum number of iterations and governs the temporal decay of
. Together,
dynamically adjust the search pressure toward the best scheduling plan
.
In practice, Equation (
31) can measure the actual distance between humpbacks, but applying it directly to scheduling plans is not meaningful. Each dimension of a plan corresponds to a GPU index, and subtracting one index from another provides no useful interpretation. To resolve this, we introduce
, defined in Equation (
25), as a metric for quantifying differences between humpbacks. However,
by itself is insufficient for directly producing new scheduling plans. To overcome this limitation, we leverage its positive correlation with
, as discussed in
Section 4.2. By translating
into
, we derive a more actionable metric that facilitates the generation of new humpback positions—corresponding to scheduling plans—throughout the iterative procedure.
It is not feasible to directly generate the next-step humpback whale location,
, using Equation (
32) based solely on a given value of
, as computed from Equation (
25). The challenge stems from the fact that
is a single scalar capturing only the absolute gap in final finish times, while scheduling plans are inherently high-dimensional vectors. As a result,
lacks the necessary structure and dimensionality to guide the generation of a new scheduling plan vector directly. To overcome this challenge, we must find a function that not only reflects the behavior of
but also supports extension into a multidimensional space consistent with the representation of scheduling plans. This is where our earlier analysis proves useful: we have observed a medium positive correlation between
and
in
Section 4.2, meaning that as the
between two scheduling plans increases, the corresponding
also tends to increase. This correlation enables us to indirectly control
by manipulating
. Therefore, in the optimization process, we replace the original abstract distance term
D in Equation (
32) with
to guide the humpback’s movement. Unlike
,
is computable based on the structural differences between scheduling plans and can be naturally expanded to a higher dimensions. This allows us to use
not only as a proxy for
but also as a foundation for deriving new, high-dimensional scheduling plans through Reinforcement Learning-driven methods.
The next position of a humpback whale,
, is obtained by first deriving a set of target values, referred to as
. Each
represents the expected workload which is the total execution time of the GPU that task node
is expected to be assigned to in the upcoming scheduling plan
. Formally, this is expressed as
where
indicates the GPU to which task node
is mapped, and
computes the total execution time of tasks assigned to that GPU under the new plan.
To construct this list of hope values for all task nodes
, we rely on both the leader whale’s location and the current whale’s position in iteration
k. Specifically,
denotes the current load on the GPU handling task node
in the leader whale’s scheduling plan
:
represents the current load on the GPU handling
in the given humpback’s current location
:
With both and known, the algorithm derives based on the relative influence of the leader and current humpback states guided by the coefficients defined in the WOA mechanism. This hope value list then serves as a basis for reassigning task nodes in order to construct the whale’s next-step scheduling plan .
Substituting Equation (
31) into Equation (
32) and comparing with Equation (
25), we obtain the following:
Here,
D is substituted with
to quantify the gap between two scheduling plans. As shown in Equation (
39), the gap between the leader whale and its follower in the next step is scaled by factor
A. During this stage, the randomness term
C is omitted; assigning
effectively removes stochasticity and ensures the leader humpback’s position remains deterministic.
Since
Section 4.2 shows a positive correlation between
and
, we suggest employing
as a substitute for
. By making this substitution in Equation (
39), we bridge the distance with a more interpretable and extendable metric in multidimensional scheduling space. Thus, the distance between the leader’s load distribution and the hope distribution in the next-step schedule can be expressed as
To derive the above expressions, we make a simplifying assumption: for each task node
, the corresponding terms on both sides of the equation are equal, that is
Under this assumption, the hope load for each GPU node at the next position
can be calculated through the following transformation:
Equation (
43) expresses the estimated load on the GPU hosting task node
at the humpback whale’s next position. It is calculated based on a weighted difference between the leader whale’s load and the current whale’s load, scaled by the factor
A. The ± sign introduces a bifurcation in movement direction, which aligns with the behavior modeled by the WOA algorithm enabling the search to either approach or diverge from the leader, depending on the optimization dynamics in a given iteration. WORL-RTGS feeds the value of
from Equation (
43) into the DDQN module detailed in
Section 4.3.4, which then determines the next humpback position according to the optimization objective.
In WORL-RTGS, every humpback whale chooses a unique direction for movement in each iteration. For each selected path, a distinct derivation process is applied to assess the computational demand on the GPU hosting each task node
at the whale’s anticipated next location. For instance, in the Encircling Prey phase, this estimation is formulated in Equation (
43), which provides the predicted load values referred to as hope values, based on the current and leader positions. In the subsequent sections, namely
Section 4.3.2 and
Section 4.3.3, we apply a similar derivation methodology to the other two behavioral strategies: Bubble-Net Attacking Method and Search for Prey. These derivations likewise yield a list of hope values tailored to each phase, enabling us to generate updated scheduling plans that reflect the selected movement strategy within the WOA-inspired optimization framework.
4.3.2. Bubble-Net Feeding Strategy (Exploitation Phase)
Humpback whales employ a distinctive hunting technique known as Bubble-Net Feeding [
47]. They submerge approximately 12 m, then generate bubbles along a path resembling a ‘9’ to encircle a school of fish, ascending toward their target. This Bubble-Net Feeding behavior comprises two key mechanisms: a shrinking encirclement process and a spiral position update.
Shrinking encirclement process. In Equation (
32), the coefficient
A is randomly generated from the interval
, with
a decreasing linearly as the iteration index
k grows. Consequently, the absolute value of
A gradually reduces over the course of the iterations. This gradual reduction causes the search area around the the leader solution to contract progressively. In the context of the Encircling Prey phase, this mechanism reflects the humpback whale’s strategy of narrowing its focus and moving closer to the optimal solution as the optimization progresses. The shrinking encirclement helps guide the algorithm toward exploitation by reducing randomness and encouraging convergence in the later stages of the search process.
Spiral position update. Seyedali et al. [
7] proposed a spiral equation to emulate the distinctive helical motion of humpback whales. The procedure first measures the distance between a whale at
and the current leader at
. Based on this distance, a new position along a spiral path between
and
is computed using the following formulation:
where
represents the distance between the whale’s current position and that of the leader,
b is a constant parameter, and
l is a random value sampled from the interval
. The random variable
controls the direction and amplitude of the spiral motion. Values of
l close to
produce outward spirals (encouraging exploration), while values near 1 generate inward spirals that contract toward the leader (favoring exploitation). Sampling
l symmetrically around zero allows the algorithm to maintain a balanced probabilistic tendency between expansion and contraction around the current best solution.
As in the Encircling Prey strategy, the Spiral Bubble-Net Movement determines the whale’s next position
by first estimating the hope load on the GPU assigned to task node
at that location. Using a derivation approach analogous to that in the encircling phase, we obtain the following expression for the hope value:
Humpbacks reduce the circle while following a spiral trajectory at the same time. To replicate this dual behavior, a random variable
is introduced to decide the movement pattern. If
, the whale updates its position using the shrinking-encircling mechanism; otherwise, the spiral updating rule is applied as follows:
4.3.3. Searching for Prey (Exploration Phase)
In natural hunting scenarios, humpback whales know the prey’s location and head straight toward it. However, in the context of discovering a suboptimal scheduling plan, whales in WORL-RTGS are not always guaranteed to converge on a globally optimal solution. Instead, they may become trapped in a local optimum by repeatedly moving toward a currently best-known solution.
To enhance global exploration and avoid local optima, WORL-RTGS adopts a strategy where a humpback whale is forced to explore new regions of the solution space when
. Under this condition, the current leader whale is no longer followed. Instead, a random whale
is selected to act as a temporary guide, and the current whale at position
is repelled from this randomly chosen whale. This mechanism increases diversity in the population and encourages exploration of unvisited scheduling plans. The corresponding position update rule is given by
where
represents the scheduling plan of a randomly chosen whale, while
D is the distance to this whale from the current whale, scaled by the random factor
C.
To derive the expected workload which is the hope value on the GPU that hosts task node
at the next-step location
, we apply the same transformation used in other movement strategies. The resulting expression is
where
is the current total execution time of all task nodes mapped to the GPU processing
in the randomly selected whale. This equation allows WORL-RTGS to estimate the updated task distribution under the exploration-driven movement and facilitates the generation of new scheduling candidates with greater diversity.
Having introduced the three behavioral mechanisms of the humpback whale: Encircling Prey, Bubble-Net Attacking, and Search for Prey, we have derived three corresponding sets of target values, denoted as . Each set represents the desired workload outcomes for task nodes under one of the three movement strategies, guiding how the whale (i.e., the scheduling plan) should evolve in the next iteration. The next step is to translate these hope values into a concrete next-step scheduling plan, i.e., the whale’s updated position. To achieve this, the WORL-RTGS algorithm employs a Double Deep Q-Network (DDQN) framework. The DDQN is used to generate a scheduling plan that aligns as closely as possible with the desired workload distribution specified in the corresponding list, thereby guiding the whale or scheduling plan toward more optimal solutions in a structured and adaptive manner.
4.3.4. DDQN-Based Whale Position Generation
To generate an optimized scheduling plan , we design a Double Deep Q-Network (DDQN) module that predicts the optimal GPU assignment for each task node based on its expected workload, denoted as . The DDQN agent is trained to learn a mapping strategy that minimizes the deviation between the actual and expected GPU workloads, thereby enhancing scheduling efficiency in high-dimensional task allocation scenarios. The corresponding Markov Decision Process (MDP) is defined as follows.
Each state
characterizes the scheduling environment of task node
and includes the following components: the index of the current task node
; the expected workload vector
, where
n denotes the total number of task nodes, and each entry corresponds to the desired total load on the GPU to which
is expected to be assigned; the computational workload vector
for all task nodes (e.g., measured in FLOPs); and the current GPU load vector
, where
m is the number of GPUs. Thus, each state is represented as
The action space consists of all available GPUs:
Each action
corresponds to assigning task node
to a specific GPU.
The reward function encourages the agent to minimize the deviation between the predicted and desired GPU workloads:
Here,
denotes the predicted load on GPU
a after assigning task node
to it. The closer this predicted load is to the desired workload value
, the higher (i.e., less negative) the reward.
The training objective is to derive a scheduling policy that reduces the overall Load Deviation (
) once all task nodes have been allocated. The
is defined as
where
denotes the total runtime of all tasks assigned to the same GPU as
in the final plan, and
represents the expected workload prior to scheduling.
The DDQN module is implemented using a lightweight multilayer perceptron (MLP) architecture. The network receives a state vector as input, which encodes the scheduling environment of the current task node. The output is a Q-value vector , where each element corresponds to the estimated cumulative reward of assigning task node to one of the m available GPUs. The network consists of an input layer of size . This is followed by two hidden layers containing 128 and 64 units, respectively, each using the ReLU activation function. The final output layer has m units and uses a linear activation to produce Q-values for each possible GPU assignment.
During training, we employ experience replay and the Double DQN target update strategy to stabilize learning. The Q-value update is computed as
with the corresponding loss function defined as
To mitigate overestimation bias, the parameters of the target network are periodically synchronized with those of the online network.
The pseudocode of the DDQN-based whale position generation method is presented in Algorithm 1. This method is applied in conjunction with the three behavioral mechanisms of the humpback whale: Encircling Prey, Bubble-Net Attacking, and Search for Prey, each of which yields a corresponding set of target values, denoted as
. The complete pseudocode of the overall WORL-RTGS algorithm is provided in Algorithm 2. We begin by presenting the pseudocode of Algorithm 1, followed by the full process of the WORL-RTGS algorithm as shown in Algorithm 2. Algorithm 1 presents the DDQN-based task-to-GPU scheduling method. In each training episode, task nodes are scheduled sequentially according to a topological order. For each task node, the current state is constructed and used as input to the online Q-network to select the GPU with the highest predicted Q-value. The selected action is applied by assigning the task to the chosen GPU and updating the current load. A reward is computed based on the deviation between the expected and predicted GPU load, and the transition tuple
is stored in the replay buffer. Mini-batches drawn from the buffer are employed to update the online network through Q-learning, with target Q-values computed using the Double DQN approach. The target network is synchronized with the online network at regular intervals. After training, the learned Q-network is used to generate the final scheduling plan
. Algorithm 2 outlines the integrated optimization procedure of the WORL-RTGS algorithm, which combines the Whale Optimization Algorithm (WOA) with a DDQN-based scheduling module. The algorithm begins by initializing a population of humpback whales and evaluating their fitness values using Equation (
17). The best-performing whale is selected as the current leader. At each iteration, whale positions are updated according to one of three behaviors: Encircling Prey, Searching for Prey, or Spiral Bubble-Net Attack, determined by the randomly chosen control parameters
r and
p. For each behavior, a corresponding hope value list
is computed, and the DDQN module as Algorithm 1 is invoked to guide the position update. After all whales are updated, their fitness values are re-evaluated and the leader is updated accordingly. The process continues until the maximum iteration limit is reached, at which point the optimal solution
is delivered. As iterations progress, the magnitude of
steadily declines from 2 to 0, facilitating a shift from exploration to exploitation. In each cycle, WORL-RTGS chooses between a spiral or circular trajectory based on a randomly determined parameter
p. Once a humpback whale selects its movement pattern, a set of hope values
is computed. Guided by a Q-network, a new whale position is established by fulfilling these hope values. Successful fulfillment of the hope values ensures that the
metric is also met, owing to its positive correlation with
, as discussed in
Section 4.2. For reference, the primary parameters utilized in the algorithm are outlined in
Table 4.
| Algorithm 1 DDQN-Based Task-to-GPU Scheduling Algorithm. |
- 1:
Input: Task set , initial GPU loads , hope vector - 2:
Output: Scheduling plan - 3:
Initialize , , and replay buffer - 4:
for episode = 1 to do - 5:
for each task node in topological order do - 6:
Construct state s - 7:
- 8:
Assign to GPU a, update - 9:
Compute reward r - 10:
Construct next state - 11:
Store in - 12:
Sample batch from - 13:
for each in batch do - 14:
- 15:
- 16:
Update with loss - 17:
Periodically update - 18:
return Final scheduling plan
|
The overall computational complexity of the WORL-RTGS algorithm is determined by the interaction between the Whale Optimization loop and the DDQN-based scheduling procedure embedded within each iteration. In the DDQN-based whale next-position generation method Algorithm 1, let
denote the batch size used in the replay buffer during training and
n denote the number of task nodes. Let
denote the number of training episodes. For each episode, the algorithm iterates over all
n task nodes. For each task node, it performs: state construction and action selection using the Q-network:
, where
m is the number of GPUs; reward computation and buffer update:
; and backpropagation on the sampled batch of size
:
. Thus, the total complexity of the DDQN module per episode is
, and the overall complexity across
episodes is
. In the WORL-RTGS optimization loop Algorithm 2, let
denote the number of whales in the population, and
the total number of optimization iterations. In each iteration, the algorithm evaluates and updates the position of each whale using one of three behavioral strategies. Each position update invokes the DDQN module once. Therefore, the total complexity of the WORL-RTGS algorithm is
This reflects the nested nature of the hybrid metaheuristic–learning framework, where each metaheuristic update is guided by a learned scheduling policy.
| Algorithm 2 WORL-RTGS Integrated Optimization Procedure. |
- 1:
Input: Maximum number of iterations ; number of humpback whales ; - 2:
Randomly initialize the population of humpback whales , where ; - 3:
Compute the fitness value ( ) of each whale according to Equation ( 17); - 4:
Identify the whale with the smallest and assign it as the leader ; - 5:
Initialize iteration counter: ; - 6:
while do - 7:
for each whale in the population do - 8:
Randomly generate control parameters r and p; - 9:
Update the coefficient A using Equation ( 33); - 10:
if then - 11:
if then - 12:
Compute the hope value list using Equation ( 43); - 13:
Update the whale position using the DDQN-based method Algorithm 1; - 14:
else if then - 15:
Randomly select a whale and assign it as ; - 16:
Compute the hope value list using Equation ( 50); - 17:
Update the whale position using the DDQN-based method Algorithm 1; - 18:
else if then - 19:
Update A using the spiral equation: ; - 20:
Compute the hope value list using Equation ( 45); - 21:
Update the whale position using the DDQN-based method Algorithm 1; - 22:
Recalculate the for all whales; - 23:
Update the leader based on the best fitness; - 24:
Increment iteration: ; - 25:
Output: Final optimal solution ;
|
The space complexity of the DDQN-based scheduler mainly consists of three parts: the parameters of the online and target Q-networks, the experience replay buffer, and auxiliary data structures for state and load tracking. Assuming an -layer neural network with hidden dimension h, the space for network parameters is . Since both and are maintained, the total becomes . The replay buffer stores experience tuples, each with state, action, reward, and next state, leading to space , where is the state vector dimension. Additionally, the scheduling plan and GPU load tracking introduce negligible overhead . In total, the space complexity is .