D3QN-Guided Sand Cat Swarm Optimization with Hybrid Exploration for Multi-Objective Cloud Task Scheduling
Abstract
1. Introduction
1.1. Background and Motivation
1.2. Related Work
1.2.1. Heuristic and Meta-Heuristic Methods
1.2.2. Hybrid Algorithm
1.2.3. Learning-Based Approach
1.3. Main Contributions
- We propose the MoSCO algorithm, a hybrid framework that combines the rapid environmental perception of a D3QN agent with the multi-objective search capabilities of the Sand Cat Swarm Optimization algorithm.
- We design a knowledge-injection mechanism in which the D3QN network extracts task features to guide population initialization, aiming to reduce blind exploration commonly observed in traditional metaheuristics.
- We formulate a multi-objective optimization model for dynamic cloud environments that accounts for node failure probabilities and simultaneously optimizes the maximum Makespan, task Tardiness, and Average Resource Utilization.
- We conduct extensive simulations to evaluate the proposed MoSCO algorithm. The results show that it can effectively balance global exploration with local exploitation, demonstrating competitive performance and stability when compared to several existing baseline algorithms.
2. Materials and Methods
2.1. System Model
2.1.1. Environment Model
- Resource model: The data center contains M heterogeneous virtual machines, represented by the set . Resource heterogeneity manifests itself as differences in processing capabilities, meaning that the same task takes different processing times across different virtual machines. Each virtual machine maintains a state object recording its current task queue and start/end times (Start, End). To simulate real-world uncertainty, a random failure mechanism is introduced. During task scheduling, a virtual machine has a base failure probability, BP, and incurs a random repair time, Repair_time. Notably, high-load nodes (those in the top 90% by completion time) are assigned a higher failure probability.
- Task Characteristics: Task flows arrive dynamically following a stochastic process. The arrival intervals of the tasks follow an exponential distribution, and each task has a distinct arrival time . The processing relationship between tasks and resources is defined by a two-dimensional matrix , representing the execution time of task on virtual machine ( indicates incompatibility). QoS objectives are defined by the desired completion time , which is the sum of the arrival time and the estimated processing duration. If the task completion time exceeds , a latency delay occurs.
2.1.2. Multi-Objective Optimization Model
- 1.
- Minimize Maximum Completion Time: The maximum completion time is defined as the maximum value among all task completion times, reflecting the system throughput.where J is the set of all tasks, and is the completion time of task .
- 2.
- Minimize Timeliness Deviation: This metric aims to measure the deviation of scheduling strategies from these predefined objectives, specifically by minimizing the cumulative lag between the actual completion times of all tasks and their expected times:where is the expected completion time for task . The function ensures that the difference is only included in the cumulative deviation when the task falls behind the expected time; if the task is completed ahead of schedule or on time, this term is zero, indicating full compliance with expectations.
- 3.
- Maximize Average Utilization: This metric measures resource utilization efficiency and is defined as the average of the proportion of busy time across all virtual machines:where is calculated by Equation (6).
- 4.
- Comprehensive Optimization Objective: In summary, this paper aims to identify the Pareto optimal strategy by optimizing the vector objective function:
2.2. The Proposed MoSCO Algorithm
2.2.1. Overall Algorithm Framework
- D3QN Lead Module: This module is responsible for perceiving the environmental state s and, based on the current policy, outputs a candidate action probability distribution for the current task generated by the D3QN agent’s policy, serving as the ’elite individual’ to be passed to the optimization module. Simultaneously, the module interacts with the Replay Memory Buffer, continuously training the network using historical experience tuples to enhance guidance quality
- SCSO Optimization Module: This module receives and incorporates it into the initial population via a Knowledge Injection mechanism, combining it with randomly generated individuals to balance solution quality and diversity. Subsequently, SCSO iterative optimization and Pareto non-domination sorting are performed to select the optimal scheduling policy from the resulting Pareto frontier.
- Execution and Feedback Loop: The optimal policy is applied to the cloud environment. The environment’s multidimensional performance metrics are weighted and normalized to produce a scalar reward . Finally, the new state and reward are fed back into the D3QN module, forming a complete adaptive closed-loop of “guidance-optimization-feedback.”
2.2.2. D3QN Guidance Module
- 1.
- State and Action Space Definition:
- State Space: To accurately characterize the dynamic cloud environment, the state is defined as a six-dimensional continuous vectorencompassing resource load characteristics and task queue status. To enhance training stability, all state features are normalized before being fed into the network. Among these, represents the average resource utilization rate; denotes the standard deviation of resource utilization; is the average task completion rate; indicates the standard deviation of task completion rate; signifies the proportion of pending tasks; represents the proportion of urgent tasks, reflecting the percentage of tasks approaching their expected completion time.
- Action Space: To address the convergence challenges arising from large-scale combinatorial action spaces, this paper proposes a dynamic candidate action generation mechanism based on -greedy. At decision time t, the algorithm constructs a fixed-dimensional candidate action set of size , following the strategy outlined below:
- –
- Heuristic Development: Select the top tasks with the tightest deadlines for the candidate pool, ensuring priority response to urgent demands.
- –
- Random Exploration: Sample randomly from the remaining feasible tasks with probability to maintain diversity in the solution space and avoid becoming stuck in local optima.
The output action represents the index of the selected virtual machine at step t. The discrete operation is then converted into a “one-hot” probability vector, which serves as the initial elite individual for the subsequent SCSO micro-optimization phase.
- 2.
- D3QN Network Architecture and Learning Process: We employ the Dueling Double Deep Q-Network (D3QN) as the value function approximator. Its architecture consists of a shared feature extraction layer and two independent output streams: the state value stream and the action advantage stream . The final Q-value is aggregated using the following formula:where , , and are the network parameters for the shared layer, value flow, and advantage flow, respectively.The D3QN agent learns by minimizing the temporal difference (TD) error. After sampling an experience tuplefrom the experience replay pool, the loss function is defined as the mean squared error between the predicted Q-value and the target value :In particular, is estimated using the Double DQN mechanism, which combines immediate rewards with future value estimation.and represent the parameters of the online network and target network, respectively. The scalar reward is obtained by linearly weighting the normalized multi-objective reward vector according to preset weights.
2.2.3. SCSO Multi-Objective Optimization Module
- 1.
- Knowledge-Guided Population Initialization: Instead of mapping a sequence of actions for a batch of tasks, the proposed method operates dynamically step-by-step. At each decision time step t, the D3QN provides a macro-level assignment direction for the current task, which is then injected into the SCSO population to accelerate the local search.
- Elite Solution Construction: At time step t, given the current system state , the D3QN agent selects the optimal target virtual machine by maximizing the Q-value:To map this discrete decision into the continuous search space of the SCSO algorithm, we define each individual in the population as an M-dimensional continuous probability distribution vector. This vector represents the preference or probability of assigning the current task to each of the M available virtual machines. The optimal action recommended by the D3QN is transformed into a one-hot encoded vector and directly injected as the first elite individual :
- Diversity Preservation: To maintain search space diversity and prevent premature convergence, the remaining individuals () are randomly generated. Specifically, they are sampled from a Dirichlet distribution, , which naturally ensures that each generated vector satisfies the probability constraint.
- 2.
- Multi-Objective Evaluation and Pareto Elite Selection: To accurately select high-quality solutions during the iteration process, this section introduces a decoding evaluation and diversity preservation mechanism.
- Fitness Evaluation: Since individual is a continuous probability distribution vector, the algorithm first employs an Argmax strategy to decode it into a discrete scheduling action . Subsequently, this action is simulated in an isolated, temporary replica of the current cloud environment. By executing this single assignment, we compute its immediate impact on the global system state across three dimensions: the updated maximum Makespan, the newly accumulated total Tardiness, and the real-time average Utilization. This ensures that the global consequences of a single-step decision can be accurately evaluated without interfering with the DRL main environment state.
- Pareto Elitism: Perform non-dominated sorting based on fitness vectors to identify the Pareto frontier. To maintain a fixed population size N and prevent premature convergence, a Crowding Distance mechanism is introduced: when the number of non-dominated solutions exceeds the population capacity, individuals with low crowding distance are prioritized for removal. Sparsely distributed solutions are retained to preserve population diversity. The resulting elite set guides the evolution of the next generation.
- 3.
- Population Renewal (Exploration and Attack): Based on the selected elite population , the SCSO module generates the next generation population by simulating the predatory behavior of sand cats. This process is regulated by the adaptive sensitivity parameter , which linearly decreases from to with the iteration count l. To balance solution quality and diversity, the algorithm employs a strategy combining elite retention with evolutionary generation: First, the top optimal individuals from are directly copied into . Subsequently, the remaining elite individuals serve as parents, and the remaining population slots are filled with a 50% probability by randomly executing one of the following two operators:
- Global Exploration: Simulate the wide-area search behavior of sand cats. Apply random perturbations with amplitude controlled by to selected elite individuals to generate new individuals , thereby escaping local optima:Here, denotes the standard normal distribution vector. A larger value of ensures that the algorithm possesses a broad search horizon during its initial stages.
- Localized attack: Simulating the sand cat’s precise attack on its prey. Due to the existence of multiple non-dominated Pareto-optimal solutions in multi-objective optimization, there is no single globally optimal solution. Therefore, the algorithm employs a random guidance strategy: it randomly samples an individual from the current elite set as a temporary “prey” target, . Individual will converge toward this target according to the following formula:Here, . This mechanism not only enhances local development capabilities but also guides the population to distribute along the entire Pareto frontier, preventing the solution set from becoming overly concentrated.
2.2.4. DRL Feedback and Learning Mechanism
- 1.
- Reward Definition and Scaling: To transform lagging system performance into immediate feedback, we define a multi-objective reward vector composed of delay penalties, utilization rewards, and completion time penalties:Given the substantial differences in the dimensionality of each component, the algorithm employs dynamic Z-score normalization based on exponential moving averages. This is combined with preference weights W to compute the final scalar reward :where W is as follows:By setting different values for W, we can simulate scenarios where users have varying needs in real-world applications. In the following experiment, W is set to a fixed value to simulate a specific application scenario.
- 2.
- Experience Replay and Meta-Learning Strategies: We store the experience tuple into the replay pool. In this mechanism, D3QN serves as a “meta-learner”: although SCSO executes the action , the system reward is attributed to the state-action trajectory that generated the initial solution. Through this approach, the intelligent agent can learn to identify high-potential initial search directions, thereby achieving a deep integration of macro-level guidance and micro-level optimization.
2.3. Overall Procedure of the MoSCO Algorithm
| Algorithm 1: MoSCO: D3QN-Guided Multi-Objective Sand Cat Swarm Optimization |
![]() |
3. Experimental Simulation and Result Analysis
3.1. Experimental Setup
3.2. Multi-Objective Optimization Performance Evaluation
3.3. Dynamic Convergence Analysis of Tardiness
3.4. Dynamic Convergence Analysis of Utilization
3.5. Dynamic Convergence Analysis of Makespan
3.6. Multi-Objective Trade-Off and Pareto Front Analysis
3.7. Stability Analysis
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| QoS | Quality-of-service |
| MoSCO | D3QN-Guided Multi-Objective Sand Cat Swarm Optimization |
| SCSO | Sand Cat Swarm Optimization |
| H-SCSO | Heuristic Sand Cat Swarm Optimization |
| NSGA | Non-dominated Sorting Genetic Algorithm |
| DRL | Deep Reinforcement Learning |
| D3QN | Dueling Double Deep Q-Network |
| TD | Temporal Difference |
| CPI | Comprehensive Performance Index |
References
- Baloni, D.; Bhatt, C.; Kumar, S.; Patel, P.; Singh, T. The Evolution of Virtualization and Cloud Computing in the Modern Computer Era. In Proceedings of the 2023 International Conference on Communication, Security and Artificial Intelligence (ICCSAI), Greater Noida, India, 23–25 November 2023; pp. 625–630. [Google Scholar] [CrossRef]
- Abraham, O.L.; Ngadi, M.A.B.; Sharif, J.B.M.; Sidik, M.K.M. Multi-Objective Optimization Techniques in Cloud Task Scheduling: A Systematic Literature Review. IEEE Access 2025, 13, 12255–12291. [Google Scholar] [CrossRef]
- Mansouri, N.; Ghafari, R.; Zade, B.M.H. Cloud computing simulators: A comprehensive review. Simul. Model. Pract. Theory 2020, 104, 102144. [Google Scholar] [CrossRef]
- Arunarani, A.R.; Manjula, D.; Sugumaran, V. Task scheduling techniques in cloud computing: A literature survey. Future Gener. Comput. Syst. 2019, 91, 407–415. [Google Scholar] [CrossRef]
- Behera, I.; Sobhanayak, S. Task scheduling optimization in heterogeneous cloud computing environments: A hybrid GA-GWO approach. J. Parallel Distrib. Comput. 2024, 183, 104766. [Google Scholar] [CrossRef]
- Chatterjee, M.; Setua, S.K. A multi-objective deadline-constrained task scheduling algorithm with guaranteed performance in load balancing on heterogeneous networks. SN Comput. Sci. 2021, 2, 361. [Google Scholar] [CrossRef]
- Mahmoud, H.; Thabet, M.; Khafagy, M.H.; Omara, F.A. Multiobjective task scheduling in cloud environment using decision tree algorithm. IEEE Access 2022, 10, 10266–10283. [Google Scholar] [CrossRef]
- Cui, Z.; Zhao, T.; Wu, L.; Qin, A.K.; Li, J. Multi-objective cloud task scheduling optimization based on evolutionary multi-factor algorithm. IEEE Trans. Cloud Comput. 2023, 11, 3685–3699. [Google Scholar] [CrossRef]
- Devi, K.L.; Valli, S. Multi-objective heuristics algorithm for dynamic resource scheduling in the cloud computing environment. J. Supercomput. 2021, 77, 8252–8280. [Google Scholar] [CrossRef]
- Mishra, A.K.; Mohapatra, S.; Sahu, P.K. Adaptive Tasmanian Devil Optimization algorithm based efficient task scheduling for big data application in a cloud computing environment. Multimed. Tools Appl. 2025, 84, 26977–26996. [Google Scholar] [CrossRef]
- Pradhan, R.; Satapathy, S.C. Energy-Aware Cloud Task Scheduling algorithm in heterogeneous multi-cloud environment. Intell. Decis. Technol. 2022, 16, 279–284. [Google Scholar] [CrossRef]
- Zhou, G.; Tian, W.; Buyya, R. Multi-search-routes-based methods for minimizing makespan of homogeneous and heterogeneous resources in Cloud computing. Future Gener. Comput. Syst. 2023, 141, 414–432. [Google Scholar] [CrossRef]
- Hemanth, S.V.; Kirubha, D.; Reddy, S.R.; Chelladurai, T.; Soundari, A.G.; Amirthayogam, G. Multi objective Ant Colony Optimization Technique for Task Scheduling in Cloud Computing. In Proceedings of the 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 4–6 July 2024; pp. 830–835. [Google Scholar]
- Malti, A.N.; Hakem, M.; Benmammar, B. A new hybrid multi-objective optimization algorithm for task scheduling in cloud systems. Clust. Comput. 2024, 27, 2525–2548. [Google Scholar] [CrossRef]
- Amer, D.A.; Attiya, G.; Ziedan, I. An efficient multi-objective scheduling algorithm based on spider monkey and ant colony optimization in cloud computing. Clust. Comput. 2024, 27, 1799–1819. [Google Scholar] [CrossRef]
- Akopov, A.S. A Hybrid Multi-Swarm Particle Swarm Optimization Algorithm for Solving Agent-Based Epidemiological Model. Cybern. Inf. Technol. 2025, 25, 59–77. [Google Scholar] [CrossRef]
- Afrasyabi, P.; Mesgari, M.S.; El-sayed, M.; Kaveh, M.; Ibrahim, A.; Khodadadi, N. A Crossover-Based Multi-Objective Discrete Particle Swarm Optimization Model for Solving Multi-Modal Routing Problems. Decis. Anal. J. 2023, 9, 100356. [Google Scholar] [CrossRef]
- Akopov, A.S.; Beklaryan, L.A. Evolutionary Synthesis of High-Capacity Reconfigurable Multilayer Road Networks Using a Multiagent Hybrid Clustering-Assisted Genetic Algorithm. IEEE Access 2025, 13, 53448–53474. [Google Scholar] [CrossRef]
- Hao, Y.; Zhao, C.; Li, Z.; Si, B.; Unger, H. A learning and evolution-based intelligence algorithm for multi-objective heterogeneous cloud scheduling optimization. Knowl.-Based Syst. 2024, 286, 111366. [Google Scholar] [CrossRef]
- Mangalampalli, S.S.; Karri, G.R.; Mohanty, S.N.; Ali, S.; Khan, M.I.; Abdullaev, S.; AlQahtani, S.A. Multi-objective Prioritized Task Scheduler using improved Asynchronous advantage actor critic (a3c) algorithm in multi cloud environment. IEEE Access 2024, 12, 11354–11377. [Google Scholar] [CrossRef]
- Mangalampalli, S.; Karri, G.R.; Kumar, M.; Khalaf, O.I.; Romero, C.A.T.; Sahib, G.A. DRLBTSA: Deep Reinforcement Learning Based Task-Scheduling Algorithm in Cloud Computing. Multimed. Tools Appl. 2024, 83, 8359–8387. [Google Scholar] [CrossRef]
- Fan, W.; Chun, X.; Fan, Z.; Zhang, R.; Liu, S.; Liu, Y. Dual-Agent DRL-Based Service Placement, Task Scheduling, and Resource Allocation for Multi-Sensor and Multi-User Edge Computing Networks. IEEE Trans. Netw. Sci. Eng. 2025, 12, 3416–3433. [Google Scholar] [CrossRef]
- Cui, D.; Peng, Z.; Li, K.; Li, Q.; He, J.; Deng, X. An Novel Cloud Task Scheduling Framework Using Hierarchical Deep Reinforcement Learning for Cloud Computing. PLoS ONE 2025, 20, e0329669. [Google Scholar] [CrossRef]
- Zhang, M.; Wang, D.; Cai, Z.; Huang, Y.; Yu, H.; Qin, H.; Zeng, J. EGLight: Enhancing deep reinforcement learning with expert guidance for traffic signal control. Transp. A Transp. Sci. 2025, 1–27. [Google Scholar] [CrossRef]
- Wang, Z.; Goudarzi, M.; Buyya, R. TF-DDRL: A transformer-enhanced distributed DRL technique for scheduling IoT applications in edge and cloud computing environments. IEEE Trans. Serv. Comput. 2025, 18, 1039–1053. [Google Scholar] [CrossRef]








| Algorithms | Tardiness | Utilization | Makespan | CPI |
|---|---|---|---|---|
| MoSCO | 4187 | 0.922 | 528 | 0.2698 |
| PureD3QN | 4604 | 0.914 | 632 | 0.3053 |
| NSGA-II | 5079 | 0.869 | 656 | 0.3227 |
| H-SCSO | 4538 | 0.891 | 713 | 0.3576 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Shao, M.; Guo, Y.; Wang, J.; Zhang, H. D3QN-Guided Sand Cat Swarm Optimization with Hybrid Exploration for Multi-Objective Cloud Task Scheduling. Algorithms 2026, 19, 321. https://doi.org/10.3390/a19040321
Shao M, Guo Y, Wang J, Zhang H. D3QN-Guided Sand Cat Swarm Optimization with Hybrid Exploration for Multi-Objective Cloud Task Scheduling. Algorithms. 2026; 19(4):321. https://doi.org/10.3390/a19040321
Chicago/Turabian StyleShao, Minghao, Ying Guo, Jibin Wang, and Hu Zhang. 2026. "D3QN-Guided Sand Cat Swarm Optimization with Hybrid Exploration for Multi-Objective Cloud Task Scheduling" Algorithms 19, no. 4: 321. https://doi.org/10.3390/a19040321
APA StyleShao, M., Guo, Y., Wang, J., & Zhang, H. (2026). D3QN-Guided Sand Cat Swarm Optimization with Hybrid Exploration for Multi-Objective Cloud Task Scheduling. Algorithms, 19(4), 321. https://doi.org/10.3390/a19040321

