1. Introduction
With the rapid advancement of cloud computing, artificial intelligence, and big data analytics, data centers have become essential components of modern computing infrastructure. These centers process vast volumes of data and perform complex computational tasks, often utilizing cloud computing containers. The efficiency of their operations significantly affects the quality of computing services, resource utilization, and overall energy consumption. As the energy consumption of data centers continues to increase annually, prioritizing energy-efficient container scheduling is crucial. Efficient resource management in such complex and dynamic environments—aimed at minimizing energy consumption while maximizing computing efficiency—has become a critical issue requiring immediate attention from both academia and industry [
1]. However, the exponential growth in task volume, combined with the diversity of computing resources, presents significant challenges to traditional container scheduling methods.
In typical data center environments, containerization technologies such as Docker and Kubernetes have become predominant for resource scheduling. These technologies offer benefits including lightweight operation, rapid deployment, and support for elastic scaling, thereby enabling effective management of distributed workloads [
2]. However, as data centers expand and incorporate heterogeneous computing resources—such as CPUs, GPUs, and TPUs—the complexity of container scheduling increases significantly. First, data centers generally consist of multiple hosts of diverse types, exhibiting substantial variations in computing power, memory capacity, network bandwidth, and other resources, which present considerable challenges for scheduling algorithms. Second, task arrivals often occur in bursts, with resource demands and inter-task communication patterns dynamically evolving, requiring scheduling algorithms to demonstrate strong real-time responsiveness and adaptability [
3]. Finally, growing concerns about energy consumption in data centers underscore the critical need to optimize energy efficiency without compromising service quality [
4].
Traditional container scheduling approaches predominantly rely on static rules or heuristic strategies [
5], such as Shortest Job First and Minimum Remaining Resource First. While effective in relatively simple scenarios, the static and rule-based nature of these methods limits their ability to adapt to dynamically changing workloads and complex resource constraints. As task volumes increase and resource requirements diversify, these conventional methods increasingly exhibit shortcomings, including low efficiency, higher energy consumption, and insufficient flexibility in responding to workload fluctuations.
Moreover, existing scheduling methods often focus on optimizing computational efficiency and resource utilization. For example, DCHG-TS [
6] enhances genetic algorithms by introducing modified operators—such as inversion, improved crossover, and mutation—to more effectively explore a broader solution space. This approach provides an efficient solution for workflow task scheduling in cloud infrastructures by minimizing the makespan of scheduled tasks. However, real-world data center environments frequently involve trade-offs between task execution time and energy consumption. For instance, deploying containers on high-performance hosts to reduce task completion time may lead to unbalanced resource utilization and increased energy consumption. Therefore, achieving a balance among multiple objectives has become a critical challenge in container scheduling research.
In recent years, deep reinforcement learning (DRL), a powerful adaptive decision-making framework, has been increasingly applied to container scheduling problems [
7]. Unlike traditional optimization methods, DRL enables agents to autonomously discover optimal scheduling policies through continuous interaction with the environment, dynamically adjusting decisions in response to environmental changes. Reference [
8] presents an online resource scheduling framework based on the Deep Q-Network (DQN) algorithm, which balances two optimization objectives—energy consumption and task interval—by modulating the reward weights assigned to each. The Soft Actor–Critic (SAC) algorithm, grounded in a maximum entropy framework, not only maximizes cumulative rewards but also encourages policy exploration, thereby reducing the risk of convergence to local optima. SAC demonstrates strong adaptability in multi-objective optimization and dynamic environments, effectively addressing challenges related to task duration and energy consumption in container scheduling.
In this study, we propose a Soft Actor–Critic based Container Scheduling method, namely SAC-CS, designed to improve container execution efficiency while reducing energy consumption in data centers. By integrating SAC’s maximum entropy framework with multi-objective optimization goals, SAC-CS adaptively balances the trade-off between task execution time and energy usage in environments characterized by dynamic workloads and heterogeneous resources. Compared to traditional heuristic and reinforcement learning approaches, SAC-CS offers more flexible and efficient scheduling solutions, enabling the joint optimization of computing performance and energy efficiency in container scheduling for data centers.
The main contributions of this study are as follows: (i) We propose a container scheduling model based on the SAC algorithm, establishing a multi-objective optimization framework that simultaneously minimizes task execution time and energy consumption in data centers. (ii) We have developed a simulation environment for container scheduling experiments in data centers, specifically designed for deep reinforcement learning to address the heterogeneity of computing resources, diverse data center environments, and energy consumption challenges. (iii) We empirically validate the effectiveness of SAC-CS across multiple scheduling scenarios, demonstrating that our approach significantly improves efficiency and reduces energy consumption in data centers.
2. Related Work
In data center environments, conventional task scheduling algorithms predominantly utilize mathematical modeling and heuristic techniques. The mathematical modeling approach involves representing scheduling problems through the formulation of exact mathematical models, including integer linear programming [
9] and mixed-integer programming frameworks, subsequently deriving optimal solutions via optimization theory. While this approach provides a robust theoretical foundation, it is associated with significant computational complexity, thereby limiting its applicability in large-scale and dynamic settings.
Heuristic approaches are commonly employed to tackle optimization challenges in traditional task scheduling by leveraging intuition and experiential knowledge. Due to their relatively low computational complexity, these algorithms are widely used in scenarios where rapid evaluation is prioritized over exhaustive optimization. They are especially beneficial in environments that require prompt decision-making rather than comprehensive optimization processes [
10]. Moreover, the worst-case performance of heuristic algorithms is often predictable, which helps minimize errors in resource allocation [
11]. Typical heuristic algorithms include FirstFit, Best Fit, Round-Robin, and PerformanceFirst. The Longest Loaded Interval First algorithm [
12] is recognized as a 2-approximation method designed to minimize the reserved energy consumption of virtual machines in cloud environments, with theoretical validation provided. Additionally, the Peak Efficiency Aware Scheduling algorithm [
13] aims to optimize both energy consumption and quality of service during the allocation and reallocation of online virtual machines in cloud systems. Reference [
14] presents the heuristic task scheduling algorithm ECOTS, which improves cloud energy efficiency based on an energy efficiency model. Despite their advantages, heuristic algorithms have several limitations. They are generally designed for single-objective problems, and their solutions often remain open to further optimization. Moreover, most heuristic algorithms are tailored to specific scenarios; consequently, even minor changes in the environment may require redesigning the algorithm, posing challenges in managing complex and dynamically evolving conditions.
The core mechanism of reinforcement learning (RL) involves optimizing strategies through interaction with the environment to achieve specific objectives [
15]. Unlike traditional supervised or unsupervised learning, reinforcement learning focuses on decision-making and behavior optimization through trial and error. Its primary advantage is adaptability; it can continuously learn and adjust to environmental changes, thereby progressively enhancing scheduling algorithms. Reinforcement learning algorithms have demonstrated remarkable success in scheduling applications. For instance, QEEC [
16] is an energy-efficient cloud computing scheduling system that incorporates Q-learning. Its workflow consists of two stages: in the first stage, a global load balancer constructs an M/M/S queuing model [
17] to dynamically allocate user requests across the cloud server cluster; in the second stage, an intelligent scheduling module based on Q-learning is deployed on each server.
DRL represents an advanced integration of deep learning and reinforcement learning (RL) by embedding a deep neural network within the perception-decision loop of the RL agent. Unlike classical methods and traditional RL, DRL offers enhanced modeling capabilities for complex systems and decision-making strategies, greater adaptability to diverse optimization objectives, and improved potential to handle large-scale tasks. Consequently, it is widely applied in scheduling algorithms in data center settings [
18,
19,
20]. For example, the DQN, a representative architecture combining deep learning and Q-learning, fundamentally constructs a deep neural network to approximate the optimal action-value function [
21]. The more recent Proximal Policy Optimization (PPO) algorithm improves training stability by introducing a policy clipping mechanism, resulting in better convergence characteristics in continuous decision-making tasks. However, PPO is an on-policy algorithm that requires a large number of interaction samples to update the policy, leading to low sample efficiency [
22,
23], and making it challenging to balance the dynamic trade-off between efficiency and energy consumption in multi-objective optimization scenarios.
Compared to DQN and PPO, the SAC algorithm integrates an entropy regularization term within the maximum entropy reinforcement learning framework. By incorporating policy entropy into the reward function, SAC balances the pursuit of high rewards with extensive exploration. It combines the high sample efficiency of off-policy algorithms with the ability to regulate exploration intensity by automatically adjusting the temperature parameter, effectively preventing the policy from converging to local optima. Furthermore, SAC employs a stochastic policy structure that reliably optimizes complex scheduling strategies in continuous action spaces, enhancing both convergence speed and generalization capability in high-dimensional, heterogeneous environments. Consequently, SAC is well-suited for dynamic multi-constraint optimization problems, such as data center container scheduling, enabling synergistic optimization of energy consumption while ensuring efficient task execution [
24,
25].
3. Problem Definition
The container scheduling problem is fundamentally a multi-constraint combinatorial optimization challenge. Its primary objective is to maximize overall system efficiency while satisfying resource capacity, task priority, and service quality requirements [
26].
It is assumed that the scheduling system comprises tasks and hosts. Tasks are represented as . In this scenario, the number of tasks arriving at each time step is variable; Tasks within the same job may require communication. In summary, the attributes of a task include: submission time, task , estimated task duration, required CPU capacity, required GPU capacity, required memory, communication time, number of communications, communication data volume, start time, and end time. Host set are represented as Machines = {M0, M1,…, Mm}. Each host Mi has indicators such as ID, CPU capacity, GPU capacity, memory capacity, CPU speed, GPU speed, memory speed, and price.
The scheduling system initiates the simulation at time step 0, with tasks arriving sequentially. At each time step, newly arrived tasks are scheduled, and those successfully scheduled are assigned to one of the hosts for execution. If no machine can meet the requirements of a particular task at the current time step, the task is retained for rescheduling in the next time step, and its waiting time is incremented by 1. Additionally, each task has a designated communication time point; when the simulation reaches this time step, the task begins communication, which incurs additional time. In real-world scenarios, multiple machines exhibit varying performance levels; to replicate this, the hosts in our system operate at different speeds. Based on these parameters, the total runtime
of the
task
is defined as follows:
where
represents the estimated task duration,
M denotes the speed of the allocated host, and
is the communication time for the task.
The average runtime
is defined as follows:
where
is the total runtime of tasks, and
n represents the number of tasks.
It is assumed that a host’s energy consumption increases linearly with its load rate. The energy consumption at full load is
EF, while at no load it is
E0. Therefore, the energy consumption
of a host
at a given time step
is defined as follows:
where
is the load rate of the host at a given time step, 0 ≤
a ≤ 1.
The total energy consumption is the sum of the energy used by all hosts across all time steps. The total energy consumption of the data center is defined as follows:
where
denotes the number of hosts in the data centers, and
is the total number of simulation time steps.
In this paper, a new evaluation metric, termed the Time–Energy product
F, is introduced for experimental assessment, as defined as follows:
where
is the average runtime,
is the total energy consumption of the machines,
is the time benchmark value, which represents the maximum possible running time,
is the energy consumption benchmark value, which represents the energy consumption value of all machines running at full load for
.
This composite metric, F, is constructed using dual-objective normalization. The product form represents the overall optimization level of the Pareto front; a smaller value indicates a better balance between time delay control and efficiency improvement achieved by the algorithm.
4. Scheduling Algorithm
In this study, the problem of container scheduling is modeled using the Markov Decision Process (MDP), and the proposed SAC-CS uses an agent based on the SAC algorithm for optimization decision-making.
Within the framework of the MDP model, an agent selects actions from a predefined action space, where each action induces a transition between states and yields an immediate reward. The objective of the decision-making process is to identify a sequence of actions that maximizes the expected cumulative reward from the current state over a specified future horizon, thereby providing a theoretically optimal solution to complex decision-making problems. The MDP model is characterized by a particular state space, an action space, a transition kernel, and a reward function, as outlined below.
where
is the state space,
is the action space,
is the transition kernel,
is the reward function,
is the initial-state distribution,
is the discount factor;
is state,
is action,
is reward, and
is the parameterized policy.
The specific state space, action space, and reward function utilized in the approach are delineated as follows.
State Space: To comprehensively represent both task characteristics and host status, the state space dimension in this model is defined as the product of the number of hosts and five features per host. These features include task affinity, host processing speed, available idle resources, and three differential values that quantify the discrepancies between the task’s resource requirements and the host’s idle resources.
where
is the number of hosts;
is the 5-dim feature for host
;
is the current task’s resource demand;
is host
’s idle resources;
are discrepancy features (e.g., CPU/memory/GPU gaps).
Action Space: For each task, the agent’s action consists of allocating the task to one of the available hosts. Therefore, the size of the action space corresponds to the total number of hosts, with each action uniquely identified by a host ID.
Reward Function: The agent receives rewards based on the actions taken in response to the observed state. The proposed reward function integrates two components: task runtime and host energy consumption. After normalization, these components are combined using a weighted sum to generate the overall reward signal.
where
is the task runtime after placement, and
is the attributed host energy consumption over the decision interval;
are normalization baselines;
ensures numerical stability;
trade off time vs. energy.
The process of SAC-CS is shown in
Figure 1, where
is environment state at step
,
is next state after executing
,
is action taken by the actor at step
, and
is immediate reward at step
.
The SAC-CS approach is a deep reinforcement learning method based on the maximum entropy reinforcement learning framework. A key advantage of SAC is its ability to simultaneously maximize cumulative reward and policy entropy, thereby improving both exploration capabilities and policy stability. The SAC-CS algorithm includes several essential components:
Maximum Entropy Objective: In addition to optimizing task execution time, SAC-CS incorporates an objective to maximize policy entropy. This approach promotes sustained exploration throughout the optimization process. Compared to conventional reinforcement learning methods, SAC-CS demonstrates superior performance in handling complex multi-objective problems and reducing the risk of convergence to suboptimal local minima.
Actor–Critic Architecture: SAC-CS utilizes an actor–critic framework in which the actor network is responsible for selecting actions (e.g., scheduling decisions), while the critic network estimates the value function (Q-value) associated with state-action pairs. This dual-network structure enables the simultaneous optimization of both the policy and the value function, thereby enhancing learning efficiency. The soft update is defined as follows.
where
is the parameters of the Critic network, and
is the parameters of the Target Critic network.
Stochastic Policy: Unlike traditional deterministic policies, SAC-CS employs a stochastic policy that outputs a probability distribution over possible actions. This stochasticity enables the algorithm to make more robust and effective decisions in uncertain and dynamic environments.
Reward Function: In the context of container scheduling, the reward function is designed to consider both task execution time and energy consumption. Through the implementation of the compound reward mechanism, effective multi-objective optimization can be achieved. The specific reward functions are as follows.
where
is the energy consumption before the action,
is the energy consumption after the action,
is the reference value for normalization, and e is the energy consumption component of the reword function.
is the time before the action is executed,
is the time after the action is executed,
is the total time spent on the completed task,
is the number of completed tasks,
is the reference value for normalization, and
is the time component of the reword function.
and
are weight values, which are set equal to 0.5 in the experiments.
Furthermore, the algorithm dynamically adjusts the relative weighting between execution time and energy consumption objectives, enabling real-time optimization of task scheduling in response to prevailing environmental conditions, such as resource utilization and network bandwidth.
6. Conclusions
In this paper, the SAC-CS container scheduling algorithm based on the Soft Actor–Critic framework is introduced. By integrating the maximum entropy principle with an actor–critic architecture, SAC-CS effectively balances exploration and exploitation strategies. Experimental evaluations on synthetic workloads and Alibaba cluster traces demonstrate that SAC-CS reduces the combined time–energy metric more effectively than heuristic methods and other DRL approaches such as DQN and PPO.
The implications of this study are significant, demonstrating that stochastic policies outperform deterministic ones in managing the uncertainty of cloud workloads. This provides operators with a viable approach to reduce operational costs and carbon footprints. However, certain limitations must be acknowledged. The evaluation is based on a simulation environment with simplified network dynamics and assumes a linear energy consumption model, which may not accurately reflect the non-linear characteristics of real-world scenarios. Additionally, the algorithm may experience instability during the initial ‘cold start’ training phase.
Future research will address these limitations by deploying SAC-CS within a more comprehensive computing and networking integration simulation environment [
38], as well as on a physical Kubernetes testbed, to rigorously evaluate its robustness under real-world complexities and dynamic conditions. Additionally, upcoming efforts will focus on integrating thermal-aware objectives into the reward function and exploring transfer learning techniques to enable faster model adaptation across heterogeneous hardware configurations.