2. Computational–Physical Collaborative Optimization Model
This paper considers a system managed by a single operator, which comprises the following core components:
DC: Each DC operates in a distinct external environment, including real-time electricity prices and outdoor temperatures .
Zones: The interior of DC is divided into independent zones and the cooling parameter of each zone can be adjusted independently.
Aggregated server (AS): We abstracted all servers within cooling zone as a single AS. This AS has the following attributes: (1) Total computing capacity : the sum of the physical cores of all servers in the zone. (2) Peak single-core processing capacity : the processing speed of servers in the zone when running at the maximum operating frequency. (3) Active core ratio : defined as the ratio of the number of online active cores to the total number of cores. (4) Frequency scaling factor : the ratio of the actual operating frequency of cores to their peak frequency. (5) Core occupancy rate : defined as the ratio of the number of cores executing tasks to the total number of active cores.
Computational tasks: Task can be represented by a directed acyclic graph (DAG) , where denotes the set of subtasks inside task i and defines the dependency relationships between subtasks. For task i, its arrival time is denoted as , the initial arrival DC is , the latest completion time is , and the bandwidth requirement is . Subtask is the minimum unit for scheduling and execution, with an initial computing demand of , a remaining computing demand at time of , and a core demand of .
The operator is required to dynamically make the following decisions to minimize the total cost on the premise of satisfying task service quality and ensuring the safe operation of equipment: (1) Spatial migration of tasks across DCs
. (2) Execution zone
and execution time
of subtasks within a DC. (3) IT equipment regulation: active core ratio
and frequency scaling factor
. (4) Cooling system regulation: fan speed
or set temperature
.
Figure 1 depicts the architecture of the computational–physical collaborative optimization model. The computational domain handles task decomposition, spatiotemporal migration, and subtask scheduling. The physical domain manages IT equipment regulation (active core ratio and frequency) and cooling system control (fan speed or set temperature). The two domains are coupled through thermal and computational constraints: task allocation affects IT power consumption, which in turn impacts cooling load and indoor temperature, while equipment regulation influences task execution efficiency.
2.1. Modeling of Task Allocation and Execution
The number of subtasks within each task is determined by its internal DAG structure and is assumed to be known upon task arrival. Subtasks are not dynamically split by the system during runtime; instead, they are predefined at the time of task submission. This ensures that all dependency relationships and computing demands are fully observable to the scheduler.
is a task-level allocation variable, where indicates that task is allocated to DC , and otherwise. is a subtask-level allocation variable, where indicates that subtask of task is allocated to zone of DC for execution, and otherwise.
Each task must be allocated to exactly one DC:
Since the minimum granularity of spatial allocation is a complete task, and task decomposition is only performed after the task has been assigned to the target DC; all subtasks belonging to the same task must be allocated to the same DC as the task itself:
This constraint ensures that if , the sum of the allocations of each subtask of task i across all zones within DC equals 1, i.e., the subtask must be allocated within DC ; if , the sum equals 0, i.e., the subtask cannot be allocated within DC .
To characterize the arrival time of tasks at the original DC, a variable
is introduced here, where
denotes that task
initially arrives at DC
at time
.
At time
, if a task initially arrives at DC
and is ultimately allocated to DC
, this indicates that task
needs to be transmitted across DCs, which triggers bandwidth occupation. Transmission is assumed to be completed instantaneously, and the transmission is executed immediately upon the task’s arrival. The sum of the bandwidth requirements of all tasks to be transmitted from DC k to DC.
. must not exceed the link capacity
, so as to ensure that the bandwidth limit is not exceeded during the transmission instant:
where
denotes the set of tasks arriving at DC
at time
.
For task
at time
, the set of its subtasks being executed, denoted as
, is defined as follows:
where
and
denote the start execution time and finish execution time of subtask
in task
, respectively.
Let
denote the number of active cores at time
. Among these active cores, some are occupied by executing tasks, while the remaining are idle. At any time
, the number of cores executing tasks in zone
must not exceed its current number of available cores
, so as to ensure that the activated computing cores are sufficient to meet the demand:
The finish time of subtask
is determined by its start time, computing demand, core demand, and the processing capacity of the AS in the corresponding zone. The remaining computing demand of subtask
at time
can be expressed as:
where
denotes the actual operating frequency after frequency scaling. Since
is a binary variable and each subtask is allocated to exactly one server (as shown in(1) and (2)), only one term in the summation is actually valid.
A subtask is considered to be completed when its remaining computing demand is less than or equal to zero.
For any dependency edge
, which indicates that subtask
is the predecessor of subtask
, subtask
can only start execution after subtask
is completed, and the following constraint must be satisfied:
Herein, the subtask level
is defined. A higher level means that the subtask is executed later in the dependency chain, and it can only be initiated after more predecessor subtasks are completed. The quantification rule for the level can be expressed as follows:
where
is the predecessor subtask of
; if subtask
has no predecessor subtasks, its level is 1; and if subtask
has predecessor subtasks, its level is equal to the maximum level value of all its predecessor subtasks plus one.
For each task
, the start time of its first subtask shall not be earlier than its arrival time, and the completion time of its last subtask must be no later than its latest completion time [
40]:
2.2. Modeling of AS
The core occupancy rate
of the AS in zone
at time
is defined as the ratio of the number of cores executing tasks to the total number of active cores:
where fixed tasks refer to critical workloads that cannot be delayed, migrated, or preempted due to operational continuity, security constraints, or user specifications. They occupy a fixed number of cores
and are not subject to spatiotemporal migration optimization. However, they still participate in SLA constraints and must be completed before their deadlines.
The total IT power
of the AS in zone
at time
consists of three components: static power
, dynamic computing power
, and state transition power
[
41].
where the static power refers to the minimum power required to maintain the basic operation of the server.
denotes the idle power of a single core. The dynamic computing power is the additional power generated during the execution of tasks.
represents the peak power, and
is the dynamic power exponent. The state transition power refers to the instantaneous power overhead generated when the server state is altered.
and
denote the variation in the active core ratio and the frequency scaling factor, respectively;
and
are the energy consumption coefficients for core count transition and frequency transition, respectively.
The total power of DC
at time
is the sum of the IT power, cooling power, and auxiliary facility power. The power of auxiliary facilities is considered to be correlated with the power of IT equipment:
where
is a proportional coefficient covering the conversion power consumption of equipment, such as UPS, and the basic lighting energy consumption.
The PUE of DC
is the ratio of the total input power of the DC to the actual power consumed by IT equipment, which can be expressed as:
2.3. Modeling of Cooling System
The cooling system can switch between free cooling mode and mechanical cooling mode according to the outdoor temperature
:
where
represents the free cooling mode and
represents the mechanical cooling mode.
When the system is in the free cooling mode, the compressor is completely shut down, and heat is removed by outdoor low-temperature air only by adjusting the flow rate of the fluid. The power is regulated by controlling the rotation speed
of fans/water pumps. When the equipment operates at full speed, the maximum heat power that can be removed based on the temperature difference is given by:
where
is the total heat transfer coefficient of the region, which depends on the heat exchanger area and air duct design.
The actual heat removal capacity is proportional to the rotation speed:
To ensure that the heat generated by IT equipment is completely removed, the following condition must be satisfied:
The power consumption of fans/water pumps is proportional to the cube of the rotation speed:
where
is the rated maximum power of the equipment.
When the system is in the mechanical cooling mode, the compressor is activated to actively transfer heat through the refrigeration cycle. The evaporation temperature is changed by adjusting the temperature set point .
The coefficient of performance (COP) is defined as the ratio of the heat removal capacity to the input electrical power, which can be modeled as:
where
,
are empirical fitting parameters calibrated to real-world DC cooling system characteristics.
The power consumption of the cooling system is jointly determined by the heat to be removed (approximately equal to the IT power consumption) and the cooling efficiency:
where
is the cooling efficiency coefficient, which is related to
,
,
are fitting coefficients; and
is the baseline power consumption of equipment during compressor operation.
To ensure equipment safety, the server air inlet temperature
must be maintained within the allowable range. Its dynamics are described by the continuous-time heat balance equation:
where
denotes the thermal capacity of the zone and
is the thermal resistance of the zone. In practical control, the discrete form is usually adopted:
where
denotes the length of the time interval.
The temperature safety constraint is expressed as follows:
where
and
denote the minimum and maximum values of the safe temperature, respectively.
2.4. Modeling of Cost
The optimization objective of the model is to minimize the total operating cost, which includes the computing cost
, cooling cost
, transmission cost
, and SLA violation cost
:
where
and
denote the communication cost coefficient and SLA violation cost coefficient, respectively, and ρ represents the SLA violation rate.
3. Two-Layer POMG
Based on the computational–physical collaborative optimization model, this section further formulates the high-dimensional, dynamic, and strongly coupled optimization problem as a two-layer POMG. By defining hierarchical agents, state space, action space, and the reward mechanism, it provides a unified decision-making framework for the subsequent H-MADDPG algorithm.
3.1. Agent Architecture and Division of Responsibilities
This paper adopts a two-layer agent architecture to balance the efficiency of global coordination and local control.
GCA: As the central decision-making unit, the GCA is responsible for the top-level decision-making of global scheduling to ensure the optimal overall cost of the system, and its decisions correspond to the variables in the optimization model.
LDCA: One LDCA is deployed in each DC, and it is responsible for all micro-level decision-making within the DC: (1) subtask scheduling; (2) zone AS state control; and (3) cooling control decision-making.
3.2. Task Priority Ranking
At each time step, the GCA collects information about the tasks arriving at each DC. Let
denote the set of pending tasks of the GCA, and
denote the set of tasks processed by the GCA at the current time step. Given that the maximum number of tasks that the GCA can process per time step is
, it is necessary to perform priority ranking on
. The priority score
of each task
depends on its total computing demand
and deadline
. A higher priority score indicates a more urgent task, which should be processed first:
where
and
are the weights of the task priority score.
The top tasks with the highest priority are included in , and their spatiotemporal migration is determined by the GCA.
Let denote the set of pending subtasks in DC , and denote the set of subtasks processed in the current time step. The maximum number of subtasks that the LDCA can process per time step is . Similarly, the subtasks are prioritized according to their computing demand and deadline, and the top subtasks with the highest priority scores are selected for decision-making.
3.3. State Space
The state observed by the GCA at time step is denoted as , where represents the current time step, and is the set of task states in the task set processed by the GCA at the current time step, including task computing demand , deadline , and the number of subtasks . denotes the state set of all DCs, including the electricity price , outdoor temperature , predicted values of electricity price and outdoor temperature for the next time steps, zone indoor average temperature , zone average frequency , and the total number of available cores of the DC . Based on the above states, the GCA is trained to identify the urgency level of tasks and make optimal allocation decisions according to task requirements and the real-time states of each DC.
Two policy networks are configured for each LDCA: one is responsible for deciding the start execution time and execution zone of subtasks, which is referred to as the subtask scheduling actor; the other is responsible for regulating the states of each zone within the DC, which is referred to as the resource state regulation actor. Both policy networks take the same state as input but output different actions, which will be elaborated upon later. The state observed by LDCA-k at time step is denoted as , where represents the current time step, denotes the set of states of the subtask set currently processed by LDCA-k, including the subtask level and subtask computing demand , denotes the set of tasks allocated to DC up to time step , denotes the states of each zone within the DC, including indoor temperature , frequency , and number of available cores , denotes the state of this DC, including the electricity price and outdoor temperature at the current moment, as well as their predicted values and for the next time steps.
3.4. Action Space
The GCA is responsible for the spatial migration of tasks processed at each time step, i.e., determining which DC the tasks are secondarily allocated to. Its action is defined as , where indicates the DC to which the task in the task set processed by the GCA at the current time step is allocated.
The subtask scheduling actor of LDCA-k is responsible for subtask decision-making, and its action can be expressed as . Herein, denotes the zone allocated to the subtask in the subtask set processed by the LDCA-k at the current time step; denotes the planned start execution time of the subtask in the subtask set processed by the LDCA-k at the current time step, which is converted to the actual time step via the mapping formula . is the earliest start time, namely the arrival time of the subtask; is the latest start time, namely the deadline of the subtask. The resource state regulation actor of LDCA-k is responsible for decision-making regarding resource state regulation of each zone, and its action can be expressed as . Herein, , , and denote the variation in the active core ratio, frequency scaling factor, fan speed, and set temperature, respectively. When in the free cooling mode, the action output by the policy network is ; when in the mechanical cooling mode, the output action is .
3.5. Reward Function
The reward function
of the GCA consists of the global electricity cost reward
, the global SLA compliance reward
, the transmission bandwidth violation penalty
, and the load balancing reward
:
where
is the total electricity cost at time and
is the maximum observed cost in the training buffer;
is the bandwidth requirement of tasks that violate the link capacity; and
is the total link capacity.
The weights are set to based on a preliminary grid search over the ranges , , .
The reward function
of the LDCA-k comprises the local electricity cost reward
, the local SLA compliance reward
, and the physical resource safety reward
, where the latter includes temperature safety and core occupancy rate non-violation constraints.
The definition of and is similar to the GCA. The weights are set to .
5. Case Study
5.1. Parameter Settings
The stochastic elements include workload arrivals, electricity prices, and ambient temperature, which are modeled based on real-world data traces and standard stochastic processes [
43,
44,
45]. This paper sets up three heterogeneous DCs, and the configuration parameters of each DC are provided in
Table 3. The electricity price and ambient temperature used during the testing phase are shown in
Figure 3. In the assumptions of this study, long-term power purchase agreements for DCs are not considered. The algorithm parameters are presented in
Table 4. All experiments are carried out under the two typical operation scenarios mentioned in the introduction, where DCs adjust their power consumption according to real-time electricity price signals and take part in demand response as valuable flexible resources.
To ensure the reproducibility of this case study, all key parameters are specified as follows. Electricity price traces are sourced from the U.S. Energy Information Administration (EIA) for three ISO regions (PJM, CAISO, and ERCOT) covering January to December 2024, while ambient temperature data are obtained from the NOAA Global Historical Climatology Network (GHCN-Daily) for three corresponding cities (Chicago, Los Angeles, and Houston) with hourly averaging. Workload arrivals follow a Poisson process with a mean rate of 120 tasks per hour, split among the three DCs proportionally to their core counts. Task computing demand follows a lognormal distribution with a mean of 500 core·GHz·ms and a standard deviation of 200. The number of subtasks per task is uniformly distributed between one and eight, and the DAG structure of each task is generated using the Gaussian elimination method with a dependency density of 0.3, based on the Alibaba Cluster Trace V2018. The bandwidth capacity between any two DCs is set to 1 Gbps (symmetric), with a communication cost coefficient of 0.005 $/MB and an SLA violation penalty coefficient of 50 $/violation. For the cooling system, the COP fitting coefficients are calibrated from a real CRAC unit as , , , and ; the rated fan power per zone is 15 kW, and the safe indoor temperature range is set to 18 °C, 27 °C. All stochastic processes are initialized with a random seed of 42, and each experimental configuration is run over 10 independent trials with results reported as mean values.
5.2. Decision Result Analysis
To verify the superiority of the method proposed in this paper, comparative experiments were designed for result analysis:
Case 1: The proposed method.
Case 2: Double-layer scheduling without physical regulation. The hierarchical multi-agent architecture is retained, focusing on the global spatiotemporal optimization of computing tasks while physical parameters are fixed.
Case 3: Single-layer local scheduling. The GCA and the inter-DC migration mechanism are removed, and the LDCA only optimizes the locally arrived tasks, without any physical parameter regulation.
Case 4: Rule-based static scheduling. After the arrival of tasks, they are evenly allocated to each region inside the DC and executed immediately.
Table 5 presents the SLA violation rate, average overtime duration, total cost, and PUE under the four cases. First, all four cases completed the computing tasks without violating the SLA, which indicates that the designed scheduling methods can meet the basic service requirements. Second, in terms of total cost, compared with Case 4, Case 3, Case 2, and Case 1 achieved cost reductions of 1.77%, 5.12%, and 36.19%, respectively. This shows that reducing costs can be achieved through time delay of computing tasks, DC spatial migration, and adjustment of physical equipment parameters. However, the effect of different methods varies significantly in terms of reduction magnitude. Compared with Case 4, Case 3 only delays computing tasks in time and migrates them between regions within the DC, enabling task execution to avoid peak electricity price periods and optimize the utilization of regional resources. On the basis of Case 3, Case 2 enables cross-DC task migration. Due to the significant differences in electricity prices and temperatures between different DCs, migrating computing tasks to DCs with low electricity prices and low temperatures can prevent a sharp increase in cooling costs caused by excessive IT equipment power, thereby reducing the total cost. Case 1 further adjusts the cores and frequencies of AS on the basis of Case 2. By combining the spatiotemporal adjustment of computing tasks and the adjustment of physical equipment operating parameters, Case 1 achieves the lowest cost, fully reflecting the advantages of its collaborative hierarchical architecture. Finally, in terms of PUE values, Cases 2–4 do not adjust the operating parameters of physical equipment and only involve spatiotemporal migration of computing tasks, resulting in higher PUE values. In contrast, Case 1 can adjust AS and cooling equipment, which not only reduces the total power but also effectively restricts cooling energy consumption, thereby lowering the PUE.
Figure 4 shows the task execution status under the four cases. Each color in the figure represents one DC, and a darker color indicates a larger number of tasks being executed. In Case 4, after receiving computing tasks, they are evenly allocated to each region for execution with zero delay. Tasks are concentrated in the early stage of the cycle, and the number of tasks is dense during the peak electricity price period. However, the number of executing tasks is small during the low electricity price period after 21:00, which leads to high electricity costs. In Case 3, tasks are executed with delay and migrated between regions within the DC, and most computing tasks are executed during the low electricity price period (00:00–08:00). However, due to the large number of arriving tasks, the small capacity of DC3, and its high electricity price and temperature, there is almost no possibility of delay, and a large number of computing tasks are executed throughout the scheduling cycle, resulting in limited adjustment capability of this strategy. In Case 2, a large number of tasks are migrated to DC2, which has low electricity prices and low temperatures, and are concentrated in the low electricity price period (00:00–08:00) for execution. This reduces the electricity cost to a certain extent but leads to unbalanced load distribution. DC3 only executes fixed loads, resulting in serious core idleness. In contrast, Case 1 avoids load imbalance and achieves cost reduction through load balancing and physical equipment adjustment. The cores and frequency of AS match the tasks being executed, and the cooling power consumption is coordinated with the IT power consumption, ensuring that the temperature does not exceed the limit.
Figure 5 shows the total power of each DC under the four cases. It can be observed that, regardless of the DC, the total power in Case 1 is the lowest. Case 1 reduces costs by combining the spatiotemporal adjustment of computing tasks and the adjustment of physical equipment parameters. In Case 2, due to the excessive migration of computing tasks to DC2, the power of DC2 is extremely high, while DC3 always operates with a fixed load, resulting in extremely low power. This leads to a high degree of load imbalance. The power of Case 3 is similar to that of Case 4, but Case 3 avoids excessive load during peak periods by delaying tasks, such as DC3, during 08:00–10:00 and 18:00–21:00.
Figure 6 presents the core occupancy rate of servers within each DC under the four cases. It is evident that the core occupancy rate in Case 1 is concentrated between 60% and 90%. This range not only reduces idle cores to lower IT power consumption but also ensures that there are standby cores to cope with the sudden arrival of new tasks, achieving a balance between resource utilization and service stability. In Case 2, DC1 and DC3 are allocated a small number of tasks, but all cores remain active, resulting in low core occupancy rates and significant resource waste. In Case 3 and Case 4, the core occupancy rate is related to the number of arriving tasks. In this experiment, the number of computing tasks arriving at the three DCs is set to be nearly the same. However, due to differences in DC capacity, the core occupancy rate of DC1 is the lowest, followed by DC2, and DC3 has the highest core occupancy rate.
Figure 7 shows the box plots of indoor temperature in each DC under the four cases. The effect of different scheduling strategies on indoor temperature can be clearly observed. The temperature distribution of the three DCs under Case 1 is the most concentrated and stable, with the core range focusing on 23–25 °C and the smallest average interquartile range. By dynamically adapting the operating parameters of cooling equipment, it can respond to real-time changes in IT power consumption, effectively suppress temperature fluctuations, and achieve precise temperature control. In contrast, Cases 2–4 exhibit obvious shortcomings in temperature regulation performance. While their temperatures in DC1 and DC3 fall within a reasonable range, DC2 suffers from excessively low temperatures, leading to unnecessary waste of cooling power consumption. This is because such strategies do not dynamically adjust the parameters of cooling equipment; the fan speed is fixed at a high level during natural cooling periods. When the heat generation of IT equipment is mismatched with cooling capacity, over-cooling becomes prominent. Notably, the temperature of DC2 in Case 2 is slightly higher than that in Cases 3 and 4. The core reason is that Case 2 allocates more computing tasks to DC2, and the increase in IT power consumption partially offsets the impact of over-cooling, resulting in a relative rise in temperature.
5.3. Sensitivity Analysis
5.3.1. Differences in Regulation Results Under Different Seasons
Different seasons require distinct regulation strategies for DCs. This paper analyzes the regulatory differences across three seasonal categories: spring/autumn, summer, and winter.
Figure 8 depicts the total cost and PUE of the four scenarios under three seasonal conditions, which intuitively reflects the adaptability of each scheduling strategy to seasonal changes. The horizontal axis of the figure represents different seasons, while the vertical axis is divided into two parts to respectively display the total cost and PUE, realizing the synchronous comparison of the two key indicators. In terms of seasonal variation characteristics, summer has the highest total cost and PUE in all four cases, followed by spring/autumn, and winter has the lowest. This is mainly because the high ambient temperature in summer increases the load of the cooling system, leading to a significant rise in cooling power consumption and further increasing the total cost and PUE. In winter, the low ambient temperature allows most DCs to adopt natural cooling, which greatly reduces cooling energy consumption, thus reducing the total cost and PUE.
Figure 9 illustrates the power consumption of each DC under different seasons. From the perspective of cooling power, the proportion of cooling power in the total power is the highest in summer, followed by spring/autumn, and the lowest in winter. This difference stems from the dominant role of seasonal ambient temperature in cooling modes: the outdoor temperature in summer is significantly high, so each DC needs to operate mechanical cooling throughout the day, and the continuous operation of compressors leads to a surge in cooling power consumption. In contrast, the outdoor temperature in winter is low, and natural cooling can be achieved only through fan-driven heat exchange between indoor and outdoor environments without starting compressors, resulting in a substantial reduction in cooling power consumption. In terms of IT power allocation, the IT power in spring/autumn and summer is evenly distributed among the three DCs. Given that the ambient temperature is moderate in spring/autumn and the cooling load is already high in the hot summer environment, excessive concentration of IT power in a single DC would lead to a sudden increase in local IT power consumption, which in turn triggers a synchronous rise in cooling energy consumption and ultimately drives up the total cost. However, the IT power allocation in winter adopts a differentiated strategy: since winter can fully rely on natural cooling, there is no need to worry about the surge in cooling power consumption caused by excessively high IT power. Therefore, most tasks are prioritized to DC1 and DC2 with lower electricity prices, while DC3, with the highest electricity price, only undertakes a small number of tasks. Through the differentiated configuration of electricity prices, the total cost is significantly reduced.
Figure 10 presents the box plots of server state regulation of each DC under different seasons, which focuses on reflecting the regulation effect on server state and adaptability to seasonal changes. Each subgraph corresponds to one DC, and the box plots of active core ratio and frequency scaling factor are distinguished by different colors. In summer, each DC tends to reduce the active core ratio and frequency scaling factor to minimize additional power consumption. This is reflected in the fact that the median frequency scaling factor of the three DCs is the lowest in summer, and the median active core ratio of DC1 and DC2 is also the lowest in summer. Notably, the active core ratio and frequency scaling factor of DC2 are consistently higher than those of DC1 and DC3. The core reason is that DC2 has a lower electricity price, allowing it to appropriately activate more cores and increase the operating frequency to ensure fast task execution and avoid task accumulation or timeout.
Figure 11 shows the variation curve of the indoor temperature of each DC under different seasons. In spring/autumn, the indoor temperature of each DC is stably maintained at 23–26 °C; in summer, the indoor temperature ranges from 24 to 27 °C, showing a slight upward trend overall; and in winter, the indoor temperature is the lowest, ranging from 18 to 25 °C, and exhibits a slow downward trend throughout the whole period. The indoor temperature in winter is the lowest and shows an overall downward trend. This is because the outdoor ambient temperature in winter is significantly low, and the proportion of natural cooling duration in each DC is greatly increased. It is only necessary to drive heat exchange between indoor and outdoor environments through fans to maintain the temperature above the safe lower limit without additional mechanical heating assistance. The temperature in spring/autumn is the most stable. The core reason is that the outdoor ambient temperature in this season is moderate, which is highly consistent with the optimal operating temperature range of DCs. The cooling system does not need to frequently switch operating modes, and natural cooling combined with short-term mechanical cooling can balance the heat generation of IT equipment. The temperature in summer shows an upward trend. This is because the outdoor high temperature lasts for a long time in summer, and each DC needs to operate mechanical cooling for an extended period. The outdoor temperature reaches its peak from noon to evening, resulting in a decrease in the heat dissipation efficiency of the cooling system. The imbalance between cooling load and heat generation load causes the indoor temperature to gradually rise, showing an obvious upward trend.
5.3.2. Impact of Learning Rate on Training Results
The learning rate is a step-size coefficient for network updates, which affects the training convergence efficiency and stability of the model. By changing its value, the impacts on the convergence, stability of the training process, and the final regulation effect are studied.
Figure 12 shows the impact of the learning rate on the training process. It can be found that convergence can be achieved at around 700 iterations under different learning rates, but the reward after convergence is the maximum when the learning rate is 1 × 10
−5.
5.4. Computational Complexity and Deployment Feasibility
This section discusses the online inference time, scalability, and real-time deployment feasibility of the proposed H-MADDPG algorithm.
Online decision-making time: The inference time of the trained H-MADDPG policy is dominated by forward passes through the actor networks. On a standard server (Intel Xeon Gold 6248, 2.5 GHz, 32 GB RAM), the average inference time per time step is approximately 23 ms for the GCA and 15 ms for each LDCA, totaling less than 80 ms for three DCs. Since each time step in the simulation corresponds to 15 min in real operation, this inference overhead is negligible and well within real-time control constraints.
Scalability: The hierarchical decomposition reduces the joint action space from exponential to polynomial complexity, making the framework inherently easy to scale. Specifically, the GCA handles only high-level task allocation while each LDCA independently manages its local subtask scheduling and equipment regulation. This divide-and-conquer design avoids the exponential growth of the joint action space and allows the system to accommodate additional DCs or zones without fundamentally altering the decision structure. In practice, the framework exhibits strong operational feasibility for geo-distributed data center deployments of realistic scale.
Real-time deployment feasibility: During execution (decentralized phase), each agent makes decisions based solely on its local observations, requiring no online communication with the centralized critic or other agents. The GCA can run on a central controller, while each LDCA runs on an edge server located with its corresponding DC. The lightweight inference makes real-time control feasible. However, practical deployment on physical hardware would require additional validation against hardware delays, measurement noise, and occasional communication faults, as acknowledged in the limitations section.
6. Conclusions
This paper proposes a computational–physical collaborative optimization model for geographically distributed DCs based on hierarchical reinforcement learning. By integrating spatiotemporal task migration, adaptive adjustment of server core count and frequency, and mode-switchable cooling control, the proposed H-MADDPG algorithm effectively bridges the gap between global coordination and local autonomy. The experimental results demonstrate that the proposed strategy reduces the total cost by 36.19%, lowers PUE to 1.47–1.60, maintains a 0% SLA violation rate, and achieves balanced resource utilization and stable temperature control compared to benchmark schemes.
Despite the promising results, several limitations must be acknowledged to frame the contribution realistically. First, this study is entirely simulation-based and has not been validated on a real DC control platform, where hardware delays, measurement noise, and communication faults may affect performance. Second, the cooling system employs a lumped thermal parameter model, and servers are abstracted as AS models, which simplify spatial temperature gradients and transient thermal dynamics. Third, while this paper notes that long-term power purchase agreements (PPAs) are not considered, this is explicitly acknowledged here as a limitation, as PPAs could significantly affect optimal migration decisions under stable electricity prices. Finally, the current framework assumes a single operator and does not address multi-operator coordination or data privacy concerns.
Based on these limitations, we propose the following specific research directions:
- (1)
Real-platform validation: Deploy the H-MADDPG algorithm on a small-scale experimental DC testbed (e.g., with 3–5 servers and controllable cooling units) to evaluate its real-time feasibility, robustness to communication delays, and generalization to non-stationary workloads.
- (2)
Refined cooling and server modeling: Replace the aggregated thermal model with zonal computational fluid dynamics (CFD) surrogate models or physics-informed neural networks (PINNs) to capture spatial temperature distributions and transient responses.
- (3)
Long-term PPA integration: Formulate a two-timescale optimization framework where PPAs are signed on a monthly/yearly basis (upper level) and real-time task migration is optimized on an hourly basis (lower level), with contract compliance constraints.
- (4)
Privacy-preserving multi-operator coordination: Develop a federated multi-agent reinforcement learning (Fed-MADDPG) framework that enables DCs of different operators to coordinate without sharing local workload or cooling state data.
- (5)
Stress condition evaluation: Conduct a systematic evaluation under stressed operating conditions, including tighter deadlines (e.g., 30% reduction), heavier workloads (e.g., doubled arrival rate), and limited inter-DC bandwidth, to further validate the robustness of the hierarchical framework.
- (6)
Carbon-aware co-optimization: Incorporate real-time carbon intensity signals into the reward function to achieve a joint reduction in electricity cost and carbon footprint, supporting green DC operations.
In summary, this work establishes a systematic and extensible hierarchical RL framework for DC energy optimization. The explicit discussion of limitations and corresponding concrete future directions provides a realistic foundation for transitioning from simulation-based research to practical deployment.