Flexibility Resource Services and Electricity Cost Optimization Oriented Control Strategy of Data Centers Based on Hierarchical Reinforcement Learning

He, Pengfei; Sun, Rongfu; Pfeifer, Antun; Wang, Ge; Liu, Qinzhe; Duić, Neven; Zhen, Zhao; Wang, Fei; Xiao, Yunpeng

doi:10.3390/electronics15091901

Open AccessArticle

Flexibility Resource Services and Electricity Cost Optimization Oriented Control Strategy of Data Centers Based on Hierarchical Reinforcement Learning

by

Pengfei He

¹,

Rongfu Sun

²,

Antun Pfeifer

³

,

Ge Wang

^1,4

,

Qinzhe Liu

²,

Neven Duić

³

,

Zhao Zhen

^1,*

,

Fei Wang

¹

and

Yunpeng Xiao

⁵

¹

Yanzhao Electric Power Laboratory, North China Electric Power University, Baoding 071003, China

²

State Grid Jibei Electric Power Co., Ltd., Beijing 100045, China

³

Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb, HR-10000 Zagreb, Croatia

⁴

Department of Electrical and Electronic Engineering, The University of Manchester, Manchester M13 9PL, UK

⁵

School of Electrical Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(9), 1901; https://doi.org/10.3390/electronics15091901

Submission received: 23 April 2026 / Revised: 27 April 2026 / Accepted: 28 April 2026 / Published: 30 April 2026

(This article belongs to the Special Issue Control, Diagnostics and Protection for Electrical Machines, Power Electronics and Drives)

Download

Browse Figures

Versions Notes

Abstract

As the core of digital infrastructure, the exceptionally rapid development of data centers (DCs) faces serious challenges due to their high electricity costs. Traditional approaches treat computational task scheduling separately from different physical control mechanisms, such as server group management, overlooking the synergistic potential between the two aspects. To address this problem, this paper proposes a computational–physical collaborative optimization model that realizes spatiotemporal task migration on the computational side and adaptive parameter regulation of IT equipment and cooling devices on the physical side. In response to the lack of global coordination in conventional distributed optimization, a two-layer partially observable Markov game (POMG) is constructed to unify global cooperative decision-making and local autonomous control. On this basis, the hierarchical multi-agent deep deterministic policy gradient (H-MADDPG) algorithm is designed by introducing task priority ranking and a variable-dimension action mask mechanism, which effectively handles the discrete–continuous hybrid action space and adapts to the dynamic variation in action dimensions caused by uncertain task arrivals. Comparative experiments with various benchmark schemes are conducted to verify the effectiveness and superiority of the proposed strategy in total cost, power usage effectiveness (PUE), resource utilization, and load balancing.

Keywords:

geographically distributed data centers; spatiotemporal migration; adaptive regulation of physical parameters; two-layer partially observable Markov game; multi-agent reinforcement learning

1. Introduction

1.1. Background and Motivation

With the vigorous development of the digital economy, the application of emerging technologies, such as cloud computing and artificial intelligence (AI), has been expanding continuously. As the physical carrier supporting their operation, DCs have become major energy consumers [1,2]. The power consumption of DCs accounted for approximately 1.5% of the global total power consumption in 2024 [3], reaching 415 terawatt-hours (TWh). By 2030, the power consumption of DCs is projected to more than double [4]. On the one hand, DCs can adjust their power consumption according to real-time electricity price signals to reduce the operating cost. On the other hand, DCs can also act as important flexible resources to participate in demand response programs, helping the power grid maintain safe and stable operation. Therefore, how to regulate the energy load of DCs, unlock their flexibility potential, and improve energy efficiency has become a key concern for DC operators.

1.2. Literature Review

Unlike traditional loads that only have temporal regulation potential [5,6], DCs are characterized by the core attribute of spatiotemporal adjustability [7,8]. By exploiting spatiotemporal electricity price disparities, computational tasks can be migrated among DCs or rescheduled within a single DC, thus cutting energy costs [9]. Ref. [10] established a two-layer model to explore workload migration strategies for urban DCs and discussed the impact of migration costs on the total system cost. Ref. [11] explored the spatiotemporal adjustability potential of DCs in peer-to-peer (P2P) trading. Refs. [12,13] mathematically characterized the temporal flexibility of batch-processing loads through the transferable power matrix and data-driven methods, respectively. Refs. [14,15] proposed a low-carbon-oriented spatiotemporal task migration mechanism. Ref. [16] formulates the bidding strategy for DC aggregators participating in the electricity market.

Moreover, a DC internally comprises IT equipment, cooling systems, and auxiliary infrastructure [17]. Among these components, IT equipment and cooling systems account for more than 80% of the total power consumption of the DC [18,19], and thus their physical parameters can be directly regulated to achieve significant energy-saving effects. In terms of IT equipment regulation, dynamic voltage and frequency scaling (DVFS) technology reduces server power consumption by real-time sensing of system load changes and dynamically adjusting the operating voltage and frequency of chips [20,21,22]. Ref. [23] maximized system performance by determining the optimal number of cores and the corresponding thread layout. Ref. [24] proposed an energy-aware virtual machine scheduling method based on thermal models. In terms of cooling system regulation, Ref. [25] constructed an integrated energy system for DCs to achieve efficient and low-carbon operation through the electro-thermal coupling among subsystems. Refs. [26,27] investigated the energy-saving benefits of air-side free-cooling technology under different meteorological conditions. Ref. [28] minimized the energy consumption of cooling systems through a novel economic model predictive control strategy while maintaining the stability of the thermal environment in DCs.

However, the above studies only optimize the computational domain (i.e., spatiotemporal scheduling of computing tasks) and physical domain (i.e., regulation of IT and cooling equipment) in isolation, ignoring their coupling relationship. Isolated optimization of either domain may degrade the performance of the other. Some studies have attempted to break down the barriers between task scheduling and physical control. Ref. [29] integrated three coupled regulation methods, namely geographic load balancing, batch task scheduling, and building thermal inertia-based cold storage, to achieve a reduction in energy consumption costs. Ref. [30] realized the goal of DC energy efficiency optimization through the regulation of internal components and load scheduling of DCs. Ref. [31] constructed a collaborative optimization model for spatiotemporal load balancing and energy management in DCs. However, these works still consider limited factors, often overlooking the intricate coupling between computing and physical domains under multidimensional constraints. In fact, the collaborative optimization is a complex process that requires balancing multiple objectives under the following key factors: (1) Network constraints: cross-DC task migration must ensure sufficient bandwidth to prevent transmission latency from violating service level agreements (SLAs). (2) Thermal constraints: task migration or IT equipment adjustment reshapes internal power distribution, requiring cooling regulation to avoid both local temperature violations and surges in cooling energy consumption [32,33]. (3) Computing constraints: adjustments to physical equipment parameters may lead to task execution timeout or failure, necessitating a balance between energy efficiency optimization and computational reliability. (4) Uncertainty: fluctuations in task arrivals, electricity prices, and other factors necessitate the design of algorithms to cope with decision-making risks in dynamic environments.

Thus, the regulation problem of DCs is essentially a high-dimensional, stochastic, and dynamic sequential decision-making problem. Traditional model-based or rule-based optimization methods struggle to achieve real-time and robust control in such environments. With the development of AI technology, reinforcement learning (RL) has been gradually introduced into the field of DC energy efficiency optimization [34,35,36,37,38,39]. However, most of them are not designed with a hierarchical decision-making structure, making it difficult to balance global coordination and local autonomy. Moreover, they cannot effectively handle heterogeneous hybrid actions coexisting with discrete task scheduling and continuous physical regulation, nor adapt to scenarios where the action dimension varies dynamically with tasks.

1.3. Contributions

As shown in Table 1, which provides a comparison of this study with the existing literature, it can be concluded that this paper comprehensively considers multiple regulatory approaches to achieve the integrated energy management of DCs, with the main contributions as follows:

(1): A computational–physical deeply coupled collaborative optimization model is established, which integrates spatiotemporal migration of tasks, dynamic adjustment of server core count and operating frequency, and adaptive switching of cooling modes.
(2): The optimization problem is formulated as a two-layer POMG, which realizes the organic coordination between global cooperative decision-making and local autonomous control.
(3): The H-MADDPG algorithm is proposed with a global coordinator agent (GCA) and dual heterogeneous actors in each local DC agent (LDCA), combined with task priority and a variable-dimension action mask to adapt to hybrid decision-making and dynamic task arrivals.
(4): A geographically heterogeneous DC simulation is conducted, and the superior performance of the proposed method in cost, PUE, resource utilization, and load balancing is verified through benchmark, seasonal adaptability, and sensitivity analyses.

1.4. Structure of the Paper

The remainder of this paper is organized as follows. Section 2 establishes a collaborative optimization model. Section 3 transforms the model into a two-layer POMG. Section 4 proposes an H-MADDPG algorithm. Section 5 presents the experimental analysis, and Section 6 summarizes the work of this paper and prospects future research directions.

2. Computational–Physical Collaborative Optimization Model

This paper considers a system managed by a single operator, which comprises the following core components:

DC: Each DC

k \in K = {1, 2, \dots, K}

operates in a distinct external environment, including real-time electricity prices

C_{k}^{e} (t)

and outdoor temperatures

T_{k}^{o u t} (t)

.

Zones: The interior of DC

k

is divided into

M_{k}

independent zones and the cooling parameter of each zone

m \in M_{k} = {1, 2, \dots M_{k}}

can be adjusted independently.

Aggregated server (AS): We abstracted all servers within cooling zone

m

as a single AS. This AS has the following attributes: (1) Total computing capacity

N_{k, m}

: the sum of the physical cores of all servers in the zone. (2) Peak single-core processing capacity

μ_{k, m}^{p e a k}

: the processing speed of servers in the zone when running at the maximum operating frequency. (3) Active core ratio

a_{k, m} (t) \in [0, 1]

: defined as the ratio of the number of online active cores to the total number of cores. (4) Frequency scaling factor

f_{k, m} (t) \in [f_{\min}, 1]

: the ratio of the actual operating frequency of cores to their peak frequency. (5) Core occupancy rate

U_{k, m} (t) \in [0, 1]

: defined as the ratio of the number of cores executing tasks to the total number of active cores.

Computational tasks: Task

i \in I = {1, 2, \dots, I}

can be represented by a directed acyclic graph (DAG)

G_{i} = {V_{i}, ℰ_{i}}

, where

V_{i}

denotes the set of subtasks inside task i and

ℰ_{i}

defines the dependency relationships between subtasks. For task i, its arrival time is denoted as

A_{i}

, the initial arrival DC is

O_{i}

, the latest completion time is

D_{i}

, and the bandwidth requirement is

B_{i}

. Subtask

v \in V_{i} = {1, 2, \dots V_{i}}

is the minimum unit for scheduling and execution, with an initial computing demand of

C o m_{i, v}^{i n i}

, a remaining computing demand at time

t

of

C o m_{i, v} (t)

, and a core demand of

C o r_{i, v}

.

The operator is required to dynamically make the following decisions to minimize the total cost on the premise of satisfying task service quality and ensuring the safe operation of equipment: (1) Spatial migration of tasks across DCs

z_{i, k}

. (2) Execution zone

y_{i, v, k, m}

and execution time

s_{i, v}

of subtasks within a DC. (3) IT equipment regulation: active core ratio

a_{k, m} (t)

and frequency scaling factor

f_{k, m} (t)

. (4) Cooling system regulation: fan speed

s_{k, m} (t)

or set temperature

T_{k, m}^{s e t} (t)

. Figure 1 depicts the architecture of the computational–physical collaborative optimization model. The computational domain handles task decomposition, spatiotemporal migration, and subtask scheduling. The physical domain manages IT equipment regulation (active core ratio and frequency) and cooling system control (fan speed or set temperature). The two domains are coupled through thermal and computational constraints: task allocation affects IT power consumption, which in turn impacts cooling load and indoor temperature, while equipment regulation influences task execution efficiency.

2.1. Modeling of Task Allocation and Execution

The number of subtasks within each task is determined by its internal DAG structure and is assumed to be known upon task arrival. Subtasks are not dynamically split by the system during runtime; instead, they are predefined at the time of task submission. This ensures that all dependency relationships and computing demands are fully observable to the scheduler.

z_{i, k} \in {0, 1}

is a task-level allocation variable, where

z_{i, k} = 1

indicates that task

i

is allocated to DC

k

, and

z_{i, k} = 0

otherwise.

y_{i, v, k, m} \in {0, 1}

is a subtask-level allocation variable, where

y_{i, v, k, m} = 1

indicates that subtask

v

of task

i

is allocated to zone

m

of DC

k

for execution, and

y_{i, v, k, m} = 0

otherwise.

Each task must be allocated to exactly one DC:

\sum_{k \in K} z_{i, k} = 1 \forall i \in I

(1)

Since the minimum granularity of spatial allocation is a complete task, and task decomposition is only performed after the task has been assigned to the target DC; all subtasks belonging to the same task must be allocated to the same DC as the task itself:

\sum_{m \in M_{k}} y_{i, v, k, m} = z_{i, k} \forall i \in I, v \in V_{i}, k \in K

(2)

This constraint ensures that if

z_{i, k} = 1

, the sum of the allocations of each subtask

v

of task i across all zones within DC

k

equals 1, i.e., the subtask must be allocated within DC

k

; if

z_{i, k} = 0

, the sum equals 0, i.e., the subtask cannot be allocated within DC

k

.

To characterize the arrival time of tasks at the original DC, a variable

δ_{i, k} (t)

is introduced here, where

δ_{i, k} (t) = 1

denotes that task

i

initially arrives at DC

k

at time

t

.

δ_{i, k} (t) = \{\begin{cases} 1 i f O_{i} = k a n d A_{i} = t \\ 0 o t h e r w i s e \end{cases}

(3)

At time

t

, if a task initially arrives at DC

k

and is ultimately allocated to DC

l

, this indicates that task

i

needs to be transmitted across DCs, which triggers bandwidth occupation. Transmission is assumed to be completed instantaneously, and the transmission is executed immediately upon the task’s arrival. The sum of the bandwidth requirements of all tasks to be transmitted from DC k to DC.

l

. must not exceed the link capacity

B_{k, l}

, so as to ensure that the bandwidth limit is not exceeded during the transmission instant:

\sum_{i \in I_{k}^{a r r} (t)} B_{i} δ_{i, k} (t) z_{i, l} \leq B_{k, l} k \neq l

(4)

where

I_{k}^{a r r} (t)

denotes the set of tasks arriving at DC

k

at time

t

.

For task

i

at time

t

, the set of its subtasks being executed, denoted as

V_{i}^{e x e} (t)

, is defined as follows:

V_{i}^{e x e} (t) = {v \in V_{i} | s_{i, v} \leq t \leq f_{i, v}}

(5)

where

s_{i, v}

and

f_{i, v}

denote the start execution time and finish execution time of subtask

v

in task

i

, respectively.

Let

N_{k, m}^{a c t i v e} (t) = a_{k, m} (t) N_{k, m}

denote the number of active cores at time

t

. Among these active cores, some are occupied by executing tasks, while the remaining are idle. At any time

t

, the number of cores executing tasks in zone

m

must not exceed its current number of available cores

N_{k, m}^{a} (t)

, so as to ensure that the activated computing cores are sufficient to meet the demand:

N_{k, m}^{a} (t) = N_{k, m}^{a c t i v e} (t) (1 - U_{k, m} (t))

(6)

\sum_{i \in I} \sum_{v \in V_{i}^{e x e} (t)} y_{i, v, k, m} C o r_{i, v} \leq N_{k, m}^{a} (t)

(7)

The finish time of subtask

v

is determined by its start time, computing demand, core demand, and the processing capacity of the AS in the corresponding zone. The remaining computing demand of subtask

v

at time

t

can be expressed as:

C o m_{i, v} (t) = C o m_{i, v} (t - 1) - \sum_{k \in K} \sum_{m \in M_{k}} y_{i, v, k, m} C o r_{i, v} μ_{k, m} (t - 1) Δ t

(8)

μ_{k, m} (t - 1) = f_{k, m} (t - 1) μ_{k, m}^{p e a k}

(9)

where

μ_{k, m} (t)

denotes the actual operating frequency after frequency scaling. Since

y_{i, v, k, m}

is a binary variable and each subtask is allocated to exactly one server (as shown in(1) and (2)), only one term in the summation is actually valid.

A subtask is considered to be completed when its remaining computing demand is less than or equal to zero.

f_{i, v} = \min {t | C o m_{i, v} (t) \leq 0}

(10)

For any dependency edge

(v, u) \in ℰ_{i}

, which indicates that subtask

v

is the predecessor of subtask

u

, subtask

u

can only start execution after subtask

v

is completed, and the following constraint must be satisfied:

s_{i, u} \geq f_{i, v} \forall (v, u) \in ℰ_{i}, i \in I

(11)

Herein, the subtask level

H_{i, u}

is defined. A higher level means that the subtask is executed later in the dependency chain, and it can only be initiated after more predecessor subtasks are completed. The quantification rule for the level can be expressed as follows:

H_{i, u} = \{\begin{cases} 1 i f {v \in V_{i} | (v, u) \in ℰ_{i}} = \emptyset \\ \max {H_{i, v} + 1 | v \in V_{i} | (v, u) \in ℰ_{i}} o t h e r w i s e \end{cases}

(12)

where

v

is the predecessor subtask of

u

; if subtask

u

has no predecessor subtasks, its level is 1; and if subtask

u

has predecessor subtasks, its level is equal to the maximum level value of all its predecessor subtasks plus one.

For each task

i

, the start time of its first subtask shall not be earlier than its arrival time, and the completion time of its last subtask must be no later than its latest completion time [40]:

\min_{v \in V_{i}} s_{i, v} \geq A_{i} \forall i \in I

(13)

\max_{v \in V_{i}} f_{i, v} \leq D_{i} \forall i \in I

(14)

2.2. Modeling of AS

The core occupancy rate

U_{k, m}

of the AS in zone

m

at time

t

is defined as the ratio of the number of cores executing tasks to the total number of active cores:

U_{k, m} (t) = \frac{\sum_{i \in I} \sum_{v \in V_{i}^{e x e} (t)} y_{i, v, k, m} C o r_{i, v} + N_{k, m}^{p r e} (t)}{N_{k, m}^{a c t i v e} (t)}

(15)

where fixed tasks refer to critical workloads that cannot be delayed, migrated, or preempted due to operational continuity, security constraints, or user specifications. They occupy a fixed number of cores

N_{k, m}^{p r e} (t)

and are not subject to spatiotemporal migration optimization. However, they still participate in SLA constraints and must be completed before their deadlines.

The total IT power

P_{k, m}^{I T} (t)

of the AS in zone

m

at time

t

consists of three components: static power

P_{k, m}^{s t} (t)

, dynamic computing power

P_{k, m}^{d y} (t)

, and state transition power

P_{k, m}^{t r} (t)

[41].

\{\begin{cases} P_{k, m}^{I T} (t) = P_{k, m}^{s t} (t) + P_{k, m}^{d y} (t) + P_{k, m}^{t r} (t) \\ P_{k, m}^{s t} (t) = N_{k, m}^{a c t i v e} (t) P_{k, m}^{i d l e} \\ P_{k, m}^{d y} (t) = N_{k, m}^{a c t i v e} (t) (P_{m}^{p e a k} - P_{m}^{i d l e}) {(f_{k, m} (t))}^{γ} U_{k, m} (t) \\ P_{k, m}^{t r} (t) = β_{a} | Δ a_{k, m} | N_{k, m} + β_{f} | Δ f_{k, m} | N_{k, m} \end{cases}

(16)

where the static power refers to the minimum power required to maintain the basic operation of the server.

P_{k, m}^{i d l e}

denotes the idle power of a single core. The dynamic computing power is the additional power generated during the execution of tasks.

P_{k, m}^{p e a k}

represents the peak power, and

γ

is the dynamic power exponent. The state transition power refers to the instantaneous power overhead generated when the server state is altered.

Δ a_{k, m} (t)

and

Δ f_{k, m} (t)

denote the variation in the active core ratio and the frequency scaling factor, respectively;

β_{a}

and

β_{f}

are the energy consumption coefficients for core count transition and frequency transition, respectively.

The total power of DC

k

at time

t

is the sum of the IT power, cooling power, and auxiliary facility power. The power of auxiliary facilities is considered to be correlated with the power of IT equipment:

P_{k, m}^{a u x} (t) = β \cdot P_{k, m}^{I T} (t)

(17)

P_{k}^{t o t a l} (t) = \sum_{m \in M_{k}} (P_{k, m}^{I T} (t) + P_{k, m}^{c o o l} (t) + P_{k, m}^{a u x} (t))

(18)

where

β

is a proportional coefficient covering the conversion power consumption of equipment, such as UPS, and the basic lighting energy consumption.

The PUE of DC

k

is the ratio of the total input power of the DC to the actual power consumed by IT equipment, which can be expressed as:

P U E_{k} (t) = \frac{P_{k}^{t o t a l} (t)}{P_{k}^{I T} (t)}

(19)

2.3. Modeling of Cooling System

The cooling system can switch between free cooling mode and mechanical cooling mode according to the outdoor temperature

T_{k}^{o u t} (t)

:

M o d e_{k} (t) = \{\begin{cases} 0 T_{k}^{o u t} (t) \leq T^{t h} \\ 1 T_{k}^{o u t} (t) > T^{t h} \end{cases}

(20)

where

M o d e_{k} (t) = 0

represents the free cooling mode and

M o d e_{k} (t) = 1

represents the mechanical cooling mode.

When the system is in the free cooling mode, the compressor is completely shut down, and heat is removed by outdoor low-temperature air only by adjusting the flow rate of the fluid. The power is regulated by controlling the rotation speed

s_{k, m} (t)

of fans/water pumps. When the equipment operates at full speed, the maximum heat power that can be removed based on the temperature difference is given by:

Q_{k, m}^{\max} (t) = κ_{k, m} (T_{k, m}^{i n} (t) - T_{k, m}^{o u t} (t))

(21)

where

κ_{k, m}

is the total heat transfer coefficient of the region, which depends on the heat exchanger area and air duct design.

The actual heat removal capacity is proportional to the rotation speed:

Q_{k, m}^{c o o l} (t) = s_{k, m} (t) \cdot Q_{k, m}^{\max} (t)

(22)

To ensure that the heat generated by IT equipment is completely removed, the following condition must be satisfied:

Q_{k, m}^{c o o l} (t) \geq P_{k, m}^{I T} (t)

(23)

The power consumption of fans/water pumps is proportional to the cube of the rotation speed:

P_{k, m}^{c o o l} (t) = P^{r a t e d} {(s_{k, m} (t))}^{3}

(24)

where

P^{r a t e d} (t)

is the rated maximum power of the equipment.

When the system is in the mechanical cooling mode, the compressor is activated to actively transfer heat through the refrigeration cycle. The evaporation temperature is changed by adjusting the temperature set point

T_{k, m}^{s e t} (t)

.

The coefficient of performance (COP) is defined as the ratio of the heat removal capacity to the input electrical power, which can be modeled as:

C O P_{k, m} (t) = a + b T_{k, m}^{s e t} (t)

(25)

where

a

,

b

are empirical fitting parameters calibrated to real-world DC cooling system characteristics.

The power consumption of the cooling system is jointly determined by the heat to be removed (approximately equal to the IT power consumption) and the cooling efficiency:

P_{k, m}^{c o o l} (t) = \frac{Q_{k, m}^{c o o l} (t)}{C O P_{k, m} (t)} + P^{f i x} = \frac{P_{k, m}^{I T} (t) η_{k, m} (t)}{C O P_{k, m} (t)} + P^{f i x}

(26)

η_{k, m} (t) = c - d \cdot C O P_{k, m} (t)

(27)

where

η_{k, m} (t)

is the cooling efficiency coefficient, which is related to

C O P_{k, m} (t)

,

c

,

d

are fitting coefficients; and

P^{f i x}

is the baseline power consumption of equipment during compressor operation.

To ensure equipment safety, the server air inlet temperature

T_{k, m}^{i n} (t)

must be maintained within the allowable range. Its dynamics are described by the continuous-time heat balance equation:

C_{k, m} \frac{d T_{k, m}^{i n} (t)}{d t} = P_{k, m}^{I T} (t) - \frac{T_{k, m}^{i n} (t) - T_{k}^{o u t} (t)}{R_{k, m}} - Q_{k, m}^{c o o l} (t)

(28)

where

C_{k, m}

denotes the thermal capacity of the zone and

R_{k, m}

is the thermal resistance of the zone. In practical control, the discrete form is usually adopted:

T_{k, m}^{i n} (t + 1) = T_{k, m}^{i n} (t) + \frac{Δ t}{C_{k, m}} [P_{k, m}^{I T} (t) - \frac{T_{k, m}^{i n} (t) - T_{k}^{o u t} (t)}{R_{k, m}} - Q_{k, m}^{c o o l} (t)]

(29)

where

Δ t

denotes the length of the time interval.

The temperature safety constraint is expressed as follows:

T^{\min} \leq T_{k, m}^{i n} (t) \leq T^{\max}

(30)

where

T^{\min}

and

T^{\max}

denote the minimum and maximum values of the safe temperature, respectively.

2.4. Modeling of Cost

The optimization objective of the model is to minimize the total operating cost, which includes the computing cost

C^{I T}

, cooling cost

C^{c o o l}

, transmission cost

C^{t r}

, and SLA violation cost

C^{S L A}

:

\{\begin{cases} \min J = C^{I T} + C^{c o o l} + C^{t r} + C^{S L A} \\ C^{I T} = \sum_{t \in T} \sum_{k \in K} C_{k}^{e} P_{k}^{I T} Δ t \\ C^{c o o l} = \sum_{t \in T} \sum_{k \in K} C_{k}^{e} P_{k}^{c o o l} Δ t \\ C^{t r} = \sum_{t \in T} \sum_{k \in K} \sum_{l \in K} \sum_{i \in I_{k}^{a r r} (t)} a^{t r} B_{i} δ_{i, k} (t) z_{i, l} l \neq k \\ C^{S L A} = a^{S L A} r^{S L A} \end{cases}

(31)

where

a^{t r}

and

a^{S L A}

denote the communication cost coefficient and SLA violation cost coefficient, respectively, and ρ represents the SLA violation rate.

3. Two-Layer POMG

Based on the computational–physical collaborative optimization model, this section further formulates the high-dimensional, dynamic, and strongly coupled optimization problem as a two-layer POMG. By defining hierarchical agents, state space, action space, and the reward mechanism, it provides a unified decision-making framework for the subsequent H-MADDPG algorithm.

3.1. Agent Architecture and Division of Responsibilities

This paper adopts a two-layer agent architecture to balance the efficiency of global coordination and local control.

GCA: As the central decision-making unit, the GCA is responsible for the top-level decision-making of global scheduling to ensure the optimal overall cost of the system, and its decisions correspond to the variables

z_{i, k}

in the optimization model.

LDCA: One LDCA is deployed in each DC, and it is responsible for all micro-level decision-making within the DC: (1) subtask scheduling; (2) zone AS state control; and (3) cooling control decision-making.

3.2. Task Priority Ranking

At each time step, the GCA collects information about the tasks arriving at each DC. Let

I^{w a i t} (t)

denote the set of pending tasks of the GCA, and

I^{c u r} (t)

denote the set of tasks processed by the GCA at the current time step. Given that the maximum number of tasks that the GCA can process per time step is

N^{G}

, it is necessary to perform priority ranking on

I^{w a i t} (t)

. The priority score

P_{i}

of each task

i

depends on its total computing demand

C o m_{i, v}

and deadline

D_{i}

. A higher priority score indicates a more urgent task, which should be processed first:

P_{i} = α_{1} C o m_{i} + α_{2} \frac{1}{D_{i}}

(32)

where

α_{1}

and

α_{2}

are the weights of the task priority score.

The top

N^{G}

tasks with the highest priority are included in

I^{c u r} (t)

, and their spatiotemporal migration is determined by the GCA.

Let

V_{k}^{w a i t} (t)

denote the set of pending subtasks in DC

k

, and

V_{k}^{c u r} (t)

denote the set of subtasks processed in the current time step. The maximum number of subtasks that the LDCA can process per time step is

N^{L}

. Similarly, the subtasks are prioritized according to their computing demand and deadline, and the top

N^{L}

subtasks with the highest priority scores are selected for decision-making.

3.3. State Space

The state observed by the GCA at time step

t

is denoted as

S^{G} = {t, J^{c u r} (t), R (t)}

, where

t

represents the current time step, and

J^{c u r} (t) = {C o m_{i}, D_{i}, V_{i} | i \in I^{c u r} (t)}

is the set of task states in the task set

I^{c u r} (t)

processed by the GCA at the current time step, including task computing demand

C o m_{i}

, deadline

D_{i}

, and the number of subtasks

V_{i}

.

R (t) = {C_{k}^{e} (t), T_{k}^{o u t} (t), {\hat{C}}_{k}^{e, f} (t), {\hat{T}}_{k}^{o u t, f} (t), {\bar{T}}_{k}^{i n}, {\bar{μ}}_{k} (t), N_{k}^{a} (t) | k \in K}

denotes the state set of all DCs, including the electricity price

C_{k}^{e} (t)

, outdoor temperature

T_{k}^{o u t} (t)

, predicted values of electricity price

{\hat{C}}_{k}^{e, f} (t)

and outdoor temperature

{\hat{T}}_{k}^{o u t, f} (t)

for the next

f

time steps, zone indoor average temperature

{\bar{T}}_{k}^{i n} (t)

, zone average frequency

{\bar{μ}}_{k} (t)

, and the total number of available cores of the DC

N_{k}^{a} (t)

. Based on the above states, the GCA is trained to identify the urgency level of tasks and make optimal allocation decisions according to task requirements and the real-time states of each DC.

Two policy networks are configured for each LDCA: one is responsible for deciding the start execution time and execution zone of subtasks, which is referred to as the subtask scheduling actor; the other is responsible for regulating the states of each zone within the DC, which is referred to as the resource state regulation actor. Both policy networks take the same state as input but output different actions, which will be elaborated upon later. The state observed by LDCA-k at time step

t

is denoted as

S_{k}^{L} (t) = {t, J^{c u r} (t), Z_{k} (t), R_{k} (t)}

, where

t

represents the current time step,

J_{k}^{c u r} (t) = {H_{i, v}, C o m_{i, v} | i \in I_{k}^{t r} (t), v \in V_{k}^{c u r} (t)}

denotes the set of states of the subtask set

V_{k}^{c u r} (t)

currently processed by LDCA-k, including the subtask level

H_{i, v}

and subtask computing demand

C o m_{i, v}

,

I_{k}^{t r} (t)

denotes the set of tasks allocated to DC

k

up to time step

t

,

Z_{k} = {T_{k, m}^{i n} (t), μ_{k, m} (t), N_{k, m}^{a} (t) | m \in M_{k}}

denotes the states of each zone within the DC, including indoor temperature

T_{k, m}^{i n} (t)

, frequency

μ_{k, m} (t)

, and number of available cores

N_{k, m} (t)

,

R_{k} (t) = {C_{k}^{e} (t), T_{k}^{o u t} (t), {\hat{C}}_{k}^{e, f} (t), {\hat{T}}_{k}^{o u t, f} (t)}

denotes the state of this DC, including the electricity price

C_{k}^{e} (t)

and outdoor temperature

T_{k}^{o u t} (t)

at the current moment, as well as their predicted values

{\hat{C}}_{k}^{e, f} (t)

and

{\hat{T}}_{k}^{o u t, f} (t)

for the next

f

time steps.

3.4. Action Space

The GCA is responsible for the spatial migration of tasks processed at each time step, i.e., determining which DC the tasks are secondarily allocated to. Its action is defined as

A^{G} (t) = {a_{i}^{t r} | i \in I^{c u r} (t)}

, where

a_{i}^{t r} \in {1, 2, \dots, K}

indicates the DC to which the task

i

in the task set

I^{c u r} (t)

processed by the GCA at the current time step is allocated.

The subtask scheduling actor of LDCA-k is responsible for subtask decision-making, and its action can be expressed as

A_{k}^{L, s t} (t) = {r_{k, i, v}, t_{k, i, v}^{n o r m} | i \in I_{k}^{t r} (t), v \in V_{k}^{c u r} (t)}

. Herein,

r_{k, i, v} \in {1, 2, \dots, M_{k}}

denotes the zone allocated to the subtask

v

in the subtask set

V_{k}^{c u r} (t)

processed by the LDCA-k at the current time step;

t_{k, i, v}^{n o r m} \in [0, 1)

denotes the planned start execution time of the subtask

v

in the subtask set

V_{k}^{c u r} (t)

processed by the LDCA-k at the current time step, which is converted to the actual time step via the mapping formula

t_{k, i, v} = t_{k, i, v}^{\min} + t_{k, i, v}^{n o r m} (t_{k, i, v}^{\max} - t_{k, i, v}^{\min})

.

t_{k, i, v}^{\min}

is the earliest start time, namely the arrival time of the subtask;

t_{k, i, v}^{\max}

is the latest start time, namely the deadline of the subtask. The resource state regulation actor of LDCA-k is responsible for decision-making regarding resource state regulation of each zone, and its action can be expressed as

A_{k}^{L, r e} = {Δ a_{k, m} (t), Δ f_{k, m} (t), Δ s_{k, m} (t) / Δ T_{k, m}^{s e t} (t) | m \in M_{k}}

. Herein,

Δ a_{k, m} (t)

,

Δ f_{k, m} (t)

,

Δ s_{k, m} (t)

and

Δ T_{k, m}^{s e t} (t)

denote the variation in the active core ratio, frequency scaling factor, fan speed, and set temperature, respectively. When in the free cooling mode, the action output by the policy network is

Δ s_{k, m} (t)

; when in the mechanical cooling mode, the output action is

Δ T_{k, m}^{s e t} (t)

.

3.5. Reward Function

The reward function

r^{G} (t)

of the GCA consists of the global electricity cost reward

r^{e} (t)

, the global SLA compliance reward

r^{S L A} (t)

, the transmission bandwidth violation penalty

r^{b a n d} (t)

, and the load balancing reward

r^{b a l a n c e} (t)

:

r^{G} (t) = ω_{1} r^{e} (t) + ω_{2} r^{S L A} (t) + ω_{3} r^{b a n d} (t) + ω_{4} r^{b a l a n c e} (t)

(33)

r^{e} (t) = 1 - \frac{C^{e} (t)}{C^{\max}}

(34)

r^{S L A} = \{\begin{cases} 0 i f s_{i, v} > D_{i} \\ 1 e l s e \end{cases}

(35)

r^{b a n d} = \{\begin{cases} 0 i f B (t) > B^{\max} \\ 1 e l s e \end{cases}

(36)

r^{b a l e n c e} = - \sum_{k \in K} (\frac{P_{k}^{t o t a l}}{P_{k}^{\max}} - \frac{\sum_{k \in K} (\frac{P_{k}^{t o t a l}}{P_{k}^{\max}})}{K})^{2}

(37)

where

C^{t o t a l}

is the total electricity cost at time and

C^{\max}

is the maximum observed cost in the training buffer;

B (t)

is the bandwidth requirement of tasks that violate the link capacity; and

B^{\max}

is the total link capacity.

The weights

ω_{1} ~ ω_{4}

are set to

[0.5, 0.3, 0.1, 0.1]

based on a preliminary grid search over the ranges

ω_{1} \in [0.3, 0.7]

,

ω_{2} \in [0.2, 0.5]

,

ω_{3} \in [0.05, 0.2]

.

The reward function

r_{k}^{L} (t)

of the LDCA-k comprises the local electricity cost reward

r_{k}^{e} (t)

, the local SLA compliance reward

r_{k}^{S L A} (t)

, and the physical resource safety reward

r_{k}^{r e}

, where the latter includes temperature safety and core occupancy rate non-violation constraints.

r_{k}^{L} (t) = ω_{5} r_{k}^{e} (t) + ω_{6} r_{k}^{S L A} (t) + ω_{7} r_{k}^{r e} (t)

(38)

r_{k}^{r e} (t) = - \sum_{m \in M_{k}} (\frac{T_{k, m}^{i n} (t)}{T^{\max}} + U_{k, m} (t))

(39)

The definition of

r_{k}^{e} (t)

and

r_{k}^{S L A} (t)

is similar to the GCA. The weights

ω_{5} ~ ω_{7}

are set to

[0.4, 0.3, 0.3]

.

4. Algorithm Design

4.1. Principle of the Classical MADDPG Algorithm

The MADDPG algorithm provides a fundamental framework for addressing the non-stationarity problem in multi-agent environments. In the training phase, a centralized critic network that can access the state and action information of all agents is introduced for each agent. In the execution phase, each agent makes distributed decisions solely based on its local observations [42].

In the multi-agent setting, the goal of agent

i

is to maximize its expected discounted cumulative reward:

J (π_{i}) = E [\sum_{t \in T} γ^{t} r_{i} (t)]

(40)

where

π_{i}

denotes the policy of agent

i

,

γ \in [0, 1]

is the discount factor that balances the importance of immediate and future rewards, and

r_{i} (t)

represents the immediate reward obtained by the agent at time step

t

.

The MADDPG algorithm adopts a collaborative architecture of the centralized critic and distributed actor, and its parameter update logic is divided into two parts: the centralized critic network update and the distributed actor network update.

The core function of the centralized critic network is to evaluate the global value of the joint state–joint action. Its parameters

ϕ

are updated by minimizing the temporal difference mean-squared error between the current Q-value and the target Q-value, so as to optimize the accuracy of value assessment.

The target Q-value is jointly calculated by the target actor network and the target critic network:

y_{i} = r_{i} + γ Q^{ϕ^{'}} (s^{'}, a_{1}^{'}, \dots a_{N}^{'}) |_{a_{j}^{'} = μ^{θ_{j}^{'}} (o_{j})}

(41)

where

s^{'}

denotes the next global state,

r_{i}

is the current global reward,

γ

is the discount factor,

a_{j}^{'} = μ^{θ_{j}^{'}} (o_{j})

represents the action generated by the target actor network of agent

j

(with parameters

θ_{j}^{'}

) based on its local observation

o_{j}

, and

Q^{ϕ^{'}}

is the function of the target critic network (with parameters

ϕ^{'}

) that evaluates the future value corresponding to the next state and target action.

The loss function can be expressed as:

L (ϕ) = E_{(s, a, r, s^{'}) ~ D} [{(y_{i} - Q^{ϕ} (s, a_{1}, \dots, a_{N}))}^{2}]

(42)

where

D

denotes the experience replay buffer, stored in the form of

(S (t), A (t), r^{t o t a l} (t), S (t + 1), d o n e (t))

, where

s

is the global state,

a = [a_{1}, \dots a_{N}]

is the joint action of all agents, and

Q^{ϕ} (s, a_{1}, \dots, a_{N})

is the value assessment of the online state and joint action performed by the current critic network (with parameters

ϕ

).

The parameter update of the online critic network is completed by minimizing the aforementioned loss via gradient descent:

ϕ \leftarrow ϕ - η_{Q} \cdot \nabla_{ϕ} L (ϕ)

, where

η_{Q}

is the learning rate of the critic network.

The actor network of each agent is updated using the policy gradient theorem. Its core objective is to maximize the action-value Q-value evaluated by the critic network, thereby optimizing its own policy. The expression for agent

i

is given by:

\nabla_{θ_{i}} J (μ^{θ_{i}}) \approx E_{s, a ~ D} [\nabla_{θ_{i}} μ^{θ_{i}} (o_{i}) \nabla_{a_{i}} Q^{ϕ} (s, a_{1}, \dots, a_{N})]

(43)

where

μ^{θ_{i}} (o_{i})

denotes the action generated by the online actor network (with parameters

θ_{i}

) of agent

i

based on its local observation

o_{i}

.

To adapt to the gradient descent optimization logic, the objective of maximizing the Q-value is transformed into a loss minimization problem, whose loss function is defined as follows:

L (θ_{i}) = - E_{(s, a, r, s^{'}) ~ D} [Q^{ϕ} (s, a_{1}, \dots, a_{N})]

(44)

The parameter update of the online actor network is achieved by minimizing the aforementioned loss via gradient descent:

θ_{i} \leftarrow θ_{i} - η_{A c t o r} \cdot \nabla_{θ_{i}} L (θ_{i})

, where

η_{A c t o r}

denotes the learning rate of the actor network.

To ensure training stability, target networks (with parameters denoted as

ϕ^{'}

and

θ_{i}^{'}

) are constructed for all critic/actor networks, and a soft update strategy is adopted to synchronize the parameter updates of the target networks:

ϕ^{'} = τ ϕ + (1 - τ) ϕ^{'}

(45)

θ_{i}^{'} = τ θ_{i} + (1 - τ) θ_{i}^{'}

(46)

where τ denotes the soft update coefficient, which balances the stability and tracking flexibility of the target network.

4.2. H-MADDPG: Network Architecture Design for Hierarchical Decision-Making

Traditional MADDPG employs a single-layer architecture with homogeneous agents, where each agent independently makes decisions based on local observations while sharing a centralized critic. However, this architecture faces three key difficulties when applied to DC optimization: (1) the coupling between global task allocation and local equipment control; (2) the coexistence of discrete (task scheduling) and continuous (frequency, cooling) action spaces; and (3) the dynamically varying action dimensions caused by uncertain task arrivals.

To address these challenges, the proposed H-MADDPG introduces three core innovations in Table 2.

Specifically, the GCA handles sparse, high-level task allocation decisions, while each LDCA manages dense, low-level subtask scheduling and real-time equipment regulation. This hierarchical decomposition directly mirrors the natural structure of data center operations and reduces the joint action space complexity from exponential to polynomial.

The network architecture of H-MADDPG mainly consists of the upper-level GCA task allocation actor, the lower-level LDCA subtask scheduling actor, the resource state regulation actor, and a centralized critic, thus enabling end-to-end learning from global task allocation down to local resource regulation. The overall architecture is illustrated in Figure 2: (a) Centralized training phase: The global centralized critic network collects joint states

S

and joint actions

A

from all agents (GCA and LDCAs) to compute the global Q-value. This critic is only used during training to provide gradient signals to the actor networks. (b) Decentralized execution phase: Each agent (GCA or LDCA) makes decisions solely based on its local observations using its trained actor network. The GCA determines cross-DC task migration, while each LDCA handles subtask scheduling and equipment regulation within its DC. No communication among agents is required during execution. Algorithm 1 presents the algorithm flow of H-MADDPG.

Algorithm 1: H-MADDPG

Input: Initial parameters of each network, experience replay buffer parameters, hyperparameters
Output: Task allocation policy, subtask scheduling policy, resource regulation policy

1:: Initialize parameters
2:: for episode = 1 to 1000 do

Reset environment

3:: $while t < T$ do:
4:: $Upper - layer : The GCA outputs A^{G} (t)$ via the online GCA actor network
5:: $Lower - layer : Each LDCA outputs A_{k}^{L, s t} (t) and A_{k}^{L, r e} (t)$ via its online LDCA dual-actor network
6:: Environment interaction
7:: $Store the sample (S (t), A (t), r^{t o t a l}, S (t + 1), d o n e (t))$ in $D$
8:: $t \leftarrow t + 1$
9:: end while
10:: $if | D | \geq 64$ then
11:: Sample a batch of samples $B$ from $D$
12:: $Calculate the target Q - value y_{i} = r_{i} + γ Q^{ϕ^{'}} (s^{'}, a_{1}^{'}, \dots a_{N}^{'})$
13:: $Online central critic network update : ϕ \leftarrow ϕ - η_{Q} \cdot \nabla_{ϕ} L (ϕ)$
14:: $Online GCA actor network update : θ^{G} \leftarrow θ^{G} - η^{G} \cdot \nabla_{θ^{G}} L (θ^{G})$
15:: $Online LDCA dual - actor network update : θ_{k}^{L, s t} \leftarrow θ_{k}^{L, s t} - η_{k}^{L, s t} \cdot \nabla_{θ_{k}^{L, s t}} L (θ_{k}^{L, s t}) θ_{k}^{L, r e} \leftarrow θ_{k}^{L, r e} - η_{k}^{L, r e} \cdot \nabla_{θ_{k}^{L, r e}} L (θ_{k}^{L, r e})$
16:: $Target network update : ϕ^{'} \leftarrow τ ϕ + (1 - τ) ϕ^{'} θ^{G^{'}} \leftarrow τ θ^{G} + (1 - τ) θ^{G^{'}} θ^{L, s t^{'}} \leftarrow τ θ^{L, s t} + (1 - τ) θ^{L, s t^{'}} θ^{L, r e^{'}} \leftarrow τ θ^{L, r e} + (1 - τ) θ^{L, r e^{'}}$
17:: end if
18:: end for

4.2.1. Upper Layer: GCA Task Allocation Actor Network

The GCA output is a probability distribution over target DCs for each task. Since the original MADDPG is designed for continuous actions, we adopt the Gumbel–Softmax reparameterization trick to make discrete action sampling differentiable during training. Specifically, the GCA actor outputs logits

\log p_{i, k}

for each task–DC pair, and the sampled action is:

a_{i, k} = \frac{\exp ((\log p_{i, k} + g_{i, k}) / τ)}{\sum_{k^{'}} \exp ((\log p_{i, k^{'}} + g_{i, k^{'}}) / τ)}

(47)

where

g_{i, k} \sim Gumbel (0,1)

is the independent Gumbel noise, and

τ

is the temperature parameter (annealed from 1.0 to 0.1 during training). The action mask sets

l o g p_{i, k} = - \infty

for invalid DCs (e.g., those that cannot receive tasks due to bandwidth constraints). During inference (execution), we replace the Gumbel–Softmax with deterministic argmax: the task is allocated to the DC with the highest probability.

The GCA task allocation actor

μ^{θ^{G}}

is responsible for receiving the task arrival information of the entire network and determining the cross-DC task allocation targets. Its key design lies in mapping a variable number of tasks to fixed-dimensional action outputs.

Input Layer: This dimension is consistent with that of the GCA state space, with the input being

S^{G} (t)

.

Output Layer and Action Masking Mechanism: To address the uncertainty of the number of arriving tasks, an action masking mechanism is added to ensure that the output dimension remains fixed at

N^{G} \times K

(where

N^{G}

is the maximum number of parallel tasks of the GCA and

K

is the number of DCs). The GCA actor outputs the probability distribution of task allocation to each DC. For the time step with the actual number of tasks

N_{t} < N^{G}

, an action mask is introduced:

{[l o g i t s]}_{i, k} = \{\begin{cases} {[n e t_o u t p u t]}_{i, k} i f i \leq N_{t} \\ - \infty o t h e r w i s e \end{cases}

(48)

π^{G} (a_{i} = k | s) = \frac{\exp ({[l o g i t s]}_{i, k})}{\sum_{k^{'} \in K} \exp ({[\log i t]}_{i, k^{'}})}

(49)

where

N_{t}

denotes the actual number of tasks at time step

t

,

{[l o g i t]}_{i, k}

is the probability that task

i

is allocated to DC

k

, and

π^{G} (a_{i} = k | s)

represents the probability that task

i

is allocated to DC

k

.

4.2.2. Lower Layer: LDCA Dual-Actor Network

Each DC deploys a subtask scheduling actor and a resource state regulation actor, which share the LDCA state

S_{k}^{L} (t)

and realize two-layer interaction by receiving allocation instructions from the GCA.

Subtask scheduling actor (

μ_{k}^{θ^{L, s t}}

): This actor receives the tasks allocated by the GCA and determines the subtask zone allocation and execution time scheduling.

Input Layer: This dimension is consistent with that of the LDCA state space, with the input being

S_{k}^{L} (t)

.

Output Layer:

Zone allocation branch: This dimension is

N^{L} \times M_{k}

(where

N^{L}

is the maximum number of parallel subtasks of the LDCA and

M_{k}

is the number of zones), which is activated to output the probability of subtasks being allocated to each zone.

Start execution time branch: This dimension is

N^{L} \times 1

, with the output normalized to a continuous execution time within

[0, 1)

.

Resource state regulation actor (

μ_{k}^{θ^{L, r e}}

): Based on the subtask scheduling results, this actor dynamically adjusts the core and frequency states of zone servers as well as the cooling set temperature.

Input Layer: This layer shares the LDCA state vector with the subtask scheduling actor, with the input being

S_{k}^{L} (t)

.

Output Layer: This dimension is

M_{k} \times 3

, which outputs the action increment.

4.2.3. Global Centralized Critic Network

The global centralized critic network

Q^{ϕ}

is used to integrate the joint state and action information of the two-layer agents, evaluate the global decision value, and provide gradient signals for the actor networks.

Input Layer: This dimension is the sum of the dimensions of the joint state and joint action, with the input being

S (t) = [S^{G} (t), S_{1}^{L} (t), \dots, S_{K}^{L} (t)]

and

A (t) = [A^{G} (t), A_{1}^{L, s t} (t), A_{1}^{L, r e} (t), \dots, A_{K}^{L, s t} (t), A_{K}^{L, r e} (t)]

.

Output Layer: This dimension is 1, activated to output the global Q-value

Q (t, S (t), A (t))

, which represents the expected cumulative reward under the joint state–action pair.

5. Case Study

5.1. Parameter Settings

The stochastic elements include workload arrivals, electricity prices, and ambient temperature, which are modeled based on real-world data traces and standard stochastic processes [43,44,45]. This paper sets up three heterogeneous DCs, and the configuration parameters of each DC are provided in Table 3. The electricity price and ambient temperature used during the testing phase are shown in Figure 3. In the assumptions of this study, long-term power purchase agreements for DCs are not considered. The algorithm parameters are presented in Table 4. All experiments are carried out under the two typical operation scenarios mentioned in the introduction, where DCs adjust their power consumption according to real-time electricity price signals and take part in demand response as valuable flexible resources.

To ensure the reproducibility of this case study, all key parameters are specified as follows. Electricity price traces are sourced from the U.S. Energy Information Administration (EIA) for three ISO regions (PJM, CAISO, and ERCOT) covering January to December 2024, while ambient temperature data are obtained from the NOAA Global Historical Climatology Network (GHCN-Daily) for three corresponding cities (Chicago, Los Angeles, and Houston) with hourly averaging. Workload arrivals follow a Poisson process with a mean rate of 120 tasks per hour, split among the three DCs proportionally to their core counts. Task computing demand follows a lognormal distribution with a mean of 500 core·GHz·ms and a standard deviation of 200. The number of subtasks per task is uniformly distributed between one and eight, and the DAG structure of each task is generated using the Gaussian elimination method with a dependency density of 0.3, based on the Alibaba Cluster Trace V2018. The bandwidth capacity between any two DCs is set to 1 Gbps (symmetric), with a communication cost coefficient of 0.005 $/MB and an SLA violation penalty coefficient of 50 $/violation. For the cooling system, the COP fitting coefficients are calibrated from a real CRAC unit as

a = 6.0

,

b = 0.02

,

c = 0.1

, and

d = 5.0

; the rated fan power per zone is 15 kW, and the safe indoor temperature range is set to 18 °C, 27 °C. All stochastic processes are initialized with a random seed of 42, and each experimental configuration is run over 10 independent trials with results reported as mean values.

5.2. Decision Result Analysis

To verify the superiority of the method proposed in this paper, comparative experiments were designed for result analysis:

Case 1: The proposed method.

Case 2: Double-layer scheduling without physical regulation. The hierarchical multi-agent architecture is retained, focusing on the global spatiotemporal optimization of computing tasks while physical parameters are fixed.

Case 3: Single-layer local scheduling. The GCA and the inter-DC migration mechanism are removed, and the LDCA only optimizes the locally arrived tasks, without any physical parameter regulation.

Case 4: Rule-based static scheduling. After the arrival of tasks, they are evenly allocated to each region inside the DC and executed immediately.

Table 5 presents the SLA violation rate, average overtime duration, total cost, and PUE under the four cases. First, all four cases completed the computing tasks without violating the SLA, which indicates that the designed scheduling methods can meet the basic service requirements. Second, in terms of total cost, compared with Case 4, Case 3, Case 2, and Case 1 achieved cost reductions of 1.77%, 5.12%, and 36.19%, respectively. This shows that reducing costs can be achieved through time delay of computing tasks, DC spatial migration, and adjustment of physical equipment parameters. However, the effect of different methods varies significantly in terms of reduction magnitude. Compared with Case 4, Case 3 only delays computing tasks in time and migrates them between regions within the DC, enabling task execution to avoid peak electricity price periods and optimize the utilization of regional resources. On the basis of Case 3, Case 2 enables cross-DC task migration. Due to the significant differences in electricity prices and temperatures between different DCs, migrating computing tasks to DCs with low electricity prices and low temperatures can prevent a sharp increase in cooling costs caused by excessive IT equipment power, thereby reducing the total cost. Case 1 further adjusts the cores and frequencies of AS on the basis of Case 2. By combining the spatiotemporal adjustment of computing tasks and the adjustment of physical equipment operating parameters, Case 1 achieves the lowest cost, fully reflecting the advantages of its collaborative hierarchical architecture. Finally, in terms of PUE values, Cases 2–4 do not adjust the operating parameters of physical equipment and only involve spatiotemporal migration of computing tasks, resulting in higher PUE values. In contrast, Case 1 can adjust AS and cooling equipment, which not only reduces the total power but also effectively restricts cooling energy consumption, thereby lowering the PUE.

Figure 4 shows the task execution status under the four cases. Each color in the figure represents one DC, and a darker color indicates a larger number of tasks being executed. In Case 4, after receiving computing tasks, they are evenly allocated to each region for execution with zero delay. Tasks are concentrated in the early stage of the cycle, and the number of tasks is dense during the peak electricity price period. However, the number of executing tasks is small during the low electricity price period after 21:00, which leads to high electricity costs. In Case 3, tasks are executed with delay and migrated between regions within the DC, and most computing tasks are executed during the low electricity price period (00:00–08:00). However, due to the large number of arriving tasks, the small capacity of DC3, and its high electricity price and temperature, there is almost no possibility of delay, and a large number of computing tasks are executed throughout the scheduling cycle, resulting in limited adjustment capability of this strategy. In Case 2, a large number of tasks are migrated to DC2, which has low electricity prices and low temperatures, and are concentrated in the low electricity price period (00:00–08:00) for execution. This reduces the electricity cost to a certain extent but leads to unbalanced load distribution. DC3 only executes fixed loads, resulting in serious core idleness. In contrast, Case 1 avoids load imbalance and achieves cost reduction through load balancing and physical equipment adjustment. The cores and frequency of AS match the tasks being executed, and the cooling power consumption is coordinated with the IT power consumption, ensuring that the temperature does not exceed the limit.

Figure 5 shows the total power of each DC under the four cases. It can be observed that, regardless of the DC, the total power in Case 1 is the lowest. Case 1 reduces costs by combining the spatiotemporal adjustment of computing tasks and the adjustment of physical equipment parameters. In Case 2, due to the excessive migration of computing tasks to DC2, the power of DC2 is extremely high, while DC3 always operates with a fixed load, resulting in extremely low power. This leads to a high degree of load imbalance. The power of Case 3 is similar to that of Case 4, but Case 3 avoids excessive load during peak periods by delaying tasks, such as DC3, during 08:00–10:00 and 18:00–21:00.

Figure 6 presents the core occupancy rate of servers within each DC under the four cases. It is evident that the core occupancy rate in Case 1 is concentrated between 60% and 90%. This range not only reduces idle cores to lower IT power consumption but also ensures that there are standby cores to cope with the sudden arrival of new tasks, achieving a balance between resource utilization and service stability. In Case 2, DC1 and DC3 are allocated a small number of tasks, but all cores remain active, resulting in low core occupancy rates and significant resource waste. In Case 3 and Case 4, the core occupancy rate is related to the number of arriving tasks. In this experiment, the number of computing tasks arriving at the three DCs is set to be nearly the same. However, due to differences in DC capacity, the core occupancy rate of DC1 is the lowest, followed by DC2, and DC3 has the highest core occupancy rate.

Figure 7 shows the box plots of indoor temperature in each DC under the four cases. The effect of different scheduling strategies on indoor temperature can be clearly observed. The temperature distribution of the three DCs under Case 1 is the most concentrated and stable, with the core range focusing on 23–25 °C and the smallest average interquartile range. By dynamically adapting the operating parameters of cooling equipment, it can respond to real-time changes in IT power consumption, effectively suppress temperature fluctuations, and achieve precise temperature control. In contrast, Cases 2–4 exhibit obvious shortcomings in temperature regulation performance. While their temperatures in DC1 and DC3 fall within a reasonable range, DC2 suffers from excessively low temperatures, leading to unnecessary waste of cooling power consumption. This is because such strategies do not dynamically adjust the parameters of cooling equipment; the fan speed is fixed at a high level during natural cooling periods. When the heat generation of IT equipment is mismatched with cooling capacity, over-cooling becomes prominent. Notably, the temperature of DC2 in Case 2 is slightly higher than that in Cases 3 and 4. The core reason is that Case 2 allocates more computing tasks to DC2, and the increase in IT power consumption partially offsets the impact of over-cooling, resulting in a relative rise in temperature.

5.3. Sensitivity Analysis

5.3.1. Differences in Regulation Results Under Different Seasons

Different seasons require distinct regulation strategies for DCs. This paper analyzes the regulatory differences across three seasonal categories: spring/autumn, summer, and winter. Figure 8 depicts the total cost and PUE of the four scenarios under three seasonal conditions, which intuitively reflects the adaptability of each scheduling strategy to seasonal changes. The horizontal axis of the figure represents different seasons, while the vertical axis is divided into two parts to respectively display the total cost and PUE, realizing the synchronous comparison of the two key indicators. In terms of seasonal variation characteristics, summer has the highest total cost and PUE in all four cases, followed by spring/autumn, and winter has the lowest. This is mainly because the high ambient temperature in summer increases the load of the cooling system, leading to a significant rise in cooling power consumption and further increasing the total cost and PUE. In winter, the low ambient temperature allows most DCs to adopt natural cooling, which greatly reduces cooling energy consumption, thus reducing the total cost and PUE.

Figure 9 illustrates the power consumption of each DC under different seasons. From the perspective of cooling power, the proportion of cooling power in the total power is the highest in summer, followed by spring/autumn, and the lowest in winter. This difference stems from the dominant role of seasonal ambient temperature in cooling modes: the outdoor temperature in summer is significantly high, so each DC needs to operate mechanical cooling throughout the day, and the continuous operation of compressors leads to a surge in cooling power consumption. In contrast, the outdoor temperature in winter is low, and natural cooling can be achieved only through fan-driven heat exchange between indoor and outdoor environments without starting compressors, resulting in a substantial reduction in cooling power consumption. In terms of IT power allocation, the IT power in spring/autumn and summer is evenly distributed among the three DCs. Given that the ambient temperature is moderate in spring/autumn and the cooling load is already high in the hot summer environment, excessive concentration of IT power in a single DC would lead to a sudden increase in local IT power consumption, which in turn triggers a synchronous rise in cooling energy consumption and ultimately drives up the total cost. However, the IT power allocation in winter adopts a differentiated strategy: since winter can fully rely on natural cooling, there is no need to worry about the surge in cooling power consumption caused by excessively high IT power. Therefore, most tasks are prioritized to DC1 and DC2 with lower electricity prices, while DC3, with the highest electricity price, only undertakes a small number of tasks. Through the differentiated configuration of electricity prices, the total cost is significantly reduced.

Figure 10 presents the box plots of server state regulation of each DC under different seasons, which focuses on reflecting the regulation effect on server state and adaptability to seasonal changes. Each subgraph corresponds to one DC, and the box plots of active core ratio and frequency scaling factor are distinguished by different colors. In summer, each DC tends to reduce the active core ratio and frequency scaling factor to minimize additional power consumption. This is reflected in the fact that the median frequency scaling factor of the three DCs is the lowest in summer, and the median active core ratio of DC1 and DC2 is also the lowest in summer. Notably, the active core ratio and frequency scaling factor of DC2 are consistently higher than those of DC1 and DC3. The core reason is that DC2 has a lower electricity price, allowing it to appropriately activate more cores and increase the operating frequency to ensure fast task execution and avoid task accumulation or timeout.

Figure 11 shows the variation curve of the indoor temperature of each DC under different seasons. In spring/autumn, the indoor temperature of each DC is stably maintained at 23–26 °C; in summer, the indoor temperature ranges from 24 to 27 °C, showing a slight upward trend overall; and in winter, the indoor temperature is the lowest, ranging from 18 to 25 °C, and exhibits a slow downward trend throughout the whole period. The indoor temperature in winter is the lowest and shows an overall downward trend. This is because the outdoor ambient temperature in winter is significantly low, and the proportion of natural cooling duration in each DC is greatly increased. It is only necessary to drive heat exchange between indoor and outdoor environments through fans to maintain the temperature above the safe lower limit without additional mechanical heating assistance. The temperature in spring/autumn is the most stable. The core reason is that the outdoor ambient temperature in this season is moderate, which is highly consistent with the optimal operating temperature range of DCs. The cooling system does not need to frequently switch operating modes, and natural cooling combined with short-term mechanical cooling can balance the heat generation of IT equipment. The temperature in summer shows an upward trend. This is because the outdoor high temperature lasts for a long time in summer, and each DC needs to operate mechanical cooling for an extended period. The outdoor temperature reaches its peak from noon to evening, resulting in a decrease in the heat dissipation efficiency of the cooling system. The imbalance between cooling load and heat generation load causes the indoor temperature to gradually rise, showing an obvious upward trend.

5.3.2. Impact of Learning Rate on Training Results

The learning rate is a step-size coefficient for network updates, which affects the training convergence efficiency and stability of the model. By changing its value, the impacts on the convergence, stability of the training process, and the final regulation effect are studied. Figure 12 shows the impact of the learning rate on the training process. It can be found that convergence can be achieved at around 700 iterations under different learning rates, but the reward after convergence is the maximum when the learning rate is 1 × 10⁻⁵.

5.4. Computational Complexity and Deployment Feasibility

This section discusses the online inference time, scalability, and real-time deployment feasibility of the proposed H-MADDPG algorithm.

Online decision-making time: The inference time of the trained H-MADDPG policy is dominated by forward passes through the actor networks. On a standard server (Intel Xeon Gold 6248, 2.5 GHz, 32 GB RAM), the average inference time per time step is approximately 23 ms for the GCA and 15 ms for each LDCA, totaling less than 80 ms for three DCs. Since each time step in the simulation corresponds to 15 min in real operation, this inference overhead is negligible and well within real-time control constraints.

Scalability: The hierarchical decomposition reduces the joint action space from exponential to polynomial complexity, making the framework inherently easy to scale. Specifically, the GCA handles only high-level task allocation while each LDCA independently manages its local subtask scheduling and equipment regulation. This divide-and-conquer design avoids the exponential growth of the joint action space and allows the system to accommodate additional DCs or zones without fundamentally altering the decision structure. In practice, the framework exhibits strong operational feasibility for geo-distributed data center deployments of realistic scale.

Real-time deployment feasibility: During execution (decentralized phase), each agent makes decisions based solely on its local observations, requiring no online communication with the centralized critic or other agents. The GCA can run on a central controller, while each LDCA runs on an edge server located with its corresponding DC. The lightweight inference makes real-time control feasible. However, practical deployment on physical hardware would require additional validation against hardware delays, measurement noise, and occasional communication faults, as acknowledged in the limitations section.

6. Conclusions

This paper proposes a computational–physical collaborative optimization model for geographically distributed DCs based on hierarchical reinforcement learning. By integrating spatiotemporal task migration, adaptive adjustment of server core count and frequency, and mode-switchable cooling control, the proposed H-MADDPG algorithm effectively bridges the gap between global coordination and local autonomy. The experimental results demonstrate that the proposed strategy reduces the total cost by 36.19%, lowers PUE to 1.47–1.60, maintains a 0% SLA violation rate, and achieves balanced resource utilization and stable temperature control compared to benchmark schemes.

Despite the promising results, several limitations must be acknowledged to frame the contribution realistically. First, this study is entirely simulation-based and has not been validated on a real DC control platform, where hardware delays, measurement noise, and communication faults may affect performance. Second, the cooling system employs a lumped thermal parameter model, and servers are abstracted as AS models, which simplify spatial temperature gradients and transient thermal dynamics. Third, while this paper notes that long-term power purchase agreements (PPAs) are not considered, this is explicitly acknowledged here as a limitation, as PPAs could significantly affect optimal migration decisions under stable electricity prices. Finally, the current framework assumes a single operator and does not address multi-operator coordination or data privacy concerns.

Based on these limitations, we propose the following specific research directions:

(1): Real-platform validation: Deploy the H-MADDPG algorithm on a small-scale experimental DC testbed (e.g., with 3–5 servers and controllable cooling units) to evaluate its real-time feasibility, robustness to communication delays, and generalization to non-stationary workloads.
(2): Refined cooling and server modeling: Replace the aggregated thermal model with zonal computational fluid dynamics (CFD) surrogate models or physics-informed neural networks (PINNs) to capture spatial temperature distributions and transient responses.
(3): Long-term PPA integration: Formulate a two-timescale optimization framework where PPAs are signed on a monthly/yearly basis (upper level) and real-time task migration is optimized on an hourly basis (lower level), with contract compliance constraints.
(4): Privacy-preserving multi-operator coordination: Develop a federated multi-agent reinforcement learning (Fed-MADDPG) framework that enables DCs of different operators to coordinate without sharing local workload or cooling state data.
(5): Stress condition evaluation: Conduct a systematic evaluation under stressed operating conditions, including tighter deadlines (e.g., 30% reduction), heavier workloads (e.g., doubled arrival rate), and limited inter-DC bandwidth, to further validate the robustness of the hierarchical framework.
(6): Carbon-aware co-optimization: Incorporate real-time carbon intensity signals into the reward function to achieve a joint reduction in electricity cost and carbon footprint, supporting green DC operations.

In summary, this work establishes a systematic and extensible hierarchical RL framework for DC energy optimization. The explicit discussion of limitations and corresponding concrete future directions provides a realistic foundation for transitioning from simulation-based research to practical deployment.

Author Contributions

Conceptualization, P.H. and G.W.; methodology, R.S.; software, P.H.; validation, F.W., A.P. and Z.Z.; formal analysis, Q.L.; investigation, P.H.; resources, Z.Z.; data curation, N.D.; writing—original draft preparation, P.H.; writing—review and editing, F.W.; visualization, Z.Z.; supervision, R.S.; project administration, Y.X.; funding acquisition, R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by S&T Program of Hebei grant number 246Z4301G.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This research was funded by S&T Program of Hebei grant number 246Z4301G. Authors Neven Duić and Antun Pfeifer received funding from the European Union (NextGenerationEU) under the National Recovery and Resilience Plan 2021–2026 (NRRP), through the UNIZAG FSB institutional project “Energy Transition of Hard-to-Abate Sectors (ET-SOD)”, approved by the Ministry of Science, Education and Youth of the Republic of Croatia (component C3.2, source 581).

Conflicts of Interest

Author Rongfu Sun and Qinzhe Liu were employed by the company State Grid Jibei Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Nomenclature

Indices
$i$	Index of tasks
$v$	Index of subtasks
$k$	Index of DCs
$m$	Index of zones within a DC
$t$	Index of time step
Parameters and Variables
$C_{k}^{e} (t)$	Electricity price of DC. $k$ . at time $t$
$T_{k}^{o u t} (t)$	Outdoor temperature of DC $k$ at time $t$
$T_{k, m}^{i n} (t)$	Indoor temperature of zone $m$ in DC $k$ at time $t$
$N_{k, m}$	Number of all servers of zone $m$ in DC $k$
$μ_{k, m}^{p e a k}$	Peak frequency of zone $m$ in DC $k$
$f_{k, m} (t)$	Frequency scaling factor of zone $m$ in DC $k$ at time $t$
$U_{k, m} (t)$	Core occupancy rate of zone $m$ in DC $k$ at time $t$
$z_{i, k}$	Characterize whether to assign task $i$ to the DC $k$
$y_{i, v, k, m}$	Characterize whether to assign subtask $v$ in task $i$ to zone $m$ in DC $k$
$μ_{k, m} (t)$	Actual operating frequency
$a_{k, m} (t)$	Active core ratio of zone $m$ in DC $k$ at time $t$
$N_{k, m}^{a} (t)$	Available cores of zone $m$ in DC $k$ at time $t$
$s_{i, v}$	Start execution time of subtask $v$ in task $i$
$f_{i, v}$	Finish execution time of subtask $v$ in task $i$
$H_{i, v}$	Level of subtask $v$ in task $i$
$A_{i}$	Arriving time of task $i$
$D_{i}$	Deadline of task $i$
$P_{k, m}^{I T} (t)$	IT power of zone $m$ in DC $k$ at time $t$
$P_{k, m}^{s t} (t)$	Static power of zone $m$ in DC $k$ at time $t$
$P_{k, m}^{d y} (t)$	Dynamic computing power of zone $m$ in DC $k$ at time $t$
$P_{k, m}^{t r} (t)$	State transition power of zone $m$ in DC $k$ at time $t$
$P U E_{k} (t)$	PUE of DC $k$
$M o d e_{k} (t)$	Cooling mode of DC $k$ at time $t$
$s_{k, m} (t)$	Rotation speed of fans/water pumps of zone $m$ in DC $k$ at time $t$
$Q_{k, m}^{c o o l} (t)$	Actual heat removal capacity of zone $m$ in DC $k$ at time $t$
$P_{k, m}^{c o o l} (t)$	Cooling power of zone $m$ in DC $k$ at time $t$
$C O P_{k, m} (t)$	Coefficient of performance (COP) of zone $m$ in DC $k$ at time $t$

References

Yang, M.; Guo, S.; Che, J.; He, W.; Wu, K.; Xu, W. Day-Ahead Photovoltaic Station Power Prediction Driven by Weather Typing: A Collaborative Modelling Approach Based on Multi-Feature Fusion Spectral Clustering and DCS-NsT-BiLSTM. Electronics 2025, 14, 3836. [Google Scholar] [CrossRef]
Liu, Y.; Yang, M. Ultra-Short-Term Photovoltaic Cluster Power Prediction Based on Photovoltaic Cluster Dynamic Clustering and Spatiotemporal Heterogeneous Dynamic Graph Modeling. Electronics 2025, 14, 3641. [Google Scholar] [CrossRef]
Wang, B.; Zhang, B.; Fu, S.; Gao, P.; Wu, W.; Wang, L. A Novel High-Efficiency Solar Photovoltaic/Thermal Cooling and Power Synergistic System for Decarbonizing Data Centers. Energy Convers. Manag. 2025, 345, 120420. [Google Scholar] [CrossRef]
International Energy Agency. Energy and AI. 2025. Available online: https://www.iea.org/reports/energy-and-ai/ (accessed on 3 January 2026).
Wang, F.; Xiang, B.; Li, K.; Ge, X.; Lu, H.; Lai, J. Smart Households’ Aggregated Capacity Forecasting for Load Aggregators Under Incentive-Based Demand Response Programs. IEEE Trans. Ind. Appl. 2020, 56, 1086–1097. [Google Scholar] [CrossRef]
Chen, Q.; Wang, Q.; Hodge, B.; Zhang, J.; Li, Z.; Shafie-Khah, M. Dynamic Price Vector Formation Model-Based Automatic Demand Response Strategy for PV-Assisted EV Charging Stations. IEEE Trans. Ind. Appl. 2017, 8, 2903–2915. [Google Scholar] [CrossRef]
Armghan, A.; Hassan, M.; Armghan, H.; Yang, M.; Alenezi, F.; Azeem, M.K.; Ali, N. Barrier Function Based Adaptive Sliding Mode Controller for a Hybrid AC/DC Microgrid Involving Multiple Renewables. Appl. Sci. 2021, 11, 8672. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J.; Liu, J.; Huang, J.; Zhu, G.; Yu, C.; Zhou, H. Data Association Load Uncertainty and Risk Aversion in Electricity Markets with Data Center Participation in the Demand Response. Energy Rep. 2024, 11, 483–497. [Google Scholar] [CrossRef]
Ye, G.; Gao, F.; Fang, J.; Zhang, Q. Joint Workload Scheduling in Geo-Distributed Data Centers Considering Ups Power Losses. IEEE Trans. Ind. Appl. 2022, 59, 612–626. [Google Scholar] [CrossRef]
Sun, J.; Chen, M.; Liu, H.; Yang, Q.; Yang, Z. Workload Transfer Strategy of Urban Neighboring Data Centers with Market Power in Local Electricity Market. IEEE Trans. Smart Grid 2020, 11, 3083–3094. [Google Scholar] [CrossRef]
Jin, T.; Bai, L.; Yan, M.; Chen, X. Unlocking Spatio-Temporal Flexibility of Data Centers in Multiple Regional Peer-To-Peer Energy Transaction Markets. IEEE Trans. Power Syst. 2025, 40, 3914–3927. [Google Scholar] [CrossRef]
He, Y.; Fan, J.; Lin, J.; Li, Z.; Zhang, J.; Tang, W.; Yang, Q. Two-Stage Robust Planning of Data Center Microgrid Considering Batch Load Flexibility and Multi-Energy Complementarity. Energy Convers. Manag. X 2025, 28, 101266. [Google Scholar] [CrossRef]
Cao, Y.; Cheng, M.; Zhang, S.; Mao, H.; Wang, P.; Li, C.; Feng, Y.; Ding, Z. Data-Driven Flexibility Assessment for Internet Data Center Towards Periodic Batch Workloads. Appl. Energy 2022, 324, 119665. [Google Scholar] [CrossRef]
Yang, T.; Jiang, H.; Hou, Y.; Geng, Y. Carbon Management of Multi-Datacenter Based on Spatio-Temporal Task Migration. IEEE Trans. Cloud Comput. 2023, 11, 1078–1090. [Google Scholar] [CrossRef]
Zhang, S.; Wei, M.; Li, Y.; Chen, Y. A Stackelberg-Game Based Bi-Level Scheduling Model of Data Center Combined with Shared Energy Storage Considering Price Linkage and Demand Response. Energy 2025, 336, 138509. [Google Scholar] [CrossRef]
Lu, X.; Zhang, P.; Li, K.; Wang, F.; Li, Z.; Zhen, Z. Data Center Aggregators’ Optimal Bidding and Benefit Allocation Strategy Considering the Spatiotemporal Transfer Characteristics. IEEE Trans. Ind. Appl. 2021, 57, 4486–4499. [Google Scholar] [CrossRef]
Wang, Y.; Lin, J.; Han, Y.; Han, K.; Han, J.; Han, T.; Wei, Y. Comprehensive Evaluation of All-Element Flexibility Resources in Data Centers: Considering Synergistic Benefits of Computing, electricity, and heat. Appl. Energy 2025, 399, 126442. [Google Scholar] [CrossRef]
Zhou, F.; Li, C.; Zhu, W.; Zhou, J.; Mao, G.; Liu, Z. Energy-Saving Analysis of a Case Data Center with a Pump-Driven Loop Heat Pipe System in Different Climate Regions in China. Energy Build. 2018, 169, 295–304. [Google Scholar] [CrossRef]
Ren, X.; Wang, J.; Hu, X.; Sun, Z.; Zhao, Q.; Chong, D.; Xue, K.; Yan, J. A Novel Demand Response Based Distributed Multi-Energy System Optimal Operation Framework for Data Centers. Energy Build. 2024, 305, 113886. [Google Scholar] [CrossRef]
Yang, W.; Zhao, M.; Li, J.; Zhang, X. Energy-Efficient DAG Scheduling with DVFS For Cloud Data Centers. J. Supercomput. 2024, 80, 14799–14823. [Google Scholar] [CrossRef]
Lin, X.; Luo, X.; Li, C.; Liang, J.; Wu, G.; Li, K. An Energy-Efficient Tuning Method for Cloud Servers Combining DVFS And Parameter Optimization. IEEE Trans. Cloud Comput. 2023, 11, 3643–3655. [Google Scholar] [CrossRef]
Asghari, A.; Sohrabi, M.K. Combined Use of Coral Reefs Optimization and Multi-Agent Deep Q-Network For Energy-Aware Resource Provisioning in Cloud Data Centers Using DVFS Technique. Clust. Comput. 2022, 25, 119–140. [Google Scholar] [CrossRef]
Luan, G.; Pang, P.; Chen, Q.; Xue, S.; Song, Z.; Guo, M. Online Thread Auto-Tuning for Performance Improvement and Resource Saving. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 3746–3759. [Google Scholar] [CrossRef]
Lin, J.; Lin, W.; Wu, W.; Lin, W.; Li, K. Energy-Aware Virtual Machine Placement Based on a Holistic Thermal Model For Cloud Data Centers. Future Gener. Comput. Syst. 2024, 161, 302–314. [Google Scholar] [CrossRef]
Silva-Llanca, L.; Ponce, C.; Bermúdez, E.; Martínez, D.; Díaz, A.; Aguirre, F. Improving Energy and Water Consumption of a Data Center Via Air Free-Cooling Economization: The Effect Weather on Its Performance. Energy Convers. Manag. 2023, 292, 117344. [Google Scholar] [CrossRef]
Zhou, Y.; Li, S.; Li, Q.; Wei, F.; Yang, D.; Liu, J.; Yu, D. Energy Savings in Direct Air-Side Free Cooling Data Centers: A Cross-System Modeling and Optimization Framework. Energy Build. 2024, 308, 114003. [Google Scholar] [CrossRef]
Zou, S.; Liu, J.; Dai, Y. Performance of a Multi-Cooling Sources Cooling System with Photovoltaics and Waste Heat Recovery in Data Center. Energy Convers. Manag. 2025, 324, 119319. [Google Scholar] [CrossRef]
Deng, W.; Wang, J.; Yue, C.; Guo, Y.; Zhang, Q. Model-Based Control Strategy with Linear Parameter-Varying State-Space Model for Rack-Based Cooling Data Centers. Energy Build. 2024, 319, 114528. [Google Scholar] [CrossRef]
Chen, M.; Gao, C.; Shahidehpour, M.; Li, Z.; Chen, S.; Li, D. Internet Data Center Load Modeling for Demand Response Considering the Coupling of Multiple Regulation Methods. IEEE Trans. Smart Grid 2020, 12, 2060–2076. [Google Scholar] [CrossRef]
Yin, X.; Ye, C.; Ding, Y.; Song, Y. Exploiting Internet Data Centers as Energy Prosumers in Integrated Electricity-Heat System. IEEE Trans. Smart Grid 2022, 14, 167–182. [Google Scholar] [CrossRef]
Han, J.; Tong, N.; Lin, J.; Han, Y.; Wang, Y.; Han, K.; Li, Y. Distributionally Robust Co-Optimization of Computing Workloads and Renewable Energy Uncertainties in Geo-Distributed Data Centers Considering Multi-Element Influences. Energy Convers. Manag. X 2025, 29, 101432. [Google Scholar] [CrossRef]
Long, S.; Li, Y.; Huang, J.; Li, Z.; Li, Y. A Review of Energy Efficiency Evaluation Technologies in Cloud Data Centers. Energy Build. 2022, 260, 111848. [Google Scholar] [CrossRef]
Han, O.; Ding, T.; Mu, C.; Jia, Z.; Ma, Z. Waste Heat Reutilization and Integrated Demand Response for Decentralized Optimization of Data Centers. Energy 2023, 264, 111871. [Google Scholar] [CrossRef]
Ran, J.; Zhang, Q.; Zhu, Y.; Zhai, J.; Li, J.; Guo, Z.; Wang, T. Co-Optimization of Thermal-Aware Workload Scheduling with Deep Reinforcement Learning-Based Cooling Control in Data Centers. Energy 2026, 344, 139965. [Google Scholar] [CrossRef]
Sun, Y.; Ding, Z.; Dehghanian, P.; Teng, F. Learning-Enabled Adaptive Power Capping Scheme for Cloud Data Centers. IEEE Trans. Smart Grid 2025, 16, 4755–4767. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, Y.; Hu, J.; Hu, H.; Zhang, X.; Xu, D. Constrained Semi-MDP Formulation and Perception-Enhanced Safe Policy Learning for Efficient Dynamic Task Scheduling of Data Centers. IEEE Trans. Smart Grid 2026, 17, 1209–1224. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Filonenko, K.; Dominković, D.; Wang, S. Physics-Guided Deep Reinforcement Learning for Optimized Data Center Cooling and Waste Heat Recovery Utilizing Aquifer Thermal Energy Storage. Appl. Energy 2026, 402, 126984. [Google Scholar] [CrossRef]
Sun, Y.; Ding, Z.; Yan, Y.; Wang, Z.; Dehghanian, P.; Lee, W. Privacy-Preserving Energy Sharing Among Cloud Service Providers Via Collaborative Job Scheduling. IEEE Trans. Smart Grid 2024, 16, 1168–1180. [Google Scholar] [CrossRef]
Chen, S.; Li, J.; Yuan, Q.; He, H.; Li, S.; Yang, J. Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 2331–2346. [Google Scholar] [CrossRef]
Liu, W.; Yan, Y.; Sun, Y.; Mao, H.; Cheng, M.; Wang, P.; Ding, Z. Online Job Scheduling Scheme for Low-Carbon Data Center Operation: An Information and Energy Nexus Perspective. Appl. Energy 2023, 338, 120918. [Google Scholar] [CrossRef]
Bian, Y.; Xie, L.; Zou, Y.; Huang, C.; Zhang, H. A Novel Two-Stage Multi-Energy Sharing Model for Data Center-Based Microgrids Considering Joint Grouping and Matching Priority. Energy 2025, 336, 138395. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6379–6390. [Google Scholar] [CrossRef]
Alibaba Group. Alibaba Cluster Trace V2018. Dataset. Available online: https://github.com/alibaba/clusterdata/blob/master/cluster-trace-v2018/ (accessed on 13 January 2026).
National Oceanic and Atmospheric Administration (NOAA). Global Historical Climatology Network-Daily (GHCN-Daily) [Global Meteorological Dataset]. Dataset. Available online: https://www.ncdc.noaa.gov/cdo-web/ (accessed on 13 January 2026).
U.S. Energy Information Administration (EIA). Electricity Data (Including Prices, Generation, and Sales) [U.S. Electricity Price Database]. 24 July 2025. Available online: https://www.eia.gov/electricity/data.php (accessed on 13 January 2026).

Figure 1. The architecture of the computational–physical collaborative optimization model.

Figure 2. H-MADDPG algorithm architecture diagram.

Figure 3. Electricity prices and outdoor temperatures of DCs: (a) DC1; (b) DC2; and (c) DC3.

Figure 4. Task execution under four cases: (a) Case 1; (b) Case 2; (c) Case 3; and (d) Case 4.

Figure 5. Total power of each DC under four cases: (a) DC1; (b) DC2; and (c) DC3.

Figure 6. Core occupancy rate of servers in each DC under four cases: (a) DC1; (b) DC2; and (c) DC3.

Figure 7. Box plots of indoor temperature in each DC under four cases: (a) DC1; (b) DC2; and (c) DC3.

Figure 8. Total cost and PUE under different seasons.

Figure 9. Power consumption under different seasons: (a) DC1 in spring/autumn; (b) DC2 in spring/autumn; (c) DC3 in spring/autumn; (d) DC1 in summer; (e) DC2 in summer; (f) DC3 in summer; (g) DC1 in winter; (h) DC2 in winter; and (i) DC3 in winter.

Figure 10. Box plots of server state regulation under different seasons: (a) DC1; (b) DC2; and (c) DC3.

Figure 11. Temperature under different seasons: (a) DC1 in spring/autumn; (b) DC2 in spring/autumn; (c) DC3 in spring/autumn; (d) DC1 in summer; (e) DC2 in summer; (f) DC3 in summer; (g) DC1 in winter; (h) DC2 in winter; (i) and DC3 in winter.

Figure 12. Impact of learning rate on the training process.

Table 1. Comparison between this paper and the existing literature.

Reference	[12]	[17]	[18]	[19]	[29]	[30]	[36]	[38]	[39]	This Paper
Task decomposition and subtask dependencies	×	×	×	×	×	×	√	√	×	√
Task temporal optimization	√	√	×	√	√	√	√	√	√	√
Cross-DC task migration	×	×	×	×	√	√	×	√	√	√
Intra-DC task migration	×	√	×	×	×	×	√	×	×	√
Proactive IT equipment state regulation	×	√	×	√	√	√	√	×	√	√
Cooling system operation optimization	√	√	√	√	√	√	×	×	×	√
Uncertainty handling	√	×	×	√	√	√	√	√	√	√

Table 2. Three core innovations of the proposed H-MADDPG.

Aspects	Traditional MADDPG	H-MADDPG
Architecture	Single-layer, homogeneous agents	Two-layer heterogeneous agents
Action space	Continuous	Hybrid (discrete + continuous) with variable-dimension masking
Decision logic	Flat, peer-to-peer	Hierarchical: global allocation, local scheduling, and control

Table 3. Zone parameters of each DC.

	Number of Zones	Zone Heat Capacity (kWh/°C)	Zone Thermal Resistance (°C/kW)	Zone Peak Frequency (GHz)	Number of Zone Cores
DC1	4	[8000, 7200, 6800, 7200]	[0.02, 0.02, 0.02, 0.02]	[3.8, 3.6, 3.5, 3.7]	[3200, 2800, 2500, 3000]
DC2	3	[3500, 3200, 2400]	[0.04, 0.04, 0.04]	[3.0, 2.9, 3.1]	[1800, 1500, 1600]
DC3	2	[800, 700]	[0.08, 0.08]	[2.6, 2.5]	[800, 600]

Table 4. Parameters of the algorithm.

Parameter Name	Value
Learning rate	1 × 10⁻⁵
Discount factor	0.99
Batch size	128
Experience replay buffer capacity	10⁶
Simulation steps per episode	96
Number of training episodes	1000

Table 5. Comparison of multidimensional indices under four cases.

	Total Cost/$	PUE
Case 1	15,741.23	1.60/1.50/1.47
Case 2	23,403.37	1.80/1.51/1.60
Case 3	24,229.59	1.73/1.64/1.47
Case 4	24,667.14	1.73/1.64/1.47

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, P.; Sun, R.; Pfeifer, A.; Wang, G.; Liu, Q.; Duić, N.; Zhen, Z.; Wang, F.; Xiao, Y. Flexibility Resource Services and Electricity Cost Optimization Oriented Control Strategy of Data Centers Based on Hierarchical Reinforcement Learning. Electronics 2026, 15, 1901. https://doi.org/10.3390/electronics15091901

AMA Style

He P, Sun R, Pfeifer A, Wang G, Liu Q, Duić N, Zhen Z, Wang F, Xiao Y. Flexibility Resource Services and Electricity Cost Optimization Oriented Control Strategy of Data Centers Based on Hierarchical Reinforcement Learning. Electronics. 2026; 15(9):1901. https://doi.org/10.3390/electronics15091901

Chicago/Turabian Style

He, Pengfei, Rongfu Sun, Antun Pfeifer, Ge Wang, Qinzhe Liu, Neven Duić, Zhao Zhen, Fei Wang, and Yunpeng Xiao. 2026. "Flexibility Resource Services and Electricity Cost Optimization Oriented Control Strategy of Data Centers Based on Hierarchical Reinforcement Learning" Electronics 15, no. 9: 1901. https://doi.org/10.3390/electronics15091901

APA Style

He, P., Sun, R., Pfeifer, A., Wang, G., Liu, Q., Duić, N., Zhen, Z., Wang, F., & Xiao, Y. (2026). Flexibility Resource Services and Electricity Cost Optimization Oriented Control Strategy of Data Centers Based on Hierarchical Reinforcement Learning. Electronics, 15(9), 1901. https://doi.org/10.3390/electronics15091901

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Flexibility Resource Services and Electricity Cost Optimization Oriented Control Strategy of Data Centers Based on Hierarchical Reinforcement Learning

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Literature Review

1.3. Contributions

1.4. Structure of the Paper

2. Computational–Physical Collaborative Optimization Model

2.1. Modeling of Task Allocation and Execution

2.2. Modeling of AS

2.3. Modeling of Cooling System

2.4. Modeling of Cost

3. Two-Layer POMG

3.1. Agent Architecture and Division of Responsibilities

3.2. Task Priority Ranking

3.3. State Space

3.4. Action Space

3.5. Reward Function

4. Algorithm Design

4.1. Principle of the Classical MADDPG Algorithm

4.2. H-MADDPG: Network Architecture Design for Hierarchical Decision-Making

4.2.1. Upper Layer: GCA Task Allocation Actor Network

4.2.2. Lower Layer: LDCA Dual-Actor Network

4.2.3. Global Centralized Critic Network

5. Case Study

5.1. Parameter Settings

5.2. Decision Result Analysis

5.3. Sensitivity Analysis

5.3.1. Differences in Regulation Results Under Different Seasons

5.3.2. Impact of Learning Rate on Training Results

5.4. Computational Complexity and Deployment Feasibility

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI