Server-cluster task scheduling and UPS energy management differ substantially in state-space complexity and control flexibility. Task migration is therefore treated as a multi-objective decision problem: it must handle highly stochastic task arrivals, a large queue state space, and the joint balancing of delay and power. Therefore, it is well suited to modeling and optimization within a Markov decision process (MDP) framework. By contrast, for the UPS battery bank, a computationally efficient greedy algorithm is adopted to directly track the low-frequency power target while satisfying physical constraints, thereby avoiding unnecessary training overhead.
To mitigate sample insufficiency and local overfitting in single-DC training, this paper proposes Fed-AdaPPO, a federated adaptive proximal policy optimization framework for collaborative tie-line power scheduling among multiple DCs. The framework integrates UCB-guided adaptive exploration with Critic-only federated gradient aggregation, enabling improved policy learning efficiency while keeping raw workload data and local scheduling states decentralized.
2.5.1. Markov Decision Process Modeling
To improve user satisfaction and fully exploit the scheduling capability, in DC d, tasks in the processing state at time t are sorted in descending order of allowable delay time, and tasks with the same allowable delay time are further sorted in ascending order of required processing time, thereby forming the task-processing queue SP. Tasks in the delayed state are sorted in ascending order of allowable delay time, and tasks with the same allowable delay time are further sorted in descending order of required processing time, thereby forming the task-delay queue SQ.
To achieve coordinated optimization of server-layer task migration, the tie-line power smoothing problem for multiple DCs is formulated as a unified MDP model.
- (1)
State space: The state vector includes the operating state of the server cluster and the task-sequence information at time t, and is defined as follows:
where
denotes the length of the
SP queue;
denotes the length of the
SQ queue;
denotes the normalized residual delay ratio, which is introduced as a soft constraint to ensure service quality, defined as follows:
where
denotes the remaining allowable delay time of the
k-th task in the SQ queue;
denotes the maximum allowable delay time of the
k-th task.
- (2)
Action space: To avoid the curse of dimensionality caused by a large discrete action space and to enable the agent to continuously perceive the scale of task migration, a continuous action is constructed and then mapped to a physically executable discrete number of tasks as follows:
where
denotes the number of activated or suspended tasks. A positive value indicates that tasks are sequentially activated from the head of the
SQ queue and moved to the
SP queue, whereas a negative value indicates that tasks are sequentially suspended from the head of the
SP queue and moved to the
SQ queue;
denotes the maximum number of tasks allowed to migrate.
- (3)
Reward function: The objective of tie-line power smoothing in a DC is to minimize the deviation between the actual power and the target power while ensuring service quality. Accordingly, the reward function is defined as follows:
where
,
denote weighting coefficients used to balance the power-smoothing performance and the task-delay risk;
denotes the maximum power consumption of DC
d, defined as follows:
Normalizing the power-deviation term in the reward function by the maximum power consumption effectively eliminates the dimensional effect caused by capacity differences among DCs of different scales. This ensures a consistent reward scale and further improves the training stability and convergence efficiency of the reinforcement learning algorithm.
2.5.2. Federated Learning Framework Mechanism
Traditional discrete RL methods suffer from action-space explosion in large-scale task scheduling and cannot capture fine-grained scheduling characteristics. Although conventional PPO with a continuous action space avoids this explosion, its exploration strategy is too simple for complex, high-dimensional environments, limiting its ability to balance stability and exploration.
AdaPPO employs a parameterized stochastic policy network as the Actor network and a state-value function network as the Critic network. The Actor network consists of a new policy network and an old policy network. The new policy network is used to update the policy based on the latest sampled data, whereas the old policy network is responsible for generating actions during interaction with the environment.
The Critic network is used to estimate the state-value function
and is trained by minimizing the mean squared error between the predicted value and the empirical return:
where
denotes the loss function of the Critic network, and
denotes the empirical return used as the training target.
The weights of the Critic network are updated by backpropagation according to:
where
denotes the learning rate of the Critic network,
denotes the gradient operator.
The weights of the Actor network are optimized through the following loss function:
where
denotes the average over sampled transitions;
denotes the clipping function, which constrains
to the interval
to prevent excessively large policy updates in a single step;
denotes the clipping threshold;
denotes the probability ratio between the new and old policies;
denotes the estimated advantage function, which measures the relative advantage of taking action
in state
under the current policy. In this paper,
is computed using generalized advantage estimation (GAE). The corresponding expressions are
where
denotes the probability of taking action
in state
under the current policy parameterized by
,
denotes the corresponding probability under the old policy,
denotes the discount factor,
denotes the balancing parameter used to trade off bias and variance,
l denotes the step offset in the summation, and
denotes the temporal-difference residual at step
t +
l, defined as follows:
The weights of the Actor network are updated via backpropagation based on its loss function:
where
is the learning rate of the Actor network.
Because the server-cluster scheduling problem involves a high-dimensional continuous action space, uniform random exploration may lead to slow convergence and inefficient sampling. To improve exploration efficiency, a UCB-based adaptive exploration mechanism is introduced. By assigning higher exploration priority to action regions with greater uncertainty, the proposed strategy encourages the agent to explore potentially valuable actions more effectively than conventional random perturbation. Specifically, the action space is discretized into
K intervals
, and the UCB value of interval
is defined as follows:
where
denotes the average reward of interval
;
denotes the number of training steps;
denotes the number of times that interval
has been selected;
denotes the exploration parameter.
During training, with probability , the center of the interval with the highest UCB value is selected as the action; with probability 1 − , the action is sampled from policy . Unlike epsilon-greedy exploration, which samples actions uniformly at random, this method prioritizes under-explored action regions while continuing to exploit high-return power-regulation actions, thereby balancing exploration and exploitation more effectively.
To further enhance generalization capability under heterogeneous multi-DC operating conditions, a horizontal federated learning framework is incorporated into PPO training. In the proposed design, each DC acts as an independent client and only shares Critic-side gradient information with the central server, while Actor policies, task queues, and other local operational data remain on-site. Since Actor policies are directly associated with local state-action mappings, not sharing Actor gradients helps reduce the exposure of sensitive scheduling information. With this design, collaborative value-function learning is achieved without exchanging raw workload data.
To further bound the sensitivity of the uploaded gradients, gradient clipping is first performed before perturbation:
where
denotes the clipped Critic gradient, and
C denotes the clipping threshold. Then, zero-mean Gaussian noise
is added before parameter uploading:
where
denotes the Critic gradients after noise injection;
denotes the noise intensity;
denotes the identity matrix.
During the federated phase, the server aggregates the perturbed Critic gradients as follows:
where
D denotes the number of DC clients participating in the aggregation.
After aggregation, the global Critic parameter is updated according to the aggregated gradient:
Then, the updated parameter is redistributed to each DC as the initialization for the next stage of training, thereby forming a closed-loop federated optimization process consisting of local training, gradient clipping, noise perturbation, secure aggregation, and parameter updating. In this way, collaborative Critic optimization and improved policy generalization are achieved without exchanging raw workload data, task queues, user information, or Actor policies.
The overall procedure of the tie-line power smoothing method for multiple DCs based on Fed-AdaPPO is illustrated in
Figure 3.
2.5.3. Security and Convergence Analysis
Based on the Critic-only federated gradient aggregation mechanism, this subsection analyzes the theoretical properties of Fed-AdaPPO in terms of differential privacy, communication complexity, and convergence stability. Since Actor-Critic training in a continuous action space involves nonlinear function approximation and stochastic policy optimization, the resulting optimization problem is generally non-convex. Therefore, this paper does not claim global optimality, but instead analyzes the first-order convergence behavior of the noisy federated Critic update under standard smoothness and bounded-variance assumptions.
For convergence analysis, the global Critic objective is defined as follows:
The noisy federated gradient aggregation is treated as a stochastic estimate of the gradient of this objective.
Assumption 1. For each DC d, the local Critic loss function is lower bounded and has an L-Lipschitz continuous gradient. The mini-batch stochastic gradient estimator is unbiased and has a bounded second moment. The bias induced by DC heterogeneity and gradient clipping is denoted by ,
satisfying where is introduced to denote the upper bound of the heterogeneity- and clipping-induced bias. The stochastic gradient noise is denoted by
,
satisfyingwhere denotes the historical information before communication round r, and is the upper bound of the stochastic gradient noise variance. Proposition 1. Under the gradient clipping mechanism, the l2-sensitivity of the uploaded gradient from each DC satisfieswhere and are neighboring local datasets that differ in only one training sample, and denotes the l2-sensitivity of the uploaded gradient mechanism. Proof. By the definition of gradient clipping, any clipped local Critic gradient satisfies
Therefore, for any neighboring datasets
and
, the triangle inequality gives
This proves Proposition 1. □
Proposition 2. If the Gaussian noise intensity sigma satisfiesthen the perturbed Critic gradient uploaded by each DC in one communication round satisfies (, )-differential privacy. Proof. According to Proposition 1, the l2-sensitivity of the clipped uploaded gradient is upper bounded by . By the Gaussian mechanism, adding zero-mean Gaussian noise with covariance ensures (, )-differential privacy when satisfies the above condition. This proves Proposition 2. □
Substituting the sensitivity bound from Proposition 1 into the Gaussian mechanism yields a conservative per-round privacy budget of
This result provides an explicit relationship among the privacy budget (, ), the clipping threshold , and the noise intensity . A smaller epsilon corresponds to stronger privacy protection, but usually requires a larger , which may increase the variance of the aggregated gradient.
Furthermore, let
denote the upload mechanism consisting of gradient clipping and Gaussian perturbation, and let
denote any event observable by the server. For any neighboring datasets
~
, we have
Thus, the server’s ability to distinguish neighboring datasets based on uploaded gradients is bounded by
and
, which limits the advantage of membership inference attacks. Moreover, gradient clipping limits the maximum contribution of local updates, and Gaussian perturbation weakens the deterministic mapping between local data and uploaded gradients, thereby mitigating the risks of gradient inversion and membership inference attacks [
24]. In addition, since the uploaded gradient contains independent Gaussian perturbation,
where
denotes the dimension of the Critic parameter vector. This indicates that the unperturbed clipped gradient cannot be exactly recovered from a single noisy upload, thereby reducing the risk of gradient inversion attacks. Moreover, Fed-AdaPPO uploads only Critic gradients, while raw task data, task queues, user information, and Actor policy parameters remain local, further reducing the possibility of sensitive scheduling information leakage.
Next, the influence of server-side noise aggregation on training stability is analyzed. Since the Gaussian noises added by different DCs are independent, the equivalent noise after server-side averaging is
which satisfies
Therefore, as the number of participating DCs D increases, the variance of the aggregated noise decreases at a rate of 1/D. This property indicates that, with more participating DCs, the disturbance caused by privacy noise in the global Critic update direction can be partially offset by averaging, thereby improving training stability. Since each DC only uploads and receives Critic gradients in each communication round, if , the per-round communication complexity and server-side aggregation complexity are both .
Proposition 3. Under Assumption 1 and an appropriately selected Critic learning rate alpha, the noisy federated Critic update in Fed-AdaPPO satisfies the following average first-order convergence bound:where is the lower bound of the global Critic loss function, and and are positive constants independent of the communication round. Proof. Using the aggregated gradient, the Critic parameter update at communication round
r can be written as follows:
Considering client heterogeneity, clipping bias, stochastic sampling error, and differential-privacy noise, the update can be expressed as follows:
By the
L-smoothness of
, we have
Substituting the update rule into the above inequality and taking conditional expectation under the bounded-bias and bounded-variance conditions in Assumption 1 yield
Summing the above inequality over r = 0, …, R − 1 and using gives the bound in Proposition 3. This completes the proof. □
Proposition 3 shows that the Critic update of Fed-AdaPPO converges to a first-order stationary neighborhood. The size of this neighborhood is jointly determined by the heterogeneity and clipping bias , the stochastic gradient variance , and the differential-privacy noise term . In particular, as D increases, the privacy-noise term decreases, indicating that multi-DC aggregation can alleviate the negative influence of noisy gradients on training stability.
The Actor network is updated locally through the clipped PPO objective, and its parameters are not involved in federated aggregation. Since the clipped PPO objective constrains the magnitude of each policy update, the Actor update can be regarded as local policy improvement based on the stabilized Critic estimate. The training process of Fed-AdaPPO can therefore be viewed as a privacy-constrained, two-level stochastic optimization: the Critic layer learns a collaborative value function through noisy federated gradient aggregation across DCs, and the Actor layer performs policy optimization locally.