The proposed approach adopts the Multi-Agent Markov Decision Process (MAMDP) as the underlying framework for MADRL. An MAMDP can be formally defined as a tuple . Here, denotes the number of agents. represents the state set of agent n, where signifies the state of agent n at time step t. The aggregation of the states of all agents constitutes the joint state space , with . Similarly, denotes the action set of agent n, where represents the action selected by agent n at time step t. The joint action space is formed by the aggregation of all individual actions, denoted as , with . is the state transition probability function, representing the probability of transitioning to the next state when the joint action is executed in state . R is the reward function, indicating the reward received by the agents from the environment after executing action in state , where . Additionally, represents the reward discount factor. At each time step t, each agent executes an action based on its observation, which jointly acts upon the environment. The agents then receive reward feedback from the environment, and the environmental state transitions to the next state . Through this cyclical interaction with the environment, the agents utilize the generated data to optimize their policies—specifically, the mapping from states to actions—to ultimately achieve the maximization of cumulative returns.
3.4. Deep Reinforcement Learning Framework Based on Spatiotemporal Attention Mechanism
To address the challenges of partial observability and the privacy protection requirements among multi-agents in community energy management, this paper adopts a CTDE architecture and proposes a MATPPO algorithm that integrates LSTM and Transformer. The network framework of this algorithm is illustrated in
Figure 3.
During the distributed execution phase, to accommodate the strong stochasticity of source-load variations and preserve data privacy, each agent relies solely on local observations for independent decision-making. Given that PV output, load demand, and dynamic prices exhibit strong sequential dependencies over time, simple fully connected networks cannot effectively capture their upcoming evolutionary trends. Consequently, this study integrates an LSTM network into the Actor as a state encoder to process the observational sequences over the upcoming three-hour forecast window and extract long-term temporal features . Compared to relying directly on discrete point prediction values, the high-dimensional hidden states extracted by the LSTM preserve the probability distribution and temporal trend characteristics of the upcoming source-load fluctuations. This feature representation empowers agents with a forward-looking perspective, enabling them to make robust scheduling decisions even when faced with source-load uncertainties. Furthermore, considering the homogeneity of prosumer devices within the community and to avoid network parameter explosion as the number of agents increases, a parameter-sharing mechanism is adopted in the Actor network. Specifically, all agents share the same set of Actor network weights for policy learning, while performing decentralized inference based on their respective local observations. Finally, the temporal dependency features are concatenated with the device internal states represented by the triplet , and undergo linear mapping to construct the final local observation state feature for agent n. Based on , the shared Actor network outputs a hybrid action distribution.
In the centralized training phase, to tackle the dual shortcomings of traditional multi-agent algorithms—namely, the input dimensionality explosion of the value network caused by the crude concatenation of global states, and the inability of mean-field methods to capture complex game relationships—this paper introduces the Transformer encoder into the centralized value network (Critic) to establish a spatial attention mechanism. Under the CTDE architecture, the single Critic network is deployed on the aforementioned Cloud Trading Platform during the training phase, and its input is reconstructed as a feature sequence . denotes the global aggregation feature that characterizes the collective trading behaviors of the community. In the community energy system, the interaction relationships among prosumers change dynamically in real-time according to supply and demand states. For instance, when prosumer i is in a power deficit state, it will prioritize attention toward neighbor j who is in a surplus state. The Transformer utilizes a Multi-Head Self-Attention mechanism to accurately model this dynamic coupling at both physical and economic levels.
Specifically, for each agent
i, the network maps it to a Query vector (
Q), Key vector (
K), and Value vector (
V) using learnable weight matrices
,
and
:
Subsequently, the attention coefficient
between agent
i and other agents
j is calculated, which characterizes the importance and coupling strength of agent
j to the decision-making of agent
i, as shown below:
where
is the transpose of the key vector for agent
j, and
represents the feature dimension of the key vector. The denominator
acts as a scaling factor to normalize the dot product. The softmax function is applied over all neighbor agents
j (i.e., along the sequence dimension).
By leveraging this attention mechanism, the Critic network automatically filters redundant information and adaptively reconstructs the current critical interaction topology from the global joint state embedding. Consequently, it generates a feature vector
which integrates global collaborative information, as calculated in Equation (24). This feature vector not only alleviates the limitations of a single agent’s restricted field of view but also provides a value benchmark with a global perspective for computing the advantage function in the PPO algorithm.
Given that community energy management is intrinsically a control problem within a continuous action space, the PPO algorithm based on the Actor–Critic framework is well suited to effectively address challenges associated with high-dimensional state spaces and stochasticity. Thus, this paper adopts the PPO algorithm for policy optimization. The specific execution and parameter update process of the MATPPO algorithm is illustrated in
Figure 4. The architecture adopts a parameter-separated design. For the Actor network, each agent utilizes only the local observation state feature
as input, constructing an independent decision-making structure for the hybrid action space. This design ensures that agents require no communication with others during the execution phase, relying solely on local information to make decisions, preserving operational privacy during the execution phase and ensuring response speed. In contrast, the Critic network fully leverages the global information available at the training center, takes the sequence of context vectors
output by the Transformer, aggregates them via a global pooling layer, and utilizes the resulting feature to fit the global state value function. The theoretical definition of the state value function is presented in Equation (25). This function characterizes the expected cumulative return of an agent executing the hybrid policy under the global state. The estimated value output by the value network is used not only to calculate the advantage function for assessing the quality of current actions but also to drive the joint update of the Transformer attention weights and Critic fully connected layer parameters by minimizing prediction error. We denote the parameter set of the entire network as
, where
represents the shared Actor parameters, and
includes the Transformer and Critic parameters.
During the interaction phase, agents execute decentralized actions based on the current shared policy. The generated transition data—comprising local observations, individual actions, global states and global reward is stored in a rollout buffer. Once a sufficient batch of trajectories is collected, the algorithm utilizes this data to update the parameters of both the Actor and the Transformer-based Critic through the following two key steps:
- (1)
Generalized Advantage Estimation
To effectively evaluate the quality of the current action
while balancing bias and variance in policy gradient estimation, we employ the Generalized Advantage Estimation (GAE) technique. Using the state value
output by the Critic network, the GAE advantage function
at time step
t is computed as follows:
where
is the community global immediate reward;
represents the Temporal-Difference (TD) Error, reflecting the deviation between the immediate reward and the expected state value;
T is the time horizon of the sampled trajectory; and
is the GAE smoothing factor used to regulate the trade-off between variance and bias.
- (2)
Optimization Targets and Parameter Updates
To implement the CTDE architecture and prevent training collapse caused by excessive policy update step sizes, we construct independent optimization targets for the Actor and Critic, respectively. For the Actor network, we define a policy objective function
that incorporates the PPO clipping mechanism and an entropy regularization term. For the Critic network, we define a value loss function
based on the Mean Squared Error (MSE), as detailed below:
where
serves as the local observation input to the decentralized Actor for agent
n; while
represents the global joint state fed into the centralized Critic. To encourage exploration and prevent premature convergence, the policy entropy
is introduced, scaled by the regularization coefficient
. Furthermore,
denotes the target state value. Equation (30) formulates the clipped surrogate objective
. By bounding the probability ratio
(defined in Equation (30)) between the current and previous policies, this clipping mechanism effectively prevents destructively large policy updates, thereby guaranteeing monotonic improvement and training stability.
Through the above parameter sharing and joint gradient ascent mechanism, the shared Actor network can absorb and generalize the exploration experiences of N agents under different local states simultaneously in one parameter backpropagation. This effectively avoids the curse of dimensionality and parameter explosion caused by the expansion of node scale, and greatly improves the convergence efficiency of the algorithm in large-scale community scenarios.