As shown in
Figure 1, the framework comprises three tightly coupled layers. (i) A topology-aware STT ingests Node Temporal Feature Embedding together with a graph-based Spatial Embedding (the graph icon denotes adjacency/diffusion/priors). It stacks alternating Spatial Transformer and Temporal Transformer blocks (STT Blocks × L) and a Prediction Head to produce short-term, multi-horizon node forecasts
. (ii) A cooperative controller adopts CTDE with an Agents
pool and a lightweight Coordinator that computes congestion-aware couplings
from path headroom, modulating a sparsified, permutation-invariant interaction critic
; consensus exploration is applied along electrical neighborhoods. Actions are optimized under a single shared reward
that includes a coordination term (weighted by
) to promote congestion-aware cooperation, rather than separate “agent” and “coordinator” rewards. (iii) A Safety Layer performs Device Projection at the device level (bounds, ramp limits, SoC gating, on/off feasibility) and then solves a compact, sensitivity-based QP to enforce feeder voltage/current limits, outputting feasible set-points
. Together, these modules realize a closed loop of forecas → multi-agent decision → safe execution over real DERs (PV, storage, and controllable loads), improving reliability and scalability while preserving interpretability and real-time feasibility.
2.1. Topology-Aware Spatio-Temporal Transformer (STT)
The Topology-Aware Spatio-Temporal Transformer (STT) is designed to address the unique challenges of coordinating DERs in complex power systems. This model integrates spatial and temporal forecasting by leveraging the strengths of transformer architectures while accounting for the inherent topology of the distribution network.
We model the distribution network as a graph
with
nodes (DERs or aggregations). Over an input window of length
, node features are stacked as
where the goal is to forecast
H future steps for all nodes. For each node
i, we first extract its temporal sequence and apply feature embedding:
where
is a linear projection to dimension
d and
denotes positional encodings. In practice, we sum multi-resolution temporal encodings (time-of-day and day-of-week) with a learned affine calibration; this preserves seasonality while letting the model adapt to feeder-specific rhythms. Inputs are standardized per-channel using training-set statistics, and a binary mask handles missing samples without information leakage. In our SimBench experiments there is no missingness; the mask is part of the general implementation.
The encoder alternates temporal attention (along each node’s timeline) and spatial attention (across nodes within a time slice), each followed by pre-norm LayerNorm, residual connections, and a lightweight position-wise feed-forward block (GELU + linear). Stacking L such layers empirically balances accuracy and latency; we use modest d and head counts to keep the memory footprint linear in and quasi-linear in N after sparsification.
Temporal attention updates each time step by attending over the node’s own history:
where
,
,
are query, key, and value projections of
. During training we default to a direct multi-horizon head to avoid exposure bias; when autoregression is used (e.g., for long horizons), we apply causal masks and scheduled sampling to stabilize rollouts.
Topology-aware spatial attention injects graph structure and electrical priors into attention scores at each time slice. Let
denote adjacency and
the graph Laplacian, where
is the degree matrix. We build a smooth proximity kernel by graph diffusion
where
controls the diffusion range (typically
based on network diameter). We combine this with optional electrical priors
(e.g., phase consistency, feeder membership, or inverse-impedance surrogates). The spatial attention logit between nodes
i and
j at time
is
with attention weights
over a sparsified neighborhood
, where
and
in our implementation. The coefficients
are learned and passed through a softplus function to keep them nonnegative, ensuring that structural and electrical proximity never becomes adversarially down-weighted. We precompute
and
per feeder, and optionally refresh them when a reconfiguration is detected. This amortizes runtime while remaining robust to moderate topology drift. After sparsification, the spatial step scales are
per head with
.
A lightweight prediction head maps the final spatio-temporal embeddings to multi-horizon outputs. For direct forecasting we use a two-layer head (linear→GELU→linear) to produce
in one shot; for autoregressive decoding the same head shares weights across steps. Dropout is applied only in the feed-forward and attention projections to preserve inductive biases of the kernels in (
4) and (
5).
The model is trained end-to-end with a node- and horizon-averaged prediction loss and a lightweight topology–physics regularizer:
where
promotes graph-consistent smoothness and
penalizes linearized power flow inconsistencies. Concretely, we instantiate
with nonnegative
learned and row-normalized to keep the scale of (
7) compatible with
. This Laplacian quadratic form is equivalent to summing squared differences across edges weighted by structural/electrical proximity and empirically stabilizes training in sparse feeders.
To connect forecasts with network physics, we define a sensitivity-based residual using a linearized DistFlow/AC model around the latest operating point. Let
denote predicted net injections and
the predicted voltage magnitude deviations (all taken from the relevant channels of
). With precomputed sensitivities
which softly enforces first-order voltage consistency without running a full power flow in the learning loop. We stop gradients through
and normalize each term by feeder-specific scales to avoid dominance by a single channel.
Implementation-wise, all kernels () are ablated in experiments to quantify their marginal utility; turning off reduces spatial attention to structure-only, turning off further collapses to hop-limited attention, and learning with recovers a vanilla Transformer. Hyperparameters are selected on a validation feeder; we found , , and small (–) to yield stable gains across feeders. At inference time, all preprocessing (standardization, kernel lookup, neighborhood construction) is batched, and the overall complexity scales as , which fits comfortably within real-time constraints for the feeder sizes considered.
2.2. Cooperative Multi-Agent Reinforcement Learning with Adaptive Coordination
We cast DER coordination as a cooperative Markov game with agents. Beyond standard MADDPG, we introduce three key innovations: (i) hierarchical action decomposition for handling hybrid discrete-continuous control, (ii) adaptive coordination weights based on network congestion, and (iii) consensus-based exploration for improved sample efficiency. From a systems perspective, these choices reflect practical DER actuation: modes (on/off/charge/discharge) are intrinsically discrete, whereas set-points are continuous and must respect ramp, SoC, and inverter capability limits; network congestion further couples otherwise local decisions along electrically salient paths. We adopt centralized training and decentralized execution (CTDE); actors are local policies executed with each agent’s observation, while a centralized critic leverages global (or enlarged) information during training to shape credit assignment and stabilize learning. At deployment, each agent uses only local (plus short-horizon forecast) information, preserving scalability and privacy.
At time
t, the environment state is
, agent
i observes
, and selects a hybrid action
where
represents discrete modes (e.g., charge/discharge/idle) and
is the continuous power set-point. Observations include local measurements (voltages, powers, and device states) and neighborhood summaries; normalization and clipping are applied per channel using training set statistics to avoid out-of-distribution magnitudes. We adopt per-channel affine normalization with statistics frozen after the warm-up phase and clip observations to
to bound critic targets.
Each agent employs a hierarchical policy as follows:
where the discrete policy selects the operation mode, and the continuous policy determines the magnitude conditioned on the mode. In training, the discrete branch is implemented with a differentiable Gumbel-Softmax relaxation (straight through at execution), and the continuous branch outputs a squashed Gaussian (Tanh) to respect device bounds. This decomposition naturally encodes mode-conditioned limits and facilitates credit assignment when discrete switches are sparse yet impactful. We anneal the Gumbel temperature
from
to
over the first
K epochs to gradually sharpen mode selection, and we clip the continuous standard deviation to
to avoid vanishing/explosive exploration. Illegal modes (e.g., charging at
) are masked by
logits, yielding zero probability mass and eliminating infeasible gradients.
We introduce a congestion-aware coordination weight that dynamically adjusts inter-agent coupling as follows:
where
is the congestion index on the path between agents
i and
j,
are voltage magnitudes, and
is the sigmoid function. The index
aggregates thermal/voltage headroom along electrical paths computed from the current feeder configuration; gradients do not backpropagate through path selection, while the coefficients
are learned (with softplus reparameterization for stability). Concretely, let
denote the set of branches on the unique (radial) or selected (meshed) electrical path between buses of agents
i and
j. Define per-branch thermal headroom
with loading ratio
, and per-bus voltage headroom
, where
, and
. We set a path headroom
and normalize the congestion index as
. This choice emphasizes the
most limiting element along the corridor and yields larger
only when genuinely scarce headroom necessitates coordinated action. The centralized critic incorporates these weights as follows:
where
captures individual contributions and
models pairwise interactions. The pairwise term is computed only over a sparsified electrical neighborhood (two-hop or
), keeping the critic evaluation
. We evaluate
with a symmetric aggregator to preserve permutation invariance and detach
with regard to critic inputs to avoid feedback loops while learning
.
Each agent’s observation is augmented with STT predictions as follows:
where
are local forecasts,
is the network congestion index, and
represents local voltage deviations. To mitigate covariate shift, forecasts are stop-gradient features and are standardized per horizon; we additionally include a recency flag and uncertainty proxy (e.g., predictive variance) when available. Empirically, exposing only short-horizon summaries (e.g., next-steps mean/variance or small-window embeddings) yields lower variance critics than feeding full forecast trajectories; we therefore concatenate compact statistics and keep their gradients blocked.
To improve coordination during exploration, we introduce a consensus term in the exploration noise as follows:
where
is the average noise of neighboring agents and
controls the consensus strength. Neighborhoods are electrical (not geometric), and
is annealed from 0 to a small
early in training to encourage coherent but diverse exploration; an Ornstein–Uhlenbeck variant can be used for temporally correlated devices. We compute
from independent Gaussians to avoid circular dependence. The consensus term is disabled at evaluation time. Intuitively, when congestion couples agents along a corridor, correlated exploration accelerates discovery of coordinated policies while keeping dispersion low off-corridor due to small
.
The shared reward balances multiple objectives as follows:
where
promotes coordination during congestion. Cost
aggregates energy price and cycling penalties;
and
are hinge losses on voltage and current headroom; and
is total variation of actions over time. All weights are nonnegative and tuned on validation feeders; we clip rewards to a fixed range to stabilize critic targets. We normalize each term by feeder-level scales (e.g., average load or nominal voltage band) to balance gradients across feeders and adopt per-timestep reward clipping to
for stability.
Training follows an off-policy CTDE pipeline with a replay buffer storing
and target networks. The critic is optimized by minimizing the temporal difference loss
where
is the target critic with delayed parameters. For robustness against overestimation we use twin critics and target-min (TD3-style) for the continuous branch. The actor update maximizes the expected
Q under the hierarchical policy; with reparameterizations for both branches, the gradient takes the standard form
where
denotes an entropy regularizer (on the discrete logits and continuous variance) with temperature
. Target networks are updated by Polyak averaging
and similarly for actor parameters, with a small
. In addition to the pairwise critic, we optionally compute a counterfactual baseline
which reduces gradient variance (COMA-style) and is evaluated efficiently using the sparsified interaction structure. We apply gradient clipping (
norm
) and prioritized replay with importance weights to stabilize updates across feeders and stress conditions.
Implementation details improve stability and scalability. (i) We mask illegal modes in
via
logits to enforce device availability and ramp limits at the policy level. (ii) The pairwise term
uses a symmetric aggregator over neighbors to maintain permutation invariance and is computed on a pruned graph to bound cost. (iii) Congestion weights
are detached with regard to the critic input but keep gradients for the learnable coefficients
, preventing spurious feedback loops. (iv) We employ prioritized replay with importance sampling corrections, and gradient norm clipping for both actor and critic. (v) During data collection, actions are post-processed by the safety layer (
Section 2.3) before environment execution; both raw and filtered actions are logged to decouple exploration from feasibility and to reduce distributional drift. Complexity-wise, training and inference scale linearly with the number of active electrical edges due to sparsified neighborhoods; decentralized execution requires only
and neighbor summaries, meeting real-time constraints on the studied feeders. Ablations show that replacing the hierarchical policy with a flat continuous policy degrades discrete-switching efficiency, fixing
harms performance under heavy congestion, and disabling consensus noise delays the emergence of corridor-level coordination.
2.3. Safety Layer
The Safety Layer guarantees operational feasibility through real-time constraint enforcement. It acts after the policy proposes actions and before interacting with the environment, so that device- and network-level limits are always respected. The layer is modular: a fast device-wise projection enforces local bounds and ramping/SoC feasibility, followed by a compact network QP that corrects any residual violations due to voltage/current limits.
For each agent
i, we define the state-dependent safe set
where
is (dis)charging efficiency and
the energy capacity. The SoC update map is
under the sign convention that
denotes charging (adjusted accordingly for devices with opposite sign conventions). In this formulation, we do not enforce the daily cycle constraint
at the end of the time horizon. Instead, we allow the state of charge (SoC) to evolve dynamically across multiple time steps, reflecting operational flexibility over time. This flexibility is particularly useful in real-world grid environments where energy storage and generation are subject to high variability.
Given a proposed action
, we apply
which reduces to clamping
to the intersection interval
with
If
is empty due to incompatible limits, we enlarge the ramp window minimally (equivalently, introduce a small nonnegative slack
penalized in the network QP below), which preserves convexity and ensures feasibility even under abrupt disturbances. Additional device rules (e.g., apparent power
P–
Q capability) are handled by masking illegal discrete modes upstream and by shrinking
online.
To respect feeder limits, we solve a small QP with precomputed sensitivities around the current operating point as follows:
where
are voltage and current sensitivity matrices. In practice we augment (
20) with box constraints
(device bounds after the projection in (
19)) and optional slacks
weighted by large penalties to improves feasibility (under a linearized model) under model mismatch
Note. Constraints are enforced with respect to linearized sensitivities
around the current operating point; hence feasibility refers to the linear model. In experiments, we verify a posteriori with full AC power flow that violations remain small.
The objective is strictly convex and the feasible region is convex; hence a unique minimizer exists and depends continuously on . We pre-factor the KKT system once per operating-point refresh and reuse it across steps, yielding sub-millisecond solves at the feeder scales considered. The overall complexity scales with the number of active inequality constraints; sparsity from feeder topology further reduces cost.
2.4. Theoretical Guarantees
We provide brief theoretical analysis for the key components of our framework, focusing on convergence and safety properties that hold under mild regularity conditions.
Under standard assumptions (Lipschitz continuous loss and bounded gradients), the STT training converges to a stationary point using first-order methods. The topology-aware regularization also yields a graph-smoothing effect. Let
be the optimal STT predictor. For connected nodes
with
, the following bound holds:
where
C depends on data smoothness and model capacity. Thus larger diffusion proximity
tightens cross-node prediction coupling while
controls the bias–variance trade-off.
For the MARL component, we consider hierarchical policies under CTDE with bounded rewards and compact action spaces. With learning rates satisfying the Robbins–Monro conditions and with target networks for stabilization, the policy updates produce asymptotic improvement up to vanishing errors as follows:
where
as
. The pairwise interaction critic remains well-defined because the sparsified neighborhood bounds the number of interaction terms; in particular, the critic gradient variance is controlled by the congestion weights, which shrink toward zero off-congestion, mitigating nonstationarity.
For safety guarantees, the device-wise stage is a projection onto a nonempty closed interval (firmly nonexpansive), while the network-wise stage solves a strictly convex QP (a prox-like mapping). Device-wise,
is the Euclidean projection onto a nonempty closed interval and is firmly nonexpansive
which implies robustness to actor noise. Network-wise, the QP in (
21) is strictly convex; thus the filtered action is unique and satisfies all constraints (or violates them minimally via slacks with quadratic penalties). Consequently, for any proposed action
, the filtered action
obeys
and among all feasible actions it minimizes the Euclidean deviation from
. Moreover, the deviation admits a sensitivity-type upper bound that scales with raw violations as follows:
where
denotes the positive part and
depend on the QP KKT matrix condition numbers.
Combining the three layers yields an additive performance decomposition. Let
be the optimal value under perfect information and unconstrained execution. The deployed policy
satisfies
where
decreases with STT generalization error and regularization strength,
decreases with training time and critic accuracy under CTDE, and
depends on typical magnitudes of raw violations and on the QP penalty parameters
. Since both projections are nonexpansive and the QP correction is minimal in
, the Safety Layer does not amplify action noise and yields a controlled, monotone trade-off between feasibility and optimality. These guarantees ensure reliable closed-loop performance with approximate safety under linearized constraints, while the modular design allows each component to be improved independently without re-deriving the others.