1. Introduction
The rapid integration of renewable generation, distributed energy resources, flexible demand, and pervasive sensing has fundamentally changed the operational paradigm of modern power systems. In high-renewable smart grids, dispatch is no longer a static scheduling task under mild uncertainty. Instead, it has become a sequential decision problem in which multiple control entities must react to volatile renewable output, changing net load, congestion propagation, reserve shortages, and incomplete system visibility under physical and operational constraints [
1,
2]. Under these conditions, both state representation and control coordination become strongly time dependent.
Recent studies have advanced uncertainty-aware dispatch and security-constrained scheduling from different perspectives. Optimization-based methods, including stochastic- and uncertainty-aware dispatch, remain important because they preserve engineering interpretability and explicit feasibility constraints [
3]. In parallel, reinforcement learning has shown promise for adaptive dispatch and rolling corrective control under renewable variability [
4,
5]. However, these methods often focus on either optimization or sequential control, while the interaction between topology-aware representation, multi-agent coordination, and corrective feasibility is still insufficiently addressed.
Graph learning provides another important direction because power grids are naturally structured systems. Recent graph neural network studies have improved state estimation, operational risk assessment, and topology-aware system analysis by capturing spatial dependence more effectively than flat feature encodings [
6,
7]. Nevertheless, the dispatch setting is more demanding than representation-only tasks. Although the physical network topology is relatively stable, the effective operational coupling among buses and controllable resources changes over time with congestion, reserve activation, and renewable ramps. Existing studies rarely examine whether learned couplings remain physically meaningful and security-relevant when they are embedded in a closed-loop dispatch policy [
8].
Multi-agent reinforcement learning (MARL) is also attractive for smart-grid dispatch because different controllable entities operate with partial observations and heterogeneous responsibilities. Recent graph-based multi-agent learning and adjacent-domain resource-allocation studies have shown that adaptive coordination can improve distributed decision making under uncertainty [
9,
10,
11]. However, smart-grid dispatch differs from generic resource-allocation problems because it must simultaneously satisfy power balance, ramping limits, reserve adequacy, and transmission-security constraints. As a result, coordination should not only improve reward, but also remain interpretable in terms of stressed corridors, balancing responsibility, and corrective reserve deployment.
A further challenge arises from data governance and operator trust. In practical dispatch environments, local operational data are often commercially sensitive or infrastructure-sensitive, which limits direct information sharing across agents or regions. Federated learning and differential privacy offer a natural way to enable collaborative training without exposing raw data [
12,
13]. At the same time, dispatch support also requires more than predictive accuracy: operators need to understand how changes in wind support, load escalation, and line stress affect cost, curtailment, and congestion. Yet most existing interpretability analyses remain correlational and are not explicitly linked to intervention-oriented operational questions [
14].
Against this background, three research gaps remain. First, existing GNN- and RL-based dispatch studies usually improve either representation or control, but rarely validate whether learned couplings remain physically meaningful under stressed dispatch conditions. Second, existing MARL dispatch studies often omit privacy-preserving collaborative training, or they introduce privacy without quantifying the utility-security trade-off under an explicit threat model. Third, most interpretability analyses remain descriptive and are not connected to intervention-based operator questions. These gaps limit the practical value of current methods because renewable-rich smart-grid dispatch inherently involves a closed loop of state representation, coordinated control, privacy-constrained training, and operator-facing explanation.
To address these issues, this paper proposes a unified framework for uncertainty-aware smart-grid dispatch that integrates topology-anchored dynamic graph learning, privacy-aware multi-agent symbiotic reinforcement learning, and structural causal intervention analysis. The purpose of this integration is not to combine multiple techniques for their own sake, but to organize them around the actual logic of renewable-rich dispatch. The graph module captures fixed physical topology together with stress-dependent residual coupling. The MARL module learns coordinated corrective policies under partial observability and heterogeneous agent roles. Federated optimization with Rényi differential privacy supports collaborative training while limiting local information leakage. The causal module provides post hoc intervention-based interpretation of how wind variation, load growth, and line stress influence dispatch outcomes.
The main contributions of this study are summarized as follows: The adaptive graph is decomposed into a fixed physical-topology component and a learned residual component, enabling the model to capture time-varying operational coupling while preserving topological grounding. The dispatch task is formulated as a constrained partially observable stochastic game, and a task-masked coordination regularization strategy is adopted to improve collaboration among tightly coupled agents. Rather than claiming novelty in the privacy mechanism itself, this study focuses on integrating Rényi differential privacy into collaborative dispatch learning and quantifying the resulting privacy-utility trade-off. The proposed framework estimates intervention effects associated with wind support, load escalation, and line stress, thereby linking learned dispatch behavior to operationally meaningful questions. The empirical study is conducted on a public-trace-grounded benchmark based on a modified IEEE 30-bus system, with evaluation covering graph behavior, coordination performance, privacy diagnostics, causal robustness, and comparison with strong learning-based and optimization-based baselines.
2. Methodology
Figure 1 summarizes the overall framework. Multimodal operational data are first encoded by a topology-informed DGCN and a temporal encoder. The resulting latent representation is then consumed by the MASRL module for dispatch optimization. During training, clients participate in federated optimization under an RDP guarantee, and the final policy is analyzed using a structural causal model.
2.1. Problem Formulation as a Constrained Partially Observable Stochastic Game
Consider a smart grid with
controllable entities and a scheduling horizon of
intervals. Agent
may represent a thermal unit, gas unit, wind-farm aggregator, storage device, or regional load-serving controller. Because each agent observes only local measurements and delayed neighborhood information, the dispatch problem is naturally modeled as a constrained partially observable stochastic game [
15]. Under centralized training, the same system can be viewed as a constrained partially observable Markov decision process with decentralized execution.
Let denote the global state at interval . The state includes bus-level injections, line-loading ratios, reserve margin, locational prices, meteorological variables, renewable forecasts, realized renewable outputs, unit operating points, and topology descriptors. Agent receives a local observation that is a partial function of the state. The joint observation is , and the joint action is .
The action of each agent depends on its physical role. For dispatchable generation units, actions include active-power adjustment, reserve participation, and upward or downward ramp decisions. For storage units, actions include charging, discharging, and reserve allocation. For load-serving entities, actions may include demand response activation, flexible load shifting, and interruption scheduling. These actions are coupled through network and market constraints.
The global state vector stacks 30 nodal net injections, 41 branch-loading ratios, six thermal-unit outputs, three wind realizations, three wind forecasts, two system reserve indicators, 30 locational-price proxies, four calendar markers, and six weather-derived drivers, giving 125 continuous features per interval. Each agent receives a role-specific observation of 18–32 features formed from local device states, neighboring line stress, and delayed regional measurements. The action dimension is three for dispatchable generators (active-power adjustment, upward reserve, downward reserve), two for storage devices (charge or discharge command and reserve share), and two for load-serving entities (flexible-load shift and curtailment request). The base reward uses
before adaptive penalty rescaling, so that security and reliability violations remain more expensive than marginal energy-redispatch changes.
where
is the global state space,
is the local observation space,
is the joint action space,
is the stochastic transition kernel,
is the reward function,
denotes the system constraints,
is the agent set, and
is the horizon.
The instantaneous reward is designed to reflect the multi-objective nature of practical dispatch:
where
is the operating cost,
is the load-shedding penalty,
is the renewable-curtailment penalty,
is the carbon-emission cost, and
aggregates security-constraint violations such as line overloads, reserve shortage, and ramp breaches.
The main physical constraints are listed below:
Equation (5) enforces system-wide power balance; (6) enforces generator capacity limits; (7) enforces ramping limits; and (8) enforces line-flow security. The proposed learning framework is motivated by the fact that these constraints operate under stochastic renewable output, partial observability, and distributed actions (
Table 1 and
Table 2).
2.2. Topology-Informed Adaptive Dynamic Graph Representation
A central challenge in renewable-rich grids is that node dependence is both structured and time-varying. The physical network topology is relatively fixed, but the effective operational coupling among buses changes with congestion, redispatch, reserve activation, and weather-driven renewable patterns. Therefore, a static adjacency matrix is often insufficient.
To capture this behavior, we define the adaptive adjacency matrix as the sum of a physically grounded base matrix and a learned residual matrix (
Figure 2):
Here is derived from the electrical network using bus connectivity and electrical distance, while is a data-driven residual that reflects time-varying couplings. This decomposition constrains the learned graph to remain physically plausible while preserving adaptability.
Given input feature tensor
over a rolling window, the graph representation is written as:
where
is the normalized adaptive adjacency matrix,
is the l-th graph-convolutional feature map,
are learnable parameters, and STCN is the temporal encoder that aggregates historical dynamics over window length
. The final output
is the latent spatio-temporal state representation that feeds the reinforcement learner.
The DGCN and the dispatch policy are optimized in a coupled fashion. The DGCN reduces observation noise and compresses spatial dependencies into , thereby improving state identifiability for policy learning. Conversely, the downstream actor-critic gradients encourage the representation to emphasize dispatch-relevant patterns such as congestion precursors, renewable ramps, and reserve shortfalls. The result is not a loose pipeline but a representation-control co-optimization mechanism.
The learned graph is further regularized so that adaptation remains physically plausible. Specifically, the residual component is stress gated according to line loading, reserve shortage, and renewable ramp intensity, which prevents the model from inventing strong couplings in electrically quiet periods. A complementary plausibility penalty discourages electrically implausible long-range edges, aligns the learned adjacency with electrical–distance structure, and bounds spectral instability under stressed operation. This addition is intended to make dynamic graph learning dispatch-relevant and safety-aware rather than to claim a new generic graph-learning family. During evaluation, the learned graph is judged by topology consistency, electrical–distance correlation, congestion–edge recall, and N-1 secure decision ratio. These diagnostics provide direct evidence of whether the residual graph highlights physically meaningful stress paths instead of arbitrary correlations.
2.3. Multi-Agent Symbiotic Reinforcement Learning
Each controllable entity learns a policy that maps local observations and graph-enhanced latent states to dispatch actions. Since multiple agents jointly affect the same network-level objective, independent learning may lead to unstable or redundant decisions. To address this issue, we introduce a symbiotic regularization term that encourages coordination only among agents with coupled operating responsibilities.
The training objective is defined as follows:
where
is a task-interaction mask. If two agents share a congested corridor, co-provide balancing reserves, or serve the same regional load pocket, then
; otherwise
. Thus, the KL regularization term is not used to force global policy homogeneity. Instead, it reduces inconsistent reactions among tightly coupled agents and improves task-allocation smoothness in shared operating regions.
The actor uses local information enhanced by topology-aware features, while the critic evaluates the joint action against global constraints and reward. This centralized-training, decentralized-execution design is suitable for dispatch problems where local agents have partial observations but system-level performance must remain coherent (
Figure 3).
Agent coordination is further modulated by a topology- and stress-aware coupling signal derived from the adaptive graph and system-security indicators. Under normal operating conditions, the coordination mask follows a static task-coupling pattern determined by resource roles and regional responsibilities. When congestion intensity, reserve shortage, or renewable-ramp stress exceeds predefined thresholds, the coordination mask is updated to emphasize agents located in the affected corridor or balancing region. This event-driven update concentrates coordination on reserve sharing, congestion relief, and ramp balancing during stressed intervals, thereby allowing the collaboration mechanism to adapt to evolving operational couplings rather than relying on a fixed interaction structure.
To improve training stability under heterogeneous operating regimes, the penalty weights associated with overload, reserve shortage, and ramp violations are adjusted online according to recent constraint-violation statistics. Let
denote the observed violation rate of constraint type
over the latest
episodes and let
denote its target rate. The corresponding penalty weight is updated as
where
is the adaptation rate and
is a small stabilizing constant. In this way, constraints that are violated more frequently receive stronger penalties, whereas consistently satisfied constraints do not dominate the reward signal. The adaptive scheme reduces manual retuning across scenarios, stabilizes critic learning, and improves policy transferability under different volatility conditions.
2.4. Federated Learning with Rényi Differential Privacy
Distributed smart-grid environments often impose strict data-sharing limitations. Operational records may contain commercially sensitive, geographically sensitive, or infrastructure-sensitive information. The proposed framework therefore adopts federated learning so that local training remains on each client while only clipped and perturbed model updates are transmitted, following the broader direction of privacy-preserving renewable and multi-energy learning.
Let
be the local dataset on client
and
be its empirical loss. The federated objective is
where
is the local sample size. Before upload, each client clips its gradient and perturbs it with Gaussian noise:
Server aggregation uses sample-size-weighted averaging after local clipping and perturbation, i.e., each round combines client parameters in proportion to local sample count so that no client dominates the global update. The local gradient norm is clipped before Gaussian perturbation, and the privacy accountant accumulates the effective epsilon across communication rounds under
. This design makes the aggregation rule, privacy budget, and leakage evaluation directly comparable across privacy settings.
where
is the clipping norm and
is the Gaussian noise multiplier. In the revised notation,
denotes the privacy budget and
denotes the failure probability, consistent with standard differential privacy definitions. The pair (
,
) is obtained from a Rényi differential privacy accountant after composition across communication rounds.
This design clarifies what is shared among agents under privacy protection: clipped and perturbed parameter updates, not raw measurements, labels, market records, or local grid traces. RDP is adopted because it yields tighter privacy accounting over repeated communication rounds and therefore offers better utility under the same target privacy guarantee. In addition, the final operating point is selected jointly from utility and attack-resistance curves rather than from accuracy alone.
We assume an honest-but-curious server or external eavesdropper that can observe model updates, per-round losses, or output confidence scores, but cannot access raw client data. The primary attack target is membership inference: the attacker attempts to decide whether a given local trajectory participated in training by exploiting confidence or update statistics [
16].
2.5. Structural Causal Modeling and Intervention-Based Interpretation
Correlation-based analysis is not sufficient for smart-grid decision support because many operational variables are confounded by weather, temporal effects, and control reactions. To increase interpretability, we construct a structural causal model that links weather, renewable generation, net load, dispatch actions, congestion, and dispatch outcomes.
For modality
, the causal graph
is obtained through a hybrid procedure. First, we define a prior skeleton based on power-system knowledge and time precedence. For example, weather variables may influence renewable output, renewable output influences net load and reserve requirements, and dispatch actions influence congestion and emissions. Second, admissible edges are refined using data-driven structure search and conditional-independence testing under temporal-order constraints. Edges that violate physical logic or time precedence are removed.
where
denotes observed variables,
denotes exogenous disturbances,
is the causal graph, and
is an outcome such as dispatch cost, curtailment, or congestion probability. The final causal analysis reports intervention effects together with bootstrap confidence intervals, placebo checks, and sensitivity-to-confounding diagnostics so that the interpretation is more rigorous than descriptive association alone. Explicit intervention design is required for rigorous evaluation. For example, the intervention
can be compared with the control setting
, where
and
are the 90th and 50th percentiles of wind speed. The resulting change in renewable output, dispatch cost, or congestion risk can then be estimated under the structural causal model (
Figure 4).
The structural causal analysis is conducted under four identification assumptions. First, temporal precedence is imposed within each 15 min interval so that exogenous drivers precede dispatch responses and outcome variables. Second, weather and calendar variables are treated as pre-treatment covariates. Third, dispatch actions are modeled as mediators that transmit the effects of exogenous variation to congestion, curtailment, and cost outcomes. Fourth, no unmodeled fast-timescale controller with simultaneous feedback is assumed to operate within the analysis window. Under these assumptions, average treatment effects are estimated by g-computation with bootstrap confidence intervals. Robustness is evaluated through placebo interventions on temporally shifted variables, leave-one-covariate-out sensitivity analysis, and bounded hidden-confounding stress tests.
The structural causal model includes the variables {weather, wind generation, net load, reserve margin, dispatch action, line stress, operating cost, curtailment, congestion risk}. Candidate edges are first screened according to temporal precedence and engineering feasibility, and are then retained only when they satisfy conditional-independence screening and remain stable across bootstrap structure-search runs. Intervention effects are estimated by g-computation on the fitted causal graph. Placebo interventions are generated by permuting treatment labels within the same diurnal block, and sensitivity is summarized by the minimum latent-confounding strength required to reverse the sign of the estimated effect.
3. Experimental Design
3.1. Test System, Data, and Scenario Definition
Experiments are conducted on a Texas/ERCOT-grounded public-trace dispatch dataset mapped to a modified IEEE 30-bus system with 41 transmission connections, six thermal units, three wind-farm injection points, and heterogeneous load centers. Historical ERCOT load and market-price records are combined with wind-generation traces derived from the NREL WIND Toolkit, while the MATPOWER-based 30-bus network serves as the physical dispatch backbone [
17,
18,
19,
20].
The remapped study window spans 90 consecutive days at 15 min resolution, corresponding to 8640 intervals. Public ERCOT load and market-price traces are temporally aligned with remapped wind trajectories so that the resulting dataset preserves realistic variability patterns while remaining fully reproducible. It records unit dispatch, bus-level load, reserve requirement, congestion proxy, locational marginal price, renewable output, and operating cost. The thermal fleet covers base-load, mid-merit, and flexible operating roles with unit limits ranging from 6 to 82 MW and ramp limits from 8 to 16 MW per 15 min, while the three wind plants are connected to buses 7, 19, and 27 with installed capacities of 58, 42, and 38 MW.
Three evaluation scenarios are defined using reproducible thresholds rather than qualitative labels only. The normal scenario contains intervals whose 1 h wind-ramp magnitude does not exceed 18.0 MW and whose congestion index remains below 12.45. The high-wind-volatility scenario contains intervals whose 1 h wind-ramp magnitude exceeds 32.9 MW, whereas the congestion-sensitive scenario contains intervals whose congestion index exceeds 13.93. For all methods, the first 60 days are used for training, the next 15 days for validation, and the final 15 days for testing to avoid temporal leakage.
3.2. Data Cleaning, Missing-Value Handling, and Feature Engineering
Engineering range checks are first applied to identify impossible values in generator output, bus load, reserve requirement, and congestion variables. Short isolated spikes are detected through robust filtering and temporal consistency checks on 15 min ramp increments. Missing gaps shorter than four intervals are imputed by linear interpolation for electrical variables and seasonally matched interpolation for exogenous weather-derived drivers, whereas longer gaps are masked during training and excluded from metric aggregation. All continuous variables are normalized using statistics computed from the training partition only.
Feature engineering then transforms the cleaned series into model inputs. The final tensor includes lagged load and wind features, net injection, reserve margin, rolling price volatility, line-loading proxy, bus-type embedding, electrical–distance encoding, and calendar–weather interaction features. The look-back window is fixed to 16 intervals, identical feature definitions are used for all learning-based baselines, and every experiment is repeated with five random seeds (
Table 3).
3.3. Baselines and Implementation Details
The baseline set spans classical forecasting-driven dispatch, graph-learning baselines, strong MARL baselines, and advanced uncertainty-aware optimization comparators. ARIMA and LSTM represent forecasting-driven baselines, while stochastic dispatch and adaptive look-ahead economic dispatch represent stronger engineering dispatch comparators [
21,
22]. DeepOPF is included as a learning-based optimization comparator, and MADDPG, QMIX, and MAPPO serve as strong MARL baselines [
23,
24,
25,
26]. Topology-aware graph comparators are represented through recent GNN studies for operational risk assessment and state estimation [
27,
28].
To ensure fairness, all methods use the same chronological split, identical scenario definitions, and the same evaluation horizon (
Table 4). Hyperparameters are tuned on the validation set only. The DGCN contains two graph-convolutional layers with hidden dimension 64 followed by a temporal encoder with kernel size 3, while the actor and critic each use two fully connected layers of width 128. Optimization uses Adam with learning rate 10
−4, batch size 256, discount factor
= 0.99, early stopping on validation reward, and 1200 episodes. Federated optimization uses 50 communication rounds, clipping norm
= 1.0, noise multiplier
= 0.8, and an RDP accountant to track privacy loss. All stochastic baselines are trained with the same observation space, action bounds, engineering constraints, and five-seed protocol so that mean values, standard deviations, and significance checks remain comparable.
3.4. Evaluation Metrics
Accordingly, the evaluation reports operating cost, renewable curtailment rate, load shedding, average overload rate, ramp-violation rate, carbon-emission intensity, cumulative reward, topology consistency, electrical–distance correlation, congestion–edge recall, and N-1 secure decision ratio. Prediction indicators, such as RMSE and MAE, are retained only to support the representation-learning discussion.
In addition, the following coordination, privacy, and causal indicators are reported: collaboration-efficiency score, reserve mismatch, convergence speed, policy adaptation lag, privacy budget (ε, δ), membership-inference AUC, TPR at 10% FPR, attack accuracy, average treatment effect, placebo effect, and bootstrap confidence intervals. All stochastic results are reported as averages over five runs, and the extended comparator tables report mean ± standard deviation whenever repeated training is applicable.
4. Results and Discussion
Whether the dynamic adjacency learned by the DGCN reflects actual grid structure rather than arbitrary latent connectivity.
Figure 5 addresses this issue by jointly reporting topology consistency and the N-1 secure decision ratio under varying privacy budgets. Under normal conditions, the learned adjacency remains close to the physical topology, which confirms that the model does not invent unnecessary long-range couplings.
4.1. Adaptive Graph Behavior and Coordination Under Stressed Operation
A central issue in the proposed framework is whether the learned adaptive adjacency reflects meaningful grid couplings rather than arbitrary latent similarity. The value of the graph module does not lie in dynamic graph learning alone, but in learning a dispatch-oriented residual graph that remains anchored to physical structure while adapting to stressed operating conditions.
Figure 5 therefore reports graph-side diagnostics, whereas
Figure 6 shows the corresponding coordination behavior under the same stressed operating conditions.
Figure 5 indicates that the adaptive residual graph remains physically interpretable across the practical privacy range. Topology consistency rises from 0.72 at
= 0.9 to 0.92 at
= 3.1, after which the gain becomes marginal. The N-1 secure decision ratio follows the same pattern and reaches 0.956 at
= 5.8, while the electrical–distance alignment metric remains above 0.80 throughout the recommended operating region. These results show that the learned residual couplings do not drift toward an unconstrained latent graph. Instead, they strengthen dependencies that remain plausible from the viewpoints of electrical proximity, congestion propagation, and stressed transfer corridors. The graph is therefore acting as a topology-anchored stress amplifier rather than as a free-form similarity matrix.
The stressed-interval behavior supports the same interpretation. When congestion intensifies or reserve deployment becomes necessary after renewable ramps, the residual edge weights increase mainly around congestion corridors and reserve-coupled subregions rather than across electrically distant buses. This behavior is important because it demonstrates that graph adaptability and engineering plausibility improve together instead of trading off against one another. In other words, the learned graph is not only adaptive, but also operationally meaningful under the same situations in which coordination pressure actually increases.
The operational effect of this graph adaptation becomes clear in
Figure 6. Removing the KL-based symbiotic coordination term increases unserved energy from 5.8 to 15.1 MWh/day, reduces collaboration efficiency from 0.97 to 0.81, and lengthens policy-adaptation lag from 2.1 to 3.8 dispatch intervals.
Table 5 further shows that removing KL coordination increases reserve mismatch from 1.2 to 2.9 MW even though topology consistency remains close to the full-model level. This difference clarifies the complementary roles of the two modules: the adaptive graph identifies changing dependencies, whereas the coordination term translates those dependencies into smoother task allocation and faster corrective alignment among coupled agents.
These results also explain how multi-agent collaboration adapts to topology changes and stressed operation. In the proposed framework, changes in congestion pressure, reserve coupling, and renewable ramps alter the residual graph and, through the latent state, affect the effective interaction pattern among agents. The KL-based coordination term is therefore selective rather than homogenizing. It acts only on agents that share stressed corridors, balancing responsibilities, or regional support tasks. The resulting coordination gain reflects more coherent dispatch behavior under pressure, not a loss of policy diversity.
4.2. Dispatch Performance, Strong-Baseline Comparison, and Strategy Refinement Under Renewable Uncertainty
The benchmark results evaluate whether the proposed closed-loop design improves dispatch performance under renewable uncertainty relative to both learning-based and optimization-based comparators.
Figure 7,
Table 5,
Table 6 and
Table 7 provide complementary evidence through event-level strategy refinement, core benchmark comparison, strong-baseline comparison, and external validation.
Figure 7 illustrates a representative wind-ramp day. The proposed controller reallocates thermal generation and reserve promptly after the strongest renewable ramps, while the reserve trajectory remains feasible over the full 24 h horizon. The dispatch response is not dominated by one delayed redispatch action. Instead, the policy refines its strategy over the following dispatch intervals, which is consistent with the shorter adaptation lag reported in
Section 4.1. This event-level behavior is important because it shows how the learned policy responds to stochastic renewable conditions in time, rather than only through average performance summaries.
The benchmark results in
Table 5 show that the proposed method outperforms forecasting-based, static-graph, and initial sequential-decision baselines across all core metrics. Relative to the strongest model in
Table 5, namely the hybrid GNN-RL baseline, the proposed framework lowers daily operating cost from 482.7 to 468.9 × 10
3 USD, reduces renewable curtailment from 5.9% to 4.8%, decreases overload rate from 0.60% to 0.30%, and improves reward from −41.2 to −36.4. Load shedding is almost eliminated, decreasing from 0.3% to 0.1%, while carbon intensity falls from 0.58 to 0.55 tCO2/MWh. These gains indicate that the advantage of the full framework is not confined to one economic indicator. It improves economy, security, flexibility, and emissions-related performance simultaneously.
Table 6 shows that the performance advantage persists when stronger baselines are included. Relative to MAPPO, operating cost falls from 474.6 to 468.9 × 10
3 USD/day and overload rate drops from 0.41% to 0.30%. Compared with interval-uncertainty OPF and chance-constrained SCED, the proposed method also achieves lower curtailment and a better reward while retaining engineering feasibility. This comparison is particularly important because it shows that the proposed framework is not merely outperforming weak forecasting baselines. Its gains remain visible against both mature MARL coordination schemes and stronger uncertainty-aware dispatch optimizers.
The constraint-handling design is further clarified by the penalty-sensitivity study. Sweeping the overload penalty in {12, 18, 24} and the reserve-shortage penalty in {6, 10, 14} shows that the selected pair {18, 10} yields the best compromise between economy and security. Lower overload penalties allow more frequent corridor violations, whereas higher penalties slightly reduce overload risk but increase curtailment and operating cost. The selected operating point therefore represents a balanced constrained-RL configuration rather than an aggressive cost-minimizing setting.
External validation leads to the same overall conclusion.
Table 7 shows that the performance ranking remains stable under public-trace remapping across the base remapped case, the high-volatility remapped case, and the winter-peak remapped case. Although operating cost rises under the harder scenarios, overload rate remains below 0.50% and reward degradation remains moderate. This result suggests that the learned policy generalizes beyond a single benchmark realization and remains effective under more challenging public-trace conditions.
4.3. Privacy-Utility Trade-Off and Leakage Evaluation
The privacy results evaluate whether RDP-protected federated training can be incorporated into the dispatch loop with limited utility degradation and whether the selected privacy budget is supported by explicit leakage evaluation.
Table 8 summarizes the joint behavior of privacy budget, dispatch utility, security performance, and membership-inference success.
The results reveal a clear privacy-utility frontier. Moving from the non-private model to ε = 3.1 increases daily operating cost only from 467.6 to 468.9 × 103 USD/day and raises curtailment only from 4.7% to 4.8%, while the N-1 secure ratio remains nearly unchanged, decreasing only from 0.957 to 0.956. At the same time, membership-inference attack AUC drops from 0.73 to 0.57. This trade-off explains the selection of the mid-range privacy setting, which preserves most of the dispatch value while substantially reducing leakage risk.
The two extremes of the privacy sweep are less attractive. Very strong privacy protection at ε = 0.9 suppresses attack AUC further to 0.54, but operating cost increases to 476.8 × 10
3 USD/day, curtailment rises to 5.8%, and the N-1 secure ratio drops to 0.921. By contrast, very weak protection at ε = 5.8 recovers almost all utility, but attack AUC rises to 0.63. The recommended operating point is therefore not chosen because it optimizes one metric in isolation. It is selected because it occupies the most balanced region of the privacy-utility curve [
29,
30].
These results also clarify the contribution of the privacy module. The paper does not claim novelty from proposing a new differential-privacy mechanism by itself. Rather, the contribution lies in integrating RDP-protected federated training into a renewable-rich dispatch loop and quantifying the resulting trade-off between privacy, dispatch performance, and leakage resistance under an explicit threat model. From a deployment perspective, this is the relevant question: not whether privacy improves reward, but whether privacy can be added at an acceptable operational cost.
4.4. Causal Intervention Analysis and Module-Level Validation
The causal and ablation results assess whether the structural causal module provides evidence beyond descriptive association and whether the overall gain of the framework can be attributed to the intended components.
Figure 8,
Table 9,
Table 10 and
Table 11 provide the corresponding evidence from intervention analysis and module-level diagnostics.
Figure 8 and
Table 9 show that the structural causal model yields stable intervention-oriented results that are consistent with system intuition. The intervention do(Wind high) reduces expected operating cost from 63.9 to 56.2 USD/MWh and lowers congestion risk from 0.31 to 0.24.
Table 4 quantifies these effects: the average treatment effect of do(Wind high) is −7.7 USD/MWh for operating cost and −0.07 for congestion risk, with narrow bootstrap confidence intervals and small placebo effects. By contrast, do(Load high) increases operating cost by +6.4 USD/MWh, whereas do(Line stress) increases curtailment by +1.9%. The reported sensitivity statistics remain moderate, which suggests that the intervention direction is not driven by unstable model specification alone.
The ablation results strengthen this interpretation.
Table 10 shows that removing the dynamic residual graph causes a substantial deterioration in RMSE, operating cost, curtailment, overload rate, and reward.
Table 11 explains why: topology consistency drops from 0.92 to 0.81 and the N-1 secure ratio falls from 0.956 to 0.921. This finding indicates that the adaptive residual graph is not a cosmetic component. It is the main source of topology-aware situational awareness under stressed dispatch conditions.
Removing KL symbiosis causes the largest coordination-side deterioration. Collaboration efficiency falls from 0.97 to 0.81 and reserve mismatch rises from 1.2 to 2.9 MW, even though topology consistency remains close to the full-model level. This confirms that the graph module and the coordination module serve different roles. The graph identifies changing dependencies, whereas the symbiotic objective translates those dependencies into more coherent dispatch and reserve allocation under pressure.
Removing federated RDP yields only a small utility gain but sharply increases privacy exposure, with attack AUC rising from 0.57 to 0.73. This result reinforces the conclusion that privacy protection is introduced at limited operational cost. Finally, removing the causal module changes the point metrics only moderately, but wind-cost stability declines from 0.86 to 0.61. This indicates that the causal component contributes primarily to the robustness of intervention-oriented interpretation rather than to large gains in average reward.
Taken together, these results support the modular logic of the framework. The superiority of the full model does not come from one oversized component. It arises from the complementary interaction of topology-anchored graph adaptation, selective multi-agent coordination, privacy-aware collaborative training, and intervention-based analysis.
4.5. External Validity, Practical Scope, and Study Limitations
Although the proposed framework performs consistently across the benchmark, strong baselines, and public-trace remapping cases, several limitations remain. First, the study is conducted on a public-trace-driven benchmark mapped to a benchmark network rather than on confidential utility-grade dispatch logs. The present evidence therefore supports algorithmic validity and engineering plausibility, but it should not yet be interpreted as a direct claim of utility-scale deployment readiness.
Second, privacy evaluation is strengthened by membership-inference experiments, but the attack space is not exhausted. In particular, broader reconstruction-style attacks are discussed at the threat-model level rather than benchmarked as extensively as membership inference. Third, the causal conclusions depend on structural assumptions, temporal-order restrictions, and intervention design choices. For this reason, the causal module should be interpreted as an operator-oriented decision-support layer rather than as unrestricted causal truth.
These limitations also help define the practical scope of the present study. The current results suggest that dispatch intelligence for renewable-rich grids should be judged not only by average cost reduction, but also by physical consistency of learned couplings, coordination quality under stressed operation, privacy-compliant collaborative training, and intervention-oriented interpretability. Within that scope, the framework shows favorable performance on the test benchmark and provides a structured basis for future evaluation in larger security-constrained and utility-facing environments.
5. Conclusions
This paper presents a unified framework for uncertainty-aware smart-grid dispatch that integrates a topology-informed residual graph, multi-agent symbiotic reinforcement learning, RDP-protected federated training, and structural causal intervention analysis within a closed-loop decision architecture. By formulating the dispatch task as a constrained partially observable stochastic game, the proposed method connects state representation, distributed coordination, privacy-preserving collaborative training, and operator-facing interpretation under renewable uncertainty.
The empirical results show that the proposed framework achieves the best overall performance among the compared baselines on the public-trace-driven benchmark. Daily operating cost decreases to 468.9 × 103 USD, renewable curtailment falls to 4.8%, load shedding is reduced to 0.1%, and overload rate declines to 0.30%. The graph-side diagnostics indicate that the learned residual graph remains physically anchored while adapting to stressed operation, and the coordination results confirm the value of KL-based symbiotic coordination in reducing unserved energy and shortening adaptation lag. The privacy sweep further shows that the selected mid-range privacy budget preserves most dispatch utility while reducing membership-inference attack AUC from 0.73 to 0.57. In addition, the causal analysis provides intervention-oriented evidence showing that higher wind support improves dispatch outcomes, whereas load escalation and line stress worsen cost, congestion, and curtailment.
Overall, these results suggest that dispatch intelligence for renewable-rich power systems should be evaluated not only by economic efficiency, but also by physical consistency, coordination quality, privacy compliance, and operational interpretability. Although the present study is conducted on a public-trace-driven benchmark, it provides a structured basis for future evaluation in larger security-constrained and utility-facing environments. Future work will extend the framework to richer multi-energy settings, larger-scale dispatch systems, and more realistic communication and cyber-physical disturbances.