Next Article in Journal
Tuning CO/CO2 Formation, Flame Temperature, and Ignition Delay Time Through Steam Dilution and Hydrogen Enrichment in Methane Oxy-Combustion
Previous Article in Journal
Optimal Splitting Sections Searching Method for Power Systems with Grid-Forming Wind Turbines Based on Branch Transient Potential Energy
Previous Article in Special Issue
Federated Learning for Decentralized Electricity Market Optimization: A Review and Research Agenda
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adversarially Robust Reinforcement Learning for Energy Management in Microgrids with Voltage Regulation Under Partial Observability

Department of Electrical and Computer Engineering, University of Alberta, 9211-116 Street NW, Edmonton, AB T6G 1H9, Canada
*
Author to whom correspondence should be addressed.
Energies 2026, 19(6), 1497; https://doi.org/10.3390/en19061497
Submission received: 10 February 2026 / Revised: 10 March 2026 / Accepted: 12 March 2026 / Published: 17 March 2026
(This article belongs to the Special Issue Transforming Power Systems and Smart Grids with Deep Learning)

Abstract

Modern microgrids increasingly rely on learning-based energy management systems (EMSs) for real-time decision-making, yet remain vulnerable to cyber–physical disturbances, sensor tampering, and model uncertainty. Existing resilient control and robust reinforcement learning methods provide useful foundations, but rarely address adversarial measurement perturbations that distort belief evolution under partial observability. This gap is critical, as structured perturbations in sensing channels can destabilize learning-based policies and propagate into voltage-regulation violations. This paper proposes an adversarially robust reinforcement learning framework for energy management with voltage regulation under partial observability in microgrids. The EMS decision-making problem is formulated as a partially observable Markov decision process (POMDP) that accounts for adversarial measurement perturbations, belief evolution, and system-level economic and voltage constraints. To avoid excessive conservatism under worst-case uncertainty, an adversary-aware belief construction based on adversarial belief balancing (A3B) is employed to focus on policy-relevant perturbations. Building on this belief representation, an adversarially robust learning framework is developed by incorporating adversarial counterfactual error (ACoE) as a learning regularization mechanism, enabling a balance between nominal operating efficiency and robustness under adversarial measurement distortion. The case study is conducted on a medium-voltage radial distribution feeder (IEEE 123-Node Test Feeder). Case study results demonstrate that the proposed ACoE-regularized policies substantially reduce voltage-deficit events, improve policy stability, and maintain operational constraints under adversarial perturbations, consistently outperforming standard proximal policy optimization (PPO)-based controllers. These results indicate that counterfactual-aware, belief-based learning substantially enhances voltage quality and operational resilience in microgrids with high penetration of distributed energy resources.

1. Introduction

Modern microgrids have evolved into highly dynamic cyber–physical systems in which distributed energy resources, communication networks, and automated control software interact continuously to regulate voltage and maintain operational resilience. This tight coupling between sensing, computation, and control enhances operational flexibility but simultaneously increases exposure to cyber–physical disturbances. Currently, advances in adversarial reinforcement learning have proposed adversary-aware belief construction and counterfactual regularization mechanisms to reduce excessive conservatism under adversarial partial observability [1]. In this work, we focus on distribution-level medium-voltage feeders (4.16 kV class, IEEE 123-Node Test Feeder), representative of distribution-level medium-voltage microgrids where inductive reactance dominates line impedances and reactive power strongly influences voltage dynamics. Recent studies and operational incidents indicate that even small disturbances in sensor measurements or control signals can propagate through distribution feeders, leading to widespread voltage deviations, instability, or loss of controllability [2]. Moreover, in practical distribution feeders, disturbances rarely remain localized within a restricted spatial region. In a medium-scale radial feeder comprising approximately 100–150 buses, a coordinated perturbation affecting only a subset of measurement channels can propagate through high-impedance branches and induce service-voltage violations at a large fraction of downstream nodes. In residential distribution networks, a single service transformer may supply electrical power to numerous households, implying that voltage fluctuations occurring at the feeder level can directly affect hundreds of end users.
By incorporating steady-state bidirectional converter models for hybrid AC/DC networked microgrids, convex reformulations of the energy management problem yield globally optimal solutions when system models are sufficiently accurate and measurements are reliable [3]. These formulations, however, presuppose deterministic state information and are typically posed as static optimization problems. In the presence of cyber–physical disturbances and measurement corruption, estimation errors can propagate directly into the optimization layer, potentially resulting in suboptimal or even unsafe control actions. Learning-based control constitutes a natural framework for sequential decision-making under uncertainty. Nonetheless, the majority of existing deep reinforcement learning (DRL) formulations assume noise-free, reliable observations and do not explicitly model or mitigate adversarial measurement distortions [4,5]. This gap motivates the development of a belief-aware learning framework that can maintain voltage regulation performance in the face of adversarial partial observability.
These vulnerabilities are further exacerbated in learning-based control architectures, wherein policy updates and real-time decision-making depend critically on the fidelity and reliability of observed system data. In the presence of adversarial noise or partial observability, corrupted measurements can distort belief states and policy updates, thereby inducing substantial divergence between nominal and perturbed system trajectories. This divergence is especially detrimental for voltage regulation tasks, where even minor sensing inaccuracies can be amplified by the nonlinear dynamics of distribution feeders and ultimately give rise to large-scale service-voltage violations.
Recent studies have demonstrated that microgrids are especially vulnerable to coordinated cyberattacks targeting sensors, actuators, and communication channels. In parallel, learning-based control strategies have gained traction for enhancing voltage stability and dynamic performance, including deep reinforcement learning (DRL) approaches for low-voltage ride-through in grid-connected photovoltaic systems [4] and transient stabilization of converter-dominated microgrids [5]. While these methods achieve improved nominal performance, they typically assume reliable state observations and do not explicitly address adversarial corruption of measurement signals.
To improve robustness, several reinforcement learning (RL) formulations have been proposed. Offline and constrained robust RL methods introduce safety margins or convex relaxations to reduce constraint violations under uncertainty [6]. However, offline robustness does not capture the evolving and strategic nature of online adversarial disturbances. Scalable online learning architectures, such as multi-agent DRL and decentralized control frameworks, have also been explored for large-scale power systems [7,8,9,10]. Despite their scalability, these approaches remain critically dependent on accurate system observations and are vulnerable to persistent measurement perturbations.
Recent advances further integrate physical insights into learning-based energy management and voltage control. Two-timescale voltage regulation strategies [11], reinforcement-learning-based energy management for networked microgrids [12,13], and physics-informed DRL frameworks [14] incorporate system dynamics and economic objectives to improve feasibility and interpretability. Hybrid DRL optimization approaches have also been proposed for unit commitment and microgrid scheduling [15,16]. Nevertheless, these methods continue to rely on clean or fully observable measurements and do not explicitly quantify the impact of adversarial observation distortion on policy behavior.
This limitation becomes critical in distribution networks governed by nonlinear voltage–power coupling. Under DistFlow dynamics, voltage magnitudes exhibit strong sensitivity to reactive power injections, while active power control influences voltage through cumulative effects along radial feeders [17]. Although DistFlow-based models admit convex relaxations under specific operating assumptions, such relaxations rely on accurate state information and consistent power-flow measurements. In adversarial settings, corrupted observations can distort the inferred power injections and invalidate the operating conditions required for reliable convex reformulations, thereby amplifying voltage deviations through feeders’ nonlinear dynamics. As demonstrated later in this work, adversarial perturbations in power measurements can interact with these physical sensitivities, leading to actuator saturation and systemic undervoltage across large-scale feeders. Despite encouraging results under standard operating conditions, most existing DRL-based voltage management methods either assume full observability or handle partial observability through conservative safety mechanisms without explicitly modeling adversarially structured measurement distortions.
Recent contributions in the literature have begun to explicitly address uncertainty and partial observability in learning-based voltage control for distribution networks. Stability- and safety-constrained RL frameworks impose stringent feasibility constraints to guarantee voltage security under uncertain operating conditions; however, these constraints typically restrict effective control authority and yield conservative control policies with degraded nominal performance [18]. Robust DRL formulations additionally incorporate uncertainty via adversarial or min–max optimization schemes, emphasizing worst-case voltage security but incurring increased power losses under typical operating regimes [19]. Currently, partially observable RL methods, which rely on limited local measurements, employ conservative value-estimation mechanisms to preserve safety, thereby constraining policy optimality in the presence of persistent observation uncertainty [20]. Taken together, these studies suggest that current approaches to handling partial observability and robustness systematically trade operational efficiency for strong worst-case guarantees, thereby motivating the development of learning mechanisms that explicitly balance adversarial robustness with nominal performance in microgrid energy management.
Motivated by these challenges, this paper proposes an adversarially robust reinforcement learning framework for energy management with voltage regulation in microgrids under partial observability. The energy management system (EMS) decision-making problem is formulated as a partially observable Markov decision process (POMDP) that explicitly accounts for adversarial measurement perturbations, belief evolution, and system-level economic and voltage constraints. To mitigate excessive conservatism in worst-case robust designs, an adversary-aware belief construction based on adversarial belief balancing (A3B) is developed to emphasize perturbations that meaningfully affect control decisions. Building on this belief representation, adversarial counterfactual error (ACoE) is incorporated as a learning regularization mechanism to penalize policies that are highly sensitive to adversarial belief distortion. Beyond algorithmic robustness, adversarial perturbations in distribution systems can be conceptualized as sensor-level cyberattacks that corrupt the mapping from measured quantities to physical control actions. When such corruptions alter the precise power setpoints, the nonlinear power-flow characteristics of feeders can amplify these disturbances, potentially resulting in pronounced voltage depressions. This design facilitates a systematic and controlled evaluation of how identical sensor-level attack realizations propagate through power-flow dynamics under varying action parameterizations. Together, these components enable stable and resilient EMS operation under sustained adversarial measurement perturbations while respecting the physical and operational constraints of microgrids. The main contributions of this paper are summarized as follows:
  • We formulate an energy management problem under adversarial partial observability, where distorted measurement signals induce uncertainty in household operating states. The problem is modeled as a POMDP that explicitly integrates adversarial observation models, belief evolution, and system-level economic and voltage constraints.
  • An adversary-aware belief construction method is designed for the EMS based on A3B. It reinterprets adversarial measurements by emphasizing perturbations that meaningfully affect control decisions while suppressing unlikely or ineffective adversarial deviations. This approach mitigates excessive conservatism and preserves nominal performance under unperturbed operation.
  • An adversarially robust reinforcement learning framework that incorporates adversarial counterfactual error as a learning regularization mechanism is proposed. By penalizing policies that are highly sensitive to adversarial belief distortion, the resulting EMS can achieve a balance between nominal operating efficiency and robustness, enabling stable EMS operation under sustained adversarial measurement perturbations.
Adhering to a conventional article-template structure, the remainder of this paper is organized as follows. Section 2 presents the cyber–physical system model and distribution-feeder formulation. Section 3 introduces the adversarial observation/attack model and the resulting belief-based problem formulation under partial observability. Section 4 describes the proposed method (A3B belief construction and ACoE-regularized policy optimization) and its implementation details. Section 5 reports empirical results on the IEEE 123-Node Test Feeder, including learning dynamics during training and feeder-wide voltage-quality metrics under matched adversarial sensing realizations across baselines. Section 6 interprets the results and summarizes practical implications and limitations for EMS deployment under adversarial partial observability. Finally, Section 7 summarizes the main contributions.

2. Materials and Methods

2.1. System Model

As shown in Figure 1, the system under study is modeled as a cyber–physical architecture that couples the electrical layer with a cyber-control layer through structured power and communication flows. The physical layer comprises generation, transmission, and distribution subsystems, which deliver electrical energy from upstream sources to downstream residential loads. For analysis and control purposes, the distribution network—comprising feeders, transformers, and residential nodes—is abstracted into a tractable mathematical representation. Specifically, the distribution feeder is modeled using a DistFlow formulation with buses indexed by i, j, and k, capturing nodal voltage magnitudes as well as active and reactive power flows. At the feeder level, household power injections are aggregated at each service transformer j via a service-transformer aggregation point, which computes the net active and reactive power exchanged with the upstream feeder. The service-transformer aggregation point denotes the secondary-side boundary where household net power injections are aggregated and where voltage compliance is enforced. Each household is modeled with a local load, photovoltaic (PV) generation, and a battery energy storage system (BESS), all of which are subject to inverter operational constraints.
The cyber-control layer operates on top of this physical model by collecting aggregated measurements and nodal state information through a communication network and dispatching control actions to household inverters. As illustrated by the dashed links in Figure 1, this communication infrastructure conveys telemetry from the measurement and aggregation interface to the policy controller, thereby defining the cyber channels through which control decisions are implemented. At the same time, these channels constitute a potential attack surface, where an adversary may manipulate or spoof transmitted measurements, influencing the control flow while leaving the underlying physical power-flow dynamics intact. In the following subsections, we present detailed formulations of the distribution feeder and service-transformer aggregation-point models, followed by the battery dynamics.

2.2. Distribution Feeder and Service-Transformer Aggregation Point Modeling

The system boundary is defined at the service-transformer level, which constitutes the interface between household-level energy resources and the upstream distribution feeder. Each transformer j T aggregates the electrical characteristics of downstream households indexed by the set H ( j ) . Time is discretized into intervals t = 0 , 1 , , enabling the evaluation of operational costs, feeder losses, and voltage compliance over time. This modeling choice separates local device-level control from feeder-level physical constraints, allowing household batteries to be controlled individually while enforcing voltage regulation and loss evaluation at the transformer level.
Household-level power exchanges follow a sign convention in which positive values denote injection into the grid. For each household n H ( j ) , the net active and reactive power exchanged with the grid are given by
P t , grid , n = P t , l , n P t , s , n P t , b , n ,
Q t , grid , n = Q t , l , n Q t , s , n Q t , b , n ,
where P t , l , n and Q t , l , n denote household demand, P t , s , n and Q t , s , n represent distributed generation, and P t , b , n and Q t , b , n correspond to battery-inverter setpoints, with positive values indicating discharging and negative values indicating charging (2).
At the transformer level, household injections are consolidated through a service-transformer aggregation point to determine the net power exchanged with the upstream feeder:
P t , tot , j = n H ( j ) P t , grid , n ,
Q t , tot , j = n H ( j ) Q t , grid , n .
This aggregation paradigm reflects operational realities of distribution systems, in which feeder-level performance is monitored and regulated primarily via service-transformer aggregation points rather than through direct observation and control of individual devices.
Feeder power flows and voltage propagation are modeled using the DistFlow (LinDistFlow) approximation [2,17], which assumes small voltage-angle differences and near-unity voltage magnitudes. In this context, for each feeder branch connecting nodes i and j, the active and reactive power balances are expressed as
P t , i j = P t , L , j + k : j k P t , j k + r i j l t , i j ,
Q t , i j = Q t , L , j + k : j k Q t , j k + x i j l t , i j ,
where r i j and x i j denote branch resistance and reactance, respectively, and l t , i j represents the squared current magnitude normalized by the sending-end voltage v i t :
l t , i j = ( P t , i j ) 2 + ( Q t , i j ) 2 ( v i t ) 2 .
Voltage propagation along the radial feeder is characterized by
( v j t ) 2 = ( v i t ) 2 2 r i j P t , i j + x i j Q t , i j + r i j 2 + x i j 2 l t , i j ,
which explicitly links aggregated household control actions to nodal voltage variations (8). Active power losses P t , loss , i j on each branch follow directly from the current-squared term, enabling explicit evaluation of feeder efficiency:
P t , loss , i j = r i j ( v i t ) 2 ( P t , i j ) 2 + ( Q t , i j ) 2 .
Service-voltage compliance is enforced at the transformer level by constraining the voltage magnitudes within prescribed bounds:
v min , j v j t v max , j ,
which constitute the primary physical safety constraints governing voltage regulation in the distribution network (10). To evaluate feeder-level voltage quality, we use an ex-post voltage-deficit index. Let v i ( k ) denote the voltage magnitude at bus i at evaluation step k, and let v ref be the minimum admissible service voltage. The index is defined as
def ( k ) = i = 1 N buses max 0 , v ref v i ( k ) ,
which captures both the number of buses that violate the prescribed service threshold and the magnitude of the corresponding undervoltage deviations. This metric is used for comparative performance evaluation across controllers.
The system model and experimental evaluation in this paper focus on inverter-interfaced DERs (PV and battery inverters) for distribution-level voltage regulation. Extending the physical model and control action space to explicitly include dispatchable generators and combined heat-and-power (CHP) thermal coupling is left for future work.

2.3. Battery Dynamics

In the system, battery dynamics couple short-term power regulation with long-term device health and degradation and, therefore, play a critical role in household-level control. For each household n, the battery state is characterized by the state of charge (SoC) s n t [ 0 , 1 ] and a normalized battery health (state-of-health) variable ψ n t ( 0 , 1 ] . The SoC variation over one time step is denoted by ω n t , where ω n t > 0 corresponds to charging and ω n t < 0 corresponds to discharging. Battery degradation is modeled on the DC side using an empirical cycle-aging formulation adapted from [21]. The incremental degradation variable E n t captures the wear induced by the SoC-variation magnitude and operating point, and the SoC and health states evolve according to
E n t = ψ n t α | ω n t | β + μ exp κ s n t + ω n t 2 0.5 , s n t + 1 = clip s n t + ω n t , s ̲ n , s ¯ n , ψ n t + 1 = ψ n t E n t ψ n t ,
where α , β , μ , and κ are positive degradation parameters that control the sensitivity of battery aging to cycle depth and the SoC operating region. Specifically, β captures the nonlinear impact of cycle depth, and κ penalizes sustained operation away from the nominal mid-SoC region. This formulation enforces SoC feasibility while explicitly accounting for cumulative battery degradation in the control objective.
To interface the electrochemical battery model with the distribution network, we introduce a consistent mapping between AC-side power operation and DC-side battery variables. Let η inv dis and η inv ch denote inverter efficiencies during discharging and charging, respectively, and let η bat dis and η bat ch denote the corresponding battery efficiencies. The battery capacity is Ψ n (kWh), and Δ t denotes the time-step duration. The DC-side power associated with the AC-side active-power setpoint P t , b , n is given by
p t , b , n dc = P t , b , n η inv dis , P t , b , n > 0 , η inv ch | P t , b , n | , P t , b , n < 0 .
Accordingly, the resulting SoC increment satisfies
ω n t = P t , b , n Δ t η inv dis η bat dis Ψ n , P t , b , n > 0 , η inv ch η bat ch | P t , b , n | Δ t Ψ n , P t , b , n < 0 .
This mapping enforces consistent sign conventions across the electrical and electrochemical domains and ensures physically meaningful energy accounting.
In certain control implementations, the policy output is defined directly in terms of the SoC increment ω n t rather than the AC-side power setpoint. This representation allows the controller to reason explicitly about electrochemical energy variation and battery health while interacting with the electrical network through physically realizable power injections. In such cases, the corresponding AC-side power command is recovered by inverting the SoC update relations, yielding
P t , b , n = ω n t η inv dis η bat dis Ψ n Δ t , ω n t < 0 ( discharge ) , ω n t Ψ n η inv ch η bat ch Δ t , ω n t > 0 ( charge ) .
This inversion maintains dimensional consistency between electrochemical SoC changes and AC-side power commands while enforcing inverter-efficiency constraints and ensuring physically realistic charging and discharging rates.

3. Attack Model

3.1. Energy Management Under Adversarial Partial Observability

The energy management problem considered in this work is subject to both physical constraints imposed by the distribution network and information uncertainty arising from adversarial manipulation of measurement channels. In particular, the EMS does not have direct access to the true physical state. Instead, it must rely on measurements that may be deliberately corrupted by an adversary. As a result, the problem is formulated as a POMDP, where the EMS selects inverter control actions based on imperfect observations to minimize long-term operational costs while satisfying network- and device-level constraints. This section formalizes the energy management problem as a POMDP by introducing an observation model with adversarial measurements and the associated belief-update mechanism, which together characterize the system under adversarial partial observability.

3.2. Sets, States, Actions, and Notation

We first define the state and action variables that describe the interaction between the EMS and the distribution system. The distribution system consists of a set of service transformers indexed by j T , where each transformer j supplies a collection of households indexed by H ( j ) . For each household n H ( j ) , the true physical state at time t is denoted by x n t and is represented using a formulation that captures generation, demand, and battery conditions relevant to control. Specifically,
x n t = P t , s , n Q t , s , n P t , l , n Q t , l , n s n t ψ n t v n t .
The control action applied by the EMS at time t is denoted by a n t and consists of the active and reactive power setpoints issued to the household battery inverter, given by
a n t = P t , b , n Q t , b , n .
To ensure physical realizability, the action space is constrained by inverter operating limits, including bounds on apparent power, current, and admissible power factor. These constraints collectively define the feasible action set A n , within which all admissible control actions are required to satisfy
P t , b , n 2 + Q t , b , n 2 S rated , n 2 ,
P t , b , n 2 + Q t , b , n 2 3 ( V L L , n t ) 2 I max , n 2 ,
| Q t , b , n | | P t , b , n | tan ϕ max ,
where S rated , n denotes the inverter apparent-power rating, I max , n is the maximum allowable current, V L L , n t is the line-to-line voltage, and ϕ max is the maximum permissible power-factor angle. All inverter constraints are expressed in physical units, aligning with the network model and the OpenDSS power-flow solver used in this study.

3.3. Observation Model with Adversarial Measurements

Due to potential adversarial manipulation, the EMS does not have direct access to the true physical state x n t and instead relies on local measurements communicated through a cyber layer. In modern microgrids, such information is typically provided by advanced metering infrastructure (AMI) and smart meters, which provide redundant measurements of nodal voltages and active/reactive power injections, thereby improving observability and enabling state estimation under normal operating conditions. However, in practice, smart meter deployment may not achieve full network-wide coverage, and certain nodes or devices may remain unmonitored. Moreover, in adversarial environments, the integrity of measurement data cannot be fully guaranteed. If sensing or communication channels are compromised, the EMS may rely on biased or corrupted observations rather than the true physical state. Consequently, the EMS operates on potentially corrupted or incomplete observations instead of the true system state, leading to residual partial observability and motivating the adoption of a belief-based POMDP framework.
Therefore, we employ an aggregated energy management system (EMS)-level observation that is compatible with the simulator environment. At time t, the observation provided to the control policy is represented as a four-dimensional vector:
o t = P t , s P t , l Q t , s Q t , l + η t , η t ϵ ,
where P t , s and Q t , s denote the aggregated measured active and reactive power injections from DERs, and P t , l and Q t , l denote the aggregated measured active and reactive load, respectively. In the proposed implementation, aggregation is carried out by summing the corresponding household-level quantities over all controlled units or, equivalently, by using transformer-level totals, depending on the simulator configuration. The perturbation process η t is modeled as memoryless and independently realized at each discrete time step and is implemented as i.i.d. componentwise bounded noise (e.g., η i t Uniform ( [ ϵ , ϵ ] ) ).
From a cyber–physical-systems perspective, this adversarial observation model is equivalent to a sensor-level attack targeting the measurement layer of the energy management system. Such an attack induces partial observability and systematically biased control actions, which subsequently propagate through and are amplified by the power-flow dynamics. The resulting EMS-level observation vector is four-dimensional, given by [ P t , s , P t , l , Q t , s , Q t , l ] + η t .

3.4. Economic Signal and Immediate Cost Structure

Economic incentives are incorporated into the energy management problem through a time-varying electricity price signal γ t , which represents the unit cost of active energy exchanged with the upstream grid. This price signal couples economic optimality with physical grid operation and directly influences inverter control decisions. At the household level, for household n at time t, the immediate cost is defined as
r n t x n t , a n t = P t , grid , n γ t + c Q Q t , grid , n + ε n E n t ,
The first term captures the economic cost (or revenue) associated with active power consumption (or injection), the second term represents optional penalties related to reactive power usage or power-factor violations, and the third term penalizes battery degradation through the health-related increment E n t , scaled by ε n . This formulation prevents short-term economic optimization from inducing excessive long-term battery wear. In scenarios where reactive power penalties are neglected, the function c Q ( · ) is set to zero, and the household cost reduces to the standard DC energy-exchange form [ P t , grid , n γ t + ε n E n t ] .
In addition to household-level costs, network-level efficiency is explicitly considered through feeder losses evaluated at each service-transformer aggregation point. For transformers j, the instantaneous loss-related cost is defined as the sum of active power losses over all feeder branches upstream of that transformer:
r loss , j t = i j P t , loss , i j = i j r i j ( v i t ) 2 ( P t , i j ) 2 + ( Q t , i j ) 2 ,
This expression explicitly links control actions to physical network losses through current magnitudes and voltage levels. When the impact of reactive power is negligible, feeder losses may be approximated using an aggregated model:
r loss , j t R eq , j V 0 2 P t , tot , j 2 ,
where R eq , j is the equivalent feeder resistance and V 0 denotes the nominal voltage magnitude. This approximation enables scalable analysis while preserving the dominant dependence of losses on transformer-level active power injections.
In this context, the feeder-loss term is not amplified by the electricity price signal; r loss , j t represents a physical efficiency penalty rather than a direct economic cost. By modeling feeder losses separately from regulated energy pricing, the framework maintains a clear separation between household-level financial incentives and system-level physical performance. Monetization of losses may be incorporated by scaling r loss , j t with an appropriate price factor; however, this is not essential for assessing the voltage-regulation and resilience objectives discussed in this study.

3.5. Exogenous Processes and State Transition

Uncertainty in the energy management problem arises predominantly from exogenous stochastic processes linked to photovoltaic (PV) power generation and residential electrical-load demand. These processes are not directly controllable by the EMS and exhibit random temporal evolution. For each household n, photovoltaic generation and load demand are modeled as stochastic processes with temporal dependence. The active power production of photovoltaic systems is contingent on external factors such as solar irradiation and weather conditions, whereas the corresponding reactive-power component is derived from inverter operating and control parameters. Consequently, both active and reactive power components are incorporated in the state description to maintain alignment with power-flow models:
( P t + 1 , s , n , Q t + 1 , s , n ) P t ( s ) · P t , s , n , Q t , s , n , ( P t + 1 , l , n , Q t + 1 , l , n ) P t ( l ) · P t , l , n , Q t , l , n ,
where P t ( s ) and P t ( l ) include temporal correlations, daily patterns, and external variability in inverter-based photovoltaic operation and residential demand, respectively.
The transition of the true physical state is governed by system dynamics, including battery dynamics, feeder power-flow relationships, and exogenous stochastic processes such as photovoltaic generation and load demand. These dynamics collectively induce a state transition T x n t + 1 x n t , a n t , which specifies the conditional probability of transitioning from state x n t to state x n t + 1 under control action a n t . In particular, the stochastic components of T arise from the exogenous processes, while the remaining state components evolve deterministically according to the physical models presented in Section 2.1.
Together, the physical power-flow dynamics, device-level constraints, stochastic exogenous processes, and adversarial observation model define the evolution of the true system state x n t and the information available to the EMS. This separation between system dynamics and information structure motivates the formulation of the energy management problem in terms of belief states, which represent the controller’s probabilistic understanding of the underlying physical system.

3.6. Belief State and Update Under Adversarial Partial Observability

Under partial observability, the EMS does not have direct access to the true physical state x n t of the system. Instead, it maintains a belief state b n t ( x ) , i.e., a probability distribution over the state space that captures the controller’s uncertainty about the current operating condition of household n. This belief is updated recursively based on the prior belief, the executed control action, and the received (possibly corrupted) observation.
In our IEEE-123 implementation, the EMS acts as the decision-making controller but does not directly observe the full physical state (e.g., feeder-wide voltages and internal device states). Instead, it relies on the adversarially perturbed four-dimensional power-measurement observation model defined in Section 3.2 (21). Feeder-wide quantities such as per-bus voltages are computed via OpenDSS and recorded only as auxiliary outputs (the simulator info dictionary) for evaluation; they are not included in the policy observation.
Observations available to the EMS are generated from the true state through an adversarial measurement model. Because observations are subject to bounded, memoryless (stepwise i.i.d.) adversarial perturbations, we model the observation likelihood as P o o t x , a , which captures the probability of receiving o t given the underlying state x and control action a under the l -bounded disturbance model.
Although the belief update is formulated using standard Bayesian filtering, its role in this context is fundamentally different because the observation likelihood function is explicitly modeled to account for adversarial perturbations. Given a prior belief b n t ( x n t ) , a control action a n t , and a received observation o t , the belief is updated according to the Bayesian filtering equation:
b n t + 1 x n t + 1 = P o o t x n t + 1 , a n t x n t T x n t + 1 x n t , a n t b n t x n t x n t + 1 P o o t x n t + 1 , a n t x n t T x n t + 1 x n t , a n t b n t x n t .
This belief update captures uncertainty arising from stochastic exogenous processes, unobserved internal states, and adversarial corruption of measurement channels. By propagating beliefs rather than point estimates, the EMS can reason probabilistically about future system evolution and select control actions that remain effective under incomplete or corrupted information.

3.7. Global Operator Objective at Service-Transformer Aggregation Points

This subsection defines the system-level objective used to evaluate and coordinate household-level control actions through service-transformer aggregation. Although control actions are implemented locally at individual devices, their aggregate effect is evaluated at the service-transformer aggregation point with respect to economic performance, feeder power losses, and adherence to operational constraints, in alignment with established practices in distribution network operation. Let γ ( 0 , 1 ) denote the discount factor. The system operator seeks to minimize the expected discounted cumulative cost aggregated at service-transformer aggregation points over an infinite time horizon. The objective accounts for household-level operating costs and feeder-level efficiency losses and is given by:
min { a n t } E x n t b n t t = 0 γ t j T n H ( j ) r n t + r loss , j t .
The above optimization problem is subject to physical and operational constraints that must be satisfied at all time steps t, for all households n, transformers j, and feeder branches ( i , j ) . These constraints include:
  • Battery dynamics and health:
    Battery state-of-charge and degradation evolve according to the battery model, including the associated AC–DC power mappings, enforcing s ̲ n s n t + 1 s ¯ n .
  • Inverter feasibility: Apparent power, current, and power-factor limits are enforced via the inverter constraints.
  • Feeder physics and losses: Power-flow relations and line-loss models govern how residential power injections propagate along the distribution feeder.
  • Transformer aggregation: The active and reactive power injections originating from individual households are aggregated at transformer-level nodes, thereby enabling systematic analysis and performance evaluation at the distribution feeder scale.
  • Voltage limits: The voltage magnitudes at transformer buses are constrained to remain within predefined admissible bounds in order to ensure compliance with power quality standards and the provision of an acceptable level of service.
  • Robust beliefs and optimality: Belief evolution and decision-making follow the robust belief-update rules and the Bellman operator.
The objective is structurally distinct from steady-state energy management formulations [3]. Classical energy management solves an optimization problem over a deterministic or estimated state vector. By contrast, the formulation in this work optimizes a sequence of control actions whose expected performance depends on belief-conditioned state realizations x n t b n t . Decision-making is therefore inherently sequential and partially observable, as actions are conditioned on time-varying belief states rather than fully observed system states. Thus, the above formulation characterizes a belief-based energy management problem under adversarial partial observability. While the system objective and physical constraints are defined, solving this problem presents several fundamental challenges. First, the EMS does not have direct access to the true household operating states, and belief evolution is driven by measurements that may be adversarially perturbed, leading to uncertainty that propagates over time. Second, naively accounting for all admissible perturbations within the uncertainty set can result in overly conservative control actions, significantly degrading nominal (unperturbed) performance. Finally, standard belief-based or robust learning approaches that optimize only the expected operating cost may yield control policies that are highly sensitive to counterfactual observation distortions, causing severe performance degradation when adversarial perturbations persist.
These challenges motivate the design of an integrated solution framework that simultaneously tackles adversarial partial observability, physical feasibility requirements, and the semantic meaning of control actions in microgrid energy management. They require a framework that (i) reconstructs the true system state from adversarially corrupted measurements without resorting to excessive conservatism; and (ii) learns control policies that balance nominal performance with resilience to adversarial distortions of the belief state over time. The next section introduces such a framework, which couples adversary-aware belief construction with robustness-regularized value learning.

4. Proposed Method

4.1. Overview

To address the challenges identified in the problem formulation, this section develops a learning-based solution framework tailored to energy management under adversarial partial observability. The key challenge is not only to estimate the partially observable system state from adversarial measurements but also to ensure that the resulting control policy remains effective when belief trajectories are distorted over time without sacrificing nominal operational performance.
The proposed solution consists of three tightly coupled components. First, an adversary-aware belief construction mechanism based on A3B reinterprets adversarial measurements. It prioritizes perturbations that meaningfully affect control decisions while suppressing unlikely or ineffective deviations, which mitigates excessive conservatism and preserves unperturbed performance. Second, a robust value-learning formulation incorporating ACoE penalizes policies whose long-term performance is highly sensitive to adversarial belief distortion, thereby promoting stability under sustained attacks. Third, the resulting belief-based robust control policy is trained using a proximal policy optimization (PPO)-based actor-critic implementation, enabling scalable learning and real-time inverter dispatch under practical EMS constraints. Together, these components provide a unified solution that balances nominal cost efficiency and robustness, allowing the EMS to operate reliably despite adversarial measurement perturbations while respecting physical and operational constraints. For system-level decision-making under adversarial partial observability, we adopt a joint representation of the EMS decision process.
To maintain consistency with Section 3.1, the household-specific variables ( x t , n , o t , n , a t , n ) are collected into the joint vectors ( x t , o t , a t ) . This yields a compact and tractable formulation of belief formation and value learning. Let x t denote the joint system state at time t, which aggregates the electrical, operational, and battery-related states of all households connected to the distribution network. The joint observation o t collects all measurements available to the EMS at time t, potentially corrupted by adversarial perturbations. Under partial observability, the EMS maintains a belief b t , defined as a probability distribution over the joint state space. The joint control action a t represents the collection of inverter setpoints issued to all households and may be factorized into individual household-level actions during execution.

4.2. Adversary-Aware Belief Construction via A3B

During real-time EMS operation, the controller receives joint system measurements o t transmitted over a potentially compromised cyber layer, while the true joint system state x t is not directly observable. Under the l -bounded and memoryless (stepwise i.i.d.) attack model introduced in Section 3.1, a naive robust design would treat all states within the uncertainty set as equally plausible. Such worst-case reasoning often leads to overly conservative dispatch decisions, as the EMS may sacrifice nominal (unperturbed) performance to guard against perturbations that are technically feasible but unlikely to be selected by an adversary because they have little impact on the control policy. Additionally, the A3B mechanism used in this study is grounded in the adversarial belief-balancing principle originally proposed by Belaire et al. [1]. In the present work, this principle is extended and specialized to accommodate continuous power-flow state variables, inverter-level control inputs, and voltage-regulation dynamics subject to physical network constraints.
To address this issue, we adopt A3B to construct an adversary-aware belief that emphasizes effective attacks while downweighting perturbations that are unlikely adversarial choices under the current EMS policy. Intuitively, a candidate true state should receive a low weight if the received observation o t is not a meaningful (policy-impacting) perturbation of that state. By concentrating belief mass on states for which the received observation o t can plausibly be explained as an effective attack, A3B enables the EMS to balance unperturbed performance with robustness under adversarial partial observability.
Given the attacked joint observation o t , we define the set of candidate true states consistent with bounded and memoryless measurement disturbance as:
N ϵ ( o t ) x t X : x t o t ϵ ,
where X denotes the joint system state space. In implementation, we approximate belief mass locally via Monte Carlo sampling within N ϵ ( o t ) rather than via global discretization. We use a small Monte Carlo sample set per time step and approximate the inner maximization using a fixed-budget projected-gradient routine. The local sampling radius used for belief construction is chosen separately from the adversarial evaluation budget used for stress testing. From a complexity standpoint, online A3B scoring adds computation beyond a single policy forward pass: it requires evaluating KL terms over sampled candidates in N ϵ ( o t ) and running a projected-gradient routine to approximate the normalization term. This overhead scales with the number of local candidates and the number of gradient steps used in the inner maximization, but is parallel across candidates and can be reduced via fewer samples, fewer gradient steps, warm starts, or adaptive triggering when telemetry quality is high.
Let π θ ( · x t ) denote the EMS control policy parameterized by θ and evaluated at a candidate joint state x t . For each x t N ϵ ( o t ) , A3B assigns a policy-sensitivity score:
z ( x t ) D KL π θ ( · o t ) π θ ( · x t ) D KL π θ ( · ν ( x t ) ) π θ ( · x t ) ,
ν ( x t ) arg max x x t ϵ D KL π θ ( · x ) π θ ( · x t ) .
To approximate the inner maximization in (29), we do not solve for ν ( x t ) in closed form and do not claim an exact analytical supremum over the l ball. Because the KL objective is induced by a neural network policy, the resulting optimization landscape is generally non-convex; therefore, projected gradient methods cannot guarantee convergence to a global maximizer. Instead, for each sampled candidate state x t N ϵ ( o t ) , we approximate ν ( x t ) via an l -constrained projected gradient ascent (PGD) routine on the KL objective using automatic differentiation. Starting from a randomly perturbed initialization within N ϵ ( x t ) , the algorithm performs a fixed-budget number of ascent steps and projects each iterate back onto N ϵ ( x t ) using componentwise clipping. Among the PGD iterates, we retain the one achieving the largest KL value as the approximate maximizer. Accordingly, the denominator is treated as a finite-step approximation of the local worst-case policy deviation rather than an exact closed-form quantity.
In this work, the maximization is approximated using a projected gradient ascent procedure within an l -bounded neighborhood of the reference state. Starting from a perturbed initialization inside the admissible region, the procedure iteratively updates the candidate state along the sign of the gradient of the KL-divergence objective and projects the result back onto the feasible set to enforce the perturbation constraint. This routine provides a consistent finite-step approximation of the local worst-case policy deviation under the assumed perturbation model. To ensure reproducible and bounded computation during training and evaluation, the optimization is executed with a fixed iteration budget rather than an adaptive stopping criterion.
Implementation details such as the iteration budget, step-size selection, and numerical safeguards used in the normalization term are provided in the experimental setup section. In addition, to reduce sensitivity to finite-sample stochasticity in the local candidate set, we compute the normalization term using the maximum achieved KL value over the locally sampled admissible perturbations at each time step (under the same l constraint), which empirically yields stable normalization across rollouts.
The convergence behavior of PGD and its effect on wall-clock time are examined. While PGD does not come with a guarantee of reaching the global maximizer, in our experiments, the KL objective generally rises quickly over the first few iterations and then yields progressively smaller gains. This pattern supports using a fixed optimization budget rather than a stopping criterion based on convergence. From a runtime standpoint, enabling A3B adds a bounded per-step overhead: for each environment step, the additional computation consists of (i) evaluating the policy to compute KL terms for a fixed local Monte Carlo candidate set and (ii) running I PGD iterations for the inner maximization. Since both the Monte Carlo candidate count and I are fixed, the additional wall-clock time scales linearly with these two budgets and is therefore predictable. This overhead is parallel across candidates and can be reduced via smaller candidate sets, fewer PGD iterations, warm starts (re-using x as the next initialization), or conditional triggering when telemetry quality is high.
In (29), the numerator measures the policy deviation induced by the received observation relative to a candidate state x t , while the denominator normalizes this deviation by an estimated maximum policy shift within the same l neighborhood (approximated via the PGD routine above). Thus, the denominator captures the local worst-case policy sensitivity around x t under the attack constraint; in practice, this quantity is approximated via an l -constrained PGD routine and evaluated over sampled admissible perturbations within the neighborhood. Consequently, z ( x t ) is small when the received observation o t is not an effective perturbation of x t , and such candidates are downweighted as implausible adversarial choices under the current EMS policy. The resulting reweighted belief b ˜ t is computed via the softmax normalization in (31) and can be propagated using (32).
Based on these scores, the belief mass is reallocated via a softmax weighting:
b ˜ t ( x t ) = exp z ( x t ) u N ϵ ( o t ) exp z ( u ) , x t N ϵ ( o t ) ,
with b ˜ t ( x t ) = 0 for x t N ϵ ( o t ) .
Conceptually, the adversary-aware belief can be propagated through the joint system transition dynamics induced by physical models and exogenous processes as follows:
b t + 1 ( x t + 1 ) = x t X T x t + 1 x t , a t b ˜ t ( x t ) x ¯ t + 1 X x t X T x ¯ t + 1 x t , a t b ˜ t ( x t ) .
The updated belief state b t + 1 functions as the sufficient information state for subsequent robust value evaluation and control. The propagation equation above formalizes a Bayesian belief update over the state space. In implementation, belief mass is maintained only over a locally sampled candidate set, and no exhaustive summation over the full space X is performed.

4.3. Robust Value Learning with ACoE Regularization

While A3B constructs an adversary-aware belief that prioritizes policy-impacting perturbations, robustness at the belief level alone is insufficient to guarantee reliable EMS operation over time. In particular, a policy optimized solely to minimize the belief-averaged operating cost may still exhibit substantial performance degradation when adversarial observation distortions persist and propagate through successive belief updates. Such sensitivity is undesirable in energy management systems, where control actions must remain effective under sustained measurement corruption.
To address this challenge, we incorporate ACoE into the value-learning process. Rather than redefining the underlying energy management objective, ACoE acts as a learning regularization signal that penalizes policies whose long-term cost is highly sensitive to adversarial belief distortion. This encourages dispatch strategies that achieve a principled balance between nominal operating efficiency and robustness under adversarial partial observability.

4.3.1. Belief-Based Cost Evaluation

For notational clarity, we define the system-level stage cost at time t as
r sys t ( x t , a t ) j T n H ( j ) r n t + r loss , j t .
Given the adversary-aware belief b t constructed via A3B, the EMS evaluates the expected instantaneous cost of applying a joint inverter control action a t as follows:
E x t b t r sys t ( x t , a t ) = x t X b t ( x t ) r sys t ( x t , a t ) ,
which accounts for uncertainty in the physical system state induced by partial observability and adversarial measurement perturbations.

4.3.2. Counterfactual Robustness Signal

To quantify the sensitivity of a control policy to adversarial belief distortion, we introduce the ACoE-to-go δ ( o t ) , which measures the discrepancy between the cost-to-go evaluated at the received joint observation and that evaluated under the corresponding adversary-aware belief. The value of δ ( o t ) satisfies the following Bellman-style recursion:
δ ( o t ) = r sys t o t , π θ ( o t ) r sys t b t , π θ ( o t ) + γ E δ ( o t + 1 ) | o t , a t = π θ ( o t ) ,
where r sys t ( o t , a t ) denotes the instantaneous cost evaluated using the received observation as the nominal system input, and
r sys t b t , a t x t X b t ( x t ) r sys t ( x t , a t )
denotes the belief-based expected instantaneous cost. The expectation in (35) is taken over the next observation o t + 1 induced by the physical transition dynamics, the adversarial observation, and the belief-update mechanism. This signal captures not only the immediate cost deviation caused by observation corruption but also its long-term propagation through future belief updates and control actions.
Since the true system state is not directly observable, we introduce an observation-based surrogate cost r sys t ( o t , a t ) , which treats the acquired measurements as a nominal proxy for the system input. In contrast, the belief-based term r sys t ( b t , a t ) denotes the expected physical cost evaluated with respect to the adversary-aware belief state.

4.3.3. Robust Q-Value Formulation and Bellman Operator

Building on the belief-based cost and the ACoE signal, the robust state-action value used by the EMS for online decision-making is defined as
Q ( b t , a t ) = E x t b t r sys t ( x t , a t ) + λ δ ( o t ) + γ E V ( b t + 1 ) ,
Note that λ in (37) denotes the robustness coefficient for ACoE regularization and is distinct from the GAE parameter λ GAE used for advantage estimation in PPO. where λ 0 balances nominal performance and robustness. The ACoE acts as a learning regularization signal that penalizes policies whose performance is highly sensitive to adversarial observation distortion. The ACoE formulation subsequently integrates the observation acquired at time t, which is associated with the belief state b t . The corresponding robust Bellman operator is given by λ
V ( b t ) = min a t A Q ( b t , a t ) ,
where A denotes the joint feasible action set induced by inverter capability limits and network constraints. Accordingly, the EMS selects the real-time inverter dispatch as
a t = arg min a t A Q ( b t , a t ) ,
which may be factorized into household-level inverter setpoints during execution. In scenarios where only active-power control is considered, the formulation naturally reduces to a single control dimension while preserving the robustness properties induced by ACoE regularization.
Importantly, ACoE preserves the fundamental operational objective and imposes a regularization on the learning procedure to reduce susceptibility to adversarial belief distortions while retaining the original EMS cost structure.

4.4. Training Procedure and Implementation Details

This subsection describes the practical implementation of the proposed adversary-aware belief construction and ACoE-regularized value-learning framework. The EMS control policy is trained using an actor–critic architecture based on proximal policy optimization (PPO). At each time step t, the actor receives the joint observation o t together with belief features derived from the adversary-aware belief b t , and outputs a stochastic joint action a t A . The critic estimates the belief-based value function V ( b t ) , corresponding to the expected discounted system-level cost defined in Section 3.1.
To enhance robustness under adversarial partial observability, the PPO training objective is augmented with an ACoE-to-go signal. Specifically, the advantage used in PPO is modified to penalize actions whose estimated value degrades sharply under adversarially distorted belief trajectories. In parallel with the value function, a separate network is trained to approximate the ACoE-to-go, yielding stable estimates of robustness penalties throughout policy optimization. This setup promotes control policies that maintain low operating costs under nominal conditions while remaining robust to adversarial perturbations in observations. PPO is therefore employed as a stable baseline optimizer, ensuring that observed robustness gains can be attributed to A3B and ACoE rather than to increased model capacity or heuristic hyperparameter tuning.
Moreover, the standard advantage estimator is computed using generalized advantage estimation (GAE), denoted by A ^ t GAE . To enhance robustness against adversarial observation perturbations, we additionally compute an ACoE-based advantage term A ^ t ACoE from the adversarial counterfactual error signal. The modified advantage used for policy updates is defined as
A ˜ t = A ^ t GAE β A ^ t ACoE ,
where β > 0 controls the robustness–performance trade-off. This combined advantage A ˜ t directly replaces the original GAE advantage in the standard clipped PPO surrogate objective; in our implementation, β corresponds to the counterfactual-loss weight reported in Table 1.
In summary, λ in (37) is the theoretical robustness coefficient weighting the ACoE-to-go term in the robust value definition, whereas β in (40) is the implementation-level coefficient that scales the same ACoE-induced training signal in PPO (up to standard advantage normalization).
Additionally, we clarify the mapping between the theoretical robustness coefficient λ and the implementation weight β . In the robust Q-value formulation in (37), the ACoE-to-go signal enters the objective through the coefficient λ via the term λ δ ( o t ) , which balances nominal operational performance and robustness against adversarial belief perturbations. In the PPO implementation, we do not optimize (37) in closed form; instead, the same robustness–performance trade-off is realized by injecting a counterfactual penalty into the policy-gradient update through the modified advantage in (40). Specifically, A ^ t ACoE provides a sample-based training signal induced by δ ( o t ) in (35), while β scales its influence during policy updates. Accordingly, β serves as the implementation-level counterpart of λ up to the standard scaling normalization used in PPO GAE advantage estimation.
Let θ and ϕ denote the parameters of the actor and critic networks, respectively. The PPO likelihood ratio is defined as
R t ( θ ) = π θ ( a t o t ) π θ old ( a t o t ) .
The clipped surrogate objective is then computed using A ˜ t :
L t CLIP ( θ ) = min R t ( θ ) A ˜ t , clip R t ( θ ) , 1 ε , 1 + ε A ˜ t ,
The policy loss at iteration k is computed as
L policy ( k ) = 1 M ( k ) t = 1 M ( k ) L t CLIP ( θ ) c ent H ^ ( k ) ,
where M ( k ) is the batch size, c ent is the entropy regularization coefficient, and H ^ ( k ) is the batch-averaged policy entropy.
Let V ϕ ( b t ) denote the critic estimate parameterized by ϕ , and let R t target denote the target cumulative reward computed from sampled trajectories. The critic is trained by minimizing the value loss,
L value ( k ) = 1 M ( k ) t = 1 M ( k ) V ϕ ( b t ) R t target 2 ,
An analogous regression loss is employed to train the auxiliary network that approximates the ACoE-to-go.
To monitor learning progress and robustness characteristics, several metrics are evaluated at each training iteration k. The mean episodic reward provides an aggregate measure of EMS performance:
mean _ reward ( k ) = 1 M ( k ) i = 1 M ( k ) r sys i .
The policy entropy H ^ ( k ) is tracked to capture the trade-off between exploration and exploitation. In parallel, the value loss L value ( k ) evaluates how well the belief-based value function is approximated under adversarial partial observability.
Collectively, these metrics enable a systematic and quantitative evaluation of both the nominal performance of the energy management system (EMS) and the robustness of the learned inverter-dispatch policies over the course of training.

4.5. Case Study Setup

This section presents a case study on the IEEE 123-Node Test Feeder [22] to evaluate the proposed PPO+ACoE learning framework for microgrid energy management, using feeder-wide voltage regulation as the primary physical performance indicator.
Figure 2 illustrates the electrical and control architecture of the case study, based on the IEEE 123-Node Test Feeder. The physical layer comprises interconnected buses and lateral branches, with selected nodes hosting residential units equipped with battery energy storage systems (BESS) and local photovoltaic generation. A transformer-level aggregator combines residential power injections and provides feeder-level voltage monitoring. In the control layer, each BESS is associated with an autonomous learning agent that implements either conventional PPO or its PPO+ACoE variant. Agents receive locally available measurements through communication channels that may be subject to adversarial perturbations, and they output active and reactive power setpoints to support voltage regulation. This tiered structure—from feeder topology to transformer aggregation to household-level control—enables a systematic evaluation of how counterfactual-aware policy optimization affects physical performance under cyber-induced measurement corruption.
In this context, adversarial measurement perturbations are modeled as reproducible, sensor-level corruptions. At each decision step t, the EMS receives a perturbed, aggregated observation o t = x ¯ t + η t , where x ¯ t contains the selected channels (aggregated active/reactive power measurements) and η t ϵ . During evaluation, η t is generated online as stepwise i.i.d. (memoryless) bounded noise, with each component sampled independently as η i t Uniform ( [ ϵ , ϵ ] ) . “Reproducible” indicates that a fixed pseudorandom seed (seed = 42) is used so that all controllers are evaluated under identical perturbation realizations, enabling controlled comparisons.
For these comparisons, we report results in three stages: (i) learning dynamics and convergence under adversarial partial observability; (ii) feeder-wide voltage compliance and grid-level benefits; and (iii) spatial interpretation using representative voltage-profile snapshots. Simulation outputs are stored in a structured format and grouped into optimization histories, final performance outcomes, and configuration descriptions; the optimization histories include per-iteration training measures (e.g., mean episodic reward, surrogate loss, policy loss, entropy bonus, action dispersion, and value-function loss) to characterize learning stability. To evaluate the role of action semantics in robustness, we benchmark against multiple established learning-based baselines under identical simulation conditions.

Training Setup and Implementation Details

The inner maximization in the A3B normalization term is approximated using projected gradient ascent with a fixed iteration budget. In our experiments, we use I = 50 iterations with a constant step size proportional to the l perturbation radius. The initialization is sampled uniformly within the admissible perturbation set, followed by projection to enforce the constraint. A small constant ( 10 8 ) is used in the denominator to avoid numerical instabilities. Using a fixed iteration budget ensures predictable computational cost and reproducible behavior across training runs.
In addition to the proposed PPO+ACoE framework, we benchmark against three learning-based baselines: (1) standard PPO with absolute action scaling [23]; (2) Robust PPO trained with memoryless (i.i.d.) bounded observation perturbations ( Uniform ( [ ϵ , ϵ ] ) ) injected during training; and (3) Baseline PPO employing the same incremental reactive-power actuation with device-level clipping ( Δ /CLIP). PPO-style controllers have been widely used in smart-grid and power-system control applications, including voltage regulation and DER coordination [6,7,13]. All controllers are evaluated under an identical feeder configuration (IEEE-123), voltage-band constraint ( ± 5 % ), device limits, and adversarial perturbation budget ( ϵ = 0.05 ), which allows us to isolate the effect of action-interface design from algorithmic adversarial robustness.
Concretely, we test six control configurations: (i) an All-Zero controller as a purely physical reference; (ii) PPO with absolute active-power actions; (iii) Robust PPO (noise-trained) with bounded observation-noise injection; (iv) Baseline PPO ( Δ /CLIP) using incremental Δ P / Δ Q actions with device-level clipping; (v) PPO+ACoE with the same absolute-action mapping; and (vi) PPO+ACoE with incremental Δ P / Δ Q actions and device-level clipping (ACoE Δ /CLIP). Across all experiments, voltage-quality metrics assess physical performance in terms of (i) the number and proportion of buses outside the nominal range of [ 0.95 , 1.05 ] p.u.; (ii) the final number of undervoltage buses; and (iii) the aggregate voltage deficit.
Table 1 summarizes the parameter configuration used for the Baseline PPO controller and the proposed PPO+ACoE variant. Both agents share an identical core PPO setup to ensure a fair and controlled comparison, such that observed performance differences can be attributed to the counterfactual robustness mechanism rather than hyperparameter tuning. In addition, for the A3B module, the local Monte-Carlo candidate set size and the inner PGD iteration budget are kept fixed across training and evaluation to ensure bounded and predictable per-step overhead.
Moreover, both controllers are trained and evaluated under identical adversarial measurement perturbations so that robustness gains can be attributed to ACoE regularization rather than differences in attack exposure. The threat model is gray-box: the adversary can corrupt the measurement channels within an l -bounded radius but has no access to policy parameters or gradients. Consistent with this assumption, η t is implemented as stepwise i.i.d. (memoryless) bounded noise with components sampled as η i t Uniform ( [ ϵ , ϵ ] ) under a fixed pseudorandom seed (seed = 42), ensuring identical perturbation trajectories across all controllers. The counterfactual-loss weight β controls the robustness–performance trade-off in the PPO implementation (40); in practice, it can be selected via a simple sweep on held-out environments or operating conditions, and we emphasize that extremely large values can over-regularize updates while extremely small values can reduce the effect of counterfactual regularization. Each iteration corresponds to a single PPO policy update performed using trajectories sampled from the interaction between the agent and the environment. Here, PPO is employed as the primary baseline, as it is among the most widely used and well-established policy-gradient algorithms for continuous control and has been extensively utilized in recent reinforcement-learning-based studies on inverter energy management and voltage regulation.
In this context, the learning rate α and the batch size jointly govern the magnitude and stability of gradient updates throughout the training process, whereas the clipping range ϵ clip constrains the extent of policy updates across successive iterations, thereby mitigating excessive policy shifts and improving the robustness of the training procedure. Here, ϵ represents the adversarial measurement limit, which is distinct from the PPO clipping parameter ϵ clip . The discount factor γ and the generalized advantage estimation (GAE) parameter λ together optimize long-term rewards while reducing variance in advantage estimation. Entropy regularization, controlled by the entropy coefficient, improves exploration in the early training phases and reduces the risk of rapid convergence to inferior deterministic strategies. The value loss coefficient regulates the importance of critic accuracy within the combined actor-critic optimization setting. In general, the PPO+ACoE controller extends this baseline design by adding two parameters. The adversarial noise level parameter quantifies the magnitude of norm-bounded perturbations applied to observation signals during training, thereby providing a systematic mechanism for addressing both partial observability and adversarially perturbed environments. The counterfactual loss weight regulates the contribution of the adversarial counterfactual error term to the overall policy optimization objective, thereby explicitly penalizing susceptibility to corrupted observations. Consequently, both agents undergo an equivalent number of iterations to provide comparable convergence horizons and uniform assessment conditions. To ensure strict reproducibility and enable a controlled cyber–physical comparison, all controllers are evaluated under an identical, fixed pseudorandom seed, which guarantees that each method encounters the same sequence of adversarial observation realizations, thereby isolating the effect of the proposed ACoE-enhanced learning mechanism from stochastic variability in the attack trajectories.

5. Results

This section evaluates the proposed ACoE-enhanced reinforcement learning framework on the IEEE 123-Node feeder and compares its learning behavior and voltage-regulation performance against baseline configurations. Throughout this section, episode reward refers to the discounted cumulative sum of stage rewards collected over an episode, computed from the same stage-cost signal used during training. The analysis is structured in three levels. First, we analyze the learning dynamics to assess whether the agent converges reliably under adversarial partial observability. Second, we evaluate feeder-wide voltage compliance to determine whether learning improvements translate into grid-level benefits. Finally, we provide a spatial and physical interpretation using a representative voltage-profile snapshot.

5.1. Learning Dynamics and Training Stability

The first level of analysis focuses on learning dynamics to determine whether the proposed framework achieves stable and reliable policy convergence under adversarial observation noise. Figure 3 displays the mean-reward trajectories during the first 1000 training iterations for the Baseline PPO and the ACoE-enhanced controller. Both agents obtain markedly negative episode rewards during the initial training phase, attributable to predominantly unguided exploration and the absence of a well-developed control policy. In this early period, their actions are primarily exploratory and only weakly aligned with the voltage-regulation objective, which leads to persistently low cumulative rewards.
As training progresses, the performance of both policies improves, demonstrating that each agent incrementally acquires an effective control strategy through interaction with the environment. However, the Baseline PPO policy exhibits pronounced performance fluctuations in later stages of training. These fluctuations are consistent with instability in policy-gradient updates under partially corrupted observations, as the corresponding advantage estimates—and thus the resulting gradient signals—become increasingly sensitive to measurement perturbations. In practical terms, these results indicate that Baseline PPO is prone to converging to a policy that is locally effective but remains susceptible to intermittent performance deterioration under adversarial partial observability.
In contrast, the controller augmented with ACoE exhibits a more gradual and stable convergence behavior and attains its performance plateau earlier, at approximately iteration 600. This behavior suggests that counterfactual regularization attenuates the impact of adversarial perturbations on policy updates by constraining the divergence between nominal and perturbed trajectories. Consequently, the ACoE-augmented controller demonstrates more gradual and stable convergence over the examined training horizon, reflecting reduced oscillatory update dynamics under the fixed experimental conditions. This stability is particularly relevant for power-system applications, where learning-induced volatility may lead to unsafe intermediate behavior. From a power-systems perspective, the reward trend reflects how quickly each policy identifies control actions that mitigate voltage violations without inducing large reactive-power fluctuations. In radial feeders, voltage deviations can accumulate along high-impedance branches; policies that overreact to corrupted measurements may drive inverter setpoints toward saturation, reducing rewards and inducing oscillatory learning behavior.
Figure 3 is complemented by the policy-entropy trajectories in Figure 4, which provide additional evidence regarding training stability and the exploration–exploitation transition. For both controllers, the policy entropy decreases monotonically over the course of training as the policies become increasingly deterministic, which is consistent with the agent progressively favoring actions with higher expected rewards. Notably, the ACoE-augmented policy exhibits a persistently lower entropy after approximately 400 iterations, indicating an earlier stabilization toward confident and repeatable action selections. This reduction in entropy does not degrade reward performance; instead, it coincides with the smoother reward convergence behavior in Figure 3. Collectively, these observations suggest that ACoE mitigates the need for prolonged exploration under observation uncertainty and facilitates a more robust and reliable convergence process in adversarial, partially observable environments [6,7].
From an electrical engineering perspective, the early entropy reduction observed with ACoE is consistent with a more rapid identification of a stable control policy that respects inverter operating constraints. Once the controller has internalized a reliable mapping between local voltage deviations and admissible Δ Q responses, additional exploratory behavior provides diminishing marginal benefits. This is particularly true under adversarial measurement noise, where excessive exploration can induce undesirable or erratic reactive-power responses. Although stable learning behavior is a prerequisite, learning improvements are only meaningful if they translate into tangible grid-level performance. The next level of analysis, therefore, examines whether stabilized policies yield effective voltage regulation across the distribution network.

5.2. Voltage Analysis and Physical Performance

The second level of analysis investigates whether the stability of the learning process leads to feeder-wide compliance with prescribed voltage limits. Voltage violations are characterized using complementary performance indices that capture both the magnitude and the duration (or frequency) of deviations from the nominal service band. This behavior aligns with underlying distribution network physics, which can be approximated by DistFlow-type formulations, wherein bus voltage magnitudes exhibit substantially higher sensitivity to reactive power injections than to active power variations. In particular, the voltage-drop relation contains a dominant linear contribution proportional to x i j Q t , i j along each branch, so distributing incremental Δ Q support across buses can directly offset cumulative drops on long, high-impedance radial segments. As a result, control schemes that can accurately and robustly regulate Δ Q within device-specific operational constraints are intrinsically more effective in restoring bus voltages to acceptable operating ranges. The voltage-recovery phenomenon observed in this study arises directly from fundamental AC distribution feeder physics (specifically, the dependence of bus voltage magnitudes on reactive power injections and the aggregated series impedance of radial distribution lines) and is therefore not an artifact specific to the IEEE 123-Node Test Feeder. Rather, the selected feeder serves as a representative benchmark network employed to illustrate this mechanism under realistic system scales and loading conditions.
Figure 5 shows the percentage of buses operating outside the nominal voltage range [ 0.95 , 1.05 ] p.u. Both the Baseline PPO and the PPO+ACoE controllers have more than 90% of buses outside acceptable limits throughout training, confirming their inability to restore voltage compliance. In comparison, the PPO+ACoE ( Δ /CLIP) controller exhibits a sharp decrease after approximately 200 training iterations and stabilizes below 10% out-of-range buses, indicating substantial corrective capability and sustained performance across the feeder. Under the Baseline PPO configuration and the PPO+ACoE configuration employing absolute active-power control actions, pronounced undervoltage conditions persist throughout training. Across the evaluated bus set in the OpenDSS implementation, more than 120 buses operate below the minimum service-voltage threshold of 0.95 p.u. These observations indicate that adversarial robustness enforced solely at the optimization or policy-learning level is insufficient to restore acceptable voltage profiles when learned control actions are structurally misaligned with underlying feeder physics.
In contrast, the proposed PPO+ACoE ( Δ /CLIP) controller endowed with reactive-power control authority exhibits qualitatively different behavior: the average per-training-iteration bus voltage increases to approximately 0.98 p.u., and the total voltage deficit decreases sharply from 44.19 p.u. to less than 0.41 p.u. across all evaluation checkpoints (five evenly spaced checkpoints across the 1000 training iterations). From a cyber–physical-systems standpoint, these findings elucidate a vulnerability pathway in which adversarial perturbations to measurements, when translated into absolute active-power setpoints, exacerbate voltage drops via cumulative impedance effects along radial feeders. In contrast, incremental Δ P / Δ Q modulation combined with device-level clipping enables direct compensation of voltage deviations through localized reactive-power support, thereby attenuating the physical propagation of sensor-level cyberattacks throughout the distribution network. These performance differences can be interpreted through the DistFlow relations in (8). Incremental reactive-power adjustments act on the dominant linear voltage term x i j Q t , i j , which governs first-order voltage sensitivity along radial feeders. By contrast, absolute active-power setpoints influence voltage primarily through accumulated quadratic current terms associated with line losses. This structural asymmetry explains why the Δ /CLIP parameterization yields improved voltage recovery under adversarial perturbations: it counteracts voltage deviations more directly through targeted reactive-power modulation.
Figure 6 reports the number of buses experiencing undervoltage ( V < 0.95 p.u.) across the training horizon. The all-zero controller reference scenario remains near zero, confirming that the feeder is inherently stable in the absence of aggressive control actions. Both the Baseline PPO and PPO+ACoE controllers sustain high undervoltage counts near 120 buses, whereas the PPO+ACoE ( Δ /CLIP) controller reduces this number below 15 by iteration 700. This outcome highlights the critical role of reactive-power modulation in mitigating systemic undervoltage, consistent with established distribution-system control insights [11]. In addition, reactive-power modulation directly counteracts voltage drops induced by cumulative feeder impedance along radial branches. In practice, it is Δ Q that influences the dominant term in the local voltage-drop relation, enabling distributed voltage recovery despite partial measurement corruption, whereas active-power-only interventions provide limited voltage control and may require excessively large setpoint variations.
While the previous results compared ACoE against clean PPO, we now examine whether the observed robustness gains can be attributed solely to adversarial training. We additionally compare Robust PPO (noise-trained) and ACoE under (a) OpenDSS evaluation and (b) a critic-based online perturbation with the same l budget ( ϵ = 0.05 ). The critic-based protocol serves as an additional stress test and is distinct from the stochastic perturbations described earlier. These results indicate that while adversarially trained PPO improves robustness relative to clean PPO, ACoE further reduces performance degradation under critic-based perturbations without compromising voltage-regulation quality. The quantitative evaluation outcomes are consolidated and presented in Table 2.
In general, the results demonstrate that adversarial robustness in learning-based voltage control emerges from the interaction between learning objectives and power-flow physics. Although ACoE regularization stabilizes policy updates in the presence of adversarial partial observability, significant voltage recovery is achieved only when control actions are expressed through physically grounded Δ P / Δ Q modulation with device-level clipping. When absolute power setpoints are applied, adversarial regularization alone is insufficient to mitigate voltage deviations driven by feeder impedance and load aggregation, resulting in persistent undervoltage across the network. This observation helps explain why many learning-based voltage-regulation methods, despite exhibiting stable training behavior, exhibit limited robustness under non-ideal measurement conditions. In such settings, robustness is often attributed to the learning algorithm itself without explicitly accounting for the interaction between control-action semantics and the nonlinear voltage sensitivity of distribution feeders.

5.3. Voltage Violation Structure and Statistical Validation

Figure 7 decomposes voltage violations into undervoltage and overvoltage events for the baseline controller and the proposed PPO+ACoE ( Δ /CLIP) method across five evaluation checkpoints. Under baseline control, undervoltages dominate both in frequency and magnitude. The PPO+ACoE ( Δ /CLIP) controller substantially suppresses these events while introducing a small number of mild overvoltage excursions ( V > 1.05 p.u.), which are typically transient and limited to a few buses. From an engineering perspective, mild overvoltage is a common consequence of strong reactive-power support. When multiple inverters inject Δ Q in electrically proximate locations, the local voltage may slightly exceed the upper limit, particularly near voltage-regulated nodes. This observation suggests that incorporating explicit overvoltage penalties or adaptive clipping could further improve compliance while preserving undervoltage rectification. The residual-violation patterns also illustrate the link between control-action semantics and feeder physics. Specifically, absolute-action controllers are dominated by undervoltage events, whereas the Δ /CLIP formulation shifts the profile toward fewer and milder overvoltage events.
As reported in Table 3, the proposed PPO+ACoE ( Δ /CLIP) formulation reduces the aggregate voltage deficit by more than one order of magnitude relative to absolute-action controllers. Physically, this reduction is consistent with injecting incremental reactive support at multiple downstream locations, which counteracts the dominant x i j Q t , i j voltage-drop term along radial segments and limits the accumulation of undervoltage deficits over high-impedance branches.
To further disentangle the contribution of the physics-aware action interface from the proposed learning regularization, we introduce an additional ablation baseline, Baseline PPO ( Δ /CLIP), which employs the same incremental Δ P / Δ Q actuation interface with device-level clipping as PPO+ACoE ( Δ /CLIP) but without the ACoE regularization mechanism. Table 4 reports (a) feeder-wide voltage statistics under OpenDSS and (b) adversarial performance under critic-based perturbations ( ϵ = 0.05 ). The results indicate that while the Δ /CLIP interface ensures nominal voltage compliance, the proposed ACoE mechanism provides additional performance gains under critic-based perturbations.
Finally, Figure 8 summarizes the OpenDSS-based feeder-wide voltage outcomes at the reported evaluation checkpoint under adversarial partial observability. Panel (a) reports the feeder-wide average bus-voltage magnitude (p.u.), while Panel (b) reports the percentage of buses outside the service band [ 0.95 , 1.05 ] p.u. Baseline PPO and PPO+ACoE (setpoint actions) drive the feeder into a collapse-like undervoltage regime with 93.94 % of buses out of range, consistent with prior evidence that coordinated false-data injection can corrupt telemetry and precipitate unsafe voltage profiles in power systems [24,25]. In contrast, PPO+ACoE ( Δ /CLIP) maintains near-nominal operation and reduces out-of-range buses to 8.33 % by enforcing incremental, feasibility-aware reactive-power coordination, aligning with established coordinated voltage-control principles [26]. The All-Zero controller remains within the service-voltage band ( 0 % out of range), indicating that the extreme undervoltage arises from control decisions under corrupted telemetry rather than from an inherent limitation of the feeder model in this setting. From a standards-compliance standpoint, the baseline outcomes are outside permissible DER operating envelopes per IEEE Std 1547-2018, whereas PPO+ACoE ( Δ /CLIP) avoids severe undervoltage conditions and substantially reduces service-band violations [27]. In practice, such baseline operating points would be expected to trigger DER protection or LVRT actions rather than sustained operation.
To clarify the mechanism behind the feeder-wide outcomes in Figure 8, Figure 9 separates controllers by action-interface family at an early checkpoint. The left column shows that both absolute-setpoint policies, Baseline PPO and PPO+ACoE, already enter a severe-undervoltage regime with ≈94% of buses outside [ 0.95 , 1.05 ] , indicating that ACoE regularization alone does not prevent failure when the action interface permits large setpoint jumps under corrupted telemetry. In contrast, the right column shows that controllers using the feasibility-enforced Δ /CLIP interface maintain near-nominal average voltages and markedly reduced service-band violations (up to 8.33 % ) at this stage, even when trained with different robustness objectives (Baseline Δ /CLIP, RobustPPO, ACoE Δ /CLIP). Together, Figure 8 and Figure 9 support the conclusion that robust voltage regulation under adversarial sensing hinges on co-designing the learning objective with electrically meaningful, incremental actuation semantics.
It is important to distinguish the complementary roles of action semantics and adversarial regularization. The Δ /CLIP interface alone is sufficient to maintain service-band compliance for Baseline ( Δ /CLIP) and RobustPPO (zero buses out of range in Figure 9). However, ACoE is not introduced to improve nominal voltage metrics per se; rather, it enhances robustness to adversarial critic perturbations and stabilizes learning under partial observability.
Overall, these findings suggest that voltage feasibility in the presence of adversarial sensing is predominantly determined by actuation semantics (i.e., incremental, saturated control actions), whereas adversarial regularization primarily contributes to improved learning stability and critic robustness. Consequently, achieving robust voltage regulation benefits from the joint co-design of (i) the control-action interface and (ii) the learning objective.

6. Discussion

This section interprets the results, situates the contributions relative to prior robust and uncertainty-aware voltage-regulation work, and distills practical implications, limitations, and promising research directions for EMS deployment under adversarial partial observability [6,7,11,19].

6.1. Key Findings and Mechanistic Interpretation

The empirical study on the IEEE 123-Node feeder [22] indicates that robustness under corrupted measurements depends jointly on (i) how the controller reconstructs state uncertainty from adversarial observations and (ii) how policy optimization penalizes sensitivity to counterfactual belief distortions. First, A3B reallocates belief mass toward candidate states for which the received observation can be explained as a policy-impacting perturbation, mitigating the overly conservative “all-admissible-states-are-equally-likely” interpretation that commonly arises under l uncertainty [1]. Second, incorporating ACoE into the PPO-based actor-critic learning objective [1,23] stabilizes policy updates under adversarial partial observability by discouraging policies whose cost-to-go is highly sensitive to observation-induced belief distortion.
A central finding is that adversarial robustness is not determined by the learning objective alone. The case study shows that action semantics, specifically incremental actions with device-level clipping, materially improve feeder-wide voltage recovery compared to absolute setpoint actions under identical adversarial sensing realizations. Mechanistically, incremental actuation with saturation limits constrains step-to-step control variation and acts as a bounded-rate, feasibility-enforced interface between noisy measurements and inverter setpoints. In radial feeders with strong voltage-reactive-power sensitivity, this reduces the effective closed-loop gain by preventing large, measurement-driven jumps from being mapped into aggressive injections, thereby limiting the amplification of observation corruption through nonlinear voltage-power coupling [2,17]. This complements robustness regularization: even when training appears stable, absolute setpoints can still induce physically aggressive responses that propagate errors feeder-wide. Building on this steady-state mechanism, the following points clarify what the observed behavior suggests for practical microgrids and for larger-scale deployments.
In the steady-state OpenDSS evaluation, the Δ /CLIP interface acts as a bounded-rate, saturation-aware mapping from corrupted measurements to inverter commands, which reduces chattering and repeated saturation events. Although our primary evaluation uses quasi-steady-state power-flow coupling rather than an electromagnetic-transient model, this behavior is consistent with improved closed-loop stability margins in the presence of noisy telemetry. In practical inverter-dominated microgrids, where some resources may implement grid-forming or virtual-synchronous-machine behaviors, incremental and clipped power commands can be expected to reduce large excursions in internal control states. Explicit validation of this linkage requires dynamic inverter models and is an important extension beyond the present steady-state study.
These stability-motivated design choices also have direct implications for scalability. The results suggest that robustness mechanisms must jointly co-design (i) the information model and (ii) the actuation interface so that local measurement corruption does not cascade into feeder-wide instability. For larger feeders and higher DER counts, scalable deployment will likely require hierarchical or distributed implementations, richer telemetry when available, and computationally efficient belief updates that preserve policy-relevant uncertainty without incurring prohibitive online overhead.

6.2. Implications for Microgrid Operation and EMS Design

From an operational viewpoint, these results yield actionable guidelines for EMS design under untrusted measurements, complementing existing learning-based voltage-control and microgrid coordination studies [7,8,9,10,11]:
  • Prefer incremental, feasibility-enforced actuation. Exposing incremental inverter commands and enforcing device-level clipping reduces the risk that corrupted observations induce large, infeasible, or destabilizing setpoint jumps.
  • Couple learning with physics-aware feasibility layers. Embedding hard constraints (apparent power limits, current limits, voltage bounds) and using power-flow-aware evaluation (OpenDSS) helps ensure that learned policies remain physically deployable.
  • Treat measurement corruption as a first-class design variable. In practical AMI/SCADA settings, state estimation and bad-data detection can provide additional redundancy, but the EMS should still maintain resilience when such layers are imperfect or compromised [2]. Our belief-based formulation provides an explicit mechanism for reasoning over corrupted observations at decision time.
Beyond these high-level guidelines, the results also have implications for real-world implementation: (i) robustness is intrinsically sensitive to the adversarial budget ϵ , so a systematic ϵ -sweep including temporally correlated disturbances is important for mapping robustness margins beyond the fixed setting used for controlled comparisons in this study; (ii) scaling to larger feeders and higher DER counts will likely require hierarchical or distributed implementations and belief updates that preserve only policy-relevant uncertainty, since naively enlarging the candidate set can make online inference and robustness checks the bottleneck; and (iii) relative to standard PPO, A3B adds per-decision computation due to scoring sampled candidates and approximating the denominator via a projected-gradient inner loop, but this overhead is embarrassingly parallel and can be reduced or amortized when telemetry quality is high.

6.3. Limitations

This work has several limitations that motivate further study. (i) The main evaluation is conducted on a single benchmark feeder (IEEE-123) with a fixed experimental configuration; generalization to other topologies and DER penetrations remains to be established [22]. (ii) The primary adversarial observation model is memoryless and l -bounded; correlated, delayed, or drifting attacks may induce different failure modes [2]. (iii) For controlled comparisons, the evaluation emphasizes identical perturbation realizations; broader multi-seed campaigns are needed for stronger statistical claims. (iv) The EMS observation used in the case study is an aggregated, low-dimensional vector; incorporating richer sensing may change the belief-construction and robustness trade-offs. (v) The learned policies are obtained via offline training in a simulated environment with quasi-steady-state power-flow coupling; deployment therefore depends on simulator fidelity and does not yet capture fast inverter and communication dynamics. (vi) The case study focuses on inverter-interfaced DERs (PV and battery inverters) for voltage regulation; extending the physical model and control action space to explicitly include dispatchable generators and combined heat-and-power (CHP) thermal coupling is left for future work. (vii) The threat model is gray-box and does not include a fully adaptive adversary that explicitly optimizes against the A3B belief-construction mechanism; analyzing such strategic feedback is an important direction for strengthening guarantees under adaptive attacks. (viii) While β is conceptually a robustness knob, a systematic sensitivity analysis over β (and its interaction with ϵ and attack correlation) is left for future work.
A related practical consideration is the occurrence of extreme undervoltage events and associated IEEE 1547 compliance-protection actions. Under adversarial partial observability, Baseline PPO and the absolute-setpoint (scaled) variants may produce extreme undervoltage profiles in the OpenDSS evaluation, where a small subset of buses reaches near-zero voltage magnitudes and some buses drop well below typical operating limits. The same locations remain within the normal operating range under the All-Zero controller, indicating that these extremes are induced by control behavior rather than by a structural artifact of the feeder model.
According to IEEE Std 1547-2018 (see Table 15 in [27], abnormal operating performance, Category II), sufficiently deep undervoltages require the DER to cease to energize within a short clearing time. Hence, sustained operation at near-zero voltage magnitudes lies far outside the permissible ride-through envelope defined for practical inverter-based resources. In real distribution systems, such voltage levels would activate inverter protection limits and feeder-level protection mechanisms.
In this work, these outcomes are intentionally classified as severe infeasibility events, as they reflect a loss of operational viability under adversarial measurement corruption rather than realistic steady-state operating conditions.

6.4. Promising Directions for Research and Development

Multiple extensions are promising for both research and deployment: (i) adversary models beyond i.i.d. bounded noise (temporally correlated attacks, targeted channel selection, stealthy data injection) [2]; (ii) closed-loop integration with state estimation and bad-data detection to combine cyber–physical redundancy with learning-based robustness; (iii) multi-feeder and multi-seed evaluation protocols, including domain randomization over load/PV profiles and DER portfolios [13,14,15]; (iv) phase-unbalanced and three-phase models with spatially resolved voltage metrics [17]; (v) hardware-in-the-loop evaluations to quantify robustness under realistic communication delays, packet loss, and actuator latencies, including dynamic inverter models; and (vi) transfer learning or domain adaptation to reduce the simulator-to-field gap and enable robust training against evolving adversarial disturbances.

7. Conclusions

This paper presented an adversary-aware, belief-based reinforcement learning framework for microgrid energy management under adversarial partial observability. By combining adversarial belief balancing (A3B), adversarial counterfactual error (ACoE) regularization, and a feasibility-enforced Δ P / Δ Q actuation interface with device-level clipping, the proposed design explicitly couples corrupted measurements, belief evolution, and distribution-level voltage physics.
Experiments on the IEEE 123-Node feeder show that robust learning-based voltage regulation depends on the joint co-design of (i) belief modeling, (ii) robustness-regularized value learning, and (iii) electrically meaningful action semantics, rather than on the learning objective alone. Under l -bounded adversarial sensing perturbations ( ϵ = 0.05 ), Baseline PPO and absolute-action controllers induce severe undervoltage, with 93.94% of buses outside the [ 0.95 , 1.05 ] p.u. service band and minimum voltages near zero. In contrast, the proposed PPO+ACoE ( Δ /CLIP) controller restores near-nominal operation, increasing the feeder-wide average voltage to 0.987 p.u. and reducing out-of-range buses to 8.33% while avoiding extreme infeasibility events.
Ablation results highlight complementary contributions from the control interface and adversarial regularization. The Δ /CLIP parameterization is critical for feasibility in nominal operation because it limits measurement-driven setpoint jumps and attenuates the physical amplification of corrupted telemetry through nonlinear voltage-reactive-power coupling. ACoE further improves robustness under adversarial stress testing by discouraging policies whose cost-to-go is overly sensitive to belief distortion. Under critic-based perturbations, PPO+ACoE ( Δ /CLIP) improves episode return relative to Baseline PPO ( Δ /CLIP) without degrading voltage statistics, indicating that counterfactual regularization strengthens resilience beyond nominal feasibility.
These trends are consistent with distribution-feeder physics: incremental reactive-power modulation directly affects the dominant linear voltage-drop term in DistFlow-type relations, whereas absolute active-power setpoints primarily influence quadratic loss-related terms. Accordingly, incremental, feasibility-enforced actuation reduces closed-loop gain under corrupted measurements and limits feeder-wide propagation of adversarial perturbations.
Overall, resilient voltage regulation under corrupted telemetry requires more than adversarial training alone; it requires the principled integration of belief-aware learning, counterfactual robustness regularization, and physically grounded control interfaces. The proposed framework, therefore, supports the design of cyber-resilient energy management systems that maintain voltage quality in inverter-dominated microgrids under measurement corruption.

Author Contributions

Conceptualization, E.D., X.Z. and H.L.; Methodology, E.D., X.Z. and H.L.; Software, E.D., X.Z. and H.L.; Validation, E.D., X.Z. and H.L.; Formal analysis, E.D., X.Z. and H.L.; Investigation, E.D., X.Z. and H.L.; Resources, E.D., X.Z. and H.L.; Data curation, E.D., X.Z. and H.L.; Writing—original draft, E.D., X.Z. and H.L.; Writing—review & editing, E.D., X.Z. and H.L.; Visualization, E.D., X.Z. and H.L.; Supervision, H.L.; Project administration, H.L.; Funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) and by the Secretaría Nacional de Ciencia, Tecnología e Innovación (SENACYT), Panama Master’s Degree for Women in Technological Sciences Scholarship Program. The APC was funded by NSERC.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the supporting institutions and all contributors to this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Belaire, R.; Sinha, A.; Varakantham, P. On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
  2. Zuo, S.; Pullaguram, D.; Rajabinezhad, M.; Lewis, F.L.; Davoudi, A. Resilient AC Microgrids Against Correlated Attacks. IEEE Access 2023, 11, 1603–1612. [Google Scholar] [CrossRef]
  3. Liang, Z.; Chung, C.Y.; Zhang, W.; Wang, Q.; Lin, W.; Wang, C. Enabling High-Efficiency Economic Dispatch of Hybrid AC/DC Networked Microgrids: Steady-State Convex Bi-Directional Converter Models. IEEE Trans. Smart Grid 2025, 16, 45–61. [Google Scholar] [CrossRef]
  4. Özgenç, B.; Baysal Aslanhan, Y.A.; Altaş, İ.H. Improving LVRT Capability in Grid Connected PV System Using DRL-Based Controller. IEEE Access 2025, 13, 41025–41039. [Google Scholar] [CrossRef]
  5. Tong, N.; Li, J.; Li, G.; Huang, W.; Cao, R.; Wang, Y. Soft Single-Pole Reclosing Control for MMC-Connected AC Transmission Lines. IEEE Trans. Power Deliv. 2024, 39, 2841–2854. [Google Scholar] [CrossRef]
  6. Yang, S.; Zhu, Y. Offline Reinforcement Learning for Microgrid Voltage Regulation. arXiv 2025, arXiv:2505.09920. [Google Scholar] [CrossRef]
  7. Chen, D.; Chen, K.; Li, Z.; Chu, T.; Yao, R.; Qiu, F.; Lin, K. PowerNet: Multi-Agent Deep Reinforcement Learning for Scalable Powergrid Control. IEEE Trans. Power Syst. 2022, 37, 1007–1017. [Google Scholar] [CrossRef]
  8. Zhao, J.; Li, F.; Mukherjee, S.; Sticht, C. Deep Reinforcement Learning-Based Model-Free On-Line Dynamic Multi-Microgrid Formation to Enhance Resilience. IEEE Trans. Smart Grid 2022, 13, 2557–2567. [Google Scholar] [CrossRef]
  9. Xia, Y.; Xu, Y.; Feng, X. Hierarchical Coordination of Networked-Microgrids Toward Decentralized Operation: A Safe Deep Reinforcement Learning Method. IEEE Trans. Sustain. Energy 2024, 15, 1981–1993. [Google Scholar] [CrossRef]
  10. Mukherjee, S.; Hossain, R.R.; Mohiuddin, S.M.; Liu, Y.; Du, W.; Adetola, V.; Jinsiwale, R.A.; Huang, Q.; Yin, T.; Singhal, A. Resilient Control of Networked Microgrids Using Vertical Federated Reinforcement Learning: Designs and Real-Time Test-Bed Validations. IEEE Trans. Smart Grid 2025, 16, 1897–1910. [Google Scholar] [CrossRef]
  11. Cao, D.; Zhao, J.; Hu, W.; Yu, N.; Ding, F.; Huang, Q.; Chen, Z. Deep Reinforcement Learning Enabled Physical-Model-Free Two-Timescale Voltage Control Method for Active Distribution Systems. IEEE Trans. Smart Grid 2022, 13, 149–165. [Google Scholar] [CrossRef]
  12. Ying, Y.; Liu, Q.; Wu, M.; Zhai, Y. Online Energy Management Strategy of the Flexible Smart Traction Power Supply System. IEEE Trans. Transp. Electrif. 2023, 9, 981–994. [Google Scholar] [CrossRef]
  13. Cui, G.; Jia, Q.S.; Guan, X. Energy Management of Networked Microgrids with Real-Time Pricing by Reinforcement Learning. IEEE Trans. Smart Grid 2024, 15, 570–580. [Google Scholar] [CrossRef]
  14. Li, Y.; Gao, J.; Li, Y.; Chen, C.; Li, S.; Shahidehpour, M.; Chen, Z. Physical Informed-Inspired Deep Reinforcement Learning Based Bi-Level Programming for Microgrid Scheduling. IEEE Trans. Ind. Appl. 2025, 61, 1488–1500. [Google Scholar] [CrossRef]
  15. Li, H.; Jin, T.; Xu, X.; Shi, L. Hierarchical Collaborative Optimization of Shared Energy Storage with Co-Generation Based on Deep Reinforcement Learning and P2P Network Game Theory. IEEE Access 2024, 12, 155718–155732. [Google Scholar] [CrossRef]
  16. Sayed, A.R.; Zhang, X.; Wang, G.; Wang, Y.; Shaaban, M.; Shahidehpour, M. Deep Reinforcement Learning-Assisted Convex Programming for AC Unit Commitment and Its Variants. IEEE Trans. Power Syst. 2024, 39, 5561–5574. [Google Scholar] [CrossRef]
  17. Baran, M.E.; Wu, F.F. Optimal Capacitor Placement on Radial Distribution Systems. IEEE Trans. Power Deliv. 1989, 4, 725–734. [Google Scholar] [CrossRef]
  18. Feng, J.; Shi, Y.; Qu, G.; Low, S.H.; Anandkumar, A.; Wierman, A. Stability Constrained Reinforcement Learning for Decentralized Real-Time Voltage Control. IEEE Trans. Control Netw. Syst. 2024, 11, 1370–1381. [Google Scholar] [CrossRef]
  19. Chen, Z.; Cai, S.; Meliopoulos, A.P.S. Robust Deep Reinforcement Learning for Volt-VAR Optimization in Active Distribution System Under Uncertainty. IEEE Trans. Smart Grid 2025, 16, 4463–4474. [Google Scholar] [CrossRef]
  20. Bouchkati, S.; Sabirov, R.; Kortmann, S.; Ulbig, A. Partially Observable Residual Reinforcement Learning for PV-Inverter-Based Voltage Control in Distribution Grids. In Proceedings of the 2025 IEEE Kiel PowerTech, Kiel, Germany, 29 June–3 July 2025. [Google Scholar]
  21. Zhuang, P. Stochastic Energy Management and Cyber-Physical Security of Battery Energy Storage Systems in Smart Distribution Systems. Ph.D. Thesis, University of Alberta, Edmonton, AB, Canada, 2020. [Google Scholar]
  22. Bolognani, S.; Zampieri, S. On the Existence and Linear Approximation of the Power Flow Solution in Power Distribution Networks. IEEE Trans. Power Syst. 2016, 31, 163–172. [Google Scholar] [CrossRef]
  23. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  24. Sahu, A.; Nguyen, T.; Chen, K.; Zhang, X.; Hassanaly, M. Detection of False Data Injection Attacks (FDIA) on Power Dynamical Systems with a State Prediction Method. IEEE Access 2025, 13, 12411–12426. [Google Scholar] [CrossRef]
  25. Naderi, E.; Asrari, A.; Fajri, P. False Data Injection Cyberattacks Targeting Electric Vehicles in Smart Power Distribution Systems. In Proceedings of the 2024 IEEE Transportation Electrification Conference and Expo (ITEC), Chicago, IL, USA, 19–21 June 2024. [Google Scholar]
  26. Tang, Z.; Hill, D.J.; Liu, T. Distributed Coordinated Reactive Power Control for Voltage Regulation in Distribution Networks. IEEE Trans. Smart Grid 2021, 12, 312–323. [Google Scholar] [CrossRef]
  27. 1547-2018; IEEE Standard for Interconnection and Interoperability of Distributed Energy Resources with Associated Electric Power Systems Interfaces. IEEE: New York, NY, USA, 2018; pp. 1–138.
Figure 1. Illustration of the cyber–physical system architecture of the microgrid.
Figure 1. Illustration of the cyber–physical system architecture of the microgrid.
Energies 19 01497 g001
Figure 2. An illustration of the case study configuration based on the IEEE 123-Node Test Feeder.
Figure 2. An illustration of the case study configuration based on the IEEE 123-Node Test Feeder.
Energies 19 01497 g002
Figure 3. Mean episodic reward over 1000 training iterations for Baseline PPO and PPO+ACoE under adversarial partial observability.
Figure 3. Mean episodic reward over 1000 training iterations for Baseline PPO and PPO+ACoE under adversarial partial observability.
Energies 19 01497 g003
Figure 4. Policy entropy over 1000 training iterations for Baseline PPO and PPO+ACoE under adversarial partial observability.
Figure 4. Policy entropy over 1000 training iterations for Baseline PPO and PPO+ACoE under adversarial partial observability.
Energies 19 01497 g004
Figure 5. Fraction of buses outside the service-voltage band ( [ 0.95 , 1.05 ] p.u.) under adversarial sensing perturbations. Results are reported at five evaluation checkpoints across the 1000-iteration training horizon.
Figure 5. Fraction of buses outside the service-voltage band ( [ 0.95 , 1.05 ] p.u.) under adversarial sensing perturbations. Results are reported at five evaluation checkpoints across the 1000-iteration training horizon.
Energies 19 01497 g005
Figure 6. Number of buses in undervoltage ( V < 0.95 p.u.) under adversarial sensing perturbations, evaluated at five checkpoints (iterations { 0 , 250 , 500 , 750 , 1000 } ).
Figure 6. Number of buses in undervoltage ( V < 0.95 p.u.) under adversarial sensing perturbations, evaluated at five checkpoints (iterations { 0 , 250 , 500 , 750 , 1000 } ).
Energies 19 01497 g006
Figure 7. Undervoltage and overvoltage counts ( V < 0.95 p.u., V > 1.05 p.u.) under adversarial sensing perturbations, comparing Baseline PPO and PPO+ACoE ( Δ /CLIP) across five evaluation checkpoints across the 1000-iteration training horizon.
Figure 7. Undervoltage and overvoltage counts ( V < 0.95 p.u., V > 1.05 p.u.) under adversarial sensing perturbations, comparing Baseline PPO and PPO+ACoE ( Δ /CLIP) across five evaluation checkpoints across the 1000-iteration training horizon.
Energies 19 01497 g007
Figure 8. OpenDSS-based feeder-wide voltage outcomes at the final evaluation checkpoint under adversarial partial observability: (a) Average bus-voltage magnitude (p.u.); (b) The percentage of buses outside the [ 0.95 , 1.05 ] service band. Bar shading indicates the action-interface family: light gray denotes absolute (setpoint) actions (Baseline PPO, PPO+ACoE), dark gray denotes Δ /CLIP (incremental/clipped) actions, and medium gray denotes the All-Zero baseline.
Figure 8. OpenDSS-based feeder-wide voltage outcomes at the final evaluation checkpoint under adversarial partial observability: (a) Average bus-voltage magnitude (p.u.); (b) The percentage of buses outside the [ 0.95 , 1.05 ] service band. Bar shading indicates the action-interface family: light gray denotes absolute (setpoint) actions (Baseline PPO, PPO+ACoE), dark gray denotes Δ /CLIP (incremental/clipped) actions, and medium gray denotes the All-Zero baseline.
Energies 19 01497 g008
Figure 9. OpenDSS feeder-wide voltage outcomes under adversarial partial observability, grouped by action-interface family. (Left): absolute setpoints (undervoltage, many violations). (Right): Δ /CLIP (near-nominal, fewer violations). Shading: light = absolute, dark = Δ /CLIP, medium = All-Zero controller. Note: the y-axis ranges differ between the (left) and (right) groups.
Figure 9. OpenDSS feeder-wide voltage outcomes under adversarial partial observability, grouped by action-interface family. (Left): absolute setpoints (undervoltage, many violations). (Right): Δ /CLIP (near-nominal, fewer violations). Shading: light = absolute, dark = Δ /CLIP, medium = All-Zero controller. Note: the y-axis ranges differ between the (left) and (right) groups.
Energies 19 01497 g009
Table 1. Parameters for Baseline PPO and PPO+ACoE training.
Table 1. Parameters for Baseline PPO and PPO+ACoE training.
ParameterBaseline PPOPPO+ACoE
Seed4242
Learning rate ( α ) 3 × 10 4 3 × 10 4
Batch size256256
Clip range ( ϵ clip )0.20.2
Discount factor ( γ )0.990.99
GAE parameter ( λ GAE )0.950.95
Entropy coefficient0.010.01
Value loss coefficient0.50.5
Adversarial noise level ( ϵ )0.050.05
Counterfactual loss weight ( β )0.1
Training horizon (PPO update iterations)10001000
Table 2. Robust baseline comparison on the IEEE-123 feeder under (a) OpenDSS evaluation and (b) critic-based perturbation ( ϵ = 0.05 ). Results shown as mean ± std.
Table 2. Robust baseline comparison on the IEEE-123 feeder under (a) OpenDSS evaluation and (b) critic-based perturbation ( ϵ = 0.05 ). Results shown as mean ± std.
(a) OpenDSS Evaluation
MetricRobust PPO (Noise-Trained)ACoE
min V (p.u.) 0.97899 ± 0.00016 0.98095 ± 0.00799
mean V (p.u.) 0.99455 ± 0.00011 0.99599 ± 0.00490
max V (p.u.) 1.03673 1.03610 ± 0.00345
(b) Critic-Based Perturbation
Methodmin bus V (p.u.)Episode Reward
Robust PPO (noise-trained) 0.979 3.006 ± 0.255
ACoE 0.979 0.940 ± 0.082
Table 3. Summary of feeder-wide voltage violation metrics (fixed-seed evaluation at iteration 1000).
Table 3. Summary of feeder-wide voltage violation metrics (fixed-seed evaluation at iteration 1000).
ControllerBuses < 0.95  p.u. (%)Aggregate Voltage Deficit (p.u.)
Baseline PPO≈90%≈44.19
PPO+ACoE≈90%≈44.19
PPO+ACoE ( Δ /CLIP)<10%≈0.27
All-Zero controller0%0
Table 4. Study under an identical physics-aware Δ /CLIP action interface on the IEEE-123 system: (a) feeder-wide voltage statistics under OpenDSS and (b) adversarial performance under critic-based perturbations ( ϵ = 0.05 ). Results are reported as mean ± std.
Table 4. Study under an identical physics-aware Δ /CLIP action interface on the IEEE-123 system: (a) feeder-wide voltage statistics under OpenDSS and (b) adversarial performance under critic-based perturbations ( ϵ = 0.05 ). Results are reported as mean ± std.
(a) OpenDSS Voltage Statistics.
Metric (Mean ± Std) Baseline PPO ( Δ /CLIP) PPO+ACoE ( Δ /CLIP)
min V (p.u.) 0.9812 ± 0.0055 0.9809 ± 0.0080
mean V (p.u.) 0.9957 ± 0.0038 0.9960 ± 0.0049
max V (p.u.) 1.0352 ± 0.0040 1.0361 ± 0.0035
std of V across buses (p.u.) 0.0126 ± 0.0022 0.0130 ± 0.0031
(b) Critic-Attack Performance.
MethodEpisode Return ( ϵ = 0 . 05 )
Baseline PPO ( Δ /CLIP) 0.9848 ± 0.1015
PPO+ACoE ( Δ /CLIP) 0.9399 ± 0.0836
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Domínguez, E.; Zhou, X.; Liang, H. Adversarially Robust Reinforcement Learning for Energy Management in Microgrids with Voltage Regulation Under Partial Observability. Energies 2026, 19, 1497. https://doi.org/10.3390/en19061497

AMA Style

Domínguez E, Zhou X, Liang H. Adversarially Robust Reinforcement Learning for Energy Management in Microgrids with Voltage Regulation Under Partial Observability. Energies. 2026; 19(6):1497. https://doi.org/10.3390/en19061497

Chicago/Turabian Style

Domínguez, Elida, Xiaotian Zhou, and Hao Liang. 2026. "Adversarially Robust Reinforcement Learning for Energy Management in Microgrids with Voltage Regulation Under Partial Observability" Energies 19, no. 6: 1497. https://doi.org/10.3390/en19061497

APA Style

Domínguez, E., Zhou, X., & Liang, H. (2026). Adversarially Robust Reinforcement Learning for Energy Management in Microgrids with Voltage Regulation Under Partial Observability. Energies, 19(6), 1497. https://doi.org/10.3390/en19061497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop