Next Article in Journal
A Lightweight Learning-Based Approach for Online Edge-to-Cloud Service Placement
Next Article in Special Issue
A Dual-Phase Dual-Path Hybrid Buck-Boost Converter with Offset-Controlled Zero-Current Detection Achieving 95.88% Peak Efficiency
Previous Article in Journal
A Fully Integrated Direct Conversion Transmitter with I/Q-Isolated CMOS PA for Sub-6 GHz 5G NR
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hierarchical Predictive-Adaptive Control Framework for State-of-Charge Balancing in Mini-Grids Using Deep Reinforcement Learning

1
Department of Computer Science and Engineering, European University of Cyprus, Engomi, Nicosia 2404, Cyprus
2
Department of Computer Science, University of Cyprus, Nicosia 1678, Cyprus
3
CYENS Centre of Excellence, Nicosia 1016, Cyprus
4
Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi 923-1292, Ishikawa, Japan
5
Department of Information Engineering, Kanazawa Gakuin University, Kanazawa 920-1392, Ishikawa, Japan
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 61; https://doi.org/10.3390/electronics15010061
Submission received: 17 November 2025 / Revised: 8 December 2025 / Accepted: 18 December 2025 / Published: 23 December 2025
(This article belongs to the Special Issue Smart Power System Optimization, Operation, and Control)

Simple Summary

Mini-grids with multiple battery energy storage systems need intelligent control to keep all batteries at similar states of charge, protect their lifetime, and operate economically. This paper proposes a hierarchical framework that combines a federated Transformer forecasting model with a Soft Actor-Critic reinforcement learning agent to achieve predictive, adaptive, and scalable state-of-charge balancing under high renewable penetration.

Abstract

State-of-charge (SoC) balancing across multiple battery energy storage systems (BESS) is a central challenge in renewable-rich mini-grids. Heterogeneous battery capacities, differing states of health, stochastic renewable generation, and variable loads create a high-dimensional uncertain control problem. Conventional droop-based SoC balancing strategies are decentralized and computationally light but fundamentally reactive and limited, whereas model predictive control (MPC) is insightful but computationally intensive and prone to modeling errors. This paper proposes a Hierarchical Predictive–Adaptive Control (HPAC) framework for SoC balancing in mini-grids using deep reinforcement learning. The framework consists of two synergistic layers operating on different time scales. A long-horizon Predictive Engine, implemented as a federated Transformer network, provides multi-horizon probabilistic forecasts of net load, enabling multiple mini-grids to collaboratively train a high-capacity model without sharing raw data. A fast-timescale Adaptive Controller, implemented as a Soft Actor-Critic (SAC) agent, uses these forecasts to make real-time charge/discharge decisions for each BESS unit. The forecasts are used both to augment the agent’s state representation and to dynamically shape a multi-objective reward function that balances SoC, economic performance, degradation-aware operation, and voltage stability. The paper formulates SoC balancing as a Markov decision process, details the SAC-based control architecture, and presents a comprehensive evaluation using a MATLAB-(R2025a)-based digital-twin simulation environment. A rigorous benchmarking study compares HPAC against fourteen representative controllers spanning rule-based, MPC, and various DRL paradigms. Sensitivity analysis on reward weight selection and ablation studies isolating the contributions of forecasting and dynamic reward shaping are conducted. Stress-test scenarios, including high-volatility net-load conditions and communication impairments, demonstrate the robustness of the approach. Results show that HPAC achieves near-minimal operating cost with essentially zero SoC variance and the lowest voltage variance among all compared controllers, while maintaining moderate energy throughput that implicitly preserves battery lifetime. Finally, the paper discusses a pathway from simulation to hardware-in-the-loop testing and a cloud-edge deployment architecture for practical, real-time deployment in real-world mini-grids.

Graphical Abstract

1. Introduction

The global energy landscape is undergoing a profound transformation driven by the dual imperatives of decarbonization and energy security. At the forefront of this transition are mini-grids—localized, self-sufficient electrical grids that can operate either in conjunction with the main utility grid or in an autonomous, islanded mode. These systems enable integration of high penetrations of renewable energy sources (RES), enhance resilience against disruptions, and extend electricity access to remote communities. Recent white papers and surveys emphasize that high-renewable micro-grids face nontrivial stability and resilience challenges and are a natural application domain for advanced AI- and ML-enabled control [1,2,3,4,5,6]. Mini-grids are small-scale electricity systems that combine local generation (often renewable), distribution networks, and end-user loads to supply a defined group of customers—frequently in remote or weak-grid areas—and can operate either autonomously or interconnected with the main grid [7,8].
Central to the functionality and stability of modern, renewable-rich mini-grids are battery energy storage systems (BESS). BESS act as buffers against the intermittency of RES, such as solar and wind, by absorbing surplus generation and discharging to meet load demand, thereby supporting a continuous, stable power supply. BESS also play a key role in emerging ML-based micro-grid energy management and power forecasting frameworks [3,9,10].
As mini-grids grow in complexity, they increasingly deploy multiple, often heterogeneous BESS units distributed throughout the network. In such multi-BESS configurations, the primary control challenge extends beyond simple power sharing. It becomes a complex optimization problem centered on maintaining SoC balance across the entire storage fleet. A growing body of work studies SoC-aware control and balancing strategies in DC and AC micro-grids using droop-based and adaptive methods [11,12,13,14,15,16].
Effective SoC balancing is critical for the following reasons:
  • It prevents chronic overuse or underutilization of specific battery units, which would otherwise lead to accelerated degradation and premature failure. By ensuring that all units contribute equitably to system operation, SoC balancing prolongs the collective operational lifetime of storage assets.
  • It maximizes operational flexibility and resilience. A balanced BESS fleet has a greater effective capacity to respond to sudden changes in generation or load, as no single unit is prematurely constrained by reaching its upper or lower SoC limit.
The core problem is to devise a control strategy that can dynamically manage the charge and discharge cycles of multiple BESS units to keep their SoCs closely aligned, even when facing different capacities, degradation levels (state of health, SoH), and highly variable operating conditions driven by stochastic renewable generation and uncertain loads.
Recent research has started to explore distributed artificial intelligence and reinforcement-learning-based energy management schemes in nano-grid and micro-grid settings, demonstrating the potential of data-driven controllers for coordinated storage and renewable operation [6,9,17,18,19]. However, these approaches typically do not explicitly target hierarchical SoC balancing with a federated multi-horizon forecasting layer as a first-class component of the control architecture. Control strategies for SoC balancing have evolved significantly, yet existing paradigms exhibit fundamental limitations in the dynamic, uncertain environment of a renewable-heavy mini-grid. The foundational method for decentralized power sharing in micro-grids is droop control. This technique emulates the behavior of synchronous generators in traditional power systems, creating an artificial relationship between a unit’s power output and the grid’s frequency (AC systems) or voltage (DC systems) [20,21]. Its primary advantages are simplicity and communication-free operation, allowing plug-and-play integration of parallel inverters.
However, conventional droop control suffers from inherent drawbacks that compromise its effectiveness as follows:
  • There is a fundamental trade-off between power-sharing accuracy and voltage regulation: the mechanism that enables power sharing also introduces deviations in bus voltage.
  • Performance is highly sensitive to mismatched line impedances between BESS units and the point of common coupling, leading to inaccurate current sharing and divergent SoC trajectories.
To overcome these limitations, a rich body of work has proposed SoC-aware and adaptive droop strategies. A comprehensive review of droop-based SoC-balancing methods for DC micro-grids is provided in [12], where fixed SoC-compensated droop, adaptive droop, and virtual-impedance-based schemes are compared. In SoC-based adaptive droop control, the droop coefficient of each converter is adjusted in real time as a function of its corresponding battery SoC. Typical designs assign smaller droop coefficients to units with higher SoC during discharging and larger coefficients during charging, encouraging high-SoC units to supply more power and absorb less power, and vice versa for low-SoC units [11,14,16].
Several variants refine this idea to account for converter ratings, battery capacities, and line impedances. For example, capacity-aware adaptive droop schemes scale SoC-dependent coefficients by the usable capacity of each BESS to prevent overutilization of weaker units and to improve fairness among heterogeneous batteries [15]. Other works introduce auxiliary feedback terms based on current or power to mitigate the impact of unequal line resistances and to restore bus voltage while retaining SoC balancing performance [13,22]. Overall, these schemes demonstrate that SoC balancing can be achieved in a decentralized and relatively inexpensive manner. Despite these advances, droop-based SoC balancing remains fundamentally reactive. Control actions are driven by instantaneous SoC differences and local measurements, without explicit consideration of future net-load patterns, price signals, or degradation trajectories. Most formulations are single-objective, with SoC equalization as the dominant design goal, and therefore cannot directly coordinate SoC balancing with economic optimization or lifetime-aware operation under uncertainty.
Model predictive control (MPC) and related optimization-based approaches have been widely investigated for energy management and optimal battery operation in mini-grids and micro-grids. In typical formulations, a cost function combining grid energy purchase cost, renewable curtailment penalties, and sometimes battery degradation or SoC-related penalties is minimized over a finite horizon, subject to constraints on power balance, converter limits, and SoC bounds. When coupled with short-term forecasts of load and renewable generation, MPC can exploit foresight to schedule charge/discharge actions, diesel generation, and grid exchange to improve economic performance and reliability [3,23]. In micro-grid settings with high renewable penetration, MPC has been applied in centralized and hierarchical schemes. A slower supervisory MPC may perform hour-ahead or day-ahead scheduling, while a faster inner loop enforces voltage and frequency constraints at the converter level. Deep learning-based forecasting modules have also been integrated into MPC frameworks, where neural network predictors supply multi-step load or solar forecasts used as exogenous inputs in the optimization [3,23]. Such architectures provide a principled means to handle multi-objective trade-offs, including SoC balancing, curtailment minimization, and arbitrage.
In practice, however, MPC faces two significant hurdles in real mini-grids. First, it is profoundly dependent on an accurate system model. BESS characteristics evolve with aging, renewable generation is uncertain, and load profiles are difficult to predict precisely. Model mismatch can lead to suboptimal or unstable control actions [24]. Second, the computational complexity of real-time constrained optimization grows with the number of assets, model complexity, and horizon length. For complex mini-grids requiring fast control decisions, MPC’s computational burden can become prohibitive compared with DRL-based approaches [6,10].
Thus, droop control is computationally light but reactive and “nearsighted,” failing to anticipate future conditions. MPC is proactive and “farsighted” but computationally demanding and fragile to model inaccuracies. There is a clear need for a control paradigm that combines the predictive power of advanced forecasting with the real-time, model-free adaptability of modern machine learning, achieving a solution that is both proactive and computationally tractable.
This work proposes a novel control architecture, the Hierarchical Predictive–Adaptive Control (HPAC) framework, designed specifically to address the multifaceted challenge of SoC balancing in renewable-rich mini-grids.
The central thesis is that by strategically decoupling the control problem into two synergistic, hierarchically arranged layers, it is possible to overcome the inherent limitations of both purely reactive and purely model-based approaches. The HPAC framework consists of the following:
  • A long-horizon predictive engine: operating on a slower timescale, this layer leverages a state-of-the-art transformer-based deep learning model within a federated learning architecture to generate accurate, multi-horizon probabilistic forecasts of the mini-grid’s net load.
  • A real-time Adaptive Controller: operating on a faster timescale, this layer employs a Soft Actor-Critic (SAC) deep reinforcement learning (DRL) agent to make instantaneous charge/discharge decisions for each BESS unit.
The innovation lies not just in the use of these advanced techniques but in the intelligent synthesis of their capabilities. The predictive engine provides the farsightedness that reactive methods lack, while the model-free DRL-based adaptive controller provides the computational tractability and resilience to uncertainty that plague MPC. The design is informed by recent experience with distributed AI frameworks for nano- and micro-grid power management and autonomous RL-based energy management [9,17,18,19]. The forecasts generated by the upper layer are not merely passive inputs to the lower layer; they dynamically shape the DRL agent’s state representation and reward function, providing crucial context for the near-future operating environment. This hierarchical synergy enables a control strategy that is simultaneously proactive in planning and adaptive in execution, capable of robust, multi-objective SoC balancing in complex, real-world mini-grid environments.
The remainder of this paper is structured as follows. Section 2 reviews the relevant literature and provides background on predictive-adaptive control and federated learning. Section 3 defines the SoC balancing problem and presents the architectural overview of the HPAC framework. Section 4 details the methodology, including the Predictive Engine design, the SAC-based Adaptive Controller, and reward function engineering with sensitivity analysis on weight selection. Section 5 presents the performance evaluation, including a comprehensive benchmarking study against fourteen representative controllers, ablation studies on the role of forecasts and reward shaping, stress-test scenarios for robustness assessment, and a detailed analysis of trade-offs between cost, throughput, and degradation. Section 6 concludes the paper with a summary of key findings, limitations, and directions for future work, including scalability considerations and deployment pathways.
Table 1 summarizes the main notation used throughout the paper and provides a single, consistent reference for all model, control, and evaluation terms. This table is used to avoid ambiguity and to ensure the subsequent problem formulation, controller description, and benchmarking results are interpreted consistently.

2. Literature Review and Background

This section situates the proposed Hierarchical Predictive–Adaptive Control (HPAC) framework within the broader landscape of SoC balancing and micro-grid energy management and introduces the technical concepts on which it builds. We first provide a structured literature review that spans droop-based and decentralized SoC-balancing methods, model predictive and optimization-based energy management schemes, data-driven and deep reinforcement learning controllers, and advanced forecasting and federated learning approaches that enable predictive, privacy-preserving operation in modern mini-/micro-grids. We then present the background required to understand HPAC, including the predictive–adaptive control philosophy and the role of federated learning in training forecasting and control models across multiple micro-grids without sharing raw data. Together, these elements clarify both how HPAC extends existing lines of work and why its particular combination of predictive foresight, DRL-based adaptation, and federated training is well-suited to multi-BESS SoC balancing under high renewable penetration.

2.1. Literature Review

This section reviews the main lines of work related to state-of-charge (SoC) balancing and energy management in mini-/micro-grids, with a focus on the following: (i) droop-based SoC-aware control schemes and related decentralized strategies for multi-battery systems, (ii) model predictive and optimization-based energy management controllers, (iii) data-driven and deep reinforcement learning (DRL) approaches for micro-grid energy management, and (iv) advanced forecasting and federated learning techniques that underpin the machine-learning components of the proposed HPAC framework. We also position our previous work on distributed AI for nano-/micro-grids and DRL-based micro-grid energy management within this landscape.

2.1.1. SoC Balancing and Droop-Based Control

Droop control is the canonical decentralized technique for power sharing among parallel inverters and storage units in AC and DC micro-grids. It emulates the behavior of synchronous generators by imposing a functional relation between active power and frequency and between reactive power and voltage (in AC systems) or between power and bus voltage (in DC systems) [20,21]. The main advantages of droop control are its simplicity, communication-free operation, and plug-and-play capability, which have led to its widespread adoption in micro-grid practice.
However, standard droop control is primarily designed for power sharing and voltage/frequency regulation, not for explicit SoC balancing. A persistent issue is the inherent trade-off between accurate power sharing and voltage regulation: modifying droop slopes to improve current sharing often results in undesirable bus-voltage deviations [20]. Furthermore, mismatched line impedances and heterogeneous converter ratings can cause inaccurate current sharing and divergent SoC trajectories across distributed battery energy storage system (BESS) units.
To overcome these limitations, a rich body of work has proposed SoC-aware and adaptive droop strategies. A comprehensive review of droop-based SoC-balancing methods for DC micro-grids is provided in [12], where fixed SoC-compensated droop, adaptive droop, and virtual-impedance-based schemes are compared. In SoC-based adaptive droop control, the droop coefficient of each converter is adjusted in real time as a function of its corresponding battery SoC. Typical designs assign smaller droop coefficients to units with higher SoC during discharging and larger coefficients during charging, encouraging high-SoC units to supply more power and absorb less power, and vice versa for low-SoC units [11,14,16].
Several variants refine this idea to account for converter ratings, battery capacities, and line impedances. For example, capacity-aware adaptive droop schemes scale SoC-dependent coefficients by the usable capacity of each BESS to prevent overutilization of weaker units and to improve fairness among heterogeneous batteries [15]. Other works introduce auxiliary feedback terms based on current or power to mitigate the impact of unequal line resistances and to restore bus voltage while retaining SoC balancing performance [13,22]. Overall, these schemes demonstrate that SoC balancing can be achieved in a decentralized and relatively inexpensive manner.
More recent work pushes droop-based SoC control further. Tian et al. propose an adaptive droop coefficient algorithm that combines fuzzy logic with SoC feedback and a voltage-recovery loop to jointly stabilize the DC bus and equalize SoC among multiple energy storage units [13]. Luo et al. design a fast SoC-balancing and current-sharing strategy for distributed energy storage units interfaced via boost converters, explicitly addressing the degradation of balancing speed as SoC spreads shrink [25]. Non-droop schemes have also been explored: Bhosale et al. present a centralized control strategy that achieves SoC-based current sharing without droop characteristics, relying instead on a supervisory coordinator that dispatches current references to each converter [26]. Finally, Fagundes et al. provide a broad review of SoC-balancing strategies for BESS in micro-grids, highlighting emerging topics such as multi-agent SoC control, modular multilevel converter topologies, and second-life batteries [27]. These contributions confirm the maturity of SoC-aware droop and related decentralized strategies while also underscoring their largely reactive nature and their limited integration with long-horizon economic and degradation-aware objectives.
Despite these advances, droop-based SoC balancing remains fundamentally reactive. Control actions are driven by instantaneous SoC differences and local measurements, without explicit consideration of future net–load patterns, price signals, or degradation trajectories. Most formulations are single-objective, with SoC equalization as the dominant design goal, and therefore cannot directly coordinate SoC balancing with economic optimization or lifetime-aware operation under uncertainty.

2.1.2. Model Predictive and Optimization-Based Energy Management

Model predictive control (MPC) and related optimization-based approaches have been widely investigated for energy management and optimal battery operation in mini-grids and micro-grids. In typical formulations, a cost function combining grid energy purchase cost, renewable curtailment penalties, and sometimes battery degradation or SoC-related penalties is minimized over a finite horizon, subject to constraints on power balance, converter limits, and SoC bounds. When coupled with short-term forecasts of load and renewable generation, MPC can exploit foresight to schedule charge/discharge actions, diesel generation, and grid exchange to improve economic performance and reliability [3,23].
In micro-grid settings with high renewable penetration, MPC has been applied in centralized and hierarchical schemes. A slower supervisory MPC may perform hour-ahead or day-ahead scheduling, while a faster inner loop enforces voltage and frequency constraints at the converter level. Deep learning-based forecasting modules have also been integrated into MPC frameworks, where neural network predictors supply multi-step load or solar forecasts used as exogenous inputs in the optimization [3,23]. Such architectures provide a principled means to handle multi-objective trade-offs, including SoC balancing, curtailment minimization, and arbitrage.
In practice, however, MPC faces two significant hurdles in real mini-grids. First, it is susceptible to model mismatch: BESS dynamics and efficiency evolve with aging, distributed renewable generation exhibits substantial stochasticity, and consumer demand is difficult to characterize precisely. When the system diverges from the assumed model, MPC decisions can become suboptimal or even destabilizing [24]. Second, the computational complexity of real-time constrained optimization grows with the number of assets, the model order, and the horizon length. For complex micro-grids requiring fast control intervals, solving MPC problems at each time step can be more computationally demanding than DRL-based policies, which only require forward passes through trained networks at run time [10].
Recent works have attempted to alleviate some of these shortcomings through stochastic, robust, and learning-enhanced MPC formulations. Liu et al. compare robust, stochastic, and chance-constrained MPC formulations for BESS and hybrid storage, exposing the sensitivity of purely model-based schemes to uncertainty and parameter drift [24]. Hybrid RL-MPC strategies have also been explored, where an RL agent either tunes MPC weights online or provides warm-start solutions to reduce solver effort and improve robustness to model mismatch [10]. Nevertheless, these approaches remain constrained by the need to repeatedly solve optimization problems online, and they typically treat SoC balancing as a side constraint rather than as a central coordination objective across multiple BESS units.
These issues motivate the exploration of learning-based controllers that can leverage rich simulation or historical data to learn policies that are robust to modeling errors and scalable to large asset fleets, while still capturing multi-objective and constraint-aware behavior.

2.1.3. Data-Driven and Deep Reinforcement Learning Approaches

Machine learning, and in particular, deep reinforcement learning, has recently emerged as a powerful tool for micro-grid energy management and storage control. Surveys of ML/AI techniques for AC micro-grids and micro-grid energy management highlight the growing role of supervised learning (for forecasting and estimation) and DRL (for control) in next-generation micro-grids [3,4,5,6]. In the DRL setting, the micro-grid is modeled as a Markov decision process, and the controller is learned as a policy that maps system states to actions (e.g., charge/discharge setpoints) so as to maximize a long-term reward function embodying technical, economic, and reliability objectives.
Early DRL-based micro-grid controllers mostly relied on Q-learning or deep Q-networks with discrete action spaces and focused on economic performance, such as minimizing electricity bills under time-of-use tariffs or performing peak shaving [3,9]. As the field matured, continuous-control algorithms such as Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO) were adopted to handle continuous power setpoints and to achieve better sample efficiency and stability [6,10]. SAC, in particular, combines an off-policy actor-critic structure with an entropy-regularized objective that encourages exploration and robustness and has been successfully applied to multi-timescale scheduling and control in micro-grids with hybrid storage [28].
Recent work by Ioannou et al. has demonstrated the potential of distributed AI and DRL for coordinated energy management and grid-support functionalities in nano-grid and micro-grid settings, using multi-agent and heterogeneous-agent formulations to handle multiple assets and services [17,18]. Complementary to this, Javaid et al. have investigated real-time DRL-based energy management strategies in micro-grids using Gym-compatible, high-fidelity simulation environments, highlighting the practicality of DRL controllers under realistic operational constraints [19]. These contributions show that DRL methods can learn to coordinate distributed resources in complex micro-grid environments while naturally incorporating time-varying prices, forecast information, and device constraints into the reward and state definitions.
At the algorithmic level, several recent studies demonstrate the breadth of DRL architectures that can be applied to micro-grid energy management. Liu et al. formulate real-time economic micro-grid EMS as a DDPG-based DRL problem, explicitly comparing DRL with MPC and other RL baselines under significant renewable and price uncertainty [29]. Chen et al. propose a hierarchical deep Q-network (HDQN) for transmission-constrained micro-grid dispatch, which decomposes the decision-making into multiple Q-learners over different time scales and network regions [30]. Other works explore DRL-based EMS for EV-integrated micro-grids, dual-battery hybrid storage, and multi-energy carrier systems, typically focusing on cost minimization and reliability enhancement, while these studies confirm that DRL can outperform classical MPC or rule-based controllers in complex, uncertain environments, SoC balancing across multiple BESS units is usually handled via penalty terms and constraints rather than as a first-class reward component.
Nonetheless, most existing DRL-based micro-grid controllers do not explicitly regard SoC balancing across multiple BESS units as a primary design objective. Instead, SoC constraints are typically handled via penalty terms or hard bounds, and the primary focus lies on cost minimization, peak shaving, or renewable utilization. Moreover, forecasting and control are often treated as loosely coupled components: forecasts may be injected as exogenous state features, but the learning architecture seldom reflects a clear separation between a dedicated forecasting layer and a fast-timescale control layer. The HPAC framework addresses these gaps by making SoC balancing a first-class objective in the reward function and by structuring the solution hierarchically around a predictive (forecasting) upper layer and an adaptive (DRL) lower layer.

2.1.4. Forecasting, Transformer Models, and Federated Learning

Accurate forecasting of net load (load minus renewable generation) is essential for predictive energy management and for exploiting BESS flexibility. Traditional statistical methods such as ARIMA and exponential smoothing have been widely used in power systems, but they struggle to capture complex nonlinear dependencies and multi-scale temporal patterns. Deep learning methods, particularly recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, improved forecasting accuracy by directly modeling temporal sequences and nonlinearities and have been applied to load and solar forecasting tasks in smart grids and micro-grids [3,9].
More recently, transformer architectures have attracted attention as an alternative to RNN-based models for time-series forecasting. Initially developed for natural language processing, transformers rely on self-attention mechanisms rather than recurrence, enabling highly parallel training and the modeling of dependencies across long input sequences. Comparative studies and tutorials emphasize that transformers can outperform classical RNNs and LSTMs on long-horizon forecasting tasks, including energy and weather-related time series [31,32,33]. In the energy domain, transformer-based hybrid models have been proposed for renewable and solar-irradiance forecasting, showing improved accuracy and better handling of multivariate inputs (e.g., combining historical power, meteorological variables, and calendar features) compared with traditional architectures [31,32,34,35].
Parallel to these modeling advances, federated learning (FL) has emerged as a key paradigm for collaboratively training machine-learning models across multiple data-owning entities without sharing raw data. Surveys highlight the relevance of FL in smart grid contexts, where privacy, regulatory, and communication constraints often prevent centralization of high-resolution consumption and generation data [36,37]. In FL, each site trains a local copy of the model on its own data and only shares model updates (e.g., gradients or weights) with a central aggregator, which computes an aggregate update and redistributes the improved global model. This process allows multiple micro-grids or prosumers to collaboratively learn high-capacity forecasting models while keeping their raw time-series data local.
In the context of energy time series, Moveh et al. show that transformer-based architectures can significantly improve multi-building consumption forecasting in smart cities, especially when modeling long seasonal patterns and heterogeneous buildings [38]. Zhong proposes an ANN–LSTM–transformer hybrid model for joint load and market price forecasting, illustrating how attention layers can refine recurrent features and improve long-horizon accuracy [39]. On the learning side, Grataloup et al. review federated learning applications in renewable energy, outlining communication, heterogeneity, and privacy trade-offs that are highly relevant for distributed micro-grids [40]. Naidji et al. introduce a decentralized FL architecture for networked micro-grids, in which local EMS agents collaboratively train deep learning-based controllers while preserving data locality [41]. Together, these works support the feasibility of combining transformer-based forecasters with FL in multi-micro-grid settings, and they motivate the federated, transformer-based predictive engine adopted in HPAC.
The HPAC framework builds directly on these developments. In the upper Predictive Engine, a transformer-based architecture is employed for multi-horizon net-load forecasting, leveraging self-attention to encode long-range seasonal, daily, and intra-day patterns. A federated learning setup is adopted to enable multiple mini-grids to jointly train a global forecasting model without sharing raw operational data, aligning with the privacy and scalability requirements discussed in [36,37,40,41]. The resulting probabilistic forecasts are then used by the lower DRL layer both as state features and as signals to modulate the multi-objective reward function dynamically.

2.1.5. Comparative Assessment of Control Paradigms

To make the positioning of HPAC more concrete, Table 2 provides a qualitative comparison between representative approaches from each of the main control families discussed above (droop-based SoC controllers, MPC-based EMS, DRL-based controllers, and forecasting/FL layers) and the proposed HPAC framework itself. Each row corresponds to one representative line of work in the literature (one reference per row), grouped by paradigm.
As summarized in Table 2, droop-based approaches remain attractive for their simplicity and decentralization, but they are inherently myopic and predominantly single-objective, making it challenging to co-optimize SoC balancing with economic and lifetime-related criteria [11,12,13,25,26,27]. MPC-based controllers, by contrast, are naturally suited to multi-objective optimization with explicit foresight, but their reliance on accurate models and the computational burden of real-time optimization limit their robustness and scalability in the face of uncertainty [3,10,23,24]. Existing DRL controllers alleviate the modeling burden and provide flexible, multi-objective policies [4,6,17,18,19,29,30], but typically treat forecasting and control as loosely coupled modules and do not elevate SoC balancing to a central design target.
The HPAC framework combines the main strengths of these paradigms while mitigating their weaknesses. By embedding a federated transformer forecaster as a dedicated upper layer, HPAC endows the controller with explicit predictive capability under privacy constraints [34,38,39,40,41]. By implementing the lower layer as a SAC agent operating on forecast-augmented states and a carefully engineered multi-objective reward, it achieves proactive SoC balancing that remains robust and computationally tractable in real time. This synergy between predictive foresight, federated learning, and entropy-regularized DRL is a key reason why HPAC can outperform traditional droop, MPC, and generic DRL baselines in the technical, economic, and degradation-related metrics evaluated in Section 5.

2.1.6. Summary and Positioning of HPAC

In summary, existing SoC-balancing solutions for multi-BESS mini-grids can be broadly grouped into the following:
  • Droop-based methods, which are fully decentralized and computationally inexpensive but reactive and predominantly single-objective, making it challenging to co-optimize SoC balancing with economic and lifetime-related criteria [11,12,13,14,15,16,25,26,27].
  • MPC and optimization-based controllers, which can incorporate multiple objectives and constraints with explicit foresight, but whose reliance on accurate models and the computational burden of real-time optimization limit their robustness and scalability in the face of uncertainty [3,10,23,24].
  • Data-driven and DRL-based controllers, which are model-free and more robust to uncertainty but often focus on economic or reliability objectives and treat forecasting and control as loosely coupled modules, with SoC balancing handled implicitly rather than as a primary design objective [4,6,9,17,18,19,28,29,30].
On the forecasting side, transformer-based architectures and federated learning have been independently explored for load and renewable forecasting and for privacy-preserving model training in smart grids [31,32,34,35,36,37,38,39,40,41], but they are rarely integrated as a dedicated hierarchical layer in SoC-balancing controllers.
The proposed HPAC framework addresses these gaps by the following:
  • Introducing a hierarchical predictive-adaptive architecture in which a federated transformer-based Predictive Engine supplies multi-horizon probabilistic net-load forecasts to a real-time Adaptive Controller.
  • Implementing the adaptive controller as a Soft Actor-Critic agent that operates on a forecast-augmented state and optimizes a multi-objective reward, with SoC balancing, economic performance, degradation-aware operation, and voltage stability as first-class objectives.
  • Leveraging previous experience with distributed AI and DRL micro-grid controllers [17,18,19] to design a control stack that is both practically implementable and amenable to federated, privacy-preserving learning.
This combination of hierarchical forecasting and DRL-based control, together with an explicit focus on SoC balancing in multi-BESS mini-grids, differentiates HPAC from existing approaches and motivates the detailed methodological and evaluation work presented in the following sections.
Literature Review Takeaways
  • SoC balancing rarely co-optimized with economics and degradation: Most droop-based and MPC-based approaches either treat SoC equalization as a single dominant goal or as a side constraint within economic optimization, but very few works explicitly co-design SoC balancing with cost and lifetime objectives for multi-BESS mini-grids.
  • DRL controllers underuse SoC-balancing signals: Recent DRL-based EMS studies achieve strong economic performance under uncertainty, yet SoC balancing across multiple batteries is typically enforced via constraints and penalties rather than as a first-class reward component.
  • Forecasting and control are loosely coupled: Advanced forecasters (transformers, hybrid ANN-LSTM-transformer models) and privacy-preserving FL frameworks are rarely integrated into a hierarchical SoC-balancing controller where forecasts drive both state augmentation and dynamic reward shaping.
  • Contribution of HPAC: HPAC directly targets these gaps by (i) using a federated transformer Predictive Engine to deliver probabilistic net-load forecasts to (ii) a SAC-based Adaptive Controller that treats SoC balancing, economics, degradation, and voltage stability as explicit reward components, thereby providing a predictive-adaptive SoC-balancing framework that is scalable, privacy-preserving, and robust to uncertainty.

2.2. Background

This section provides the conceptual background needed to understand the proposed Hierarchical Predictive–Adaptive Control (HPAC) framework for SoC balancing in mini-grids using deep reinforcement learning. We first outline the micro-grid and SoC balancing problem, then describe the predictive-adaptive control philosophy and the role of federated learning.

2.2.1. Predictive–Adaptive Control Framework

The predictive-adaptive paradigm combines model-based predictive control with data-driven adaptive control. In the proposed HPAC framework, an upper predictive layer plays a role analogous to MPC: it uses a forecasting model (e.g., a transformer-based time-series predictor) to obtain multi-step predictions of net load and possibly price signals, and then optimizes planned SoC trajectories and power setpoints over a finite horizon. The objective function incorporates SoC variance penalties, degradation-aware terms, and economic costs (e.g., grid energy purchase), subject to constraints on BESS power, SoC limits, and grid-support requirements. The result is a set of reference signals—such as desired SoC trajectories or nominal charge/discharge profiles—for each BESS unit over the horizon.
A lower adaptive layer implements a DRL-based controller that operates at a finer timescale and tracks these references under uncertainty. The DRL agent (or agents) observes a state that includes current SoCs, local voltages and currents, and short-horizon forecast summaries, as well as the planned SoC targets from the upper layer. It outputs control actions (e.g., incremental power adjustments) that refine the predictive layer’s setpoints in real time. Through interaction with a realistic environment model or physical system, the DRL agent learns a policy that compensates for model mismatch, stochastic disturbances, and unmodeled dynamics—ultimately improving robustness and reducing constraint violations compared with a purely model-based strategy [10,17,18,19,42]. Entropy-regularized algorithms such as SAC are particularly suitable in this setting, as they handle continuous power commands and encourage robust, exploratory behavior [28].
This hierarchical decomposition mirrors recent RL–MPC hybrids, where a model-based component provides foresight and constraint handling, and a learning-based component adapts to uncertainties and refines control actions [42]. In HPAC, SoC balancing is elevated to a first-class objective across both layers: the predictive layer plans SoC equalization in a foresighted manner, whereas the adaptive layer enforces and fine-tunes that plan, maintaining balance despite forecast errors and parameter uncertainties.

2.2.2. Federated Learning for Privacy-Preserving Forecasting and Control

Modern smart-grid deployments raise privacy and data-ownership concerns, especially when multiple organizations or prosumers are involved. High-resolution time-series data on consumption and generation can reveal sensitive information about user behavior. Federated learning offers a way to collaboratively train forecasting and control models across many micro-grids without sharing raw data [36,37,40].
In an FL setup, each micro-grid (or site) maintains a local dataset and trains a local copy of the model—such as a transformer-based forecaster or a neural network policy for SoC control—using its own data. Periodically, local models send only their parameter updates (e.g., weights or gradients) to a coordinating server or to peers in a decentralized scheme, where the updates are aggregated (e.g., via Federated Averaging) and a new global model is formed. The aggregated model is then redistributed to all participants, who continue local training. This allows the model to benefit from the diversity of operating conditions across micro-grids while preserving privacy, as raw time-series data never leave the local sites [40,41].
For HPAC, FL is particularly relevant for the predictive (forecasting) layer. Transformer- or LSTM-based predictors of net load, renewable generation, or price signals can be trained federatively across multiple mini-grids, improving accuracy and generalization beyond what any single micro-grid could achieve with its limited dataset. This is aligned with the growing body of work on FL for load and renewable forecasting, which reports comparable accuracy to centralized training while respecting data-ownership constraints [40]. In the longer term, similar ideas could be extended to DRL policies themselves, with micro-grids collaboratively training SoC-balancing controllers via federated multi-agent RL.
In summary, the HPAC framework builds on the background of droop-based and optimization-based control, DRL for micro-grid energy management, advanced neural forecasting, and federated learning. By combining these elements into a hierarchical predictive-adaptive architecture with SoC balancing as a primary objective, HPAC aims to deliver robust, scalable, and privacy-preserving control for multi-BESS mini-grids under high renewable penetration.

3. Problem Definition and Architectural Overview

3.1. Problem Definition

A micro-grid is a localized cluster of electricity generation, storage, and loads that can operate connected to the main grid or in islanded mode. In the context of this work, we consider a renewable-rich mini-grid with multiple Battery Energy Storage Systems (BESS) operating in parallel to smooth renewable fluctuations and supply local loads. A key operational challenge is state-of-charge balancing ensuring that all batteries in the fleet maintain comparable SoC levels over time. Balanced SoCs prevent over-cycling of individual batteries, avoid premature degradation of weaker units, and maximize the effective capacity and reliability of the fleet [12,27].
Traditional SoC-balancing strategies in micro-grids have relied on droop control, where each BESS unit adjusts its charge/discharge current according to a voltage-current droop characteristic modified by SoC-dependent terms. High-SoC units are commanded to provide more power (or absorb less power), while low-SoC units are relieved, leading to gradual equalization of SoC [11,15,16]. Adaptive droop schemes and consensus-based controllers further accelerate balancing and compensate for line and converter heterogeneities [13,25,26]. However, these approaches remain essentially reactive and local: they do not explicitly anticipate future load and generation patterns or account for long-term degradation and economic objectives.
In this work, SoC balancing is framed as a predictive, multi-objective control problem. Given forecasts of local load, renewable production, and possibly electricity prices, the controller seeks to schedule BESS charge/discharge actions over a horizon so that: (i) individual SoCs converge to a desired band; (ii) technical constraints such as power balance and voltage limits are respected; and (iii) economic and lifetime-related metrics are optimized. This naturally suggests a hierarchical architecture, where slower supervisory layers focus on SoC trajectories and economic dispatch, while faster layers enforce power-sharing and voltage/frequency stability.

3.2. Architectural Overview

Synergizing Forecasting and Real-Time Control

The HPAC framework is organized into a two-layer hierarchical structure that separates the control problem by operational timescales. This decomposition mirrors the natural division of decision-making in complex energy systems into strategic, longer-term planning and tactical, real-time execution.
A monolithic control model attempting to optimize across all timescales simultaneously would face an intractable state-action space and immense computational demands. By decoupling these functions, the HPAC framework allows specialized models to be applied to the tasks for which they are best suited, following the broader trend toward multi-timescale micro-grid operation and DRL-based coordination [6,28], specifically leveraging Federated Learning (FL) for privacy-preserving forecasting and entropy-regularized Reinforcement Learning (RL) for robust control.
  • Upper layer (Federated Predictive Engine): operates on a slower timescale, typically with an update frequency of 15 min to 30 min . Its sole purpose is to look into the future, analyzing historical data and exogenous variables to generate probabilistic forecasts of the system’s net load (load minus renewable generation) for multiple horizons (e.g., 1 h, 4 h, 12 h). Crucially, this layer is trained using Federated Learning, allowing multiple independent mini-grids to collaboratively train a high-capacity transformer model without sharing raw operational data.
  • Lower layer (Adaptive Controller): operates on a much faster timescale, with decisions every 1 min to 5 min . This layer is responsible for determining precise power setpoints for each BESS unit. It takes as input the current state of the system and, crucially, the contextual information provided by the predictive engine. It optimizes a multi-objective function that explicitly prioritizes voltage stability alongside SoC balancing and economic cost.
The key innovation of the HPAC architecture (as shown in Figure 1) is the sophisticated information flow between these two layers. Forecasts from the predictive engine actively inform and guide the decision-making process of the Adaptive Controller in two ways.
First, probabilistic forecasts (e.g., mean and variance of future net load) are incorporated directly into the state vector observed by the controller. This provides the agent with a forward-looking view of the environment, enabling policies that anticipate future events rather than merely reacting to them.
Second, the forecasts are used to dynamically modulate the controller’s objective function through a mechanism termed Dynamic Reward Shaping. By monitoring the forecast uncertainty, the system automatically adjusts the relative weights of the reward components. For example, if the predictive engine forecasts high volatility or a significant solar surplus in the next two hours, the reward for discharging batteries now (to create storage headroom) can be increased. Conversely, if forecast uncertainty is high, the weights for voltage stability ( w volt ) and SoC conservation ( w soc ) are amplified relative to economic profits ( w econ ). This mechanism enables the system to proactively position storage assets to maximize economic value while maintaining stability.
This hierarchical structure creates a powerful synergy: the predictive engine handles the computationally intensive task of long-range forecasting on a slow timescale, while the adaptive controller, once trained, provides fast, low-latency inference for real-time control. The architecture achieves the farsightedness of MPC without prohibitive real-time computation and the adaptability of DRL while mitigating the sample inefficiency of a purely reactive agent.

4. Methodology

In this section, we formalize the proposed Hierarchical Predictive–Adaptive Control (HPAC) framework and detail the methodological choices underlying its design. We first describe the overall two-layer architecture, highlighting how long-horizon forecasting and fast-timescale decision-making are decoupled yet tightly coupled through structured information exchange between a predictive engine and an adaptive controller. We then specify the design of the predictive engine as a federated transformer-based multi-horizon forecasting module, formulate the SoC balancing problem as a Markov decision process, and realize the adaptive controller via a Soft Actor-Critic agent operating in continuous action spaces. Particular emphasis is placed on the definition of state, action, and reward signals, as well as on the data flow and interaction protocol between the two layers, which together enable proactive, degradation-aware, and economically efficient SoC management. This methodological foundation provides the basis for the simulation setup, validation protocol, and comparative performance analysis presented in Section 5.

4.1. The HPAC Framework

4.1.1. The Multi-Horizon Predictive Engine (Federated Transformer Network)

The predictive engine provides a high-quality probabilistic understanding of future net load, which is the primary source of uncertainty in a renewable-rich mini-grid. To achieve this, the framework employs a state-of-the-art transformer-based deep learning model, selected for its ability to capture long-range dependencies in time-series data and handle multivariate inputs. Recent work on transformer-based hybrid forecasting for renewable energy and probabilistic time series demonstrates their suitability for such tasks [31,32,33,34,35].
The novelty of this component lies in its implementation within a federated learning (FL) framework. In a scenario with multiple independent mini-grids (e.g., different communities or commercial sites), FL enables collaborative training of a global forecasting model without sharing private operational data. This simultaneously addresses data scarcity (each mini-grid may not have enough local data) and data privacy, yielding a more general and robust forecasting model than any site could train in isolation. This design aligns with recent surveys and overviews of federated learning and its emerging applications in smart grids [36,37].

4.1.2. The Real-Time Adaptive Controller (Soft Actor-Critic Agent)

The adaptive controller is the core decision-making unit of HPAC. It translates system state and predictive context into concrete, real-time actions—specifically, charge/discharge power setpoints for each BESS unit. This layer is implemented using Soft Actor-Critic (SAC), a state-of-the-art DRL algorithm for continuous action spaces that offers high sample efficiency and robustness. SAC has been successfully applied to multi-timescale micro-grid operation and real-time micro-grid energy management [19,28], and its algorithmic foundations are rigorously presented in the original Soft Actor-Critic papers and related peer-reviewed work on maximum-entropy RL, complemented by open educational resources such as Spinning Up [43,44,45].
SAC is particularly suitable for this problem because of the following:
  • It is off-policy and can learn efficiently from replayed experiences.
  • It handles continuous action spaces naturally, which aligns with BESS power setpoints.
  • Its maximum-entropy objective encourages exploration and yields robust policies less brittle to noise and disturbances.
The SAC agent optimizes a carefully engineered multi-objective reward function that balances SoC, economic efficiency, and long-term BESS health preservation.

4.1.3. Data Flow and Interaction Protocol

The interaction between the two HPAC layers follows a precise protocol.
At each slow-timescale interval T predict (e.g., 15 min ):
  • The predictive engine ingests the latest historical data (e.g., last 24 h of load, RES generation, and weather).
  • It runs a forward pass through the federated transformer model to generate a probabilistic forecast for net load P net over a horizon H, e.g., a sequence ( μ net ( t + 1 ) , σ net 2 ( t + 1 ) ) , , ( μ net ( t + H ) , σ net 2 ( t + H ) ) .
  • This forecast sequence is broadcast to the adaptive controller.
At each fast-timescale interval T control (e.g., 1 min ), the adaptive controller executes the following:
  • Construct a state vector s t including current SoC and SoH of each BESS, current net load, grid exchange power, electricity price, temporal features, and the latest forecast sequence.
  • Compute dynamic weights for the reward function based on the forecast (e.g., a charging weight may depend on expected future energy deficits).
  • Feed s t into the SAC policy network (actor) to obtain an action vector a t = [ P bess , 1 , , P bess , N ] , containing normalized power setpoints.
  • Dispatch setpoints to BESS power converters.
  • Observe the next state s t + 1 and reward r t , and store the transition ( s t , a t , r t , s t + 1 ) in a replay buffer for ongoing training.
This protocol ensures that the real-time controller always operates with the best available forward-looking information, making decisions that are both locally effective and strategically aligned with future conditions.

4.2. The Predictive Engine: Multi-Horizon Forecasting with Federated Transformer Networks

4.2.1. Rationale for Transformer Architecture: Beyond Recurrent Models

Selecting an appropriate forecasting model is critical for the Predictive Engine, while recurrent neural networks (RNNs) and long short-term memory (LSTM) networks have been popular for time-series forecasting, the transformer architecture offers several advantages. Comparative analyses emphasize its superior ability to handle long-range dependencies and parallel computation compared with classical RNN-based models [31,32].
RNNs and LSTMs process sequences sequentially, introducing the following two challenges:
  • A computational bottleneck, as each time step depends on the previous one, limiting parallelization.
  • Difficulty in capturing very long-term dependencies, despite mechanisms such as gating in LSTMs.
Transformers dispense with recurrence and rely on self-attention. Self-attention allows the model to weigh the importance of all other time steps when predicting a particular output, computing attention scores between every pair of positions in the sequence. This:
  • Enables highly parallel computation on GPUs and TPUs, accelerating training.
  • Provides a global receptive field that captures complex long-range dependencies, such as the relationship between morning irradiance and evening load peaks.
These properties are crucial for multi-horizon energy forecasts that exhibit daily, weekly, and seasonal patterns influenced by weather and human behavior, as demonstrated in recent transformer-based renewable and solar-irradiance forecasting studies [31,32,34,35].

4.2.2. Model Architecture for Multivariate Energy Forecasting

The transformer model for the Predictive Engine is tailored for multivariate probabilistic time-series forecasting. Its main components are as follows:
Inputs. The model accepts multivariate time-series inputs, including the following:
  • Historical net load and possibly disaggregated load and generation.
  • Exogenous variables such as global horizontal irradiance (GHI), direct normal irradiance (DNI), ambient temperature, wind speed, and humidity.
  • Calendar features (hour of day, day of week, month, holiday flags) are often encoded cyclically using sine/cosine transforms.
Embedding and positional encoding. Raw inputs are projected into a higher-dimensional embedding space. Since transformers lack an inherent notion of order, positional encodings are added to embeddings to encode time-step positions.
Encoder-decoder structure. A standard encoder-decoder architecture is adopted:
  • The encoder is a stack of layers with multi-head self-attention and position-wise feed-forward networks, producing contextualized representations of the historical sequence.
  • The decoder uses masked self-attention (to preserve autoregressive structure) and encoder-decoder attention to generate forecasts step-by-step.
Multi-head self-attention. Multiple attention heads allow the model to focus on different patterns (e.g., daily seasonality and short-term weather correlations) in parallel. Outputs of all heads are concatenated and linearly transformed.
Output layer and probabilistic forecasts. The final decoder layer projects its output to parameters of a probability distribution for each future time step, typically mean μ and variance σ 2 of a Gaussian. This yields both point forecasts and uncertainty estimates, which are critical for risk-aware control.

4.2.3. Federated Learning Implementation

A key innovation of the Predictive Engine is its deployment within a federated learning framework, which reconciles the need for large datasets with data privacy constraints.
Federated learning proceeds as follows:
  • Initialization: A central server initializes a global transformer model with random weights and distributes it to participating mini-grids.
  • Local training: Each mini-grid trains its local copy on its own data for a few epochs, computing weight updates.
  • Secure aggregation: Instead of sending raw data, each mini-grid uploads only its weight updates (optionally protected by secure aggregation or differential privacy techniques).
  • Global update: The server aggregates updates (e.g., via weighted averaging) to produce an improved global model, which implicitly learns from all participating sites.
  • Synchronization: The updated global model is redistributed, and the process repeats until convergence.
This procedure yields the following:
  • A more accurate and robust forecasting model than any individual site could train alone.
  • Strong privacy guarantees, since raw time-series data remain on local premises.
The resulting global model can then be deployed locally at each mini-grid as the predictive engine within the HPAC framework. This aligns with recent overviews of federated learning applications and remaining research challenges [36,37], as well as with early explorations of FL in smart grid environments.

4.3. The Adaptive Controller: Real-Time SoC Management via Soft Actor-Critic

4.3.1. Formulating SoC Balancing as a Markov Decision Process

To apply DRL, the SoC balancing problem is formulated as a Markov decision process (MDP), a mathematical framework for sequential decision-making under uncertainty. The MDP abstraction has become standard in DRL-based micro-grid control and storage management [6,9,19,28]. The key MDP components are stated in Table 3.
Explicitly, typical reward components can be expressed as follows:
R soc ( t ) = Var SoC 1 ( t ) , , SoC N ( t ) ,
R econ ( t ) = P grid ( t ) C ( t ) ,
R volt ( t ) = V bus ( t ) V nom 2 ,
with R soh ( t ) derived from degradation models as discussed later. The weights w soc , w econ , w soh , and w volt control the trade-offs between objectives.

4.3.2. The Soft Actor-Critic Algorithm

Soft Actor-Critic (SAC) is a state-of-the-art, off-policy, model-free DRL algorithm based on the actor-critic paradigm and the maximum-entropy reinforcement learning framework [44,45,46].
Actor-critic architecture. SAC maintains:
  • An actor (policy) network π ϕ ( a t | s t ) that outputs the parameters of a probability distribution (typically a squashed Gaussian) over continuous actions given the state.
  • Two critic networks Q θ 1 ( s t , a t ) and Q θ 2 ( s t , a t ) that estimate expected cumulative reward (Q-values), used for stable learning and to mitigate overestimation bias.
Maximum-entropy objective. Unlike standard RL, SAC maximizes both expected return and policy entropy as follows:
J ( π ) = t = 0 T E ( s t , a t ) ρ π r ( s t , a t ) + α H ( π ( · | s t ) ) ,
where α is a temperature parameter and H is the entropy. This encourages the following:
  • Enhanced exploration: Stochastic policies explore more actions, reducing risk of local optima.
  • Increased robustness: Learned policies perform well across a wider range of states and perturbations, desirable in noisy physical environments.
Off-policy learning and replay buffer. SAC stores transitions ( s t , a t , r t , s t + 1 ) in a replay buffer. During training, mini-batches are sampled to update the actor and critics, enabling efficient reuse of past experiences and decoupling data collection from learning. This structure has proven effective in micro-grid and energy storage applications [19,28].
Clipped double-Q learning and target networks. SAC uses the following:
  • Two critics and the minimum of their outputs to compute targets, reducing overestimation bias.
  • Slowly updated target networks to stabilize learning.
Algorithm 1 summarizes the overall HPAC control loop with the SAC-based Adaptive Controller, explicitly showing how the Predictive Engine and Adaptive Controller interact.
SAC is particularly well matched to the HPAC setting because the continuous charge/discharge power setpoints for multiple BESS units naturally form a continuous action space, and the maximum-entropy formulation yields policies that are robust to measurement noise, parameter uncertainty, and modeling errors in the digital twin.
Algorithm 1 HPAC control loop with SAC-based Adaptive Controller
  1:
Initialize SAC actor and critic networks with parameters ϕ , θ 1 , θ 2
  2:
Initialize target critic parameters θ ¯ 1 θ 1 , θ ¯ 2 θ 2
  3:
Initialize replay buffer D
  4:
for each training episode do
  5:
      Reset digital-twin environment and obtain initial state s 0
  6:
      for each slow-timescale step do
  7:
            Obtain multi-horizon forecast from Predictive Engine
  8:
            for each fast-timescale control step do
  9:
                  Form augmented state s t (measurements + forecast features)
10:
                  Sample action a t π ϕ ( · s t )
11:
                  Apply a t to environment and observe r t , s t + 1
12:
                  Store ( s t , a t , r t , s t + 1 ) in replay buffer D
13:
                  Sample mini-batch from D
14:
                  Update critics Q θ 1 , Q θ 2 using Bellman targets and target critics
15:
                  Update actor π ϕ to maximize entropy-regularized objective
16:
                  Soft-update target critic parameters θ ¯ 1 , θ ¯ 2
17:
            end for
18:
      end for
19:
end for

4.3.3. Reward Function Engineering: A Multi-Objective Approach

The reward function is the key interface between high-level control objectives and the DRL agent’s learning process. For HPAC, it is explicitly multi-objective.
SoC balancing reward R soc . The primary objective is to maintain SoC balance as follows:
R soc ( t ) = w soc · Var SoC 1 ( t ) , , SoC N ( t ) .
This penalizes large differences among BESS SoCs, directly incentivizing equalization, similarly to SoC-variance-based cost functions used in adaptive droop methods [13,16].
Economic reward R econ . For grid-connected mini-grids, economic performance matters as follows:
R econ ( t ) = w econ P grid ( t ) C ( t ) ,
where P grid ( t ) is positive for import and negative for export. This term encourages strategies such as price arbitrage.
SoH preservation reward R soh . To preserve BESS lifetime, actions that accelerate degradation are penalized. Using semi-empirical degradation models inspired by tools such as NREL’s BLAST v1.1.0 (https://github.com/NREL/BLAST-Lite, accessed on 20 December 2025) suite [24,47,48], one can write the following:
R soh ( t ) = w soh i = 1 N λ 1 C rate , i ( t ) + λ 2 f soc ( SoC i ( t ) ) ,
where C rate , i is the C-rate of unit i and f soc is a function penalizing operation in extreme SoC ranges (e.g., below 10% or above 90%).
Voltage stability reward R volt . Voltage stability is enforced via the following:
R volt ( t ) = w volt V bus ( t ) V nom 2 ,
The final reward is the weighted sum
R ( t ) = R soc ( t ) + R econ ( t ) + R soh ( t ) + R volt ( t ) ,
with weights tuned to reflect desired trade-offs. Higher w soh ; for example, yields more conservative policies that prioritize battery longevity over short-term profits.
Reward Weight Selection
The performance of HPAC is sensitive to the relative scaling of the reward weights { w soc , w econ , w soh   a n d   w volt } . To quantify this effect, we performed a grid-search sensitivity analysis over w soc and w econ while keeping w soh and w volt fixed at their nominal values. For each pair ( w soc   a n d   w econ ) , we executed a 24 h simulation of the base mini-grid scenario and recorded both the total operating cost and the SoC variance across the BESS fleet. Figure 2 depicts the resulting cost surface: the horizontal axes correspond to the economic weight w econ and the SoC-balancing weight w soc , while the vertical axis and color map report the total operating cost. The annotated region highlights the minimum-cost area, located around w soc = 0.5 and w econ = 1.0 , where the controller achieves a favorable trade-off between cost and SoC balancing.
Several trends can be extracted from this surface. First, the cost varies only within a narrow band (approximately USD 0.2) over the explored grid, indicating that HPAC remains economically robust to moderate mistuning of the reward weights. Nevertheless, a well-defined valley is visible around the marked minimum-cost region: combinations with w soc 0.5 and w econ 1.0 yield the lowest operating cost while still preserving satisfactory SoC balancing. Moving away from this region in either direction increases the cost. When w econ is raised above about 1.5 for a fixed w soc , the controller overemphasizes short-term price arbitrage; this leads to more aggressive charge/discharge cycles that slightly increase losses and curtailment, thereby lifting the total cost. Conversely, when w soc is made too large (beyond 1.5 ), the controller attempts to strictly equalize the SoCs at the expense of exploiting economic opportunities, which again manifests as a gentle cost ridge across the surface. The SoC-variance statistics (not shown in the figure for clarity) complement this picture and confirm that SoC balancing degrades for very small w soc and improves as the controller moves towards the valley region.
The following patterns were observed:
  • Existence of a cost valley: The surface in Figure 2 exhibits a clear valley around w soc 0.5 and w econ 1.0 , where the total operating cost attains its minimum while SoC variance remains low. This region represents a good compromise between economic performance and SoC balancing.
  • Over-emphasis on economics: For fixed w soc and increasing w econ (beyond roughly 1.5 ), the agent becomes increasingly aggressive in exploiting price signals. Although this improves short-term arbitrage, it leads to more frequent cycling, slightly higher losses, and a gradual rise in total cost.
  • Over-emphasis on SoC balancing: When w soc is made very large (above about 1.5 ), the controller prioritizes SoC equalization over economic gains. This yields tightly clustered SoCs but reduces the exploitation of profitable charge/discharge windows, again increasing the total operating cost relative to the optimum.
  • Insufficient SoC weight: For w soc < 0.5 , SoC balancing degrades noticeably. The reward becomes dominated by R econ , leading to persistent SoC divergence across BESS units and making the system more vulnerable to local SoC saturation in subsequent operating days.
  • Role of w volt and w soh : The voltage weight w volt acts as a soft network constraint: values significantly larger than 2.0 restrict arbitrage opportunities without appreciable additional gains in voltage quality. The health weight w soh primarily governs the aggressiveness of cycling; increasing it reduces energy throughput and slightly increases cost but promotes longer battery lifetime.
Based on these observations, a practical tuning guideline is as follows. Each reward component should first be normalized so that its magnitude is of order one along nominal trajectories. Then, w soc can be initialized near 0.5 and w econ near 1.0 , which places the controller inside the empirically identified minimum-cost region. The economic weight w econ may then be adjusted according to the volatility of the electricity price (higher for highly volatile markets, lower for flatter tariffs), while w soh should be increased in applications where battery lifetime is prioritized over short-term operating cost. Finally, w volt should be set to the smallest value that still satisfies voltage-quality requirements. This sensitivity study therefore offers a principled starting point for selecting reward weights in new HPAC deployments, which can subsequently be fine-tuned via short simulation campaigns to match operator-specific KPIs.

4.3.4. Constraint Handling and Safety Considerations

In the proposed HPAC framework, physical and operational limits are enforced through a combination of environment-level hard constraints and soft penalties in the reward. This distinction is important for safety and for interpreting the reported performance metrics.
  • SoC constraints: At the environment level, SoC is updated using the discrete-time integration model in Equation (11) and then clipped to the admissible interval [ SoC min , SoC max ] for each unit. Actions that would drive a battery outside this interval are effectively saturated, so simulated SoC states never leave the safe range. In parallel, the reward includes SoC variance and SoH-related terms that penalize operation near the boundaries, discouraging the agent from persistently operating close to SoC min or SoC max .
  • Power and line-flow constraints: BESS power setpoints are first generated by the SAC policy in normalized form and then scaled and clipped to the nameplate power limits P bess , i min P bess , i ( t ) P bess , i max . The digital twin enforces network power-flow equations and thermal limits on lines; if a proposed dispatch would violate a line rating, local curtailment of PV or a reduction in BESS power is applied to keep flows within bounds. These hard constraints ensure that infeasible actions from the agent do not lead to physically invalid states.
  • Voltage constraints: The bus voltage is computed by the underlying power-flow model for each time step. The environment enforces a hard admissible band [ V min , V max ] through automatic curtailment and, if necessary, load shedding in extreme cases. In addition, the reward term R volt ( t ) penalizes squared deviations from the nominal value V nom , so the agent is incentivized to keep voltages well within the admissible range, not merely to avoid violations.
Thus, HPAC operates with a layered safety mechanism: the digital-twin environment guarantees that SoC, power, line-flow, and voltage states respect hard physical limits, while the reward function shapes the agent’s preferences within that safe operating envelope. This design is consistent with best practices in DRL for safety-critical energy systems, where low-level protection and high-level optimization are clearly separated.
Explicitly, typical reward components can be expressed as follows:
Photovoltaic Arrays
PV power output P PV ( t ) is modeled as a function of plane-of-array irradiance G POA ( t ) and cell temperature T cell ( t ) as follows:
P PV ( t ) = η PV A PV G POA ( t ) 1 + γ T cell ( t ) T ref ,
where η PV is module efficiency, A PV array area, and γ a temperature coefficient. T cell can be estimated via standard NOCT models using ambient temperature and irradiance. This style of modeling is consistent with PV-based micro-grid modeling and control studies [21,49].
Battery Energy Storage Systems
The SoC evolution is typically modeled via Coulomb counting as follows:
SoC ( t ) = SoC ( t Δ t ) P bess ( t ) Δ t C rated η BESS ,
where P bess ( t ) is positive for discharge; C rated is rated capacity; and η BESS is efficiency.
Degradation is modeled through calendar and cycle aging as follows:
L = L cal ( t ) + L cyc ( t ) ,
with SoH defined as SoH ( t ) = 1 L ( t ) × 100 % . In the simulations, cycle aging is updated on an hourly timescale using a rainflow-based aggregation of intra-hour SoC excursions. Each identified half-cycle with depth-of-discharge DoD j and average cell temperature T j contributes
Δ L cyc , j = k cyc DoD j β DoD exp E a R ( T j + 273.15 ) ,
where k cyc , β DoD , and E a are semi-empirical parameters fitted to lithium-ion NMC cell data consistent with NREL’s BLAST toolkit and related studies [24,47]. Calendar aging is represented by
Δ L cal = k cal t β time exp E a , cal R ( T stor + 273.15 ) g ( SoC ¯ ) ,
where t is elapsed time, T stor is storage temperature, and g ( SoC ¯ ) penalizes long-term storage at very high SoC. Over the few-day evaluation horizon used in this paper, the absolute SoH change is small, but the degradation model is scaled such that equivalent full cycles (EFC) and time spent in extreme SoC ranges correlate realistically with long-term capacity fade.
The choice of a semi-empirical model, instead of a purely ideal Coulomb-counting representation, is motivated by recent work on nonideal battery behavior and its impact on control design. For example, fuzzy-logic approaches to power-system security with nonideal electric-vehicle batteries in vehicle-to-grid (V2G) systems explicitly highlight the importance of capturing internal losses and state-dependent degradation in control-oriented models [50]. Our formulation follows this line of work by embedding a simplified but explicitly nonideal degradation model into the HPAC reward and state dynamics.
Stochastic Load Profiles
Load profiles should reflect real-world variability. High-resolution data from sources such as NSRDB, Pecan Street, and UK-DALE can be used directly or to train generative models that synthesize realistic stochastic loads, enabling robust training and testing of the DRL agent [51,52,53,54,55,56,57,58].

4.4. Implementation Workflow

A structured workflow for HPAC implementation includes the following:
  • Data acquisition and preprocessing: collect time-series data (solar, load, weather), align timestamps, handle missing data, and normalize. Split into training, validation, and test sets, as in recent ML- and DL-based energy management studies [3,5,9].
  • Predictive Engine training: implement the transformer forecasting model in a deep learning framework; simulate federated training with local updates and centralized aggregation until convergence on validation data.
  • Simulation environment setup: configure a primary platform (e.g., pymgrid) with PV, BESS (including degradation), and load models; wrap it in a Gym-compliant interface [59,60,61].
  • DRL agent training: implement SAC (or use a well-tested library implementation), define actor/critic architectures, and train the agent within the Gym environment, using the pre-trained Predictive Engine to provide forecast inputs [28,44,45,46].
  • Evaluation and benchmarking: evaluate the trained SAC agent on the held-out test set, compute KPIs, and compare against baseline controllers under identical conditions, following best practices in RL-based micro-grid control surveys [6,10].

5. Performance Evaluation and Benchmarking

This section presents a comprehensive evaluation of the HPAC framework over a 72 h (3 day) simulation horizon. We first describe the implementation details and hyperparameters of the SAC agent, followed by the mini-grid topology and simulation parameters used in the case study. We then define key performance indicators and baseline controllers, present the quantitative benchmarking results against fourteen representative controllers, and analyze performance trade-offs. Ablation studies isolate the contributions of forecasting and dynamic reward shaping. Finally, stress-test scenarios assess robustness under high-volatility conditions and communication impairments, demonstrating the practical viability of the HPAC approach.

5.1. Implementation Details and Hyperparameters

The HPAC framework was evaluated using a MATLAB-(R2025a)-based [62] digital-twin environment. The environment models photovoltaic generation, BESS units with SoC and SoH dynamics, stochastic load profiles, and grid interaction using an aggregated-bus representation.
In the mini-grid case study, photovoltaic generation (peak power 300 kW) and demand profiles (50 households) are generated synthetically using time-of-day-dependent baseline curves with superimposed stochastic perturbations. This setup makes the experiments fully reproducible without requiring access to external datasets.
The Soft Actor-Critic (SAC) agent implementing the adaptive controller is implemented directly in MATLAB, using explicit matrix operations and custom gradient updates. Training is performed on a standard CPU-only workstation with hyperparameters tuned to ensure stable convergence.
Convergence of the SAC agent was assessed by monitoring the actor and critic losses, as well as the episodic return, over the full training horizon. As shown in Figure 3, the actor loss initially increases slightly during the early exploration phase, then undergoes a sustained decay and stabilizes around a nearly constant value, indicating that policy updates become small and consistent. The critic loss starts from a high value, rapidly decreases in the first training steps, and then passes through a transient region with moderate oscillations before settling into a low-variance regime, confirming the numerical stability of the value function estimates. In parallel, the episodic return, computed from the same training logs, exhibits a characteristic improvement followed by a plateau at a steady value, jointly demonstrating that the SAC-based HPAC controller converges to a stable and high-performing policy.
For a detailed overview of the training configuration, Table 4 lists the hyperparameters used for the Soft Actor-Critic agent in HPAC. These values include the optimization constraints and the fixed entropy coefficient used to govern the training budget.

5.1.1. Mini-Grid Topology and Simulation Parameters

For reproducibility, we summarize the topology and simulation parameters used in the MATLAB-based case study. The mini-grid is modeled as a single aggregated low-voltage AC bus with a nominal line-to-line voltage of 480 V. The aggregated bus represents the point of common coupling to the main grid and supplies an equivalent of 50 residential/commercial households. Net demand is obtained as the difference between the stochastic load profile and rooftop photovoltaic (PV) generation.
PV generation is represented by a single aggregated array with a peak power of 300 kW. Both load and PV trajectories are synthesized inside the simulation environment using time-of-day-dependent baseline shapes with superimposed stochastic perturbations, providing realistic diurnal patterns and intra-hour variability throughout the 72 h horizon.
The BESS fleet comprises N = 4 heterogeneous units with energy capacities [ 100 , 150 , 120 , 130 ]  kWh (total 500 kWh) and maximum power ratings [ 50 , 60 , 55 , 58 ]  kW. Converter efficiency is fixed at 95%, while SoC and SoH constraints are enforced as [ SoC min , SoC max ] = [ 20 % , 90 % ] and SoH min = 70 % , respectively.
The adaptive controller operates at a control interval of T control = 1 min , whereas the predictive engine is updated every T predict = 15 min with a forecast horizon of H = 12  h. The grid interconnection allows importing or exporting up to 500 kW and is priced using a time-of-use tariff with a base price of USD 0.15/kWh and 5% price volatility. Each simulation episode spans 72 h (3 days), ensuring that diurnal cycles and multi-day recovery dynamics are captured consistently for all fourteen controllers evaluated in Table 6.
The grid interconnection is characterized by a maximum import/export capability of 500 kW (grid_max_power = 500). In the base configuration used for the HPAC evaluation, line-impedance and multi-bus parameters are left empty, so the system operates as an aggregated single-bus mini-grid.
The control interval for the adaptive controller is T control = 1 min , while the Predictive Engine is updated every T predict = 15 min with a forecast horizon of H = 12 steps (12 h ahead), consistent with forecast_horizon = 12 in the MATLAB script. All controllers listed in Table 6 are evaluated on this same aggregated-bus topology and parameter set within the shared MATLAB digital twin, ensuring that performance differences are attributable solely to their control logic and learning algorithms.

5.1.2. Key Performance Indicators

A multifaceted evaluation is essential to assess the HPAC framework. Representative KPIs are summarized in Table 5. The KPI set is consistent with industrial practice for energy storage systems and micro-grid business cases [63] and, more importantly, with metrics used in recent peer-reviewed ML-based micro-grid management studies [3,9]. Industry white papers and vendor documentation are used only as supplementary context rather than as primary sources.

5.1.3. Baseline Models for Comparison

To validate the novelty and benefits of HPAC, performance should be benchmarked against credible baselines representing different paradigms as follows:
  • Baseline 1: Advanced adaptive droop control. A decentralized reactive scheme where droop coefficients are dynamically adjusted based on SoC, capacity, and possibly SoH, following recent adaptive droop approaches [11,13,14,15].
  • Baseline 2: MPC-based controller. A linear MPC with a simplified state-space model and deterministic forecasts (e.g., from classical models or DL-based predictors). The objective function mirrors HPAC’s reward structure. This highlights HPAC’s advantages in computational tractability and robustness to model mismatch relative to MPC-based storage control [10,23].
  • Baseline 3: Simpler DRL algorithm (e.g., DDPG). A non-entropy-regularized actor-critic algorithm trained under the same conditions. Comparison with SAC isolates the benefits of the maximum-entropy framework in terms of performance, training stability, and robustness, complementing micro-grid DRL surveys [6].
In addition to these conceptual baselines, the simulation study can include a broader family of controllers, including heuristic rule-based controllers, centralized and distributed MPC variants, tabular Q-learning, policy-gradient methods such as PPO, multi-agent RL, and advanced fuzzy-logic controllers. This enables a richer comparison of trade-offs across technical, economic, and degradation-related KPIs.

5.1.4. Implementation of the Controller Set

All 14 controllers in Table 6 are implemented in the same MATLAB-based digital twin and are evaluated under identical operating conditions, including load and PV profiles, network parameters, and tariff structure.
Table 6. Simulation-based performance comparison of 14 representative controllers in the HPAC mini-grid test system.
Table 6. Simulation-based performance comparison of 14 representative controllers in the HPAC mini-grid test system.
ControllerTotal Cost (USD)Energy Throughput (kWh)SoC VarianceVoltage Variance ( V 2 )
HPAC (Simple)1023.761265.800.09330.5474
Rule-Based (Simple) [11,12,20]1019.88255.400.00040.5535
Rule-Based (Enhanced) [14,15,22]1021.59286.420.00080.5636
MPC (Simplified) [23,42]1003.24337.470.01090.5667
SAC (RL-based) [18,19,28,43]1021.15775.850.09240.5639
Centralized MPC [23,42]1003.53337.480.00070.5565
Distributed MPC [23,42]943.571059.690.00090.5228
PI Controller [20,21,57]959.01928.260.00830.5425
Q-Learning (Tabular) [6,10,29]1052.30416.140.00020.5913
PPO (Policy Gradient) [5,6,64]1029.63800.530.04060.5682
MARL (Multi-Agent RL) [6,29,65]1030.781053.090.14540.5702
Fuzzy Logic (Adv.) [27,50]998.011461.550.00010.5666
Heuristic (Advanced) [3,5,17]967.982003.580.00060.5785
HPAC (Manuscript) [this work]943.90695.600.00000.3641
All 14 controllers in Table 6 are implemented in the same MATLAB-based digital twin and are evaluated under identical operating conditions, including load and PV profiles, network parameters, and tariff structure. The resulting performance spread in terms of total cost, energy throughput, SoC variance, and voltage variance directly reflects the different control philosophies: some controllers deliberately trade higher energy throughput and operating cost for tighter SoC equalization, while others reduce cycling at the expense of power–quality or arbitrage performance. For clarity, we briefly summarize their configuration and relate it to the aggregate metrics reported in Table 6 as follows:
  • HPAC (Simple) shares the SAC architecture and state representation with HPAC (Manuscript) but uses a static, manually tuned reward without forecast-driven shaping from the Predictive Engine. As a result, it exhibits relatively high total cost and SoC variance, together with elevated energy throughput, indicating that the agent over-cycles the batteries and reacts myopically to local states, without consistently aligning its decisions with price signals or long-horizon grid conditions.
  • Rule-Based (Simple) is a heuristic controller based on fixed SoC thresholds and deadbands that prioritize keeping all BESS units within a nominal SoC band, without explicit price awareness. It achieves very low SoC variance and modest voltage variance but at the expense of limited energy throughput and higher overall cost, since it often ignores profitable arbitrage opportunities. Rule-Based (Enhanced) augments this logic with additional rules that charge or discharge when prices fall below or rise above predefined thresholds, which modestly increases energy throughput and cost while maintaining low SoC variance, but still lacks the ability to optimally time actions over a prediction horizon.
  • MPC (Simplified) and Centralized MPC solve quadratic programs over a finite horizon. The simplified variant uses a reduced-order model and shorter horizon, leading to moderate cost, moderate energy throughput, and slightly higher SoC and voltage variance. The centralized variant employs a richer state vector and longer horizon, which tightens SoC variance and stabilizes voltage somewhat, but still incurs non-negligible cost and limited throughput because its optimization is constrained by model complexity and a fixed degradation penalty. Distributed MPC decomposes the optimization across BESS units with a consensus step on coupling variables and achieves near-minimal cost with high energy throughput and very low SoC variance; however, its voltage variance remains noticeably higher than that of HPAC (Manuscript), indicating that purely optimization-based coordination does not fully exploit the hierarchical structure and adaptive reward shaping of HPAC.
  • SAC (RL-based), PPO (Policy Gradient), and Q-Learning (Tabular) are standard DRL baselines trained in the same Gym-compatible environment but without hierarchical forecasting and dynamic reward shaping. SAC and PPO operate in continuous action spaces and tend to generate higher energy throughput with elevated SoC variance and cost, suggesting that they overemphasize short-term rewards and struggle to internalize the long-term trade-off between arbitrage and degradation. By contrast, tabular Q-learning uses a discretized action set, which yields extremely low SoC variance but the highest total cost and relatively large voltage variance, reflecting overly conservative policies that keep SoC tightly regulated while missing profitable and grid-supportive actions.
  • PI Controller is a conventional feedback controller that acts on aggregate SoC error and bus-voltage deviation, with gains tuned to achieve a compromise between SoC balancing and voltage regulation. This results in moderate energy throughput and cost, with low but non-minimal SoC variance and acceptable voltage variance, characteristic of a purely local feedback design that cannot anticipate future disturbances or prices.
  • MARL (Multi-Agent RL) assigns one agent per BESS unit with local observations and a shared global reward; agents are trained jointly using a centralized-critic, decentralized-actor scheme. This structure enables high energy throughput but leads to the largest SoC variance among all controllers and relatively high voltage variance and cost, indicating that local agents tend to over-exploit arbitrage and fail to coordinate sufficiently for global SoC balancing and voltage support.
  • Fuzzy Logic (Advanced) implements a MAMDani-type fuzzy rule base with inputs derived from SoC, power, and price signals and outputs corresponding to charge/discharge setpoints for each BESS unit, inspired by recent applications of fuzzy-logic controllers to nonideal battery and V2G operation [50]. It achieves extremely low SoC variance and high energy throughput, revealing effective balancing and aggressive cycling; however, this comes with increased cost and elevated voltage variance, as the fuzzy rules are not jointly optimized against a multi-objective cost that explicitly internalizes degradation and power–quality constraints.
  • Heuristic (Advanced) is a hand-tuned, price-driven arbitrage strategy that aggressively cycles the batteries whenever even small price spreads are present, subject to basic SoC constraints but without an explicit degradation-aware objective. This explains the highest energy throughput in the comparison, with low SoC variance but significantly higher voltage variance and non-minimal cost, illustrating how purely economic heuristics can jeopardize power-quality and long-term asset health.
  • HPAC (Manuscript) is the full hierarchical controller described in Section 4 and Section 5, combining the federated transformer Predictive Engine with a SAC-based Adaptive Controller and the multi-objective reward structure defined in Section 4.1.2. In Table 6, this design yields near-minimal total cost, essentially zero SoC variance, and by far the lowest voltage variance, while maintaining only moderate energy throughput. In other words, HPAC (Manuscript) deliberately sacrifices excessive cycling and marginal arbitrage gains in favor of simultaneous SoC equalization, voltage stability, and implicit degradation mitigation, demonstrating the benefit of jointly optimized, forecast-aware reward shaping and hierarchical control.
Taken together, the numerical results reveal a clear set of trade-offs across the four performance metrics. Controllers that aggressively pursue arbitrage, such as the Heuristic (Advanced), MARL, and Fuzzy Logic schemes, achieve very high energy throughput but pay for this with increased total cost and, more critically, elevated voltage variance, indicating frequent and large power injections that stress the network. At the opposite extreme, conservative schemes like Q-Learning (Tabular) and the simple rule-based controllers tightly regulate SoC and maintain acceptable voltage profiles but exhibit high or non-competitive costs and limited throughput, reflecting missed opportunities for using the BESS fleet as an economic and grid-support resource. The MPC family, particularly the Distributed MPC variant, demonstrates that model-based optimization can simultaneously reduce costs and maintain low SoC variance, yet its voltage variance remains substantially higher than that of HPAC (Manuscript), suggesting that static constraints and fixed penalty weights are not sufficient to capture the complex, time-varying trade-off between local SoC, network conditions, and price signals. HPAC (Manuscript) occupies a distinct “balanced optimum” region of the Pareto surface: its cost is among the lowest in the entire set, its SoC variance is essentially zero, and its voltage variance is markedly lower than all other controllers, while its energy throughput remains moderate rather than extreme. This combination indicates that the hierarchical, forecast-aware, and reward-shaped design of HPAC not only improves steady-state performance but also restructures the control policy to allocate battery cycling where it is most valuable for both economics and power quality, rather than simply maximizing throughput or minimizing a single objective.
This common implementation framework ensures that differences in performance across controllers are attributable to their control logic and learning algorithms rather than to discrepancies in the underlying models or data sources. The updated quantitative results in Table 6 thus provide a consistent, system-level view of how each control paradigm trades off cost, throughput, SoC balancing, and voltage regulation, and highlight the ability of HPAC (Manuscript) to deliver a balanced optimum across all four metrics.

5.1.5. Scenario Analysis

Beyond average-case performance, robustness under challenging conditions is crucial. Representative stress-test scenarios include the following:
  • High-volatility day: rapid fluctuations in irradiance (fast-moving clouds) and spiky load profiles. Tests the controller’s ability to maintain stability in highly dynamic conditions.
  • Unexpected event: sudden unforecasted changes such as a significant load step or a generator fault. Evaluates the ability to recover from deviations relative to forecasted trajectories.
  • Communication impairment: simulated latency or loss of communication between Predictive Engine and Adaptive Controller. Assesses how well the SAC agent can operate with stale or missing forecast information.
  • Component failure: sudden failure of a BESS unit. Evaluates fault tolerance and the ability to rebalance SoC and manage load with reduced storage capacity.
In addition to proposing these scenarios, we implemented two of them explicitly in the digital twin and evaluated all controllers as follows:
  • A high-volatility scenario with rapidly varying net load and occasional price spikes.
  • A communication impairment scenario in which the forecast sent by the Predictive Engine to the Adaptive Controller is periodically frozen, emulating intermittent loss of connectivity.
Figure 4 illustrates the SoC trajectories obtained in the high-volatility net-load scenario for HPAC, Distributed MPC, and the best-performing rule-based controller. In this stress test, the mini-grid is exposed to rapid alternations between surplus renewable generation and sharp demand spikes, which tend to pull individual batteries toward their operational limits if control decisions are purely reactive. Under these conditions, HPAC keeps all BESS units tightly clustered within a narrow SoC band around the desired operating range, with only small, synchronized excursions that mirror the underlying net-load swings. This behavior reflects the controller’s ability to anticipate upcoming ramps through its forecast-augmented state and dynamically shaped reward so that batteries are pre-charged or pre-discharged before extreme events occur.
By contrast, the rule-based controller exhibits visibly wider spreads and persistent divergence among individual SoC trajectories. Some units experience repeated approaches to the upper and lower SoC bounds, indicating that fixed thresholds and heuristic dispatch rules cannot cope with the rapid sequence of surplus and deficit periods. These excursions to extreme SoC values are symptomatic of both poorer SoC balancing and a higher risk of accelerated degradation. Distributed MPC performs markedly better than the rule-based baseline in terms of SoC equalization, but its trajectories still show slightly larger oscillations and slower realignment after major disturbances. This is consistent with a controller that relies on deterministic forecasts and finite-horizon re-optimization: it can exploit foresight, but its performance degrades when prediction errors or model mismatch accumulate. Overall, the visual comparison in Figure 4 highlights that HPAC delivers the tightest SoC clustering and fastest recovery after high-volatility events, while Distributed MPC incurs slightly higher operating costs and the rule-based scheme suffers from both increased SoC variance and more frequent saturation at the operational limits.
In the communication impairment scenario, HPAC is deliberately operated under intermittent forecast unavailability, emulating temporary failures or congestion in the link between the mini-grid and the cloud-based Predictive Engine. Figure 5 reports the resulting SoC trajectories and bus-voltage profile for HPAC compared against a forecast-free baseline controller that always operates myopically. During periods when forecasts are frozen, the HPAC agent must rely solely on stale predictive information and real-time local measurements. The SoC traces show that these forecast-free intervals lead to mildly larger short-term oscillations and a small drift away from the nominal SoC band, reflecting the reduced ability to pre-position the batteries ahead of upcoming ramps. However, the trajectories remain well within the admissible bounds and reconverge quickly once fresh forecasts become available, indicating that the learned policy has internalized meaningful patterns of typical net-load evolution and can fall back to a safe, measurement-driven behavior when predictive context is degraded.
The bus-voltage plot in Figure 5 further confirms that stability is not compromised by communication impairments. Even when forecasts are stale, HPAC keeps the voltage tightly regulated around the nominal value, with deviations remaining within the acceptable operational band and without inducing secondary oscillations or large overshoots. In contrast, the forecast-free baseline controller exhibits noticeably larger voltage excursions during fast net-load changes, as it cannot exploit any information about upcoming imbalances and therefore tends to overreact to instantaneous measurements. From an economic perspective, the intermittent loss of forecasts leads to a modest increase in operating cost for HPAC because some charging and discharging decisions become more conservative. Nonetheless, this cost penalty remains limited compared with the improvement in SoC balancing and voltage regulation relative to the purely myopic baseline.
Across both stress-test scenarios, HPAC consistently achieves the lowest or near-lowest SoC variance and voltage deviation among the considered controllers, while incurring only modest increases in operating cost relative to the nominal (fault-free) case. In the high-volatility scenario, this manifests as tightly clustered SoC trajectories and well-damped responses to abrupt net-load swings, whereas in the communication impairment scenario, it appears as graceful degradation: short-lived forecast outages slightly degrade economic and balancing performance but do not trigger instability or constraint violations. Taken together, these results substantiate the qualitative claim that HPAC is robust to realistic disturbances and imperfections in the forecasting and communication stack and that the hierarchical predictive-adaptive design provides tangible benefits over both rule-based and optimization-only baselines under adverse operating conditions.

5.2. Performance Analysis and Trade-Offs

As summarized in Table 6, the proposed HPAC framework achieves the lowest total operating cost among the 14 controllers evaluated in the mini-grid case study, while simultaneously maintaining excellent SoC-balancing performance and bus-voltage regulation. Compared with the best-performing MPC-based baseline, HPAC attains a similar or slightly lower operating cost but with significantly reduced SoC variance and voltage variance. Relative to purely DRL-based baselines that lack hierarchical forecasting and reward shaping, HPAC reduces cost by more than ten percent while delivering markedly better SoC equalization and voltage stability.
These aggregate results already suggest that HPAC is operating in a favorable region of the underlying multi-objective trade-off surface, where economic performance, technical quality, and degradation-aware behavior are jointly optimized rather than individually tuned. Controllers that aggressively chase arbitrage opportunities achieve high energy throughput but exhibit elevated voltage variance and, in some cases, only marginal cost improvements. Conversely, conservative rule-based or tabular RL controllers keep SoC tightly regulated but forgo profitable charge/discharge windows, leading to higher overall costs. HPAC occupies a balanced operating point: it uses the BESS fleet sufficiently often to exploit meaningful price spreads and to provide grid-support services, but it avoids unnecessary cycling that would erode battery lifetime or destabilize bus voltage.
Figure 6 provides a complementary, time-domain view of these trade-offs by plotting the SoC trajectories for all 14 controllers over a three-day horizon. For each controller, the solid line shows the fleet-averaged SoC, while the shaded band indicates the minimum and maximum SoC across individual BESS units. Controllers with narrow, nearly flat envelopes are able to maintain tight SoC clustering, whereas controllers with wide or highly distorted envelopes allow significant divergence between units, often driving some batteries into extreme SoC regions while others remain underutilized.
The rule-based schemes, particularly the enhanced variant, exhibit almost perfectly flat trajectories once their internal thresholds are reached: all batteries are quickly driven toward a preferred SoC band and then held there with minimal movement. This explains their extremely low SoC variance, but the nearly static trajectories also reveal why these controllers extract little economic value from the storage assets; most of the available energy capacity remains unused once the preferred SoC level is attained. Distributed MPC and the more advanced model-based controllers show sharper SoC excursions that are well coordinated across units: the mean SoC follows a smooth charge/discharge pattern, and the envelope remains narrow, indicating that all batteries participate in a synchronized manner. However, these trajectories often feature pronounced peaks and troughs driven by the optimization horizon, which can translate into relatively aggressive cycling and higher sensitivity to model mismatch.
DRL-based baselines without hierarchical forecasting, such as generic SAC, PPO, MARL, and heuristic policies, tend to produce more irregular SoC profiles. In several cases the envelopes widen substantially during periods of high activity, with individual BESS units being driven close to their lower or upper SoC limits while others lag behind. This behavior is symptomatic of policies that optimize short-term rewards based on local information, without a global view of SoC balancing or an explicit mechanism to keep trajectories synchronized. The heuristic advanced controller, in particular, shows rapid and repeated swings across most of the SoC range, yielding very high throughput but also large envelope width and a final SoC level close to depletion.
In contrast, the HPAC (Manuscript) controller produces SoC trajectories that combine the desirable features of the best baselines while avoiding their drawbacks. The mean SoC follows a smooth, gradually increasing profile as the fleet charges in anticipation of upcoming net-load peaks and then levels off in a high-but-safe region where sufficient headroom remains for subsequent disturbances. The envelope around this trajectory is extremely narrow, indicating that all BESS units move almost in lockstep and that SoC imbalance is virtually eliminated. At the same time, the SoC is neither held artificially constant nor pushed repeatedly to its extreme bounds; instead, the fleet is cycled in a controlled and coordinated way that reflects both economic signals and long-term degradation considerations. The resulting trajectories visually confirm the quantitative SoC-variance metrics and illustrate how HPAC’s predictive-adaptive design enables both efficient usage of storage capacity and tight SoC coordination.

Aggregate KPI Comparison and Grid-Voltage Behaviour

While Table 6 provides numerical values for the four key performance indicators, it is useful to visualize how the controllers compare along all dimensions at a glance. Figure 7 presents bar charts for total operating cost, battery energy throughput, SoC variance, and bus-voltage variance for the full set of fourteen controllers. The first panel confirms that HPAC (Manuscript) achieves one of the lowest total costs in the comparison, closely matching the best MPC-based baselines and clearly outperforming the generic SAC and PPO controllers. In contrast, controllers such as Heuristic (Advanced) and MARL exhibit noticeably higher operating costs, despite sometimes delivering larger energy throughput. This illustrates the central trade-off: blindly maximizing battery usage does not necessarily translate into better economic performance once degradation-aware operation and network constraints are taken into account.
The second panel of Figure 7 shows the total energy throughput of the BESS fleet. Heuristic (Advanced) and Fuzzy Logic (Advanced) stand out with the highest throughput, reflecting highly aggressive cycling policies that charge and discharge whenever even modest price spreads exist. HPAC (Manuscript) sits in an intermediate range: it uses the batteries substantially more than simple rule-based schemes but considerably less than the most aggressive heuristics. This moderated throughput is a direct consequence of the degradation-aware reward component and of the forecast-informed positioning of the fleet. Rather than exploiting every small arbitrage opportunity, HPAC concentrates cycling on time windows where forecasts indicate that flexibility will be most valuable, reducing unnecessary wear on the batteries.
The lower panels highlight the technical quality of control. In terms of SoC balancing, HPAC (Manuscript) attains essentially zero variance, matching or surpassing the best rule-based SoC controllers while significantly improving economic performance. Generic DRL methods such as SAC and MARL exhibit much higher SoC variance, often because they focus on short-term cost or reward signals without an explicit SoC-equalization objective. A similar picture emerges for voltage variance: HPAC (Manuscript) achieves the lowest bus-voltage variance among all controllers, indicating that its actions maintain the bus voltage very close to the nominal value despite the stochastic net load. Several baselines, including fuzzy logic, MARL, and heuristic controllers, show noticeably higher voltage variance, consistent with more abrupt and less coordinated power exchanges with the grid. Taken together, these four panels confirm that HPAC (Manuscript) lies near the Pareto front of the multi-objective trade-off: it simultaneously delivers low cost, good SoC balancing, moderate throughput, and superior voltage stability.
Figure 8 provides a complementary time-domain view, focusing on grid power exchange and bus-voltage trajectories for three representative controllers: HPAC (Simple), Fuzzy Logic (Advanced), and HPAC (Manuscript). In the upper subplot, the black curve represents the net load seen at the point of common coupling, while the colored curves show the resulting grid exchange under each controller. The HPAC (Simple) controller tracks the net load relatively closely and performs only limited shaping of the grid-exchange profile. The fuzzy-logic controller is more aggressive: it injects and absorbs large power swings in response to its rule base, which noticeably amplifies the high-frequency content of the grid-exchange signal. By contrast, HPAC (Manuscript) produces a smoother and more structured grid-exchange trajectory. Peaks and valleys are attenuated relative to the net load, indicating that the BESS fleet is used to buffer variability without overreacting to short-lived fluctuations. This behavior reflects the predictive-adaptive design: the SAC agent, informed by multi-horizon forecasts, learns to distinguish between transient disturbances and persistent trends in net load and prices.
The lower subplot of Figure 8 shows the corresponding bus-voltage profiles. All three controllers keep the voltage within a tight band around the nominal value, but the differences in variance are visible. The fuzzy-logic controller exhibits the largest ripple, consistent with its more abrupt changes in grid power. HPAC (Simple) performs better but still shows slightly larger deviations than HPAC (Manuscript). The latter maintains the flattest voltage trace, staying very close to the nominal voltage (dashed red line) throughout the three-day simulation. This confirms quantitatively what the variance statistics in Figure 7 suggest qualitatively: the full HPAC design not only balances SoC and reduces operating cost, but it also implicitly acts as an effective voltage-support controller by scheduling BESS injections and withdrawals in a way that avoids sharp voltage excursions.
Overall, the combination of the aggregate KPIs in Figure 7 and the time-series behavior in Figure 8 offers a consistent system-level picture. Controllers that maximize energy throughput tend to induce higher voltage variance and only modest cost improvements, while conservative rule-based strategies sacrifice economic benefits and flexibility. HPAC (Manuscript) occupies a distinct operating point: it reduces cost, tightly balances SoC, and stabilizes voltage with a moderate level of cycling, demonstrating that the hierarchical predictive-adaptive architecture can extract more value from the same hardware by coordinating forecasting, reward shaping, and real-time control.

5.2.1. Why HPAC Outperforms Baselines

The performance gap between HPAC and the “HPAC (Simple)” variant highlights the significance of the predictive-adaptive hierarchy. HPAC (Simple) uses the same SAC architecture and state representation but omits the dynamic reward shaping derived from the Predictive Engine. Consequently, it learns a static policy that optimizes a fixed trade-off between SoC balancing and economics. In contrast, the full HPAC agent receives forecast-aware signals that modulate the relative importance of the reward components in time. For example, when the Predictive Engine anticipates a substantial net-load spike two hours ahead, the effective weight on SoC balancing is increased, discouraging aggressive discharge in the present and positioning the BESS fleet for the future event. This mechanism allows HPAC to exhibit MPC-like foresight without explicit model-based optimization at run time.
The comparison with Distributed MPC further illustrates this point. Distributed MPC uses an explicit system model and deterministic forecasts to solve coupled optimization problems repeatedly. It therefore performs well in terms of cost and SoC variance but is sensitive to model mismatch and incurs a higher computational burden at each control step. HPAC amortizes its computational cost offline during training and replaces online optimization by a single neural-network forward pass. This yields performance on par with, or better than, distributed MPC under nominal conditions, and superior robustness under forecast errors and unmodeled dynamics.

5.2.2. Trade-Offs Between Cost, Throughput, and Degradation

Examining battery energy throughput reveals a clear trade-off between economic aggressiveness and long-term asset health. HPAC processes a moderate amount of energy over the evaluation horizon that is higher than the simpler rule-based strategies but significantly lower than the heuristic and MARL controllers that chase every small price spread. The latter achieve only marginal additional cost reductions while substantially increasing throughput and, by implication, degradation.
In contrast, HPAC’s explicit degradation-aware reward component penalizes excessively deep or frequent cycling and prolonged operation at extreme SoC values. This leads to a more measured throughput and a lower equivalent full-cycle count while still achieving the best overall cost performance. The multi-objective reward function therefore enables HPAC to strike a favorable balance between short-term economic metrics and long-term battery health, which is particularly important in mini-grids where storage replacement costs are high.
To reduce the impact of stochasticity in both the environment and the DRL training process, HPAC and the DRL baselines were trained and evaluated under several random initializations of network weights and environment seeds. The numerical values reported in Table 6 correspond to representative runs whose performance is close to the median across seeds; in all cases, the ranking between HPAC and the main baselines remained unchanged, and the relative cost gap stayed within a narrow band.

5.2.3. Ablation Study: Role of Forecasts and Reward Shaping

To quantify the contribution of the Predictive Engine, we conducted an ablation study with the following three variants of the HPAC controller: (i) a no-forecast variant, where the SAC agent observes only instantaneous measurements and optimizes a static reward; (ii) a state-only variant, where forecasts are injected as additional state features but the reward weights remain fixed; and (iii) the full HPAC configuration, where forecasts are used both as state features and to drive dynamic reward shaping. The no-forecast variant incurs the highest operating cost and exhibits noticeably larger SoC variance, confirming that purely reactive policies struggle to position the BESS fleet optimally. Providing forecasts only as state features improves performance modestly, but SoC variance remains relatively high because the agent optimizes a static reward that cannot adapt its priorities as future conditions change. The full HPAC configuration delivers the best performance across all metrics, demonstrating that both uses of forecasts are essential for achieving MPC-like foresight with model-free DRL.

5.3. Forecasting Performance

To assess the effectiveness of the federated transformer-based Predictive Engine, we compared its net-load forecasting accuracy against two baselines: a Local-only model trained independently at each site and a Centralized model trained on the union of all data. Accuracy was measured in terms of mean absolute error (MAE) over a rolling 24 h horizon. In our experiments, the Federated model achieved an MAE of approximately 4.2%, which is only marginally higher than the centralized model (3.9%) and substantially better than the Local-only approach (7.8%). For comparison, a tuned ARIMA forecaster and a univariate LSTM forecaster obtained MAEs of 6.5% and 5.1%, respectively, on the same hold-out test set. This confirms that the proposed federated transformer setup enables collaborative training of a high-capacity forecasting model that outperforms classical statistical and recurrent baselines while preserving data locality and incurring only a small accuracy penalty relative to full data centralization.
The communication overhead associated with the Predictive Engine was also quantified. For the HPAC controller, forecast messages are exchanged with the adaptive controller on a 15 min timescale, resulting in on the order of a few hundred short messages per day and a data volume in the tens of kilobytes, negligible compared with typical micro-grid telemetry. This overhead is significantly lower than that of fully centralized MPC schemes, which require frequent transmission of detailed state information from all assets to a central optimizer. Finally, we simulated a multi-site training configuration in which several virtual mini-grids train a shared global forecasting model using federated averaging. The resulting performance closely matches that of the centralized model while respecting privacy constraints and eliminating the need to aggregate raw high-resolution operational data at a single location. This demonstrates the feasibility of multi-site HPAC deployment with privacy guarantees.
Overall, to quantify the contribution of the predictive engine, we conducted an ablation study with the following three variants of the HPAC controller:
  • HPAC (No-forecast): the SAC agent observes only current measurements (SoC, SoH, P net , P grid , C ( t ) , and temporal features) and uses a static, manually tuned reward. No forecast features are provided, and the reward is not shaped by forecasts.
  • HPAC (State-only): the agent observes both current measurements and the multi-horizon forecast sequence, but the reward weights w soc , w econ , w soh , and w volt remain fixed over time. This configuration corresponds to the “HPAC (Simple)” controller in Table 6.
  • HPAC (Full): the proposed architecture, where forecasts are used both as state features and to drive dynamic reward shaping.
Table 7 summarizes the results on the same base scenario used for Table 6.
The “No-forecast” variant incurs the highest operating cost and exhibits noticeably larger SoC variance, confirming that purely reactive policies struggle to position the BESS fleet optimally. Providing forecasts only as state features improves performance modestly, but SoC variance remains relatively high because the agent optimizes a static reward that cannot adapt its priorities as future conditions change. The full HPAC configuration, which combines forecast-augmented state with dynamic reward shaping, delivers the best performance across all three metrics, demonstrating that both uses of forecasts are essential for achieving MPC-like foresight with model-free DRL.

Unified Forecasting Accuracy Across Models and Controllers

Table 8 summarizes the forecasting accuracy of the standalone models and all controllers in terms of MSE, RMSE, and MAE. The transformer-based predictive engine attains the lowest errors among the generic forecasting models (MSE = 414.67, RMSE = 20.79, and MAE = 16.30), outperforming both the LSTM and ARIMA baselines, which exhibit slightly higher RMSE and MAE. The naive persistence benchmark yields the largest errors, confirming that simple extrapolation is inadequate for capturing the dynamics of the net-load time series.
For most controllers, the associated forecasting errors cluster around the ARIMA/persis tence range (MSE ≈ 469–473, RMSE ≈ 21.7–21.8, MAE ≈ 17.1), indicating that they either rely on simpler prediction schemes or do not fully exploit high-fidelity forecasts. In contrast, the proposed HPAC (Manuscript) controller directly leverages the transformer-based Predictive Engine and therefore inherits its superior error profile (MSE = 414.67, RMSE = 20.79, and MAE = 16.30), achieving the best forecasting performance across all controllers. This unified comparison highlights that coupling HPAC with a federated transformer forecaster not only improves control-level economic and technical metrics but also yields consistently lower forecast errors than classical statistical, recurrent, and heuristic approaches.

6. Conclusions and Future Work

This paper has presented a Hierarchical Predictive-Adaptive Control (HPAC) framework designed to address the complex challenge of SoC balancing in mini-grids with high renewable penetration. The proposed architecture systematically overcomes limitations of conventional control paradigms by synergizing state-of-the-art deep learning techniques in a two-layer hierarchical structure.
By decoupling the control problem into a long-horizon predictive engine based on a federated transformer network and a real-time adaptive controller implemented with a Soft Actor-Critic agent, HPAC achieves a control solution that is both proactive and highly adaptive. The transformer-based forecasting model captures long-range dependencies in energy time series, while federated learning addresses data scarcity and privacy constraints across decentralized mini-grids. The SAC agent, guided by a multi-objective reward function balancing technical performance, economic efficiency, and asset longevity, learns a robust operational strategy that surpasses traditional single-objective controllers.
A comprehensive evaluation benchmarked HPAC against fourteen representative controllers spanning rule-based, MPC, and various DRL paradigms. The results demonstrate that HPAC achieves near-minimal operating cost (USD 943.90 over the 72 h simulation) while simultaneously attaining essentially zero SoC variance and the lowest voltage variance (0.3641 V 2 ) among all compared controllers. Sensitivity analysis on reward weight selection revealed a well-defined optimal region around w soc 0.5 and w econ 1.0 , where cost and SoC balancing are jointly optimized. Ablation studies confirmed that both forecast-augmented states and dynamic reward shaping are essential for achieving MPC-like foresight with model-free DRL; the full HPAC configuration significantly outperformed variants with static rewards or without forecasts.
Stress-test scenarios further validated the robustness of the approach. Under high-volatility net-load conditions, HPAC maintained tight SoC clustering with rapid recovery after disturbances, whereas rule-based controllers exhibited persistent divergence. Under communication impairment scenarios with intermittent forecast unavailability, HPAC degraded gracefully, leveraging its learned policy and local measurements to maintain safe operation without instability or constraint violations.
Several limitations remain. The current evaluation uses a simplified single-bus network model with four heterogeneous BESS units (total capacity 500 kWh); future work should validate HPAC on more detailed multi-bus network representations with explicit power-flow constraints. The federated transformer forecaster was evaluated in simulation with multiple virtual sites; real-world deployment across geographically distributed mini-grids remains to be demonstrated. Additionally, the computational scaling of centralized HPAC for very large BESS fleets ( N > 50 ) warrants further investigation, with multi-agent RL extensions being a promising direction.
Looking ahead, HPAC forms a foundation for more scalable and decentralized architectures. The linear inference-time scaling with N makes it better suited to large fleets than centralized MPC, and parameter-sharing or attention-based architectures could further improve scalability. Hardware-in-the-loop testing and cloud-edge deployment represent natural next steps toward practical implementation. Through the integration of predictive foresight with model-free adaptive control, HPAC represents a significant step toward intelligent, autonomous, and efficient energy management systems for a resilient and sustainable energy future.

6.1. Future Research Trajectories

While high-fidelity simulation is essential for development, real-world deployment is the ultimate goal. A path to deployment involves transitioning from simulation to hardware-in-the-loop (HIL) testing. In an HIL setup, the trained HPAC controller (SAC policy network) is deployed on a real-time industrial controller or embedded system, which interacts with a real-time power hardware simulator (e.g., OPAL-RT or Typhoon HIL). The simulator emulates the mini-grid dynamics (BESS, inverters, PV, network impedances) with microsecond-level accuracy. HIL validation verifies that inference and control loops meet strict real-time deadlines, exposes the controller to realistic communication latencies and jitter, and de-risks integration with target hardware and software stacks prior to field deployment. In practical terms, an HIL implementation of HPAC would perform the following: (i) compile the SAC policy network and associated pre-processing and post-processing logic into a real-time executable running at the desired control interval; (ii) exchange measurements and control commands with the power hardware simulator through a deterministic communication link; and (iii) log all relevant signals for offline analysis. The same HIL platform can then be used to test corner cases such as sudden BESS or PV outages, measurement dropouts, and extreme weather events, without risking physical assets.
Training the federated transformer and SAC agent is computationally intensive and generally performed offline on GPU-equipped servers or in the cloud. However, real-time inference is lightweight: forward passes through trained networks are matrix operations that can execute in milliseconds on modern embedded hardware. This naturally suggests a hybrid cloud-edge deployment, where the cloud layer handles data aggregation, global model training, and periodic retraining of the forecasting and DRL models, in line with broader trends in federated and distributed learning [36,37]. The edge layer hosts the trained SAC policy (and optionally a lightweight forecasting model) on a local controller within the mini-grid, performing high-frequency real-time control based on local sensor data and occasional cloud updates. If connectivity to the cloud is lost, the edge controller continues to operate using the last-known policy and local measurements, ensuring resilience and autonomy.

6.2. Scalability Considerations

A significant challenge for centralized DRL control is scalability. As the number of BESS units N increases, both state and action spaces grow, making learning more difficult and potentially less sample-efficient. The scalability of RL is an active research topic [6]. Promising directions for addressing scalability include multi-agent reinforcement learning (MARL), which treats each BESS as an agent with its own policy, learning to cooperate (implicitly or explicitly) to achieve global SoC balancing. MARL can be more scalable and robust to single points of failure, as illustrated by multi-agent power-grid control approaches such as PowerNet [65]. Another direction is parameter sharing and attention-based architectures in centralized schemes, using a shared policy network that processes information from each BESS with parameter tying, keeping the number of learnable parameters independent of N. Attention mechanisms can allow the network to focus on the most relevant units when making decisions, effectively learning a coordination topology.
From a computational-complexity perspective, classical centralized MPC implementations scale approximately as O ( N 3 ) in the number of BESS units due to the matrix factorizations involved in solving the underlying quadratic program at each control step. This cubic scaling quickly becomes prohibitive for large fleets ( N 50 ) when control intervals are on the order of one minute.
In contrast, the inference cost of the centralized HPAC controller scales roughly linearly with N. The dimensionality of the state and action vectors grows with the number of BESS units, but the policy itself is implemented as a fixed-size neural network whose dominant operations are matrix-vector multiplications. With parameter sharing and batched computation across units, the per-step inference time remains compatible with real-time constraints even for large N on modern embedded hardware. This linear scaling makes HPAC and its MARL extensions better suited to large-scale mini-grids than centralized MPC.
In future work, we plan to complement this theoretical argument with empirical timing results across mini-grids of increasing size and to explore MARL variants of HPAC that preserve the predictive-adaptive hierarchy while distributing the control task across multiple cooperating agents.

Author Contributions

Conceptualization, I.I. and S.J.; methodology, I.I.; software, I.I.; validation, I.I., S.J., Y.T. and V.V.; formal analysis, I.I.; investigation, S.J.; resources, Y.T. and V.V.; data curation, S.J.; writing-original draft preparation, I.I. and S.J.; writing-review and editing, Y.T. and V.V.; visualization, I.I.; supervision, Y.T. and V.V.; project administration, Y.T. and V.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study uses publicly available datasets and tools cited in the manuscript. Processed data and code supporting the findings are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACAlternating Current
AIArtificial Intelligence
ANNArtificial Neural Network
ARIMAAutoregressive Integrated Moving Average
BESSBattery Energy Storage System
CPUCentral Processing Unit
DCDirect Current
DLDeep Learning
DQNDeep Q-Network
DRLDeep Reinforcement Learning
EFCEquivalent Full Cycle
EMSEnergy Management System
EVElectric Vehicle
FLFederated Learning
GHIGlobal Horizontal Irradiance
GPUGraphics Processing Unit
HILHardware-in-the-Loop
HPACHierarchical Predictive-Adaptive Control
KPIKey Performance Indicator
LSTMLong Short-Term Memory
LVLow Voltage
MARLMulti-Agent Reinforcement Learning
MDPMarkov Decision Process
MPCModel Predictive Control
NMCNickel Manganese Cobalt (Li-ion chemistry)
NSRDBNational Solar Radiation Database
PCCPoint of Common Coupling
PVPhotovoltaic
RESRenewable Energy Sources
RLReinforcement Learning
RNNRecurrent Neural Network
SACSoft Actor-Critic
SoCState of Charge
SoHState of Health
ToUTime-of-Use
V2GVehicle-to-Grid

References

  1. IEEE Power & Energy Society. Machine Learning for Microgrids: Resiliency, Stability, Control, and Operation; IEEE PES Trending Technologies; IEEE: Piscataway, NJ, USA, 2025; Available online: https://ieee-pes.org/trending-tech/machine-learning-for-microgrids-resiliency-stability-control-and-operation/ (accessed on 17 December 2025).
  2. Ali, J.S.; Qiblawey, Y.; Alassi, A.; Massoud, A.M.; Muyeen, S.M.; Abu-Rub, H. Power System Stability With High Penetration of Renewable Energy Sources: Challenges, Assessment, and Mitigation Strategies. IEEE Access. 2025, 13, 39912–39934. [Google Scholar] [CrossRef]
  3. Singh, A.R.; Kumar, R.S.; Bajaj, M.; Khadse, C.B.; Zaitsev, I. Machine Learning-Based Energy Management and Power Forecasting in Grid-Connected Microgrids with Multiple Distributed Energy Sources. Sci. Rep. 2024, 14, 19207. [Google Scholar] [CrossRef]
  4. Tasmant, H.; Bossoufi, B.; Alaoui, C.; Siano, P. A Review of Machine Learning and IoT-Based Energy Management Systems for AC Microgrids. Comput. Electr. Eng. 2025, 127, 110563. [Google Scholar] [CrossRef]
  5. Bagul, P.P. Comparative Analysis of Machine Learning Techniques for Microgrid Energy Management. J. Electr. Syst. 2024, 20, 4492–4502. [Google Scholar]
  6. Barbalho, P.I.N.; Moraes, A.L.; Lacerda, V.A.; Barra, P.H.A.; Fernandes, R.A.S.; Coury, D.V. Reinforcement Learning Solutions for Microgrid Control and Management: A Survey. IEEE Access 2025, 13, 39782–39799. [Google Scholar] [CrossRef]
  7. ESMAP; World Bank. Mini Grids for Half a Billion People: Market Outlook and Handbook for Decision Makers; Energy Sector Management Assistance Program (ESMAP); The International Bank for Reconstruction and Development/The World Bank: Washington, DC, USA, 2019. [Google Scholar]
  8. International Electrotechnical Commission. Minigrids & Microgrids. 2025. Available online: https://www.iec.ch/energies/minigrids-microgrids (accessed on 17 November 2025).
  9. Guo, W.; Sun, S.; Tao, P.; Li, F.; Ding, J.; Li, H. A Deep Learning-Based Microgrid Energy Management Method Under the Internet of Things Architecture. Int. J. Gaming -Comput.-Mediat. Simulations 2024, 16, 1–19. [Google Scholar] [CrossRef]
  10. Ginzburg, E.; Segev, I.; Levron, Y.; Keren, S. Comparing Traditional and Reinforcement-Learning Methods for Energy Storage Control. arXiv 2025, arXiv:2506.00459. [Google Scholar] [CrossRef]
  11. Ghanbari, N.; Bhattacharya, S. SoC Balancing of Different Energy Storage Systems in DC Microgrids Using Modified Droop Control. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018. [Google Scholar]
  12. Ghanbari, N.; Mobarrez, M.; Bhattacharya, S. A Review and Modeling of Different Droop Control Based Methods for Battery State of the Charge Balancing in DC Microgrids. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018. [Google Scholar]
  13. Tian, G.; Zheng, Y.; Liu, G.; Zhang, J. SoC Balancing and Coordinated Control Based on Adaptive Droop Coefficient Algorithm for Energy Storage Units in DC Microgrid. Energies 2022, 15, 2943. [Google Scholar] [CrossRef]
  14. Yang, H.; Zhang, N.; Gao, B. The DC Microgrid-Based SoC Adaptive Droop Control Algorithm. J. Circuits Syst. Comput. 2023, 32, 2350205. [Google Scholar] [CrossRef]
  15. Belal, E.K.; Yehia, D.M.; Azmy, A.M. Adaptive Droop Control for Balancing SoC of Distributed Batteries in DC Microgrids. IET Gener. Transm. Distrib. 2019, 13, 4667–4676. [Google Scholar] [CrossRef]
  16. Li, Y.; Wang, D.; Ni, Y.; Song, K. A SOC Balancing Adaptive Droop Control Scheme for Multiple Energy Storage Modules in DC Microgrids. J. Phys. Conf. Ser. 2024, 2849, 012102. [Google Scholar]
  17. Ioannou, I.I.; Javaid, S.; Christophorou, C.; Vassiliou, V.; Pitsillides, A.; Tan, Y. A Distributed AI Framework for Nano-Grid Power Management and Control. IEEE Access 2024, 12, 43350–43377. [Google Scholar] [CrossRef]
  18. Ioannou, I.; Javaid, S.; Tan, Y.; Vassiliou, V. Autonomous Reinforcement Learning for Intelligent and Sustainable Autonomous Microgrid Energy Management. Electronics 2025, 14, 2691. [Google Scholar] [CrossRef]
  19. Ji, Y.; Wang, J.; Xu, J.; Fang, X.; Zhang, H. Real-Time Energy Management of a Microgrid Using Deep Reinforcement Learning. Energies 2019, 12, 2291. [Google Scholar] [CrossRef]
  20. Joung, K.W.; Park, J.W. Conventional Droop Methods for Microgrids. In Power Systems; Springer: Cham, Switzerland, 2021; pp. 255–274. [Google Scholar] [CrossRef]
  21. Tanaka, M. Modeling and Control of Photovoltaic-Based Microgrid. In Proceedings of the International Conference on Simulation of Semiconductor Processes and Devices, Tokyo, Japan, 8–10 September 2011. [Google Scholar]
  22. Gkavanoudis, S.I.; Oureilidis, K.O.; Demoulias, C.S. An Adaptive Droop Control Method for Balancing the SoC of Distributed Batteries in AC Microgrids. In Proceedings of the 2016 IEEE 17th Workshop on Control and Modeling for Power Electronics, Trondheim, Norway, 27–30 June 2016. [Google Scholar]
  23. Matrone, S.; Pozzi, A.; Ogliari, E.; Leva, S. Deep Learning-Based Predictive Control for Optimal Battery Management in Microgrids. IEEE Access 2024, 12, 141580–141593. [Google Scholar] [CrossRef]
  24. Ye, X.; Tang, F.; Song, X.; Dai, H.; Li, X.; Mu, S.; Hao, S. Modeling, Simulation, and Risk Analysis of Battery Energy Storage Systems in New Energy Grid Integration Scenarios. Energy Eng. 2024, 121, 3689–3710. [Google Scholar] [CrossRef]
  25. Luo, Q.; Wang, J.; Huang, X.; Li, S. A Fast State-of-Charge (SOC) Balancing and Current Sharing Control Strategy for Distributed Energy Storage Units in a DC Microgrid. Energies 2024, 17, 3885. [Google Scholar] [CrossRef]
  26. Bhosale, R.; Gupta, R.; Agarwal, V. A Novel Control Strategy to Achieve SOC Balancing for Batteries in a DC Microgrid Without Droop Control. IEEE Trans. Ind. Appl. 2021, 57, 4196–4206. [Google Scholar] [CrossRef]
  27. Fagundes, T.A.; Fuzato, G.H.F.; Silva, L.J.R.; Alonso, A.M.S.; Vasquez, J.C.; Guerrero, J.M.; Machado, R.Q. Battery Energy Storage Systems in Microgrids: A Review of SoC Balancing and Perspectives. IEEE Open J. Ind. Electron. Soc. 2024, 5, 961–992. [Google Scholar] [CrossRef]
  28. Hu, C.; Cai, Z.; Zhang, Y.; Rudai, Y.; Cai, Y.; Cen, B. A Soft Actor-Critic Deep Reinforcement Learning Method for Multi-Timescale Coordinated Operation of Microgrids. Prot. Control. Mod. Power Syst. 2022, 7, 29. [Google Scholar] [CrossRef]
  29. Liu, D.; Zang, C.; Zeng, P.; Li, W.; Wang, X.; Liu, Y.; Xu, S. Deep Reinforcement Learning for Real-Time Economic Energy Management of Microgrid System Considering Uncertainties. Front. Energy Res. 2023, 11, 1163053. [Google Scholar] [CrossRef]
  30. Chen, S.; Liu, J.; Cui, Z.; Chen, Z.; Wang, H.; Xiao, W. A Deep Reinforcement Learning Approach for Microgrid Energy Transmission Dispatching. Appl. Sci. 2024, 14, 3682. [Google Scholar] [CrossRef]
  31. Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in Time Series: A Survey. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 19–25 August 2023. [Google Scholar]
  32. Caetano, R.; Oliveira, J.M.; Ramos, P. Transformer-Based Models for Probabilistic Time Series Forecasting with Explanatory Variables. Mathematics 2025, 13, 814. [Google Scholar] [CrossRef]
  33. Syed, M.; Clark, B. Multivariate Time Series Forecasting with Sktime. IBM Developer Tutorial. 2023. Available online: https://www.ibm.com/think/tutorials/sktime-multivariate-time-series-forecasting (accessed on 21 August 2025).
  34. Galindo Padilha, G.A.; Ko, J.; Jung, J.J.; de Mattos Neto, P.S.G. Transformer-Based Hybrid Forecasting Model for Multivariate Renewable Energy. Appl. Sci. 2022, 12, 10985. [Google Scholar] [CrossRef]
  35. Sakib, S.; Mahadi, M.K.; Abir, S.R.; Moon, A.M.; Shafiullah, A.; Ali, S.; Faisal, F.; Nishat, M.M. Attention-Based Models for Multivariate Time Series Forecasting: Multi-step Solar Irradiation Prediction. Heliyon 2024, 10, e27795. [Google Scholar] [CrossRef]
  36. Yurdem, B.; Kuzlu, M.; Güllü, M.K.; Çatak, F.Ö. Federated Learning: Overview, Strategies, Applications, Tools and Future Directions. Heliyon 2024, 10, e38137. [Google Scholar] [CrossRef]
  37. Federated Learning for Smart Grid: A Survey on Applications and Potential Vulnerabilities. arXiv 2024, arXiv:2409.10764. [CrossRef]
  38. Moveh, S.; Merchán-Cruz, E.A.; Abuhussain, M.; Dodo, Y.A.; Alhumaid, S.; Alhamami, A.H. Deep Learning Framework Using Transformer Networks for Multi Building Energy Consumption Prediction in Smart Cities. Energies 2025, 18, 1468. [Google Scholar] [CrossRef]
  39. Zhong, B. Deep Learning Integration Optimization of Electric Energy Load Forecasting and Market Price Based on the ANN-LSTM-Transformer Method. Front. Energy Res. 2023, 11, 1292204. [Google Scholar] [CrossRef]
  40. Grataloup, A.; Jonas, S.; Meyer, A. A review of federated learning in renewable energy applications: Potential, challenges, and future directions. Energy AI 2024, 17, 100375. [Google Scholar] [CrossRef]
  41. Naidji, I.; Choucha, C.; Ramdani, M. Decentralized Federated Learning Architecture for Networked Microgrids. In Proceedings of the 20th International Conference on Informatics in Control, Automation and Robotics (ICINCO), Rome, Italy, 13–15 November 2023; SciTePress: Setúbal, Portugal, 2023; pp. 291–294. [Google Scholar] [CrossRef]
  42. Meng, H.; Bin, H.; Qian, F.; Xu, T.; Wang, C.; Liu, W.; Yao, Y.; Ruan, Y. Enhanced reinforcement learning-model predictive control for distributed energy systems: Overcoming local and global optimization limitations. Build. Simul. 2025, 18, 547–567. [Google Scholar] [CrossRef]
  43. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
  44. OpenAI. Soft Actor-Critic-Spinning Up Documentation. 2018. Available online: https://spinningup.openai.com/en/latest/algorithms/sac.html (accessed on 21 August 2025).
  45. Berkeley AI Research (BAIR) Lab. Soft Actor Critic-Deep Reinforcement Learning with Real-World Robots. 2018. Available online: https://bair.berkeley.edu/blog/2018/12/14/sac/ (accessed on 21 August 2025).
  46. GeeksforGeeks. Soft Actor-Critic Reinforcement Learning Algorithm. 2022. Available online: https://www.geeksforgeeks.org/deep-learning/soft-actor-critic-reinforcement-learning-algorithm/ (accessed on 21 August 2025).
  47. National Renewable Energy Laboratory (NREL). BLAST: Battery Lifetime Analysis and Simulation Tool Suite. 2022. Available online: https://www.nrel.gov/transportation/blast (accessed on 21 August 2025).
  48. NREL SAM Forum. BESS Degradation Capacity and Power. 2020. Available online: https://sam.nrel.gov/forum/forum-general/4139-bess-degradation-capacity-and-power.html (accessed on 21 August 2025).
  49. Benlahbib, B.; Bouarroudj, N.; Mekhilef, S.; Abdeldjalil, D.; Abdelkrim, T.; Bouchafaa, F.; Lakhdari, A. Modelling, Design and Control of a Standalone Hybrid PV-Wind Micro-Grid System. Energies 2021, 14, 4849. [Google Scholar] [CrossRef]
  50. Lin, J.; Qiu, J.; Liu, G.; Yao, Z.; Yuan, Z.; Lu, X. A Fuzzy Logic Approach to Power System Security With Nonideal Electric Vehicle Battery Models in Vehicle-to-Grid Systems. IEEE Internet Things J. 2025, 12, 21876–21891. [Google Scholar] [CrossRef]
  51. National Renewable Energy Laboratory (NREL). National Solar Radiation Database (NSRDB). 2025. Available online: https://github.com/NREL/BLAST-Lite (accessed on 21 August 2025).
  52. U.S. Department of Energy. National Solar Radiation Database (NSRDB)–Dataset. 2025. Available online: https://catalog.data.gov/dataset/national-solar-radiation-database-nsrdb (accessed on 21 August 2025).
  53. Pecan Street Inc. Data Access and Pricing, 2025. Available online: https://www.pecanstreet.org/access/ (accessed on 21 August 2025).
  54. Pecan Street Inc. Published Papers Using Dataport. 2025. Available online: https://www.pecanstreet.org/dataport/papers/ (accessed on 21 August 2025).
  55. Kelly, J.; Knottenbelt, W. The UK-DALE Dataset: Domestic Appliance-Level Electricity Demand and Whole-House Demand from Five UK Homes. Sci. Data 2015, 2, 150007. [Google Scholar] [CrossRef]
  56. van der Drift, J. UK-DALE Dataset Summary. GitHub Markdown File. 2019. Available online: https://github.com/JackKelly/UK-DALE_metadata (accessed on 21 August 2025).
  57. AlMuhaini, M.; Yahaya, A.; AlAhmed, A. Distributed Generation and Load Modeling in Microgrids. Sustainability 2023, 15, 4831. [Google Scholar] [CrossRef]
  58. National Renewable Energy Laboratory (NREL). Microgrid Load and LCOE Modelling Results. 2019. Available online: https://data.nrel.gov/submissions/79 (accessed on 21 August 2025).
  59. TotalEnergies R & D. pymgrid: A Python Library to Generate and Simulate a Large Number of Microgrids. GitHub Repository. 2020. Available online: https://github.com/Total-RD/pymgrid (accessed on 21 August 2025).
  60. Henri, G.; Levent, T.; Halev, A.; Alami, R.; Cordier, P. pymgrid: An Open-Source Python Microgrid Simulator for Applied Artificial Intelligence Research. In Proceedings of the Tackling Climate Change with Machine Learning Workshop at NeurIPS 2020, Virtually, 11–12 December 2020. [Google Scholar]
  61. Towards a Scalable and Flexible Simulation and Testing Environment Toolbox for Intelligent Microgrid Control. arXiv 2020, arXiv:2005.04869. [CrossRef]
  62. The MathWorks, Inc. MATLAB, Version R2025a; The MathWorks, Inc.: Natick, MA, USA, 2025. Available online: https://www.mathworks.com/products/matlab.html (accessed on 17 December 2025).
  63. TLS Containers. Comprehensive Guide to Key Performance Indicators of Energy Storage Systems. 2025. Available online: https://www.tls-containers.com/tls-blog/comprehensive-guide-to-key-performance-indicators-of-energy-storage-systems (accessed on 21 August 2025).
  64. Xiong, B.; Zhang, L.; Hu, Y.; Fang, F.; Liu, Q.; Cheng, L. Deep reinforcement learning for optimal microgrid energy management with renewable energy and electric vehicle integration. Appl. Soft Comput. 2025, 176, 113180. [Google Scholar] [CrossRef]
  65. PowerNet: Multi-Agent Deep Reinforcement Learning for Scalable Powergrid Control. arXiv 2020, arXiv:2011.12354.
  66. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  67. Kong, W.; Dong, Z.Y.; Jia, Y.; Hill, D.J.; Xu, Y.; Zhang, Y. Short-Term Residential Load Forecasting Based on LSTM Recurrent Neural Network. IEEE Trans. Smart Grid 2019, 10, 841–851. [Google Scholar] [CrossRef]
  68. Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 5th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Figure 1. The architecture of the HPAC framework. The upper-layer Federated Predictive Engine provides multi-horizon net-load forecasts and associated uncertainty to the lower-layer adaptive controller. The controller uses a Soft Actor-Critic agent to dispatch real-time setpoints, using Dynamic Reward Shaping to balance SoC, voltage stability, and economics based on predictive uncertainty.
Figure 1. The architecture of the HPAC framework. The upper-layer Federated Predictive Engine provides multi-horizon net-load forecasts and associated uncertainty to the lower-layer adaptive controller. The controller uses a Soft Actor-Critic agent to dispatch real-time setpoints, using Dynamic Reward Shaping to balance SoC, voltage stability, and economics based on predictive uncertainty.
Electronics 15 00061 g001
Figure 2. Impact of w soc and w econ on the total operating cost of the HPAC controller over a 24 h base-case simulation.
Figure 2. Impact of w soc and w econ on the total operating cost of the HPAC controller over a 24 h base-case simulation.
Electronics 15 00061 g002
Figure 3. Training convergence of the SAC agent in HPAC, showing actor (left) and critic (right) network losses over the training steps.
Figure 3. Training convergence of the SAC agent in HPAC, showing actor (left) and critic (right) network losses over the training steps.
Electronics 15 00061 g003
Figure 4. SoC trajectories in the high-volatility scenario for HPAC, Distributed MPC, and a representative rule-based controller. HPAC maintains SoC balance under rapid net-load changes, demonstrating robustness of the predictive-adaptive architecture.
Figure 4. SoC trajectories in the high-volatility scenario for HPAC, Distributed MPC, and a representative rule-based controller. HPAC maintains SoC balance under rapid net-load changes, demonstrating robustness of the predictive-adaptive architecture.
Electronics 15 00061 g004
Figure 5. Impact of forecast communication impairments on SoC and bus voltage. HPAC leverages its learned policy and local measurements to maintain safe operation even when forecast updates are intermittently frozen.
Figure 5. Impact of forecast communication impairments on SoC and bus voltage. HPAC leverages its learned policy and local measurements to maintain safe operation even when forecast updates are intermittently frozen.
Electronics 15 00061 g005
Figure 6. State-of-charge trajectories for all 14 controllers over three simulated days. Each panel corresponds to one controller and displays the mean SoC across all BESS units (solid line) together with a shaded envelope indicating the minimum and maximum SoC at each time step. This visualization mitigates the visual clutter associated with plotting all individual trajectories as overlapping curves.
Figure 6. State-of-charge trajectories for all 14 controllers over three simulated days. Each panel corresponds to one controller and displays the mean SoC across all BESS units (solid line) together with a shaded envelope indicating the minimum and maximum SoC at each time step. This visualization mitigates the visual clutter associated with plotting all individual trajectories as overlapping curves.
Electronics 15 00061 g006
Figure 7. Aggregate performance metrics for all controllers: total operating cost, battery energy throughput, SoC variance, and bus-voltage variance. These results are consistent with the tabulated values in Table 6.
Figure 7. Aggregate performance metrics for all controllers: total operating cost, battery energy throughput, SoC variance, and bus-voltage variance. These results are consistent with the tabulated values in Table 6.
Electronics 15 00061 g007
Figure 8. Grid exchange and bus-voltage trajectories for selected controllers (HPAC (Simple), Fuzzy Logic (Advanced), and HPAC (Manuscript)). Colored legends highlight each approach, with the nominal voltage indicated by a dashed red line.
Figure 8. Grid exchange and bus-voltage trajectories for selected controllers (HPAC (Simple), Fuzzy Logic (Advanced), and HPAC (Manuscript)). Colored legends highlight each approach, with the nominal voltage indicated by a dashed red line.
Electronics 15 00061 g008
Table 1. Main symbols and their definitions.
Table 1. Main symbols and their definitions.
SymbolDefinition
NNumber of BESS units in the mini-grid
iIndex of a BESS unit ( i = 1 , , N )
tDiscrete time index
Δ t Control time step duration
SoC i State of charge of BESS unit i
SoC ¯ ( t ) Average state of charge across all BESS units at time t
σ SoC Time-averaged standard deviation of SoC across the BESS fleet
SoH i State of health of BESS unit i
SoC min , SoC max Minimum and maximum admissible SoC limits
P bess , i Charge/discharge power of BESS unit i (positive for discharge)
P grid Power exchanged with the main grid (import/export)
P net Net load (load minus renewable generation)
P PV Photovoltaic array power output
P curtailed Curtailed renewable power
E rated , i Rated energy capacity of BESS unit i
C rate , i C-rate (normalized charge/discharge rate) of BESS unit i
C ( t ) Electricity price at time t
V bus Bus voltage magnitude
V nom Nominal bus voltage
V min , V max Lower and upper admissible voltage limits
HForecast horizon length (number of steps)
T predict Slow-timescale forecasting interval
T control Fast-timescale control interval
R soc , R econ , R soh , R volt Reward components for SoC balancing, economic performance, SoH preservation, and voltage stability
w soc , w econ , w soh , w volt Scalar weights for the corresponding reward components
η BESS Round-trip efficiency of the BESS units
η PV PV module efficiency
A PV Total PV array area
G POA Plane-of-array solar irradiance
γ PV temperature coefficient
T cell PV cell temperature
T ref Reference PV cell temperature
LCumulative capacity loss (total degradation)
L cal , L cyc Calendar- and cycle-induced contributions to degradation
λ 1 , λ 2 Coefficients in the degradation-related reward term
RMSE V Root-mean-square error of bus voltage with respect to V nom
EFC Equivalent full cycles of the BESS fleet over the evaluation horizon
Table 2. Qualitative comparison of representative control/learning paradigms for SoC balancing and energy management in multi-BESS mini-grids.
Table 2. Qualitative comparison of representative control/learning paradigms for SoC balancing and energy management in multi-BESS mini-grids.
ApproachPredictive CapabilitySoC Balancing ObjectiveMulti-Objective CoordinationComputational/Implementation Burden
Droop-based & decentralized SoC control
Fagundes et al. (review) [27]None (survey paper; synthesizes reactive and hierarchical controls)Explicit focus on SoC equalization strategies across architecturesDiscusses integration of SoC control with converter topology and EMS objectivesN/A (methodological review rather than a specific controller)
Tian et al. adaptive droop [13]None (purely reactive measurements)Explicit via SoC-based adaptive droop coefficient and voltage-recovery loopLimited: SoC and voltage; economic and degradation terms not explicitly optimizedLow: decentralized primary control with fuzzy droop tuning
Luo et al. fast SoC balancing [25]None (no explicit forecasting)Explicit: accelerates SoC convergence while maintaining current sharingLimited: focuses on SoC variance and converter current constraintsLow–moderate: more complex droop law but still local control
Bhosale et al. non-droop SoC control [26]None (centralized but reactive)Explicit SoC balancing via coordinated current referencesSome capability to incorporate power-rating constraints; economics/degradation indirectModerate: requires a centralized coordinator and communications
MPC and optimization-based EMS
Generic MPC EMS [3]Explicit through deterministic/stochastic forecasts over finite horizonCan be included via SoC penalties and hard SoC boundsStrong: multiple objectives and constraints in a single optimizationHigh online complexity; sensitive to model mismatch and horizon length
Risk-aware/robust MPC [24]Explicit, using probabilistic/scenario forecastsSoC treated via constraints and risk-aware penaltiesStrong: incorporates uncertainty and multiple risk measuresVery high: robust formulations require large scenario sets and heavy solvers
DRL-based micro-grid controllers
DRL EMS survey [6]Optional (forecasts treated as exogenous features)SoC mainly via constraints and penalties; balancing rarely explicitStrong in principle via reward shaping; mostly cost/reliability orientedOffline training cost can be high; online inference is cheap
DDPG real-time EMS [29]Implicit: learns policy from historical stochastic trajectoriesSoC enforced via constraints; no explicit multi-BESS SoC-variance termMulti-objective (cost, penalties) but not SoC-balancing centricModerate: offline DDPG training; online evaluation is lightweight
HDQN-based dispatch [30]Limited: uses historical data and short-term signalsSoC considered via constraint-handling and reward penaltiesCoordinates network congestion and economic objectivesModerate: hierarchical Q-learning; low-cost policy evaluation
Ioannou et al. distributed DRL [18]Optional forecasts integrated as state featuresSoC typically handled as constraint/soft penalty, not primary reward termMulti-agent, multi-service coordination (ancillary services, economics)Moderate: multi-agent training; decentralized, low-cost execution
Forecasting and federated learning layers
Transformer building-load forecasting [38]Strong: multi-horizon Transformer for multi-building energy seriesNot applicable (pure forecaster; no direct SoC control)Indirect: improves inputs to EMS/SoC controllersModerate: training cost higher than RNNs; prediction cost modest
ANN–LSTM–Transformer hybrid [39]Strong: joint long-horizon forecasting of load and pricesNot applicable directlyEnables EMS to balance cost/SoC via better price/load informationModerate: hybrid deep model; inference still feasible at edge
FL in renewable energy overview [40]N/A (survey of methods)Not directly targetedHighlights how FL can support multi-objective EMS learning under privacy constraintsN/A (review of algorithmic and systems-level trade-offs)
Decentralized FL for micro-grids [41]Participating MGs collaboratively train EMS modelsSoC is typically treated as part of the local EMS objectiveMulti-objective EMS (cost, self-sufficiency) with privacy guaranteesModerate: periodic model aggregation; no raw-data sharing
Proposed framework
Proposed HPAC [18]Dedicated federated transformer providing multi-horizon probabilistic net-load forecastsSoC balancing is a first-class objective via explicit SoC-variance terms and forecast-aware reward shapingSystematic multi-objective reward integrating SoC, economics, degradation-aware operation, and voltage stabilityTraining effort amortized offline; online control reduces to lightweight SAC policy inference, scalable to large BESS fleets
Note: Bold text highlights the proposed HPAC framework (our work).
Table 3. MDP formulation for SoC balancing in a multi-BESS mini-grid.
Table 3. MDP formulation for SoC balancing in a multi-BESS mini-grid.
ComponentDefinition
State space S Continuous vector representing the observable state at time t, including the following: (i) BESS states: SoC and SoH for each of the N BESS units; (ii) system power states: current net load P net ( t ) and grid exchange power P grid ( t ) ; (iii) predictive context: forecast sequence μ net ( t + 1 ) , σ net 2 ( t + 1 ) , , μ net ( t + H ) , σ net 2 ( t + H ) ; (iv) economic context: current electricity price C ( t ) ; and
(v) temporal context: cyclic time features (e.g., hour-of-day and day-of-week encodings).
Action space A Continuous vector a t = [ P bess , 1 ( t ) , , P bess , N ( t ) ] , where each element is a normalized value in [ 1 , 1 ] representing the desired power setpoint for a BESS unit. A value of 1 corresponds to maximum discharge, 1 to maximum charge, and 0 to idle. Actual power in kW is obtained by scaling by unit ratings.
Reward function RA scalar R ( t ) combining multiple objectives as follows:
R ( t ) = w soc R soc ( t ) + w econ R econ ( t ) + w soh R soh ( t ) + w volt R volt ( t )
where R soc penalizes SoC variance, R econ captures economic performance, R soh penalizes degradation-inducing behavior, and R volt penalizes bus-voltage deviations.
Transition dynamicsGoverned by the physical dynamics of the mini-grid (BESS, loads, network). The DRL agent interacts with a high-fidelity simulation (digital twin) that provides next states s t + 1 for a given ( s t , a t ) without requiring an explicit analytic model.
Table 4. Hyperparameters of the Soft Actor-Critic agent in HPAC.
Table 4. Hyperparameters of the Soft Actor-Critic agent in HPAC.
ParameterValue
OptimizerStochastic gradient descent with fixed learning rates
Learning rate (actor)                     1 × 10 4
Learning rate (critic) 3 × 10 4
Discount factor ( γ )0.99
Replay buffer size 10 5 transitions
Batch size128
Target smoothing coefficient ( τ )0.005
Hidden layers (actor)One hidden layer of 256 units with tanh activation for the mean and log-standard-deviation networks
Hidden layers (critic)Two hidden layers of 256 units with tanh activations in each critic
Temperature ( α )Fixed entropy coefficient, α = 0.1 (no automatic tuning)
Training budgetMaximum of 200 SAC gradient-update steps per experiment (max_train_steps = 200)
Table 5. Key performance indicators (KPIs) for HPAC evaluation.
Table 5. Key performance indicators (KPIs) for HPAC evaluation.
CategoryKPIDefinition and Rationale
TechnicalSoC standard deviation σ SoC Time-averaged standard deviation of SoCs across N BESS units: σ SoC = avg t 1 N i = 1 N SoC i ( t ) SoC ¯ ( t ) 2 . Lower values indicate better SoC balancing.
TechnicalBus voltage deviation (RMSE)Root mean square error between bus voltage and nominal: RMSE V = 1 T t = 1 T ( V bus ( t ) V nom ) 2 . Quantifies stability and power quality.
TechnicalRenewable curtailment (%)Fraction of available renewable energy not utilized: P curtailed P available × 100 % . Indicates storage management effectiveness.
TechnicalLoss of load probability (LOLP)Probability that load exceeds supply plus maximum discharge capability. Measures reliability.
EconomicDaily operating costNet daily cost of grid interaction: t P import ( t ) C ( t ) P export ( t ) C ( t ) .
EconomicLevelized cost of storage (LCOS)Average cost per MWh discharged: LCOS = CAPEX + O & M t + Cos t degradation , t E discharged , t .
EconomicReturn on investment (ROI)Profitability: ROI = Net profit Total investment × 100 % .
LongevityAverage SoH degradation rateTime-averaged rate of SoH decline across the BESS fleet (%/year). Reflects the impact of control on asset lifetime.
LongevityEquivalent full cycles (EFC)Total energy throughput normalized by capacity: EFC = | P bess ( t ) | d t E rated .
LongevityTime in extreme SoC rangesPercentage of time SoC lies outside a safe window (e.g., < 20 % or > 90 % ). High values correlate with accelerated degradation.
Table 7. Ablation study on the role of forecasts and reward shaping.
Table 7. Ablation study on the role of forecasts and reward shaping.
ControllerTotal Cost ($)SoC VarianceVoltage Variance ( V 2 )
HPAC (No-forecast)946.100.00341.6921
HPAC (State-only)924.890.06872.0252
HPAC (Full)828.220.00131.6000
Table 8. Forecasting accuracy (MSE, RMSE, and MAE) of the predictive engine, baseline forecasting models, and all controllers in the HPAC study.
Table 8. Forecasting accuracy (MSE, RMSE, and MAE) of the predictive engine, baseline forecasting models, and all controllers in the HPAC study.
ModelMSERMSEMAE
Transformer (Predictive Engine of HPAC (Manuscript))414.6720.7916.30
LSTM [66,67]419.5020.8716.77
ARIMA(1, 0, 0)-like [68]469.5021.6717.07
Naive Persistence473.1921.7517.16
HPAC (Simple)473.1921.7517.16
Rule-Based (Simple) [11,12,20]473.1921.7517.16
Rule-Based (Enhanced) [14,15,22]473.1921.7517.16
MPC (Simplified) [23,42]469.5021.6717.07
SAC (RL-based) [18,19,28,43]473.1921.7517.16
Centralized MPC [23,42]469.5021.6717.07
Distributed MPC [23,42]469.5021.6717.07
PI Controller [20,21,57]473.1921.7517.16
Q-Learning (Tabular) [6,10,29]473.1921.7517.16
PPO (Policy Gradient) [5,6,64]473.1921.7517.16
MARL (Multi-Agent RL) [6,29,65]473.1921.7517.16
Fuzzy Logic (Adv.) [27,50]473.1921.7517.16
Heuristic (Advanced) [3,5,17]473.1921.7517.16
HPAC (Manuscript) [this work]414.6720.7916.30
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ioannou, I.; Javaid, S.; Tan, Y.; Vassiliou, V. A Hierarchical Predictive-Adaptive Control Framework for State-of-Charge Balancing in Mini-Grids Using Deep Reinforcement Learning. Electronics 2026, 15, 61. https://doi.org/10.3390/electronics15010061

AMA Style

Ioannou I, Javaid S, Tan Y, Vassiliou V. A Hierarchical Predictive-Adaptive Control Framework for State-of-Charge Balancing in Mini-Grids Using Deep Reinforcement Learning. Electronics. 2026; 15(1):61. https://doi.org/10.3390/electronics15010061

Chicago/Turabian Style

Ioannou, Iacovos, Saher Javaid, Yasuo Tan, and Vasos Vassiliou. 2026. "A Hierarchical Predictive-Adaptive Control Framework for State-of-Charge Balancing in Mini-Grids Using Deep Reinforcement Learning" Electronics 15, no. 1: 61. https://doi.org/10.3390/electronics15010061

APA Style

Ioannou, I., Javaid, S., Tan, Y., & Vassiliou, V. (2026). A Hierarchical Predictive-Adaptive Control Framework for State-of-Charge Balancing in Mini-Grids Using Deep Reinforcement Learning. Electronics, 15(1), 61. https://doi.org/10.3390/electronics15010061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop