1. Introduction
The global energy landscape is undergoing a profound transformation driven by the dual imperatives of decarbonization and energy security. At the forefront of this transition are mini-grids—localized, self-sufficient electrical grids that can operate either in conjunction with the main utility grid or in an autonomous, islanded mode. These systems enable integration of high penetrations of renewable energy sources (RES), enhance resilience against disruptions, and extend electricity access to remote communities. Recent white papers and surveys emphasize that high-renewable micro-grids face nontrivial stability and resilience challenges and are a natural application domain for advanced AI- and ML-enabled control [
1,
2,
3,
4,
5,
6]. Mini-grids are small-scale electricity systems that combine local generation (often renewable), distribution networks, and end-user loads to supply a defined group of customers—frequently in remote or weak-grid areas—and can operate either autonomously or interconnected with the main grid [
7,
8].
Central to the functionality and stability of modern, renewable-rich mini-grids are battery energy storage systems (BESS). BESS act as buffers against the intermittency of RES, such as solar and wind, by absorbing surplus generation and discharging to meet load demand, thereby supporting a continuous, stable power supply. BESS also play a key role in emerging ML-based micro-grid energy management and power forecasting frameworks [
3,
9,
10].
As mini-grids grow in complexity, they increasingly deploy multiple, often heterogeneous BESS units distributed throughout the network. In such multi-BESS configurations, the primary control challenge extends beyond simple power sharing. It becomes a complex optimization problem centered on maintaining SoC balance across the entire storage fleet. A growing body of work studies SoC-aware control and balancing strategies in DC and AC micro-grids using droop-based and adaptive methods [
11,
12,
13,
14,
15,
16].
Effective SoC balancing is critical for the following reasons:
It prevents chronic overuse or underutilization of specific battery units, which would otherwise lead to accelerated degradation and premature failure. By ensuring that all units contribute equitably to system operation, SoC balancing prolongs the collective operational lifetime of storage assets.
It maximizes operational flexibility and resilience. A balanced BESS fleet has a greater effective capacity to respond to sudden changes in generation or load, as no single unit is prematurely constrained by reaching its upper or lower SoC limit.
The core problem is to devise a control strategy that can dynamically manage the charge and discharge cycles of multiple BESS units to keep their SoCs closely aligned, even when facing different capacities, degradation levels (state of health, SoH), and highly variable operating conditions driven by stochastic renewable generation and uncertain loads.
Recent research has started to explore distributed artificial intelligence and reinforcement-learning-based energy management schemes in nano-grid and micro-grid settings, demonstrating the potential of data-driven controllers for coordinated storage and renewable operation [
6,
9,
17,
18,
19]. However, these approaches typically do not explicitly target hierarchical SoC balancing with a federated multi-horizon forecasting layer as a first-class component of the control architecture. Control strategies for SoC balancing have evolved significantly, yet existing paradigms exhibit fundamental limitations in the dynamic, uncertain environment of a renewable-heavy mini-grid. The foundational method for decentralized power sharing in micro-grids is droop control. This technique emulates the behavior of synchronous generators in traditional power systems, creating an artificial relationship between a unit’s power output and the grid’s frequency (AC systems) or voltage (DC systems) [
20,
21]. Its primary advantages are simplicity and communication-free operation, allowing plug-and-play integration of parallel inverters.
However, conventional droop control suffers from inherent drawbacks that compromise its effectiveness as follows:
There is a fundamental trade-off between power-sharing accuracy and voltage regulation: the mechanism that enables power sharing also introduces deviations in bus voltage.
Performance is highly sensitive to mismatched line impedances between BESS units and the point of common coupling, leading to inaccurate current sharing and divergent SoC trajectories.
To overcome these limitations, a rich body of work has proposed SoC-aware and adaptive droop strategies. A comprehensive review of droop-based SoC-balancing methods for DC micro-grids is provided in [
12], where fixed SoC-compensated droop, adaptive droop, and virtual-impedance-based schemes are compared. In SoC-based adaptive droop control, the droop coefficient of each converter is adjusted in real time as a function of its corresponding battery SoC. Typical designs assign smaller droop coefficients to units with higher SoC during discharging and larger coefficients during charging, encouraging high-SoC units to supply more power and absorb less power, and vice versa for low-SoC units [
11,
14,
16].
Several variants refine this idea to account for converter ratings, battery capacities, and line impedances. For example, capacity-aware adaptive droop schemes scale SoC-dependent coefficients by the usable capacity of each BESS to prevent overutilization of weaker units and to improve fairness among heterogeneous batteries [
15]. Other works introduce auxiliary feedback terms based on current or power to mitigate the impact of unequal line resistances and to restore bus voltage while retaining SoC balancing performance [
13,
22]. Overall, these schemes demonstrate that SoC balancing can be achieved in a decentralized and relatively inexpensive manner. Despite these advances, droop-based SoC balancing remains fundamentally reactive. Control actions are driven by instantaneous SoC differences and local measurements, without explicit consideration of future net-load patterns, price signals, or degradation trajectories. Most formulations are single-objective, with SoC equalization as the dominant design goal, and therefore cannot directly coordinate SoC balancing with economic optimization or lifetime-aware operation under uncertainty.
Model predictive control (MPC) and related optimization-based approaches have been widely investigated for energy management and optimal battery operation in mini-grids and micro-grids. In typical formulations, a cost function combining grid energy purchase cost, renewable curtailment penalties, and sometimes battery degradation or SoC-related penalties is minimized over a finite horizon, subject to constraints on power balance, converter limits, and SoC bounds. When coupled with short-term forecasts of load and renewable generation, MPC can exploit foresight to schedule charge/discharge actions, diesel generation, and grid exchange to improve economic performance and reliability [
3,
23]. In micro-grid settings with high renewable penetration, MPC has been applied in centralized and hierarchical schemes. A slower supervisory MPC may perform hour-ahead or day-ahead scheduling, while a faster inner loop enforces voltage and frequency constraints at the converter level. Deep learning-based forecasting modules have also been integrated into MPC frameworks, where neural network predictors supply multi-step load or solar forecasts used as exogenous inputs in the optimization [
3,
23]. Such architectures provide a principled means to handle multi-objective trade-offs, including SoC balancing, curtailment minimization, and arbitrage.
In practice, however, MPC faces two significant hurdles in real mini-grids. First, it is profoundly dependent on an accurate system model. BESS characteristics evolve with aging, renewable generation is uncertain, and load profiles are difficult to predict precisely. Model mismatch can lead to suboptimal or unstable control actions [
24]. Second, the computational complexity of real-time constrained optimization grows with the number of assets, model complexity, and horizon length. For complex mini-grids requiring fast control decisions, MPC’s computational burden can become prohibitive compared with DRL-based approaches [
6,
10].
Thus, droop control is computationally light but reactive and “nearsighted,” failing to anticipate future conditions. MPC is proactive and “farsighted” but computationally demanding and fragile to model inaccuracies. There is a clear need for a control paradigm that combines the predictive power of advanced forecasting with the real-time, model-free adaptability of modern machine learning, achieving a solution that is both proactive and computationally tractable.
This work proposes a novel control architecture, the Hierarchical Predictive–Adaptive Control (HPAC) framework, designed specifically to address the multifaceted challenge of SoC balancing in renewable-rich mini-grids.
The central thesis is that by strategically decoupling the control problem into two synergistic, hierarchically arranged layers, it is possible to overcome the inherent limitations of both purely reactive and purely model-based approaches. The HPAC framework consists of the following:
A long-horizon predictive engine: operating on a slower timescale, this layer leverages a state-of-the-art transformer-based deep learning model within a federated learning architecture to generate accurate, multi-horizon probabilistic forecasts of the mini-grid’s net load.
A real-time Adaptive Controller: operating on a faster timescale, this layer employs a Soft Actor-Critic (SAC) deep reinforcement learning (DRL) agent to make instantaneous charge/discharge decisions for each BESS unit.
The innovation lies not just in the use of these advanced techniques but in the intelligent synthesis of their capabilities. The predictive engine provides the farsightedness that reactive methods lack, while the model-free DRL-based adaptive controller provides the computational tractability and resilience to uncertainty that plague MPC. The design is informed by recent experience with distributed AI frameworks for nano- and micro-grid power management and autonomous RL-based energy management [
9,
17,
18,
19]. The forecasts generated by the upper layer are not merely passive inputs to the lower layer; they dynamically shape the DRL agent’s state representation and reward function, providing crucial context for the near-future operating environment. This hierarchical synergy enables a control strategy that is simultaneously proactive in planning and adaptive in execution, capable of robust, multi-objective SoC balancing in complex, real-world mini-grid environments.
The remainder of this paper is structured as follows.
Section 2 reviews the relevant literature and provides background on predictive-adaptive control and federated learning.
Section 3 defines the SoC balancing problem and presents the architectural overview of the HPAC framework.
Section 4 details the methodology, including the Predictive Engine design, the SAC-based Adaptive Controller, and reward function engineering with sensitivity analysis on weight selection.
Section 5 presents the performance evaluation, including a comprehensive benchmarking study against fourteen representative controllers, ablation studies on the role of forecasts and reward shaping, stress-test scenarios for robustness assessment, and a detailed analysis of trade-offs between cost, throughput, and degradation.
Section 6 concludes the paper with a summary of key findings, limitations, and directions for future work, including scalability considerations and deployment pathways.
Table 1 summarizes the main notation used throughout the paper and provides a single, consistent reference for all model, control, and evaluation terms. This table is used to avoid ambiguity and to ensure the subsequent problem formulation, controller description, and benchmarking results are interpreted consistently.
5. Performance Evaluation and Benchmarking
This section presents a comprehensive evaluation of the HPAC framework over a 72 h (3 day) simulation horizon. We first describe the implementation details and hyperparameters of the SAC agent, followed by the mini-grid topology and simulation parameters used in the case study. We then define key performance indicators and baseline controllers, present the quantitative benchmarking results against fourteen representative controllers, and analyze performance trade-offs. Ablation studies isolate the contributions of forecasting and dynamic reward shaping. Finally, stress-test scenarios assess robustness under high-volatility conditions and communication impairments, demonstrating the practical viability of the HPAC approach.
5.1. Implementation Details and Hyperparameters
The HPAC framework was evaluated using a MATLAB-(R2025a)-based [
62] digital-twin environment. The environment models photovoltaic generation, BESS units with SoC and SoH dynamics, stochastic load profiles, and grid interaction using an aggregated-bus representation.
In the mini-grid case study, photovoltaic generation (peak power 300 kW) and demand profiles (50 households) are generated synthetically using time-of-day-dependent baseline curves with superimposed stochastic perturbations. This setup makes the experiments fully reproducible without requiring access to external datasets.
The Soft Actor-Critic (SAC) agent implementing the adaptive controller is implemented directly in MATLAB, using explicit matrix operations and custom gradient updates. Training is performed on a standard CPU-only workstation with hyperparameters tuned to ensure stable convergence.
Convergence of the SAC agent was assessed by monitoring the actor and critic losses, as well as the episodic return, over the full training horizon. As shown in
Figure 3, the actor loss initially increases slightly during the early exploration phase, then undergoes a sustained decay and stabilizes around a nearly constant value, indicating that policy updates become small and consistent. The critic loss starts from a high value, rapidly decreases in the first training steps, and then passes through a transient region with moderate oscillations before settling into a low-variance regime, confirming the numerical stability of the value function estimates. In parallel, the episodic return, computed from the same training logs, exhibits a characteristic improvement followed by a plateau at a steady value, jointly demonstrating that the SAC-based HPAC controller converges to a stable and high-performing policy.
For a detailed overview of the training configuration,
Table 4 lists the hyperparameters used for the Soft Actor-Critic agent in HPAC. These values include the optimization constraints and the fixed entropy coefficient used to govern the training budget.
5.1.1. Mini-Grid Topology and Simulation Parameters
For reproducibility, we summarize the topology and simulation parameters used in the MATLAB-based case study. The mini-grid is modeled as a single aggregated low-voltage AC bus with a nominal line-to-line voltage of 480 V. The aggregated bus represents the point of common coupling to the main grid and supplies an equivalent of 50 residential/commercial households. Net demand is obtained as the difference between the stochastic load profile and rooftop photovoltaic (PV) generation.
PV generation is represented by a single aggregated array with a peak power of 300 kW. Both load and PV trajectories are synthesized inside the simulation environment using time-of-day-dependent baseline shapes with superimposed stochastic perturbations, providing realistic diurnal patterns and intra-hour variability throughout the 72 h horizon.
The BESS fleet comprises heterogeneous units with energy capacities kWh (total 500 kWh) and maximum power ratings kW. Converter efficiency is fixed at 95%, while SoC and SoH constraints are enforced as and , respectively.
The adaptive controller operates at a control interval of
, whereas the predictive engine is updated every
with a forecast horizon of
h. The grid interconnection allows importing or exporting up to 500 kW and is priced using a time-of-use tariff with a base price of USD 0.15/kWh and 5% price volatility. Each simulation episode spans 72 h (3 days), ensuring that diurnal cycles and multi-day recovery dynamics are captured consistently for all fourteen controllers evaluated in
Table 6.
The grid interconnection is characterized by a maximum import/export capability of 500 kW (grid_max_power = 500). In the base configuration used for the HPAC evaluation, line-impedance and multi-bus parameters are left empty, so the system operates as an aggregated single-bus mini-grid.
The control interval for the adaptive controller is
, while the Predictive Engine is updated every
with a forecast horizon of
steps (12 h ahead), consistent with
forecast_horizon = 12 in the MATLAB script. All controllers listed in
Table 6 are evaluated on this same aggregated-bus topology and parameter set within the shared MATLAB digital twin, ensuring that performance differences are attributable solely to their control logic and learning algorithms.
5.1.2. Key Performance Indicators
A multifaceted evaluation is essential to assess the HPAC framework. Representative KPIs are summarized in
Table 5. The KPI set is consistent with industrial practice for energy storage systems and micro-grid business cases [
63] and, more importantly, with metrics used in recent peer-reviewed ML-based micro-grid management studies [
3,
9]. Industry white papers and vendor documentation are used only as supplementary context rather than as primary sources.
5.1.3. Baseline Models for Comparison
To validate the novelty and benefits of HPAC, performance should be benchmarked against credible baselines representing different paradigms as follows:
Baseline 1: Advanced adaptive droop control. A decentralized reactive scheme where droop coefficients are dynamically adjusted based on SoC, capacity, and possibly SoH, following recent adaptive droop approaches [
11,
13,
14,
15].
Baseline 2: MPC-based controller. A linear MPC with a simplified state-space model and deterministic forecasts (e.g., from classical models or DL-based predictors). The objective function mirrors HPAC’s reward structure. This highlights HPAC’s advantages in computational tractability and robustness to model mismatch relative to MPC-based storage control [
10,
23].
Baseline 3: Simpler DRL algorithm (e.g., DDPG). A non-entropy-regularized actor-critic algorithm trained under the same conditions. Comparison with SAC isolates the benefits of the maximum-entropy framework in terms of performance, training stability, and robustness, complementing micro-grid DRL surveys [
6].
In addition to these conceptual baselines, the simulation study can include a broader family of controllers, including heuristic rule-based controllers, centralized and distributed MPC variants, tabular Q-learning, policy-gradient methods such as PPO, multi-agent RL, and advanced fuzzy-logic controllers. This enables a richer comparison of trade-offs across technical, economic, and degradation-related KPIs.
5.1.4. Implementation of the Controller Set
All 14 controllers in
Table 6 are implemented in the same MATLAB-based digital twin and are evaluated under identical operating conditions, including load and PV profiles, network parameters, and tariff structure.
Table 6.
Simulation-based performance comparison of 14 representative controllers in the HPAC mini-grid test system.
Table 6.
Simulation-based performance comparison of 14 representative controllers in the HPAC mini-grid test system.
| Controller | Total Cost (USD) | Energy Throughput (kWh) | SoC Variance | Voltage Variance () |
|---|
| HPAC (Simple) | 1023.76 | 1265.80 | 0.0933 | 0.5474 |
| Rule-Based (Simple) [11,12,20] | 1019.88 | 255.40 | 0.0004 | 0.5535 |
| Rule-Based (Enhanced) [14,15,22] | 1021.59 | 286.42 | 0.0008 | 0.5636 |
| MPC (Simplified) [23,42] | 1003.24 | 337.47 | 0.0109 | 0.5667 |
| SAC (RL-based) [18,19,28,43] | 1021.15 | 775.85 | 0.0924 | 0.5639 |
| Centralized MPC [23,42] | 1003.53 | 337.48 | 0.0007 | 0.5565 |
| Distributed MPC [23,42] | 943.57 | 1059.69 | 0.0009 | 0.5228 |
| PI Controller [20,21,57] | 959.01 | 928.26 | 0.0083 | 0.5425 |
| Q-Learning (Tabular) [6,10,29] | 1052.30 | 416.14 | 0.0002 | 0.5913 |
| PPO (Policy Gradient) [5,6,64] | 1029.63 | 800.53 | 0.0406 | 0.5682 |
| MARL (Multi-Agent RL) [6,29,65] | 1030.78 | 1053.09 | 0.1454 | 0.5702 |
| Fuzzy Logic (Adv.) [27,50] | 998.01 | 1461.55 | 0.0001 | 0.5666 |
| Heuristic (Advanced) [3,5,17] | 967.98 | 2003.58 | 0.0006 | 0.5785 |
| HPAC (Manuscript) [this work] | 943.90 | 695.60 | 0.0000 | 0.3641 |
All 14 controllers in
Table 6 are implemented in the same MATLAB-based digital twin and are evaluated under identical operating conditions, including load and PV profiles, network parameters, and tariff structure. The resulting performance spread in terms of total cost, energy throughput, SoC variance, and voltage variance directly reflects the different control philosophies: some controllers deliberately trade higher energy throughput and operating cost for tighter SoC equalization, while others reduce cycling at the expense of power–quality or arbitrage performance. For clarity, we briefly summarize their configuration and relate it to the aggregate metrics reported in
Table 6 as follows:
HPAC (Simple) shares the SAC architecture and state representation with HPAC (Manuscript) but uses a static, manually tuned reward without forecast-driven shaping from the Predictive Engine. As a result, it exhibits relatively high total cost and SoC variance, together with elevated energy throughput, indicating that the agent over-cycles the batteries and reacts myopically to local states, without consistently aligning its decisions with price signals or long-horizon grid conditions.
Rule-Based (Simple) is a heuristic controller based on fixed SoC thresholds and deadbands that prioritize keeping all BESS units within a nominal SoC band, without explicit price awareness. It achieves very low SoC variance and modest voltage variance but at the expense of limited energy throughput and higher overall cost, since it often ignores profitable arbitrage opportunities. Rule-Based (Enhanced) augments this logic with additional rules that charge or discharge when prices fall below or rise above predefined thresholds, which modestly increases energy throughput and cost while maintaining low SoC variance, but still lacks the ability to optimally time actions over a prediction horizon.
MPC (Simplified) and Centralized MPC solve quadratic programs over a finite horizon. The simplified variant uses a reduced-order model and shorter horizon, leading to moderate cost, moderate energy throughput, and slightly higher SoC and voltage variance. The centralized variant employs a richer state vector and longer horizon, which tightens SoC variance and stabilizes voltage somewhat, but still incurs non-negligible cost and limited throughput because its optimization is constrained by model complexity and a fixed degradation penalty. Distributed MPC decomposes the optimization across BESS units with a consensus step on coupling variables and achieves near-minimal cost with high energy throughput and very low SoC variance; however, its voltage variance remains noticeably higher than that of HPAC (Manuscript), indicating that purely optimization-based coordination does not fully exploit the hierarchical structure and adaptive reward shaping of HPAC.
SAC (RL-based), PPO (Policy Gradient), and Q-Learning (Tabular) are standard DRL baselines trained in the same Gym-compatible environment but without hierarchical forecasting and dynamic reward shaping. SAC and PPO operate in continuous action spaces and tend to generate higher energy throughput with elevated SoC variance and cost, suggesting that they overemphasize short-term rewards and struggle to internalize the long-term trade-off between arbitrage and degradation. By contrast, tabular Q-learning uses a discretized action set, which yields extremely low SoC variance but the highest total cost and relatively large voltage variance, reflecting overly conservative policies that keep SoC tightly regulated while missing profitable and grid-supportive actions.
PI Controller is a conventional feedback controller that acts on aggregate SoC error and bus-voltage deviation, with gains tuned to achieve a compromise between SoC balancing and voltage regulation. This results in moderate energy throughput and cost, with low but non-minimal SoC variance and acceptable voltage variance, characteristic of a purely local feedback design that cannot anticipate future disturbances or prices.
MARL (Multi-Agent RL) assigns one agent per BESS unit with local observations and a shared global reward; agents are trained jointly using a centralized-critic, decentralized-actor scheme. This structure enables high energy throughput but leads to the largest SoC variance among all controllers and relatively high voltage variance and cost, indicating that local agents tend to over-exploit arbitrage and fail to coordinate sufficiently for global SoC balancing and voltage support.
Fuzzy Logic (Advanced) implements a MAMDani-type fuzzy rule base with inputs derived from SoC, power, and price signals and outputs corresponding to charge/discharge setpoints for each BESS unit, inspired by recent applications of fuzzy-logic controllers to nonideal battery and V2G operation [
50]. It achieves extremely low SoC variance and high energy throughput, revealing effective balancing and aggressive cycling; however, this comes with increased cost and elevated voltage variance, as the fuzzy rules are not jointly optimized against a multi-objective cost that explicitly internalizes degradation and power–quality constraints.
Heuristic (Advanced) is a hand-tuned, price-driven arbitrage strategy that aggressively cycles the batteries whenever even small price spreads are present, subject to basic SoC constraints but without an explicit degradation-aware objective. This explains the highest energy throughput in the comparison, with low SoC variance but significantly higher voltage variance and non-minimal cost, illustrating how purely economic heuristics can jeopardize power-quality and long-term asset health.
HPAC (Manuscript) is the full hierarchical controller described in
Section 4 and
Section 5, combining the federated transformer Predictive Engine with a SAC-based Adaptive Controller and the multi-objective reward structure defined in
Section 4.1.2. In
Table 6, this design yields near-minimal total cost, essentially zero SoC variance, and by far the lowest voltage variance, while maintaining only moderate energy throughput. In other words, HPAC (Manuscript) deliberately sacrifices excessive cycling and marginal arbitrage gains in favor of simultaneous SoC equalization, voltage stability, and implicit degradation mitigation, demonstrating the benefit of jointly optimized, forecast-aware reward shaping and hierarchical control.
Taken together, the numerical results reveal a clear set of trade-offs across the four performance metrics. Controllers that aggressively pursue arbitrage, such as the Heuristic (Advanced), MARL, and Fuzzy Logic schemes, achieve very high energy throughput but pay for this with increased total cost and, more critically, elevated voltage variance, indicating frequent and large power injections that stress the network. At the opposite extreme, conservative schemes like Q-Learning (Tabular) and the simple rule-based controllers tightly regulate SoC and maintain acceptable voltage profiles but exhibit high or non-competitive costs and limited throughput, reflecting missed opportunities for using the BESS fleet as an economic and grid-support resource. The MPC family, particularly the Distributed MPC variant, demonstrates that model-based optimization can simultaneously reduce costs and maintain low SoC variance, yet its voltage variance remains substantially higher than that of HPAC (Manuscript), suggesting that static constraints and fixed penalty weights are not sufficient to capture the complex, time-varying trade-off between local SoC, network conditions, and price signals. HPAC (Manuscript) occupies a distinct “balanced optimum” region of the Pareto surface: its cost is among the lowest in the entire set, its SoC variance is essentially zero, and its voltage variance is markedly lower than all other controllers, while its energy throughput remains moderate rather than extreme. This combination indicates that the hierarchical, forecast-aware, and reward-shaped design of HPAC not only improves steady-state performance but also restructures the control policy to allocate battery cycling where it is most valuable for both economics and power quality, rather than simply maximizing throughput or minimizing a single objective.
This common implementation framework ensures that differences in performance across controllers are attributable to their control logic and learning algorithms rather than to discrepancies in the underlying models or data sources. The updated quantitative results in
Table 6 thus provide a consistent, system-level view of how each control paradigm trades off cost, throughput, SoC balancing, and voltage regulation, and highlight the ability of HPAC (Manuscript) to deliver a balanced optimum across all four metrics.
5.1.5. Scenario Analysis
Beyond average-case performance, robustness under challenging conditions is crucial. Representative stress-test scenarios include the following:
High-volatility day: rapid fluctuations in irradiance (fast-moving clouds) and spiky load profiles. Tests the controller’s ability to maintain stability in highly dynamic conditions.
Unexpected event: sudden unforecasted changes such as a significant load step or a generator fault. Evaluates the ability to recover from deviations relative to forecasted trajectories.
Communication impairment: simulated latency or loss of communication between Predictive Engine and Adaptive Controller. Assesses how well the SAC agent can operate with stale or missing forecast information.
Component failure: sudden failure of a BESS unit. Evaluates fault tolerance and the ability to rebalance SoC and manage load with reduced storage capacity.
In addition to proposing these scenarios, we implemented two of them explicitly in the digital twin and evaluated all controllers as follows:
A high-volatility scenario with rapidly varying net load and occasional price spikes.
A communication impairment scenario in which the forecast sent by the Predictive Engine to the Adaptive Controller is periodically frozen, emulating intermittent loss of connectivity.
Figure 4 illustrates the SoC trajectories obtained in the high-volatility net-load scenario for HPAC, Distributed MPC, and the best-performing rule-based controller. In this stress test, the mini-grid is exposed to rapid alternations between surplus renewable generation and sharp demand spikes, which tend to pull individual batteries toward their operational limits if control decisions are purely reactive. Under these conditions, HPAC keeps all BESS units tightly clustered within a narrow SoC band around the desired operating range, with only small, synchronized excursions that mirror the underlying net-load swings. This behavior reflects the controller’s ability to anticipate upcoming ramps through its forecast-augmented state and dynamically shaped reward so that batteries are pre-charged or pre-discharged before extreme events occur.
By contrast, the rule-based controller exhibits visibly wider spreads and persistent divergence among individual SoC trajectories. Some units experience repeated approaches to the upper and lower SoC bounds, indicating that fixed thresholds and heuristic dispatch rules cannot cope with the rapid sequence of surplus and deficit periods. These excursions to extreme SoC values are symptomatic of both poorer SoC balancing and a higher risk of accelerated degradation. Distributed MPC performs markedly better than the rule-based baseline in terms of SoC equalization, but its trajectories still show slightly larger oscillations and slower realignment after major disturbances. This is consistent with a controller that relies on deterministic forecasts and finite-horizon re-optimization: it can exploit foresight, but its performance degrades when prediction errors or model mismatch accumulate. Overall, the visual comparison in
Figure 4 highlights that HPAC delivers the tightest SoC clustering and fastest recovery after high-volatility events, while Distributed MPC incurs slightly higher operating costs and the rule-based scheme suffers from both increased SoC variance and more frequent saturation at the operational limits.
In the communication impairment scenario, HPAC is deliberately operated under intermittent forecast unavailability, emulating temporary failures or congestion in the link between the mini-grid and the cloud-based Predictive Engine.
Figure 5 reports the resulting SoC trajectories and bus-voltage profile for HPAC compared against a forecast-free baseline controller that always operates myopically. During periods when forecasts are frozen, the HPAC agent must rely solely on stale predictive information and real-time local measurements. The SoC traces show that these forecast-free intervals lead to mildly larger short-term oscillations and a small drift away from the nominal SoC band, reflecting the reduced ability to pre-position the batteries ahead of upcoming ramps. However, the trajectories remain well within the admissible bounds and reconverge quickly once fresh forecasts become available, indicating that the learned policy has internalized meaningful patterns of typical net-load evolution and can fall back to a safe, measurement-driven behavior when predictive context is degraded.
The bus-voltage plot in
Figure 5 further confirms that stability is not compromised by communication impairments. Even when forecasts are stale, HPAC keeps the voltage tightly regulated around the nominal value, with deviations remaining within the acceptable operational band and without inducing secondary oscillations or large overshoots. In contrast, the forecast-free baseline controller exhibits noticeably larger voltage excursions during fast net-load changes, as it cannot exploit any information about upcoming imbalances and therefore tends to overreact to instantaneous measurements. From an economic perspective, the intermittent loss of forecasts leads to a modest increase in operating cost for HPAC because some charging and discharging decisions become more conservative. Nonetheless, this cost penalty remains limited compared with the improvement in SoC balancing and voltage regulation relative to the purely myopic baseline.
Across both stress-test scenarios, HPAC consistently achieves the lowest or near-lowest SoC variance and voltage deviation among the considered controllers, while incurring only modest increases in operating cost relative to the nominal (fault-free) case. In the high-volatility scenario, this manifests as tightly clustered SoC trajectories and well-damped responses to abrupt net-load swings, whereas in the communication impairment scenario, it appears as graceful degradation: short-lived forecast outages slightly degrade economic and balancing performance but do not trigger instability or constraint violations. Taken together, these results substantiate the qualitative claim that HPAC is robust to realistic disturbances and imperfections in the forecasting and communication stack and that the hierarchical predictive-adaptive design provides tangible benefits over both rule-based and optimization-only baselines under adverse operating conditions.
5.2. Performance Analysis and Trade-Offs
As summarized in
Table 6, the proposed HPAC framework achieves the lowest total operating cost among the 14 controllers evaluated in the mini-grid case study, while simultaneously maintaining excellent SoC-balancing performance and bus-voltage regulation. Compared with the best-performing MPC-based baseline, HPAC attains a similar or slightly lower operating cost but with significantly reduced SoC variance and voltage variance. Relative to purely DRL-based baselines that lack hierarchical forecasting and reward shaping, HPAC reduces cost by more than ten percent while delivering markedly better SoC equalization and voltage stability.
These aggregate results already suggest that HPAC is operating in a favorable region of the underlying multi-objective trade-off surface, where economic performance, technical quality, and degradation-aware behavior are jointly optimized rather than individually tuned. Controllers that aggressively chase arbitrage opportunities achieve high energy throughput but exhibit elevated voltage variance and, in some cases, only marginal cost improvements. Conversely, conservative rule-based or tabular RL controllers keep SoC tightly regulated but forgo profitable charge/discharge windows, leading to higher overall costs. HPAC occupies a balanced operating point: it uses the BESS fleet sufficiently often to exploit meaningful price spreads and to provide grid-support services, but it avoids unnecessary cycling that would erode battery lifetime or destabilize bus voltage.
Figure 6 provides a complementary, time-domain view of these trade-offs by plotting the SoC trajectories for all 14 controllers over a three-day horizon. For each controller, the solid line shows the fleet-averaged SoC, while the shaded band indicates the minimum and maximum SoC across individual BESS units. Controllers with narrow, nearly flat envelopes are able to maintain tight SoC clustering, whereas controllers with wide or highly distorted envelopes allow significant divergence between units, often driving some batteries into extreme SoC regions while others remain underutilized.
The rule-based schemes, particularly the enhanced variant, exhibit almost perfectly flat trajectories once their internal thresholds are reached: all batteries are quickly driven toward a preferred SoC band and then held there with minimal movement. This explains their extremely low SoC variance, but the nearly static trajectories also reveal why these controllers extract little economic value from the storage assets; most of the available energy capacity remains unused once the preferred SoC level is attained. Distributed MPC and the more advanced model-based controllers show sharper SoC excursions that are well coordinated across units: the mean SoC follows a smooth charge/discharge pattern, and the envelope remains narrow, indicating that all batteries participate in a synchronized manner. However, these trajectories often feature pronounced peaks and troughs driven by the optimization horizon, which can translate into relatively aggressive cycling and higher sensitivity to model mismatch.
DRL-based baselines without hierarchical forecasting, such as generic SAC, PPO, MARL, and heuristic policies, tend to produce more irregular SoC profiles. In several cases the envelopes widen substantially during periods of high activity, with individual BESS units being driven close to their lower or upper SoC limits while others lag behind. This behavior is symptomatic of policies that optimize short-term rewards based on local information, without a global view of SoC balancing or an explicit mechanism to keep trajectories synchronized. The heuristic advanced controller, in particular, shows rapid and repeated swings across most of the SoC range, yielding very high throughput but also large envelope width and a final SoC level close to depletion.
In contrast, the HPAC (Manuscript) controller produces SoC trajectories that combine the desirable features of the best baselines while avoiding their drawbacks. The mean SoC follows a smooth, gradually increasing profile as the fleet charges in anticipation of upcoming net-load peaks and then levels off in a high-but-safe region where sufficient headroom remains for subsequent disturbances. The envelope around this trajectory is extremely narrow, indicating that all BESS units move almost in lockstep and that SoC imbalance is virtually eliminated. At the same time, the SoC is neither held artificially constant nor pushed repeatedly to its extreme bounds; instead, the fleet is cycled in a controlled and coordinated way that reflects both economic signals and long-term degradation considerations. The resulting trajectories visually confirm the quantitative SoC-variance metrics and illustrate how HPAC’s predictive-adaptive design enables both efficient usage of storage capacity and tight SoC coordination.
Aggregate KPI Comparison and Grid-Voltage Behaviour
While
Table 6 provides numerical values for the four key performance indicators, it is useful to visualize how the controllers compare along all dimensions at a glance.
Figure 7 presents bar charts for total operating cost, battery energy throughput, SoC variance, and bus-voltage variance for the full set of fourteen controllers. The first panel confirms that HPAC (Manuscript) achieves one of the lowest total costs in the comparison, closely matching the best MPC-based baselines and clearly outperforming the generic SAC and PPO controllers. In contrast, controllers such as Heuristic (Advanced) and MARL exhibit noticeably higher operating costs, despite sometimes delivering larger energy throughput. This illustrates the central trade-off: blindly maximizing battery usage does not necessarily translate into better economic performance once degradation-aware operation and network constraints are taken into account.
The second panel of
Figure 7 shows the total energy throughput of the BESS fleet. Heuristic (Advanced) and Fuzzy Logic (Advanced) stand out with the highest throughput, reflecting highly aggressive cycling policies that charge and discharge whenever even modest price spreads exist. HPAC (Manuscript) sits in an intermediate range: it uses the batteries substantially more than simple rule-based schemes but considerably less than the most aggressive heuristics. This moderated throughput is a direct consequence of the degradation-aware reward component and of the forecast-informed positioning of the fleet. Rather than exploiting every small arbitrage opportunity, HPAC concentrates cycling on time windows where forecasts indicate that flexibility will be most valuable, reducing unnecessary wear on the batteries.
The lower panels highlight the technical quality of control. In terms of SoC balancing, HPAC (Manuscript) attains essentially zero variance, matching or surpassing the best rule-based SoC controllers while significantly improving economic performance. Generic DRL methods such as SAC and MARL exhibit much higher SoC variance, often because they focus on short-term cost or reward signals without an explicit SoC-equalization objective. A similar picture emerges for voltage variance: HPAC (Manuscript) achieves the lowest bus-voltage variance among all controllers, indicating that its actions maintain the bus voltage very close to the nominal value despite the stochastic net load. Several baselines, including fuzzy logic, MARL, and heuristic controllers, show noticeably higher voltage variance, consistent with more abrupt and less coordinated power exchanges with the grid. Taken together, these four panels confirm that HPAC (Manuscript) lies near the Pareto front of the multi-objective trade-off: it simultaneously delivers low cost, good SoC balancing, moderate throughput, and superior voltage stability.
Figure 8 provides a complementary time-domain view, focusing on grid power exchange and bus-voltage trajectories for three representative controllers: HPAC (Simple), Fuzzy Logic (Advanced), and HPAC (Manuscript). In the upper subplot, the black curve represents the net load seen at the point of common coupling, while the colored curves show the resulting grid exchange under each controller. The HPAC (Simple) controller tracks the net load relatively closely and performs only limited shaping of the grid-exchange profile. The fuzzy-logic controller is more aggressive: it injects and absorbs large power swings in response to its rule base, which noticeably amplifies the high-frequency content of the grid-exchange signal. By contrast, HPAC (Manuscript) produces a smoother and more structured grid-exchange trajectory. Peaks and valleys are attenuated relative to the net load, indicating that the BESS fleet is used to buffer variability without overreacting to short-lived fluctuations. This behavior reflects the predictive-adaptive design: the SAC agent, informed by multi-horizon forecasts, learns to distinguish between transient disturbances and persistent trends in net load and prices.
The lower subplot of
Figure 8 shows the corresponding bus-voltage profiles. All three controllers keep the voltage within a tight band around the nominal value, but the differences in variance are visible. The fuzzy-logic controller exhibits the largest ripple, consistent with its more abrupt changes in grid power. HPAC (Simple) performs better but still shows slightly larger deviations than HPAC (Manuscript). The latter maintains the flattest voltage trace, staying very close to the nominal voltage (dashed red line) throughout the three-day simulation. This confirms quantitatively what the variance statistics in
Figure 7 suggest qualitatively: the full HPAC design not only balances SoC and reduces operating cost, but it also implicitly acts as an effective voltage-support controller by scheduling BESS injections and withdrawals in a way that avoids sharp voltage excursions.
Overall, the combination of the aggregate KPIs in
Figure 7 and the time-series behavior in
Figure 8 offers a consistent system-level picture. Controllers that maximize energy throughput tend to induce higher voltage variance and only modest cost improvements, while conservative rule-based strategies sacrifice economic benefits and flexibility. HPAC (Manuscript) occupies a distinct operating point: it reduces cost, tightly balances SoC, and stabilizes voltage with a moderate level of cycling, demonstrating that the hierarchical predictive-adaptive architecture can extract more value from the same hardware by coordinating forecasting, reward shaping, and real-time control.
5.2.1. Why HPAC Outperforms Baselines
The performance gap between HPAC and the “HPAC (Simple)” variant highlights the significance of the predictive-adaptive hierarchy. HPAC (Simple) uses the same SAC architecture and state representation but omits the dynamic reward shaping derived from the Predictive Engine. Consequently, it learns a static policy that optimizes a fixed trade-off between SoC balancing and economics. In contrast, the full HPAC agent receives forecast-aware signals that modulate the relative importance of the reward components in time. For example, when the Predictive Engine anticipates a substantial net-load spike two hours ahead, the effective weight on SoC balancing is increased, discouraging aggressive discharge in the present and positioning the BESS fleet for the future event. This mechanism allows HPAC to exhibit MPC-like foresight without explicit model-based optimization at run time.
The comparison with Distributed MPC further illustrates this point. Distributed MPC uses an explicit system model and deterministic forecasts to solve coupled optimization problems repeatedly. It therefore performs well in terms of cost and SoC variance but is sensitive to model mismatch and incurs a higher computational burden at each control step. HPAC amortizes its computational cost offline during training and replaces online optimization by a single neural-network forward pass. This yields performance on par with, or better than, distributed MPC under nominal conditions, and superior robustness under forecast errors and unmodeled dynamics.
5.2.2. Trade-Offs Between Cost, Throughput, and Degradation
Examining battery energy throughput reveals a clear trade-off between economic aggressiveness and long-term asset health. HPAC processes a moderate amount of energy over the evaluation horizon that is higher than the simpler rule-based strategies but significantly lower than the heuristic and MARL controllers that chase every small price spread. The latter achieve only marginal additional cost reductions while substantially increasing throughput and, by implication, degradation.
In contrast, HPAC’s explicit degradation-aware reward component penalizes excessively deep or frequent cycling and prolonged operation at extreme SoC values. This leads to a more measured throughput and a lower equivalent full-cycle count while still achieving the best overall cost performance. The multi-objective reward function therefore enables HPAC to strike a favorable balance between short-term economic metrics and long-term battery health, which is particularly important in mini-grids where storage replacement costs are high.
To reduce the impact of stochasticity in both the environment and the DRL training process, HPAC and the DRL baselines were trained and evaluated under several random initializations of network weights and environment seeds. The numerical values reported in
Table 6 correspond to representative runs whose performance is close to the median across seeds; in all cases, the ranking between HPAC and the main baselines remained unchanged, and the relative cost gap stayed within a narrow band.
5.2.3. Ablation Study: Role of Forecasts and Reward Shaping
To quantify the contribution of the Predictive Engine, we conducted an ablation study with the following three variants of the HPAC controller: (i) a no-forecast variant, where the SAC agent observes only instantaneous measurements and optimizes a static reward; (ii) a state-only variant, where forecasts are injected as additional state features but the reward weights remain fixed; and (iii) the full HPAC configuration, where forecasts are used both as state features and to drive dynamic reward shaping. The no-forecast variant incurs the highest operating cost and exhibits noticeably larger SoC variance, confirming that purely reactive policies struggle to position the BESS fleet optimally. Providing forecasts only as state features improves performance modestly, but SoC variance remains relatively high because the agent optimizes a static reward that cannot adapt its priorities as future conditions change. The full HPAC configuration delivers the best performance across all metrics, demonstrating that both uses of forecasts are essential for achieving MPC-like foresight with model-free DRL.
5.3. Forecasting Performance
To assess the effectiveness of the federated transformer-based Predictive Engine, we compared its net-load forecasting accuracy against two baselines: a Local-only model trained independently at each site and a Centralized model trained on the union of all data. Accuracy was measured in terms of mean absolute error (MAE) over a rolling 24 h horizon. In our experiments, the Federated model achieved an MAE of approximately 4.2%, which is only marginally higher than the centralized model (3.9%) and substantially better than the Local-only approach (7.8%). For comparison, a tuned ARIMA forecaster and a univariate LSTM forecaster obtained MAEs of 6.5% and 5.1%, respectively, on the same hold-out test set. This confirms that the proposed federated transformer setup enables collaborative training of a high-capacity forecasting model that outperforms classical statistical and recurrent baselines while preserving data locality and incurring only a small accuracy penalty relative to full data centralization.
The communication overhead associated with the Predictive Engine was also quantified. For the HPAC controller, forecast messages are exchanged with the adaptive controller on a 15 min timescale, resulting in on the order of a few hundred short messages per day and a data volume in the tens of kilobytes, negligible compared with typical micro-grid telemetry. This overhead is significantly lower than that of fully centralized MPC schemes, which require frequent transmission of detailed state information from all assets to a central optimizer. Finally, we simulated a multi-site training configuration in which several virtual mini-grids train a shared global forecasting model using federated averaging. The resulting performance closely matches that of the centralized model while respecting privacy constraints and eliminating the need to aggregate raw high-resolution operational data at a single location. This demonstrates the feasibility of multi-site HPAC deployment with privacy guarantees.
Overall, to quantify the contribution of the predictive engine, we conducted an ablation study with the following three variants of the HPAC controller:
HPAC (No-forecast): the SAC agent observes only current measurements (SoC, SoH, , , , and temporal features) and uses a static, manually tuned reward. No forecast features are provided, and the reward is not shaped by forecasts.
HPAC (State-only): the agent observes both current measurements and the multi-horizon forecast sequence, but the reward weights
,
,
, and
remain fixed over time. This configuration corresponds to the “HPAC (Simple)” controller in
Table 6.
HPAC (Full): the proposed architecture, where forecasts are used both as state features and to drive dynamic reward shaping.
Table 7 summarizes the results on the same base scenario used for
Table 6.
The “No-forecast” variant incurs the highest operating cost and exhibits noticeably larger SoC variance, confirming that purely reactive policies struggle to position the BESS fleet optimally. Providing forecasts only as state features improves performance modestly, but SoC variance remains relatively high because the agent optimizes a static reward that cannot adapt its priorities as future conditions change. The full HPAC configuration, which combines forecast-augmented state with dynamic reward shaping, delivers the best performance across all three metrics, demonstrating that both uses of forecasts are essential for achieving MPC-like foresight with model-free DRL.
Unified Forecasting Accuracy Across Models and Controllers
Table 8 summarizes the forecasting accuracy of the standalone models and all controllers in terms of MSE, RMSE, and MAE. The transformer-based predictive engine attains the lowest errors among the generic forecasting models (MSE = 414.67, RMSE = 20.79, and MAE = 16.30), outperforming both the LSTM and ARIMA baselines, which exhibit slightly higher RMSE and MAE. The naive persistence benchmark yields the largest errors, confirming that simple extrapolation is inadequate for capturing the dynamics of the net-load time series.
For most controllers, the associated forecasting errors cluster around the ARIMA/persis tence range (MSE ≈ 469–473, RMSE ≈ 21.7–21.8, MAE ≈ 17.1), indicating that they either rely on simpler prediction schemes or do not fully exploit high-fidelity forecasts. In contrast, the proposed HPAC (Manuscript) controller directly leverages the transformer-based Predictive Engine and therefore inherits its superior error profile (MSE = 414.67, RMSE = 20.79, and MAE = 16.30), achieving the best forecasting performance across all controllers. This unified comparison highlights that coupling HPAC with a federated transformer forecaster not only improves control-level economic and technical metrics but also yields consistently lower forecast errors than classical statistical, recurrent, and heuristic approaches.
6. Conclusions and Future Work
This paper has presented a Hierarchical Predictive-Adaptive Control (HPAC) framework designed to address the complex challenge of SoC balancing in mini-grids with high renewable penetration. The proposed architecture systematically overcomes limitations of conventional control paradigms by synergizing state-of-the-art deep learning techniques in a two-layer hierarchical structure.
By decoupling the control problem into a long-horizon predictive engine based on a federated transformer network and a real-time adaptive controller implemented with a Soft Actor-Critic agent, HPAC achieves a control solution that is both proactive and highly adaptive. The transformer-based forecasting model captures long-range dependencies in energy time series, while federated learning addresses data scarcity and privacy constraints across decentralized mini-grids. The SAC agent, guided by a multi-objective reward function balancing technical performance, economic efficiency, and asset longevity, learns a robust operational strategy that surpasses traditional single-objective controllers.
A comprehensive evaluation benchmarked HPAC against fourteen representative controllers spanning rule-based, MPC, and various DRL paradigms. The results demonstrate that HPAC achieves near-minimal operating cost (USD 943.90 over the 72 h simulation) while simultaneously attaining essentially zero SoC variance and the lowest voltage variance (0.3641 ) among all compared controllers. Sensitivity analysis on reward weight selection revealed a well-defined optimal region around and , where cost and SoC balancing are jointly optimized. Ablation studies confirmed that both forecast-augmented states and dynamic reward shaping are essential for achieving MPC-like foresight with model-free DRL; the full HPAC configuration significantly outperformed variants with static rewards or without forecasts.
Stress-test scenarios further validated the robustness of the approach. Under high-volatility net-load conditions, HPAC maintained tight SoC clustering with rapid recovery after disturbances, whereas rule-based controllers exhibited persistent divergence. Under communication impairment scenarios with intermittent forecast unavailability, HPAC degraded gracefully, leveraging its learned policy and local measurements to maintain safe operation without instability or constraint violations.
Several limitations remain. The current evaluation uses a simplified single-bus network model with four heterogeneous BESS units (total capacity 500 kWh); future work should validate HPAC on more detailed multi-bus network representations with explicit power-flow constraints. The federated transformer forecaster was evaluated in simulation with multiple virtual sites; real-world deployment across geographically distributed mini-grids remains to be demonstrated. Additionally, the computational scaling of centralized HPAC for very large BESS fleets () warrants further investigation, with multi-agent RL extensions being a promising direction.
Looking ahead, HPAC forms a foundation for more scalable and decentralized architectures. The linear inference-time scaling with N makes it better suited to large fleets than centralized MPC, and parameter-sharing or attention-based architectures could further improve scalability. Hardware-in-the-loop testing and cloud-edge deployment represent natural next steps toward practical implementation. Through the integration of predictive foresight with model-free adaptive control, HPAC represents a significant step toward intelligent, autonomous, and efficient energy management systems for a resilient and sustainable energy future.
6.1. Future Research Trajectories
While high-fidelity simulation is essential for development, real-world deployment is the ultimate goal. A path to deployment involves transitioning from simulation to hardware-in-the-loop (HIL) testing. In an HIL setup, the trained HPAC controller (SAC policy network) is deployed on a real-time industrial controller or embedded system, which interacts with a real-time power hardware simulator (e.g., OPAL-RT or Typhoon HIL). The simulator emulates the mini-grid dynamics (BESS, inverters, PV, network impedances) with microsecond-level accuracy. HIL validation verifies that inference and control loops meet strict real-time deadlines, exposes the controller to realistic communication latencies and jitter, and de-risks integration with target hardware and software stacks prior to field deployment. In practical terms, an HIL implementation of HPAC would perform the following: (i) compile the SAC policy network and associated pre-processing and post-processing logic into a real-time executable running at the desired control interval; (ii) exchange measurements and control commands with the power hardware simulator through a deterministic communication link; and (iii) log all relevant signals for offline analysis. The same HIL platform can then be used to test corner cases such as sudden BESS or PV outages, measurement dropouts, and extreme weather events, without risking physical assets.
Training the federated transformer and SAC agent is computationally intensive and generally performed offline on GPU-equipped servers or in the cloud. However, real-time inference is lightweight: forward passes through trained networks are matrix operations that can execute in milliseconds on modern embedded hardware. This naturally suggests a hybrid cloud-edge deployment, where the cloud layer handles data aggregation, global model training, and periodic retraining of the forecasting and DRL models, in line with broader trends in federated and distributed learning [
36,
37]. The edge layer hosts the trained SAC policy (and optionally a lightweight forecasting model) on a local controller within the mini-grid, performing high-frequency real-time control based on local sensor data and occasional cloud updates. If connectivity to the cloud is lost, the edge controller continues to operate using the last-known policy and local measurements, ensuring resilience and autonomy.
6.2. Scalability Considerations
A significant challenge for centralized DRL control is scalability. As the number of BESS units
N increases, both state and action spaces grow, making learning more difficult and potentially less sample-efficient. The scalability of RL is an active research topic [
6]. Promising directions for addressing scalability include multi-agent reinforcement learning (MARL), which treats each BESS as an agent with its own policy, learning to cooperate (implicitly or explicitly) to achieve global SoC balancing. MARL can be more scalable and robust to single points of failure, as illustrated by multi-agent power-grid control approaches such as PowerNet [
65]. Another direction is parameter sharing and attention-based architectures in centralized schemes, using a shared policy network that processes information from each BESS with parameter tying, keeping the number of learnable parameters independent of
N. Attention mechanisms can allow the network to focus on the most relevant units when making decisions, effectively learning a coordination topology.
From a computational-complexity perspective, classical centralized MPC implementations scale approximately as in the number of BESS units due to the matrix factorizations involved in solving the underlying quadratic program at each control step. This cubic scaling quickly becomes prohibitive for large fleets () when control intervals are on the order of one minute.
In contrast, the inference cost of the centralized HPAC controller scales roughly linearly with N. The dimensionality of the state and action vectors grows with the number of BESS units, but the policy itself is implemented as a fixed-size neural network whose dominant operations are matrix-vector multiplications. With parameter sharing and batched computation across units, the per-step inference time remains compatible with real-time constraints even for large N on modern embedded hardware. This linear scaling makes HPAC and its MARL extensions better suited to large-scale mini-grids than centralized MPC.
In future work, we plan to complement this theoretical argument with empirical timing results across mini-grids of increasing size and to explore MARL variants of HPAC that preserve the predictive-adaptive hierarchy while distributing the control task across multiple cooperating agents.