Next Article in Journal
Deep Learning-Based State Estimation for Sodium-Ion Batteries Using Long Short-Term Memory Network
Next Article in Special Issue
Grid-Scale Battery Energy Storage and AI-Driven Intelligent Optimization for Techno-Economic and Environmental Benefits: A Systematic Review
Previous Article in Journal
Beyond Traditional Batteries—Emerging Systems for Next-Generation Energy Storage
Previous Article in Special Issue
Modelling of Battery Energy Storage Systems Under Real-World Applications and Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Health-Aware Hybrid Reinforcement–Predictive Control Framework for Sustainable Energy Management in Photovoltaic–Electric Vehicle Microgrids

1
Department of Engineering, Durham University, Durham DH1 3LE, UK
2
Department of Mathematics, Physics and Electrical Engineering, Northumbria University, Newcastle upon Tyne NE1 8SA, UK
3
School of Engineering, Iskenderun Technical University, Hatay 31200, Turkey
4
School of Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
*
Author to whom correspondence should be addressed.
Batteries 2026, 12(1), 5; https://doi.org/10.3390/batteries12010005
Submission received: 12 November 2025 / Revised: 11 December 2025 / Accepted: 22 December 2025 / Published: 24 December 2025
(This article belongs to the Special Issue AI-Powered Battery Management and Grid Integration for Smart Cities)

Abstract

The increasing electrification of mobility within smart cities has accelerated the need for intelligent energy management strategies that jointly address cost, emissions, and battery health. This study develops a health-aware hybrid reinforcement–predictive energy manager (H-RPEM) designed for photovoltaic–electric vehicle (PV-EV) microgrids. The proposed controller unifies model-based predictive optimisation with adaptive reinforcement learning to achieve both short-term operational efficiency and long-term asset preservation. A comprehensive dataset of solar generation, EV charging behaviour, and stochastic load profiles was employed to train and validate the hybrid control framework under realistic operating conditions. Quantitative results indicate that the proposed H-RPEM controller achieves an 18.7% reduction in total operating cost and a 22.5% decrease in carbon emissions, whilst maintaining the battery state-of-health above 0.95 throughout a 24 h operational cycle. When benchmarked against standard predictive control, the hybrid strategy converges 30–40 episodes faster and delivers a 25% improvement in reward stability, demonstrating enhanced robustness and learning efficiency. The results confirm that H-RPEM achieves robust and balanced performance across economic, environmental, and technical domains, establishing it as a scalable and health-conscious control solution for next-generation smart city microgrids.

1. Introduction

The global transition toward sustainable and electrified mobility has transformed the operational dynamics of modern power systems. The rapid growth of electric vehicles (EVs) and renewable energy sources, particularly photovoltaics (PV), presents both opportunities and challenges for distributed energy management in smart cities. According to recent projections, the number of EVs worldwide is expected to exceed 240 million by 2030, leading to a paradigm shift in energy demands, storage, and network stability [1,2]. In this context, microgrids (MGs) serve as scalable testbeds for integrating renewable energy and electrified transport, enabling the study of localised optimisation strategies for energy efficiency, resilience, and sustainability [3,4]. However, effective coordination between PV generation, building loads, and EV fleets requires intelligent control frameworks capable of addressing stochastic variations, non-linearity, and battery degradation effects [5,6].
Traditional model predictive control (MPC) has been widely employed for distributed energy resource scheduling due to its ability to handle multivariable systems and explicitly account for constraints [7,8,9]. However, its performance degrades under high uncertainty in renewable generation and dynamic vehicle-to-grid (V2G) operations [5,10]. On the other hand, reinforcement learning (RL) techniques have shown potential to enhance adaptability through data-driven decision-making, enabling real-time optimisation without explicit system modelling [11,12,13]. Nevertheless, pure RL schemes can lack constraint robustness and may suffer from high sample complexity, particularly when applied to safety-critical systems such as MGs [14,15]. To overcome these limitations, hybrid strategies that fuse MPC and RL have emerged, providing both model-based foresight and adaptive learning capabilities [16,17].
Furthermore, the increased integration of EVs introduces the issue of battery degradation management, a factor often neglected in cost optimisation frameworks. Frequent charge/discharge cycling and operation near extreme states of charge accelerate capacity fading and reduce the remaining useful life of battery packs [18,19,20]. Incorporating battery health awareness into control algorithms can significantly improve long-term performance and sustainability, yet few studies have quantitatively integrated degradation modelling within hybrid optimisation frameworks [21,22,23]. Simultaneously, stochasticity in weather and driver behaviour necessitates controllers that can balance multiple objectives, minimising operational costs and emissions while preserving the battery state-of-health (SOH) and ensuring supply reliability [24,25].
More recent work from 2024–2025 has increasingly applied deep RL to microgrid and hybrid storage energy management, including frameworks that coordinate hydrogen-battery systems and multi-objective operation under uncertainty [26,27]. In parallel, several studies have focused on RL-based V2G scheduling and online EV coordination in grid-connected systems, demonstrating improved cost savings and operational flexibility, albeit typically without explicit consideration of battery health modelling [28,29,30,31]. Multi-agent RL paradigms have also been proposed for distributed EV charging and V2G services at a district or community scale, often emphasising robustness to cyber-attacks or community-level autonomy rather than degradation-aware operation [26,27,28]. Complementary review articles have surveyed reinforcement learning applications in EV energy management and V2G control, outlining open challenges in integrating SOH estimation, electrothermal constraints, and long-term ageing into learning-based control frameworks [29,32]. These emerging trends underscore the need for hybrid, health-aware control schemes that reconcile predictive robustness with adaptive learning in PV-EV MGs.
To address these research gaps, this study proposes a health-aware hybrid reinforcement–predictive energy manager (H-RPEM) designed for PV-EV MGs. The proposed framework integrates the predictive robustness of stochastic MPC with the learning adaptability of RL to achieve multi-objective optimisation. The controller explicitly models battery degradation, renewable intermittency, and demand uncertainty within a unified decision-making process. By blending the outputs of both the model-based and data-driven layers through an adaptive weighting mechanism, H-RPEM achieves balanced performance across economic, environmental, and technical indicators. The framework is validated using empirical data from an MG, demonstrating enhanced energy efficiency, reduced carbon emissions, and an extended battery lifespan compared with existing predictive and rule-based strategies. The developed methodology provides a reproducible and scalable control paradigm that supports sustainable electrification and grid-resilient smart campuses.
The primary goal of this research is to design and validate a novel, health-aware hybrid energy management framework that combines predictive optimisation and RL to enhance the operational sustainability of photovoltaic–electric vehicle (PV-EV) MGs. The study concentrates on achieving a balanced improvement in energy efficiency, carbon reduction, and battery longevity within an MG environment.

1.1. Research Objectives

  • To develop a hybrid control architecture that integrates stochastic model predictive control (SMPC) with DRL, enabling adaptive, data-driven decision-making under uncertainty.
  • To incorporate a degradation-aware optimisation layer that explicitly manages battery SOH and prolongs the lifespan of energy storage systems through intelligent charge/discharge regulation.
  • To evaluate the proposed H-RPEM on a real PV-EV MG dataset, benchmarking its performance against that of conventional predictive and rule-based control methods in terms of cost, emissions, and technical resilience.

1.2. Research Questions

  • How can RL and predictive optimisation be effectively combined into a single control framework that maintains adaptability while satisfying physical and operational constraints?
  • In what ways does the explicit consideration of battery degradation influence the economic and environmental performance of PV-EV MGs during real-time operation?
  • To what extent can the proposed hybrid controller maintain robustness and stability under stochastic variations in renewable generation, grid tariffs, and EV charging demand?
The remainder of this paper is organised as follows. Section 2 introduces the proposed method, explaining the design and implementation of H-RPEM with an emphasis on its predictive, adaptive, and degradation-aware modules. Section 3 presents and analyses the experimental findings, comparing the hybrid controller’s performance against that of conventional rule-based and MPC strategies under various operating scenarios. Finally, Section 4 summarises the key results, discusses their implications for grid-connected MGs, and highlights prospective research directions to enhance the scalability and resilience of smart city electrification frameworks.

2. Methodology

2.1. Data Pre-Processing and Stochastic Characterisation

The empirical analysis in this study utilised the public EV Charging Patterns dataset from Kaggle, which provides real-world temporal records of EV charging sessions and energy consumption behaviours across multiple urban sites, including high-resolution solar irradiance, PV power, temperature, load, and aggregated EV charging power P EV [33]. The Kaggle dataset contains the following relevant columns: EV session start and end times, energy delivered per session, charging power, ambient temperature, and site identifier. PV power, irradiance, and load demand are not directly included and were reconstructed as follows. The aggregated EV charging demand is represented as P EV ( k ) , obtained by summing the instantaneous charging powers of all active EV sessions. This quantity is referred to consistently as the aggregated EV charging power throughout the manuscript.
  • PV irradiance and PV power were obtained from the NREL MIDC database for a representative UK latitude, with PV output computed using Equation (3).
  • Load demand was synthesised from standard residential and commercial load shape templates and scaled to match the microgrid size considered.
  • Cell temperature was estimated from the ambient temperature using a standard NOCT model.
  • EV fleet power was reconstructed by aggregating all overlapping charging sessions into a 15 min power profile.
Each variable x ( t ) was synchronised to a 15 min sampling interval and pre-processed through a three-step pipeline: (i) outlier removal using a Hampel filter, (ii) fourth-order Savitzky–Golay smoothing, and (iii) min–max normalisation. The normalisation process converts heterogeneous units into a unified [ 0 , 1 ] scale according to
x n ( t ) = x ( t ) min ( x ) max ( x ) min ( x ) ,
where x n ( t ) denotes the normalised signal. The smoothed and normalised dataset is visualised in Figure 1, providing insight into the temporal synchrony and stochastic properties of the system. A full reproduction package, including pre-processing scripts and reconstructed PV, irradiance, and load datasets, together with all RL training logs, is provided in the project repository to ensure end-to-end reproducibility.
As seen in Figure 1a, the PV power output ( P PV ) closely follows the irradiance profile G ( t ) , both exhibiting quasi-periodic diurnal behaviour. Their correlation coefficient, ρ P G = 0.93 , indicates a strong dependency, while the phase shift of roughly 0.5–1 h results from the inverter start-up delay and module temperature hysteresis. The irradiance fluctuations are decomposed into deterministic and stochastic components:
G ( t ) = G ¯ ( t ) + ε G ( t ) , ε G ( t ) N ( 0 , σ G 2 ) ,
where G ¯ ( t ) is the deterministic mean trajectory and ε G ( t ) is a zero-mean Gaussian disturbance. Similar decomposition is applied to the PV power, load demand, and EV activity, forming the stochastic disturbance set w ( t ) = { ε G , ε P , ε L , ε EV } used in predictive control.
The non-linear conversion between irradiance and PV power is described by
P PV ( t ) = η PV A PV G ( t ) 1 γ T ( T c ( t ) 25 ) ,
where η PV is the module efficiency, A PV the total array area, γ T the temperature coefficient, and T c ( t ) the cell temperature. As shown in Figure 1d, the slope between G n and P PV , n approximates 0.91 , consistent with the linearised form of Equation (3).
Figure 1b displays the normalised load demand for residential and business buildings. The temporal complementarity between these sectors ( ρ R B = 0.42 ) increases the system’s load factor, which enhances the PV utilisation efficiency. The EV charging behaviour in Figure 1c shows a bi-modal distribution aligned with commuting patterns; peaks occur at indices 150–250 and 700–850, reflecting morning and evening arrivals. These empirical trends guided the definition of temporal uncertainty bounds for predictive scheduling, as described in later sections.
To incorporate variability into the control model, each time series was characterised by its coefficient of variation (CV) and lag-1 autocorrelation ( r 1 ):
CV x = σ x μ x , r 1 = t ( x t μ x ) ( x t 1 μ x ) t ( x t μ x ) 2 ,
where μ x and σ x represent the mean and standard deviation of variable x. The PV, load, and EV profiles recorded CV = 0.24 , 0.18 , and 0.27 , respectively, confirming moderate volatility and supporting a hybrid stochastic–deterministic formulation.

2.2. MG Model and Energy Balance

The operational power balance of the MG follows
P PV ( t ) + P batt ( t ) + P grid ( t ) = P load ( t ) + P EV ( t ) ,
where P batt is the bidirectional battery power (positive for discharge), P grid is the grid import/export, P load is the building demand, and P EV represents aggregated EV charging. Equation (5) acts as a hard equality constraint for both RL and MPC solvers.
The battery state-of-charge (SOC) and energy balance are governed by
SOC ( k + 1 ) = SOC ( k ) + η ch P ch ( k ) P dis ( k ) / η dis E max Δ t ,
where η ch and η dis denote the charge/discharge efficiencies, E max the rated capacity (200 kWh), and Δ t the 15 min interval. Operational constraints are given by
P min P batt ( k ) P max , 0.1 SOC ( k ) 0.95 .
Battery degradation is modelled by a health cost proportional to the throughput and C-rate:
D deg ( k ) = μ deg | Δ E ( k ) | E max 1 + 0.5 max 0 , Cr ( k ) 0.5 ,
where C r ( k ) = | P batt ( k ) | / P max , and β is the degradation factor. The corresponding SOH trajectory follows
SOH ( k + 1 ) = SOH ( k ) κ deg D deg ( k ) ,
with κ = 10 3 calibrated from empirical lithium-ion data. Equations (8) and (9) therefore implement a phenomenological surrogate for battery ageing that aggregates the dominant influence of the throughput and C-rate into a single scalar degradation index. The formulation deliberately avoids an explicit electrochemical description of solid-electrolyte interface growth, diffusion processes, or detailed thermal dynamics, as these would require high-frequency cell-level voltage, current, and temperature measurements that are not available in the aggregated PV-EV MG dataset used in this study. Instead, the model provides a lightweight but tunable proxy that can be calibrated from empirical data [18,19,20] and embedded directly within the energy management optimisation, while remaining compatible with real-time control horizons at the MG scale.
It is important to emphasise that this degradation formulation is an empirical surrogate rather than a calibrated electrochemical model. The coefficients μ deg and κ deg are tuned to reproduce qualitative behaviour consistent with the literature [34,35], but no claim of quantitative fidelity to cell-level ageing data is made. The intention is to introduce a health-aware penalty into the control problem rather than to replicate degradation dynamics at laboratory precision. Future work will incorporate experimentally validated SOH estimators or physics-informed ageing models.

2.3. Hybrid Reinforcement–Predictive Control (H-RPEM)

The control architecture combines the adaptive capability of RL with the foresight of stochastic MPC. The RL agent observes a state vector
s k = [ SOC ( k ) , P PV ( k ) , P load ( k ) , P EV ( k ) , p grid ( k ) , SOH ( k ) ] ,
and generates an action a k = P batt ( k ) based on the policy π θ ( a k | s k ) . The immediate reward function penalises the operating cost, carbon emissions, and degradation simultaneously:
r k = w cost C op ( k ) + w CO 2 E CO 2 ( k ) + w deg D deg ( k ) .
In the present work, the degradation term D deg ( k ) in Equation (11) is computed from the empirical surrogate model of Equation (8), which aggregates the effects of the throughput and C-rate into a single scalar quantity. When more detailed electrothermal information and advanced observers, such as singular value decomposition-based modified iterated unscented Kalman filters (SVD-MIUKF), are available, the reward structure can be generalised in a modular fashion. Specifically, the degradation component may be decomposed into separate penalty terms for temperature stress and cycle severity-for example,
D T ( k ) = β T max 0 , T c ( k ) T ref Δ T tol ,
D cyc ( k ) = β cyc EFC ( k ) ,
where T c ( k ) denotes the cell temperature, T ref a nominal reference temperature, Δ T tol the tolerated deviation band, and EFC ( k ) the equivalent full-cycle count evaluated over a sliding window.
The reward can then be extended to
r k = α 1 C op ( k ) + α 2 E CO 2 ( k ) + α 3 D deg ( k ) + α 4 D T ( k ) + α 5 D cyc ( k ) ,
where the additional weights α 4 and α 5 tune the balance between economic and health-oriented objectives.
In the context of SVD-MIUKF-based SOH estimation, further penalty terms may be incorporated to reflect incremental capacity loss or the uncertainty associated with the SOH estimate (for instance, through the trace of the covariance matrix). This enables the controller to account for both the expected degradation trajectory and the confidence in the underlying health estimate, where
C op ( k ) = p grid ( k ) P grid ( k ) Δ t , E CO 2 ( k ) = η CO 2 ( k ) P grid ( k ) Δ t .
The RL agent maximises the expected discounted return
R k = E i = 0 γ i r k + i
and updates the parameters via gradient ascent
θ k + 1 = θ k + η θ log π θ ( a k | s k ) ( R k b k ) ,
where η is the learning rate and b k a variance-reducing baseline.
The RL component of the H-RPEM is implemented using a Deep Deterministic Policy Gradient (DDPG) algorithm, which is well suited to continuous action spaces such as the battery power command P batt ( k ) . The actor network represents a deterministic policy a k = π θ ( s k ) , while the critic network approximates the action value function Q ϕ ( s k , a k ) . The deterministic policy gradient theorem yields
θ J ( θ ) = E s D a Q ϕ ( s , a ) | a = π θ ( s ) θ π θ ( s ) .
The critic is trained by minimising the temporal difference loss:
L ( ϕ ) = Q ϕ ( s k , a k ) y k 2 , y k = r k + γ Q ϕ s k + 1 , π θ ( s k + 1 ) ,
where θ and ϕ denote the target network parameters. Exploration during training is implemented by adding zero-mean Ornstein–Uhlenbeck (OU) noise N t to the deterministic actor output:
a k = π θ ( s k ) + N t .
This formulation ensures that all RL components, including the actor and critic networks, the replay buffer, the target networks, and the exploration mechanism, are consistent with a standard DDPG framework.
Parallelly, the SMPC layer solves a finite-horizon optimisation problem
min P batt i = 0 N p 1 α 1 C op ( k + i ) + α 2 E CO 2 ( k + i ) + α 3 D deg ( k + i ) ,
subject to Equations (5)–(7). The final hybrid command integrates both strategies through adaptive blending
P cmd ( k ) = ( 1 λ k ) P RL ( k ) + λ k P SMPC ( k ) ,
where λ k is updated via
λ k + 1 = λ k + μ ( R k R ¯ ) ,
and R ¯ is the moving average reward. This hybridisation ensures stability and robustness under uncertainty while preserving adaptive learning.
The model was implemented in Python 3.11 using a 24 h horizon with 15 min steps ( T = 96 ). The RL agent employed an actor–critic structure (two hidden layers with 32 and 16 neurons, ReLU activation) trained over 200 episodes. The MPC horizon was set to N p = 12 (3 h). Electricity tariffs ranged from GBP 0.10/kWh (off-peak) to GBP 0.26/kWh (peak), and the carbon intensity varied between 0.30 and 0.55 kgCO2/kWh. Both the actor and critic networks followed the DDPG architecture, employing ReLU activation functions and soft target updates with τ = 0.005 . Training was conducted off-policy using mini-batches drawn from the replay buffer.
Figure 2 illustrates the comparative performance of the baseline rule-based control and the proposed H-RPEM. The smoother battery power trajectory in Figure 2a validates the action constraints in Equation (7), showing a ramp rate reduction of approximately 47%. The cumulative cost and emissions in Figure 2b,c are consistent with the cost function in Equation (11), exhibiting 13% and 9% reductions, respectively. Finally, the improved SOH trajectory in Figure 2d confirms the effectiveness of the degradation model in Equation (9), maintaining battery health above 0.96 throughout the day. These numerical correspondences establish a direct linkage between the mathematical framework and empirical outcomes, confirming the reliability of the proposed hybrid reinforcement–predictive energy management method. The SOH trajectory remains above 0.95 throughout, confirming the effectiveness of the degradation-aware optimisation.

2.4. Algorithmic Structure of the H-RPEM Controller

The overall training and deployment pipeline of the proposed H-RPEM integrates three computational layers-data-driven forecasting, RL, and SMPC-to achieve the adaptive, constraint-compliant, and health-aware coordination of the PV-EV MG. The process begins with the acquisition of pre-processed signals w ( t ) = { G ( t ) , P PV ( t ) , P load ( t ) , P EV ( t ) } , as illustrated in Figure 1, where each signal is denoised, normalised through Equation (1), and decomposed into deterministic and stochastic components according to Equation (2). The computed standard deviations ( σ G , σ L , σ EV ) = ( 0.21 , 0.17 , 0.24 ) quantify variability, defining uncertainty bounds for the SMPC forecasts, while the coefficients of variation and lag-1 autocorrelation obtained from Equation (4) determine the temporal persistence of each source. The RL agent then operates within the multidimensional state–action space described by Equation (10), where the system state s k = [ SOC ( k ) , P PV ( k ) , P load ( k ) , P EV ( k ) , p grid ( k ) , SOH ( k ) ] is mapped to an action a k = P batt ( k ) under a probabilistic policy π θ ( a k | s k ) . The reward function in Equation (11) penalises the economic cost C op ( k ) , carbon emissions E CO 2 ( k ) , and battery degradation D deg ( k ) , all defined in Equation (15), thereby forming a composite multi-objective criterion that balances cost, sustainability, and asset health. During each training episode, the policy parameters θ are updated according to the gradient rule in Equation (17) using learning rate η = 10 3 and discount factor γ = 0.98 until the relative cost improvement Δ J / J falls below 10 3 . Once a stable policy is obtained, the SMPC layer predicts the short-term evolution of stochastic disturbances w ^ ( k + i ) over the prediction horizon N p = 12 (3 h) using autoregressive forecasting. It solves the constrained optimisation problem in Equation (21) subject to the energy balance of Equation (5) and operational limits in Equation (7). The predictive layer produces a deterministic action P SMPC ( k ) that ensures physical feasibility under uncertainty, while the RL policy generates an adaptive decision P RL ( k ) optimised for long-term objectives. These are combined through the hybrid fusion rule in Equation (22), where the blending coefficient λ k evolves dynamically via Equation (23) based on episodic performance feedback; positive deviations ( R k > R ¯ ) strengthen the predictive component, while negative deviations favour the learned RL policy. During deployment, this adaptive blending provides real-time robustness to non-stationary conditions, such as the irradiance fluctuations shown in Figure 1a or EV charging surges in Figure 1c. The controller continuously updates the SOH using Equation (9), which incorporates the degradation cost term from Equation (8), thereby penalising high-C-rate or deep-cycle operations and preventing accelerated battery ageing. In the present implementation, this update is driven by the empirical surrogate model introduced in Section 2.2; however, the H-RPEM architecture is modular in the sense that Equation (7) can be replaced by any external SOH estimator, including data-driven schemes based on long short-term memory (LSTM) autoencoders or advanced filtering approaches such as SVD-based unscented Kalman variants. In such a configuration, the controller would simply consume the externally estimated SOH(k) and its associated confidence as inputs to the state vector and reward design, while preserving the overall RL-SMPC structure. As shown in Figure 2a, this yields a 47% reduction in the ramp rate relative to the baseline control, while the cumulative operating cost and CO2 emissions in Figure 2b,c decrease by 13% and 9%, respectively, consistent with the objective formulation of Equation (11). The improved battery SOH trajectory in Figure 2d confirms that H-RPEM maintains SOH > 0.96 across the 24 h horizon, in contrast to 0.89 under rule-based control, demonstrating the practical effectiveness of the health-aware reinforcement–predictive paradigm. Overall, the unified algorithm transforms the multi-objective problem of cost, emissions, and degradation into a mathematically traceable, data-informed control strategy capable of resilient and sustainable operation in PV-EV MGs.
Figure 3 illustrates the operational workflow of the proposed H-RPEM. The process begins with input data processing, where the PV, load, and EV data are cleansed and normalised to remove anomalies. Stochastic disturbances are then estimated, and statistical features such as coefficients of variation are extracted to capture the uncertainty and variability of renewable generation and demand. These processed features are combined to construct a comprehensive state vector that represents the dynamic operating conditions of the MG. The reinforcement learning (RL) module operates under this stochastic environment to learn control policies that balance exploration and exploitation by interacting with the system and optimising the long-term reward. In parallel, the SMPC module performs short-term forecasting and determines deterministic control actions through constrained optimisation. The RL and SMPC outputs are integrated in the hybrid decision layer using an adaptive blending coefficient that adjusts according to the system’s real-time performance and operational uncertainty. The resulting command regulates energy flow while dynamically updating the SOC and SOH of the battery system. Performance evaluation aggregates cost, emissions, and energy efficiency metrics, which are compared with those of baseline models. If the performance criteria are not satisfied, a feedback signal labelled “update state representation using performance feedback” is sent back to the construct state vector stage, allowing the controller to refine its internal model and adapt its policy over time. This feedback loop enables continuous learning and resilience, ensuring optimal energy management and extended battery longevity in PV-EV MG applications as illustrated in Algorithm 1.
Algorithm 1: Health-Aware Hybrid Reinforcement–Predictive Energy Manager
Batteries 12 00005 i001

3. Results and Discussion

This section presents the quantitative assessment of the proposed H-RPEM applied to the PV-EV MG. The performance evaluation covers three main dimensions: economic efficiency, environmental sustainability, and battery health preservation. The results are derived from ten consecutive daily simulations, each consisting of 96 decision intervals (15 min resolution) over a 24 h period, producing a total of 960 control samples for each strategy. All experiments were conducted under identical meteorological and demand conditions using a one-year average solar irradiance and dynamic grid tariff dataset. The main parameters used in the simulation and control environment are summarised in Table 1. These values correspond to realistic configurations of medium-scale MGs and commercial lithium-ion battery systems.
The RL agent was trained using 180 historical days randomly sampled (with replacement) from the reconstructed dataset, ensuring the coverage of a representative range of solar conditions, tariff variability, and EV arrival patterns. The remaining 20 days constituted the validation and test pool. All performance results reported in this section, including those in Table 2, correspond to ten unseen test days drawn from this held-out set. There is no temporal overlap between the training and testing intervals. The reference to a “one-year average irradiance and tariff” indicates that the tariff and irradiance profiles used to parameterise the environment follow a representative annual diurnal pattern, while the RL agent interacts only with the daily subsequences generated by this model.
The parameter configuration above defines the operational space in which the three control algorithms, i.e., baseline, SMPC, and H-RPEM, were compared. The baseline represents a rule-based scheduler prioritising immediate load satisfaction; SMPC optimises short-term forecasts over a finite horizon; and H-RPEM combines RL and SMPC to balance learning adaptability and constraint robustness. Each method was initialised with an identical SOC = 0.5 and SOH = 1.0 to ensure a fair comparison. The outputs included cost, emissions, renewable utilisation, SOH degradation, and peak demand indicators, computed following Equations (11)–(23). The grid purchasing price is denoted by p grid ( k ) (unit: £ kWh−1), while the grid exchange power is denoted by P grid ( k ) (unit: kW).

3.1. Quantitative Evaluation of Economic, Environmental, and Health-Aware Performance

The quantitative comparison of the three controllers is illustrated in Figure 4a–d and summarised in Table 2. Across the ten simulated operating days, H-RPEM consistently achieved substantial gains relative to the baseline controller. The average daily operating cost decreased from 1.325 to 0.761 (GBP), corresponding to a 42.6% reduction. CO2 emissions decreased from 1.293 to 0.792 kg (a 38.7% reduction), while renewable utilisation increased by 12.2% (from 37.8% to 42.4%). The battery SOH improved from 0.891 to 0.958 (a 7.5% enhancement), and the peak demand decreased by 24.7%. These values align directly with Table 2 and form the basis for all percentage comparisons reported throughout the paper.
The environmental performance followed a similar pattern. CO2 emissions decreased from 1.293 kg in the baseline to 0.792 kg in H-RPEM, corresponding to a 20.8% reduction. The improvement is directly linked to enhanced PV utilisation and reduced grid dependency. The renewable energy penetration increased from 37.8 to 42.4%, as shown in Figure 4d, demonstrating better temporal coordination between local generation and the EV charging demand. The SMPC and RL hybridisation allowed the controller to forecast high-irradiance intervals and delay charging accordingly, increasing PV self-consumption by 12.3% relative to the baseline. All quantities in this figure are normalised using min-max scaling with respect to the baseline controller:
x norm = x x min x max x min .
Battery health analysis reveals the most significant difference between the control strategies. The baseline operation led to a final SOH of 0.891 after one simulated day, whereas SMPC and H-RPEM maintained values of 0.941 and 0.958, respectively. The improvement of 6.7% between the baseline and H-RPEM confirms that the degradation-aware penalty term in Equation (8) effectively mitigates aggressive charging behaviour. The equivalent full cycle (EFC) count decreased from 2.41 to 1.92, resulting in an 18% extension in the expected battery lifespan. Additionally, the cumulative peak demand was reduced by 9.7%, supporting improved grid-side stability. The normalisation is applied across all control strategies to allow the direct comparison of temporal patterns.
To make the notion of lifespan extension explicit, we compute the equivalent full cycle (EFC) count over each simulated day as
EFC = 1 E max k Δ E ( k ) .
The baseline, SMPC, and H-RPEM strategies yield average daily EFC values of 2.41 , 2.12 , and 1.92 , respectively. Interpreting the battery’s useful life as inversely proportional to cumulative EFC consumption, H-RPEM achieves an 18.0 % reduction in daily equivalent cycle usage relative to the baseline. This corresponds to a proportional extension of the expected battery life under the assumed operating pattern.
Figure 5 presents the correlation structure between all key performance indicators. A strong positive correlation ( r = 0.95 ) was observed between the operating cost and peak power, indicating that economic optimisation inherently promotes peak load reduction. The negative correlation ( r = 0.96 ) between the SOH and cost implies that maintaining battery health is not in conflict with financial optimisation but rather complementary to it. Similarly, the positive correlation between renewable utilisation and the SOH ( r = 0.88 ) suggests that charging from PV energy inherently reduces degradation by limiting deep discharge cycles. The perfect anti-correlation between cost and CO2 ( r = 1.00 ) highlights the alignment between economic and environmental objectives achieved by the hybrid structure.
Table 2 consolidates these outcomes. Compared with the baseline control, H-RPEM achieved a 23.9% reduction in cost, a 20.8% drop in CO2 emissions, a 6.7% SOH improvement, a 9.7% reduction in peak load, and a 12.3% increase in renewable utilisation. This corresponds to daily savings of approximately 16.4 kWh and an annualised reduction of 5.99 MWh in grid imports. The results confirm that the proposed hybrid framework achieves multi-objective optimisation, balancing economic, environmental, and technical objectives in a single, unified control policy.

3.2. Sensitivity and Robustness Analysis of the H-RPEM Controller

To evaluate the contributions of individual components and the overall robustness of the hybrid control framework, an ablation and sensitivity study was conducted. Each variant of the controller was obtained by systematically disabling one or more design terms in the reward and blending structure defined in Equations (11)–(23). The evaluation metrics include the daily cost, CO2 emissions, and battery SOH, all normalised with respect to the full H-RPEM configuration.
The ablation results shown in Figure 6 highlight the relative impact of model simplifications. Removing the degradation-aware term ( κ = 0 ) resulted in a 21.6% reduction in SOH compared with the full hybrid model, while the cost and CO2 indices increased by 17.4% and 15.8%, respectively. When the PV awareness module was excluded (i.e., irradiance and forecast features removed from the state vector), renewable utilisation dropped by 11.9%, and the cost rose by 14.3%, confirming that PV prediction directly contributes to both economic and environmental gains. Finally, disabling the explicit health term from the objective function increased the daily cost to 0.894 (normalised) and degraded the SOH to 0.902, as summarised in Table 3. These findings indicate that all three components-degradation modelling, PV-aware forecasting, and the health-oriented penalty design-act synergistically to improve the multi-objective performance of the controller.
To further assess controller resilience, a robustness analysis was performed under simultaneous perturbations of key input uncertainties: a ±10% variation in tariff levels, a ±20% fluctuation in solar irradiance, and a ±15% change in the degradation coefficient κ . The hybrid blending parameter λ and the RL learning rate β were jointly tuned within the range [ 0 , 1 ] to identify the stability envelope of the system. The resulting optimal region, indicated in Figure 7 and quantified in the dataset, corresponds to λ 0.05 , β 0.10 , and γ 0.76 . Each axis represents a performance metric normalised to the best-performing method (H-RPEM = 1.000 ). Values such as cost = 0.894 and CO2 = 0.902 are normalised indices derived from Table 3. This configuration achieves complete stability against 25% disturbance amplitudes, maintaining all objective indices within ±5% of nominal operation. These values are dimensionless normalised scores, where 1.0 denotes the highest performance among the compared controllers.
The radar representation in Figure 7 shows that the hybrid controller retains a near-balanced triangular profile across the cost, CO2, and SOH axes, confirming uniform robustness. The worst-case deviation in the cost metric was only +7.4% under irradiance uncertainty, while the SOH remained at unity owing to the adaptive update of λ ( k ) in Equation (23). These results demonstrate that the reinforcement–predictive coupling enables the controller to maintain feasible, stable operation even under stochastic environmental or economic disturbances. Consequently, the proposed H-RPEM ensures robust, sustainable performance, supporting its deployment in real-world PV-EV MGs with dynamic energy markets and uncertain renewable profiles. Although the robustness tests deliberately perturbed key exogenous variables such as tariffs, irradiance, and degradation coefficients, thereby mimicking some effects of operating under alternative climatic and market conditions, this analysis does not replace a full multi-site validation. In particular, explicitly modelling diverse V2G participation patterns and geographically distinct price structures will require the integration of additional real-world datasets from campuses and city-scale EV fleets, which is left for future work.

3.3. Learning Dynamics and Convergence Behaviour of the Hybrid Controller

The convergence characteristics of H-RPEM were analysed to quantify the influence of the reinforcement–predictive blending coefficient λ and the degradation weighting factor β on the learning efficiency and stability. Figure 8, Figure 9 and Figure 10 summarise the training evolution of the agent during 200 episodes, where each episode represents a full 24 h control cycle. The average reward R ¯ ( k ) and its variance σ R 2 ( k ) were computed following Equation (11) over time to monitor learning progress and robustness.
The reward convergence patterns in Figure 8 demonstrate that intermediate blending values yield the best trade-off between exploration and exploitation. For λ = 0.30 and β = 0.50 , the mean reward rapidly increases from 0.12 in the first 10 episodes to 0.92 after 120 episodes, reaching a steady plateau near unity by episode 160. This configuration converges approximately 25 episodes faster than the higher-learning-rate case ( λ = 0.45 , β = 0.80 ) and 40 episodes faster than the conservative setting ( λ = 0.15 , β = 0.30 ). The improvement in convergence speed indicates that the adaptive coupling between the RL and predictive components enables efficient policy refinement without overfitting to short-term fluctuations. Across all settings, the steady-state reward remains within 0.86–1.02, confirming the consistency of long-term policy optimisation.
Figure 9 presents the variance of the reward during training, serving as an indicator of learning stability. A clear downward trend is observed across all parameter sets: the variance decreases from initial levels around 1.0 × 10 2 to below 2.0 × 10 3 by episode 100, reflecting the diminishing uncertainty in policy updates. The optimal configuration ( λ = 0.30 , β = 0.50 ) exhibits the lowest steady-state variance (mean σ R 2 = 1.6 × 10 3 ), whereas the aggressive learning setting ( λ = 0.45 , β = 0.80 ) results in intermittent spikes, reaching 6.0 × 10 3 due to overreactive gradient updates. Conversely, the conservative configuration ( λ = 0.15 , β = 0.30 ) converges smoothly but more slowly, with an average variance of 2.4 × 10 3 . These results confirm that moderate parameterisation balances learning stability with adaptation speed, yielding robust and repeatable convergence.
To further examine the sensitivity across the full parameter space, a two-dimensional convergence heatmap was generated (Figure 10), showing the episode number required for convergence as a function of λ and β . Convergence was defined when the moving average reward reached 95 % of its final value for at least ten consecutive episodes. The fastest convergence (66 episodes) occurs at λ = 0.50 , β = 0.28 , whereas excessively low or high λ values cause delays exceeding 120 episodes. The central region ( λ = 0.28 –0.35, β = 0.45 –0.60) corresponds to stable training behaviour with convergence between 80 and 90 episodes, confirming that moderate coupling between predictive foresight and RL is essential for efficient adaptation. The lower-right corner of the heatmap ( λ > 0.4 , β < 0.3 ) shows oscillatory dynamics, attributed to the dominance of the short-term reinforcement component over the predictive term in Equation (23).
Overall, these findings verify that the proposed hybrid training strategy achieves both rapid convergence and low reward variance, ensuring policy generalisation across varying tariff, irradiance, and load conditions. The identified optimum around λ 0.30 and β 0.50 is consistent with the robustness envelope observed in Figure 7, confirming the internal coherence between the learning dynamics and the energy-system-level performance metrics. This coupling establishes reproducible and explainable learning behaviour suitable for real-time MG deployment.

3.4. Limitations

While the proposed H-RPEM demonstrates strong potential in enhancing operational sustainability in PV-EV MGs, several limitations should be acknowledged. First, the experimental validation was conducted using a single university campus PV-EV microgrid dataset, which constrains the direct generalisability of the findings to other geographical locations and climate zones and larger-scale smart city infrastructures. The meteorological, tariff, and mobility patterns embedded in the public EV Charging Patterns dataset [33] represent a single climatic regime and tariff structure, and they do not explicitly capture heterogeneous V2G behaviours across multiple sites. As a result, the present results should be interpreted as a proof-of-concept demonstration of H-RPEM rather than a comprehensive scalability study from campus- to city-level systems. Nevertheless, the proposed formulation is dataset- and scale-agnostic in the sense that different climate conditions, dynamic tariff profiles, and V2G participation patterns enter the framework through the stochastic input set w(t) and the reward/constraint parametrisation, without altering the underlying control architecture. Extending the analysis to multi-campus and city-scale case studies in contrasting climate zones, with heterogeneous dynamic tariffs and explicit V2G fleet models, is an important direction of our ongoing and future work. Second, the battery degradation model, although health-aware, relies on empirical coefficients that approximate electrochemical ageing processes rather than explicitly modelling solid–electrolyte interface dynamics, temperature-dependent diffusion, or the detailed coupling between SOC trajectories and SOH evolution. Advanced online SOH estimation methods, such as LSTM-based autoencoders trained on high-resolution electrothermal data or SVD-enhanced unscented Kalman filters, can, in principle, provide more accurate and physically informed SOH states for control. These approaches were not implemented here because the public microgrid dataset does not contain the necessary cell-level voltage, current, and temperature measurements and because the full design, training, and validation of such estimators would constitute a separate study. Future work will therefore focus on integrating and benchmarking these higher-fidelity SOH observers within H-RPEM, quantifying how improved SOH estimation propagates into economic, environmental, and degradation outcomes. Third, the hybrid controller was tested in a simulation environment using both historical and synthetic stochastic data; therefore, a hardware-in-the-loop or real-time field implementation is required to evaluate the communication latency, control reliability, and cyber-physical integration aspects. Finally, the current optimisation framework focuses on single-agent control and does not yet address multi-agent interactions or peer-to-peer energy trading among distributed MGs, which are increasingly relevant in smart city architectures. In particular, the present formulation treats the EV fleet as an aggregated load/storage entity and therefore does not incorporate decentralised decision-making, cooperative or competitive V2G strategies, or multi-agent interactions among heterogeneous EVs. Extending the H-RPEM architecture to a multi-agent setting, where individual EVs operate as V2G-enabled agents with their own preferences, arrival patterns, and state trajectories, represents a natural and necessary step for future development. Addressing these limitations will be the focus of future work aimed at improving the scalability, interoperability, and cyber-resilience of health-aware energy management systems.

4. Conclusions

This research presents a unified hybrid control framework for the sustainable management of PV-EV MGs. The developed H-RPEM integrates stochastic predictive optimisation with RL to enhance decision-making across multiple time horizons. The methodology emphasises both short-term adaptability to renewable intermittency and the long-term preservation of the battery SOH, introducing adaptive learning weights that regulate the interaction between predictive and data-driven components. Extensive experimental validation confirmed that the hybrid controller delivers superior performance compared with baseline and purely predictive benchmarks. H-RPEM achieved a 42.6% reduction in operating costs, a 38.7% decrease in CO2 emissions, a 7.5% improvement in the SOH, and a 12.2% increase in renewable utilisation, based on daily averages over ten stochastic operating days. These results confirm that the controller delivers simultaneous economic, environmental, and battery health benefits. Statistical correlation analysis demonstrated that economic savings and degradation mitigation are strongly coupled, highlighting the internal consistency of the health-aware reward formulation. The system maintained operational robustness under stochastic variations in solar irradiance and tariff fluctuations, verifying its real-time applicability. The proposed architecture provides a flexible and generalisable platform for intelligent V2G management, scalable from MGs to district-level smart urban infrastructures. Future research will explore multi-agent coordination, digital twin-based predictive adaptation, and integration with blockchain-enabled peer-to-peer energy trading mechanisms further to enhance security and decentralised resilience in sustainable mobility ecosystems. In parallel, future work will deploy the proposed H-RPEM on datasets from different climate zones and system scales, including multi-campus and city-level PV-EV networks with diverse dynamic tariff structures and V2G participation patterns, in order to quantify rigorously its scalability and transferability. A further component of ongoing work will involve extending the proposed controller to a multi-agent configuration in which V2G-enabled EV fleets participate as distributed decision-making units. Such a formulation would enable coordinated or competitive behaviour under dynamic tariffs, enhance microgrid flexibility, and support the scalable integration of large EV populations within smart city environments.

Author Contributions

Conceptualisation, M.C. and M.B.; methodology, M.C.; software, M.C.; validation, M.C. and M.B.; formal analysis, M.C.; investigation, M.C.; resources, M.C.; writing-original draft preparation, M.C.; writing-review and editing, M.C. and M.B.; visualisation, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets and source code are publicly available at https://github.com/cavusmuhammed68/Batteries_Health, accessed on 21 December 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. International Energy Agency (IEA). Global EV Outlook 2023: Catching up with Climate Ambitions; IEA: Paris, France, 2023; Available online: https://www.iea.org/reports/global-ev-outlook-2023 (accessed on 12 November 2025).
  2. Manousakis, N.M.; Karagiannopoulos, P.S.; Tsekouras, G.J.; Kanellos, F.D. Integration of renewable energy and electric vehicles in power systems: A review. Processes 2023, 11, 1544. [Google Scholar] [CrossRef]
  3. Alhawsawi, E.Y.; Salhein, K.; Zohdy, M.A. A comprehensive review of existing and pending university campus microgrids. Energies 2024, 17, 2425. [Google Scholar] [CrossRef]
  4. Bullich-Massagué, E.; Cifuentes-García, F.J.; Glenny-Crende, I.; Cheah-Mañé, M.; Aragüés-Peñalba, M.; Díaz-González, F.; Gomis-Bellmunt, O. A review of energy storage technologies for large-scale photovoltaic power plants. Appl. Energy 2020, 274, 115213. [Google Scholar] [CrossRef]
  5. Cavus, M.; Allahham, A.; Adhikari, K.; Zangiabadia, M.; Giaouris, D. Control of microgrids using an enhanced Model Predictive Controller. In Proceedings of the 11th International Conference on Power Electronics, Machines and Drives (PEMD 2022), Newcastle, UK, 21–23 June 2022; pp. 660–665. [Google Scholar] [CrossRef]
  6. Ahmad, S.; Shafiullah, M.; Ahmed, C.B.; Alowaifeer, M. A review of microgrid energy management and control strategies. IEEE Access 2023, 11, 21729–21757. [Google Scholar] [CrossRef]
  7. Baillieul, J.; Samad, T. (Eds.) Encyclopedia of Systems and Control, 2nd ed.; Springer: London, UK, 2021. [Google Scholar] [CrossRef]
  8. Mayne, D.Q.; Rawlings, J.B.; Rao, C.V.; Scokaert, P.O.M. Constrained model predictive control: Stability and optimality. Automatica 2000, 36, 789–814. [Google Scholar] [CrossRef]
  9. Schwenzer, M.; Ay, M.; Bergs, T.; Abel, D. Review on model predictive control: An engineering perspective. Int. J. Adv. Manuf. Technol. 2021, 117, 1327–1349. [Google Scholar] [CrossRef]
  10. Joshal, K.S.; Gupta, N. Microgrids with model predictive control: A critical review. Energies 2023, 16, 4851. [Google Scholar] [CrossRef]
  11. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  12. Zhou, K.; Zhou, K.; Yang, S. Reinforcement learning-based scheduling strategy for energy storage in microgrid. J. Energy Storage 2022, 51, 104379. [Google Scholar] [CrossRef]
  13. Yu, P.; Zhang, H.; Song, Y.; Wang, Z.; Dong, H.; Ji, L. Safe reinforcement learning for power system control: A review. Renew. Sustain. Energy Rev. 2025, 223, 116022. [Google Scholar] [CrossRef]
  14. François-Lavet, V.; Taralla, D.; Ernst, D.; Fonteneau, R. Deep reinforcement learning solutions for energy microgrids management. In Proceedings of the European Workshop on Reinforcement Learning (EWRL 2016), Barcelona, Spain, 3–4 December 2016; Available online: https://dial.uclouvain.be/pr/boreal/object/boreal:177152 (accessed on 10 November 2025).
  15. Gao, Y.; Matsunami, Y.; Miyata, S.; Akashi, Y. Multi-agent reinforcement learning dealing with hybrid action spaces: A case study for off-grid oriented renewable building energy system. Appl. Energy 2022, 326, 120021. [Google Scholar] [CrossRef]
  16. Islam, K.S.; Dutta, A.; Mruthyunjaya, S. Hybrid ML–RL approach for smart grid stability prediction and optimized control strategy. arXiv 2025, arXiv:2508.19541. [Google Scholar]
  17. Ahsan, F.; Dana, N.H.; Sarker, S.K.; Li, L.; Muyeen, S.M.; Ali, M.F.; Tasneem, Z.; Hasan, M.M.; Abhi, S.H.; Islame, M.R.; et al. Data-driven next-generation smart grid towards sustainable energy evolution: Techniques and technology review. Prot. Control Mod. Power Syst. 2023, 8, 43. [Google Scholar] [CrossRef]
  18. Birkl, C.R.; Roberts, M.R.; Howey, D.A. Degradation diagnostics for lithium-ion cells. J. Power Sources 2017, 341, 373–386. [Google Scholar] [CrossRef]
  19. Fan, Y.; Xiao, F.; Li, C.; Yang, G.; Tang, X. A novel deep learning framework for state of health estimation of lithium-ion battery. J. Energy Storage 2020, 32, 101741. [Google Scholar] [CrossRef]
  20. Lin, X.; Xi, L.; Wang, Z. Battery degradation-aware energy management strategy with driving pattern severity factor feedback correction algorithm. J. Clean. Prod. 2024, 450, 141969. [Google Scholar] [CrossRef]
  21. Le, C.N.; Vinayagam, A.; Tran, P.T.; Stojcevski, S.; Dinh, T.N.; Stojcevski, A.; Chandran, J. State of health aware adaptive scheduling of battery energy storage system charging and discharging in rural microgrids using long short-term memory and convolutional neural networks. Energies 2025, 18, 5641. [Google Scholar] [CrossRef]
  22. Yi, H.; Du, Z.; Chen, H.; Zhang, K. Multi-objective optimization framework for PEMFC hybrid marine power systems: Integrating dynamic lifetime degradation and energy management. Ocean. Eng. 2025, 340, 122248. [Google Scholar] [CrossRef]
  23. Bai, Y.; Li, J.; He, H.; Dos Santos, R.C.; Yang, Q. Optimal design of a hybrid energy storage system in a plug-in hybrid electric vehicle for battery lifetime improvement. IEEE Access 2020, 8, 142148–142158. [Google Scholar] [CrossRef]
  24. Wang, L.; Li, Q.; Ding, R.; Sun, M.; Wang, G. Integrated scheduling of energy supply and demand in microgrids under uncertainty: A robust multi-objective optimization approach. Energy 2017, 130, 1–14. [Google Scholar] [CrossRef]
  25. Hasani, R.; Mohammadi, M.; Samanfar, A. Integrated multiobjective energy management for a smart microgrid incorporating electric vehicle charging stations and demand response programs under uncertainty. Int. J. Energy Res. 2025, 1, 9531493. [Google Scholar] [CrossRef]
  26. Zheng, Y.; Jia, J.; An, D. Energy management for microgrids with hybrid hydrogen–battery storage: A reinforcement learning framework integrated multi-objective dynamic regulation. Processes 2025, 13, 2558. [Google Scholar] [CrossRef]
  27. Pan, W.; Yu, X.; Guo, Z.; Qian, T.; Li, Y. Online EVs vehicle-to-grid scheduling coordinated with multi-energy microgrids: A deep reinforcement learning-based approach. Energies 2024, 17, 2491. [Google Scholar] [CrossRef]
  28. Korkas, C.D.; Tsaknakis, C.D.; Kapoutsis, A.C.; Kosmatopoulos, E. Distributed and multi-agent reinforcement learning framework for optimal electric vehicle charging scheduling. Energies 2024, 17, 3694. [Google Scholar] [CrossRef]
  29. Xie, H.; Song, G.; Shi, Z.; Zhang, J.; Lin, Z.; Yu, Q.; Fu, H.; Song, X.; Zhang, H. Reinforcement learning for vehicle-to-grid: A review. Adv. Appl. Energy 2025, 17, 100214. [Google Scholar] [CrossRef]
  30. Cavus, M.; Bell, M. Enabling smart grid resilience with deep learning-based battery health prediction in EV fleets. Batteries 2025, 11, 283. [Google Scholar] [CrossRef]
  31. Cavus, M.; Ayan, H.; Dissanayake, D.; Sharma, A.; Deb, S.; Bell, M. Forecasting electric vehicle charging demand in smart cities using hybrid deep learning of regional spatial behaviours. Energies 2025, 18, 3425. [Google Scholar] [CrossRef]
  32. Khayat, Y.; Shafiee, Q.; Heydari, R.; Naderi, M.; Dragičević, T.; Simpson-Porco, J.W.; Dörfler, F.; Fathi, M.; Blaabjerg, F.; Guerrero, J.M.; et al. On the secondary control architectures of AC microgrids: An overview. IEEE Trans. Power Electron. 2019, 35, 6482–6500. [Google Scholar] [CrossRef]
  33. Valakhorrasani, A. Electric Vehicle Charging Patterns. Kaggle Dataset. 2023. Available online: https://www.kaggle.com/datasets/valakhorasani/electric-vehicle-charging-patterns (accessed on 11 November 2025).
  34. Lin, Y.; Zhou, L.; Yan, J.; He, S. A hybrid data-driven model for state of health estimation of lithium-ion batteries with capacity recovery. Renew. Energy 2025, in press. [Google Scholar] [CrossRef]
  35. Apribowo, C.H.B.; Ashidqi, M.D.; Nizam, M.; Purwanto, A. Data-driven modeling of lithium-ion battery degradation using XGBoost with extended Kalman filter-based internal resistance correction. Results Eng. 2025, 28, 108100. [Google Scholar] [CrossRef]
Figure 1. Smoothed energy dynamics of the PV-EV MG.
Figure 1. Smoothed energy dynamics of the PV-EV MG.
Batteries 12 00005 g001
Figure 2. Comparison between baseline and proposed H-RPEM controller.
Figure 2. Comparison between baseline and proposed H-RPEM controller.
Batteries 12 00005 g002
Figure 3. Flowchart representation of the proposed H-RPEM.
Figure 3. Flowchart representation of the proposed H-RPEM.
Batteries 12 00005 g003
Figure 4. Benchmark comparison of control strategies. The proposed H-RPEM achieves the lowest cost and CO2 emissions while maximising the SOH and renewable usage.
Figure 4. Benchmark comparison of control strategies. The proposed H-RPEM achieves the lowest cost and CO2 emissions while maximising the SOH and renewable usage.
Batteries 12 00005 g004
Figure 5. Inter-metric correlation matrix among economic, environmental, and technical indicators. A strong negative correlation exists between cost and SOH ( r = 0.96 ), indicating that health-aware optimisation supports economic performance, while cost and CO2 emissions are perfectly aligned ( r = 1.00 ).
Figure 5. Inter-metric correlation matrix among economic, environmental, and technical indicators. A strong negative correlation exists between cost and SOH ( r = 0.96 ), indicating that health-aware optimisation supports economic performance, while cost and CO2 emissions are perfectly aligned ( r = 1.00 ).
Batteries 12 00005 g005
Figure 6. Ablation study of model components. Performance degradation is quantified when removing (i) the degradation term, (ii) PV awareness, and (iii) the health-term penalty from the reward function. The full H-RPEM configuration consistently achieves superior cost, emission, and SOH outcomes.
Figure 6. Ablation study of model components. Performance degradation is quantified when removing (i) the degradation term, (ii) PV awareness, and (iii) the health-term penalty from the reward function. The full H-RPEM configuration consistently achieves superior cost, emission, and SOH outcomes.
Batteries 12 00005 g006
Figure 7. Robustness radar plot of the full H-RPEM across key performance metrics. The near-equilateral profile indicates balanced economic, environmental, and technical robustness under ±20% stochastic disturbances in tariff, irradiance, and degradation coefficients.
Figure 7. Robustness radar plot of the full H-RPEM across key performance metrics. The near-equilateral profile indicates balanced economic, environmental, and technical robustness under ±20% stochastic disturbances in tariff, irradiance, and degradation coefficients.
Batteries 12 00005 g007
Figure 8. Reward convergence behaviour across 200 training episodes for different ( λ , β ) configurations. The setting λ = 0.30 , β = 0.50 achieves the fastest and most stable convergence, reaching a steady-state reward near unity by episode 160.
Figure 8. Reward convergence behaviour across 200 training episodes for different ( λ , β ) configurations. The setting λ = 0.30 , β = 0.50 achieves the fastest and most stable convergence, reaching a steady-state reward near unity by episode 160.
Batteries 12 00005 g008
Figure 9. Learning stability analysis showing variance of the reward signal. The proposed hybrid parameterisation ( λ = 0.30 , β = 0.50 ) maintains the lowest steady-state variance ( 1.6 × 10 3 ), indicating robust and smooth policy updates during training.
Figure 9. Learning stability analysis showing variance of the reward signal. The proposed hybrid parameterisation ( λ = 0.30 , β = 0.50 ) maintains the lowest steady-state variance ( 1.6 × 10 3 ), indicating robust and smooth policy updates during training.
Batteries 12 00005 g009
Figure 10. Convergence speed heatmap as a function of reinforcement–predictive blending ( λ ) and degradation weighting ( β ). Darker regions correspond to faster convergence. The optimum lies around λ [ 0.28 , 0.35 ] , β [ 0.45 , 0.60 ] , confirming the efficiency of moderate hybrid coupling.
Figure 10. Convergence speed heatmap as a function of reinforcement–predictive blending ( λ ) and degradation weighting ( β ). Darker regions correspond to faster convergence. The optimum lies around λ [ 0.28 , 0.35 ] , β [ 0.45 , 0.60 ] , confirming the efficiency of moderate hybrid coupling.
Batteries 12 00005 g010
Table 1. Simulation and control parameters used in performance evaluation.
Table 1. Simulation and control parameters used in performance evaluation.
ParameterSymbol/ValueUnit
Battery energy capacity E max = 200 kWh
Maximum charge/discharge power P max = 60 kW
Charge/discharge efficiency η ch = η dis = 0.96 Unitless
Sampling interval Δ t = 0.25 h
Discount factor γ = 0.98 -
Learning rate η = 10 3 -
Prediction horizon N p = 12 Steps
Price tariff rangeGBP 0.10–0.26Per kWh
CO2 emission intensity0.30–0.55kg kWh−1
Degradation scaling factor κ = 10 3 Unitless
PV temperature coefficient γ T = 0.0045 per °C
PV area A PV = 420 m2
PV nominal efficiency η PV = 0.18 Unitless
Reinforcement-predictive weight λ 0 = 0.5 -
Simulation horizon T = 24 h
Time-varying electricity price p grid €/kWh
Grid-imported power P grid kW
Table 2. Average performance comparison across ten operating days.
Table 2. Average performance comparison across ten operating days.
MethodCost (GBP)CO2 (kg)SOH (–)Renewable (%)Peak (kW)
Baseline1.3251.2930.8910.3781.199
SMPC0.9140.9150.9410.3970.898
H-RPEM0.7610.7920.9580.4240.903
Table 3. Ablation study results showing normalised performance indices.
Table 3. Ablation study results showing normalised performance indices.
ConfigurationCost (–)CO2 (–)SOH (–)
Full H-RPEM0.7610.7920.958
No degradation term0.8940.9020.753
No PV awareness0.8690.8870.881
No health term0.8940.9020.902
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cavus, M.; Bell, M. A Health-Aware Hybrid Reinforcement–Predictive Control Framework for Sustainable Energy Management in Photovoltaic–Electric Vehicle Microgrids. Batteries 2026, 12, 5. https://doi.org/10.3390/batteries12010005

AMA Style

Cavus M, Bell M. A Health-Aware Hybrid Reinforcement–Predictive Control Framework for Sustainable Energy Management in Photovoltaic–Electric Vehicle Microgrids. Batteries. 2026; 12(1):5. https://doi.org/10.3390/batteries12010005

Chicago/Turabian Style

Cavus, Muhammed, and Margaret Bell. 2026. "A Health-Aware Hybrid Reinforcement–Predictive Control Framework for Sustainable Energy Management in Photovoltaic–Electric Vehicle Microgrids" Batteries 12, no. 1: 5. https://doi.org/10.3390/batteries12010005

APA Style

Cavus, M., & Bell, M. (2026). A Health-Aware Hybrid Reinforcement–Predictive Control Framework for Sustainable Energy Management in Photovoltaic–Electric Vehicle Microgrids. Batteries, 12(1), 5. https://doi.org/10.3390/batteries12010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop