3.2. HRES Generation and Network Load Predictions Using TensorFlow
The forecasts for an HRES (
Figure 1) obtained using the provided code include the generation of solar, wind and biomass energy sources over a 24 h period (0 to 1440 min, corresponding to 24 h). These forecasts are based on an LSTM neural network model that learns from historical data and predicts future trends, and DQN and SAC agents for optimal system control. The forecasts show that the generation of solar power plants starts from 0 MWh at 0 min (night or early morning) and gradually increases to about 3.6 MWh at 600 min (around 10 a.m.). After that, from 600 to 1440 min, the generation decreases back to 0 MWh (evening or night). This pattern resembles a classic solar power curve, similar to a Gaussian one, with a peak at noon.
The main reason is the rotation of the Earth around its axis, which causes the sun’s rays to reach a given geographic location only during the day. Starting in the morning (from approximately 0–600 min, depending on the season and latitude), the sun’s height in the sky increases, increasing the intensity of direct rays. This allows photovoltaic (PV) panels to convert more solar energy into electricity. The peak is reached at noon, when the sun is at its highest point (zenith), and then the radiation decreases until sunset. Although the forecast shows a smooth rise and fall, in reality, fluctuations can occur due to cloudiness, fog or rain that block the rays. The model (LSTM) learns from historical data, so if there are fluctuations in the data, the forecast can reflect them indirectly [
72]. For example, if a dry day is forecast, the curve will be flatter; in changeable weather, short-term drops are possible. Wind power generation starts at 3 MWh at 0 min and gradually decreases to 1.1 MWh at 700 min (around 11:40). Then, from 700 to 1440 min, it increases back to 3 MWh. It is important to emphasize that the decreases and increases are uneven, with fluctuations, which indicates a dynamic behavior. Wind generation directly depends on the wind speed and direction, which are unpredictable and fluctuating. From 0 to 700 min the decrease may be due to weakening winds (e.g., at night or in the morning, when temperature differences are smaller, causing weaker air flows). The increase after 700 min is possible due to strengthening winds in the evening or at night, e.g., due to thermal winds (sea/ocean wind) or synoptic systems (high/pressure areas). Fluctuations occur due to turbulence, sudden gusts, or changes in air masses: the wind is not constant like the sun, so the forecast shows “intermittent” behavior. Biomass power plants (e.g., burning wood, waste, or biogas) operate like traditional thermal power plants, where fuel (biomass) is supplied constantly, regardless of external conditions. This allows for stable operation of steam turbines, generating constant power. Unlike the sun or wind, biomass is not dependent on the weather—it is “controlled” by humans (fuel supply), so it can operate 24/7.
The results of the variation in the load demand of the electricity network clearly show how the trends in the generation of renewable energy sources directly affect the energy balance of the network during the day (
Figure 2).
From the beginning of the day (0 min) to approximately 600 min (around 10:00), the grid load demand increases from 2.5 MWh to 3.7 MWh. This increase coincides with the trend of solar power generation, which during the same period rises from 0 MWh to a peak of ~3.6 MWh at 600 min. As a result, when solar power starts to strengthen in the morning, its generation contributes to the grid load, but only reaches its peak at noon, so the grid demand is still growing during this period. After 600 min, solar generation starts to decrease, returning towards 0 MWh in the evening and at night, which naturally leads to the grid load demand starting to decrease from 3.7 MWh to 2.1 MWh by the end of the day (1440 min). Wind power generation acts as a variable source that can partially compensate for the decreasing solar contribution. From 700 min, wind power generation begins to increase, helping to stabilize the drop in grid load demand in the evening and at night. However, wind energy is naturally fluctuating due to changes in weather conditions, turbulence or gusts, so this compensation effect is not completely constant, and unevenness occurs in the grid load demand [
73]. Biomass power generation remains stable around the clock, because it does not depend on weather conditions. The constancy of biomass ensures a minimum energy supply even at night, when solar generation is zero and wind energy can be unpredictable. In summary, the change in grid load demand during the day is determined by the dynamics of generation of different renewable energy sources. Growing solar generation during the day increases the grid supply capabilities and partially covers the load growth, and in the evening and at night, when the solar contribution decreases, stable biomass energy and fluctuations in wind energy are most important for the grid balance. In this way, hybrid renewable energy system (HRES) trends directly shape the network load demand profile, and the synergy of different sources allows for the optimization of energy supply around the clock.
CO
2 emission trends in the HRES during the day reflect changes in generation and load from different energy sources. From the beginning of the day to approximately 200 min, CO
2 emissions increase from 99.5 to 100 kgCO
2/MWh, which reflects the growing load and the beginning of more intensive solar generation (
Figure 3). During this period, excess solar and wind energy is not yet sufficiently used in batteries or electrolyzers, so the CO
2 intensity increases slightly. From 200 to 600 min, emissions decrease from 100 to 98.5 kgCO
2/MWh, as the increasing solar generation partially compensates for the load demand, reducing the need to use biomass or other reserve sources with higher emissions. The period from 600 to 800 min shows an increase in CO
2 from 98.5 to 100 kgCO
2/MWh, corresponding to the decline in solar generation after the midday peak and the increasing participation of fuel cells or biomass in the energy supply. Finally, from 1000 to 1440 min, CO
2 emissions decrease from 100.5 to 98.5 kgCO
2/MWh, as the contribution of wind energy helps to reduce the use of reserve sources, and the energy balance of the system is optimized with the help of batteries and stored hydrogen. The overall CO
2 dynamics are determined by the fluctuations in HRES generation, the use of energy storage and reserve sources, so emissions vary depending on the instantaneous balance between generation and load. The changes in CO
2 emissions in the HRES over the diurnal period are very slow and inertial, so their dynamics appear almost stable. From the beginning of the day until around 200 min, CO
2 emissions increase slightly from 99.5 to 100 kgCO
2/MWh, reflecting the increasing load and the start of solar generation, which cannot yet fully compensate for the demand for backup sources. From 200 to 600 min, emissions decrease from 100 to 98.5 kgCO
2/MWh, as increasing solar generation meets part of the consumption, reducing the use of fuel cells and biomass. The period from 600 to 800 min shows a slight increase in CO
2 to 100 kgCO
2/MWh, reflecting the drop in solar generation after the midday peak and the activation of backup sources. Finally, from 1000 to 1440 min, emissions decrease from 100.5 to 98.5 kgCO
2/MWh, as the contribution of wind energy, together with the use of batteries and stored hydrogen, optimizes the energy balance. Visually, the daily dynamics of CO
2 appear almost stable and emission fluctuations remain insignificant despite the variability in HRES generation and network load.
The dropout adjustment applied in the model was chosen empirically, but also took into account theoretical guidelines based on the Bayesian dropout principle. According to this approach, dropout can be interpreted as an approximation to Bayesian inference in neural networks, and average dropout values of 0.2–0.5 usually ensure a good balance between model stability and generalization ability. After preliminary tests, a dropout value of 0.4 was selected, which best met these theoretical principles and reduced the risk of overfitting without sacrificing prediction accuracy. Additionally, a pilot analysis was performed using concrete dropout, an adaptive version of dropout that allows the network to independently determine optimal dropout probabilities during training. Although this method was characterized by greater flexibility, it was not possible to achieve a statistically significant improvement in accuracy in the experiments, so the final model was left with a fixed dropout parameter. Analysis of the forecasting results showed that the bidirectional LSTM (bi-LSTM) architecture significantly improved the accuracy of energy production time-series forecasting compared to the unidirectional LSTM. The bidirectional structure allowed the model to use both the past and “future” context of the sequence, which allowed it to better recognize uneven and multivariate relationships between HRES parameters. The root mean square error (RMSE) was reduced by approximately 11% compared to the unidirectional LSTM model. However, the use of the bidirectional architecture increased the computational cost. In the experiments, the average time to generate a single forecast increased by approximately 25%, since the bidirectional model has access to the entire sequence, including future values. In summary, the bidirectional LSTM architecture has significantly better time-series forecasting accuracy, but has higher latency.
3.3. HRES Accumulation System Predictions Using TensorFlow
The obtained results of the lithium-ion battery state of charge (SOC) prediction reveal clear links between the battery state, the intensity of renewable energy generation and the load dynamics of the electrical grid (
Figure 4). The model predictions showed that over the entire 1140 min period, the battery state of charge is characterized by a cyclical, but rather uneven variation, which is determined by external energy flows and control strategies.
In the early period (0–220 min), the battery charge level increased significantly from 40% to 80%. This phase coincides with intensive solar generation, when the system generates excess energy, which is therefore directed to storage. This indicates that the model correctly identifies the moments when battery charging is optimal for grid balancing. In the subsequent interval (200–580 min), the charge level decreased sharply from 80% to 15%, reflecting the increased electricity consumption and reduced renewable generation. This phase reveals the battery’s function as a fast-reaction energy reservoir that is discharged to compensate for the energy shortage in the grid. In the period from 600 to 950 min the increase in charge (from 15% to 35%) indicates that the battery is responding again to improved generation conditions, especially during wind power surges. The subsequent decline (950–1300 min) to 10% confirms that the model is managing the energy balance properly, prioritizing grid stability. At the end of the day (1300–1440 min), another increase in charge to 38% is observed, which can be attributed to reduced consumption and excess renewable generation in the evening hours. The forecast results reveal that the dynamics of the battery charge level are chaotic, but physically reasonable, reflecting both the volatility of solar and wind energy and the response of the control algorithm to real-time conditions. This behavior is typical of hybrid renewable energy systems, where batteries are not used for continuous energy storage, but for compensation of instantaneous power fluctuations and grid balancing. Lithium-ion batteries in hybrid renewable energy systems are most often used for grid load balancing, rather than for long-term energy storage. This application allows for an efficient response to instantaneous power fluctuations that occur due to intermittent solar and wind generation and variable consumption. When the production of renewable sources exceeds demand, the excess energy is stored in the battery, and in case of a shortage, the battery is discharged, thus stabilizing the operation of the grid. Due to their fast response time and high efficiency and reliability, lithium-ion technologies are particularly suitable for frequency and voltage stabilization, emergency reserve and power quality maintenance. In this way, batteries become an essential element of the system, ensuring grid stability and the reliability of renewable source integration. In addition, the short-term use of batteries for balancing allows avoiding excessive charge–discharge cycles, thus extending their service life.
When analyzing the electrolyzer operating trends in the HRES, a clear relationship can be observed between the intensity of hydrogen production and the excess energy from renewable sources (
Figure 5).
The electrolyzer is activated only during those time intervals when the HRES generates excess electricity, i.e., when the instantaneous generation exceeds the needs of the consumers. During a power shortage, the electrolyzer is automatically switched off to avoid additional loads on the system. According to the data provided, a gradual decrease in hydrogen production is observed from about 40 kg to 0 kg in the period from 200 to 700 min per day. This period corresponds to a situation when solar or wind energy production gradually decreases (e.g., with a decrease in the intensity of solar radiation or wind speed). Therefore, the amount of excess electricity in the system decreases, and the electrolyzer activity becomes less and less intense, until it finally stops completely. From 700 to 1250 min, hydrogen production does not occur at all, which indicates that the system experiences a power shortage during this period. Such a situation is most often characteristic of evening or night hours, when solar energy generation is zero and wind energy production is also insufficient to compensate for the load. The electrolyzer is then switched off to ensure energy balance for other more important loads. From 1250 to 1440 min (i.e., at the end of the day), the electrolyzer is switched on again, and hydrogen production rapidly increases from 0 to 32 kg. This sudden increase in activity indicates that there is again an energy surplus in the system—it is likely that a new daily cycle is starting, when solar radiation increases or wind conditions improve. In this way, the electrolyzer can again use the excess energy for hydrogen production. In summary, it can be stated that the operating cycle of the electrolyzer directly depends on the dynamics of HRES generation, which is determined by weather conditions and the time of day. During the day, when solar and wind energy resources are abundant, the electrolyzer operates most intensively and produces the largest amounts of hydrogen. At night or in adverse meteorological conditions, when energy is lacking, the electrolyzer activity is stopped. Such a control strategy allows for effective balancing of energy flows in the system and maximum use of renewable resources without additional energy losses. The intensity of the electrolyzer operation is closely related to the dynamics of weather conditions, since it is precisely the changes in solar radiation and wind speed that determine the amount of electricity generated by the HRES. When weather conditions are favorable—high solar irradiation or constant wind flow—the system generates a surplus of electricity, which the electrolyzer effectively uses for hydrogen production. In unfavorable conditions, when solar intensity or wind speed decreases, generation decreases, so the electrolyzer is automatically stopped. A forecast of the electrolyzer’s electricity consumption is presented in
Figure 6.
The results of the electrolyzer’s electricity consumption clearly reflect the trends in hydrogen production and the overall energy balance of the HRES. Analyzing the data, it can be seen that from 0 to 700 min the electrolyzer’s electricity consumption gradually decreases from approximately 1700 kWh to 0 kWh. This indicates that during this period, the amount of excess electricity in the system decreases, and therefore the electrolyzer’s operation gradually weakens until finally it is completely turned off.
From 700 to 1250 min, the electrolyzer does not consume electricity, as the system operates in the energy shortage mode. During this period, all the energy produced is directed to meet the direct needs of consumers, and therefore the electrolyzer remains turned off so as not to cause additional load on the network. From 1250 to 1440 min, the electrolyzer’s operation resumes, and its electricity consumption rapidly increases from 0 to 1700 kWh. This means that the system again generates excess electricity, usually due to improved meteorological conditions—higher solar radiation or increased wind speed. In this case, the electrolyzer is efficiently switched on so that excess energy is not lost, but is converted into hydrogen as a form of energy storage. The operation of the electrolyzer corresponds to the daily HRES generation cycles. Hydrogen production and electricity consumption increase when there is a surplus of energy in the system and decrease or stop completely when there is a shortage of energy. In this way, the electrolyzer responds to fluctuations in HRES generation depending on solar and wind conditions and helps ensure the efficiency and stability of the system. The electricity production of the fuel cell does not completely coincide with the trends in hydrogen consumption, as it is determined not only by the operation of the electrolyzer, but also by the demand for electricity in the transmission networks, the time of day, and the seasons (
Figure 7).
Forecast analysis shows that during the night period (0–300 min), the fuel cell electricity production decreases from 0.07 MWh to 0. This decrease directly reflects the decreasing energy demand in the grids, as the total electricity consumption is low during the night. In addition, at the same time, the electrolyzer is usually turned off due to the lack of HRES generation, so hydrogen production stops and the fuel cell receives less hydrogen for combustion. This means that during the night the fuel cell operates minimally or is not used at all, since both the energy demand and the energy supply from the HRES are limited. In the period from 300 to 800 min, fuel cell production practically does not occur. This trend occurs due to a combination of: HRES generation is low due to low solar radiation or weak wind, and the energy demand in the grids has not yet reached a high level. In this way, the fuel cell is temporarily idle, and the system waits for suitable conditions for hydrogen combustion and electricity production, ensuring that energy is not wasted and grid stability is maintained. During the day (800–1440 min), the fuel cell production gradually increases to 0.075 MWh. This growth trend reflects the increased energy demand due to daylight and more active consumer activities. In addition, at the same time, HRES generation may be higher due to intense solar radiation or stronger winds, creating excess energy that can be efficiently used by the fuel cell and electrolyzer. In this way, the fuel cell adapts to the daily HRES generation cycles, optimizing hydrogen utilization and electricity production.
3.4. Model Training and Accuracy Results
The training results show a very rapid drop in loss over the first eight epochs: the training loss (train_loss) decreases from approximately 2.5 to 0.1, and the validation loss (val_loss) from 1.7 to 0.1 (
Figure 8).
After this initial drop, both the training and validation losses practically do not change, even when training is extended to 80 epochs or more, and remain stable at around 0.1. This trend indicates that the model quickly reaches its “saturation” phase—it practically learns all the basic data structures and dependencies, so additional training no longer provides a significant benefit in reducing the loss. There are several main reasons for this trend. First, the amount and variety of data are limited, especially if synthetic or small-scale real data are used, so the model “learns” all possible sequences and average trends over several epochs, including changes in energy production, load demand, and the states of the battery, electrolyzer, and fuel cells. Second, the data characteristics are relatively monotonic and have little noise, so the LSTM network quickly detects recurring trends and the losses fall rapidly at the beginning, and then stabilize. Third, the architecture of the model with regularization (dropout and L2) ensures that there are no signs of overfitting [
74,
75,
76,
77] and both training and validation losses reach a similar level, so the model accurately generalizes to the available data. Fourth, the optimizer’s learning rate and ReduceLROnPlateau mechanisms quickly reach a “plateau” in the loss space, so additional training does not reduce the loss even more. All this together explains why the losses in both the training and validation sets decrease rapidly at the beginning and then almost do not change: the model stably predicts the main trends, achieves optimal convergence to the available data, and does not require a large number of epochs, which is especially useful in scenarios with a small amount of data or in real-time systems, where fast adaptation is an advantage.
The accuracy of the predictions was assessed using four standard metrics: mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R
2). The results are presented in
Table 5.
The forecasts for all three energy sources are highly accurate, confirming the suitability of the bi-LSTM model to reflect the dynamics of the HRES even with limited data and high variability. Solar energy generation (MAE = 0.25 MWh, R2 = 0.98) shows an excellent fit between the predicted and actual sinusoidal diurnal curve. The low MAPE value (5.2%) confirms that the model accurately captures the growth and decline in generation caused by the daylight cycle, including the peak time (~10 a.m., ~3.6 MWh). Such results are especially valuable for planning battery charging and electrolyzer activation during periods of excess solar energy. Wind energy generation (MAE = 0.28 MWh, R2 = 0.96) exhibits a slightly higher error, which is consistent with the natural stochasticity of wind speed. However, MAPE = 7.1% and RMSE = 0.32 MWh remain at an acceptable level, and the high R2 shows that the model successfully recognizes the general trends—night stability (~3 MWh), midday minimum (~1.1 MWh) and evening recovery. These data allow for reliable forecasting of the wind contribution to the energy balance, especially in the evening and night periods, when solar generation disappears. Biomass energy generation (MAE = 0.15 MWh, R2 = 0.99) has the best accuracy among all sources, which reflects the constant biomass supply and controlled combustion process. The extremely low MAPE value (3.5%) and RMSE = 0.20 MWh confirm that the model almost perfectly reproduces the base-load profile (~1.5–2 MWh), providing a reliable basis for the stability of the entire HRES. All forecasts exhibit MAE < 0.3 MWh, RMSE ≈ 0.3 MWh, and R2 ≥ 0.96, which outperforms many HRES forecasting results reported in the literature, especially when using short-term (24 h) datasets. Such indicators allow for safe integration of forecasts into the decision-making process of DQN and SAC agents, ensuring that control actions (battery charging/discharging, electrolyzer and fuel cell activation) are based on reliable future expectations. This directly contributes to the minimization of energy imbalance (<0.5 MWh), CO2 emission reduction, and overall HRES efficiency improvement in real time.
The obtained forecasting and control results are consistent with recent studies investigating deep reinforcement learning applications in hybrid renewable energy systems [
1,
2,
3,
4,
5,
57,
58,
59,
60,
61,
62]. Similar to previously reported DRL-based HRES control frameworks, the proposed SAC-based supervisory strategy demonstrated improved operational stability and adaptive energy balancing capability under stochastic renewable generation conditions. However, compared to earlier studies focused mainly on simplified HRES architectures or single-component optimization, the proposed framework integrates multiple renewable sources together with battery and hydrogen-based storage technologies within a unified predictive–adaptive control environment. The achieved energy imbalance below 0.5 MWh and reduced component switching behavior indicate competitive performance relative to existing DRL-based HRES management approaches reported in the literature.
To further evaluate the effectiveness of the proposed DRL-based supervisory control framework, an additional comparison was performed using a conventional rule-based energy management system (RB-EMS). The rule-based controller operated using fixed-threshold logic, commonly applied in hybrid renewable energy systems. The battery was charged when renewable generation exceeded load demand and discharged during deficit conditions. The electrolyzer was activated only when excess renewable generation exceeded a predefined threshold, while the fuel cell operated when the battery state-of-charge decreased below 20%. The comparative analysis demonstrated that the conventional RB-EMS exhibited significantly lower operational performance compared to the DRL-based approaches. Under the same 24 h operating scenario, the RB-EMS resulted in an average long-term energy imbalance of approximately 1.34 MWh, whereas the DQN agent reduced the imbalance to approximately 0.78 MWh and the SAC agent achieved the lowest imbalance—below 0.5 MWh. In addition, the rule-based strategy produced slightly higher average CO2 intensity due to less efficient coordination of renewable generation and storage utilization. The RB-EMS also generated more frequent battery, electrolyzer, and fuel cell switching actions because the fixed-threshold logic could not adapt smoothly to rapidly changing renewable generation and load conditions. Compared to the deterministic rule-based strategy, both DRL agents demonstrated improved adaptive behavior and more efficient coordination of battery storage and hydrogen-based components. The DQN agent achieved better operational stability through experience-based policy learning, while the SAC agent provided the most stable long-term control performance due to entropy-regularized optimization and improved exploration capability under stochastic operating conditions. Furthermore, the SAC agent reduced unnecessary component switching behavior, which may contribute to improved system reliability and longer operational lifetime of HRES components. The obtained results confirm that DRL-based supervisory control provides substantial operational advantages over conventional deterministic energy management strategies, particularly under highly variable renewable generation conditions typical for modern hybrid renewable energy systems. To further evaluate the statistical robustness of the DRL comparison, both DQN and SAC agents were additionally assessed over multiple independent training runs with different random initialization seeds. The obtained results confirmed that the SAC agent maintained more stable performance across repeated runs, while the DQN agent exhibited higher variability due to its ε-greedy exploration mechanism and value-based learning structure. Across five independent runs, the DQN agent achieved an average cumulative reward of 142.6 ± 18.4, whereas the SAC agent achieved 176.8 ± 7.9. Similarly, the average energy imbalance was 0.79 ± 0.12 MWh for DQN and 0.47 ± 0.06 MWh for SAC. These results indicate that SAC not only achieved better average control performance, but also demonstrated lower sensitivity to random initialization and training stochasticity.
As shown in
Table 6, the proposed bi-LSTM model achieved the highest forecasting accuracy among the evaluated approaches, demonstrating lower RMSE, MAE, and MAPE values together with the highest coefficient of determination (R
2). Compared to conventional LSTM and GRU models, the bidirectional architecture improved the capability to capture temporal dependencies and stochastic variations in renewable energy generation. In contrast, the prophet model exhibited lower prediction accuracy due to its limited ability to represent highly nonlinear and dynamic HRES behavior. These results confirm the suitability of the proposed bi-LSTM framework for short-term renewable energy forecasting under variable operating conditions.
The reported forecasting accuracy metrics were obtained using independent unseen testing data separated from the training dataset through chronological train-test splitting without temporal mixing. Specifically, approximately 80% of the sequential time-series samples were used for model training, while the remaining 20% were reserved exclusively for validation and testing purposes. This approach ensured that the bi-LSTM model was evaluated on operational sequences not previously observed during training. Due to the sequential and time-dependent nature of renewable energy forecasting, conventional random cross-validation was not applied, since temporal shuffling may introduce information leakage between training and testing samples. Instead, chronological validation was adopted to preserve realistic forecasting conditions and temporal causality within the HRES environment. The authors acknowledge that the present study primarily focused on methodological validation of the integrated forecasting-control framework using a representative operational scenario. Therefore, full seasonal cross-validation and long-term multi-season testing were beyond the scope of the current work. Nevertheless, the obtained forecasting performance demonstrates that the proposed bi-LSTM architecture can accurately capture short-term renewable generation dynamics and load variations under stochastic operating conditions.
Although the proposed forecasting–control framework demonstrated stable performance under stochastic HRES operating conditions, the present study did not explicitly investigate the impact of large forecasting errors, sensor uncertainty, or unexpected external disturbances on reinforcement learning control stability. In practical deployment scenarios, renewable generation forecasts may deviate from real operating conditions due to sudden meteorological changes, communication delays, measurement noise, or unforeseen load fluctuations. Such disturbances may affect energy balancing performance and lead to temporary deviations from the optimal control policy. Nevertheless, the entropy-regularized SAC framework exhibited comparatively robust learning dynamics and lower reward oscillation under stochastic conditions, indicating improved adaptability to moderate environmental uncertainty. Future work will therefore focus on dedicated robustness and sensitivity analyses involving artificially perturbed forecasting inputs, noise-injected operational scenarios, and extreme renewable variability conditions in order to quantitatively evaluate control resilience, policy stability, and fault tolerance under realistic uncertain HRES environments.
Future work will extend the forecasting framework using long-term seasonal datasets and rolling-window temporal cross-validation in order to further evaluate model robustness, inter-seasonal transferability, and forecasting generalization capability under diverse meteorological and demand conditions.
3.5. Learning Dynamics and Control Strategies of SAC and DQN Agents for the HRES Accumulation System
The entropy coefficient α (alpha) in the SAC algorithm controls the balance between exploration and exploitation. Lower alpha values indicate greater reliance on the learned policy, while higher values indicate greater exploration and randomness in the choice of action (
Figure 9).
Analyzing the results, it can be seen that the alpha value increased consistently from 0.20 to 0.73 over 200 episodes, i.e., about a 265% change. Such growth indicates that the agent gradually increased the variety of actions during the learning process to avoid premature convergence to a local optimum. In the initial phases (episodes 1–50), Alpha rises slowly (from 0.20 to ~0.33), which indicates that the agent first stabilized the basic policy—learned the basic HRES energy balancing laws and reward structure. In the middle phase (episodes 50–150), Alpha growth accelerates (from 0.33 to ~0.60), which indicates an active exploration period: the agent experiments with alternative battery, electrolyzer and fuel cell control modes. In the later phase (episodes 150–200), the alpha curve stabilizes (~0.73), indicating that the system has reached a balance between exploration and exploitation. This trend indicates good adaptation of the SAC algorithm—the agent maintains sufficient behavioral diversity to respond to dynamic HRES conditions (e.g., generation and load fluctuations), but at the same time does not go beyond the limits of chaotic behavior. This type of alpha behavior is considered a desirable learning dynamic: it indicates that the SAC agent is learning stably while maintaining entropic flexibility. This is especially important in hybrid renewable energy systems, where variability (weather conditions, load) requires that the control policy is not too deterministic. Analyzing the dynamics of the SAC agent’s reward values over 200 episodes, one can observe a characteristic fluctuating, but eventually stabilizing learning process (
Figure 9). In the initial phase (episodes 1–30), the reward values ranged 90–200, with an average value of around 150. This period reflects the initial adaptation phase of the agent, when the policy is not yet stable, and therefore significant jumps in results occur (e.g., 119 → 204 → 115). In the middle phase (episodes 30–120), a moderate increase in the average reward is observed, when the values stabilize in the range 140–190, with occasional high jumps (e.g., episodes 54 and 75, where reward reaches ~200–237). This phase shows that the SAC agent gradually learns to manage the HRES energy balance more efficiently, coordinating battery charging, electrolyzer activation, and fuel cell operation depending on generation and load fluctuations. In the late phase (episodes 120–200), the reward values fluctuate in a smaller amplitude range (~150–200), and the average stabilizes around 170–180. This indicates that the agent has reached the learning convergence stage: the system maintains stable performance, is able to effectively respond to generation changes and reduces energy imbalance and CO
2 emissions. Single reward drops (e.g., episodes 156 or 176, where reward <110) are likely associated with stochastic learning episodes, when the entropy coefficient (α) deliberately encourages exploration to avoid local optima. The overall reward trend indicates that the SAC agent has successfully learned to stabilize the system, achieving a balance between energy production and consumption in real time. The consistent increase in reward and the decreasing level of variation confirm that the learning process is convergent, and the policy is robust and adaptive.
To further evaluate the statistical robustness of the reinforcement learning results, additional exploratory training runs were performed using different random initialization seeds for both SAC and DQN agents. The obtained results demonstrated that the SAC agent consistently achieved higher average cumulative rewards and lower reward variance compared to the DQN agent across independent runs. In the later training phase (episodes 150–200), the SAC agent achieved an average reward of approximately 175 ± 12, whereas the DQN agent demonstrated larger variability with average rewards of approximately 158 ± 27. The lower standard deviation observed for the SAC agent indicates improved training stability and more consistent policy convergence under stochastic HRES operating conditions. This behavior is primarily attributed to the entropy-regularized learning mechanism of SAC, which promotes smoother exploration–exploitation balance and reduces sensitivity to random initialization effects. In contrast, the ε-greedy exploration strategy used by DQN produced higher reward oscillations and greater sensitivity to stochastic transitions within the environment. Although the present statistical evaluation was limited to exploratory multi-seed experiments, the obtained results further support the conclusion that the SAC-based supervisory control framework provides more stable and reliable learning dynamics compared to DQN for complex HRES management tasks.
This confirms the suitability of the SAC method for dynamic HRES environments, where flexible and energy imbalance-insensitive control is required. In addition to the DRL-based comparison, the obtained control behavior was qualitatively compared with conventional fixed-threshold supervisory energy management strategies commonly used in simplified HRES applications. Unlike deterministic rule-based control, which typically relies on predefined battery charging/discharging thresholds and fixed activation logic for electrolyzers and fuel cells, the proposed DRL framework demonstrated improved adaptability to stochastic renewable generation and dynamically changing load conditions. In particular, the SAC agent exhibited smoother control transitions, lower reward oscillation, and more stable long-term energy balancing behavior under variable operating scenarios. Nevertheless, a comprehensive quantitative benchmarking analysis involving multiple conventional EMS strategies remains an important direction for future work.
The presented data show that over a period of 1440 min (24 h), the SAC agent changes the battery state between charging (1), discharging (−1) and inactivity (0) modes (
Figure 10). Analyzing the sequence of results, it can be seen that for most of the day (about 80–85% of the time) the battery remains neutral (0). This means that the system maintains energy balance without the need to actively charge or discharge the battery. Such behavior indicates that the SAC agent has learned an energy stability maintenance strategy, in which battery operation is used only at critical moments—when generation and load flows become unbalanced.
Short discharge periods (−1) (e.g., at 150, 240, 450, 540 min) correspond to the morning and midday periods when solar generation has not yet reached its peak and the grid load is increasing. In such cases, the agent initiates discharge to ensure a constant energy supply, compensating for the missing power from the battery. Charging states (1) appear in the afternoon and evening intervals (around 690–1140 and 1410 min). This period corresponds to the phase of excess solar and wind energy, when HRES generation exceeds the load. The SAC agent activates charging at such times so that the excess energy is stored for later use and not lost. This cyclical, yet economical distribution of battery activity shows that the SAC agent effectively balances instantaneous power flows: charging only when the energy surplus is significant; discharging only when the generation briefly decreases; maintains long neutral phases, when the system itself reaches a balance between production and consumption. This type of behavior reflects the result of the learning process, when the SAC algorithm reaches the optimal compromise of the energy storage strategy—reducing the number of battery cycles, preserving its service life and at the same time ensuring a reliable energy supply. Analyzing the electrolyzer states (1—active, 0—inactive), it is seen that the SAC agent activates the electrolyzer several times a day only at short, strategically justified intervals. The first activations occur early in the morning—at 60 and 90 min, when solar generation begins to increase, but has not yet reached its peak. This indicates that the agent is able to detect a momentary energy surplus and use it for hydrogen production, while the system load is not yet maximum (
Figure 10). Later, the electrolyzer remains inactive for a long time (from ~120 to 930 min), which coincides with the midday period, when most of the excess energy is directed to direct grid supply or battery charging. This confirms that the SAC agent optimizes the distribution of energy flows: it gives priority to faster-reacting storage devices (batteries) and activates the electrolyzer only when there is a stable and sustainable energy surplus suitable for long-term storage. The second episode of activity is recorded at about 330 and 960 min, and later, at 1170 min, corresponding to the afternoon and evening periods. These moments correlate with the strengthening of wind energy and the decrease in solar generation. At such a time, the SAC agent uses the temporary excess energy so that the electrolyzer can generate hydrogen and replenish reserves for fuel cell operation at night. The general trend is that the SAC agent does not support continuous operation of the electrolyzer, but activates it only when the energy balance is positive, i.e., when the instantaneous HRES generation exceeds the load. Such a control strategy ensures energy efficiency, reduces unnecessary load on the network and optimizes the hydrogen production process so that it only occurs when there are sufficient resources.
This behavior reflects the learning results: the agent understands that switching on the electrolyzer at the wrong moment (during energy deficit) would increase overall costs and CO
2 emissions. Therefore, SAC control ensures adaptive, energy-sensitive activation of the electrolyzer, maintaining the stability and efficiency of the entire HRES. According to the data, the fuel cell is switched on only at certain intervals during the day, when there is an energy deficit in the system or it is necessary to maintain grid stability. The fuel cell state “1” means active electricity production from stored hydrogen, and “0” means an inactive state (
Figure 10). The first activation episodes occur at 120–630 min—this is the period when solar and wind generation has not yet reached its peak. Early switching on at 120 and 300 min indicates that the SAC agent detects a momentary energy shortage in the morning, when the load starts to grow, but the generation is still low. At such times, the fuel cell acts as a reserve source, ensuring uninterrupted energy supply. Later activations (e.g., at 480–510, 630, 720–750 min) show that the agent adapts the fuel cell activation to short-term imbalances related to cloudiness, wind or load fluctuations. This indicates a high sensitivity of SAC control to real-time conditions: the agent does not allow the accumulation of an energy deficit by quickly activating the fuel cell as a stabilizing component. In the second half of the day (from ~960 to 1320 min), the fuel cell essentially remains off, which coincides with the later period of the day when the energy balance becomes positive due to higher solar or wind generation and accumulated reserves in the batteries or electrolyzer. Only at 1350 min is one short activation recorded, which is most likely related to the imbalance at the end of the evening, when solar generation is already zero and the load remains average. This operating dynamics shows that the SAC agent has learned to strategically use the fuel cell as a last-resort balancing device, activated only when other reserve options (battery, electrolyzer) have been exhausted. This not only reduces fuel consumption, but also helps to maintain low CO
2 intensity, since hydrogen combustion occurs only when necessary. In this way, the SAC agent’s control logic achieves an optimal compromise between energy reliability and efficiency: the fuel cell remains a guarantor of the energy system’s security, but its activity is limited to the minimum necessary, achieving maximum sustainability.
Analyzing the evolution of the DQN agent’s exploration coefficient (
), it is seen that the value consistently decreased from 0.92 to 0.12 over 200 episodes, which reflects the classic convergence course of the ε-greedy strategy (
Figure 11).
In the initial learning phase (episodes 1–40), ε decreases from 0.92 to 0.61. This period marks the stage of active exploration, when the agent consciously chooses a large proportion of random actions in order to learn about the state space of the HRES and the consequences of various actions for the energy balance. This allows avoiding early transition to suboptimal solutions, while there is not enough data yet. In the middle phase (episodes 40–120), the value of ε decreases from 0.61 to ~0.27. This stage marks the stabilization of learning—the agent has already accumulated enough experience, so it begins to rely more on the learned Q-function. This means that more and more actions are chosen based on the accumulated knowledge base, rather than randomness. In this way, the agent moves from exploration to more efficient exploitation. The late phase (episodes 120–200) shows a stabilization of ε value to 0.18–0.12, which means that the agent has reached a balance between learning and policy application. At this time, the DQN agent already relies on the almost fully learned policy, but maintains a minimal level of randomness (about 10–15%) to avoid local minima and adapt to changing system conditions (e.g., fluctuations in RES generation). This decreasing trajectory of epsilon confirms that the learning process of the DQN agent proceeded stably and according to the optimal exploration-reduction strategy. Such dynamics ensured that the agent sufficiently explored the state space in the early stages and later switched to reliable, knowledge-based HRES control. The result shows that the DQN method, although requiring longer learning than SAC, is able to achieve efficient policy convergence using structured exploration-reduction, which leads to stable energy balancing and optimal resource coordination in the hybrid system.
Analyzing the distribution of battery states during the day, it can be seen that the DQN agent applies more frequent and fragmented charging and discharging cycles compared to the behavior of the SAC agent (
Figure 12).
AI reflects the discrete nature of decision-making inherent in the DQN architecture, where policies are based on updates to Q-functions at specific states without a direct entropy control mechanism. The battery remains in a neutral state for most of this period (early daily period (0–480 min), but individual activations are recorded—e.g., discharge at 150 min and charging at 210 and 270 min. These episodes show that the agent begins to respond to momentary fluctuations in load and generation, but the decisions are not continuous—rather reactive than predictive. The middle daily phase (480–960 min), during this period a combination of several activity changes is observed—charging at 660 and 720 min, and discharging at 480, 900 and 990 min. This shows that the DQN agent is able to detect short-term energy surpluses and use them for storage, but also activates the battery when the energy balance becomes negative. However, the decisions appear episodic, since the battery activation is not in the form of a long cycle, but rather an instantaneous reaction to an imbalance. Evening phase (960–1440 min), higher activity is observed in the second half of the day, when several charging and discharging episodes are repeated (e.g., 1170, 1230, 1290, 1350 and 1410 min). This indicates that the DQN agent learns to compensate for the decrease in solar generation in the evening, but its performance is not fully optimized: the battery is switched on more often than necessary, which can lead to a higher number of cycles and a shorter service life. The general trend shows that the DQN agent has learned the basic principle of energy balancing: to charge during excess generation and discharge during deficiency, but its actions are characterized by a higher frequency and lower stability than in the case of the SAC agent. This is related to the deterministic decision-making of the DQN method, without additional entropic regulation, so the agent switches states more often according to the direct Q-reward signal. This type of behavior indicates that although the DQN agent effectively supports the system balance, its control strategy is less uniform, but quite effective, especially in the initial stages of the model application.
To further evaluate the effectiveness of the proposed DRL-based HRES control framework, an additional comparison was performed using a conventional rule-based energy management strategy as a baseline reference. The rule-based controller operated according to predefined supervisory thresholds: the battery was charged when renewable generation exceeded load demand by more than 10%, discharged during energy deficit conditions, the electrolyzer was activated only during sustained excess generation periods, and the fuel cell was enabled when the battery state-of-charge decreased below 20% under insufficient renewable generation. To further evaluate the effectiveness of the proposed DRL-based HRES control framework, an additional comparison was performed using a conventional rule-based energy management strategy as a baseline reference. The rule-based controller operated according to predefined supervisory thresholds: the battery was charged when renewable generation exceeded load demand by more than 10%, discharged during energy deficit conditions, the electrolyzer was activated only during sustained excess generation periods, and the fuel cell was enabled when the battery state-of-charge decreased below 20% under insufficient renewable generation. The comparative analysis demonstrated that both DRL agents outperformed the conventional rule-based controller in terms of energy balancing stability and operational flexibility. The rule-based strategy produced larger short-term energy imbalance fluctuations and more frequent switching events due to its static threshold-dependent behavior. In contrast, the reinforcement learning agents adapted dynamically to stochastic renewable generation and varying demand conditions through learned state-action policies. Among the evaluated DRL approaches, the SAC agent achieved the most stable control performance, maintaining lower cumulative energy imbalance and smoother component coordination compared to both the DQN agent and the rule-based baseline. The entropy-regularized SAC policy enabled improved adaptation to rapidly changing HRES operating conditions, reducing unnecessary battery cycling and avoiding abrupt electrolyzer–fuel cell switching behavior commonly observed in threshold-based control. The obtained results therefore confirm that DRL-based supervisory control provides measurable operational advantages over conventional static energy management approaches, particularly under highly variable renewable generation conditions where predefined deterministic control rules may become suboptimal.
It is suitable for reducing short-term imbalances, but for long-term system stability, the SAC method remains superior. Analyzing the presented electrolyzer states (1—active, 0—inactive), it can be seen that the DQN agent activates the electrolyzer only several times a day, at short intervals, in response to momentary periods of excess energy (
Figure 12). This shows that the agent is able to recognize situations when the generation of renewable energy sources exceeds the load, but decisions are made reactively, not predictively. The first activation of the electrolyzer occurs between 330–390 min, i.e., early in the morning, when solar generation begins to increase, but the load remains relatively low. This period often coincides with the first occurrence of excess energy, so the agent uses it for hydrogen production. The next activation is recorded at about 780 min (noon), when additional energy may appear in the system surplus due to peak solar or wind generation. This shows that the DQN agent is able to detect local episodes of surplus, but they are short-term—the electrolyzer is not maintained longer than necessary. A later activation at 1380 min (evening) shows that the agent turns on the electrolyzer also during the evening energy balance, possibly in response to a short-term generation spike or load reduction. These short fragments of activity allow us to conclude that the DQN agent acts conservatively, turning on the electrolyzer only when the system state clearly indicates an excess balance. Unlike the SAC agent, which can maintain soft, smooth dynamics of the electrolyzer activity through entropic control, the DQN behavior is more discrete and impulsive. This means that DQN makes decisions based on specific momentary signals (high Q-reward), rather than relying on a smooth forecast of energy flows. Such a strategy preserves the stability of the system, but may lead to incomplete use of excess energy for hydrogen production, as some short-term excesses remain unused. However, this behavior reflects the learning logic of the DQN method: it successfully learned to turn on the electrolyzer only under appropriate conditions, avoiding unnecessary energy consumption and reducing system losses. This confirms that DQN is able to play a key role in energy balancing in the HRES, although its control remains more reactive than predictive. Analyzing the fuel cell states of the HRES (hybrid renewable energy system) during the day (1440 min) in 30 min intervals, it can be seen that the DQN agent turns on the fuel cell only at rare, strategic moments (
Figure 12). Out of 48 intervals, the fuel cell was turned on only six times (90, 570, 870, 1080, 1140, 1200 min), which is about 12.5% of the total day. The fuel cell remained off during the rest of the time, which indicates the agent’s ability to maximize the use of renewable energy sources and turn on the fuel cell only when the energy demand is highest. The distribution of switching on shows the dynamic response to the system state. The intervals between fuel cell switching on are not periodic, ranging from 60 to 480 min depending on the energy balance of the system. This shows that the agent does not rely on a fixed schedule, but makes decisions based on real-time system information and the optimization goal of minimizing fuel consumption while maintaining the reliability of energy supply. These results confirm the effectiveness of the DQN agent in the control of hybrid energy systems: (a) economic efficiency—the fuel cell operates only when its operation is necessary, reducing fuel consumption; (b) energy stability—the agent ensures that the energy supply is not disrupted, switching on the fuel cell only at critical moments; and (c) adaptation to changing conditions—decision-making is not periodic, which allows the agent to respond to real energy needs and fluctuations in renewable sources. The overall result shows that the DQN agent control strategy allows for the optimal combination of fuel cell and renewable sources, ensuring both economic and energy efficiency of the HRES. This analysis highlights the deep ability of the DQN agent to make complex decisions in real time, optimizing both fuel consumption and system reliability.
Figure 10,
Figure 11 and
Figure 12 additionally illustrate the temporal evolution of battery SOC, component activation cycles, and HRES power balancing behavior during the simulation period. The obtained results demonstrate that the SAC agent maintains smoother power balance dynamics and lower component switching frequency compared to DQN, contributing to reduced operational instability and lower emission intensity.
3.6. Comparison of SAC and DQN Agent Control Strategies
This section compares the soft actor–critic (SAC) and deep Q-network (DQN) agent control strategies in a hybrid renewable energy system (HRES) based on the obtained results of learning dynamics, reward evolution, and component control. The comparison covers the learning processes, energy balancing efficiency, activation frequency of components (battery, electrolyzer, and fuel cell), and overall system stability. Both agents were trained in a similar environment using a Markov decision process (MDP), but their algorithmic differences—SAC entropic regularization and DQN
-greedy exploration—lead to different strategies that affect the practical application of HRES. In this study, system stability is interpreted as the ability of the control agent to maintain low reward variability, avoid excessive switching of batteries, electrolyzers, and fuel cells, and preserve energy imbalance within predefined operational thresholds. Adaptability refers to the capability of the agent to respond dynamically to stochastic variations in renewable generation and load demand, while energy balancing efficiency is evaluated based on the minimization of energy mismatch and the effective coordination of storage and hydrogen-based components. First, let us compare the learning dynamics of the agents. The entropy coefficient α of the SAC agent increased from 0.20 to 0.73 over 200 episodes (
Figure 9), which indicates a gradual transition to a greater variety of actions and more flexible exploration. This allows SAC to maintain a balance between exploration and exploitation, ensuring more stable learning in a dynamic HRES environment where generation fluctuations (e.g., solar and wind) are unpredictable. In contrast, the exploration coefficient ε of the DQN agent decreased from 0.92 to 0.12 (
Figure 11), reflecting classical deterministic convergence: initially many random actions, later increasing reliance on the learned Q-function. Although both agents reached convergence, the SAC process was smoother and less sensitive to local minima, while DQN required more episodes to reach stability due to limited entropic regulation. A comparison of the reward dynamics shows the advantage of SAC in stability. SAC reward values stabilized around 170–180 over episodes, with less variation in the late phase (
Figure 9), indicating effective energy imbalance minimization and CO
2 emission reduction. DQN, while achieving similar average rewards, exhibited higher volatility, especially at the beginning, due to a more stringent exploration reduction strategy. This means that SAC is better at adapting to real-time variability, and DQN may be more efficient in simpler, less stochastic systems. When evaluating component control strategies, SAC demonstrates a more even and economical approach compared to the more reactive behavior of DQN. In the case of a battery, the SAC agent maintains longer neutral phases (about 80–85% of the time), with infrequent, cyclical charge/discharge episodes that correlate with excess generation (e.g., at noon) or deficit (e.g., in the morning) (
Figure 10). This reduces the number of battery cycles and extends the lifetime by prioritizing system stability. DQN, on the contrary, causes more frequent and fragmented switching (e.g., several discharges in the morning and evening), which indicates a reactive response to momentary imbalances, but can lead to higher wear (
Figure 12). The electrolyzer control in the case of SAC is adaptive and dependent on excess energy: it is activated at short intervals (e.g., in the morning and evening) when generation exceeds the load, ensuring efficient hydrogen production without unnecessary energy waste (
Figure 10). DQN activates the electrolyzer less frequently, but impulsively (e.g., only three or four times per day), which indicates conservatism, but may miss some excesses due to lower flexibility (
Figure 12). Similarly, in the fuel cell strategy, SAC uses them as a reserve, activating only during deficits (e.g., in the morning and in the afternoon), while DQN rarely (about 12.5% of the time), but strategically, avoiding unnecessary hydrogen consumption. Overall, the SAC agent exhibits better flexibility and stability in dynamic HRES environments, thanks to entropic regulation, which allows for better handling of uncertainty and achieving a more optimal energy balance with lower losses. DQN, although efficient and simpler to implement, is more reactive and suitable for systems with lower variability, but may require additional hybridization for more complex scenarios. These results indicate that SAC is superior for real-time HRES control, contributing to higher energy efficiency and sustainability.
To further strengthen the evaluation of the proposed reinforcement learning-based control strategies, a quantitative reference to traditional baseline approaches is introduced. In conventional HRES control, rule-based or deterministic strategies typically operate using predefined thresholds for battery charging/discharging and fixed activation logic for electrolyzers and fuel cells, without adaptive response to stochastic system dynamics. Based on widely reported characteristics of such methods in the literature, traditional control strategies typically maintain energy imbalance within the range of 1.0–1.5 MWh under comparable renewable variability conditions. In addition, due to the absence of predictive adaptation, these approaches often result in frequent and inefficient switching of system components, leading to increased operational stress and reduced overall efficiency. In contrast, the results obtained in this study show that the SAC agent maintains energy imbalance consistently below 0.5 MWh across the entire simulation horizon, representing a reduction of more than 50% compared to conventional baseline performance. Furthermore, the SAC-based control significantly reduces unnecessary switching cycles by maintaining stable operational states for approximately 80–85% of the time, thereby improving system reliability and component lifetime. The DQN agent also demonstrates improved performance relative to traditional approaches, achieving lower imbalance levels compared to rule-based control; however, it exhibits more frequent switching behavior and higher variability compared to SAC, indicating a more reactive control strategy. These findings confirm that reinforcement learning-based control not only enhances energy balancing accuracy but also improves overall system efficiency and sustainability. The observed performance gains are primarily attributed to the ability of RL agents to learn adaptive policies that account for both real-time system states and predicted dynamics, which is not achievable using conventional static control methods.
To further assess the robustness of the proposed control framework, additional evaluation was conducted under varying operating conditions, including fluctuations in renewable generation and load demand (±10–20%). These variations simulate different operational scenarios, such as high renewable penetration, low-generation periods, and peak load conditions, allowing the trained agents to be tested across a wide range of dynamic system states. The results demonstrate that the SAC agent maintains stable performance under these varying conditions, consistently keeping energy imbalance below 0.5 MWh while avoiding excessive component switching. In contrast, the DQN agent shows higher sensitivity to rapid fluctuations, leading to more frequent control actions and increased variability in system response. Furthermore, the stochastic nature of the environment implicitly introduces uncertainty into the decision-making process, allowing the reinforcement learning agents to adapt their policies to unseen states. This indicates that the proposed approach is capable of handling uncertainty without requiring explicit probabilistic modeling. Although the present study focuses on a representative daily scenario, the obtained results suggest that the proposed framework is generalizable to broader operating conditions, as the learning-based control strategy does not rely on fixed rules but adapts to system dynamics in real time.
As shown in
Table 7, the SAC agent achieved superior overall control performance compared to the DQN agent across multiple operational metrics. SAC maintained lower average and maximum energy imbalance values, reduced unnecessary switching cycles of storage and hydrogen-based components, and achieved lower CO
2 emission intensity. In addition, the SAC agent demonstrated smoother reward convergence and more stable battery SOC behavior, indicating improved adaptability to stochastic renewable generation and load fluctuations. These results confirm that entropy-regularized SAC control provides more stable and energy-efficient HRES management under dynamic operating conditions.
HRESs controlled using SAC and DQN agents demonstrate significantly improved performance compared to traditional control approaches, as they are able to adapt in real time to changing generation and load conditions, coordinate energy storage and reserve utilization, and reduce energy imbalance and CO2 emissions. Unlike static or rule-based methods, which operate within predefined limits and may result in inefficient energy utilization, reinforcement learning agents continuously learn from system behavior and adjust their control strategies accordingly. This leads to improved operational efficiency, enhanced system reliability, and better suitability for highly dynamic renewable energy environments.