Co-Optimization of Capacity and Operation for Battery-Hydrogen Hybrid Energy Storage Systems Based on Deep Reinforcement Learning and Mixed Integer Programming

Qian, Tiantian; Zhang, Kaifeng; Shi, Difen; Zhang, Lei

doi:10.3390/en18215638

Open AccessArticle

Co-Optimization of Capacity and Operation for Battery-Hydrogen Hybrid Energy Storage Systems Based on Deep Reinforcement Learning and Mixed Integer Programming

¹

School of Electronic Engineering, Nanjing Xiaozhuang University, Nanjing 211171, China

²

School of Automation, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(21), 5638; https://doi.org/10.3390/en18215638 (registering DOI)

Submission received: 1 October 2025 / Revised: 22 October 2025 / Accepted: 24 October 2025 / Published: 27 October 2025

(This article belongs to the Special Issue AI Solutions for Energy Management: Smart Grids and EV Charging)

Download

Browse Figures

Versions Notes

Abstract

The hybrid energy storage system (HESS) that combines battery with hydrogen storage exploits complementary power/energy characteristics, but most studies optimize capacity and operation separately, leading to suboptimal overall performance. To address this issue, this paper proposes a bi-level co-optimization framework that integrates deep reinforcement learning (DRL) and mixed integer programming (MIP). The outer layer employs the TD3 algorithm for capacity configuration, while the inner layer uses the Gurobi solver for optimal operation under constraints. On a standalone PV–wind–load-HESS system, the method attains near-optimal quality at dramatically lower runtime. Relative to GA + Gurobi and PSO + Gurobi, the cost is lower by 4.67% and 1.31%, while requiring only 0.52% and 0.58% of their runtime; compared with a direct Gurobi solve, the cost remains comparable while runtime decreases to 0.07%. Sensitivity analysis further validates the model’s robustness under various cost parameters and renewable energy penetration levels. These results indicate that the proposed DRL–MIP cooperation achieves near-optimal solutions with orders of magnitude speedups. This study provides a new DRL–MIP paradigm for efficiently solving strongly coupled bi-level optimization problems in energy systems.

Keywords:

hybrid energy storage system (HESS); hydrogen storage; battery energy storage system; co-optimization; bi-level; deep reinforcement learning (DRL)

1. Introduction

The hybrid energy storage system (HESS) integrates power-based energy storage technologies, such as supercapacitors, flywheels, and batteries, with energy-based energy storage technologies, such as hydrogen, thermal, compressed-air, and gravity. This combined approach balances fast power response with long-duration energy supply, addressing the shortcomings of single storage technologies. By adjusting power output across different time scales, HESS improves renewable energy utilization and strengthens the overall stability of energy networks. With properly optimized resource allocation, HESS not only reduces overall system investment costs but also extends the service life of storage devices. Owing to these technical and economic merits, HESS has emerged as a prominent research focus in the energy sector and has been widely deployed in renewable energy systems, active distribution networks, electric vehicles, and integrated energy systems.

Existing studies on HESS optimization primarily focus on three aspects: capacity (sizing) optimization, operation (energy management, scheduling, and power allocation) optimization, and the co-optimization of both capacity and operation. This paper systematically reviews and analyzes the current literature from these three perspectives.

(1): Capacity Optimization

References [1,2,3,4,5] investigated the application of HESS in grid-connected photovoltaic (PV), wind, and wave energy systems, with a focus on capacity optimization to smooth power fluctuations. Reference [6] proposed a multi-objective capacity optimization approach for standalone wind power systems, jointly considering economic performance, reliability, and energy consumption rate. References [7,8] explored HESS deployment in renewable energy systems, optimizing system sizing in response to load demands under varying climatic conditions. Reference [9] examined the role of HESS in islanded microgrids, configuring storage capacity to minimize both overall operating costs and the flexibility-deficiency rate. References [10,11,12,13] focused on integrated energy systems, where capacity optimization of HESS improved system economics and stability while mitigating wind/PV power curtailment and voltage fluctuations.

In summary, HESS capacity optimization models primarily focus on economy, reliability, and equipment operating conditions; however, their operational strategies are mostly rule-based deterministic methods, lacking systematic operation optimization modeling.

(2): Operation Optimization

References [14,15,16] investigated energy management strategies for HESS in electric vehicles, aiming to improve energy efficiency and driving range. References [17,18,19,20,21,22] focused on renewable energy grid-integration scenarios, optimizing energy management to enhance system stability. References [23,24,25,26,27,28,29,30,31,32,33,34,35] explored the optimal operation of HESS in microgrids to achieve global energy management objectives. References [36,37,38] discussed the scheduling optimization methods of HESS in integrated energy systems.

These studies generally assume that storage capacity is predetermined, concentrating instead on optimizing decision variables at the operational level.

The aforementioned studies consider capacity configuration and operation optimization separately; however, there is a significant coupling relationship between these two aspects. Consequently, an increasing number of studies in recent years have focused on this.

(3): Co-Optimization of Capacity and Operation

The collaborative optimization of capacity and operation for HESS is inherently a complex mixed-integer nonlinear optimization problem, posing high complexity in both modeling and solution processes. There are three major challenges inherent in the strongly coupled capacity–operation problem of hybrid energy storage systems (HESS):

(1): Strong coupling between long-term investment and short-term operation;
(2): High nonlinearity and mixed-integer characteristics;
(3): Environmental uncertainty and computational complexity.

To this end, this Section of the literature review focuses on model construction methods and solution strategies for the co-optimization of HESS capacity and operation, with a summary of relevant studies presented in Table 1.

In general, existing studies typically model the co-optimization of HESS capacity and operation as a bi-level or two-stage optimization model. Specifically, the capacity level focuses on capacity configuration optimization, with objectives usually being the minimization of the system’s life-cycle investment cost or comprehensive economic indicators; the operation level emphasizes operation optimization, targeting the minimization of the system’s daily operating cost while maintaining reliability. In terms of modeling, the capacity configuration problem is often formulated as a nonlinear programming model (NLP); due to the need to consider energy storage charging/discharging logic and energy balance constraints, the operation optimization problem is usually modeled as a mixed-integer linear programming (MILP), mixed-integer nonlinear programming (MINLP), or dynamic programming model (DP).

From the perspective of the solution methods of the two-layer optimization model, there are mainly two categories: (1) the two-layer interactive iterative solution method [39,40,41,42,43,44,45]; (2) the method of converting the two-layer model into a single-layer MILP model and then directly using a commercial solver to solve it [46,47]: (1) bi-level interactive iterative solution methods [39,40,41,42,43,44,45]; (2) methods that convert the bi-level model into a single-level MILP model and then solve it directly using commercial solvers [46,47]. Among them, the former is more widely applied. The main reason is that the interactive iterative solution method can better handle the nonlinear characteristics and cross-time-scale coupling problems existing in HESS. In contrast, although the bi-level to single-level method can significantly reduce computational complexity, it often requires introducing simplifying assumptions during the model conversion process, which easily ignores nonlinear coupling relationships and leads to a decline in numerical stability and robustness.

For the capacity layer, existing studies mostly adopt heuristic algorithms (such as PSO, GA, GWO, etc.) [40,41,42,43] to address the nonlinear and non-convex characteristics of the upper-level model; some studies use surrogate model-based space reduction algorithms to reduce computational complexity [39]; other scholars employ deterministic mathematical programming methods (such as sequential quadratic programming algorithms [44], ε-constraint multi-objective optimization algorithms [45]) for solution. In general, deterministic mathematical programming methods can obtain high-precision solutions but are sensitive to initial values and time-consuming; heuristic algorithms have relatively faster solution speeds but often yield suboptimal solutions in complex non-convex spaces; surrogate model methods feature high computational efficiency but their accuracy depends on the quality of approximation. The trade-off between solution accuracy and computational efficiency remains prominent.

For the operation layer, dynamic programming (DP) algorithms [39,40,41] or commercial solvers (such as CPLEX, Gurobi) [42,43,44,45] are commonly used for operation optimization, and the solution methods are relatively mature.

In addition, the volatility of renewable energy (e.g., photovoltaic and wind power) and the uncertainty of load demand result in high uncertainty in system operation [48,49]. Constrained by the model structure, the aforementioned methods usually perform optimization under a single typical scenario, making it difficult to fully reflect the uncertainties of renewable energy output and load fluctuations under multi-scenario conditions.

In recent years, the emerging deep reinforcement learning (DRL) method can learn optimal strategies through multi-scenario training in an interactive environment, thereby capturing complex dynamic characteristics and various uncertainties. It exhibits stronger generalization ability and robustness when dealing with complex future stochastic environments. Meanwhile, once the DRL model is trained, its inference and solution speed is much faster than other methods, featuring significant advantages in real-time decision-making [50].

At the same time, the degradation characteristics of energy storage systems (especially non-battery energy storage) have an important impact on both capacity planning and operation decisions, but current research has relatively insufficient consideration of this issue.

To address the above issues, this paper constructs a hybrid energy storage system (HESS) integrating a battery energy storage system (BESS), electrolyzer (EL), fuel cell (FC), and hydrogen storage tank (HST), and applies it to a standalone hybrid renewable energy system to improve the reliability and economy of power supply. On this basis, a bi-level collaborative optimization method combining deep reinforcement learning and mixed-integer programming is proposed as follows: the outer layer (RL) adaptively learns capacity configuration strategies through interaction with the environment to cope with complex, variable, and uncertain operating environments; the inner layer (MIP) achieves precise operation scheduling and constraint satisfaction under given capacity conditions, thereby fully leveraging the complementary advantages of the two methods. In addition, the degradation characteristics of battery and hydrogen energy storage are explicitly incorporated into the model to further enhance the accuracy of optimization results and engineering applicability.

The main contributions of this paper are as follows:

(1): A novel bi-level collaborative optimization model for capacity and operation is proposed. In the capacity optimization layer, the model incorporates an adaptive mechanism for net load fluctuations into capacity boundary calculation; in the operation optimization layer, a dynamic operation model considering the degradation processes of battery and hydrogen energy storage is established to improve the model’s accuracy and engineering applicability.
(2): A novel hybrid solution algorithm that combines RL and mixed-integer programming (MIP) is proposed for the bi-level optimization model. The proposed approach leverages the adaptive learning and environment-perception capabilities of RL to dynamically respond to complex and uncertain scenarios, while employing MIP to ensure accurate optimal operation and strict constraint satisfaction. This hybrid algorithm achieves near-optimal performance with significantly reduced computational time.
(3): The proposed model is validated through multi-scenario simulations and sensitivity analyses, demonstrating its robustness and generalization capability. The results show that the proposed method maintains stable optimization performance under varying operating conditions and system parameters.

The remainder of this paper is organized as follows. Section 2 describes the configuration and mathematical model of the proposed HESS integrated with hybrid renewable energy systems. Section 3 presents the bi-level collaborative optimization framework that combines deep reinforcement learning (DRL) and mixed-integer programming (MIP). Section 4 discusses the case studies and sensitivity analyses used to validate the proposed method. Finally, Section 5 concludes the paper and outlines future research directions.

2. System Model

The hybrid energy storage system considered in this paper comprises two subsystems, a lithium-ion BESS and a hydrogen energy storage system. The hydrogen energy storage system consists of an EL, a hydrogen storage tank (HST), and an FC, collectively providing power balancing and energy management services for a standalone hybrid renewable energy system, as shown in Figure 1. The system is assumed to operate in a standalone configuration under normal conditions, without considering failure contingencies.

The working principle is as follows. Surplus electricity is converted into hydrogen by the electrolyzer and stored in the hydrogen tank; when electricity is required, the fuel cell reconverts the stored hydrogen into power. The lithium battery primarily provides fast, short-term power response, while the two subsystems operate in a complementary manner to achieve multi-timescale energy management.

To achieve the co-optimization of capacity configuration and operational strategy for the lithium-hydrogen HESS, a bi-level optimization model is developed. The core idea is to decompose the comprehensive optimization problem into two interrelated but hierarchical sub-problems. The outer layer deals with capacity configuration (long-term investment decision), and the inner layer addresses operation optimization (short-term dispatch decision). The outer layer determines the optimal capacities of each storage component and provides boundary constraints to the inner layer; the inner layer, in turn, derives the optimal operational strategy under the given capacity limits and feeds back its objective value to the outer layer as the economic performance metric for investment decisions. The detailed formulation of the bi-level optimization model is presented below.

2.1. Inner-Layer Operation Optimization

This layer primarily addresses the optimal operating strategy of the HESS in typical daily scenarios, subject to the capacity configuration parameters determined by the outer layer. The decision variables include the electrolyzer power, fuel cell power, and battery charging/discharging power, defined as

P_{e l e} (t)

,

P_{h f c} (t)

,

P_{b a t . c} (t)

and

P_{b a t . d} (t)

, respectively. The objective function and constraints of the inner layer are formulated as follows.

2.1.1. Objective Function

The objective of the inner-layer optimization is to minimize the weighted sum of the daily operating cost and power deviation, thereby achieving coordinated optimization of the system’s economic efficiency and reliability.

O_{i n n e r} = \min (C_{o p} + P D)

(1)

where

C_{o p}

denotes the total daily operation and maintenance cost of each component, as calculated by Equation (2). It consists of three parts: the degradation cost of the electrolyzer, the degradation cost of the fuel cell, and the aging cost associated with battery charging and discharging.

C_{o p} = \sum_{t = 1}^{T} (D_{e l e} P_{e l e} (t) + D_{h f c} P_{h f c} (t) + D_{b a t} (t) (P_{b a t . c} (t) + P_{b a t . d} (t))) \times Δ t

(2)

where T represents the length of the optimization time horizon (typically set to 24 h), and the calculation formula for the degradation cost coefficient for the battery is given in Equation (3).

D_{b a t} (t) = \frac{C_{i n v . b a t} - R S_{b a t}}{2 D o D (t) A C C (t) \sqrt{η_{b a t . c} η_{b a t . d}}}

(3)

where

C_{i n v . b a t}

denotes the battery investment cost,

R S_{b a t}

represents the salvage value,

D o D (t)

denotes the depth of discharge (DoD),

A C C (t)

denotes the accelerated aging coefficient of battery,

η_{b a t . c}

and

η_{b a t . d}

represent the charging and discharging efficiencies of battery, respectively, a and b are the parameters of the battery aging model, with values a = 1.8 × 10⁻³ and b = 2, based on experimental references.

D o D (t) = 1 - S o C (t)

(4)

A C C (t) = \frac{a}{{(D o D (t))}^{b}}

(5)

The calculation formula for the degradation cost coefficient of electrolyzer

D_{e l e}

is given in Equation (6), where

f_{e l e}

denotes the ratio of maintenance and replacement costs to the investment cost of the electrolyzer,

C_{i n v . e l e}

represents the unit power investment cost of the electrolyzer, and

H_{e l e}

denotes the designed service life of the electrolyzer.

D_{e l e} = \frac{f_{e l e} C_{i n v . e l e}}{H_{e l e}}

(6)

The calculation formula for the degradation cost coefficient of fuel cell

D_{h f c}

in Equation (7), where

f_{h f c}

denotes the ratio of maintenance and replacement cost to the investment cost of the fuel cell,

C_{i n v . h f c}

represents the unit power investment cost of the fuel cell, and

H_{h f c}

denotes the designed service life of the fuel cell.

D_{h f c} = \frac{f_{h f c} C_{i n v . h f c}}{H_{h f c}}

(7)

P D

denotes the power deviation, which is used to evaluate the supply–demand balancing capability of the HESS. It is typically required that PD ≤ 1–2%, and its calculation formula is given as follows:

P D = \frac{\sum_{t = 1}^{T} | P_{n e t} (t) - (P_{e l e} (t) + P_{h f c} (t) + P_{b a t . c} (t) + P_{b a t . d} (t)) |}{\sum_{t = 1}^{T_{2}} | P_{n e t} (t) |}

(8)

where

P_{n e t} (t)

denotes the net load power at time t, calculated as the load demand minus the renewable energy power output, that a positive value indicates a power deficit, whereas a negative value indicates a power surplus.

2.1.2. Constraints

Hydrogen Energy Storage System Constraints

(1): Power operation constraints.

0 \leq P_{e l e} (t) \leq u_{e l e} (t) \cdot P_{e l e}^{r a t e d}

(9)

0 \leq P_{h f c} (t) \leq u_{h f c} (t) \cdot P_{h f c}^{r a t e d}

(10)

(2): Mutual-exclusion constraint (to prevent the electrolyzer and fuel cell from operating simultaneously).

u_{e l e} (t) + u_{h f c} (t) \leq 1

(11)

(3): Hydrogen storage tank dynamic balance equation.

S O H_{H T} (t) = S O H_{H T} (t - 1) + (\frac{G_{e l e} (t) η_{s}}{Q_{H T}} - \frac{G_{h f c} (t)}{η_{c} Q_{H T}}) \cdot Δ t

(12)

(4): Hydrogen storage tank state constraints.

S O H_{H T}^{\min} \leq S O H_{H T} (t) \leq S O H_{H T}^{\max}

(13)

where the hydrogen production rate and hydrogen consumption rate are given as,

G_{e l e} (t) = \frac{η_{e l e} P_{e l e} (t)}{E_{H_{2}}}

(14)

G_{h f c} (t) = \frac{P_{h f c} (t)}{η_{h f c} E_{H_{2}}}

(15)

(5): To ensure the feasibility and stability of the system during multi-day continuous operation, periodic constraints are imposed.

| S O H_{T - 1} - S O H_{0} | \leq ϵ

(16)

(6): To prevent frequent start–stop cycling of the electrolyzer, which may accelerate its degradation, a start–stop operation constraint is imposed.

\sum_{t = 0}^{T - 1} (s t a r t u p_{e l e, t} + s h u t d o w n_{e l e, t}) \leq N_{s w i t c h, m a x}

(17)

where the start–stop indicator variable is defined as,

s t a r t u p_{e l e, t} \geq u_{e l e, t} - u_{e l e, t - 1}

(18)

s h u t d o w n_{e l e, t} \geq u_{e l e, t - 1} - u_{e l e, t}

(19)

where

u_{e l e} (t)

and

u_{h f c} (t)

are the binary variables indicating whether the electrolyzer and the fuel cell are operating, respectively; they are equal to 1 when operating and 0 when shut down.

S O H_{H T}

denotes the state of the hydrogen storage tank;

η_{e l e}

is the electricity-to-hydrogen conversion efficiency of the electrolyzer;

η_{h f c}

is the hydrogen-to-electricity conversion efficiency of the fuel cell;

η_{s}

represents the hydrogen compression storage efficiency, and

η_{c}

denotes the hydrogen decompression efficiency, representing the energy transfer efficiency during the release process from the hydrogen tank to the fuel cell;

E_{H_{2}}

is the lower heating value (LHV) of hydrogen (kWh/kg); the unit of

Q_{H T}

is kilograms (kg);

ϵ

represents the difference between the starting state and the ending state, typically 10%; and

N_{s w i t c h, m a x}

denotes the total number of start–stop cycles.

Battery System Constraints

(1): Power operation constraints

0 \leq P_{b a t . c} (t) \leq u_{b a t . c} (t) \cdot P_{b a t}^{r a t e d}

(20)

0 \leq P_{b a t . d} (t) \leq u_{b a t . d} (t) \cdot P_{b a t}^{r a t e d}

(21)

(2): Charging/discharging mutual-exclusion constraint (to prevent simultaneous charging and discharging)

u_{b a t . c} (t) + u_{b a t . d} (t) \leq 1

(22)

(3): Battery state balance equation

S o C_{b a t} (t) = S o C_{b a t} (t - 1) + \frac{η_{b a t . c} P_{b a t . c} (t) Δ t}{C_{b a t}} - \frac{P_{b a t . d} (t) Δ t}{η_{b a t . d} C_{b a t}}

(23)

(4): Battery state upper and lower bound constraints

S o C_{b a t}^{\min} \leq S o C_{b a t} (t) \leq S o C_{b a t}^{\max}

(24)

To prevent overdraw of the energy storage system (e.g., full discharge or overcharge of the battery) caused by single-day optimization and to ensure the feasibility and stability of the system during multi-day continuous operation, a periodic constraint is imposed.

| S o C_{T - 1} - S o C_{0} | \leq ϵ

(25)

where

u_{b a t . c} (t)

and

u_{b a t . d} (t)

are the binary variables representing the charging and discharging states of the battery, respectively;

S o C_{b a t}

denotes the state of charge (SoC) of the battery;

η_{b a t . c}

and

η_{b a t . d}

represent the charging efficiency and discharging efficiency of the battery, respectively.

System Power Balance Equation

P_{h f c, t} + P_{b d, t} - P_{e l e, t} - P_{b c, t} + ξ = P_{n e t, t}

(26)

where

ξ

denotes the deficit or surplus power.

2.2. Outer-Layer Capacity Optimization

The outer-layer optimization aims to determine the optimal rated power and capacity configuration for each HESS component, representing a long-term planning decision focused on the economic performance of the system over its entire life cycle. The decision variables include the rated power of the electrolyzer, the rated power of the fuel cell, the capacity of the hydrogen storage tank, the rated power of the battery, and the rated energy capacity of the battery, defined as

P_{e l e}^{r a t e d}

,

P_{h f c}^{r a t e d}

,

Q_{H T}

,

P_{b a t}^{r a t e d}

and

C_{b a t}

, respectively.

The objective function and constraints of the outer layer are formulated as follows.

2.2.1. Objective Function

The objective of outer-layer optimization is to minimize the total system cost, comprising the daily annualized investment cost and the daily operating cost, while satisfying the system technical constraints and operational reliability requirements.

C^{t o t a l} = \min (\frac{1}{365} \times \frac{i {(1 + i)}^{N_{l}}}{{(1 + i)}^{N_{l}} - 1} \times I N V + C_{o p})

(27)

I N V = C_{i n v . e l e} P_{e l e}^{r a t e d} + C_{i n v . h f c} P_{h f c}^{r a t e d} + C_{i n v . H T} Q_{H T} + C_{i n v . b a t_p} P_{b a t}^{r a t e d} + C_{i n v . b a t_c} C_{b a t}

(28)

where

C^{t o t a l}

denotes the minimized daily total cost of the system. INV denotes the investment cost, with its calculation given by Equation (28);

N_{l}

denotes the technical lifetime (years), i denotes the discount rate, and

C_{o p}

denotes the minimum operating cost obtained from the inner-layer optimization.

C_{i n v}

denotes the unit investment cost per unit power/capacity for each component.

2.2.2. Constraints

(1): System reliability constraints

P D \leq P D_{t h r e s h o l d}

(29)

(2): Capacity configuration boundary constraints

P_{e l e}^{r a t e d} \in (0, | P_{n e t}^{m i n} | \times λ]

(30)

P_{h f c}^{r a t e d} \in (0, P_{n e t}^{m a x} \times λ]

(31)

Q_{H T} \in (0, \max (E_{s u r p l u s}, E_{d e f i c i t}) / E_{H_{2}} \times λ]

(32)

P_{b a t}^{r a t e d} \in (0, \max (P_{n e t}^{m a x}, | P_{n e t}^{m i n} |) \times λ]

(33)

C_{b a t} \in (0, E_{c o n t i n u o u s}^{m a x} \times λ]

(34)

where λ denotes the margin factor. Among the characteristic parameters,

P_{n e t}^{m i n}

denotes the minimum of the net load,

P_{n e t}^{m a x}

denotes the maximum of the net load,

E_{s u r p l u s}

denotes the surplus energy,

E_{d e f i c i t}

denotes the deficit energy, and

E_{c o n t i n u o u s}^{m a x}

denotes the maximum continuous energy requirement of the system. The corresponding formulas are given in Equations (35)–(39).

P_{n e t}^{m i n} = \min_{t} {P_{n e t, t} : P_{n e t, t} < 0}

(35)

P_{n e t}^{m a x} = \max_{t} {P_{n e t, t} : P_{n e t, t} > 0}

(36)

E_{d e f i c i t} = \sum_{t : P_{n e t, t} > 0} P_{n e t, t} Δ t

(37)

E_{s u r p l u s} = \sum_{t : P_{n e t, t} < 0} | P_{n e t, t} | Δ t

(38)

E_{c o n t i n u o u s}^{m a x} = \max (\max_{t} \sum_{i = 0}^{k} P_{n e t, t + i}^{+}, \max_{t} \sum_{i = 0}^{k} | P_{n e t, t + i}^{-} |)

(39)

(3): Battery charging/discharging duration constraints

\frac{C_{b a t}}{P_{b a t}^{r a t e d}} \geq τ_{m i n}

(40)

where

τ_{m i n}

denotes the required minimum charging/discharging duration of the battery system.

3. A Cooperative DRL–MIP Framework for HESS Capacity Configuration and Operation Optimization

To solve the bi-level co-optimization model proposed in Section 2, this study develops a cooperative algorithm that integrates DRL with MIP. The outer layer employs a DRL method, exemplified by the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for adaptive exploration of capacity configuration, while the inner layer formulates the operational optimization as an MIP problem and solves it exactly using Gurobi. In this way, efficient co-optimization of capacity configuration and operation optimization is achieved.

3.1. Collaborative Optimization Mechanism

3.1.1. Outer Layer Design

The HESS capacity configuration problem can be modeled as a Markov Decision Process (MDP), defined by a five-tuple

M = (S, A, P, R, γ)

. Here, S denotes the state space, represented by an 18-dimensional vector comprising load characteristics, the current capacity configuration, and historical optimization information; A denotes the action space, consisting of five continuous adjustment variables for capacity configuration;

P : S \times A \times S \to [0, 1]

denotes the state transition probability, jointly determined by the capacity update rules and the outcomes of the inner-layer optimization;

R : S \times A \to R

denotes the reward function, constructed based on cost improvements and constraint satisfaction; and

γ \in [\begin{matrix} 0, 1 \end{matrix}]

denotes the discount factor, which balances immediate rewards and long-term returns.

Within this MDP framework, the agent incrementally optimizes the capacity configuration through sequential decisions. At each decision step, the agent observes the current state

s_{t} \in S

, selects an action

a_{t} \in A

to adjust the capacities, and the environment executes the inner-layer optimization based on the updated configuration, returning the reward

r_{t}

and the next state

s_{t + 1}

. The state-transition process satisfies the Markov property

P (s_{t + 1} | s_{0}, a_{0}, \dots, s_{t}, a_{t}) = P (s_{t + 1} | s_{t}, a_{t})

, ensuring that the current decision depends only on the current state, without requiring knowledge of the entire historical trajectory.

3.1.2. Inner Layer Design

With the operating cost and power-balance deviation as the optimization objectives, a constrained model is formulated, and its results are fed back to the outer layer as the reward signal, thereby forming a closed-loop optimization mechanism.

The overall algorithmic workflow is illustrated in Figure 2 and outlined as follows:

Step 1: Algorithm initialization. Configure the network architecture and hyperparameters, including the TD3 framework, training parameters, and the experience replay buffer, and then enter the outer-layer TD3 decision loop.

Step 2: Environment reset and adaptive boundary computation. At the beginning of each episode, reset the environment and compute adaptive bounds based on the typical daily net-load profile to provide reasonable constraint ranges for capacity configuration.

Step 3: Action selection and configuration update. Select actions using the TD3 networks and update the capacity configuration.

Step 4: Inner-layer scheduling optimization. Given the current capacities, use Gurobi to solve a mixed-integer programming model that minimizes the operating cost and power-balance deviation, thereby yielding the optimal operating strategy.

Step 5: Reward calculation and configuration update. If the current solution outperforms the historical best in terms of cost and power-balance constraints, update the best configuration; otherwise, keep it unchanged.

Step 6: Network parameter updates. Store experience samples in the replay buffer and train the TD3 networks by sampling mini-batches, including value-function updates for the critic network and policy-gradient optimization for the actor network.

Step 7: Convergence checks and output. When the number of steps in an episode reaches the preset limit, terminate the episode and output the optimal capacity configuration and operational strategy.

Step 8: Iterative loop. Repeat the above process until the training converges or the preset maximum number of episodes is reached.

This cooperative solution algorithm integrates the exploratory capability of reinforcement learning with the exactness of mathematical programming, achieving coordinated optimality of HESS capacity configuration and operating strategy under strict technical constraints. Its core modules include the following:

State and action space design

The state space comprises statistical features of the net-load profile (10 dimensions), the normalized values of the current optimal configuration (five dimensions), and historical information features (three dimensions).

s = [f_{p r o f i l e}, x_{n o r m a l i z e d}, h_{h i s t o r y}] \in ℝ^{18}

(41)

where

f_{p r o f i l e}

denotes the feature vector of the net-load profile (normalized values of the mean, standard deviation, maximum, minimum, number of deficit periods, number of surplus periods, total deficit energy, total surplus energy, longest consecutive deficit duration, and longest consecutive surplus duration);

x_{n o r m a l i z e d}

denotes the normalized value of the current best capacity configuration; and

h_{h i s t o r y}

denotes the normalized value of the historical information features (best daily total cost, PD value, and minimum boundary distance).

The action space is a five-dimensional continuous space, corresponding to the adjustment increments of the capacity configuration:

Δ P_{e l e}^{r a t e d}

,

Δ P_{h f c}^{r a t e d}

,

Δ Q_{H T}

,

Δ P_{b a t}^{r a t e d}

and

Δ C_{b a t}

.

a = [a_{1}, a_{2}, a_{3}, a_{4}, a_{5}], a_{i} \in [- 1, 1]

(42)

The update formula for the capacity configuration is,

s = [f_{p r o f i l e}, x_{n o r m a l i z e d}, h_{h i s t o r y}] \in ℝ^{18}

(43)

2.: Reward function design

The reward function is constructed based on the magnitude of cost improvement and the power-balance constraints. Equations (44) and (45), respectively, quantify the degree of cost improvement or deterioration of the current solution relative to the previous best solution, whereas Equation (46) evaluates whether the current power-balance deviation exceeds the specified threshold.

ρ = \frac{\max (0, C_{b e s t}^{p r e v} - C_{c u r r e n t})}{C_{b e s t}^{p r e v} + ϵ_{1}}

(44)

ρ_{n e g} = \frac{\max (0, C_{c u r r e n t} - C_{b e s t}^{p r e v})}{C_{b e s t}^{p r e v} + ϵ_{1}}

(45)

ρ_{n e g} = \frac{\max (0, C_{c u r r e n t} - C_{b e s t}^{p r e v})}{C_{b e s t}^{p r e v} + ϵ_{1}}

(46)

When the optimal cost decreases and the power deviation does not exceed the threshold, the reward is computed by Equation (47). If the cost decreases but the power deviation exceeds the threshold, it is calculated by Equation (48). Otherwise, a negative reward is assigned, as given in Equation (49).

R = R_{b o n u s} + α \cdot \min ({(\max (\frac{ρ}{s_{c}}, 1.0))}^{p}, R_{c a p})

(47)

R = - β \cdot v

(48)

R = - γ (1 - e^{- ρ_{n e g} / s_{w}}) - β \cdot v

(49)

where

P D_{t h r e s h o l d}

denotes the power deviation threshold;

ϵ_{1}

denotes a numerical stability term;

R_{b o n u s}

denotes the base improvement reward;

α

denotes the improvement amplification coefficient;

p

denotes the superlinear exponent;

R_{c a p}

denotes the reward cap;

γ

denotes the penalty coefficient for no improvement;

β

denotes the penalty coefficient for PD violation;

s_{c}

and

s_{w}

denote the normalization scale parameters.

3.: Network architecture and training strategy
(1): Network architecture

The Actor network adopts a multilayer fully connected architecture, incorporating LayerNorm and Dropout to enhance generalization. The Critic network also uses a multilayer structure, with the state–action pair as its input.

(2): TD3 core strategy

Find the optimal actor policy

π_{θ}

that minimizes the expected total system cost

C^{t o t a l} (s, a)

, where θ denotes the parameters of the actor network [51].

\min E_{s, a \sim π_{θ}} [C^{t o t a l} (s, a)]

(50)

(3): Network update mechanism

The actor network adopts a multilayer fully connected architecture with LayerNorm and Dropout to enhance generalization. The Critic network also uses a multilayer architecture, taking the state–action pair as input.

θ_{μ} \leftarrow θ_{μ} + α_{μ} \nabla_{θ_{μ}} E [Q_{i} (s, μ_{θ_{μ}} (s))]

(51)

θ_{Q_{i}} \leftarrow θ_{Q_{i}} - α_{Q} \nabla_{θ_{Q_{i}}} L_{i}

(52)

where

μ

denotes the actor network (policy network), describing how the agent selects actions based on the observed state;

θ_{μ}

denotes the parameter set of the policy function;

α_{μ}

the actor learning rate;

μ_{θ_{μ}} (s)

denotes the action output by the actor network under state s, i.e., the policy function; and

\nabla_{θ_{μ}}

denotes the gradient operator with respect to

θ_{μ}

;

Q_{i}

is the i-th critic network’s Q-function (action–value function);

θ_{Q_{i}}

represents the parameters of the critic network;

α_{Q}

the critic learning rate;

\nabla_{θ_{Q_{i}}}

denotes the gradient operator with respect to

θ_{Q_{i}}

;

L_{i}

denotes the loss function of the critic network, computed as in Equation (55).

L_{i} = E [{(Q_{i} (s_{t}, a_{t}) - y_{t})}^{2}]

(53)

The target Q-value is

y_{t}

.

y_{t} = r_{t} + γ \min_{i = 1, 2} Q_{i}^{'} (s_{t + 1}, μ^{'} (s_{t + 1}) + noise)

(54)

where

r_{t}

denotes the immediate reward;

γ

denotes the discount factor;

\min_{i = 1, 2} Q_{i}^{'}

denotes the minimum of the two Critic network outputs;

μ^{'} (s_{t + 1})

denotes the action generated by the target actor network at the next state

s_{t + 1}

;

noise

denotes the random noise added to the target action.

(4): Soft update mechanism

The soft update mechanism ensures smooth iteration of the target network parameters, with the update process given by Equations (55) and (56).

θ_{Q_{i}}^{'} \leftarrow τ θ_{Q_{i}} + (1 - τ) θ_{Q_{i}}^{'}

(55)

θ_{μ}^{'} \leftarrow τ θ_{μ} + (1 - τ) θ_{μ}^{'}

(56)

where

τ

denotes the soft update parameter;

θ_{Q_{i}}^{'}

denotes the parameters of the i-th target critic network;

θ_{μ}^{'}

denotes the parameters of the target actor network.

(5): Prioritized experience replay

To improve learning efficiency, a prioritized experience replay mechanism is employed. The core idea is to assign sampling probabilities based on the gap between the current Q-value prediction and the “true” target value (Temporal Difference Error, TDE). The corresponding formulas are given in Equations (57)–(59).

P (i) = \frac{p_{i}^{α_{p r i o r i t y}}}{\sum_{k} p_{k}^{α_{p r i o r i t y}}}

(57)

p_{i} = | δ_{i}^{(1)} + δ_{i}^{(2)} | / 2 + ϵ_{p r i o r i t y}

(58)

δ_{i}^{(j)} = Q_{j} (s_{i}, a_{i}) - y_{i}

(59)

where

p_{i}

denotes the priority of sample i;

α_{p r i o r i t y}

denotes the priority-sampling hyperparameter that controls the influence of the TDE on the sampling probability;

ϵ_{p r i o r i t y}

denotes the priority stability term;

δ_{i}^{(j)}

denotes the temporal-difference error.

An importance-sampling weight

w_{i}

is introduced to reweight samples and correct the bias induced by non-uniform sampling, where

β_{p r i o r i t y}

denotes the importance-sampling hyperparameter that controls the degree of correction.

w_{i} = {(\frac{1}{N} \cdot \frac{1}{P (i)})}^{β_{p r i o r i t y}} / \max_{j} w_{j}

(60)

4. Results and Discussion

This Section aims to validate the proposed TD3–Gurobi cooperative algorithm that integrates deep reinforcement learning with mathematical programming. First, the parameter settings of the case study are introduced. Next, the HESS capacity configuration solution obtained by the algorithm and its operating strategy under a typical day are presented and analyzed. Subsequently, three categories of comparative cases are designed from different validation perspectives. Finally, a sensitivity analysis is conducted to examine the impact of key parameters on system configuration and economic performance, thereby discussing the model’s applicability and robustness.

4.1. Case Setting

The optimization object of this study is a hybrid renewable energy system comprising photovoltaics (PV), wind power, local load, and a lithium-hydrogen HESS. The daily curves of PV output, wind output, and load are shown in Figure 3, Figure 4 and Figure 5. System optimization is conducted based on annual net-load data. Different colored lines represent different curves. To more intuitively present load characteristics and to select representative scenarios for validating the algorithm, the annual net-load profiles are clustered by renewable energy penetration into high, medium, and low scenarios, as shown in Figure 6. The main technical and economic parameters of the HESS are summarized in Table 2.

4.2. Algorithmic Solution and Results Analysis

The computing platform used for the solution is configured as follows: an Intel Core i9-14900K processor (24 cores, 32 threads), 128 GB RAM, and an NVIDIA RTX 4090 24 GB discrete GPU.

A total of 1500 daily net-load scenarios, derived from a standard dataset, were employed to train the RL agent. During the testing phase, for clarity of presentation, the net-load profiles were classified into three representative categories (high-, medium-, and low-penetration) using the k-means clustering method (shown in Figure 6). One typical scenario from each category was then selected for detailed analysis.

The training curves of the DRL–Gurobi cooperative algorithm are shown in Figure 7, demonstrating stable convergence. The parameters N (maximum number of training episodes) and N_step (maximum decision steps per episode), as illustrated in Figure 2, were empirically determined through convergence tests to balance training stability and computational efficiency. Specifically, N = 1500 episodes ensures convergence of the TD3 learning curve, while N_step = 150 steps per episode provides sufficient learning opportunities for sequential capacity adjustment decisions in the outer-layer optimization.

Using the proposed cooperative approach, the optimal HESS configuration is obtained. As shown in Table 3, under the condition that the power deviation does not exceed 0.01, the minimum daily total cost is USD 209.10.

To further illustrate the internal operation mechanism of the system, Figure 8 presents the coordinated operation strategy of each HESS component under a representative net-load profile (high-penetration scenario).

Figure 8a illustrates the power profile over the typical day, capturing the dynamic relationships among the electrolyzer, fuel cell, battery charging/discharging, and net load. Figure 8b depicts the evolution of the battery state of charge (SoC) and the state of hydrogen storage tank (SoH).

At night (00:00–06:00 and 18:00–24:00), the net load remains negative. During these periods, the electrolyzer operates at a large scale, converting surplus electricity into hydrogen for storage; battery charging mainly occurs during the low-demand hours, cooperating with the electrolyzer to absorb excess energy.

During the daytime (09:00–12:00 and 15:00–18:00), the net load is markedly positive. The fuel cell and battery discharge jointly support power supply, performing peak shaving and valley filling. The battery SoC shows obvious charge–discharge cycles: charging in the early morning and at night, and rapid discharging during daytime load peaks. The SoC peaks at about 90% and falls to as low as 10%, indicating that the battery undertakes fast and deep regulation tasks.

The SoH exhibits relatively small variations with a gradual rise–fall pattern: it increases at night due to hydrogen production by the electrolyzer and declines during the day as the fuel cell consumes hydrogen for power generation. The changes are smooth, highlighting the long-cycle energy shifting characteristic of hydrogen storage. The battery and hydrogen storage are complementary across time scales; the battery primarily addresses short-term, rapid fluctuations, while hydrogen storage achieves inter-period energy balancing.

4.3. Comparative Analysis

To comprehensively validate the effectiveness of the proposed TD3–Gurobi cooperative optimization framework, three comparative cases are designed from different validation perspectives as follows:

Case 1 (Proposed method): TD3-based deep reinforcement learning in the outer layer and the Gurobi mixed-integer programming solver in the inner layer.

Case 2 (Comparative algorithm verification): The outer layer uses the standard genetic algorithm (GA) and particle swarm optimization algorithm (PSO) combined with the inner layer Gurobi solver. And Gurobi is set to directly solve the two-level optimization problem for comparison. GA and PSO are selected to examine whether the proposed DRL method outperforms mature metaheuristics on high-dimensional continuous decision problems, in terms of solution quality (achieving lower cost) and convergence efficiency (shorter computation time). Although mathematical programming methods can theoretically obtain global optimal solutions, they often face the problem of excessively long solution times or even failure to converge. Therefore, a comparison with traditional deterministic algorithms (Gurobi solvers) aims to demonstrate whether the proposed algorithm can significantly improve computational efficiency while ensuring solution quality.

Case 3 (System architecture validation): Configure a single-storage system only (either battery storage or hydrogen storage) and solve it using the same optimization method as in Case 1. This comparison is designed to verify the advantages of the hybrid energy storage system over single-storage technologies in terms of economic performance and technical efficacy.

The performance of each scheme on key economic and technical indicators is summarized in Table 4.

The comparative analysis indicates substantial differences in economic performance and computational efficiency across optimization methods and storage configurations.

Case 1 (TD3 + Gurobi) achieves the most balanced performance, with a total cost of only USD 209.10 and a computation time of just 1.3 s.

In Case 2, the total costs of GA + G and PSO + G are 4.90% and 1.32% higher than Case 1, respectively, and their computation times are 250 s and 225 s, approximately 192× and 173× that of Case 1. This indicates that metaheuristic algorithms converge more slowly and have limited global search capability for complex optimization problems. Although the Gurobi approach attains the lowest total cost, its computation time is about 1385× that of Case 1, implying that while it can theoretically approach the global optimum, its efficiency is too low for complex, non-convex, high-dimensional problems and thus fails to meet rapid decision-making requirements.

Case 3 further validates the structural advantages of the hybrid storage system. Although the battery storage scheme provides fast response, it requires an oversized capacity (3297.74 kWh) to satisfy energy balance, resulting in a total cost of USD 473.35, which is substantially higher than that of the hybrid system. The hydrogen energy storage solution has the lowest cost (USD 140.19), but its response rate cannot meet the system flexibility requirements. In contrast, the hybrid energy storage system achieves complementary capacity configuration between the battery and hydrogen storage, optimizes the balance of power and energy, and simultaneously delivers economy, flexibility, and operational stability.

4.4. Sensitivity Analysis

4.4.1. Sensitivity Analysis of Key Component Costs

To evaluate the robustness of the proposed method and identify the main cost drivers, this study conducted a sensitivity analysis on the costs of key components, including the electrolyzer power, fuel cell power, hydrogen tank, and the power and energy capacity of lithium battery. The corresponding optimal HESS configurations are shown in Table 5, Table 6, Table 7, Table 8 and Table 9, respectively. The details are as follows:

(1): Sensitivity to Electrolyzer Power Cost

Variations in the electrolyzer power cost have a pronounced impact on system configuration and total cost. When the cost increases from 550 USD/kW to 1021 USD/kW, the average daily total cost rises from USD 186.71 to USD 231.62 (+24.0%). The hydrogen storage tank capacity decreases from 232.94 kg to 72.45 kg (−68.9%), indicating that the system mitigates higher electrolyzer costs by reducing hydrogen storage capacity. Correspondingly, the battery storage capacity increases from 158.19 kWh to 521.15 kWh (+229.4%), reflecting the substitution effect of battery storage for hydrogen storage. These results indicate that the electrolyzer cost is a key determinant of the economic performance of the hybrid storage system.

(2): Sensitivity to Fuel Cell Power Cost

Changes in the fuel cell power cost have a relatively moderate impact on the system. When the cost increases from 200 USD/kW to 371 USD/kW, the average daily total cost rises only from USD 205.98 to USD 215.13 (+4.4%).

(3): Sensitivity to Hydrogen Tank Cost

Changes in the hydrogen tank cost significantly affect the system’s configuration strategy. When the cost increases from 800 USD/kg to 1486 USD/kg, the average daily total cost rises from USD 187.76 to USD 219.14 (+16.7%). The hydrogen storage tank capacity decreases from 255.82 kg to 57.52 kg (−77.5%). To compensate for the reduction in hydrogen storage, the battery storage capacity increases from 120.08 kWh to 553.54 kWh (+361.0%), reflecting a cost-driven technology substitution effect and indicating that hydrogen tank cost is a key economic factor influencing long-term storage technology choices.

(4): Sensitivity to Lithium Battery Power Cost

Variations in the lithium battery power cost have a relatively small effect on the total system cost. When the cost increases from 300 USD/kW to 557 USD/kW, the average daily total cost rises from USD 207.81 to USD 217.57 (+4.7%).

(5): Sensitivity to Lithium Battery Energy-Capacity Cost

Changes in the lithium battery energy-capacity cost have a pronounced impact on system configuration. When the cost rises from 250 USD/kWh to 464 USD/kWh, the average daily total cost increases from USD 179.71 to USD 220.08 (+22.5%). When the capacity cost is relatively low (250–321 USD/kWh), the system substantially increases the battery energy capacity (up to 611.26 kWh) while reducing investment in hydrogen storage, reflecting a cost-driven technology substitution effect.

Ranked by the magnitude of total cost variation, the cost sensitivities of the components from highest to lowest are as follows: electrolyzer power cost (24.0%) > lithium battery energy-capacity cost (22.5%) > hydrogen tank cost (16.7%) > lithium battery power cost (4.7%) > fuel cell power cost (4.4%).

The proposed algorithm enables robust configuration adjustments in response to changes in component costs, maintaining overall economic performance through a dynamic balance between battery storage and hydrogen storage. When the cost of one storage technology increases, the system automatically raises the configuration share of the other, demonstrating the technological complementarity of the hybrid energy storage system.

4.4.2. Sensitivity Analysis of Renewable Energy Penetration

Optimization is performed separately under the three typical renewable energy penetration scenarios: high, medium, and low. The optimal HESS configuration is shown in Table 10. The optimal operation results are shown in Figure 8, Figure 9, and Figure 10, respectively.

It can be observed that under the low-penetration scenario, renewable generation is insufficient; the required electrolyzer capacity is relatively small, but a higher-power fuel cell is needed to supplement electricity and ensure supply stability, while large hydrogen reserves are required to address long-term energy deficits. Under the medium-penetration scenario, the battery undertakes more short-term regulation tasks and the hydrogen demand is relatively low; under the medium-penetration scenario, the battery accounts for a higher share in short-term fluctuation mitigation, and both medium and low penetration scenarios require larger battery capacities to balance energy. In the low-penetration case, maintaining stable system operation also necessitates larger-capacity fuel cells and hydrogen storage, leading to a significant increase in cost.

From Figure 8, Figure 9 and Figure 10, the following can be observed.

Under the high-penetration scenario, the battery SoC exhibits a dual-peak charging pattern, reaching about 90% around 03:00–04:00 and 15:00. Discharging mainly coincides with the load peaks at 10:00–11:00 and 19:00–20:00, with the minimum SoC dropping to around 10%. Under medium penetration, the battery SoC can reach nearly 100% and remain there for an extended period, followed by rapid discharging during peak-load hours to about 25%. Under low penetration, the battery charges overnight to approximately 90% and then remains stable, with a gradual daytime discharge to a minimum of roughly 15%.

For the hydrogen storage tank, under high penetration, SoH slowly rises from about 50% to 65%, reflecting the smooth characteristics of long-duration storage. Under medium penetration, SoH quickly increases to around 100%, remains high for a considerable time, and then gradually declines, indicating higher utilization intensity. Under low penetration, SoH varies most gently, remaining essentially within the 45–52% range, implying relatively low utilization.

The TD3–Gurobi cooperative optimization algorithm yields appropriate optimal configurations across all scenarios, demonstrating its robustness and adaptability. As renewable energy penetration increases, the importance of long-duration storage becomes more pronounced, while battery storage consistently provides indispensable fast regulation in all scenarios.

5. Conclusions

This paper proposes a bi-level cooperative optimization framework that integrates deep reinforcement learning (DRL) with mixed-integer programming (MIP) to jointly optimize the capacity configuration and operating strategy of a hybrid battery-hydrogen energy storage system. Targeting the minimization of average daily total cost and power deviation, the method achieves optimal capacity configuration and operation optimization of all system components. The results demonstrate that the proposed approach significantly outperforms conventional algorithms in both computational efficiency and solution quality, thereby verifying the advantages of hybrid storage over single-technology schemes. These findings demonstrate the feasibility and effectiveness of combining DRL with mathematical programming, providing a valuable reference for optimizing the configuration of complex energy systems.

The theoretical contribution lies in providing an efficient solution paradigm that fuses artificial intelligence with operations research for bi-level optimization problems characterized by strong coupling in energy systems. Future work may incorporate uncertainties in wind/PV and load forecasting using stochastic programming or robust optimization to enhance reliability under complex conditions. In addition, future studies will extend the proposed model to account for system resilience under islanded operation and potential component failures. This can be achieved by integrating probabilistic reliability modeling or fault-tolerant control mechanisms into the DRL–MIP framework, enabling a more comprehensive assessment of system robustness. In parallel, exploring diversified business models, such as the participation of storage in ancillary service markets, could further unlock the economic potential of hybrid energy storage systems.

Author Contributions

Conceptualization, T.Q.; methodology, T.Q.; software, T.Q.; validation, T.Q.; formal analysis, T.Q.; investigation, T.Q.; resources, T.Q.; data curation, T.Q.; writing—original draft preparation, T.Q.; writing—review and editing, T.Q.; visualization, T.Q.; supervision, K.Z.; project administration, T.Q.; funding acquisition, T.Q., L.Z. and D.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Jiangsu Provincial Science and Technology Project No. BK20230114, the Jiangsu Provincial Universities’ Basic Science Research General Project No. 23KJD470009, and the Talent-Introduction Research Start-up Fund No. 4170084.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations and Nomenclature

The following abbreviations are used in this manuscript:

HESS	Hybrid energy storage system
DRL	Deep reinforcement learning
EL	Electrolyzer
BESS	Battery energy storage system
MIP	Mixed integer programming
FC	Fuel cell
HST	Hydrogen storage tank
MDP	Markov Decision Process
TD3	Twin Delayed Deep Deterministic Policy Gradient
SoC	State of charge
SoH	State of hydrogen storage tank
Cost related variables
$C_{o p}$	Total daily operation and maintenance cost of each component
$C_{i n v . b a t}$	Battery investment cost
$C_{i n v . e l e}$	Unit power investment cost of the electrolyzer
$C_{i n v . h f c}$	Unit power investment cost of the fuel cell
$C^{t o t a l}$	Minimized daily total cost of the system
$R S_{b a t}$	Salvage value
$f_{e l e}$	Ratio of maintenance costs to the investment cost of the electrolyzer
$f_{h f c}$	Ratio of maintenance cost to the investment cost of the fuel cell
$D_{e l e}$	Degradation cost coefficient of electrolyzer
$D_{h f c}$	Degradation cost coefficient of fuel cell
Power and energy related variables
$P_{n e t}$	Net load power at time t
$P_{e l e}$	Electrolyzer power
$P_{h f c}$	Fuel cell power
$P_{b a t . c}$ $, P_{b a t . d}$	Battery charging/discharging power
$P_{e l e}^{r a t e d}$	Rated power of the electrolyzer
$P_{h f c}^{r a t e d}$	Rated power of the fuel cell
$P_{n e t}^{m i n}$ $, P_{n e t}^{m a x}$	Minimum and maximum of the net load
$E_{s u r p l u s}$ $, E_{d e f i c i t}$	Surplus and deficit energy
$E_{c o n t i n u o u s}^{m a x}$	Maximum continuous energy requirement of the system
$ξ$	Deficit or surplus power
$Q_{H T}$	Capacity of the hydrogen storage tank
$G_{e l e}$	Hydrogen production rate
$G_{h f c}$	Hydrogen consumption rate
$A C C$	Accelerated aging coefficient of battery
Efficiency and parameters
$η_{b a t . c}$ $, η_{b a t . d}$	Charging and discharging efficiencies of the battery
$η_{e l e}$	Electricity-to-hydrogen conversion efficiency of the electrolyzer
$η_{h f c}$	Hydrogen-to-electricity conversion efficiency of the fuel cell
$η_{s}$	Hydrogen compression storage efficiency
$η_{c}$	Hydrogen decompression efficiency
a,b	Parameters of the battery aging model
$ϵ$	Difference between the starting state and the ending state
i	Discount rate
λ	Margin factor
T	Length of optimization time horizon
$N_{l}$	Technical lifetime
$H_{e l e}$	Designed service life of the electrolyzer
$H_{h f c}$	Designed service life of the fuel ce
$τ_{m i n}$	Required minimum charging/discharging duration of the battery system
$E_{H_{2}}$	Lower heating value (LHV) of hydrogen
$N_{s w i t c h, m a x}$	Total number of start–stop cycles
Binary variables
$u_{e l e}$	Binary variables-the status of the electrolyzer: 1 on, 0 off
$u_{h f c}$	Binary variables-the status of the fuel cell: 1 on, 0 off
$u_{b a t . c}$	Binary variables-the charging states of the battery: 1 on, 0 off
$u_{b a t . d}$	Binary variables-the discharging states of the battery: 1 on, 0 off
Variables related to reinforcement learning and neural networks
$f_{p r o f i l e}$	Feature vector of the net-load profile
$x_{n o r m a l i z e d}$	Normalized value of the current best capacity configuration
$h_{h i s t o r y}$	Normalized value of the historical information features
$P D_{t h r e s h o l d}$	Power deviation threshold
$ϵ_{1}$	Numerical stability term
$R_{b o n u s}$	Base improvement reward
$α$	Improvement amplification coefficient
$p$	Superlinear exponent
$R_{c a p}$	Reward cap
$γ$	Penalty coefficient for no improvement
$β$	Penalty coefficient for PD violation
$s_{c}$ $, s_{w}$	Normalization scale parameters
$π_{θ}$	Optimal actor policy
θ	Parameters of the actor network
$C^{t o t a l} (s, a)$	Expected total system cost
$μ$	Policy network
$θ_{μ}$	Parameter set of the policy function
$α_{μ}$	Actor network learning rate
$μ_{θ_{μ}} (s)$	Action output by the actor network under state s
$\nabla_{θ_{μ}}$	$Gradient operator with respect to θ_{μ}$
$Q_{i}$	i-th critic network’s Q-function (action–value function)
$θ_{Q_{i}}$	Parameters of the critic network
$α_{Q}$	Critic learning rate
$\nabla_{θ_{Q_{i}}}$	$Gradient operator with respect to θ_{Q_{i}}$
$L_{i}$	Loss function of the critic network
$r_{t}$	Immediate reward
$γ$	Discount factor
$\min_{i = 1, 2} Q_{i}^{'}$	Minimum of the two Critic network outputs
$μ^{'} (s_{t + 1})$	$Action generated by the target actor network at the next state s_{t + 1}$
$noise$	Random noise added to the target action
$τ$	Soft update parameter
$θ_{Q_{i}}^{'}$	Parameters of the i-th target critic network
$θ_{μ}^{'}$	Parameters of the target actor network
$p_{i}$	Priority of sample i
$α_{p r i o r i t y}$	Priority sampling hyperparameter
$δ_{i}^{(j)}$	Temporal difference error
$w_{i}$	Importance sampling weight
$β_{p r i o r i t y}$	Importance sampling hyperparameter

References

Zhou, Z.; Ma, Z.; Mu, T. Hybrid energy storage capacity optimization based on VMD-SG and improved Firehawk optimization. Electr. Power Syst. Res. 2025, 239, 111218. [Google Scholar] [CrossRef]
Lu, Q.; Yang, Y.; Chen, J.; Liu, Y.; Liu, N.; Cao, F. Capacity optimization of hybrid energy storage systems for offshore wind power volatility smoothing. Energy Rep. 2023, 9, 575–583. [Google Scholar] [CrossRef]
Wu, X.; Shang, W.; Feng, G.; Huang, B.; Xiong, X. Coordinated control algorithm of hydrogen production-battery based hybrid energy storage system for suppressing fluctuation of PV power. Int. J. Hydrogen Energy 2024, 88, 931–944. [Google Scholar] [CrossRef]
El-Ghazaly, M.; Abdel-Salam, M.; Nayel, M.; Hashem, M. Techno-economic utilization of hybrid optimized gravity-supercapacitor energy-storage system for enriching the stability of grid-connected renewable energy sources. J. Energy Storage 2025, 107, 115002. [Google Scholar] [CrossRef]
Elkholy, M.; Schwarz, S.; Aziz, M. Advancing renewable energy: Strategic modeling and optimization of flywheel and hydrogen-based energy system. J. Energy Storage 2024, 101, 113771. [Google Scholar] [CrossRef]
Hu, S.; Yang, H.; Ding, S.; Tian, Z.; Guo, B.; Chen, H.; Yang, F.; Xu, N. Model simulation and multi-objective capacity optimization of wind power coupled hybrid energy storage system. Energy 2025, 319, 134887. [Google Scholar] [CrossRef]
Al-Quraan, A.; Athamnah, I. Economic tri-level control-based sizing and energy management optimization for efficiency maximization of stand-alone HRES. Energy Convers. Manag. 2024, 302, 118140. [Google Scholar] [CrossRef]
Guven, A.F.; Abdelaziz, A.Y.; Samy, M.M.; Barakat, S. Optimizing energy dynamics: A comprehensive analysis of hybrid energy storage systems integrating battery banks and supercapacitors. Energy Convers. Manag. 2024, 312, 118560. [Google Scholar] [CrossRef]
Li, B.; Wang, H.; Tan, Z. Capacity optimization of hybrid energy storage system for flexible islanded microgrid based on real-time price-based demand response. Int. J. Electr. Power Energy Syst. 2022, 136, 107581. [Google Scholar] [CrossRef]
Wang, J.; Deng, H.; Qi, X. Cost-based site and capacity optimization of multi-energy storage system in the regional integrated energy networks. Energy 2022, 261, 125240. [Google Scholar] [CrossRef]
Hu, Y.; Yang, B.; Wu, P.; Wang, X.; Li, J.; Huang, Y.; Su, R.; He, G.; Yang, J.; Su, S.; et al. Optimal planning of electric-heating integrated energy system in low-carbon park with energy storage system. J. Energy Storage 2024, 99, 113327. [Google Scholar] [CrossRef]
Rowe, K.; Mokryani, G.; Cooke, K.; Campean, F.; Chambers, T. Bi-level optimal sizing, siting and operation of utility-scale multi-energy storage system to reduce power losses with peer-to-peer trading in an electricity/heat/gas integrated network. J. Energy Storage 2024, 83, 110738. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, Y.; Xue, L.; Liu, C.; Song, F.; Sun, Y.; Liu, Y.; Che, B. Research on planning optimization of integrated energy system based on the differential features of hybrid energy storage system. J. Energy Storage 2022, 55, 105368. [Google Scholar] [CrossRef]
Li, X.; Li, M.; Habibi, M.; Najaafi, N.; Safarpour, H. Optimization of hybrid energy management system based on high-energy solid-state lithium batteries and reversible fuel cells. Energy 2023, 283, 128454. [Google Scholar] [CrossRef]
Ye, Y.; Xu, B.; Wang, H.; Zhang, J.; Lawler, B.; Ayalew, B. Deep reinforcement learning-based energy management system enhancement using digital twin for electric vehicles. Energy 2024, 312, 133384. [Google Scholar] [CrossRef]
Wu, Y.; Huang, Z.; Li, D.; Li, H.; Peng, J.; Guerrero, J.M.; Song, Z. Integrated battery thermal and energy management for electric vehicles with hybrid energy storage system: A hierarchical approach. Energy Convers. Manag. 2024, 317, 118853. [Google Scholar] [CrossRef]
Barelli, L.; Bidini, G.; Ciupageanu, D.A.; Pelosi, D. Integrating hybrid energy storage system on a wind generator to enhance grid safety and stability: A levelized cost of electricity analysis. J. Energy Storage 2021, 34, 102050. [Google Scholar] [CrossRef]
Roy, P.; Liao, Y.; He, J. Economic dispatch for grid-connected wind power with battery-supercapacitor hybrid energy storage system. IEEE Trans. Ind. Appl. 2023, 59, 1118–1128. [Google Scholar] [CrossRef]
Bharatee, A.; Ray, P.K.; Ghosh, A. A power management scheme for grid-connected PV integrated with hybrid energy storage system. J. Mod. Power Syst. Clean. Energy 2022, 10, 954–963. [Google Scholar] [CrossRef]
Wu, X.; Liu, L.; Wu, Y.; Luo, C.; Tang, Z.; Kerekes, T. Near-optimal energy management strategy for a grid-forming PV and hybrid energy storage system. IEEE Trans. Smart Grid 2025, 16, 1422–1433. [Google Scholar] [CrossRef]
Dsouza, O.D.; Shilpa, G.; Rajnikanth; Irusapparajan, G. Optimized energy management for hybrid renewable energy sources with hybrid energy storage: An SMO-KNN approach. J. Energy Storage 2024, 96, 112152. [Google Scholar] [CrossRef]
Sathishkumar, R.; Venkateswaran, M.; Deepamangai, P.; Rajan, P.S. An efficient power management control strategy for grid-independent hybrid renewable energy systems with hybrid energy storage: Hybrid approach. J. Energy Storage 2024, 96, 112685. [Google Scholar] [CrossRef]
Adam, A.H.A.; Chen, J.; Kamel, S.; Safaraliev, M.; Matrenin, P. Power management and control of hybrid renewable energy systems with integrated diesel generators for remote areas. Int. J. Hydrogen Energy 2024, 89, 320–341. [Google Scholar] [CrossRef]
Manandhar, U.; Zhang, X.; Beng, G.H.; Subramanian, L.; Lu, H.H.C.; Fernando, T. Enhanced energy management system for isolated microgrid with diesel generators, renewable generation, and energy storages. Appl. Energy 2023, 350, 121624. [Google Scholar] [CrossRef]
Behera, P.K.; Pattnaik, M. Supervisory power management scheme of a laboratory scale wind-PV based LVDC microgrid integrated with hybrid energy storage system. IEEE Trans. Ind. Appl. 2024, 60, 4723–4735. [Google Scholar] [CrossRef]
Ramu, S.K.; Vairavasundaram, I.; Palaniyappan, B.; Bragadeshwaran, A.; Aljafari, B. Enhanced energy management of DC microgrid: Artificial neural networks-driven hybrid energy storage system with integration of bidirectional DC-DC converter. J. Energy Storage 2024, 88, 111562. [Google Scholar] [CrossRef]
Nkwanyana, T.B.; Siti, M.W.; Wang, Z.; Mulumba, W. Hybrid energy storage lifespan optimization based on an enhanced fuel-cell degradation model and meta-heuristic algorithm. Energy Rep. 2024, 12, 5712–5727. [Google Scholar] [CrossRef]
Duong, H.-N.; Tran, L.; Vu, T.; Vo-Duy, T.; Nguyễn, B.-H. A global optimal benchmark for energy management of microgrid (GoBuG) integrating hybrid energy storage system. IEEE Trans. Smart Grid 2024, 15, 5429–5440. [Google Scholar] [CrossRef]
Zhang, K.; Zou, G.; Zhang, J.; Li, H.; Sun, Y.; Li, G. Microgrid energy management strategy considering source-load forecast error. Int. J. Electr. Power Energy Syst. 2025, 164, 110372. [Google Scholar] [CrossRef]
Tang, Y.; Xun, Q.; Liserre, M.; Yang, H. Energy management of electric-hydrogen hybrid energy storage systems in photovoltaic microgrids. Int. J. Hydrogen Energy 2024, 80, 1–10. [Google Scholar] [CrossRef]
Sepehrzad, R.; Moridi, A.R.; Hassanzadeh, M.E.; Seifi, A.R. Intelligent energy management and multi-objective power distribution control in hybrid micro-grids based on the advanced fuzzy-PSO method. ISA Trans. 2021, 112, 199–213. [Google Scholar] [CrossRef]
Chekira, O.; Boujoudar, Y.; El Moussaoui, H.; Boharb, A.; Lamhamdi, T.; El Markhi, H. An improved microgrid energy management system based on hybrid energy storage system using ANN NARMA-L2 controller. J. Energy Storage 2024, 98, 113096. [Google Scholar] [CrossRef]
Wang, J.; Lyu, C.; Bai, Y.; Yang, K.; Song, Z.; Meng, J. Optimal scheduling strategy for hybrid energy storage systems of battery and flywheel combined multi-stress battery degradation model. J. Energy Storage 2024, 99, 113208. [Google Scholar] [CrossRef]
Elkholy, M.H.; Senjyu, T.; Metwally, H.; Farahat, M.; Irshad, A.S.; Hemeida, A.M.; Lotfy, M.E. A resilient and intelligent multi-objective energy management for a hydrogen-battery hybrid energy storage system based on MFO technique. Renew. Energy 2024, 222, 119768. [Google Scholar] [CrossRef]
Yang, C.; Li, X.; Chen, L.; Mei, S. Intra-day and seasonal peak shaving oriented operation strategies for electric–hydrogen hybrid energy storage in isolated energy systems. Sustainability 2024, 16, 7010. [Google Scholar] [CrossRef]
Han, F.; Zeng, J.; Lin, J.; Gao, C. Multi-stage distributionally robust optimization for hybrid energy storage in regional integrated energy system considering robustness and nonanticipativity. Energy 2023, 277, 127729. [Google Scholar] [CrossRef]
Shan, J.; Lu, R. Multi-objective economic optimization scheduling of CCHP micro-grid based on improved bee colony algorithm considering the selection of hybrid energy storage system. Energy Rep. 2021, 7, 326–341. [Google Scholar] [CrossRef]
Deng, J.; Wang, X.; Chen, T.; Meng, F. An energy router based on multi-hybrid energy storage system with energy coordinated management strategy in island operation mode. Renew. Energy 2023, 212, 274–284. [Google Scholar] [CrossRef]
Pang, B.; Zhu, H.; Tong, Y.; Dong, Z. Optimal design and control of battery-ultracapacitor hybrid energy storage system for BEV operating at extreme temperatures. J. Energy Storage 2024, 101, 113963. [Google Scholar] [CrossRef]
Li, M.; Wang, L.; Wang, Y.; Chen, Z. Sizing optimization and energy management strategy for hybrid energy storage system using multiobjective optimization and random forests. IEEE Trans. Power Electron. 2021, 36, 11421–11430. [Google Scholar] [CrossRef]
Xu, F.; Li, X.; Jin, C. Optimal capacity configuration and dynamic pricing strategy of a shared hybrid hydrogen energy storage system for integrated energy system alliance: A bi-level programming approach. Int. J. Hydrogen Energy 2024, 69, 331–346. [Google Scholar] [CrossRef]
He, Y.; Guo, S.; Zhou, J.; Song, G.; Kurban, A.; Wang, H. The multi-stage framework for optimal sizing and operation of hybrid electrical-thermal energy storage system. Energy 2022, 245, 123248. [Google Scholar] [CrossRef]
Li, C.; Zhang, X. Optimal sizing of hybrid energy storage system under multiple typical conditions of sources and loads. Int. J. Sustain. Energy 2025, 44, 2439298. [Google Scholar] [CrossRef]
Tsao, Y.-C.; Banyupramesta, I.G.A.; Lu, J.-C. Optimal operation and capacity sizing for a sustainable shared energy storage system with solar power and hydropower generator. J. Energy Storage 2025, 110, 115173. [Google Scholar] [CrossRef]
Wang, G.; Blondeau, J. Optimal combination of daily and seasonal energy storage using battery and hydrogen production to increase the self-sufficiency of local energy communities. J. Energy Storage 2024, 92, 112206. [Google Scholar] [CrossRef]
Yang, H.; Chu, Y.; Ma, Y.; Zhang, D. Operation strategy and optimization configuration of hybrid energy storage system for enhancing cycle life. J. Energy Storage 2024, 95, 112560. [Google Scholar] [CrossRef]
Li, H.; Sun, D.; Li, B.; Wang, X.; Zhao, Y.; Wei, M.; Dang, X. Collaborative optimization of VRB-PS hybrid energy storage system for large-scale wind power grid integration. Energy 2023, 265, 126292. [Google Scholar] [CrossRef]
Liang, Z.; Chung, C.Y.; Wang, Q.; Chen, H.; Yang, H.; Wu, C. Fortifying renewable-dominant hybrid microgrids: A bi-directional converter-based interconnection planning approach. Engineering 2025, 51, 130–143. [Google Scholar] [CrossRef]
Lou, Q.; Li, Y.; Li, Z.; Han, L.; Xu, Y.; Yi, Z. Multi-stage planning approach for distribution network considering long-term variations in load and renewable energy. Energies 2025, 18, 152. [Google Scholar] [CrossRef]
Huang, B.; Zhao, T.; Yue, M.; Wang, J. Bi-level adaptive storage expansion strategy for microgrids using deep reinforcement learning. IEEE Trans. Smart Grid 2023, 15, 1362–1375. [Google Scholar] [CrossRef]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]

Figure 1. Schematic diagram of a hybrid energy storage system.

Figure 2. Overall flowchart of the proposed algorithm.

Figure 3. The daily photovoltaic output curve.

Figure 4. The daily wind power output curve.

Figure 5. The daily load curve.

Figure 6. The daily net load curve.

Figure 7. The training curve of deep reinforcement learning in the proposed algorithm.

Figure 8. The HESS coordinated operation strategy under typical high penetration conditions: (a) Power profile over the typical day. A positive Net load value indicates a system power deficit, while a negative value indicates a system power surplus; (b) Evolution of the battery state of charge (SoC) and the hydrogen storage tank state (SoH).

Figure 9. The HESS coordinated operation strategy under typical medium penetration conditions: (a) Power profile over the typical day; (b) Evolution of the battery state of charge (SoC) and the hydrogen storage tank state (SoH).

Figure 10. The HESS coordinated operation strategy under typical low penetration conditions: (a) Power profile over the typical day; (b) Evolution of the battery state of charge (SoC) and the hydrogen storage tank state (SoH).

Table 1. Literature Review on Co-Optimization of Capacity and Operation.

Reference	Co-Optimization Module	Solution Methods	Whether Scenario Uncertainty Is Considered	Whether Energy Storage Degradation Is Considered
[39]	Capacity level: nonlinear programming (NLP); Operation level: dynamic programming model (DP)	Capacity level: multi-start space-reduction algorithm; Operation level: dynamic programming algorithm; Interactive iteration between two levels	No	Yes, only battery
[40]	Capacity level: multi-objective NLP; Operation level: DP	Capacity level: multi-objective grey wolf optimization algorithm; Operation level: dynamic programming algorithm; Interactive iteration between two levels	No	Yes, only battery
[41]	Capacity level: NLP; Operation level: mixed-integer linear programming (MILP)	Capacity level: improved PSO–GA hybrid algorithm; Operation level: commercial solver; Interactive iteration between two levels	Partially considered by typical day data based on clustering	No
[42]	Capacity level: NLP; Operation level: MILP	Capacity level: PSO algorithm; Operation level: commercial solver; Interactive iteration between two levels	Partially considered by typical day data based on clustering	Yes, only battery
[43]	Capacity level: multi-objective NLP; Operation level: MILP	Capacity level: NSGA-III algorithm; Operation level: commercial solver; Interactive iteration between two levels	Partially considered by typical day data based on clustering	No
[44]	Capacity level: NLP; Operation level: mixed-integer nonlinear programming (MINLP)	Capacity level: sequential quadratic programming algorithm; Operation level: commercial solver; Interactive iteration between two levels	No	No
[45]	Capacity level: multi-objective NLP; Operation level: MILP	Capacity level: ε-constraint multi-objective optimization algorithm; Operation level: commercial solver; Interactive iteration between two levels	No	No
[46]	Capacity level: NLP; Operation level: MINLP	Transformed into a single-level MILP and solved by a commercial solver	No	Yes, only battery
[47]	Capacity level: NLP; Operation level: MINLP	Transformed into a single-level MILP and solved by a commercial solver	No	No

Table 2. The main economic and technical parameters of the HESS.

Component	Economic/Technical Parameter	Value	Unit
EL (Electrolyzer)	$C_{i n v . e l e}$	786	USD/kWh
	$η_{e l e}$	0.8	- ¹
	$f_{e l e}$	0.4	-
	$H_{e l e}$	70,000	Hour
HFC (Fuel Cell)	$C_{i n v . h f c}$	286	USD/kW
	$η_{h f c}$	0.6	-
	$f_{h f c}$	0.3	-
	$H_{h f c}$	30,000	Hour
HT (Hydrogen Tank)	$C_{i n v . H T}$	1143	USD/kg
	$S O H_{H T}^{\min}$	0	-
	$S O H_{H T}^{\max}$	1	-
	$η_{s}$	0.97	-
	$η_{c}$	0.98	-
Battery	$C_{i n v . b a t_p}$	429	USD/kW
	$C_{i n v . b a t_c}$	357	USD/kWh
	$η_{b a t . c}$	0.98	-
	$η_{b a t . d}$	0.98	-
	$S o C_{b a t}^{\min}$	0.1	-
	$S o C_{b a t}^{\max}$	0.9	-
	$R S_{b a t}$	10	USD/kWh
HESS (Hybrid Energy Storage System)	$N_{l}$	20	Year
HESS (Hybrid Energy Storage System)	i	0.08	-
Other	$E_{H_{2}}$	33.33	kWh/kg

¹ The symbol “-” indicates no unit.

Table 3. The optimal configuration results of HESS based on the proposed method.

Decision Variable	Optimized Result	Unit
$P_{e l e}^{r a t e d}$	312.23	kW
$P_{h f c}^{r a t e d}$	173.26	kW
$Q_{H T}$	225.90	kg
$P_{b a t}^{r a t e d}$	71.60	kW
$C_{b a t}$	174.68	kWh
Minimum daily total cost	209.10	USD

Table 4. Performance comparison of different optimization methods and energy storage configurations.

Cases		$P_{ele}^{rated}$ (kW)	$P_{hfc}^{rated}$ (kW)	$Q_{HT}$ (kg)	$P_{bat}^{rated}$ (kW)	$C_{bat}$ (kWh)	$C^{total}$ (USD)	Computation Time (s)
Case 1	DRL + Gurobi	312.23	173.26	225.90	71.60	174.68	209.10	1.3
Case 2	GA + Gurobi	262.88	93.19	112.77	133.42	460.72	219.34	250
	PSO + Gurobi	279.96	115.35	146.30	104.32	353.56	211.87	225
	Gurobi	309.16	173.33	220.43	74.39	186.89	208.73	1800
Case 3	Battery-only	-	-	-	383.57	3297.74	473.35	-
Case 3	Hydrogen-only	383.57	238.93	76.25	-	-	140.19	-

Table 5. The sensitivity analysis of electrolyzer power cost on HESS optimal configuration and minimum daily total cost.

$C_{inv . ele}$ (USD/kW)	$P_{ele}^{rated}$ (kW)	$P_{hfc}^{rated}$ (kW)	$Q_{HT}$ (kg)	$P_{bat}^{rated}$ (kW)	$C_{bat}$ (kWh)	$C^{total}$ (USD)
550	319.07	172.27	232.94	64.53	158.19	186.71
629	274.13	109.31	132.36	109.37	386	198.5
707	307.63	156.85	217.02	75.94	194.1	203.3
786	312.23	173.26	225.9	71.6	174.68	209.1
864	270.75	106.74	122.61	112.82	406.87	218.51
943	266.8	102.38	109.69	116.77	435.96	225.15
1021	255.13	93.96	72.45	128.44	521.15	231.62

Table 6. The sensitivity analysis of fuel cell power cost on HESS optimal configuration and minimum daily total cost.

$C_{inv . hfc}$ (USD/kW)	$P_{ele}^{rated}$ (kW)	$P_{hfc}^{rated}$ (kW)	$Q_{HT}$ (kg)	$P_{bat}^{rated}$ (kW)	$C_{bat}$ (kWh)	$C^{total}$ (USD)
200	304.59	167.7	210.58	81.9	208.87	205.98
229	317.57	175.64	230.91	68.55	163.37	207.27
257	317.28	173.78	231.34	68.11	162.31	207.79
286	312.23	173.26	225.9	71.6	174.68	209.1
314	318.06	173.25	231.93	65.71	160.49	210.95
343	280.19	108.37	154.28	103.38	338.23	213.84
371	273.75	109.99	132.11	109.82	385.45	215.13

Table 7. The sensitivity analysis of hydrogen storage tank cost on HESS optimal configuration and minimum daily total cost.

$C_{inv . HT}$ (USD/kg)	$P_{ele}^{rated}$ (kW)	$P_{hfc}^{rated}$ (kW)	$Q_{HT}$ (kg)	$P_{bat}^{rated}$ (kW)	$C_{bat}$ (kWh)	$C^{total}$ (USD)
800	334.53	181.63	255.82	49.54	120.08	187.76
914	320.07	175.68	234.14	63.54	155.56	194.59
1029	286.16	115.47	164.88	96.73	327.47	206.44
1143	312.23	173.26	225.9	71.6	174.68	209.1
1257	253.81	102.13	65.9	129.77	534.64	215.21
1371	251.5	90.13	59.94	132.91	548.25	216.69
1486	254.86	95.91	57.52	128.72	553.54	219.14

Table 8. The sensitivity analysis of lithium battery power cost on HESS optimal configuration and minimum daily total cost.

$C_{inv . bat_p}$ (USD/kW)	$P_{ele}^{rated}$ (kW)	$P_{hfc}^{rated}$ (kW)	$Q_{HT}$ (kg)	$P_{bat}^{rated}$ (kW)	$C_{bat}$ (kWh)	$C^{total}$ (USD)
300	305.77	169.31	212.97	80.29	203.19	207.81
343	306.89	171.54	215.44	77.49	197.67	208.86
386	306.34	171.57	215.45	78.51	199.52	209.82
429	312.23	173.26	225.9	71.6	174.68	209.1
471	273.55	125.53	131.58	110.09	386.62	213.42
514	314.97	180.85	259.79	57.69	143.77	216.62
557	278.22	146.66	131.77	107.87	386.84	217.57

Table 9. The sensitivity analysis of lithium battery capacity cost on HESS optimal configuration and minimum daily total cost.

$C_{inv . bat_c}$ (USD/kWh)	$P_{ele}^{rated}$ (kW)	$P_{hfc}^{rated}$ (kW)	$Q_{HT}$ (kg)	$P_{bat}^{rated}$ (kW)	$C_{bat}$ (kWh)	$C^{total}$ (USD)
250	249.66	88.36	53.83	134.04	561.86	179.71
286	244.15	99.73	53.04	146.54	611.26	198.23
321	264.4	99.8	101.19	119.17	455.11	203.01
357	312.23	173.26	225.9	71.6	174.68	209.1
393	285.23	150.21	168.27	98.33	304.06	217.79
429	311.65	168.48	225.19	77.5	177.39	218.42
464	325.33	174.73	241.01	58.71	142.63	220.08

Table 10. The optimal HESS configuration and performance under different renewable energy.

Renewable Energy Penetration Scenario	$P_{ele}^{rated}$ (kW)	$P_{hfc}^{rated}$ (kW)	$Q_{HT}$ (kg)	$P_{bat}^{rated}$ (kW)	$C_{bat}$ (kWh)	$C^{total}$ (USD)
High	312.23	173.26	225.9	71.6	174.68	209.1
Medium	318.37	212.37	51.83	117.86	372.66	203.48
Low	123.18	290.64	1083.62	69.76	327.83	470.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qian, T.; Zhang, K.; Shi, D.; Zhang, L. Co-Optimization of Capacity and Operation for Battery-Hydrogen Hybrid Energy Storage Systems Based on Deep Reinforcement Learning and Mixed Integer Programming. Energies 2025, 18, 5638. https://doi.org/10.3390/en18215638

AMA Style

Qian T, Zhang K, Shi D, Zhang L. Co-Optimization of Capacity and Operation for Battery-Hydrogen Hybrid Energy Storage Systems Based on Deep Reinforcement Learning and Mixed Integer Programming. Energies. 2025; 18(21):5638. https://doi.org/10.3390/en18215638

Chicago/Turabian Style

Qian, Tiantian, Kaifeng Zhang, Difen Shi, and Lei Zhang. 2025. "Co-Optimization of Capacity and Operation for Battery-Hydrogen Hybrid Energy Storage Systems Based on Deep Reinforcement Learning and Mixed Integer Programming" Energies 18, no. 21: 5638. https://doi.org/10.3390/en18215638

APA Style

Qian, T., Zhang, K., Shi, D., & Zhang, L. (2025). Co-Optimization of Capacity and Operation for Battery-Hydrogen Hybrid Energy Storage Systems Based on Deep Reinforcement Learning and Mixed Integer Programming. Energies, 18(21), 5638. https://doi.org/10.3390/en18215638

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Co-Optimization of Capacity and Operation for Battery-Hydrogen Hybrid Energy Storage Systems Based on Deep Reinforcement Learning and Mixed Integer Programming

Abstract

1. Introduction

2. System Model

2.1. Inner-Layer Operation Optimization

2.1.1. Objective Function

2.1.2. Constraints

Hydrogen Energy Storage System Constraints

Battery System Constraints

System Power Balance Equation

2.2. Outer-Layer Capacity Optimization

2.2.1. Objective Function

2.2.2. Constraints

3. A Cooperative DRL–MIP Framework for HESS Capacity Configuration and Operation Optimization

3.1. Collaborative Optimization Mechanism

3.1.1. Outer Layer Design

3.1.2. Inner Layer Design

4. Results and Discussion

4.1. Case Setting

4.2. Algorithmic Solution and Results Analysis

4.3. Comparative Analysis

4.4. Sensitivity Analysis

4.4.1. Sensitivity Analysis of Key Component Costs

4.4.2. Sensitivity Analysis of Renewable Energy Penetration

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations and Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI