Comparative Analysis of Optimal Control and Reinforcement Learning Methods for Energy Storage Management Under Uncertainty

Ginzburg-Ganz, Elinor; Segev, Itay; Levron, Yoash; Belikov, Juri; Baimel, Dmitry; Keren, Sarah

doi:10.3390/esa2040014

Open AccessArticle

Comparative Analysis of Optimal Control and Reinforcement Learning Methods for Energy Storage Management Under Uncertainty

by

Elinor Ginzburg-Ganz

¹

,

Itay Segev

²

,

Yoash Levron

^1,*

,

Juri Belikov

³

,

Dmitry Baimel

⁴ and

Sarah Keren

²

¹

The Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering, Technion—Israel Institute of Technology, Haifa 3200003, Israel

²

Computer Science Faculty, Technion—Israel Institute of Technology, Haifa 3200003, Israel

³

Department of Software Science, Tallinn University of Technology, Akadeemia tee 15a, 12618 Tallinn, Estonia

⁴

Sami Shamoon College of Engineering, Beer-Sheva 84100, Israel

^*

Author to whom correspondence should be addressed.

Energy Storage Appl. 2025, 2(4), 14; https://doi.org/10.3390/esa2040014

Submission received: 2 July 2025 / Revised: 18 September 2025 / Accepted: 15 October 2025 / Published: 17 October 2025

Download

Browse Figures

Versions Notes

Abstract

The challenge of optimally controlling energy storage systems under uncertainty conditions, whether due to uncertain storage device dynamics or load signal variability, is well established. Recent research works tackle this problem using two primary approaches: optimal control methods, such as stochastic dynamic programming, and data-driven techniques. This work’s objective is to quantify the inherent trade-offs between these methodologies and identify their respective strengths and weaknesses across different scenarios. We evaluate the degradation of performance, measured by increased operational costs, when a reinforcement learning policy is adopted instead of an optimal control policy, such as dynamic programming, Pontryagin’s minimum principle, or the Shortest-Path method. Our study examines three increasingly intricate use cases: ideal storage units, storage units with losses, and lossy storage units integrated with transmission line losses. For each scenario, we compare the performance of a representative optimal control technique against a reinforcement learning approach, seeking to establish broader comparative insights.

Keywords:

model-free; reinforcement learning; storage; storage scheduling; transients; optimal control

1. Introduction

Optimal control of energy storage systems is vital in modern power systems [1]. In addition to classical optimization methods, such as linear programming, dynamic programming, quadratic programming, and Pontryagin’s minimum principle, we see, in recent years, a noticeable growth in machine learning methods, and specifically in reinforcement learning (RL) methods, for managing storage devices. The strength of such methods lies in the fact that they can produce a highly accurate and efficient classification, despite the high dimensions of the feature space.

The literature exploring this problem is vast. For example, [2] describes a dynamic programming approach to maximize the value of energy storage systems in grid applications under uncertain energy prices. Similarly, [3] addresses microgrid management using a stochastic programming model to minimize expected energy costs, demonstrating scalability and cost savings with a customized stochastic dual dynamic programming algorithm. Recent research has also explored reinforcement learning for storage control. For instance, [4] introduces a deep reinforcement learning framework for optimizing battery storage, incorporating a lithium-ion battery degradation model and a noisy network architecture for effective action space exploration. In [5], Double Deep Q-Learning is used to optimize community battery storage in microgrids. Work [6] proposes an effective real-time scheduling model for multiple battery energy storage systems (MBES) that balances short-term operational goals with long-term lifespan benefits. To optimize this model under uncertain conditions, the paper introduces a novel piece-wise linear function-based continuous approximate dynamic programming (PLFC-ADP) algorithm. The proposed algorithm effectively manages the computational complexity of MBES scheduling while achieving a near-optimal solution. Further, paper [7] focuses on optimizing energy management for microgrids connected to a main grid, which is challenging due to the fluctuating nature of renewable energy and market prices. The study models these uncertainties using probabilistic and stochastic methods and evaluates the effectiveness of the stochastic dual dynamic programming (SDDP) algorithm. The results demonstrate that the SDDP algorithm provides robust strategies that closely match the optimal solutions found through dynamic programming and Monte Carlo simulations, leading to improvements in performance and cost reduction. Comparative studies, such as [8], evaluate reinforcement learning against traditional dynamic programming for energy management in hybrid electric vehicles, showing RL’s ability to achieve global optimality in infinite-horizon control problems. Another study, [9], compares various reinforcement learning architectures, including policy iteration and value iteration, for storage control problems using a benchmarked analytical solution.

As can be seen from the above literature review, reinforcement learning methods are increasingly used for solving energy storage optimal control problems. Paper [10] focuses on energy storage management under fluctuating market prices and production from renewable sources. The authors employ two deep reinforcement learning algorithms, DDPG and PPO, to address the electricity arbitrage problem, and the proposed solution is compared to an model-predictive control method. Work [11] also concerns with electricity arbitrage, where the cost of electricity varies based on the time of day, reflecting supply–demand conditions. This paper proposes an energy management optimization strategy that minimizes the users costs by controlling battery charging/discharging, based on pricing and battery state, which is modeled as a Markov decision process, and solved using a Soft Actor–Critic algorithm. The proposed method is compared to the PPO algorithm, and to stochastic programming methods. From a slightly different perspective, study [12] proposes an energy management strategy for islanded microgrids. It uses probabilistic modeling to handle the uncertainty in renewable energy generation, loads, and market conditions, and employs a deep reinforcement learning agent, specifically, a Soft Actor–Critic agent, that makes use of the wavelet transform to extract essential features from fluctuating renewable energy sources. The proposed strategy addresses renewable energy uncertainties and power fluctuations by controlling active power and frequency, with a focus on minimizing computational burden and time. Simulation results demonstrate the improvements in computational efficiency compared to a DDPG method. Work [13] presents a novel strategy for optimizing home energy management (HEM) systems, which traditionally focus on active power and cost savings while neglecting reactive power. The research formulates the problem as a Markov decision process and proposes a deep reinforcement learning (DRL) approach to handle uncertainties in user behavior and environmental factors. By strategically controlling flexible loads and energy storage, the proposed methodology not only achieves a 31.5% reduction in electricity bills but also significantly improves the power factor from 0.44 to 0.9. Moreover, paper [14] introduces a hybrid framework for predicting a battery’s end-of-life (EOL) and quantifying its uncertainty by combining Gaussian process regression (GPR) and the Kalman filter. The methodology transforms the degradation curve forecasting into a prediction of virtual degradation rates and acceleration, which is then executed using an iterative GPR strategy. The effectiveness of the approach is validated on a lithium-ion battery dataset, demonstrating that it provides more accurate EOL predictions with a narrower range of uncertainty compared to a particle filter framework.

Contribution: However, despite the rise of AI-driven control techniques for energy storage systems, their performance compared to traditional optimal-control methods such as stochastic dynamic programming or Pontryagin’s minimum principle remains unclear. In this light, our aim in this paper is to quantify the trade-offs between these approaches, specifically evaluating the operational cost incurred by reinforcement learning policies, when operating under uncertainty conditions, thus extending recent studies such as [15,16,17]. Using three progressively complex energy storage scenarios, ideal storage, lossy storage, and lossy storage with transmission line losses, we analyze the performance and characterize the advantages and disadvantages of each approach.

2. Fundamental Challenges in the Solution of Energy Storage Optimal-Control Problems, and the Need for Reinforcement-Learning

All energy storage optimal-control problems share one common trait, which is that they are computationally demanding. Essentially, this is because such problems require deciding on an optimal function, usually power or energy, for which values should be found for each and every point in time. As a result, the number of decision variables is large, and the computational power required to find an optimal solution is substantial. Another challenge is that the resulting optimization problems are often not convex, either since the objective function is not convex, or since the decision variables are not defined over a convex set.

While such optimal control problems are in essence quite complex, traditional methods can still solve them efficiently, and with acceptable runtimes, under several simplifying assumptions, most notably, that the stochastic signals involved in the process, for example the load signals, are known in advance. Several examples of such methods are the Shortest-Path (SP) method [15], Pontryagin’s minimum principle [16], dynamic programming (DP) [17], and model-predictive control (MPC) [18]. All these methods require considerable prior knowledge of the system dynamics and input signals. For instance, classic application of the Pontryagin’s minimum principle or dynamic programming requires full knowledge of the physical model, including the dynamics of the storage device and the load signal. Such requirements are often not realistic, either since exact knowledge of the physical model is not available, or since the input signals cannot be predicted accurately. There exist, of course, stochastic versions of the same algorithms, such as stochastic dynamic-programming or model-predictive control. Nevertheless, these methods also require, at the very least, a detailed statistical model, which is often not available.

Due to these challenges, we are witnessing today an emerging trend of using machine learning methods, specifically reinforcement learning methods, for solving energy storage optimal control problems [19]. The major advantage of this approach, in comparison to the existing traditional solution methods, is that full knowledge of the system model and input signals, or their statistics, is often not required, since the reinforcement learning algorithm is designed to “learn” these directly from the data samples and improves its performance based on the knowledge acquired from exploration. The reinforcement learning algorithms themselves vary greatly: Some of them require a detailed physical model and relatively accurate statistical knowledge, while some of them do not and learn only based on actual data. Thus, one may say that reinforcement learning algorithms are distributed over a spectrum, which includes “model-based” methods at one extreme and “model-free” methods at the other extreme. Nevertheless, nearly all reinforcement learning methods will require less prior knowledge in comparison to traditional methods. This idea is illustrated in Figure 1.

In light of this emerging trend, it is important to quantify what is lost in terms of performance when the physical model is not exactly known, or the statistical description of the input signals is incomplete. While, in this paper, we do not provide a full theoretical answer to this question, we continue to explore it based on a number of typical case studies.

3. Methodology and Results of the Case Studies

We have structured the following case studies as a deliberate progression, arranged in order of ascending complexity. The initial case establishes a foundational scenario, and each subsequent study introduces additional layers or variables, building directly on the previous ones. This tiered approach means that with each new level of complexity, the amount of inherent uncertainty to be managed also increases systematically. Therefore, the following case studies are presented with an integrated structure that combines methodology and results. This format was intentionally chosen to clearly associate the specific procedures used in each case with the findings they produced. Therefore, each case study begins with a description of the methodology, followed directly by an analysis of the resulting data.

3.1. Ideal Storage Devices with Convex Cost Functions

To initiate our comparative analysis, we first address the classical problem of managing an ideal grid-connected energy storage device. This scenario, extensively explored in prior literature using traditional optimal control methods, for instance, in [15,17], serves as a crucial baseline. In this study, we develop a systematic approach to tackle this problem using reinforcement learning methods. These RL techniques operate with incomplete information regarding the system’s model dynamics. A key objective here is to meticulously demonstrate and quantify the effects of this missing knowledge on the operational results, especially when contrasted with traditional optimal control methods that presume and operate with complete, deterministic knowledge of both the system model and all relevant signals, such as load and generation profiles. This initial, simplified case is fundamental for understanding the performance degradation that might occur when relying on data-driven approaches under ideal conditions where optimal control is expected to excel.

In this context, we consider a system comprising a grid-connected storage device and a photovoltaic (PV) panel, as illustrated in Figure 2. This configuration involves the storage device being charged from both the main electrical grid and the local PV generation system. Subsequently, the stored energy, along with direct generation, is used to supply an aggregated electrical load, which is characterized by its active power consumption. The active power demand of this load, denoted as

P_{a} (t) : R_{\geq 0} \to R

, is modeled as a continuous positive function over a finite and known time interval

[0, T]

. Complementing this, the renewable energy generated by the PV system is represented by

P_{p v} (t) : R_{\geq 0} \to R

, a piece-wise continuous and semi-positive function reflecting the available solar power. The net power consumption of the load is then defined as

P_{L} (t) = P_{a} (t) - P_{p v} (t)

. The charging or discharging power of the storage unit,

P_{s} (t)

, is determined by the balance between the power drawn from the grid,

P_{g} (t) : R_{\geq 0} \to R

, and the net load demand, such that

P_{s} (t) = P_{g} (t) - P_{L} (t)

. Correspondingly,

E_{g} (t)

,

E_{L} (t)

, and

E_{s} (t)

represent the cumulative generated energy from the grid, the cumulative energy consumed by the load, and the energy stored in the battery, respectively. These energy quantities are derived by integrating their respective power functions over time, formally,

E (t) = \int_{0}^{T} P (τ) d τ

. For this foundational scenario, complexities such as internal storage losses or transmission line losses are intentionally omitted to establish an unambiguous performance benchmark for an ideal system.

The power drawn from or supplied to the grid,

P_{g} (t)

, is the primary controllable variable in this system. Its usage is characterized by a fuel consumption function,

f (P_{g} (t)) \in R_{\geq 0}

, which quantifies the cost, for example, fuel consumed or monetary cost, associated with generating power

P_{g} (t)

at any time

t \in [0, T]

. This cost function can also be interpreted more broadly to represent other objectives, such as minimizing carbon emissions or other environmental impacts linked to power generation. A critical assumption, following the original work [15] in the paper, is that this function

f (P_{g} (t))

is twice differentiable and strictly convex, meaning its second derivative

f^{''} (P_{g} (t)) > 0

. This convexity is significant as it generally ensures that any local minimum found is also the global minimum, simplifying the optimization process. The overall objective is to minimize the total cost of fuel consumption over the entire period,

F : = \int_{0}^{T} f (P_{g} (t)) d t

. This leads to the formulation of the following optimization problem:

\begin{matrix} \underset{{P_{g} (\cdot)}}{minimize} & \int_{0}^{T} f (P_{g} (t)) d t, \\ subject to & E_{g} (t) & = \int_{0}^{T} P_{g} (τ) d τ, \\ E_{L} (t) & \leq E_{g} (t) \leq E_{L} (t) + E_{max}, \\ for 0 & \leq t \leq T, E_{g} (0) = E_{L} (0), \\ E_{g} (T) & = E_{L} (T) . \end{matrix}

(1)

These constraints ensure energy balance, respect the storage capacity limits (

E_{max}

), and define the operational cycle of the storage.

An optimal solution to this problem, under the assumption of complete knowledge, can be efficiently found using the Shortest-Path method, as introduced in the work [15] in the paper. This method posits that the optimal trajectory of generated energy,

E_{g} (t)

, corresponds to a minimal-cost path. This path is constrained to lie between the cumulative energy demand

E_{L} (t)

and the maximum possible stored energy level

E_{L} (t) + E_{max}

. The identification of this path is achieved by applying Dijkstra’s algorithm to a specially constructed graph that represents the feasible energy states and transitions over time. To construct this graph, the continuous time horizon

[0, T]

is discretized into N uniform intervals of duration

Δ t

, such that

t_{k} = k Δ t

for

k = 0, \dots, N

. At each discrete time step

t_{k}

, the permissible energy levels

E_{g} (t_{k})

are bounded by

E_{L} (t_{k}) \leq E_{g} (t_{k}) \leq E_{L} (t_{k}) + E_{max}

. The search graph

G = (V, E)

is then defined where each vertex

v_{k}^{j} \in V

signifies a discretized energy level

E_{g}^{(j)} (t_{k})

at time

t_{k}

, and each edge

e_{k}^{j \to ℓ} \in E

represents a feasible transition from an energy state

E_{g}^{(j)} (t_{k})

at time

t_{k}

to another state

E_{g}^{(ℓ)} (t_{k + 1})

at the subsequent time step

t_{k + 1}

. These transitions are governed by feasible charging/discharging rates,

Δ E_{k}^{j \to ℓ} = E_{g}^{(ℓ)} (t_{k + 1}) - E_{g}^{(j)} (t_{k})

, which must be within the range

[0, E_{max}]

(assuming

E_{max}

can also represent maximum charge/discharge capability within a

Δ t

interval). The power drawn from the generator for such a transition is

P_{g, k + 1}^{(ℓ)} = (E_{g}^{(ℓ)} (t_{k + 1}) - E_{L} (t_{k + 1})) / Δ t

. The weight assigned to each edge,

w_{k}^{j \to ℓ}

, is determined by the cost function

w_{k}^{j \to ℓ} = f (P_{g, k + 1}^{(ℓ)})

, leveraging the strict convexity of

f (\cdot)

. Due to the continuous nature of

E_{g}

, a uniform discretization with resolution

δ E

is applied in the energy space, yielding a finite set of allowable energy levels at each time step. This results in a layered graph structure, where layers correspond to time steps

t_{k}

, and nodes within each layer represent feasible energy states at that step. Dijkstra’s algorithm then systematically explores paths starting from the initial node

v_{0}^{(0)} = E_{g} (t_{0}) = E_{L} (t_{0})

, expanding paths of incrementally increasing cumulative cost until it reaches a terminal node at

t_{N} = T

that satisfies the end condition

E_{g} (t_{N}) = E_{L} (t_{N})

. The path identified with the overall minimum cumulative cost dictates the optimal energy dispatch trajectory for the storage system. This method’s efficacy is fundamentally tied to the availability of complete prior knowledge of system dynamics and load signals.

In contrast, the reinforcement learning formulation approaches the problem by replacing the assumption of a deterministic and fully known load function with a generative or learned model of the environment. The environment is conceptualized as an MDP, characterized by a continuous state space

S

and a continuous action space

A

. At each discrete time step k, the state of the system

s_{k}

is defined by a tuple

s_{k} = (E_{s, k}, H_{k}, P_{L, k})

, where

E_{s, k}

is the state of charge of the battery,

H_{k}

indicates the current hour of the day, providing temporal context, and

P_{L, k}

is the net load demand at that time

t_{k}

. The action

a_{k} \in A

, taken by the RL agent, determines the amount of energy

Δ E_{k}

to be charged into or discharged from the battery, constrained by

a_{k} = Δ E_{k}

, with

Δ E_{k} \in [0, E_{max}]

for charging, with similar bounds for discharging, representing the energy transfer limit, and also by the battery’s maximum capacity,

E_{s, k} + Δ E_{k} \leq E_{max}

. Following an action

a_{k}

, the system transitions deterministically, in this ideal, lossless model, to a new state

s_{k + 1}

, and the agent receives a scalar reward

r_{k}

. This reward is defined by the negative of the generation cost,

r_{k} = - f (P_{g, k})

, where

P_{g, k} = (E_{s, k + 1} - E_{L} (t_{k + 1})) / Δ t

. This interaction process unfolds over the specified time horizon until a terminal time

t_{N}

is reached, yielding a complete trajectory of states, actions, and rewards, and, thus, a cumulative reward or cost. The core objective for the RL agent is to learn an optimal policy

π : S \to A

that minimizes the expected cumulative operational cost over time. This learning occurs through trial and error or experience, without explicit programming of the optimal strategy, making it suitable for model-free approaches where the agent directly optimizes its policy based on environmental interactions.

To empirically evaluate and compare these methodologies, the optimal control policy derived from the Shortest-Path method was implemented in MATLAB R2024a, while three distinct model-free RL algorithms—Soft Actor–Critic (SAC), Proximal Policy Optimization (PPO), and Twin Delayed DDPG (TD3)—were implemented in Python 3.9 using the Stable-Baselines library, version 2.7.1a3. The corresponding code files are accessible via a public git repository [20]. The operational task is to determine a policy that specifies the generation actions, consequently inducing charging or discharging of the storage unit, over a 24 h interval, setting

T = 24

h for this study. The selection of these model-free RL approaches was driven by their widespread use in various control applications, particularly for energy storage, owing to their general applicability and relative ease of adaptation to new or poorly defined domains where system dynamics are not explicitly modeled.

Results: The results from this first scenario, presented in Figure 3, are interesting, and back up the initial assumptions. None of the examined RL methods succeeded in learning an effective operational policy when benchmarked against the optimal path determined by the classical Shortest-Path algorithm. An intriguing secondary observation was the similarity in the policies learned by both the PPO and TD3 methods. This is noteworthy because PPO employs a stochastic policy search, whereas TD3 relies on a deterministic policy. This convergence to similar policies, despite algorithmic differences, offers several insights into the underlying structure of this specific control problem.

Firstly, if two different days exhibit similar patterns of energy demand and PV production, and if their initial storage conditions are close, they are likely to result in similar operational policies. This consistency can be attributed to the fact that the value function, which the RL agent seeks to optimize, is Cauchy bounded, implying that small changes in input conditions lead to correspondingly small changes in output values or actions. Secondly, the inherent dynamics governing the interactions among the system’s components, the storage device, PV generation, grid power, and load demand are assumed to be time invariant in this idealized model. This means that the fundamental relationships and operational characteristics of the system do not change over the simulation period, providing a stable learning environment. Thirdly, the reward function, derived from the strictly convex generation cost function

f (P_{g} (t))

, is itself convex. A convex reward landscape is bereft of local minima, which means that different optimization algorithms, even if they explore the solution space differently, are more likely to be guided towards the same global optimum or a similar region of high performance. Finally, the reward function may exhibit symmetries concerning certain actions or states, thereby guiding different algorithms towards similar policies. For instance, consider two distinct scenarios: one (state

s_{0}

) where high PV production fully meets the current demand, but high future demand is predicted, prompting a decision to charge the storage with

Δ P_{s} = 1

[p.u.]. Another scenario (state

s_{1}

) might involve low PV production and an empty storage unit, necessitating the purchase of 1 [p.u.] from the grid merely to satisfy the current demand. Despite these being entirely different operational situations, they could potentially yield the same immediate reward (cost). If such reward equivalences are common, RL algorithms might develop policies that are relatively insensitive to the nuanced differences between these states, causing disparate algorithms like PPO and TD3 to converge towards similar, possibly generalized, strategies. This insensitivity, if it leads to overlooking critical state distinctions, could also contribute to the observed suboptimal performance of the RL agents compared to the fully informed optimal control solution. The daily costs over a 100-day test set, shown in Figure 4 and Table 1, further quantify this performance gap, with RL methods incurring substantially higher mean costs. Specifically, the PPO algorithm resulted in

2428.24

USD/p.u., the SAC algorithm resulted in

1935.02

USD/p.u., and the TD3 resulted in

2787.36

USD/p.u., in comparison to the optimal control approach, using the SP algorithm, that resulted in

711.85

USD/p.u. Moreover, the model-free RL approaches exhibited greater variance compared to the SP method. The cost deviations depicted in Figure 5 highlight the tighter concentration of costs around the mean for SP, which is mostly within

\pm 300

USD/p.u., in comparison to a wider spread for the RL algorithms, as evident by examining the PPO and TD3 results, which exceed

\pm 1000

USD/p.u.

3.2. Grid Connected Lossy Storage Devices

To continue the comparative analysis and delve into more realistic operational conditions, we now introduce a model with more intricate dynamics by incorporating a lossy storage device, as detailed in work [16]. This progression from an ideal to a lossy model inherently increases the complexity and the amount of uncertainty that both traditional optimal control and reinforcement learning algorithms must handle. The presence of energy losses during charging, discharging, and even self-decay introduces a new layer of challenge in optimizing the storage operation for minimal cost.

To formally incorporate these battery losses, we define an efficiency function

η : R \to [0, 1]

as a piece-wise continuous function. This function characterizes the efficiency of the storage device and depends on the power flowing into or out of it,

P_{s} (t)

, and it is defined as

η (P_{s} (t)) = \{\begin{matrix} η_{ch, t} (P_{s} (t)) η_{decay}, & if P_{s} (t) > 0 (charging), \\ η_{dis, t}^{- 1} (P_{s} (t)) η_{decay}, & if P_{s} (t) < 0 (discharging) . \end{matrix}

(2)

Here,

η_{ch, t} (P_{s} (t))

represents the charging efficiency,

η_{dis, t} (P_{s} (t))

is the discharging efficiency, and

η_{decay}

accounts for inherent energy decay or self-discharge over time. Furthermore, we assess a generalized non-affine dynamic model for the battery’s state of charge,

E_{s} (t)

, which is defined by the following differential equation:

\frac{d}{d t} E_{s} (t) = η (P_{s} (t)) P_{s} (t) .

(3)

This dynamic model directly links the rate of change of stored energy to the power flow and the state-dependent efficiency. This formulation leads to the following optimization problem:

\begin{matrix} \underset{{P_{s} (\cdot)}}{minimize} & \int_{0}^{T} (f (P_{s} (t)) + c (E_{s} (t))) d t, \\ subject to & \frac{d}{d t} E_{s} (t) = η (P_{s} (t)) P_{s} (t), \\ E_{s} (0) = 0, \\ E_{s} (T) = 0, \end{matrix}

(4)

where

f (P_{s} (t))

is the primary cost function, accounting for generation cost, as the power flowing bidirectionally from the battery

P_{s} (t)

influences

P_{g} (t)

). The function

c : R \to [0, \infty)

introduces a penalty for violating storage capacity constraints. This penalty function is defined as

c (x) : = \{\begin{matrix} \frac{Q}{2 E_{max}} {(x - E_{max})}^{2}, & for x > E_{max} (overcharged), \\ 0, & for 0 \leq x \leq E_{max} (within limits), \\ \frac{Q}{2 E_{max}} x^{2}, & for x < 0 (over - discharged), \end{matrix}

(5)

with

Q > 0

being a large penalty coefficient. By setting

Q

to a sufficiently large value, we implicitly enforce that the stored energy

E_{s} (t)

remains within the operational bounds

[0, E_{max}]

, as deviations would incur extremely high costs. Solving for the optimal energy storage trajectory

E_{s} (t)

and the corresponding power flow

P_{s} (t)

allows for the determination of the grid power

P_{g} (t)

and generated energy

E_{g} (t)

through the system’s power balance equations. For this scenario, Pontryagin’s minimum principle is employed as the classical optimal control method to find an analytical solution.

Consequently, the explicit analytical solution derived using PMP is described by the dynamics of the optimal state

{\hat{E}}_{s} (t)

and an associated Lagrange multiplier, the costate variable,

\hat{λ} (t)

. The optimal rate of change of stored energy is

\frac{d}{d t} {\hat{E}}_{s} (t) = η ({\hat{P}}_{s} (t)) {\hat{P}}_{s} (t)

, with boundary conditions

{\hat{E}}_{s} (0) = {\hat{E}}_{s} (T) = 0

. The costate variable evolves according to

\frac{d}{d t} \hat{λ} (t) = c^{'} ({\hat{E}}_{s} (t))

, where

c^{'} ({\hat{E}}_{s} (t))

is the derivative of the penalty function. The optimal storage power

{\hat{P}}_{s} (t)

is then determined based on

\hat{λ} (t)

, the net load

P_{L} (t)

, and the charging or discharging efficiencies, referred to as

η_{c h}

and

η_{d i s}

in this context:

{\hat{P}}_{s} (t) = \{\begin{matrix} η_{ch} \hat{λ} (t) - P_{L} (t), & for P_{L} (t) < η_{ch} \hat{λ} (t), \\ η_{dis}^{- 1} \hat{λ} (t) - P_{L} (t), & for P_{L} (t) > η_{dis}^{- 1} \hat{λ} (t), \\ 0, & otherwise . \end{matrix}

(6)

This solution provides the benchmark optimal control policy under the assumption of known loss characteristics.

The RL formulation for this lossy storage case closely mirrors that of the ideal storage scenario, with the crucial difference lying in the definition of the MDP’s state transition function. To accurately reflect the system’s behavior, the transition function must now explicitly account for the energy losses inherent in the storage device. When the agent takes an action

a_{k} = Δ E_{k}

, representing the energy intended to be charged or discharged before losses, the resulting next state

s_{k + 1}^{'}

, which is the new state of charge

E_{s, k + 1}

, is calculated as

E_{s, k + 1} = (E_{s, k} + Δ E_{k}) η_{effective, k}

. This occurs under the assumption that the transition respects the battery’s physical capacity limits, formally,

0 \leq E_{s, k + 1} \leq E_{max}

. The term

η_{effective, k}

represents the effective efficiency for that specific transition. If

Δ E_{k} \neq 0

, meaning active charging or discharging occurs,

η_{effective, k}

incorporates both the relevant charge or discharge efficiency, denoted by

η_{ch}

or

η_{dis}

, respectively, and the self-decay component

η_{decay}

. If

Δ E_{k} = 0

, meaning no active charging or discharging, then

η_{effective, k}

simply reflects the self-decay

η_{decay}

over the time step

Δ t

. This modification ensures that the RL agent learns to operate the storage device while being subjected to realistic energy loss dynamics.

Results: The experimental results for this scenario, presented in Figure 6, reveal a notable shift in comparative performance. While the classical PMP algorithm still achieved the lowest operational costs, the performance gap between the optimal control algorithm and some of the RL methods narrowed considerably. Specifically, the policies learned by PPO and TD3 were again observed to be quite similar, and their mean daily operational costs were significantly closer to the PMP benchmark than in the ideal storage case. As shown in Table 2 of the original paper, the PMP method achieved a mean daily cost of

705.82

USD/p.u. Remarkably, both PPO and TD3 achieved mean costs of

748.17

USD/p.u., which is only approximately

6 %

higher than the PMP optimal. This suggests that when the uncertainty, in the form of power loss, is primarily confined to the internal dynamics of the storage device, these RL algorithms can learn the storage behavior quite effectively and achieve a substantial improvement in their relative cost performance, even without explicit prior knowledge of the loss model. The SAC algorithm, however, did not perform as well, with a mean cost of

2280.81

USD/p.u., indicating its potential struggles with this specific problem configuration or hyperparameter tuning.

An interesting observation from Table 2 is that the mean and variance of the classical PMP algorithm’s costs were lower than those of the SP algorithm in the ideal case, with

705.82

[USD/p.u.] in comparison to

711.85

[USD/p.u.] achieved by the SP algorithm. This might suggest that PMP, which is well suited for continuous-time optimal control problems, handles the continuous dynamics of the lossy storage model more adeptly than the Dijkstra-based SP method would if applied to a discretized version of this more complex problem. Furthermore, the mean costs for PPO and TD3 also decreased compared to their performance in the ideal case, with

748.17

USD/p.u. in comparison to

2428.24

USD/p.u. for PPO and

2787.36

USD/p.u. for TD3. This may be atributed to the improvement to their inherent ability to navigate and learn within large and complex state spaces, allowing them to capture some of the nuances of the lossy environment more effectively than they did in the simpler ideal environment where their learning seemed less focused. However, it is also crucial to note that the introduction of model complexity in the form of losses led to an increase in the variance of the daily costs for all RL algorithms. Figure 7, which presents the daily costs, and Figure 8, which presents the deviations of the costs, visually confirm these trends, showing PPO and TD3 tracking much closer to the PMP optimal than in the first scenario, while SAC remains significantly higher. The cost deviations for PMP remained tightly centered, mostly within

\pm 200

USD/p.u., whereas SAC exhibited the largest spread, indicating higher variability.

3.3. Storage Devices Within a Transmission Grid with Losses

In this section, we advance our comparative analysis by eliminating an additional simplifying assumption. We now introduce and analyze a model that incorporates power losses occurring over the transmission lines situated between the power generation source and the consumer block, which includes the energy storage system. This represents the most complex scenario in our study, aiming to capture a more comprehensive set of uncertainties that real-world systems face. The system under consideration can be conceptualized as consisting of a primary power source, modeled as a synchronous generator, and a non-linear aggregated load, similarly to the illustration in Figure 2. This non-linear load block comprises the photovoltaic generation unit, the energy storage device with a defined capacity

E_{max}

, and the end-consumer load that consumes active power

P_{L} (t) : R_{\geq 0} \to R

. A key modeling assumption here is that the components encapsulated within this non-linear load block (PV, storage, and consumer load) are located in close physical proximity to each other, rendering the energy transmission between these internal components effectively lossless. However, the transmission from the main synchronous generator to this aggregated load block is subject to losses.

To perform this analysis, we adopt a distributed circuit model perspective, specifically considering a short-length transmission line. The power generated at the source,

P_{g} (t)

, and the power received at the aggregated load and storage blocks are related through a loss model. It is denoted by

P_{received} (t) = P (t) + P_{L} (t)

, with

P (t)

being the power flowing bidirectionally from storage and

P_{L} (t)

being the consumer load demand. The power generated by the source,

P_{g} (t)

, must be greater than

P_{received} (t)

to compensate for these losses. We approximate the generated power required as

P_{g} (t) \approx (P (t) + P_{L} (t)) + α {(P (t) + P_{L} (t))}^{2} .

(7)

Here,

P (t)

is the power flowing into or from the storage device. The term

α {(P (t) + P_{L} (t))}^{2}

represents the quadratic transmission losses, where

α = \frac{R}{| V_{g} |^{2}}

is a constant determined by the transmission line resistance R and the square of the magnitude of the line voltage

V_{g}

. Such a quadratic relationship for transmission losses is a well-established approximation in power systems literature, as discussed, for example, by [23]. This formulation leads to the following optimization problem:

\begin{matrix} minimize & \int_{0}^{T} f (P (t) + P_{L} (t) + α {(P (t) + P_{L} (t))}^{2}) d t, \\ subject to & \frac{d E_{s} (t)}{d t} = η (E_{s} (t)) \cdot P (t), \\ 0 \leq E_{s} (t) \leq E_{max}, \\ E_{s} (0) = 0, E_{s} (T) = 0 . \end{matrix}

(8)

The function

f (\cdot)

is the cost of generation at the source, now applied to the total power

P_{g} (t)

that includes losses. The term

η (E_{s} (t))

represents the efficiency of the storage device itself, as in the previous lossy storage scenario, potentially including charge, discharge and decay effects. For this complex problem, dynamic programming (DP) is chosen as the classical optimal control method.

To apply DP, the problem is discretized. We define a time resolution

Δ t = T / N

, with discrete time steps

t_{i} = i Δ t

for

i = 0, \dots, N

. Accordingly, the energy stored at time

t_{i}

is

E_{s, i} = E_{s} (i Δ t)

, and the power values are approximated as constant over each interval:

P_{i} = P (i Δ t)

,

P_{L, i} = P_{L} (i Δ t)

, and

P_{g, i} = P_{g} (i Δ t)

. The discrete version of the optimization problem is given by

\begin{matrix} \underset{{P_{i}}}{minimize} & Δ t \sum_{i = 1}^{N} f (P_{i} + P_{L, i} + α {(P_{i} + P_{L, i})}^{2}), \\ subject to & \frac{E_{s, i} - E_{s, i - 1}}{Δ t} = η (E_{s, i}) P_{i}, for i = 1, \dots, N, \\ 0 \leq E_{s, i} \leq E_{max}, for i = 1, \dots, N, \\ E_{s, 0} = 0, E_{s, N} = 0 . \end{matrix}

(9)

Here,

η (E_{s, i})

represents the aggregated energy loss efficiency of the storage device for the i-th interval.

For the reinforcement learning solution in this scenario, the model must now account for transmission line losses in addition to the storage’s internal losses. Consequently, the state transition function is further modified. When an action

a_{k} = Δ E_{k}

is taken from a state

s_{k}

with storage energy

E_{s, k}

, the resulting next state

s_{k + 1}^{'}

with storage energy

E_{s, k + 1}

is determined by a two-stage loss application. First, the energy

Δ E_{k}

is subject to transmission line losses, represented by an efficiency factor

η_{tr}

. The energy effectively reaching or leaving the storage block is then

Δ E_{k} η_{tr}

. This amount is then subject to the storage’s internal charge or discharge and decay efficiencies, collectively denoted by

η_{storage, k}

, similarly to

η

in the previous case study. Thus, the new state of charge might be approximated as

E_{s, k + 1} \approx (E_{s, k} + Δ E_{k} η_{tr}) η_{storage, k}

. This compounded loss model should make the environment significantly more challenging for the RL agent to learn and control optimally.

Results: The experimental results for this third scenario, presented in Figure 9, demonstrate a further degradation in the performance of the RL algorithms when compared to the DP optimal solution. As detailed in Table 3, the DP method achieved a mean daily cost of

773.10

USD/p.u.. In contrast, the PPO algorithm resulted in a mean cost of

1165.14

USD/p.u., a

51 %

increase over DP. SAC incurred

1293.96

USD/p.u., resulting in a

67 %

increase, and TD3 performed notably worse with a mean cost of

2779.89

USD/p.u., showing a

260 %

increase, more than 3.5 times higher than the DP algorithm. These results underscore the difficulty RL methods face when dealing with external, system-level uncertainties like transmission line losses, which are not directly part of the storage device’s internal dynamics that the agent might more easily learn through interaction. The mean and variance of the classical DP algorithm also increased in this scenario compared to PMP in the lossy-storage-only case, where the mean value increased to

773.10

in comparison to

705.82

when applying PMP, and the value of the variance increased to

89, 220.93

in comparison to

69, 024.48

when applying PMP. This reflects the added difficulty posed by the transmission losses even for the optimal control approach. However, for all RL algorithms, there was a general trend of performance improvement relative to their own baseline in the ideal case, but a degradation compared to the lossy-storage case for PPO and SAC, while TD3 remained poor. This suggests that while RL can handle some forms of uncertainty, the nature and location of that uncertainty, whether it is internal, and is captured by the model, or external to the modeled MDP, significantly impacts its effectiveness.

A particularly interesting observation in this scenario was that the PPO and TD3 algorithms no longer produced similar policies, unlike in the previous two cases. The policy generated by the PPO algorithm, which employs a stochastic policy search, was observed to be smoother and more effective than that of the TD3 algorithm, which relies on a deterministic policy. This divergence and the superior performance of PPO further emphasize the potential benefits of using stochastic policies when navigating environments with highly unpredictable or complex dynamics, such as those introduced by combined storage and transmission losses. The deterministic policy of TD3 appeared more vulnerable to these compounded uncertainties, leading to significantly higher operational costs. The daily cost comparisons in Figure 10 and the cost deviation histograms in Figure 11 of the original paper visually corroborate these findings, showing PPO and SAC performing relatively better than TD3 but all RL methods trailing the DP benchmark by a wider margin than in the previous scenario. The cost deviations for PPO and SAC remained largely within

\pm 600

USD/p.u. of their mean, while TD3 exhibited a very wide spread, up to

\pm 1500

USD/p.u., confirming that its deterministic policy struggled significantly with the unpredictable dynamics of this scenario.

Figure 10 presents for each algorithm the daily cost of energy production over a test set of 100 days. The mean and variance for each algorithm are presented in Table 3.

The results in the table show few trends. (a) The mean and variance of the classical algorithm has increased. This is caused by the additional uncertainty of the dynamic model. (b) For all RL algorithms, we see improvement in the performance, which highlights their effectiveness when operating in highly uncertain environments, with stochastic transitions. (c) The PPO algorithm indeed achieves better results than TD3, which further emphasizes the effectiveness of using stochastic policies for such unpredictable model dynamics. The deviation from the mean of the daily costs of each algorithm over these 100 test days is depicted in Figure 11.

4. Discussion

The optimal control of energy storage systems in uncertain environments poses a significant challenge. Data-driven approaches, especially reinforcement learning, offer a solution by learning input signal statistics and system dynamics directly from data, eliminating the need for prior knowledge. However, this comes at a greater computational cost than traditional optimal control methods, such as stochastic dynamic programming. To understand these trade-offs, this work directly compares these two approaches. Our focus is on quantifying the performance loss incurred when the storage unit’s dynamics are uncertain and learned through a data-driven method. Thus, this analysis expands upon earlier works like [16,17], which assume complete and accurate knowledge of the energy storage device’s dynamics and full control capabilities.

Our results show several interesting trends that may help answer both questions raised in Section 2. First and foremost, we observe that if perfect knowledge of the physical model exists, one can achieve results far superior to those achieved by statistical learning. However, when comparing the results in Section 3.1 and Section 3.2, for ideal and lossy storage devices, respectively, it may be observed that the overall cost difference for the RL solution and the optimal control algorithm solution diminishes. For the ideal storage case, as indicated by Figure 3 and Table 1, the RL algorithms demonstrated significantly higher operational costs than those achieved by the optimal SP solution, with mean daily costs of

2428.24

,

1935.02

, and

2787.36

for PPO, SAC, and TD3, respectively, compared to

711.85

for SP—a 172% to 291% increase. In the lossy storage scenario, as may be seen from Figure 6 and Table 2, while the classical PMP solution achieved a lower mean cost of

705.82

, both PPO and TD3 achieved similar mean costs of

748.17

, corresponding to a modest increase of approximately 6%. This suggests that when uncertainty that stems from the power loss is confined to the storage devices, RL algorithms can learn the storage behavior effectively, and achieve improvement in the overall cost, although they do not possess prior knowledge about this type of loss.

Nevertheless, when the uncertainty, in the form of power loss, originates from the transmission lines, as in Section 3.3, RL performance degrades, as evidenced by the increased mean and variance. Under transmission losses, as may be viewed in Figure 9 and Table 3, RL performance degraded again, with PPO, SAC, and TD3 incurring mean daily costs of

1165.14

,

1293.96

, and

2779.89

, respectively, compared to

773.10

for DP—indicating an increase of 51% to 260%.

This highlights the difficulty RL faces when dealing with external, system-level uncertainties that are not directly part of its training process. These findings suggest that the type of uncertainty plays an important role in determining the effectiveness of RL-based control. Generally speaking, reinforcement learning methods perform better with internal energy storage dynamics than with external, system-level uncertainties because they are designed to learn from interactions within a defined environment, such as the state of charge of a battery. The internal dynamics of the storage device, like losses from charging and discharging, are directly observable through the agent’s actions and the resulting state changes, allowing the agent to learn and adapt to them effectively, even without explicit prior knowledge. In this scenario, the uncertainty is part of the system’s core dynamics, which the RL agent is trained to handle. Conversely, external uncertainties, such as losses from a transmission grid, are often difficult to capture within the defined state–action–reward framework of the RL model. The agent’s actions may not directly influence or perceive these external factors in a way that allows for effective learning, leading to a degradation in performance.

Crucially, the dominant paradigm for RL is the Markov decision process, a mathematical framework for modeling sequential decision-making in stochastic environments. The foundational assumption of this framework is the Markov property, which posits that the current state

s_{k}

encapsulates all information from the history of the process that is relevant for predicting the future. In other words, the probability of transitioning to the next state

s_{k + 1}

depends only on the current state

s_{k}

and the chosen action

a_{k}

, not on the sequence of states and actions that preceded it. The agent’s ability to learn an effective control strategy is, therefore, fundamentally constrained by the information content of the state tuple. Any system dynamics, disturbances, or sources of uncertainty that are not captured by or directly correlated with these three variables are, from the agent’s perspective, unobservable. They manifest as unpredictable noise in the state transitions or reward signals, confounding the learning process. This intrinsic limitation of the agent’s worldview is the primary determinant of its performance across the different scenarios. Thus, these external factors are not captured in the agent’s state representation, effectively breaking the core Markov property and rendering the environment partially observable. This misalignment between the problem structure and the RL agent’s worldview is the fundamental root of the observed performance dichotomy.

The second scenario presented in the paper, which introduces a lossy storage device, provides a compelling case study in how RL can effectively manage certain types of uncertainty. In this scenario, the ideal, lossless battery model is replaced with a more realistic one. Despite this added complexity, the performance of the RL agents, particularly PPO and TD3, improves dramatically relative to the optimal control benchmark. This counterintuitive result, that adding uncertainty improved relative performance, can be explained by examining the nature of that uncertainty through the lens of the MDP. The battery losses, while complex and non-linear, are internal to the system being controlled. They are a direct, physical consequence of the agent’s actions on the state of the system. When the agent takes an action

a_{k}

, for example, commanding a certain power flow

Δ E_{k}

, the resulting change in the state variable

E_{s, k + 1}

is directly affected by the efficiency

η

. This effect is consistently observable through the state–action–reward loop. Although the agent does not have an explicit mathematical model of the function

η (P_{s} (t))

, its model-free learning algorithm is designed precisely for such situations. Through trial and error, the agent can implicitly learn the mapping from its actions to their consequences, effectively building an internal model of the battery’s lossy behavior. In this context, the uncertainty is a stable, learnable feature of the environment’s transition dynamics. The problem, while more challenging, remains a well-posed MDP. The agent’s state representation contains the necessary information to learn the effects of its actions. The success of the RL agents in this scenario is, therefore, not an anomaly, but a confirmation of their core strength: discovering effective control policies in systems with unknown or complex dynamics without requiring an explicit, a priori model.

The third and most complex scenario, which adds transmission line losses to the lossy storage model, causes a significant degradation in RL performance. The empirical results show the performance gap, between traditional optimal control approaches to model-free approaches, widening once again. The root cause of this failure lies in the fact that the transmission loss is an external uncertainty from the perspective of the agent’s defined MDP. This creates a critical violation of the Markov property. The reward, or cost, received by the agent is no longer a function of just its current state and action. The same action from the exact same state can result in vastly different costs depending on the instantaneous value of the external load

P_{L} (t)

, which is not fully captured in the state representation. This transforms the problem into a partially observable Markov decision process (POMDP), where the agent must act based on incomplete information about the true state of the system. Standard model-free RL algorithms, such as those used in the paper, namely, PPO, SAC, and TD3, are not inherently designed to solve POMDPs. They operate under the assumption that the state is fully observable. When this assumption is violated, the agent cannot reliably attribute outcomes to its actions. It becomes unable to distinguish between stochasticity in the environment and the deterministic effects of unobserved state variables. This leads to a confused and inefficient learning process, resulting in a suboptimal policy that cannot effectively plan for the consequences of its actions, hence the observed performance collapse.

In all examined cases, optimal control methods outperformed RL, likely due to the presence of epistemic uncertainty, which RL struggles to handle effectively. This may have implications for robustness analysis of RL methods in energy storage applications. These findings also reveal one key limitation of reinforcement learning methods, which is that model-free methods must be retrained for each new scenario. This prevents direct transfer of the trained agent, for instance, from one household to another. Consequently, incorporating model-based elements, or at least embedding system dynamics into the training process, becomes crucial. This enables the agent to generalize more effectively, and to apply the learned knowledge across varying problems, even with differing underlying statistical patterns.

In this paper, the entire framework is built around the unique characteristics of battery energy storage systems. The optimization problems are explicitly defined by the physical constraints of an energy storage unit, the analysis progressively introduces complexities that are characteristic of real-world BESSs, and the resulting plots exhibit the optimal generation under the operation of BESSs as ancillary services. However, the insights of the comparative analysis, between optimal control and reinforcement learning methods, presented in this work, can be extended to several other applications beyond the management of a BESS. This type of research may aid in quantifying the trade-offs between performance, computational requirements, and the amount of prior knowledge each methodology demands in different contexts. Naturally, the choice of the technology for the energy storage device is crucial for the whole comparative analysis, since it is based on the specific dynamics of a BESS, and the immediate question that arises is whether this will repeat itself when simulating different types of energy storage technologies, or whether these findings were limited, for some reason, to the specific chosen MDP formulation.

5. Conclusions

In this work, we compare traditional optimal control methods with data-driven reinforcement learning approaches for managing energy storage systems under uncertainty. The results reveal a key trend: While methods based on perfect physical models are always superior, the performance gap between them and reinforcement learning narrows as specific uncertainties are added. In an ideal, lossless scenario, the RL algorithms performed very poorly, incurring significantly higher operational costs than the optimal control solution. However, when internal uncertainties like battery power loss were introduced, the RL methods’ performance improved dramatically, achieving costs only slightly higher than the optimal benchmark. This suggests that RL is effective at learning and managing uncertainties that are confined to the device it is controlling, and are observable for the agent, even without explicit prior knowledge.

Beyond BESS management, future research may consider challenges that are present in systems such as electric vehicles (EVs), microgrid management, and home energy management systems. They can benefit from the key insight of this paper, stating that reinforcement learning methods are better at handling a system’s internal uncertainties than external, system-level ones. This suggests that a model-free RL approach would be most effective when the primary challenges are internal to the system, such as managing EV battery degradation or optimizing a microgrid’s internal components and losses. However, the paper demonstrates that RL performance degrades significantly when faced with external uncertainties that are not part of the training environment. This approach allows RL agents to generalize more effectively, leading to more robust and superior performance when compared to methods that only consider internal system dynamics. Moreover, the analysis can be extended to supply chain logistics, where RL could optimize tactical decisions by learning the complex, emergent dynamics of a firm’s internal operations. For supply chains, internal uncertainty involves variability within a company’s own control, such as production line yield, machine downtime, and warehouse processing times. External uncertainty consists of major, unmodeled disruptions to the broader network.

Author Contributions

Conceptualization, E.G.-G., S.K., and Y.L.; methodology, E.G.-G.; software, E.G.-G. and I.S.; validation, J.B. and D.B.; formal analysis, E.G.-G.; investigation, E.G.-G. and I.S.; data curation, E.G.-G.; writing—original draft preparation, E.G.-G.; writing—review and editing, Y.L., J.B., and D.B.; visualization, J.B.; supervision, S.K.; project administration, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available in a publicly accessible repository at https://github.com/ElinorG11/StorageComparative2025.git, accessed on 25 May 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

$E_{g}$	Generated energy
$E_{L}$	Energy demand of the load
$E_{s}$	Energy capacity of the battery
$E_{max}$	Energy capacity of the battery
$P_{g}$	Generated power
$P_{L}$	Load power consumption
$P_{s}$	Power that flows into or from the storage device
$[0, T]$	Integration interval
t	Point in time, continuous
i	Point in time, discrete, often used instead of t to denote discrete time steps

References

Hannan, M.; Hoque, M.; Mohamed, A.; Ayob, A. Review of energy storage systems for electric vehicle applications: Issues and challenges. Renew. Sustain. Energy Rev. 2017, 69, 771–789. [Google Scholar] [CrossRef]
Díaz, G.; Gómez-Aleixandre, J.; Coto, J.; Conejero, O. Maximum income resulting from energy arbitrage by battery systems subject to cycle aging and price uncertainty from a dynamic programming perspective. Energy 2018, 156, 647–660. [Google Scholar] [CrossRef]
Venayagamoorthy, G.K.; Sharma, R.K.; Gautam, P.K.; Ahmadi, A. Dynamic Energy Management System for a Smart Microgrid. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 1643–1656. [Google Scholar] [CrossRef] [PubMed]
Cao, J.; Harrold, D.; Fan, Z.; Morstyn, T.; Healey, D.; Li, K. Deep reinforcement learning-based energy storage arbitrage with accurate lithium-ion battery degradation model. IEEE Trans. Smart Grid 2020, 11, 4513–4521. [Google Scholar] [CrossRef]
Bui, V.H.; Hussain, A.; Kim, H.M. Double deep Q-learning-based distributed operation of battery energy storage system considering uncertainties. IEEE Trans. Smart Grid 2019, 11, 457–469. [Google Scholar] [CrossRef]
Xue, X.; Ai, X.; Fang, J.; Jiang, Y.; Cui, S.; Wang, J.; Ortmeyer, T.H.; Wen, J. Continuous Approximate Dynamic Programming Algorithm to Promote Multiple Battery Energy Storage Lifespan Benefit in Real-Time Scheduling. IEEE Trans. Smart Grid 2024, 15, 5744–5760. [Google Scholar] [CrossRef]
Tabares, A.; Cortés, P. Using Stochastic Dual Dynamic Programming to Solve the Multi-Stage Energy Management Problem in Microgrids. Energies 2024, 17, 2628. [Google Scholar] [CrossRef]
Lee, H.; Song, C.; Kim, N.; Cha, S.W. Comparative Analysis of Energy Management Strategies for HEV: Dynamic Programming and Reinforcement Learning. IEEE Access 2020, 8, 67112–67123. [Google Scholar] [CrossRef]
Jiang, D.R.; Pham, T.V.; Powell, W.B.; Salas, D.F.; Scott, W.R. A comparison of approximate dynamic programming techniques on benchmark energy storage problems: Does anything work? In Proceedings of the 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Orlando, FL, USA, 9–14 December 2014; pp. 1–8. [Google Scholar] [CrossRef]
Xu, G.; Shi, J.; Wu, J.; Lu, C.; Wu, C.; Wang, D.; Han, Z. An optimal solutions-guided deep reinforcement learning approach for online energy storage control. Appl. Energy 2024, 361, 122915. [Google Scholar] [CrossRef]
Xiong, S.; Liu, D.; Chen, Y.; Zhang, Y.; Cai, X. A deep reinforcement learning approach based energy management strategy for home energy system considering the time-of-use price and real-time control of energy storage system. Energy Rep. 2024, 11, 3501–3508. [Google Scholar] [CrossRef]
Sepehrzad, R.; Langeroudi, A.S.G.; Khodadadi, A.; Adinehpour, S.; Al-Durra, A.; Anvari-Moghaddam, A. An applied deep reinforcement learning approach to control active networked microgrids in smart cities with multi-level participation of battery energy storage system and electric vehicles. Sustain. Cities Soc. 2024, 107, 105352. [Google Scholar] [CrossRef]
Aldahmashi, J.; Ma, X. Real-Time Energy Management in Smart Homes Through Deep Reinforcement Learning. IEEE Access 2024, 12, 43155–43172. [Google Scholar] [CrossRef]
Meng, J.; Yue, M.; Diallo, D. A Degradation Empirical-Model-Free Battery End-of-Life Prediction Framework Based on Gaussian Process Regression and Kalman Filter. IEEE Trans. Transp. Electrif. 2023, 9, 4898–4908. [Google Scholar] [CrossRef]
Levron, Y.; Shmilovitz, D. Optimal Power Management in Fueled Systems With Finite Storage Capacity. IEEE Trans. Circuits Syst. Regul. Pap. 2010, 57, 2221–2231. [Google Scholar] [CrossRef]
Chowdhury, N.R.; Ofir, R.; Zargari, N.; Baimel, D.; Belikov, J.; Levron, Y. Optimal Control of Lossy Energy Storage Systems With Nonlinear Efficiency Based on Dynamic Programming and Pontryagin’s Minimum Principle. IEEE Trans. Energy Convers. 2021, 36, 524–533. [Google Scholar] [CrossRef]
Zargari, N.; Ofir, R.; Chowdhury, N.R.; Belikov, J.; Levron, Y. An Optimal Control Method for Storage Systems With Ramp Constraints, Based on an On-Going Trimming Process. IEEE Trans. Control Syst. Technol. 2023, 31, 493–496. [Google Scholar] [CrossRef]
Zhang, F.; Fu, A.; Ding, L.; Wu, Q. MPC based control strategy for battery energy storage station in a grid with high photovoltaic power penetration. Int. J. Electr. Power Energy Syst. 2020, 115, 105448. [Google Scholar] [CrossRef]
Ginzburg-Ganz, E.; Segev, I.; Balabanov, A.; Segev, E.; Kaully Naveh, S.; Machlev, R.; Belikov, J.; Katzir, L.; Keren, S.; Levron, Y. Reinforcement Learning Model-Based and Model-Free Paradigms for Optimal Control Problems in Power Systems: Comprehensive Review and Future Directions. Energies 2024, 17, 5307. [Google Scholar] [CrossRef]
Ginzburg-Ganz, E.; Segev, I.; Belikov, J.; Keren, S.; Baimel, D.; Leena, S.; Levron, Y. Code Files of the Analysis. 2025. Available online: https://github.com/ElinorG11/StorageComparative2025.git (accessed on 25 May 2025).
California Independent System Operator (CAISO). California ISO Energy Data and Forecasts. 2025. Available online: https://www.energy.ca.gov/publications/2018/california-energy-demand-2018-2030-revised-forecast (accessed on 20 February 2025).
Noga-Israel Independent System Operator (ISO). Noga ISO System Operation and Demand Curve Data. 2025. Available online: https://www.noga-iso.co.il/systemoperationunit/demand-curve/ (accessed on 25 May 2025).
Hobbs, B.F.; Drayton, G.; Bartholomew Fisher, E.; Lise, W. Improved Transmission Representations in Oligopolistic Market Models: Quadratic Losses, Phase Shifters, and DC Lines. IEEE Trans. Power Syst. 2008, 23, 1018–1029. [Google Scholar] [CrossRef]

Figure 1. Illustration of the trade-offs between prior knowledge and computational resources in energy storage optimal control problems.

Figure 2. Illustration of the system’s key components and their interconnections. (1) Renewable energy sources, represented by solar panels. (2) Energy storage units; this paper considers, specifically, a battery energy storage system (BESS). (3)–(4) Conventional power sources, represented by a power plant and synchronous generators. (5) Monitoring and control unit.

Figure 3. Optimalgeneration policies for the compared control algorithms, with Shortest-Path used as the classical optimal control strategy, and load profiles, where the load represents the net demand. The statistical characteristics are based on the following two publicly available datasets of consumption and renewable production data, CAISO [21] and NogaISO [22].

Figure 4. The daily cost of power generation over a 100-day test period for each algorithm, with Shortest-Path used as the classical optimal control strategy.

Figure 5. The histograms present the cost deviations over 100 test days for the SP, PPO, SAC, and TD3 algorithms.

Figure 6. Optimalgeneration policies for the compared control algorithms, with Pontryagin’s minimum principle used as the classical optimal control strategy, and load profiles, where the load represents the net demand. The statistical characteristics are based on the following two publicly available datasets of consumption and renewable production data, CAISO [21] and NogaISO [22].

Figure 7. The daily cost of power generation over a 100 day test period for each algorithm, with Pontryagin’s minimum principle used as the classical optimal control strategy.

Figure 8. The histograms present the cost deviations over 100 test days for the PMP, PPO, SAC, and TD3 algorithms.

Figure 9. Optimal generation policies for the compared control algorithms, with the dynamic programming algorithm used as the classical optimal control strategy, and load profiles, where the load represents the net demand. The statistical characteristics are based on the following two publicly available datasets of consumption and renewable production data, CAISO [21] and NogaISO [22].

Figure 10. The daily cost of power generation over a 100-day test period for each algorithm, with the dynamic programming method used as the classical optimal control strategy.

Figure 11. The histograms present the cost deviations over 100 test days for the DP, PPO, SAC, and TD3 algorithms.

Table 1. Mean and variance of the daily cost of power generation for each algorithm over a 100 day test period.

Algorithm	Mean	Var
SP	711.85	70,854.35
PPO	2428.24	27,1321.64
SAC	1935.02	15,4239.84
TD3	2787.36	34,2948.40

Table 2. Mean and variance of the daily cost of power generation for each algorithm over a 100-day test period.

Algorithm	Mean	Var
PMP	705.82	69,024.48
PPO	748.17	88,234.85
SAC	2280.81	20,7902.61
TD3	748.17	88,234.85

Table 3. Mean and variance of the daily cost of power generation for each algorithm over a 100-day test period.

Algorithm	Mean	Var
DP	773.10	89,220.93
PPO	1165.14	14,6485.45
SAC	1293.96	10,9379.59
TD3	2779.89	34,2000.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ginzburg-Ganz, E.; Segev, I.; Levron, Y.; Belikov, J.; Baimel, D.; Keren, S. Comparative Analysis of Optimal Control and Reinforcement Learning Methods for Energy Storage Management Under Uncertainty. Energy Storage Appl. 2025, 2, 14. https://doi.org/10.3390/esa2040014

AMA Style

Ginzburg-Ganz E, Segev I, Levron Y, Belikov J, Baimel D, Keren S. Comparative Analysis of Optimal Control and Reinforcement Learning Methods for Energy Storage Management Under Uncertainty. Energy Storage and Applications. 2025; 2(4):14. https://doi.org/10.3390/esa2040014

Chicago/Turabian Style

Ginzburg-Ganz, Elinor, Itay Segev, Yoash Levron, Juri Belikov, Dmitry Baimel, and Sarah Keren. 2025. "Comparative Analysis of Optimal Control and Reinforcement Learning Methods for Energy Storage Management Under Uncertainty" Energy Storage and Applications 2, no. 4: 14. https://doi.org/10.3390/esa2040014

APA Style

Ginzburg-Ganz, E., Segev, I., Levron, Y., Belikov, J., Baimel, D., & Keren, S. (2025). Comparative Analysis of Optimal Control and Reinforcement Learning Methods for Energy Storage Management Under Uncertainty. Energy Storage and Applications, 2(4), 14. https://doi.org/10.3390/esa2040014

Article Menu

Comparative Analysis of Optimal Control and Reinforcement Learning Methods for Energy Storage Management Under Uncertainty

Abstract

1. Introduction

2. Fundamental Challenges in the Solution of Energy Storage Optimal-Control Problems, and the Need for Reinforcement-Learning

3. Methodology and Results of the Case Studies

3.1. Ideal Storage Devices with Convex Cost Functions

3.2. Grid Connected Lossy Storage Devices

3.3. Storage Devices Within a Transmission Grid with Losses

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI