Next Article in Journal
Thermal Comfort of Older People: Validation of the MPMV Model
Previous Article in Journal
A Reduced-Order Algorithm for a Digital Twin Model of Ultra-High-Voltage Valve-Side Bushing Considering Spatio-Temporal Non-Uniformity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Uniqueness of Optimal Power Management Strategies for Energy Storage Dynamic Models

1
The Andrew and Erna Viterbi Faculty of Electrical and Computer Engineering, Technion—Israel Institute of Technology, Haifa 3200003, Israel
2
Department of Software Science, Tallinn University of Technology, Akadeemia tee 15a, 12618 Tallinn, Estonia
*
Author to whom correspondence should be addressed.
Energies 2025, 18(6), 1483; https://doi.org/10.3390/en18061483
Submission received: 18 February 2025 / Revised: 11 March 2025 / Accepted: 14 March 2025 / Published: 17 March 2025
(This article belongs to the Section D: Energy Storage and Application)

Abstract

:
This paper contributes to the field of analytic and semi-analytic solutions for optimal power flow problems involving storage systems. Its primary contribution is a rigorous proof establishing the uniqueness of the “shortest path” optimal solution, a key element in this class of algorithms, building upon a graphical design procedure previously introduced. The proof is constructed through five consequential lemmas, each defining a distinct characteristic of the optimal solution. These characteristics are then synthesized to demonstrate the uniqueness of the optimal solution, which corresponds to the shortest path of generated energy within defined bounds. This proof not only provides a solid theoretical foundation for this algorithm class but also paves the way for developing analytic solutions to more complex optimal control problems incorporating storage. Furthermore, the efficacy of this unique solution is validated through two comparative tests. The first one uses synthetic data to benchmark the proposed solution in comparison to recent reinforcement learning algorithms, including actor–critic, PPO, and TD3. The second one compares the proposed solution to the optimal solutions derived from other numerical methods based on real-world data from an electrical vehicle storage device.

1. Introduction

Energy storage devices are emerging as an essential component in modern power systems [1,2]. These devices can be used in many applications to increase efficiency, reduce operation costs, limit the peak power generation, or increase stability. However, the optimal management of storage devices is often very challenging, since the optimal energy curve must be evaluated in the time-domain and therefore always involves a high-dimensional solution space.
As indicated in [3], many methods and algorithms have been developed to solve storage system optimization problems. For instance, linear programming approaches are used that are relatively simple and intuitive, relying on low computational resources but limited to problems in which the objective function is linear and the constraints are either linear equalities or linear inequalities. Work [4] aims to minimize the daily electricity cost of a campus central chiller plant, which includes electric chillers and thermal energy storage, by employing a hybrid optimization approach. Specifically, mixed-integer linear programming determines the optimal chiller operation, while dynamic programming manages the thermal energy storage. Alternatively, work [5] investigates the optimization of an islanded microgrid, incorporating renewable energy sources, diesel generators, and battery energy storage systems through the application of linear and mixed-integer programming. The linear programming formulation aims to minimize diesel generator power output while the mixed-integer programming formulation seeks to minimize total operational cost, with both formulations including battery state-of-charge constraints.
Dynamic programming is an example for traditional optimization method vastly in use. This method divides complex problems into a series of simpler sub-problems, with low computational overhead each, and solves them recursively. For example, in the energy market category, work [6] is concerned with minimizing the average cost of electricity used and investment in storage while satisfying demand in the dynamic pricing problem. The objective of [7] is to calculate the optimal control of a residential ESS, with and without local generation in real-time pricing problems. To address uncertainties, there are several approaches for deciding upon the optimal control. For example, Model Predictive Control (MPC) is one dominant method used in managing storage devices, while another involves framing the problem as a Markov Decision Process (MDP) and resolving it through stochastic dynamic programming. Unlike classical dynamic programming, which focuses on optimizing control policies for predetermined signals, stochastic dynamic programming focuses on optimizing control policies across a spectrum of potential signals. Consequently, rather than yielding a singular optimal solution, the algorithm produces an optimal control strategy that adapts in real time, according to actual measurements. On top of that, robust control techniques, particularly sliding mode control and its higher-order variants, are also gaining extensive attention. sliding mode control, in essence, is a nonlinear control technique designed to force a system’s state to reach and remain on a predefined surface. First, a mathematical function that defines the desired system behavior is formulated. Once the system state reaches this surface, it remains there despite disturbances. Following, the control law is designed to drive the system state from any initial condition to the sliding surface. Once on the sliding surface, the system follows a reduced-order dynamics known as sliding mode dynamics. The system becomes insensitive to parameter variations and disturbances, leading to high robustness. Several examples include [8] that introduces a super-twisting sliding mode control (STSMC) strategy for grid-side inverters (GSIs) in wind power generation systems to improve power delivery under unpredictable weather. The proposed method employs three integral-type STSMC controllers within a multi-loop GSI model to regulate active and reactive power exchange, mitigating chattering and improving robustness against parameter variations. Simulation and experimental results demonstrate that the STSMC significatnly improves the active/reactive power response and stability of three-phase voltage/current signals, with reduced harmonic content compared to proportional-integral control. Similarly, work [9] proposes a universal controller for battery energy storage systems in microgrids that uses impedance shaping and sliding mode control to address issues with grid-feeding and grid-forming voltage source converters. The controller adapts to different operating modes, improving performance in weak grid connections and during fault ride-through transients, while also ensuring a smooth transition between grid-connected and islanded modes. The proposed control method was evaluated through simulations in MATLAB R2024a Simulink. Different works utilize stochastic control methods for grid-connected storage device control problems. For instance, work [6] aims to minimize the average cost of electricity used and investment in storage while satisfying demand. Pontryagin’s Minimum Principle is yet another technique that is often used for managing storage devices. Various works leverage this powerful mathematical principle. As an example, if one examines control problems for grid-connected devices, ref. [10] or [11] focus on optimizing revenue from generated electricity in concentrated solar power plants. In [12], the objective is to calculate the optimal total expenses of a capacitor-type energy system. From a slightly different perspective, works [13,14] focus on managing and optimizing the power flows in the power system of a hybrid vehicle. The authors model the control problem as a constrained optimization problem and use Pontryagin’s Minimum Principle approach to solve it. In this work, a multi-objective optimization problem is considered, aiming to minimize both energy efficiency, in the sense of overall cost, and pollutant emissions for a given driving cycle. The model is tested in a simulated environment.
Furthermore, with the surge of machine learning tools and recent advancements in computational power, there is an emerging trend of using machine learning, specifically reinforcement learning algorithms, for storage management problems. For instance, ref. [15] introduces a deep reinforcement learning framework for optimizing battery storage, incorporating a lithium-ion battery degradation model and a noisy network architecture for effective action space exploration. In [16], Double Deep Q-Learning is used to optimize community battery storage in microgrids. More relevant works are [17,18].
It is evident from the literature review that various optimization methods have been explored for managing energy storage systems, each with its own strengths and limitations. For instance, dynamic programming suffers from the “curse of dimensionality”, making it computationally expensive for large-scale applications. Pontryagin’s Minimum Principle provides an analytical framework for solving optimal control problems, yet it relies on precise system modeling, making it less flexible in handling uncertainties and stochastic variations in energy demand and generation. Additionally, reinforcement learning has gained popularity for its ability to adapt to unknown system dynamics, yet model-free RL struggles with long-term dependencies, constraint enforcement, and convergence stability, especially in highly variable environments.
The shortest path method [19] offers a complementary approach that serves as an intuitive benchmark for evaluating energy storage optimization techniques. Unlike RL, PMP, or DP, the shortest path method provides an approximate but interpretable solution based on a graphical representation of energy trajectories. While it does not incorporate detailed system constraints or real-world transients, its smooth trajectories help identify structural properties of optimal solutions, making it a valuable qualitative tool. By comparing advanced algorithms against the shortest path benchmark, researchers can pinpoint areas where additional complexity is necessary, guiding the development of more computationally efficient and real-world-applicable optimization strategies.
In this light, the current paper continues several recent works that study optimal power management for energy storage dynamic models. Original work [19] provides a graphical design procedure that reveals the optimal power flow solution in a system that comprises a load and a network-connected storage device. Later works building upon this result provide a deeper understanding and intuitive characteristics of the optimal solution in several cases, such as optimal peak-shaving, scenario with storage devices connected by a network, and an extension of this initial result for cases of stochastic load profiles. However, the basic graphical design procedure presented in the initial work [19] was never proven in a rigorous manner.
Therefore, the purpose of this paper is to bridge this gap and to provide a mathematically complete proof for the central theorem in [19]. Nonetheless, this proof provides not only a sound theoretical foundation for this class of algorithms, but may also help develop additional analytic solutions for more complex optimal control problems involving storage systems. This unique solution is validated via two distinct comparative tests. The first one uses synthetic data to benchmark the proposed solution in comparison to recent reinforcement learning algorithms, including actor–critic, PPO, and TD3. The second one compares the proposed solution to the optimal solutions derived from other numerical methods based on real-world data from an electrical vehicle storage device. This comparative analysis emphasizes how the shortest path method can be used to learn about the performance bounds of other more advanced methods, since, as discussed in previous works, the shortest path method does indeed have the advantage of minimal computational complexity, thus it can serve as a benchmark for other methods, for instance, those new methods that make use of reinforcement learning.
It is important to note that while “uniqueness” often implies both exclusivity and optimality, in this case, we refer strictly to the mathematical uniqueness of the solution rather than its real-world superiority over alternative methods. The shortest path method, while proven to yield a unique solution within the given framework, does not necessarily mean that it is the most effective approach in practical applications, where additional constraints, uncertainties, and real-world imperfections may impact performance. The structure of the article is presented in Figure 1.

2. Main Result

The original problem explored in [19] concerns with continuous time, finite-horizon optimization problem that may be concisely formulated as follows:
minimize τ = 0 T F ( P g ( τ ) ) d τ , subject to W g ( t ) = 0 T P g ( τ ) d τ , W L ( t ) W g ( t ) W L ( t ) + W max , W g ( 0 ) = W L ( 0 ) , W g ( T ) = W L ( T ) ,
where F ( P g ) is a general convex cost function, P g ( t ) is the generated power, W L ( t ) is the energy consumption of the load, W g ( t ) is the generated energy, [ 0 , T ] is the integration interval, where T is known, and W max is the capacity of the storage device. The solution for this problem may be obtained by the “shortest path” method, as explained in [19]. Our main objective in this paper is to provide an analytical and rigorous proof for this result, as formulated by the following theorem:
Theorem 1
(Unique Solution). Consider the optimization problem described in (1). If F ( · ) is convex, then this problem has a unique solution W g * ( t ) , and this solution is independent of F ( · ) .
Proof. 
First, recall two lemmas from [19]:
Lemma 1
(Straight Flow (SF), [19]). If [ a , b ] [ 0 , T ] such that t ( a , b ) : W L ( t ) < W g ( t ) < W L ( t ) + W max , then the optimal generated power P g ( t ) is constant in [ a , b ] (when the generated energy is not constrained, it is a straight line).
Lemma 2
(Tangent And Continuous (TC), [19]). The curve of optimal generated energy must tangent its bounding constraints. In other words, at each time when the generated energy meets the constraints, the generated power is equal to the load power. Formally, W g ( t ) = W L ( t ) P g ( t ) = P L ( t ) and W g ( t ) = W L ( t ) + W max P g ( t ) = P L ( t ) . Moreover, the optimal generated power, P g ( t ) , is a continuous function.
The aforementioned lemmas can be interpreted as follows:
  • When not bounded, the generated energy is a straight line.
  • When bounded, the generated energy is tangent to the constraint.
In [19], those two lemmas were used to assert that there is only one legal solution that includes both features, so together, the two features define the optimal solution. These lemmas tell us what happens to the generated energy when it is bounded and what happens when it is not. To complete the proof, we need to show that the times when the generated energy is bounded are unique.
The outline of the proof is depicted in Figure 2. To prove the uniqueness of the generated energy, additional lemmas are presented that serve as the basis of our proof. First, let us introduce the first added lemma, which is denoted by lemma number three, which shows that we can split the timeline into segments. Those intervals divide the power demand of the load into segments where this function is monotonically increasing or decreasing, as can be viewed in Figure 3. Moreover, inside those intervals, the optimal solution can be tangent to an adequate constraint; upper bound in an increasing interval and lower bound in a decreasing interval. The solution is tangential continuously to the constraint for a finite interval. One may conclude that the optimal solution can be described by the MI on which it tangents one of the constraints. As seen in Figure 2, the third lemma is based on the two lemmas from [19]. Finally, two additional new lemmas, denoted as the fourth and fifth lemmas, demonstrate that the times in which the optimal generated energy is bounded are unique. To prove this, it must be shown that the optimal generated energy is bounded inside a unique MI, which sets the unique times as required. Presented in Figure 2, those lemmas rely on the third lemma. Consequently, we combine the fourth and fifth lemmas to show the uniqueness of the optimal generated energy.
Let us set a new framework and present some formal definitions of the varied segments and points that characterize the load power consumption P L ( t ) and the optimal solution W g ( t ) .
Definition 1
(Increasing Interval). A real valued function f ( t ) : [ 0 , T ] R has an increasing interval [ c , d ] [ 0 , T ] if
1.
The function is increasing in [ c , d ] :
x , y [ c , d ] : x y f ( x ) f ( y ) .
2.
If the point t = c is not at the edge of the domain of definition of the function, then the function is decreasing on the interval [0,c):
c 0 δ 1 > 0 , t [ c δ 1 , c ) : f ( t ) > f ( c ) .
3.
If the point t = d is not at the edge of the domain of definition of the function, then the function is decreasing from some point inside the interval ( d , T ] :
d T δ 2 > 0 , x ( d , d + δ 2 ] : f ( x ) f ( d ) y ( d , d + δ 2 ] : f ( y ) < f ( d ) .
Definition 2
(Decreasing Interval). A real-valued function f ( t ) : [ 0 , T ] R has a decreasing interval [ c , d ] [ 0 , T ] if
1.
The function is decreasing in [ c , d ] :
x , y [ c , d ] : x y f ( x ) f ( y ) .
2.
If the point t = c is not at the edge of the domain of definition of the function, then the function is increasing on the interval [0,c):
c 0 δ 1 > 0 , t [ c δ 1 , c ) : f ( t ) < f ( c ) .
3.
If the point t = d is not at the edge of the domain of definition of the function, then the function is increasing from some point inside the interval ( d , T ] :
d T δ 2 > 0 , x ( d , d + δ 2 ] : f ( x ) f ( d ) y ( d , d + δ 2 ] : f ( y ) > f ( d )
An illustration of those definitions can be seen in Figure 3.
Definition 3
(Monotone Interval). A real-valued function f ( t ) : [ 0 , T ] R has a monotone interval [ c , d ] [ 0 , T ] if [ c , d ] is either a decreasing or an increasing interval of f ( t ) .
Definition 4
(Union Point). Continuous real valued functions g ( t ) ,   f ( t ) : [ 0 , T ] R have a union point in t m [ 0 , T ] if:
1.
δ 1 > 0 : t [ t m δ 1 , t m ) , f ( t ) g ( t ) ,
2.
δ 2 > 0 : t [ t m , t m + δ 2 ] , f ( t ) = g ( t ) .
Definition 5
(Separation Point). Continuous real valued functions g ( t ) ,   f ( t ) : [ 0 , T ] R have a separation point in t s [ 0 , T ] if
1.
δ 1 > 0 : t [ t s δ 1 , t s ] ,   f ( t ) = g ( t ) ,
2.
δ 2 > 0 : t ( t s , t s + δ 2 ] ,   f ( t ) g ( t ) .

2.1. The Third Lemma

Lemma 3
(Up Increasing, Down Decreasing (UIDD)). We let W g ( t ) be an optimal generated energy for (1).
If t 0 : W g ( t 0 ) = W L ( t 0 ) + W max , then t 1 < t 2 [ 0 , T ] such that t 0 [ t 1 , t 2 ] , t [ t 1 , t 2 ] : W g ( t ) = W L ( t ) + W max and [ a , b ) such that [ t 1 , t 2 ] [ a , b ) and [ a , b ) is an increasing interval of P L ( t ) .
Similarly, if t 3 : W g ( t 3 ) = W L ( t 3 ) , then t 4 < t 5 [ 0 , T ] such that t 3 [ t 4 , t 5 ] , t [ t 4 , t 5 ] : W g ( t ) = W L ( t ) and [ c , d ] such that [ t 4 , t 5 ] [ c , d ] and [ c , d ] is a decreasing interval of P L ( t ) .
Proof. 
First, let us prove that t 0 [ 0 , T ] such that t 0 is a part of an increasing interval of P L ( t ) and W g ( t 0 ) = W L ( t 0 ) .
We assume by negation that [ a , b ] [ 0 , T ] which is an increasing interval of P L ( t ) , and for which t 0 [ a , b ] such that an optimal generated energy W g ( t ) has W g ( t 0 ) = W L ( t 0 ) . We can establish two opposing facts about W g ( t ) behavior and reach a contradiction.
t [ a , b ] : W g ( t ) = W L ( t ) .
Without loss of generality, we assume by negation t 1 [ a , b ] such that t 1 > t 0 , W g ( t 1 ) > W L ( t 1 ) . Because P L ( t ) is increasing on [ t 0 , t 1 ] [ a , b ] , we know W L ( t ) is convex in this section, and therefore the line connecting W L ( t 0 ) = W g ( t 0 ) and W g ( t 1 ) located above W L ( t ) . Knowing that W g ( t ) , W L ( t ) are antiderivatives, they are continuous, and therefore ( W g W L ) ( t ) is continuous. This means (from the intermediate point theorem) that t s e p [ t 0 , t 1 ]  such that 
t [ t 0 , t s e p ] : W g ( t ) W L ( t ) = 0 , δ > 0   such   that   0 < h < δ : 0 < W g ( t s e p + h ) W L ( t s e p + h ) .
Relying on the fact that d d t ( W g ( t ) W L ( t ) ) = P g ( t ) P L ( t ) is continuous (from TC lemma, and therefore bounded on [ a , b ] ), we also know we can choose δ > 0 such that
0 < h < δ : 0 < ( W g W L ) ( t s e p + h ) < W max .
So, we can conclude that
W g ( t s e p ) = W L ( t s e p ) , h ( 0 , δ ) : W L ( t s e p + h ) < W g ( t s e p + h ) < W L ( t s e p + h ) + W max .
Therefore, from SF lemma, we know W g ( t ) must be a straight line in the time segment t s e p , t s e p + δ 2 , and its slope satisfies
t t s e p , t s e p + δ 2 : P g ( t ) = 2 W g t s e p + δ 2 W g ( t s e p ) δ p g 0 .
From TC lemma, we also know that W g ( t s e p ) = W L ( t s e p ) P g ( t s e p ) = P L ( t s e p ) ; however,
p g 0 · δ 2 = t s e p t s e p + δ 2 P g ( t ) · d t = = W g t s e p + δ 2 W g ( t s e p ) > W L t s e p + δ 2 W L ( t s e p ) = = t s e p t s e p + δ 2 P L ( t ) d t p g 0 > 2 δ t s e p t s e p + δ 2 P L ( t ) d t .
Since P L ( t ) is increasing on t s e p , t s e p + δ 2 [ a , b ] , that means P g ( t s e p ) = p g 0 > P L ( t s e p ) . This is a contradiction to the TC lemma, which proves that (2) holds. Next, we explicitly define the time t b i g [ a , b ] at which W g ( t b i g ) > W L ( t b i g ) . Notice the load power P L ( t ) is increasing on [ a , b ] , so W L ( t ) is convex there. This means that t s , t e [ t 0 , b ] , t s < t e , the line connecting W L ( t s ) , W L ( t e ) is above W L ( t ) , and it is strictly above W L ( t ) for some t s < t e . Because P L ( t ) is continuous on a compact interval, we know it is bounded. So,
M R : t [ 0 , T ] , | P g ( t ) | < M .
Therefore, if we choose t s , t e such that t e t s < W max / M , we know the line connecting W L ( t s ) , W L ( t e ) is also strictly below W L ( t ) + W max . Hence, according to SF lemma, if W g ( t s ) = W L ( t s ) and W g ( t e ) = W L ( t e ) , it must be a straight line between those times. For that reason, looking at t b i g = ( t s + t e ) / 2 yields
W g ( t b i g ) = W g t s + t e 2 > W L t s + t e 2 = W L ( t b i g ) .
Notice that (2) and (9) contradict each other; therefore, t 0 [ a , b ] such that W g ( t 0 ) = W L ( t 0 ) . An illustration of this argument is shown in Figure 4.
Note that the same argument can be made in case that W g ( t ) is equal to the upper energy band W L ( t ) + W max in a decreasing interval of P L ( t ) by switching convexity for concavity solely. An illustration is shown in Figure 5.
Lastly, let us confirm that when the generated energy W g ( t ) is equal to a constraint, it must be equal to it for a continuous time segment inside the monotone interval of P L ( t ) . Without loss of generality, we let W g ( t 0 ) = W L ( t 0 ) + W max for some t 0 [ 0 , T ] . Using the fact that [ a , b ] [ 0 , T ] such that t 0 [ a , b ] and [ a , b ] is an increasing interval of P L ( t ) , we let
A = { t [ a , b ] W g ( t ) = W L ( t ) + W max } .
Next, it is imperative to show that the set A is a segment. Since A is bounded, it has a supremum and infimum. Given that W g ( t ) , W L ( t ) are continuous, we also know the supremum and infimum are in A. We let
min ( A ) = c , max ( A ) = d .
We assume by negation t 1 [ c , d ] : W g ( t 1 ) W L ( t 1 ) + W max . Then,
W g ( c ) = W L ( c ) + W max , W g ( d ) = W L ( d ) + W max , W g ( t 1 ) < W L ( t 1 ) + W max .
Inasmuch as W g ( t ) and W L ( t ) are continuous, we let t l be their first separation point after t = c . Two different scenarios need to be addressed as follows:
1.
If t m [ c , d ] such that t m > t l and t m is a union point of W L ( t ) , W g ( t ) , since [ c , d ] [ a , b ] is in an increasing interval of P L ( t ) , the following is produced:
P g ( t m ) = P g ( t l ) = P L ( t l ) < P L ( t m ) .
This contradicts TC lemma at t m . An illustration of this case can be seen in Figure 6.
2.
Otherwise, t m [ c , d ] such that t m > t l and t m is a union point of W L ( t ) , W g ( t ) . But then from SF lemma P g ( t ) is constant for t [ t l , d ] , and
W g ( d ) = W g ( t l ) + t l d P g ( t ) d t = = W L ( t l ) + W max + t l d P L ( t l ) d t < < W L ( t l ) + W max + t l d P L ( t ) d t = W L ( d ) + W max .
This contradicts (12). An illustration of this case can be seen in Figure 7.
Both cases reached a contradiction. Hence, if t 1 [ c , d ] then W g ( t 1 ) W L ( t 1 ) + W max and the set A is a segment. □

2.2. The Fourth Lemma

The proof of this lemma is based on the following definitions.
Definition 6
(Candidate Optimal Solution). We let [ t a , t b ] [ 0 , T ] be an interval. A differentiable function W g ( t ) : [ t a , t b ] R is a Candidate Optimal Solution (COS) on [ t a , t b ] if
1.
t [ t a , t b ] : W L ( t ) W g ( t ) W L ( t ) + W max ,
2.
W g ( t ) satisfies SF lemma on [ t a , t b ] ,
3.
W g ( t ) satisfies TC lemma on [ t a , t b ] .
Definition 7
(Reachable Monotone Interval). We let [ t a , t b ] be a monotone interval of P L ( t ) , where t b < T . Also, we let C ( i , t ) = W L ( t ) + i · W max be the optimization problem constraints for i { 0 , 1 } . We let [ t c , t d ] be another monotone interval of P L ( t ) with t c > t b . Then, [ t c , t d ] is reachable from [ t a , t b ] if
1.
W g ( t ) : [ t a , t d ] R such that W g ( t ) is a COS on [ t a , t d ] ,
2.
t s e p [ t a , t b ] , i 0 { 0 , 1 } such that t s e p is a separation point of W g ( t ) and C ( i 0 , t ) ,
3.
t m e e t [ t c , t d ] , i 1 { 0 , 1 } such that t m e e t is a union point of W g ( t ) and C ( i 1 , t ) ,
4.
t [ t s e p , t m e e t ] : W L ( t ) < W g ( t ) < W L ( t ) + W max .
Definition 8
(Routed Between Monotone Intervals). We let [ t a , t b ] be a monotone interval of P L ( t ) , where t b < T . Also, let C ( i , t ) = W L ( t ) + i · W max be the optimization problem constraints for i { 0 , 1 } . We let [ t c , t d ] be a reachable monotone interval from [ t a , t b ] . A COS W g ( t ) on [ t a , t d ] is said to be routed between [ t a , t b ] and [ t c , t d ] if
1.
t s e p [ t a , t b ] , i 0 { 0 , 1 } such that t s e p is a separation point of W g ( t ) and C ( i 0 , t ) ,
2.
t m e e t [ t c , t d ] , i 1 { 0 , 1 } such that t m e e t is a union point of W g ( t ) and C ( i 1 , t ) ,
3.
t 1 [ t c , t d ] with t 1 > t m e e t where P g ( t 1 ) P g ( t m e e t ) .
Remark. 
A COS that is routed between an increasing interval and a decreasing interval is said to be discharging. A COS that is routed between a decreasing interval and an increasing interval is said to be charging.
Lemma 4
(Unique Between Monotone Intervals (UBMI)). We let [ t a , t b ] and [ t c , t d ] be two monotone intervals of P L ( t ) with t b < t c , and we let W g ( t ) be a COS of (1). Also, we let C ( i , t ) = W L ( t ) + i · W max be the optimization problem constraints for i { 0 , 1 } . If
1.
[ t 1 , t 2 ] [ t a , t b ] , i 1 { 0 , 1 } such that t [ t 1 , t 2 ] : W g ( t ) = C ( i 1 , t ) ,
2.
[ t 3 , t 4 ] [ t c , t d ] , i 2 { 0 , 1 } such that t [ t 3 , t 4 ] : W g ( t ) = C ( i 2 , t ) ,
3.
t [ t 2 , t 3 ] : W L ( t ) < W g ( t ) < W L ( t ) + W max ,
then t 2 , t 3 are unique.
Proof. 
Without loss of generality, we assume the COS W g ( t ) is discharging between an increasing interval [ t a , t b ] and a decreasing interval [ t c , t d ] of P L ( t ) (where t b < t c ). We let W g 1 ( t ) , W g 2 ( t ) be two COS to the problem described in (1), such that
[ t 1 i , t 2 i ] [ t a , t b ] such that t [ t 1 i , t 2 i ] : W g i ( t ) = W L ( t ) + W max , [ t 2 i , t 3 i ] [ t c , t d ] such that t [ t 3 i , t 4 i ] : W g i ( t ) = W L ( t ) , t [ t 2 i , t 3 i ] : W L ( t ) < W g ( t ) < W L ( t ) + W max ,
where i { 1 , 2 } . We know from SF lemma that d d t W g i ( t ) = P g i ( t ) are constant in [ t 2 i , t 3 i ] . Without loss of generality, we assume by negation t 2 1 < t 2 2 . Because we know P g i ( t 2 i ) = P g i ( t 3 i ) , and because [ t a , t b ] is an increasing interval of P L ( t ) while [ t c , t d ] is a decreasing interval, we determine that t 3 1 > t 3 2 and
t ( t 2 1 , t 3 1 ) : P g 1 ( t ) < P g 2 ( t ) .
This results in
W max = t 2 1 t 3 1 ( P g 1 P L ) ( t ) d t < t 2 1 t 3 1 ( P g 2 P L ) ( t ) d t = t 2 2 t 3 2 ( P g 2 P L ) ( t ) d t = W max W max < W max .
This contradicts the fact that both COS discharge in [ t 2 i , t 3 i ] , t 2 1 = t 2 2 and t 3 1 = t 3 2 , accordingly. An illustration of the proof can be seen in Figure 8. □

2.3. The Fifth Lemma

Lemma 5
(Most Distant Reachable Interval (MDRI)). We let [ t a , t b ] and [ t c , t d ] be monotone intervals of P L ( t ) , where t b < t c . Also, we let C ( i , t ) = W L ( t ) + i · W max be the optimization problem constraints for i { 0 , 1 } and let W g ( t ) be an optimal solution to (1). If W g ( t ) is routed between [ t a , t b ] and [ t c , t d ] , then [ t c , t d ] is unique, and it is either the most distant increasing or most distant decreasing reachable interval.
Proof. 
The proof of the lemma consists of three parts.
1.
The first part formulates the connection between the slopes of generated energies routed between monotone intervals of P L ( t ) and the distance between the different monotone intervals.
2.
The second part utilizes the conclusion regarding those connections and proves that an optimal solution cannot be routed to a monotone interval which is not either the most distant reachable increasing interval of P L ( t ) possible or the most distant reachable decreasing interval of P L ( t ) possible.
3.
The last part proves that only one of those monotone intervals is a valid solution.
We let [ t a , t b ] be a monotone interval and let [ t c , t d ] and [ t e , t f ] be another monotone interval reachable from [ t a , t b ] (where t b < t c , t d < t e ). Let us show the following characteristics are valid and denote them as “Fan Behavior” of the COS:
We let W g < 1 > ( t ) be a COS routed from [ t a , t b ] to [ t c , t d ] and let W g < 2 > ( t ) be a COS routed from [ t a , t b ] to [ t e , t f ] . We let P g < i > ( t ) be the corresponding generated power for i { 1 , 2 } . We denote the separation point of W g < i > ( t ) and the constraint at [ t a , t b ] as t s e p < i > , and the union point of W g < i > ( t ) and the constraint at [ t c , t d ] as t m e e t < i > . Then,
1.
if [ t c , t d ] is an increasing interval, P g < 1 > ( t m e e t < 1 > ) = P g < 1 > ( t s e p < 1 > ) < P g < 2 > ( t s e p < 2 > ) = P g < 2 > ( t m e e t < 2 > ) ;
2.
if [ t c , t d ] is a decreasing interval, P g < 1 > ( t m e e t < 1 > ) = P g < 1 > ( t s e p < 1 > ) > P g < 2 > ( t s e p < 2 > ) = P g < 2 > ( t m e e t < 2 > ) .
To prove it, without loss of generality, we assume [ t a , t b ] and [ t c , t d ] are increasing intervals. We assume by negation that P g < 2 > ( t s e p < 2 > ) > P g < 1 > ( t s e p < 1 > ) . Because [ t a , t b ] is an increasing interval, we know t s e p < 1 > < t s e p < 2 > . Therefore,
W g < 2 > ( t m e e t < 1 > ) = W g < 2 > ( t s e p < 1 > ) + t s e p < 1 > t s e p < 2 > P g < 2 > ( t ) d t + t s e p < 2 > t m e e t < 1 > P g < 2 > ( t ) d t = W L ( t s e p < 1 > ) + W max + t s e p < 1 > t s e p < 2 > P L ( t ) d t + P L ( t s e p < 2 > ) · ( t m e e t < 1 > t s e p < 2 > ) > W L ( t s e p < 1 > ) + W max + P L ( t s e p < 1 > ) · ( t s e p < 2 > t s e p < 1 > ) + P L ( t s e p < 1 > ) · ( t m e e t < 1 > t s e p < 2 > ) = W g < 1 > ( t s e p < 1 > ) + t s e p < 1 > t s e p < 2 > P g < 1 > ( t ) d t + t s e p < 2 > t m e e t < 1 > P g < 1 > ( t ) d t = W g < 1 > ( t m e e t < 1 > ) = W L ( t m e e t < 2 > ) + W max .
In this case, the meaning is that W g < 2 > ( t ) violates the upper constraint at t m e e t < 2 > , which results in a contradiction. Therefore, the “Fan Behavior” holds. An illustration of this behavior can be seen in Figure 9 and Figure 10.
Now, consider again [ t a , t b ] , [ t c , t d ] and [ t e , t f ] monotone intervals where t b < t c ,   t d < t e and [ t c , t d ] , [ t e , t f ] are reachable from [ t a , t b ] . We prove that
1.
If [ t c , t d ] and [ t e , t f ] are both increasing intervals, W g ( t ) a COS on [ t a , t f ] that is routed from [ t a , t b ] to [ t c , t d ] .
2.
If [ t c , t d ] and [ t e , t f ] are both decreasing intervals, W g ( t ) a COS on [ t a , t f ] that is routed from [ t a , t b ] to [ t c , t d ] .
Without loss of generality, we assume [ t a , t b ] , [ t c , t d ] and [ t e , t f ] are all increasing intervals of P L ( t ) . We let W g < 1 > ( t ) be a COS routed from [ t a , t b ] to [ t c , t d ] and W g 2 ( t ) be a COS routed from [ t a , t b ] to [ t e , t f ] . We let t 2 1 [ t c , t d ] be the union point of W g 1 ( t ) and W L ( t ) + W max in [ t c , t d ] and t 2 2 [ t e , t f ] be the union point of W g 2 ( t ) and W L ( t ) + W max in [ t e , t f ] . Also, we let t 1 1 [ t a , t b ] be the separation point of W g 1 ( t ) and W L ( t ) + W max in [ t a , t b ] and t 1 2 [ t a , t b ] separation point of W g 2 ( t ) and W L ( t ) + W max in [ t a , t b ] .
From the “Fan Behavior” property, we know P g 1 ( t 1 1 ) > P g 2 ( t 1 2 ) .
Because [ t a , t b ] is increasing, t 1 1 > t 1 2 . We let [ t g , t h ] be a decreasing interval of P L ( t ) with t g > t d , t h < t e . If [ t g , t h ] is reachable from [ t a , t b ] and let W g 3 ( t ) be a COS routed from [ t a , t b ] to [ t g , t h ] . We know from the “Fan Behavior” that P g 3 ( t 1 3 ) < P g 2 ( t 1 2 ) , and also P g 2 ( t 1 2 ) < P g 1 ( t 1 1 ) . Hence, P g 3 ( t 1 3 ) < P g 1 ( t 1 1 ) . This means W g 1 ( t ) cannot meet any decreasing interval before time t 2 2 , and therefore t [ t 1 2 , t 2 2 ] : P g 2 ( t ) > P g 1 ( t ) .
Therefore, we know
W g 1 ( t 2 2 ) = W g 1 ( t 1 2 ) + t 1 2 t 2 2 P g 1 ( t ) d t = = W L ( t 1 2 ) + W max + t 1 2 t 2 2 P g 1 ( t ) d t > > W L ( t 1 2 ) + W max + t 1 2 t 2 2 P g 2 ( t ) d t = = W g 2 ( t 1 2 ) + t 1 2 t 2 2 P g 2 ( t ) d t = W g 1 ( t 2 2 ) = = W L ( t 2 2 ) + W max .
The resulting conclusion is that W g 1 ( t 2 2 ) violates the upper bound in t 2 2 , and it is therefore not a valid COS. Consequently, the second claim of MDRI lemma holds; thus, an optimal solution must be routed to the most distant increasing or most distant decreasing reachable interval.
As a final step, let us prove that the first claim of MDRI lemma holds, which means that at least one of the most distant reachable monotone intervals is not valid since a COS that is routed to it violates a constraint. As a final step, let us prove that the first claim of MDRI lemma holds, which means that at least one of the most distant reachable monotone intervals is not valid since a COS that is routed to it violates a previous lemma or a boundary constraint. We let [ t a , t b ] , [ t c , t d ] and [ t e , t f ] be monotone intervals where t b < t c , t d < t e and [ t c , t d ] , [ t e , t f ] are the most distant increasing and decreasing intervals reachable from [ t a , t b ] . We let W g 1 ( t ) be a COS routed from [ t a , t b ] to [ t c , t d ] and W g 2 ( t ) be a COS routed from [ t a , t b ] to [ t e , t f ] . Without loss of generality, we assume [ t a , t b ] , [ t e , t f ] are increasing intervals and [ t c , t d ] is a decreasing interval. We let t 1 1 [ t a , t b ] be the separation point of W g 1 ( t ) and W L ( t ) + W max at [ t a , t b ] and t 1 2 [ t a , t b ] be the separation point of W g 2 ( t ) and W L ( t ) + W max at [ t a , t b ] . Because [ t a , t b ] is increasing, t 1 2 > t 1 1 . Also, we let t 2 1 be the union point of W g 1 ( t ) and the W L ( t ) at [ t c , t d ] and let t 2 2 be the union point of W g 2 ( t ) and the W L ( t ) + W max at [ t e , t f ] .
We know W g 1 ( T ) = W g 2 ( T ) because both solutions must satisfy an empty battery at time t = T . Using the following fact,
1.
[ t a , t b ] is increasing; hence, t ( t 1 1 , t 2 1 ] : P g 1 ( t ) = P L ( t 1 1 ) < P L ( t ) = P g 2 ( t ) ;
2.
Utilizing “Fan Behavior”, one may conclude that t [ t 1 2 , t 2 1 ] : P g 1 ( t ) = P g 1 ( t 1 1 ) < P g 2 ( t 1 2 ) = P g 2 ( t ) ;
3.
[ t c , t d ] is decreasing; thus, t ] [ t 2 1 , t d ] : P g 1 ( t ) = P L ( t ) < P L ( t 2 1 ) = P L ( t 1 1 ) < P g 2 ( t 1 2 ) = P g 2 ( t ) .
Therefore, for t [ t c , t d ] , we know W g 2 ( t ) > W g 1 ( t ) and P g 2 ( t ) > P g 1 ( t ) . Moreover, P g 1 ( t ) is only decreasing as long as W g 1 ( t ) does not meet W L ( t ) + W max . The same is true in reverse for W g 2 ( t ) , P g 2 ( t ) is only increasing as long as W g 1 ( t ) does not meet W L ( t ) . Because if both of them are valid, they must have a union point (at least at time t = T ), at least one of the solutions must meet the other’s constraint: W g 2 ( t ) must have a union point with W L ( t ) or the other way around. Without loss of generality, we assume W g 2 ( t ) has a union point t 0 > t f with W L ( t ) and assume t 0 is the first such union point. Then,
W L ( t 0 ) = W g 2 ( t 0 ) = W g 2 ( t 1 1 ) + t 1 1 t 0 P g 2 ( t ) d t > > W g 2 ( t 1 1 ) + t 1 1 t 0 P g 1 ( t ) d t = = W g 1 ( t 1 1 ) + t 1 1 t 0 P g 1 ( t ) d t = W g 1 ( t 0 ) .
This means W g 1 ( t ) violates the lower constraint at t = t 0 , and is therefore not a valid COS on [ t a , T ] . Therefore, only one of the COS W g 1 ( t ) , W g 2 ( t ) is valid, and there is only a single valid COS. □
As the final stage of the proof, we let W g 1 ( t ) , W g 2 ( t ) be two optimal solutions to the optimiation problem described in (1). We let A = { t [ 0 , T ] W g 1 ( t ) W g 2 ( t ) } . We assume by negation A , and let t 0 = inf ( A ) . We can now check for the following cases:
1.
If t 0 = 0 , then we have a contradiction to the starting condition.
2.
The case where i { 1 , 2 } W L ( t 0 ) < W g i ( t 0 ) < W L ( t 0 ) + W max is not possible because both W g 1 ( t ) and W g 2 ( t ) must be straight lines in the neighborhood of t 0 (SF lemma). Therefore, there must be δ > 0 such that t [ t 0 δ , t 0 ) : W g 1 ( t ) W g 2 ( t ) , which contradicts the definition of t 0 .
3.
If i { 1 , 2 } : W g i ( t 0 ) = W L ( t 0 ) , then we know t 0 [ a , b ] such that [ a , b ] is a decreasing interval of P L ( t ) . We now have a contradiction, because
(a)
If both W g 1 ( t ) , W g 2 ( t ) are routed from [ a , b ] to the same reachable monotone interval, we have a contradiction to UBMI lemma.
(b)
If W g 1 ( t ) and W g 2 ( t ) are routed from [ a , b ] to different monotone intervals, we have a contradiction to MDRI lemma.
4.
The case where i { 1 , 2 } : W g i ( t 0 ) = W L ( t 0 ) + W max is similar to the last one.
5.
Lastly, if t 0 = T , we have a contradiction to the ending condition.
In conclusion, it may be deduced that A = { } , and the solution to (1) is unique. □

3. Comparative Analysis

3.1. Energy Balancing with Transients

We consider a power system comprising a grid-connected storage device and a photovoltaic (PV) unit, as illustrated in Figure 11. The storage device is charged from both the grid and the PV unit, supplying an aggregated load characterized by its active power consumption. The load’s active power demand is modeled as a continuous positive function, P a : R 0 R , defined over a finite time interval [ 0 , T ] for a given parameter T. The power supplied by the PV is represented by a piecewise continuous function, P p v : R 0 R , which partially offsets the load’s power demand. Consequently, the net power consumption of the load is expressed as P L ( t ) = P a ( t ) P p v ( t ) . The power flowing into or from the storage device is given by P s ( t ) = P g ( t ) P L ( t ) , where P g : R 0 R represents the power injected from the grid. Furthermore, W g ( t ) , W L ( t ) , and W ( t ) denote the generated energy, load energy, and stored energy, respectively. These energy terms relate to the power functions via the integral equation E ( t ) = 0 T P ( τ ) d τ . The maximum storage capacity, denoted as E max , is varied in simulations. The tested values are E max 0 ,   2 ,   6 ,   8 , 15 ,   27 p.u. The power consumption of the load and PV generation data (both in p.u.) are synthesized to reflect real-world conditions. The simulation results for E max = 2 ,   6 p.u. are presented in Figure 12. Regarding the generation sources, we use a standard synchronous generator model where the generator operates with frequency and voltage regulation, providing stable power output while following load fluctuations. The photovoltaic plant model assumes a typical irradiance-dependent power output, where solar generation varies over time but follows a predefined profile based on environmental conditions.
To explore the behavior of this model in reinforcement learning (RL) settings, we first examine its ideal case performance before assessing how transient effects influence algorithmic efficiency. In the RL formulation, the deterministic load function is replaced with a generative model. The environment’s state is represented by a continuous state space, S = s = ( E s , H , P L ) , where W denotes the state of charge (SOC), H indicates the current hour, and P L represents the current load. At each time step t, the agent selects an action a A , determining the energy allocated to charge the battery. The action space aligns with the original search problem, defined as A = a = Δ E , 0 Δ E W L + E max , where Δ E R signifies the energy added to the battery’s current level. The SOC and battery capacity establish the permissible action bounds. After executing an action, the agent observes the new state and receives a scalar reward r, transitioning accordingly. The simplest reward function follows a quadratic cost structure [20], expressed as r = f ( P g ) with f ( P g ) = P g 2 ( t ) . The transition function deterministically maps the battery state to the prescribed SOC, ensuring consistent decision-making.
To determine the optimal control policy, we implement the analytical method in MATLAB and trained three model-free RL algorithms in Python 3.12. The policy specifies the generation and charge/discharge actions over 48 time steps spanning 24 h. Our study focuses on model-free approaches, which forego explicit environmental dynamics modeling in favor of direct policy optimization. These methods are widely used, particularly for storage control, due to their adaptability across domains. Nevertheless, model-based approaches present an intriguing avenue for future research. The RL methods analyzed include Soft Actor–Critic (SAC), Proximal Policy Optimization (PPO), and Twin Delayed DDPG (TD3), implemented via Stable Baselines. These methods were chosen due to their effectiveness in continuous action spaces, making them suitable for energy storage control tasks. The RL agents were trained in a simulated environment, where they learned optimal charging and discharging policies through trial-and-error interactions with the system. The training objective was to minimize the total energy cost while adhering to storage constraints.
The hyperparameter settings followed standard configurations for stability and efficiency. The discount factor γ was set to 0.99 to prioritize long-term rewards, and the learning rate was adjusted based on the algorithm, 3 × 10 4 for PPO and SAC, 10 3 for TD3. The policy and value networks consisted of two hidden layers with 256 neurons each, using ReLU activations. To improve exploration, we incorporated Gaussian noise for SAC and TD3 and used a clipped surrogate objective for PPO. Training was conducted over one million time steps, with batch sizes of 256 and an Adam optimizer for gradient updates. The final policy was evaluated over a 100-day test set, comparing RL-based energy management decisions to classical optimization methods.
The results shown in Figure 13 reveal that none of the RL methods succeeded in matching the optimal policy identified by the classical algorithm. Notably, PPO and TD3 yielded similar policies, despite TD3 employing a deterministic policy while PPO relies on a stochastic one. A few key insights emerge.
1.
When examining consecutive days with similar demand and PV production patterns, policies tend to be similar due to the Cauchy-bounded nature of the value function.
2.
The system dynamics remain consistent over time, ensuring stable relationships between storage, PV generation, grid interaction, and load.
3.
The reward function exhibits convexity, preventing local minima.
4.
The reward function exhibits symmetry concerning certain states and actions, guiding both algorithms toward similar policies. For example, we consider state s 0 , where high PV generation satisfies demand but additional charging ( Δ P s = 1 p.u.) is necessary due to anticipated future demand, versus state s 1 , where low PV production requires purchasing 1 p.u. from the grid. Despite differing conditions, identical rewards may cause algorithms to converge to similar policies.
Figure 14 presents, for each algorithm, the daily cost of energy production over a test set of 100 days. The mean and variance for each algorithm are presented in Table 1. The deviation from the mean of the daily costs of each algorithm over these 100 test days is depicted in Figure 15.
To extend the comparative analysis, we now introduce a more complex model that incorporates a lossy storage device [21]. In this formulation, storage losses are explicitly introduced, meaning that energy stored at one time step is no longer perfectly available in subsequent time steps. This modification affects the optimal trajectory construction because the system must now compensate for energy dissipation over time, requiring a more conservative approach to energy allocation. When losses are taken into account, the shortest path solution is affected because the system must optimize not only for immediate energy balance but also for minimizing accumulated losses over time. As a result, the optimal trajectory smooths out power fluctuations to avoid unnecessary energy cycling, which would otherwise exacerbate losses. The addition of losses reduces the feasibility of high-frequency energy exchanges and forces a more structured, long-term planning approach in the optimization. This additional complexity heightens the uncertainty faced by both traditional and reinforcement learning algorithms. The RL formulation remains largely consistent with the ideal storage case, with the key distinction being in the definition of the MDP’s transition function. Instead of a direct mapping between charge and discharge actions to a specified SOC, transmission losses must be considered. Thus, action a = Δ E constitutes a valid transition from state s = W to s = ( W + Δ E ) η provided the transition adheres to the battery’s capacity constraint, i.e., W + Δ E < E max . To account for energy dissipation, if Δ E 0 , the decay factor is given by η = η dis η decay . Conversely, when no charge or discharge occurs, the decay factor simplifies to η = η decay .
The corresponding results are illustrated in Figure 16. Once again, none of the RL methods managed to learn an effective policy when compared to the optimal path found by the classical algorithm. Again, the policy of both PPO and TD3 methods is similar. The mean and variance for each algorithm are presented in Table 2.
From the table, it can be observed that the mean and variance of the classical algorithm decreased, indicating improved battery operation compared to the initial case. This highlights the characteristics of each classical approach: “SP”, which employs “Dijkstra” search, appears to be more suited for discrete-state dynamics, whereas “PMP” demonstrates better performance in continuous environments. Furthermore, the mean values for both PPO and TD3 also declined, possibly due to their capacity to manage large and complex state spaces, enabling them to capture environmental subtleties more effectively. However, the increased model complexity led to a rise in the variance across all RL algorithms. Figure 17 presents, for each algorithm, the daily cost of energy production over a test set of 100 days. The deviation from the mean of the daily costs of each algorithm over these 100 test days is depicted in Figure 18.
Following, we remove the assumption of idealized transmission of power between sources, storage, and loads. In previous paragraphs, we assumed that only storage losses existed, but now we also consider losses occurring between different components in the system and propose a model for analyzing losses over the transmission lines between the power source and the consumer, including the storage device. The system under consideration consists of a source, modeled as a synchronous generator, and a non-linear load. This non-linear load encompasses the photovoltaic unit, the storage device with capacity E max , and the load consuming active power P L : R 0 R . We assume that the components within the non-linear load are closely situated, ensuring lossless transmission among them. Based on this, we conduct the analysis using a distributed circuit model, incorporating a short-length transmission line. The power approximation is given by
P g P + P V g 2 R P + α P 2 ,
where P g represents the generated power, P denotes the power flowing into the storage device, and α = R | V g | 2 is a constant dependent on the line voltage V g and line resistance R. This quadratic relation is well documented in the literature, as discussed in [22].
For the RL solution, we incorporate the transmission line loss model. Consequently, the transition from state s = W after taking an action a = Δ E leads to a new state s = ( W + Δ E η tr ) η , where η tr accounts for transmission line losses and η remains consistent with previous case studies.
The corresponding results are illustrated in Figure 19. In this scenario, PPO and TD3 algorithms yield distinct policies. Notably, the policy generated by the PPO algorithm, which employs a stochastic policy, appears smoother compared to that of the TD3 algorithm, which follows a deterministic policy.
Figure 20 presents, for each algorithm, the daily cost of energy production over a test set of 100 days. The mean and variance for each algorithm are presented in Table 3.
The results in the table reveal several trends: (a) The mean and variance of the classical algorithm increase due to the added uncertainty in the dynamic model. (b) Performance improvements are observed across all RL algorithms, highlighting their effectiveness in highly uncertain environments with stochastic transitions. (c) The PPO algorithm outperforms TD3, underscoring the benefits of stochastic policies in handling unpredictable model dynamics. The deviation from the mean daily costs of each algorithm over 100 test days is depicted in Figure 21.

3.2. Hybrid Electrical Vehicle Simulation

In this section, we present a series of simulations, showcasing numerical simulations on electrical vehicles, to demonstrate the validity of the algorithm whose properties were studied in the paper. The data set used for the experiments can be found in the Argonne National Laboratory. The analysis focuses on the real-time energy management of a hybrid electric vehicle, where the car’s battery is used alongside a fuel cell to meet its power demands during driving. In a hybrid vehicle, energy management involves optimally distributing power between the battery and the fuel cell to achieve efficiency and longevity. The optimization does not involve an external power system, but rather ensures that battery discharge is managed effectively to minimize fuel consumption while maintaining performance. The decision variables include when to draw energy from the battery versus when to rely on the fuel cell, taking into account state-of-charge constraints, efficiency losses, and future energy needs. We consider the Volkswagen e-Golf electric vehicle with an 85 kW, 279 Nm permanent magnet synchronous AC motor and a 24.2 kWh, 323 V rated lithium-ion battery. In the simulations, we define E max = 300 kJ and select the simulation time as T = 45 s, with a sample time of Δ T = 0.1 . The results are presented in Figure 22, in which the resulting optimal generated power and the load profile P L ( t ) are presented. The ripples seen in the PMP solution are likely due to the transients observed in the load and the properties of the numerical solver. It is important to note, however, that the shortest path method provides an intuitive reference solution that allows for a benchmarking framework when evaluating other optimization techniques. While it is not directly applicable in real-world scenarios due to its reliance on a graphical interpretation, it may be used as an analytical tool. By offering a qualitative baseline, it enables researchers and practitioners to assess the effectiveness of more advanced control strategies, such as those based on Pontryagin’s principle or reinforcement learning. Given that it is an approximation-based method, the shortest path approach naturally produces smooth trajectories, which may not fully capture system transients or high-frequency variations in load demand, and thus the fluctuations that are observed in the other methods are not apparent. Despite these limitations, the shortest path solution can assist in planning and algorithmic design. Its ability to highlight structural properties of the optimal solution makes it useful for generating heuristic insights into problem formulations. Additionally, by comparing new algorithms against this benchmark, it is possible to identify areas where more sophisticated numerical methods are required to account for system dynamics, constraints, and uncertainties. In this sense, the shortest path method acts as a guiding principle rather than an operational tool, shaping the development of more practical and computationally feasible optimization strategies for real-world applications.
The MSE between the shortest path method and the minimum principle method is 0.6803 , and the MSE between the shortest path method and the dynamic programming method is 1.6211 .
Moreover, another application that was tested is the peak of the generated power. Table 4 shows the peak power calculated by the shortest path algorithm for other vehicles available in the data set, and the optimal power flows are computed as well. The database contains various car data samples. Therefore, the last column in the table indicates the record serial number.

4. Discussion

The results presented in this work highlight that the shortest path method serves as a useful graphical benchmark tool for evaluating optimization techniques for energy storage management. While it is not directly applicable in real-world implementations, it provides a qualitative baseline that enables researchers to assess the effectiveness of more advanced control strategies, such as Pontryagin’s principle or reinforcement learning. Given its approximate and inherently smooth trajectories, the shortest path method may overlook fast system transients but remains valuable for identifying structural properties of optimal solutions. By comparing new algorithms against this benchmark, it becomes possible to pinpoint areas where more sophisticated methods are needed to better capture system dynamics, constraints, and uncertainties. In this sense, the shortest path method guides the design of more practical optimization strategies, contributing to the development of computationally efficient and real-world-applicable control solutions. In particular, the comparison between Pontryagin’s Maximum Principle (PMP), dynamic programming (DP) method, the shortest path method, and reinforcement learning (RL) algorithms reveals important insights into the performance of each method and their practical applicability.
When examining the trajectories produced by either the PMP or DP methods, it is apparent that these produce more fluctuations than then the shortest path solution. In other words, the shortest path method appears to perform better than the other two algorithms. However, this result should be interpreted with caution. The shortest path method serves primarily as a reference signal, offering an idealized solution that may not fully capture real-world transients and constraints. Since it arises from a graphical solution, it naturally smooths out power variations and provides an approximated, but interpretable, optimal trajectory. This smoothing effect can make the shortest path solution appear superior in terms of cost and feasibility. However, this does not necessarily mean that it is applicable in real-world scenarios.
Furthermore, reinforcement learning algorithms struggled to match the analytical solutions, as evident in Figure 13. This is not necessarily due to insufficient training but rather an inherent limitation of model-free RL methods in high-variability environments. Specifically, the RL algorithms used in this study are model-free, meaning they do not have access to a mathematical model of the environment. Instead, they have to learn the behavior of the system through exploration and interaction. This learning process is particularly difficult due to the stochastic nature of load variations and the complexity of the energy constraints. This is in contrast to the analytical methods that leverage full knowledge of the system dynamics, since RL algorithms must learn optimal policies purely from experience, without access to the explicit equations describing the system. This makes it particularly challenging to capture long-term dependencies and constraints, such as energy storage limits and future load variations, especially when the observation window is short. In this analysis, the RL agents could only observe a limited time window, preventing them from fully anticipating future constraints, such as battery state-of-charge limitations.
Additionally, RL solutions in Figure 19 suggest that storage capacity constraints are not fully respected, likely due to the difficulty in learning accurate representations of the system boundaries. Unlike the shortest path, the PMP and the DP approaches, which explicitly incorporate constraints in their optimization framework, RL algorithms must infer them from observed rewards. If the training process does not include enough penalties or incentives to learn boundary conditions, the resulting policy may fail to enforce them effectively.
Another key factor is the impact of problem complexity on RL generalization. Interestingly, in the more complex model with lossy storage, RL algorithms performed closer to the ideal solution. This seemingly counterintuitive result can be explained by the fact that energy losses naturally regularize the problem, discouraging rapid charge–discharge cycles and promoting smoother control policies. With losses, RL algorithms are forced to learn long-term energy conservation strategies, which align more closely with the analytical approaches. This highlights an important interaction between problem complexity and RL generalization, where certain real-world constraints can help RL agents learn better policies by discouraging unrealistic strategies. These insights are summarized in Table 5.
To conclude, the shortest path method, while not a direct solution for practical applications, offers a comparative standard for energy storage optimization techniques. Its smooth trajectory provides a basis for understanding solution structures, allowing researchers to identify where more complex methods are necessary to address system dynamics and constraints. Therefore, this method aids in the development of optimization strategies that are both computationally efficient and relevant to real-world scenarios.

5. Conclusions

The current paper belongs to a class of research works that aim to find analytic and semi-analytic solutions to optimal power flow problems that involve storage systems. Following is a summary of the most important points stemming from the above analysis.
  • The main contribution of this work is a rigorous proof of the central result provided in [19], which is one of the first papers in this group. This proof justifies the “shortest path” graphical design method, and assures that the optimal solution obtained is indeed unique, thus allowing to avoid possible conflicts between different competing optimal solutions.
  • Most importantly, the uniqueness proof presented in this paper has practical implications, since a guarantee that the solution is unique allows for more confident decision-making in real-world applications, such as grid management, and energy dispatch. Furthermore, the analytical nature of the solution, in contrast to purely numerical approaches, offers potential advantages in terms of computational efficiency and interpretability. The graphical design procedure, coupled with a guarantee of a unique solution, facilitates a deeper understanding of the system’s behavior and can aid in the design and optimization of storage systems.
  • The validation of the proposed solution through two distinct comparative studies further strengthens its credibility. The comparison with reinforcement learning algorithms on synthetic data highlights the potential advantages of the proposed method in terms of convergence speed and solution quality. The data analysis, using an electrical vehicle storage device, demonstrates the practical applicability and effectiveness of the proposed solution in a realistic scenario.
Future research directions could explore extending this approach to more complex scenarios, such as systems with multiple storage devices, time-varying constraints, or uncertainties in the system parameters. Investigating the applicability of this method to other types of energy storage technologies, beyond electrical vehicle batteries, is also a promising area for future work.
As a final remark, we must acknowledge that the comparison between the shortest path and reinforcement learning (RL) methods in this study is problematic. This is because the selected optimization problem inherently favors the shortest path approach, neglecting the uncertainty-rich environments where RL thrives. Furthermore, the problem’s low dimensionality and constrained solution space severely limit RL’s potential, and consequently the observed superiority of the shortest path method may be misleading, as RL’s performance was significantly degraded. While a comprehensive analysis of RL’s capabilities in this context falls outside the scope of this paper, which primarily focuses on the theoretical proof of the shortest path method, we recognize this as another valuable avenue for future research.

Author Contributions

Conceptualization, T.G.-T. and Y.L.; methodology, T.G.-T.; software, T.G.-T. and E.G.-G.; validation, E.G.-G.; formal analysis, T.G.-T.; investigation, T.G.-T.; resources, J.B.; data curation, E.G.-G. and J.B.; writing—original draft preparation, T.G.-T.; writing—review and editing, E.G.-G.; visualization, T.G.-T., J.B. and E.G.-G.; supervision, Y.L.; project administration, Y.L.; funding acquisition, J.B. All authors have read and agreed to the published version of the manuscript.

Funding

The work of J. Belikov was partly supported by the Estonian Research Council grant PRG1463.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study, uses Matlab R2024a and Python 3.12, are openly available online on GitHub at https://github.com/ElinorG11/UniquenessGraphicalMethod.git, accessed on 19 February 2025. The other raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

[ 0 , T ] Integration interval
P g Generated power
P L Load power consumption
P s Power that flows into the battery
WState of charge of the battery
W g Generated energy
W L Energy demand of the load
W max Energy capacity of the battery

References

  1. Akinyele, D.; Rayudu, R. Review of energy storage technologies for sustainable power networks. Sustain. Energy Technol. Assessments 2014, 8, 74–91. [Google Scholar] [CrossRef]
  2. Khodadoost Arani, A.; Gharehpetian, G.B.; Abedi, M. Review on Energy Storage Systems Control Methods in Microgrids. Int. J. Electr. Power Energy Syst. 2019, 107, 745–757. [Google Scholar] [CrossRef]
  3. Machlev, R.; Zargari, N.; Chowdhury, N.; Belikov, J.; Levron, Y. A review of optimal control methods for energy storage systems–Energy trading, energy balancing and electric vehicles. J. Energy Storage 2020, 32, 101787. [Google Scholar] [CrossRef]
  4. Deng, K.; Sun, Y.; Li, S.; Lu, Y.; Brouwer, J.; Mehta, P.G.; Zhou, M.; Chakraborty, A. Model predictive control of central chiller plant with thermal energy storage via dynamic programming and mixed-integer linear programming. IEEE Trans. Autom. Sci. Eng. 2014, 12, 565–579. [Google Scholar] [CrossRef]
  5. Dolara, A.; Grimaccia, F.; Magistrati, G.; Marchegiani, G. Optimization Models for islanded micro-grids: A comparative analysis between linear programming and mixed integer programming. Energies 2017, 10, 241. [Google Scholar] [CrossRef]
  6. Harsha, P.; Dahleh, M. Optimal Management and Sizing of Energy Storage Under Dynamic Pricing for the Efficient Integration of Renewable Energy. IEEE Trans. Power Syst. 2015, 30, 1164–1181. [Google Scholar] [CrossRef]
  7. Yoon, Y.; Kim, Y.H. Effective scheduling of residential energy storage systems under dynamic pricing. Renew. Energy 2016, 87, 936–945. [Google Scholar] [CrossRef]
  8. Zhang, W.; Wang, Y.; Zeeshan, M.; Han, F.; Song, K. Super-twisting sliding mode control of grid-side inverters for wind power generation systems with parameter perturbation. Int. J. Electr. Power Energy Syst. 2025, 165, 110501. [Google Scholar] [CrossRef]
  9. Asadi, Y.; Eskandari, M.; Mansouri, M.; Moradi, M.H.; Savkin, A.V. A universal model for power converters of battery energy storage systems utilizing the impedance-shaping concepts. Int. J. Electr. Power Energy Syst. 2023, 149, 109055. [Google Scholar] [CrossRef]
  10. Cirocco, L.R.; Belusko, M.; Bruno, F.; Boland, J.; Pudney, P. Controlling stored energy in a concentrating solar thermal power plant to maximise revenue. IET Renew. Power Gener. 2015, 9, 379–388. [Google Scholar] [CrossRef]
  11. Cirocco, L.; Pudney, P.; Boland, J.; Bruno, F.; Belusko, M. Maximising revenue via optimal control of a concentrating solar thermal power plant with limited storage capacity. IET Renew. Power Gener. 2016, 10, 729–734. [Google Scholar] [CrossRef]
  12. Lifshitz, D.; Weiss, G. Optimal Control of a Capacitor-Type Energy Storage System. IEEE Trans. Autom. Control 2015, 60, 216–220. [Google Scholar] [CrossRef]
  13. Nguyen, A.; Lauber, J.; Dambrine, M. Optimal control based algorithms for energy management of automotive power systems with battery/supercapacitor storage devices. Energy Convers. Manag. 2014, 87, 410–420. [Google Scholar] [CrossRef]
  14. Li, Q.; Huang, W.; Chen, W.; Yan, Y.; Shang, W.; Li, M. Regenerative braking energy recovery strategy based on Pontryagin’s minimum principle for fell cell/supercapacitor hybrid locomotive. Int. J. Hydrogen Energy 2019, 44, 5454–5461. [Google Scholar] [CrossRef]
  15. Cao, J.; Harrold, D.; Fan, Z.; Morstyn, T.; Healey, D.; Li, K. Deep reinforcement learning-based energy storage arbitrage with accurate lithium-ion battery degradation model. IEEE Trans. Smart Grid 2020, 11, 4513–4521. [Google Scholar] [CrossRef]
  16. Bui, V.H.; Hussain, A.; Kim, H.M. Double deep Q-learning-based distributed operation of battery energy storage system considering uncertainties. IEEE Trans. Smart Grid 2019, 11, 457–469. [Google Scholar] [CrossRef]
  17. Lee, H.; Song, C.; Kim, N.; Cha, S.W. Comparative analysis of energy management strategies for HEV: Dynamic Programming and Reinforcement Learning. IEEE Access 2020, 8, 67112–67123. [Google Scholar] [CrossRef]
  18. Jiang, D.R.; Pham, T.V.; Powell, W.B.; Salas, D.F.; Scott, W.R. A comparison of approximate dynamic programming techniques on benchmark energy storage problems: Does anything work? In Proceedings of the 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), Orlando, FL, USA, 9–12 December 2014; pp. 1–8. [Google Scholar] [CrossRef]
  19. Levron, Y.; Shmilovitz, D. Optimal power management in fueled systems with finite storage capacity. IEEE Trans. Circuits Syst. I Regul. Pap. 2010, 57, 2221–2231. [Google Scholar] [CrossRef]
  20. Zivic Djurovic, M.; Milacic, A.; Krsulja, M. A simplified model of quadratic cost function for thermal generators. In Proceedings of the 23rd International DAAAM Symposium, Vienna, Austria, 24–25 October 2012; pp. 24–27. [Google Scholar]
  21. Chowdhury, N.R.; Ofir, R.; Zargari, N.; Baimel, D.; Belikov, J.; Levron, Y. Optimal control of lossy energy storage systems with nonlinear efficiency based on dynamic programming and Pontryagin’s Minimum Principle. IEEE Trans. Energy Convers. 2021, 36, 524–533. [Google Scholar] [CrossRef]
  22. Hobbs, B.F.; Drayton, G.; Bartholomew Fisher, E.; Lise, W. Improved transmission representations in oligopolistic market models: Quadratic losses, phase shifters, and DC lines. IEEE Trans. Power Syst. 2008, 23, 1018–1029. [Google Scholar] [CrossRef]
Figure 1. A conceptual illustration of the main approach: the proposed analytical proof contributes to our understanding of graphical procedure method, the “shortest path”. This allows better benchmarking of high-complexity solution methods, such as those based on reinforcement learning. Additionally, it may aid to the planning and development of basic and advanced numerical methods.
Figure 1. A conceptual illustration of the main approach: the proposed analytical proof contributes to our understanding of graphical procedure method, the “shortest path”. This allows better benchmarking of high-complexity solution methods, such as those based on reinforcement learning. Additionally, it may aid to the planning and development of basic and advanced numerical methods.
Energies 18 01483 g001
Figure 2. An illustration of the proof strategy. The first two lemmas are based on [19]. As depicted above, the third lemma utilizes the lemmas proven in [19]. Next, the fourth and fifth lemmas are based on all preceding lemmas. Finally, uniqueness is proved using the last two lemmas.
Figure 2. An illustration of the proof strategy. The first two lemmas are based on [19]. As depicted above, the third lemma utilizes the lemmas proven in [19]. Next, the fourth and fifth lemmas are based on all preceding lemmas. Finally, uniqueness is proved using the last two lemmas.
Energies 18 01483 g002
Figure 3. Illustration of the two categories of intervals.
Figure 3. Illustration of the two categories of intervals.
Energies 18 01483 g003
Figure 4. Illustration of the generated energy and power inside an increasing interval. Notice that it cannot be equal to the lower energy band in this interval.
Figure 4. Illustration of the generated energy and power inside an increasing interval. Notice that it cannot be equal to the lower energy band in this interval.
Energies 18 01483 g004
Figure 5. Illustration of the generated energy and power in a decreasing interval. Notice that it cannot be equal to the upper energy band in this interval.
Figure 5. Illustration of the generated energy and power in a decreasing interval. Notice that it cannot be equal to the upper energy band in this interval.
Energies 18 01483 g005
Figure 6. Illustration of TC lemma violation. Notice that the generated energy cannot leave the upper energy constraint and meet the lower one.
Figure 6. Illustration of TC lemma violation. Notice that the generated energy cannot leave the upper energy constraint and meet the lower one.
Energies 18 01483 g006
Figure 7. Illustration of the generated energy detaching the upper bound constraint in an increasing interval of the power demand of the load. Notice it does not unite back again after separation.
Figure 7. Illustration of the generated energy detaching the upper bound constraint in an increasing interval of the power demand of the load. Notice it does not unite back again after separation.
Energies 18 01483 g007
Figure 8. Illustration of the proof of the fourth lemma. When the integral is equal to minus the maximal energy capacity of the battery, meaning the battery is discharged, and it is colored in red, and when it is equal to the maximal energy capacity of the battery, meaning the battery is fully charged, it is colored in green.
Figure 8. Illustration of the proof of the fourth lemma. When the integral is equal to minus the maximal energy capacity of the battery, meaning the battery is discharged, and it is colored in red, and when it is equal to the maximal energy capacity of the battery, meaning the battery is fully charged, it is colored in green.
Energies 18 01483 g008
Figure 9. Illustration of the Fan Behavior of energy trajectories, demonstrating how energy and power transition differently across increasing and decreasing intervals. The steeper slopes of closer increasing intervals and the more gradual changes in more distant decreasing intervals highlight the underlying optimization structure. Top subplot: Green solid line—Represents the energy trajectory when transitioning through a closer increasing interval, meaning that energy accumulation follows a steeper slope due to a nearby increase in power demand. Purple solid line—Represents the energy trajectory for a more distant increasing interval, which has a shallower slope compared to the green line, reflecting the Fan Behavior where closer increasing intervals have steeper slopes. Blue dashed line (–)—Serves as a reference trajectory for energy, likely representing an intermediate path between increasing and decreasing intervals. Orange dashed line (–)—Represents an energy trajectory associated with a decreasing interval, showing a downward correction in the energy profile. Blue dotted line (:)—Represents a reference energy trajectory, indicating a baseline or average energy demand used for comparison across different energy transition strategies. Orange dotted line (:)—Corresponds to a energy trajectory associated with a decreasing interval, showing how energy production adjusts in response to declining demand. The same interpretation applies to the bottom subfigure, only for power function instead of energy function.
Figure 9. Illustration of the Fan Behavior of energy trajectories, demonstrating how energy and power transition differently across increasing and decreasing intervals. The steeper slopes of closer increasing intervals and the more gradual changes in more distant decreasing intervals highlight the underlying optimization structure. Top subplot: Green solid line—Represents the energy trajectory when transitioning through a closer increasing interval, meaning that energy accumulation follows a steeper slope due to a nearby increase in power demand. Purple solid line—Represents the energy trajectory for a more distant increasing interval, which has a shallower slope compared to the green line, reflecting the Fan Behavior where closer increasing intervals have steeper slopes. Blue dashed line (–)—Serves as a reference trajectory for energy, likely representing an intermediate path between increasing and decreasing intervals. Orange dashed line (–)—Represents an energy trajectory associated with a decreasing interval, showing a downward correction in the energy profile. Blue dotted line (:)—Represents a reference energy trajectory, indicating a baseline or average energy demand used for comparison across different energy transition strategies. Orange dotted line (:)—Corresponds to a energy trajectory associated with a decreasing interval, showing how energy production adjusts in response to declining demand. The same interpretation applies to the bottom subfigure, only for power function instead of energy function.
Energies 18 01483 g009
Figure 10. Detailed examination of the “Fan Behavior” focusing on the beginning of the trajectory.
Figure 10. Detailed examination of the “Fan Behavior” focusing on the beginning of the trajectory.
Energies 18 01483 g010
Figure 11. Illustration of a grid-connected energy storage system, showcasing its various components and their interconnectedness. In addition to the inherent uncertainty due to possible malfunction of different elements and the intermittent and unreliable nature of the photovoltaic cell, there are also power losses when charging and discharging the storage devices and energy dissipation due to aging. Moreover, the transmission lines that appear in the illustration are lossy. The key components are numbered and represent the following: (1) and (4): Representation of conventional production sources. These sources are associated with power/energy generation ( P g / W g respectively); (2) Photovoltaic cells, representing renewable energy source; (3) System controller. Situated in the center, it represents the central control unit that manages and coordinates the operation of all other components; (5) Battery Bank. This component is represented the system’s energy storage capability. They are associated with the power that flows into battery P s and state of charge 0 W W max .
Figure 11. Illustration of a grid-connected energy storage system, showcasing its various components and their interconnectedness. In addition to the inherent uncertainty due to possible malfunction of different elements and the intermittent and unreliable nature of the photovoltaic cell, there are also power losses when charging and discharging the storage devices and energy dissipation due to aging. Moreover, the transmission lines that appear in the illustration are lossy. The key components are numbered and represent the following: (1) and (4): Representation of conventional production sources. These sources are associated with power/energy generation ( P g / W g respectively); (2) Photovoltaic cells, representing renewable energy source; (3) System controller. Situated in the center, it represents the central control unit that manages and coordinates the operation of all other components; (5) Battery Bank. This component is represented the system’s energy storage capability. They are associated with the power that flows into battery P s and state of charge 0 W W max .
Energies 18 01483 g011
Figure 12. The top plot presents the optimal generated energy in solid black line, between the operational boundaries od the storage device, represented in red and green lines. The middle plot represents the energy of the storage device in solid black line. The bottom plot describes the generated power in blue line, and the power consumed by the load in black line.
Figure 12. The top plot presents the optimal generated energy in solid black line, between the operational boundaries od the storage device, represented in red and green lines. The middle plot represents the energy of the storage device in solid black line. The bottom plot describes the generated power in blue line, and the power consumed by the load in black line.
Energies 18 01483 g012
Figure 13. The right subfigures represent two scenraios (different days of the year) for which the net consumption (after the production from renewable sources was consumed) was considered. The left subfigures represent the optimal generation policy where different algorithms attempt to match power generation with fluctuating demand. The solid black line represents the shortest path method, serving as a benchmark solution that provides a smooth energy trajectory. The red dashed line corresponds to PPO; The blue dotted line represents SAC; The magenta dash-dotted line corresponds to TD3.
Figure 13. The right subfigures represent two scenraios (different days of the year) for which the net consumption (after the production from renewable sources was consumed) was considered. The left subfigures represent the optimal generation policy where different algorithms attempt to match power generation with fluctuating demand. The solid black line represents the shortest path method, serving as a benchmark solution that provides a smooth energy trajectory. The red dashed line corresponds to PPO; The blue dotted line represents SAC; The magenta dash-dotted line corresponds to TD3.
Energies 18 01483 g013
Figure 14. Daily generation cost for different optimization algorithms over a 100-day period. The cost is measured in USD/p.u. and reflects the efficiency of each method in managing energy generation. The fluctuations in cost illustrate how well each approach adapts to daily variations in demand and system constraints. The blue solid line with circles represents the SP method, which exhibits the lowest and most stable generation costs. The green dashed line with squares corresponds to PPO; The red dash-dotted line with triangles represents SAC; The purple dotted line with diamonds corresponds to TD3.
Figure 14. Daily generation cost for different optimization algorithms over a 100-day period. The cost is measured in USD/p.u. and reflects the efficiency of each method in managing energy generation. The fluctuations in cost illustrate how well each approach adapts to daily variations in demand and system constraints. The blue solid line with circles represents the SP method, which exhibits the lowest and most stable generation costs. The green dashed line with squares corresponds to PPO; The red dash-dotted line with triangles represents SAC; The purple dotted line with diamonds corresponds to TD3.
Energies 18 01483 g014
Figure 15. Histograms of cost deviations for four different optimization algorithms: DP, PPO, SAC and TD3. The x-axis represents cost deviation, measured as the difference between the algorithm’s daily generation cost and a reference benchmark (e.g., the shortest path or an ideal cost trajectory). The y-axis represents the count (frequency) of occurrences for each cost deviation range over the evaluated time horizon.
Figure 15. Histograms of cost deviations for four different optimization algorithms: DP, PPO, SAC and TD3. The x-axis represents cost deviation, measured as the difference between the algorithm’s daily generation cost and a reference benchmark (e.g., the shortest path or an ideal cost trajectory). The y-axis represents the count (frequency) of occurrences for each cost deviation range over the evaluated time horizon.
Energies 18 01483 g015
Figure 16. The right subfigures represent two scenraios (different days of the year) for which the net consumption (after the production from renewable sources was consumed) was considered. The left subfigures represent the optimal generation policy where different algorithms attempt to match power generation with fluctuating demand. The solid black line represents the PMP method, serving as a benchmark solution that provides a smooth energy trajectory. The red dashed line corresponds to PPO; The blue dotted line represents SAC; The magenta dash-dotted line corresponds to TD3.
Figure 16. The right subfigures represent two scenraios (different days of the year) for which the net consumption (after the production from renewable sources was consumed) was considered. The left subfigures represent the optimal generation policy where different algorithms attempt to match power generation with fluctuating demand. The solid black line represents the PMP method, serving as a benchmark solution that provides a smooth energy trajectory. The red dashed line corresponds to PPO; The blue dotted line represents SAC; The magenta dash-dotted line corresponds to TD3.
Energies 18 01483 g016
Figure 17. Daily generation cost for different optimization algorithms over a 100-day period. The cost is measured in USD/p.u. and reflects the efficiency of each method in managing energy generation. The fluctuations in cost illustrate how well each approach adapts to daily variations in demand and system constraints. The blue solid line with circles represents the PMP method, which exhibits the lowest and most stable generation costs. The green dashed line with squares corresponds to PPO; The red dash-dotted line with triangles represents SAC; The purple dotted line with diamonds corresponds to TD3.
Figure 17. Daily generation cost for different optimization algorithms over a 100-day period. The cost is measured in USD/p.u. and reflects the efficiency of each method in managing energy generation. The fluctuations in cost illustrate how well each approach adapts to daily variations in demand and system constraints. The blue solid line with circles represents the PMP method, which exhibits the lowest and most stable generation costs. The green dashed line with squares corresponds to PPO; The red dash-dotted line with triangles represents SAC; The purple dotted line with diamonds corresponds to TD3.
Energies 18 01483 g017
Figure 18. Histograms of cost deviations for four different optimization algorithms: DP, PPO, SAC and TD3. The x-axis represents cost deviation, measured as the difference between the algorithm’s daily generation cost and a reference benchmark (e.g., the shortest path or an ideal cost trajectory). The y-axis represents the count (frequency) of occurrences for each cost deviation range over the evaluated time horizon.
Figure 18. Histograms of cost deviations for four different optimization algorithms: DP, PPO, SAC and TD3. The x-axis represents cost deviation, measured as the difference between the algorithm’s daily generation cost and a reference benchmark (e.g., the shortest path or an ideal cost trajectory). The y-axis represents the count (frequency) of occurrences for each cost deviation range over the evaluated time horizon.
Energies 18 01483 g018
Figure 19. The right subfigures represent two scenraios (different days of the year) for which the net consumption (after the production from renewable sources was consumed) was considered. The left subfigures represent the optimal generation policy where different algorithms attempt to match power generation with fluctuating demand. The solid black line represents the dynamic programming method, serving as a benchmark solution that provides a smooth energy trajectory. The red dashed line corresponds to PPO; The blue dotted line represents SAC; The magenta dash-dotted line corresponds to TD3.
Figure 19. The right subfigures represent two scenraios (different days of the year) for which the net consumption (after the production from renewable sources was consumed) was considered. The left subfigures represent the optimal generation policy where different algorithms attempt to match power generation with fluctuating demand. The solid black line represents the dynamic programming method, serving as a benchmark solution that provides a smooth energy trajectory. The red dashed line corresponds to PPO; The blue dotted line represents SAC; The magenta dash-dotted line corresponds to TD3.
Energies 18 01483 g019
Figure 20. Daily generation cost for different optimization algorithms over a 100-day period. The cost is measured in USD/p.u. and reflects the efficiency of each method in managing energy generation. The fluctuations in cost illustrate how well each approach adapts to daily variations in demand and system constraints. The blue solid line with circles represents the DP method, which exhibits the lowest and most stable generation costs. The green dashed line with squares corresponds to PPO; The red dash-dotted line with triangles represents SAC; The purple dotted line with diamonds corresponds to TD3.
Figure 20. Daily generation cost for different optimization algorithms over a 100-day period. The cost is measured in USD/p.u. and reflects the efficiency of each method in managing energy generation. The fluctuations in cost illustrate how well each approach adapts to daily variations in demand and system constraints. The blue solid line with circles represents the DP method, which exhibits the lowest and most stable generation costs. The green dashed line with squares corresponds to PPO; The red dash-dotted line with triangles represents SAC; The purple dotted line with diamonds corresponds to TD3.
Energies 18 01483 g020
Figure 21. Histograms of cost deviations for four different optimization algorithms: DP, PPO, SAC and TD3. The x-axis represents cost deviation, measured as the difference between the algorithm’s daily generation cost and a reference benchmark (e.g., the shortest path or an ideal cost trajectory). The y-axis represents the count (frequency) of occurrences for each cost deviation range over the evaluated time horizon.
Figure 21. Histograms of cost deviations for four different optimization algorithms: DP, PPO, SAC and TD3. The x-axis represents cost deviation, measured as the difference between the algorithm’s daily generation cost and a reference benchmark (e.g., the shortest path or an ideal cost trajectory). The y-axis represents the count (frequency) of occurrences for each cost deviation range over the evaluated time horizon.
Energies 18 01483 g021
Figure 22. The load profile represented in red solid line. The optimal generated power as computed by the shortest path, minimum principle, and dynamic programming methods represented by green dashed line, black solid line, and blue solid line, respectively. Only some of the data markers are presented for clarity.
Figure 22. The load profile represented in red solid line. The optimal generated power as computed by the shortest path, minimum principle, and dynamic programming methods represented by green dashed line, black solid line, and blue solid line, respectively. Only some of the data markers are presented for clarity.
Energies 18 01483 g022
Table 1. Mean and variance calculation; The method exhibiting the minimal mean and variance is presented in bold typeface.
Table 1. Mean and variance calculation; The method exhibiting the minimal mean and variance is presented in bold typeface.
AlgorithmMeanVar
SP711.8570,854.35
PPO2428.24271,321.64
SAC1935.02154,239.84
TD32787.36342,948.40
Table 2. Mean and variance calculation; The method exhibiting the minimal mean and variance is presented in bold typeface.
Table 2. Mean and variance calculation; The method exhibiting the minimal mean and variance is presented in bold typeface.
AlgorithmMeanVar
PMP705.8269,024.48
PPO748.1788,234.85
SAC2280.81207,902.61
TD3748.1788,234.85
Table 3. Mean and variance calculation; TThe method exhibiting the minimal mean and variance is presented in bold typeface.
Table 3. Mean and variance calculation; TThe method exhibiting the minimal mean and variance is presented in bold typeface.
AlgorithmMeanVar
DP773.1089,220.93
PPO1165.14146,485.45
SAC1293.96109,379.59
TD32779.89342,000.82
Table 4. Generated peak-power for different vehicle categories.
Table 4. Generated peak-power for different vehicle categories.
Model NameNominalShortes-PathNo
Mercedes-Benz13.10002.481179
Nissan Leaf SV20.07573.0375144
Mitsubishi I-MiEV6.58871.1180100
Chevrolet Spark EV7.80281.997721
Volkswagen e-Golf38.664319.2181207
Smart EV16.13828.9181188
BMW i3BEV21.74965.63881
Ford Focus8.81221.719342
Kia Soul82.290830.204864
Table 5. Summary of experimental results and analysis.
Table 5. Summary of experimental results and analysis.
ExperimentKey ObservationsRole of SP as a Benchmarking Tool
Baseline CaseSP exhibits the lowest mean cost (711.85) and variance (70,854.35). RL algorithms perform significantly worse, with higher variance, indicating unstable policies.SP serves as an interpretable reference, revealing structural properties of the optimal energy trajectory. It highlights RL inefficiencies in capturing the long-term dynamics of the system.
Lossy Storage ModelPMP achieves the lowest cost (705.82) and variance (69,024.48), demonstrating the effect of incorporating physical constraints explicitly. RL methods improve but still exhibit performance gaps.SP provides a qualitative baseline for assessing the effect of adding realistic losses. By comparing RL outputs to SP and PMP, it is evident that RL struggles with long-term energy planning.
Lossy Transmission Line ModelDP achieves the lowest mean cost (773.10) and variance (89,220.93), outperforming RL methods. RL variance remains high, showing unstable learning behavior.SP acts as an initial structural guide, allowing researchers to interpret whether more advanced algorithms are following expected energy trajectories. The graphical approach aids in evaluating solution smoothness and feasibility.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Goldstein-Tweg, T.; Ginzburg-Ganz, E.; Belikov, J.; Levron, Y. Uniqueness of Optimal Power Management Strategies for Energy Storage Dynamic Models. Energies 2025, 18, 1483. https://doi.org/10.3390/en18061483

AMA Style

Goldstein-Tweg T, Ginzburg-Ganz E, Belikov J, Levron Y. Uniqueness of Optimal Power Management Strategies for Energy Storage Dynamic Models. Energies. 2025; 18(6):1483. https://doi.org/10.3390/en18061483

Chicago/Turabian Style

Goldstein-Tweg, Tom, Elinor Ginzburg-Ganz, Juri Belikov, and Yoash Levron. 2025. "Uniqueness of Optimal Power Management Strategies for Energy Storage Dynamic Models" Energies 18, no. 6: 1483. https://doi.org/10.3390/en18061483

APA Style

Goldstein-Tweg, T., Ginzburg-Ganz, E., Belikov, J., & Levron, Y. (2025). Uniqueness of Optimal Power Management Strategies for Energy Storage Dynamic Models. Energies, 18(6), 1483. https://doi.org/10.3390/en18061483

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop