Next Article in Journal
Creating Forested Wetlands for Improving Ecosystem Services and Their Potential Benefits for Rural Residents in Metropolitan Areas
Next Article in Special Issue
Identifying Worst Transient Cases and Optimizing Surge Protection for Existing Water Networks
Previous Article in Journal
Quantifying Baseflow Changes Due to Irrigation Expansion Using SWAT+gwflow
Previous Article in Special Issue
Advancing Ion Constituent Simulations in California’s Sacramento–San Joaquin Delta Using Machine Learning Tools
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved Reinforcement Learning for Multi-Objective Optimization Operation of Cascade Reservoir System Based on Monotonic Property

1
Hubei Key Laboratory of Intelligent Yangtze and Hydroelectric Science, China Yangtze Power Co., Ltd., Yichang 443000, China
2
State Key Laboratory of Water Resources Engineering and Management, Wuhan University, Wuhan 430072, China
3
School of Resource and Environmental Sciences, Wuhan University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
Water 2025, 17(11), 1681; https://doi.org/10.3390/w17111681
Submission received: 24 April 2025 / Revised: 24 May 2025 / Accepted: 25 May 2025 / Published: 2 June 2025
(This article belongs to the Special Issue Machine Learning Applications in the Water Domain)

Abstract

In this paper, improved reinforcement learning (IRL) is designed for the multi-objective optimization operation of a cascade reservoir system. The primary improvement of IRL is searching within limited solution space, based on the derived monotonic property: the first-order derivative relationship between individual reservoir water release decisions for mainstream use (i.e., hydropower generation) as well as tributary use (i.e., regional water supply) and the cascade system’s or a particular reservoir’s water availability, along with the synchronicity and substitutability assumption of storage distribution in the cascade system. The improved algorithm is then applied to a real-world cascade reservoir system in the Yangtze River of China. The results demonstrate the high computational efficiency and reasonable interpretability of IRL.

1. Introduction

The operation of multi-reservoir systems is essential for regulating the spatial and temporal distribution of water resources to satisfy multiple water use needs, such as hydropower generation and regional water demand [1,2]. However, the optimization of the coordinated operation of a complex multi-reservoir system, particular for reservoirs in series, is always challenged by the high-dimensional decision-making process, nonlinear system dynamics, and stochasticity of runoff [1,2,3]. Various optimization techniques are proposed to address the above challenges, as can be found from the state-of-the-art reviews of Yeh [1], Wurbs [4], Labadie [5], Rani and Moreira [6], Dobson [7] and Giuliani et al [8].
According to Dobson et al. [7], optimization methods for multi-reservoir system operation can be categorized into release sequences, operating policy, and real-time optimization argument. Among these three categories, only the value function estimation based on Markov Decision Processes (MDP) can explicitly capture the dynamic nature of the optimization problem as well as the stochasticity of inflow by decomposing the complex optimization issues into a sequence of sub-problems. As one of the most famous optimization algorithms based on MDP, stochastic dynamic programming (SDP) has been widely used to solve reservoir optimization operation problems, since it can handle the nonlinear characteristics of the objective and constraints [1,5,9]. However, the applicability of SDP is restricted to only a limited number of individual reservoirs, due to the so-called curse of dimensionality, i.e., the exponential increase in computing time and memory requirements as the number of individual reservoirs increases. Various improvements to SDP have been proposed to achieve controlled computational efficiency [10,11]. For example, Tilmant and Kelman [12] introduced Successive Approximation Dynamic Programming (DPSA), which decomposes multi-reservoir operation problems into a series of single reservoir operation problems; Howson and Sancho [13] developed Progressive Optimization Algorithm (POA), transforming the multi-stage operation problem into multiple one-stage operation problems; Turgeon [14] and Turgeon and Charbonneau [15] proposed aggregation–disaggregation methods to reduce the high dimensionality of state variables. Although the computational complexity of SDP can be significantly reduced, a severe loss of solution accuracy might result from those improvements [1,16,17,18,19].
As the other optimization method based on MDP, reinforcement learning (RL) integrates the trial-and-error learning mechanisms with Bellman’s principle of optimality [20,21,22,23] to approximate the optimal solutions. As one of the most popular RL methods for solving Markov decision problems, the Q-learning method [11] employs a depth-first search strategy and incrementally optimizes the visited state variables [24], which is different from enumeration over all discrete combinations of state variables to develop a value function and derive the optimal decisions in SDP. Based on this high efficiency computational advantage, Q-learning is widely adopted to solve single-reservoir and multi-reservoir optimization operation problems [24,25,26]. For example, Xu et al. [26] suggested a stochastic optimization method with the integration of Deep Q-Networks (DQN) with aggregation–disaggregation methods for cascade reservoir system optimization operation. However, the limitation of converging into a local optimal solution, as well as a relatively low interpretability of the decision-making process, needs to be further addressed for Q-learning applications in complex reservoir systems [27,28].
Recently, the use of mathematic characteristics of a reservoir system, e.g., the relationships between the decision variables and state variables [17,29,30], has been proposed to improve the MDP by achieving a high level of computational efficiency and effectiveness. For example, Zhao et al. [29,30,31] proposed improved SDP with a monotonic relationship between the decision variable and state variable for a single reservoir water supply, hydropower generation and flood control operation. Using the spatially distributed characteristics of individual reservoir storage, Zeng et al. [17] proposed an improved SDP algorithm based on a derived monotonic relationship between individual reservoir carryover storage and parallel reservoir system water availability. The high computational efficiency and effectiveness with controlled solution accuracy have been proven in a real case. However, the application of the monotonic decision property to overcome the dimensionality problem of cascade reservoir optimization operation has received little attention.
Following the work of Zeng et al. [17], the objective of this paper is to reduce the computational complexity of RL with a derived monotonic relationship between individual reservoir water release decisions and the water availability of the cascade system or a specific individual reservoir, which provides the theoretical basis for the improved RL algorithm design. According to our knowledge, this study can be the first to apply the monotonic property to address the curse of dimensionality for cascade reservoir optimization operation.
The rest of this paper is organized with a step-by-step structure as follows: Section 2 presents the formulation of the stochastic dynamic optimization model for the operation of reservoirs in series and specifies the optimality condition for the model without binding inequality constraints. On the basis of the specified optimality condition, Section 3 provides the first-order derivative relationship between individual reservoir water release decisions and the water availability of the cascade system or a specific individual reservoir. The monotonic property with reduced searching region of the improved RL method is described in Section 4. A cascade reservoir system in the Yangtze River of China is illustrated as a case study in Section 5. Finally, the conclusions of this paper are given in Section 6.

2. Optimality Condition for Multi-Objective Operation of Cascade Reservoir System

This section first describes a long-term optimization problem for the multi-objective operation of the cascade reservoir system, then develops a two-stage optimization operation model based on the stochastic dynamic programming and reinforcement learning methods. Following that, the optimality condition for the multi-objective operation of the cascade reservoir system is derived.

2.1. Stochastic Optimization Operation Model for Cascade Reservoir System

Considering the stochasticity of hydrologic inflow, the long-term operation of the cascade reservoir system illustrated in Figure 1 requires balancing water release for hydropower generation in the mainstream and water release for regional use in tributaries at each period. The objective is to maximize the expected value of multi-regional basin-wide benefits, integrating diverse water use purposes, such as ecology and hydropower production. In addition, the decision of water release at each period needs to obey the physical and engineering constraints of the cascade reservoir system, which includes the mass balance equation, release, and carryover storage constraints. The mathematical expression of the long-term multi-objective optimization operation for the cascade reservoir system can be written as follows:
max IR , OR       E I t = 1 T i = 1 n B i I R i , t + B i O R i , t  
s . t .         W A 1 , t = S 1 , t 1 + I 1 , t E 1 , t = S 1 , t + I R 1 , t + O R 1 , t                                           t     W A i , t = S i , t 1 + I i , t E i , t = S i , t + I R i , t + O R i , t I R i 1 , t               i 2 , n , t                                                                                     0 I R i , t I D i , t                                                                                       i 1 , n , t       0 O R i , t O D i , t                                                                                   i 1 , n , t                                                                                   S i , t m i n S i , t S i , t m a x                                                                               i 1 , n , t
where E   is the expectation function; B i   is the benefit function of in-stream water release from i reservoir to meet the hydropower generation or ecological water use [32,33,34,35]; B i   is the benefit function of out-stream water release from i reservoir to meet the regional consumptive water demand; particularly, both B i   and B i   are assumed to be concave [16,32,33,34,36]; T is the planning horizon of the cascade reservoir system; n is the number of reservoirs in the cascade system; W A i , t is the water availability of i reservoir at time t, which is defined as the sum of initial water storage plus current inflow and minus evaporation loss during this time period for i reservoir; S i , t 1 is the initial storage of the reservoir i at the beginning of period t; I i , t is the inflow of i reservoir at period t; E i , t is the evaporation loss of i reservoir at period t; I R i , t is the water release for hydropower generation from i reservoir into mainstream at period t; I D i , t is the water demand from mainstream at period t; O R i , t is the water supply for tributary water user at period t; O D i , t is the tributary water demand of i reservoir at period t; S i , t m i n and S i , t m a x are the minimum and maximum storage of i reservoir, respectively.

2.2. Stochastic Dynamic Programming for System Optimization Operation

The aforementioned long-term optimization model encompasses a substantial number of water release decisions for each reservoir, and deriving optimal solutions for such an extensive nonlinear optimization model poses significant challenges [1,2,5]. Based on the optimality principle of dynamic programming, the long-term optimization problem can be decomposed into a series of two-stage stochastic optimization problems. If the runoff of each reservoir can be described as a first-order Markov process, then the randomness of the inflow at the subsequent time t + 1 can be characterized using a prior inflow transition probability P I t + 1 I t given any specific I i , t . Under this assumption, the long-term stochastic optimization model can be expressed using the following backward recursive equation:
C V t 1 S t 1 , I t = max IR t , OR t   B t I R t , O R t + I t + 1 P I t + 1 I t × C V t S t , I t + 1
B t I R t , O R t = i = 1 n B i I R i , t + B i O R i , t
S t 1 = S 1 ,   t 1 S n ,   t 1 ;   I t = I 1 , t I n , t   ;   I R t = I R 1 , t I R n , t ;   O R t = O R 1 , t O R n , t
where S t 1 , I t , I R t and O R t represent the vector of the initial storage, inflow, and water release in the mainstream and tributary of the cascade reservoir system at period t, respectively; C V   is the value function of the cascade reservoir system at period t, representing the maximum benefit that can be achieved from period t to the end of the planning horizon, conditioned on the inflow and storage state combination. Previous studies [35,37] have shown that because the value function represents the cumulative utilities from period t to T, it is considered to share similar mathematical characteristics with the benefit function [29,31]. In other words, the value function C V   is concave.
To maintain consistency with the mathematical expression of the benefit function used in the reinforcement learning method, Equation (2a) is rewritten with the Q-value function as follows:
Q t 1 S t 1 , I t , I R t , O R t = B t I R t , O R t + I t + 1 P I t + 1 I t × max IR t + 1 , OR t + 1 Q t S t , I t + 1 , I R t + 1 , O R t + 1
C V t 1 S t 1 , I t = m a x I R t , O R t Q t 1 S t 1 , I t , I R t , O R t
where Q   is the approximate function of C V   in the reinforcement learning method.

2.3. Reinforcement Learning for System Optimization Operation

In order to construct the value function of the cascade reservoir system, the dynamic programming method based on the Bellman equation requires perfect information of the underlying stochastic hydrologic processes throughout the entire planning horizon. The well-known “curse of dimensionality” problem is caused by this assumption, even though it can capture the best solutions for long-term operation. To make efficient use of the searching process and reduce computational complexity, the Q-learning method in reinforcement learning adopts the concept of random sampling to gradually approximate the value function through incremental temporal difference methods. The multi-objective optimization model of the cascade reservoir system in the Q-learning approach can be written as follows:
Q t 1 S t 1 , I t , I R t , O R t Q t 1 S t 1 , I t , I R t , O R t + α B t I R t , O R t +   max IR t + 1 , OR t + 1 Q t S t , I t + 1 , IR t + 1 , OR t + 1 Q t 1 S t 1 , I t , I R t , O R t
where α is the learning rate parameter controlling the balance given to the immediate water use benefit and future value return.
Notably, the Q-learning approach only carries out incremental optimization for the sample state combination, as opposed to exhaustively listing every discrete combination of carryover storage at each period of SDP. This paper specifically assumes that the immediate water use benefit ( B i   and B i   ) is provided at each period. Therefore, for any state combination, the convergence condition of the Q-value function is:
Q t 1 S t 1 , I t , I R t , O R t B t I R t , O R t + max IR t + 1 , OR t + 1 Q t S t , I t + 1 , I R t + 1 , O R t + 1
Combining the above equation with Equation (3b), we have:
C V t 1 S t 1 , I t = max IR t , OR t   B t I R t , O R t + max IR t + 1 , OR t + 1 Q t S t , I t + 1 , I R t + 1 , O R t + 1
According to the equation above, “Q-Learning converges to the optimum action values with probability 1 so long as all actions are repeatedly sampled in all states and the action values are represented discretely” [21] (p. 279). Following this, the optimality condition of Q-learning and dynamic programming methods is equivalent, and the computational complexity of Q-learning can be significantly lower.

2.4. The Optimality Principle of Equal Marginal Utility

If boundary constraints on release and carryover storage (i.e., Equation (1b)) are non-binding, the Lagrange multiplier approach [36] can be used to specify the optimality condition of the two-stage optimization model with the mass balance equation mentioned above, which leads to:
d B i d I R i , t * + Q t S i , t * × S i , t * I R i , t * + Q t S i + 1 , t * × S i + 1 , t * I R i , t * = 0 d B i d O R i , t * + Q t S i , t * × S i , t * O R i , t * = 0   d B i d I R i , t * = Q t S i , t * Q t S i + 1 , t * d B i d O R i , t * = Q t S i , t *           i n
d B n d I R n , t * + Q t S n , t * × S n , t * I R n , t * = 0 d B n d O R n , t * + Q t S n , t * × S n , t * O R n , t * = 0       d B n d I R n , t * = Q t S n , t * d B n d O R n , t * = Q t S n , t *   i = n
The aforementioned formulas illustrate the equal marginal utility optimality principle for the multi-objective optimization operation for the cascade reservoir system. That is, (1) for the individual reservoir not at the lowest elevation i n , the marginal benefit of releasing water for the tributary use should be equal to the marginal benefit of carryover storage, and the marginal benefit of releasing water in the mainstream for hydropower generation should be equal to the deviation between this reservoir and the downstream reservoir’s marginal benefits of carryover storage; (2) for the individual reservoir at the lowest elevation i = n , the marginal benefits of releasing water for both mainstream and tributary use should be equal to the marginal benefit of carryover storage.

3. Monotonic Property for Multi-Objective Optimization Operation of Cascade Reservoir Systems

This section derives the monotonic relationship between the optimal water release decision of an individual reservoir and the (total) water availability of the cascade system or a specific individual reservoir.

3.1. Monotonic Property of Water Release for the Tributary Use

For further mathematical derivation, two assumptions regarding the synchronicity and substitutability are used to characterize the spatial distribution of individual reservoir storage.
Assumption 1.
The carryover storage of each reservoir is a non-decreasing function of the cascade system’s total storage, i.e., synchronic property. The mathematical expression of this assumption is
S i , t * S t S u m 0
S t S u m = i = 1 n S i , t *
In the equations, S t S u m is the total carryover storage of the cascade system at time t. The assumption of storage state synchronicity (Equation (8a)) indicates that the reservoir system should allocate the increased storage according to the prior optimal storage distribution. Theoretical research and empirical findings [14,15,16,17] have demonstrated that reservoir operation rules based on this assumption can successfully lower the system’s invalid water spillage and improve the coordination of decisions regarding water supply and storage across individual reservoirs [16,38,39]. Therefore, this property is widely adopted for the operation of a complex reservoir system, such as the storage target curves [40,41] and the parametric rule [42].
Assumption 2.
The stored water in each reservoir is substitutable among each reservoir. The mathematical expression of resource substitutability is defined as follows [17]:
2 Q t S i , t * S t S u m S i , t * 0
The above equation indicates that if the total storage of other reservoirs increases, the marginal value of carryover storage for each reservoir either decreases or stays the same because there is more water available in the system for both mainstream and tributary demand, particularly for the successive drought year [17].
Based on the above two assumptions, the monotonic relationship between the optimal water release for tributary use and the total system water availability at period t can be derived.
For any two optimal cascade systems, carryover storage S t S u m , 1 and S t S u m , 2 at period t, if S t S u m , 1 S t S u m , 2 , then according to the Assumption 1, we have S i , t 1 S i , t 2 . Using the contradiction method, it can be deduced that S t S u m , 1 S i , t 1 S t S u m , 2 S i , t 2 , otherwise, it would exist the carryover storage of an individual reservoir: S i , t 1 S i , t 2 , which is in violation of Assumption 1.
Subsequently, according to Assumption 2, we have
Q t S i , t 1 S 1 , t 1 , , S m , t 1 Q t S i , t 1 S 1 , t 2 , , S m , t 2
According to the concavity of the value function (decreasing marginal benefit), i.e., conditioned on the unchanged carryover storage of other reservoirs in the system, a higher storage will lead to a lower marginal benefit:
Q t S i , t 1 S 1 , t 2 , , S m , t 2 Q t S i , t 2 S 1 , t 2 , , S m , t 2
Combining Equations (10) and (11), we can obtain:
Q t S i , t 1 S 1 , t 1 , , S m , t 1 Q t S i , t 2 S 1 , t 2 , , S m , t 2
Based on the marginal benefit relationship in Equations (7a) and (7b), it can be derived that d B i d O R i , t 1 d B i d O R i , t 2 . According to the decreasing marginal benefit property of water release, the optimal water release for tributary use must satisfy the following relationship: O R i , t 1 O R i , t 2 . Similarly, it can be deduced that the optimal water release for hydropower generation from the reservoir at the lowest elevation i = n must satisfy I R n , t 1 I R n , t 2 . The cascade system’s total water balance equation can be used to obtain:
W A t S u m , 1 = S t S u m , 1 + i = 1 n O R i , t 1 + I R n , t 1 S t S u m , 2 + i = 1 n O R i , t 2 + I R n , t 2 = W A t S u m , 2
W A t S u m = i = 1 n W A i , t
That is, if the total carryover storage of the cascade reservoir system increases, the carryover storage of each reservoir, water release for the tributary use, and the total available water of the cascade system should increase or remain unchanged, and vice versa. As a result, the non-decreasing monotonicity between a cascade reservoir system’s total water availability, water release for tributary use, and optimal carryover storage is determined.

3.2. Monotonic Property of Water Release for the Mainstream Use

Assume that only the water availability of reservoir k increases and the water availability of other reservoirs remains the same, i.e., for any two available water combinations, W A k , t 1 W A k , t 2 and W A i , t 1 = W A i , t 2 ( i k ) . According to the mass balance equation in Equation (1b), the optimal water release from the upstream and downstream reservoirs can be expressed as:
I R k 1 , t = i = 1 k 1 W A i , t S i , t O R i , t
I R k , t = i = k + 1 n S i , t + O R i , t W A i , t + I R n , t
Water release for each reservoir’s tributary satisfies the monotonically non-decreasing property due to carryover storage. Therefore, according to Equation (14a), the water release of the upstream reservoir has the following relationship: I R k 1 , t 1 I R k 1 , t 2 , provided that the amount of water available in the reservoir remains constant.
Similarly, from Equation (14b), it can be deduced that the water release of reservoir k and its downstream reservoirs should satisfy I R k , t 1 I R k , t 2 .
The above derivation implies that as the available water of a specific individual reservoir increases, the water release for the mainstream demand from this reservoir and the corresponding downstream reservoirs should not decrease, while the upstream reservoir should not increase. That is, the non-decreasing monotonicity between the water release for the mainstream use from a specific particular reservoir or any downstream reservoir and the water availability of such individual reservoir is derived, whereas the non-increasing monotonicity between the upstream water release and the water availability is specified.

3.3. Quantitative Analysis Based on the Monotonic Property

Based on the monotonicity of the water storage and release decisions of the cascade reservoir system, the following quantitative analysis can be further conducted:
If the cascade system’s total available water increases with an equal interval length (i.e., IL), the sum increase in the total water storage and the water release for tributaries would not exceed IL, according to the system’s total water balance equation (Equation (13a)). Following the non-decreasing monotonical relationship between the optimal decision of individual reservoir and the total available water of the system, it can deduce with the contradiction method that the carryover storage and the water release of individual reservoir will not exceed IL.
Similarly, if reservoir k’s water availability increases with an interval length (i.e., IL) while other reservoirs’ available water stays constant, then both the decrease amount of water release from upstream reservoir and the increase amount of release from downstream will not exceed IL, since the total increment or decrement does not exceed IL. Following the chain rule of derivation, the monotonic property would be maintained in the subsequent operation periods.
In summary, the monotonic relationship between different water release and the water availability in the cascade system is shown in Figure 2.

4. Improved Reinforcement Learning Method for Optimization Operation of Cascade Reservoir System

Reinforcement learning for solving the joint optimization operation of the cascade reservoir system would not utilize the prior knowledge of water mass balance of each reservoir, as well as the hydrologic connection equation between the upper and lower reservoirs to specify the monotonic decision-making property. Based on the monotonic relationship between the optimal release decisions and the water availability, the efficiency of the reinforcement learning method can be effectively enhanced. The main principle of the improved optimization method can be summarized as two steps: initial state optimization, and searching space reduction, as shown in Figure 3.
(1) Initial state optimization: First, the storage volume of each reservoir is discretized with an equal interval length (i.e., IL) for each reservoir at each period, i.e., the discretization interval of each reservoir is set as IL, and the number of discretization of reservoir i is NSI. After determining the feasible space of the water release decision for each initial state combination based on the constraints of water supply and storage, Newton’s method is applied to specify the optimal release for the mainstream and tributary from the reservoir’s water balance equation (Equation (1b)) and the marginal benefit equalization equation (Equations (7a) and (7b)). The feasibility of the decision is checked according to the water release and carryover storage inequality constraints, where, if the decision variables that do not meet these constraints, according the suggested procedure [43], the boundary of the inequality constraint is set as the optimal value, and the corresponding water release is adjusted according to the mass balance equation. In this process, the reinforcement learning method employs incremental sampling to identify the set of solutions for each reservoir’s optimal water release over the course of the scheduling period.
(2) Searching space reduction: Assume that the optimal water release and carryover storage for the entire periods corresponding to the j sample are I R 1 # , , I R T # ; O R 1 # , , O R T # and S 1 # , , S T # , respectively. If the initial water storage state of j sample increases with IL, then the water release for the tributary can be defined as the decision variable in Newton’s method, and the release for the mainstream can be specified using Equation (14a). According to the monotonicity of the water release decision, the optimal release would be identified within the range between O R 1 # , , O R T # and O R 1 # + I L , , O R T # + I L . As a result, the improved method only searches in the feasible interval with a variation of IL, which leads to a higher computational efficiency than searching the entire feasible region in the traditional reinforcement learning method.

5. Case Study

5.1. System Description

As illustrated in Figure 4, the WB-Xll-Xjb-Sx reservoir system is a giant cascade reservoir system on the mainstream of the upper reaches of the Yangtze River in China. Among them, the storage capacity of WB reservoir is 13.46 billion m3, and the annual average runoff is 241.34 billion m3, with the annual demand of 120 billion m3 in the mainstream and 3.20 billion m3 in the tributary; the storage capacity of Xll reservoir is 6.46 billion m3, and the annual average runoff is 20.64 billion m3 with the annual demand of 65 billion m3 in the mainstream and 0.5 billion m3 in the tributary; the storage capacity of Xjb is 0.90 billion m3, and the annual average runoff is 7.28 billion m3 with the annual demand of 15 billion m3 in the mainstream and 1.8 billion m3 in the tributary; the storage capacity of Sx is 22.15 billion m3, and the annual average runoff is 280.18 billion m3 with the annual demand of 200 billion m3 in the mainstream and 4.5 billion m3 in the tributary.
In this paper, the commonly used benefit function [44] is set as the objective function to evaluate the system operation performance, as below:
B i I R i , t =     E i , t I E i , t α = γ × I R i , t γ × I D i , t α = I R i , t I D i , t α ;     I R i , t I D i , t         1             ;   I R i , t > I D i , t  
B i O R i , t =     O R i , t O D i , t α ;     O R i , t O D i , t         1             ;   O R i , t > O D i , t      
where E i , t and I E i , t are the hydropower generation of the release and maximum release from reservoir i to the mainstream at period t, respectively; γ is the transition coefficient from water into hydropower, which is given from an experimental test; α is the exponential parameter of the benefit function generated by the water release. α < 1 is defined for a concave function. According to the literature [35] and sensitivity parameter analysis, the value of α is set to 0.8 in this paper. It is assumed that the weighting factors for various water utilities are equal to 1 for the sake of the subsequent comparison analysis.

5.2. Comparison Analysis of Optimization Results

The WB-Xll-Xjb-Sx cascade reservoir system’s comparison analysis of multi-objective optimization techniques divides a hydrological year into 12 periods, with each period’s operation time step of a month. Historical inflow runoff from 1959 to 2023 is collected from each reservoir to generate the corresponding runoff state transition probability function. The inflow is discretized as four for each reservoir, and six storage discretization levels (i.e., interval lengths), ranging from 0.2 to 0.4 billion m3, are employed for the algorithm implementation. Stochastic dynamic programming (SDP), reinforcement learning (RL), and improved reinforcement learning (IRL) methods with a learning rate of 0.9 are executed with GUN Fortran to solve the optimization operation of the cascade reservoir system on a ThinkPad T15g workstation (i7-10750H/16 GB), provided by Lenovo, Beijing, China. The optimization results under different storage discretization levels are shown in Figure 5, Figure 6 and Figure 7.
To compare the results from different methods, the optimization results of SDP are set as a benchmark, and the differences between SDP and RL or SDP and IRL are measured by relative error in Figure 5. As can be seen, there is no significant difference in the optimization results obtained by the three optimization algorithms under six different discretization intervals, with relative errors all kept within 10%. Only when the reservoir capacity discretization interval is set to 0.24 billion m3, the relative errors between SDP and RL, as well as the error between SDP and IRL, both reach the maximum of 5.73% and 9.87%, respectively. As further illustrated in Figure 5b,c, the maximum relative error mainly results from the optimization operation of the Sx reservoir, which has a relatively large storage capacity and water demand.
The computing time of SDP, RL, and IRL with different discretization levels is illustrated in Figure 6a,b. The computation time of IRL is consistently less than 0.10% of that from SDP and 5% of that from RL for all storage discretization levels. For example, the time of SDP is 4109.47 s and RL of 52.83 s, while that of RIDP is 2.10 s under the discretization level of 0.32 billion m3 for running the models over a one-year time horizon. This conclusion indicates the high computational efficiency of IRL.
The carryover storage distribution of the cascade reservoir system in September is given in Figure 7. As can be seen from Figure 7a,b, the carryover storage of the Xll reservoir obtained from SDP or RL would not always increase with the total storage of the cascade reservoir system increases, particularly for a high system storage level. In contrast, the carryover storage of all individual reservoirs derived from IRL would increase with the total storage of the cascade system in Figure 7c, which follows the synchronic storage distribution property. Since September is often defined as the refill period in the Yangtze River, a relatively high storage of the system would seldom result in a relatively low storage of the individual reservoir, which is not consistent with the previous numerical study [45] or the commonly used operating rules [32,33,34]. These distribution results imply the relatively high interpretability of the solutions from IRL.

6. Conclusions

Based on the synchronicity and substitutability property of storage distribution in the cascade reservoir system, the first-order derivative relationship between individual reservoir water release decisions and the water availability of the cascade system or a specific individual reservoir is derived. Theoretical analysis results show that: (1) the optimal carryover storage and water release for the tributary use from each individual reservoir is a non-decreasing monotonic function of the cascade reservoir system’s overall water availability; (2) the water release for mainstream use from a specific reservoir, or from any downstream reservoir, is a non-decreasing monotonic function of the water availability of that individual reservoir. Conversely, the relationship between upstream water release and water availability is characterized by non-increasing monotonicity.
Considering the derived monotonic properties among carryover storage, water release decisions from each individual reservoir, and the water availability of the cascade reservoir system, an improved reinforcement learning method with a reduced searching region has been developed. To demonstrate the effectiveness and efficiency of the proposed algorithm, the WB-Xll-Xjb-Sx reservoir system in the upper reaches of the Yangtze River in China is employed as a case study. Numerical results from this case study indicate that, compared to standard dynamic programming (SDP) and traditional reinforcement learning (RL), the improved method significantly reduces computation time without substantial loss of solution accuracy, particularly in preserving highly interpretable optimization results.

Author Contributions

Conceptualization, X.L.; Methodology, X.L.; Software, S.C.; Validation, H.M.; Formal analysis, X.L.; Investigation, H.M.; Resources, S.C.; Data curation, Y.X.; Writing—original draft, X.L.; Writing—review & editing, X.Z.; Visualization, Y.X.; Supervision, X.Z.; Project administration, X.Z.; Funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Major Science and Technology Projects of Ministry of Water Resources (SKS-2022119), the Open Research Fund of Hubei Key Laboratory of Intelligent Yangtze and Hydroelectric Science, China Yangtze Power Co., Ltd. (2422020009), National Key Research and Development Program of China (2023YFC3208403) and National Natural Science Foundation of China (52209016; 52479025).

Data Availability Statement

Some or all data, models, or code that support the findings of this study is available from the corresponding author upon reasonable request due to privacy and legislation.

Conflicts of Interest

Authors Xiang Li, Haoyu Ma and Yang Xu were employed by China Yangtze Power Co., Ltd. The research design, analysis, and conclusions presented in this paper were conducted independently and do not represent the official views or interests of the company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Yeh, W.W.G. Reservoir Management and Operations Models: A State-of-the-Art Review. Water Resour. Res. 1985, 21, 1797–1818. [Google Scholar] [CrossRef]
  2. Oliverira, R.; Loucks, D.P. Operating Rules for Multireservoir Systems. Water Resour. Res. 1997, 33, 839–852. [Google Scholar] [CrossRef]
  3. Ahmad, A.; El-Shafie, A.; Razali, S.F.M.; Mohamad, Z.S. Reservoir Optimization in Water Resources: A Review. Water Resour. Manag. 2014, 28, 3391–3405. [Google Scholar] [CrossRef]
  4. Wurbs, R.A. Reservoir System Simulation and Optimization Models. J. Water Resour. Plan. Manag. 1993, 119, 455–472. [Google Scholar] [CrossRef]
  5. Labadie, J. Optimal Operation of Multi-Reservoir Systems: State-of-the-Art Review. J. Water Resour. Plan. Manag. 2004, 130, 93–111. [Google Scholar] [CrossRef]
  6. Rani, D.; Moreira, M.M. Simulation-Optimization Modeling: A Survey and Potential Application in Reservoir Systems Operation. Water Resour. Manag. 2010, 24, 1107–1138. [Google Scholar] [CrossRef]
  7. Dobson, B.; Wagener, T.; Pianosi, F. An Argument-Driven Classification and Comparison of Reservoir Operation Optimization Methods. Adv. Water Resour. 2019, 128, 74–86. [Google Scholar] [CrossRef]
  8. Giuliani, M.; Lamontagne, J.R.; Reed, P.M.; Castelletti, A. A State-of-the-Art Review of Optimal Reservoir Control for Managing Conflicting Demands in a Changing World. Water Resour. Res. 2021, 57, e2021WR029927. [Google Scholar] [CrossRef]
  9. Mendoza-Ramírez, R.; Silva, R.; Domínguez-Mora, R.; Juan-Diego, E.; Carrizosa-Elizondo, E. Comparison of Two Convergence Criterion in the Optimization Process Using a Recursive Method in a Multi-Reservoir System. Water 2022, 14, 2952. [Google Scholar] [CrossRef]
  10. Powell, W. Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed.; Wiley: Hoboken, NJ, USA, 2011; ISBN 9780470604458. [Google Scholar]
  11. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018; ISBN 9780262039246. [Google Scholar]
  12. Tilmant, A.; Kelman, R. A Stochastic Approach to Analyze Trade-Offs and Risks Associated with Large-Scale Water Resources Systems. Water Resour. Res. 2007, 43, W06425. [Google Scholar] [CrossRef]
  13. Howson, H.R.; Sancho, N.G.F. A New Algorithm for the Solution of Multistate Dynamic Programming Problems. Math. Program. 1975, 8, 104–116. [Google Scholar] [CrossRef]
  14. Turgeon, A. Optimal Operation of Multireservoir Power Systems with Stochastic Inflows. Water Resour. Res. 1980, 16, 275–283. [Google Scholar] [CrossRef]
  15. Turgeon, A.; Charbonneau, R. An Aggregation-Disaggregation Approach to Long-Term Reservoir Management. Water Resour. Res. 1998, 34, 3585–3594. [Google Scholar] [CrossRef]
  16. Zeng, X.; Hu, T.S.; Xiong, L.H.; Cao, Z.X.; Xu, C.Y. Derivation of Operation Rules for Reservoirs in Parallel with Joint Water Demand. Water Resour. Res. 2015, 51, 9539–9563. [Google Scholar] [CrossRef]
  17. Zeng, X.; Hu, T.S.; Cai, X.M.; Zhou, Y.L.; Wang, X. Improved Dynamic Programming for Parallel Reservoir System Operation Optimization. Adv. Water Resour. 2019, 131, 103373. [Google Scholar] [CrossRef]
  18. Beiranvand, B.; Ashofteh, P.S. A Systematic Review of Optimization of Dams Reservoir Operation Using the Meta-heuristic Algorithms. Water Resour. Manag. 2023, 37, 3457–3526. [Google Scholar] [CrossRef]
  19. Emami, M.; Nazif, S.; Mousavi, S.F.; Karami, H.; Daccache, A. A hybrid constrained coral reefs optimization algorithm with machine learning for optimizing multi-reservoir systems operation. J. Environ. Manag. 2021, 286, 112250. [Google Scholar] [CrossRef] [PubMed]
  20. Sutton, R.S. Learning to Predict by the Methods of Temporal Differences. Mach. Learn. 1988, 3, 9–44. [Google Scholar] [CrossRef]
  21. Watkins, C.J.C.H.; Dayan, P. Technical Note: Q-Learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  22. Castro Freibott, R.; García Sánchez, Á.; Espiga-Fernández, F.; González-Santander de la Cruz, G. Intraday Multireservoir Hydropower Optimization with Alternative Deep Reinforcement Learning Configurations. In Decision Sciences (DSA ISC 2024); Lecture Notes in Computer Science; Juan, A.A., Faulin, J., Lopez-Lopez, D., Eds.; Springer: Cham, Switzerland, 2025; Volume 14778. [Google Scholar]
  23. Luo, W.; Wang, C.; Zhang, Y.; Zhao, J.; Huang, Z.; Wang, J.; Zhang, C. A Deep Reinforcement Learning Approach for Joint Scheduling of Cascade Reservoir System. J. Hydrol. 2025, 651, 132515. [Google Scholar] [CrossRef]
  24. Lee, J.H.; Labadie, J.W. Stochastic Optimization of Multireservoir Systems via Reinforcement Learning. Water Resour. Res. 2007, 43, W11408. [Google Scholar] [CrossRef]
  25. Castelletti, A. A Reinforcement Learning Approach for the Operational Management of a Water System. In Proceedings of IFAC Workshop Modelling and Control in Environmental Issues; Elsevier: Yokohama, Japan, 2001; pp. 22–23. [Google Scholar]
  26. Xu, W.; Zhang, X.; Peng, A.B.; Liang, Y. Deep Reinforcement Learning for Cascaded Hydropower Reservoirs Considering Inflow Forecasts. Water Resour. Manag. 2020, 34, 3003–3018. [Google Scholar] [CrossRef]
  27. Wu, R.; Wang, R.; Hao, J.; Wu, Q.; Wang, P. Multiobjective Hydropower Reservoir Operation Optimization with Transformer-Based Deep Reinforcement Learning. arXiv 2023, arXiv:230705643. [Google Scholar]
  28. Mitjana, F.; Denault, M.; Demeester, K. Managing Chance-Constrained Hydropower with Reinforcement Learning and Backoffs. Adv. Water Resour. 2022, 169, 104308. [Google Scholar] [CrossRef]
  29. Zhao, T.T.G.; Cai, X.M.; Lei, X.H.; Wang, H. Improved Dynamic Programming for Reservoir Operation Optimization with a Concave Objective Function. J. Water Resour. Plan. Manag. 2012, 138, 590–596. [Google Scholar] [CrossRef]
  30. Zhao, T.T.G.; Zhao, J.S.; Yang, D.W. Improved Dynamic Programming for Hydropower Reservoir Operation. J. Water Resour. Plan. Manag. 2014, 140, 365–374. [Google Scholar] [CrossRef]
  31. Zhao, T.T.G.; Zhao, J.S.; Lei, X.H.; Wang, X.; Wu, B.S. Improved Dynamic Programming for Reservoir Flood Control Operation. Water Resour. Manag. 2017, 31, 2047–2063. [Google Scholar] [CrossRef]
  32. You, J.Y.; Cai, X.M. Hedging Rule for Reservoir Operations: 1. A Theoretical Analysis. Water Resour. Res. 2008, 44, W01415. [Google Scholar] [CrossRef]
  33. Shiau, J.T. Analytical Optimal Hedging with Explicit Incorporation of Reservoir Release and Carryover Storage Targets. Water Resour. Res. 2011, 47, W01515. [Google Scholar] [CrossRef]
  34. Zeng, X.; Lund, J.R.; Cai, X.M. Linear versus Nonlinear (Convex and Concave) Hedging Rules for Reservoir Optimization Operation. Water Resour. Res. 2021, 57, e2020WR029160. [Google Scholar] [CrossRef]
  35. Fama, E.F. Multiperiod Consumption-Investment Decisions. Am. Econ. Rev. 1970, 60, 163–174. [Google Scholar]
  36. Draper, A.J.; Lund, J.R. Optimal Hedging and Carryover Storage Value. J. Water Resour. Plan. Manag. 2004, 130, 83–87. [Google Scholar] [CrossRef]
  37. Carroll, C.D.; Kimball, M.S. On the Concavity of the Consumption Function. Econometrica 1996, 64, 981–992. [Google Scholar] [CrossRef]
  38. Johnson, S.A.; Stedinger, J.R.; Staschus, K. Heuristic Operating Policies for Reservoir System Simulation. Water Resour. Res. 1991, 27, 673–685. [Google Scholar] [CrossRef]
  39. Sand, G.M. An Analytical Investigation of Operating Policies for Water-Supply Reservoirs in Parallel. Ph.D. Dissertation, Cornell University, Ithaca, NY, USA, 1984. [Google Scholar]
  40. Lund, J.R.; Ferreira, I. Operating Rule Optimization for Missouri River Reservoir System. J. Water Resour. Plan. Manag. 1996, 122, 287–295. [Google Scholar] [CrossRef]
  41. Perera, B.J.C.; Codner, P.G. Reservoir Targets for Urban Water Supply Systems. J. Water Resour. Plan. Manag. 1996, 122, 270–279. [Google Scholar] [CrossRef]
  42. Nalbantis, I.; Koutsoyiannis, D. A Parametric Rule for Planning and Management of Multiple Reservoir Systems. Water Resour. Res. 1997, 33, 2165–2177. [Google Scholar] [CrossRef]
  43. Zhao, J.S.; Cai, X.M.; Wang, Z.J. Optimality conditions for a two-stage reservoir operation problem. Water Resour. Res. 2011, 47, W08503. [Google Scholar] [CrossRef]
  44. You, J.Y.; Cai, X.M. Hedging Rule for Reservoir Operations: 2. A Numerical Model. Water Resour. Res. 2008, 44, W01416. [Google Scholar] [CrossRef]
  45. Wang, C.; Jiang, Z.; Xu, Y.; Wang, S.; Wang, P. Discussion on the monotonicity principle of the two-stage problem in joint optimal operation of cascade hydropower stations. J. Hydrol. 2023, 623, 129803. [Google Scholar] [CrossRef]
Figure 1. Cascade reservoir system with hydropower generation and tributary water supply objectives.
Figure 1. Cascade reservoir system with hydropower generation and tributary water supply objectives.
Water 17 01681 g001
Figure 2. Monotonic relationship between optimization decisions and water availability; where, ↑ denotes non-decreasing monotonic relationship; ↓ denotes non-increasing monotonic relationship; ─ denotes without any change.
Figure 2. Monotonic relationship between optimization decisions and water availability; where, ↑ denotes non-decreasing monotonic relationship; ↓ denotes non-increasing monotonic relationship; ─ denotes without any change.
Water 17 01681 g002
Figure 3. Schematic diagram of the improved reinforcement learning method.
Figure 3. Schematic diagram of the improved reinforcement learning method.
Water 17 01681 g003
Figure 4. The layout of WB-Xll-Xjb-Sx cascade reservoir system.
Figure 4. The layout of WB-Xll-Xjb-Sx cascade reservoir system.
Water 17 01681 g004
Figure 5. Differences between SDP and RL or SDP and IRL (a) relative error under different combinations of storage discretization levels; (b) hydropower generation under the discretization level of 0.24 billion m3; (c) tributary release under the discretization level of 0.24 billion m3.
Figure 5. Differences between SDP and RL or SDP and IRL (a) relative error under different combinations of storage discretization levels; (b) hydropower generation under the discretization level of 0.24 billion m3; (c) tributary release under the discretization level of 0.24 billion m3.
Water 17 01681 g005
Figure 6. Computing time under the discretization level of 0.32 billion m3 for running the models over a one-year time horizon (a) SDP; (b) RL; (c) IRL.
Figure 6. Computing time under the discretization level of 0.32 billion m3 for running the models over a one-year time horizon (a) SDP; (b) RL; (c) IRL.
Water 17 01681 g006
Figure 7. Carryover storage distribution of the cascade reservoir system in September (a) SDP; (b) RL; (c) IRL.
Figure 7. Carryover storage distribution of the cascade reservoir system in September (a) SDP; (b) RL; (c) IRL.
Water 17 01681 g007
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Ma, H.; Chen, S.; Xu, Y.; Zeng, X. Improved Reinforcement Learning for Multi-Objective Optimization Operation of Cascade Reservoir System Based on Monotonic Property. Water 2025, 17, 1681. https://doi.org/10.3390/w17111681

AMA Style

Li X, Ma H, Chen S, Xu Y, Zeng X. Improved Reinforcement Learning for Multi-Objective Optimization Operation of Cascade Reservoir System Based on Monotonic Property. Water. 2025; 17(11):1681. https://doi.org/10.3390/w17111681

Chicago/Turabian Style

Li, Xiang, Haoyu Ma, Sitong Chen, Yang Xu, and Xiang Zeng. 2025. "Improved Reinforcement Learning for Multi-Objective Optimization Operation of Cascade Reservoir System Based on Monotonic Property" Water 17, no. 11: 1681. https://doi.org/10.3390/w17111681

APA Style

Li, X., Ma, H., Chen, S., Xu, Y., & Zeng, X. (2025). Improved Reinforcement Learning for Multi-Objective Optimization Operation of Cascade Reservoir System Based on Monotonic Property. Water, 17(11), 1681. https://doi.org/10.3390/w17111681

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop