Next Article in Journal
Enhancing Flood Inundation Simulation Under Rapid Urbanisation and Data Scarcity: The Case of the Lower Prek Thnot River Basin, Cambodia
Previous Article in Journal
Spatiotemporal Patterns, Characteristics, and Ecological Risk of Microplastics in the Surface Waters of Shijiu Lake (Nanjing, China)
Previous Article in Special Issue
Combination of UAV Imagery and Deep Learning to Estimate Vegetation Height over Fluvial Sandbars
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning for Optimized Reservoir Operation and Flood Risk Mitigation

1
Department of Civil, Architectural and Environmental System Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea
2
Graduate School of Water Resources, Sungkyunkwan University, Suwon 16419, Republic of Korea
*
Author to whom correspondence should be addressed.
Water 2025, 17(22), 3226; https://doi.org/10.3390/w17223226
Submission received: 19 October 2025 / Revised: 10 November 2025 / Accepted: 10 November 2025 / Published: 11 November 2025
(This article belongs to the Special Issue Machine Learning Applications in the Water Domain)

Abstract

Effective reservoir operation demands a careful balance between flood risk mitigation, water supply reliability, and operational stability, particularly under evolving hydrological conditions. This study applies deep reinforcement learning (DRL) models—Deep Q-Network (DQN), Proximal Policy Optimization (PPO), and Deep Deterministic Policy Gradient (DDPG)—to optimize reservoir operations at the Soyang River Dam, South Korea, using 30 years of daily hydrometeorological data (1993–2022). The DRL framework integrates observed and remotely sensed variables such as precipitation, temperature, and soil moisture to guide adaptive storage decisions. Discharge is computed via mass balance, preserving inflow while optimizing system responses. Performance is evaluated using cumulative reward, action stability, and counts of total capacity and flood control violations. PPO achieved the highest cumulative reward and the most stable actions but incurred six flood control violations; DQN recorded one flood control violation, reflecting larger buffers and strong flood control compliance; DDPG provided smooth, intermediate responses with one violation. No model exceeded the total storage capacity. Analyses show a consistent pattern: retain on the rise, moderate the crest, and release on the recession to keep Flood Risk (FR) < 0. During high-inflow days, DRL optimization outperformed observed operation by increasing storage buffers and typically reducing peak discharge, thereby mitigating flood risk.

1. Introduction

Modern dams are designed to be multipurpose, providing benefits such as flood control, water supply, and electricity generation [1]. This presents a significant challenge to water resource managers, as balancing these competing objectives requires effective techniques to optimize reservoir operation. The primary goal of this research is to optimize reservoir operations by moderating hydrometeorological variables such as inflow, water storage, and discharge, and subsequently determining the flood risk (deterministic flood threshold exceedance index instead of probabilistic flood risk) for the observed thirty-year period from 1993 to 2022 [2,3,4]. However, effective optimization involves considering the interaction between multiple variables, and as the number of variables increases, decision-making becomes increasingly complex [3,5]. This complexity is compounded by the limitations of traditional approaches like Dynamic Programming (DP) and Stochastic Dynamic Programming (SDP), which struggle with the “curse of dimensionality”—the exponential growth of computational complexity with added variables—and the “curse of modeling,” which refers to the difficulty of accurately representing all components of a real-world water system due to data limitations and system uncertainties [4,6,7]. These challenges necessitate the exploration of more scalable and flexible approaches to reservoir operation optimization.
Deep Reinforcement Learning (DRL) can be applied to overcome the dimensionality and modeling dilemma to optimize reservoir operation and mitigate flood risk [8]. DRL involves an agent, such as Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Deep Q-Network (DQN), learning and making decisions by interacting with an environment to maximize cumulative rewards. The agents use varying algorithms for decision making and to optimize actions in their respective environments [9]. DRL involves an agent interacting with an environment to learn optimal decision-making through trial and error [10]. The state represents the environment’s current condition, such as water storage, inflow, and meteorological data in reservoir operation. Based on the state, the agent selects an action, like regulating water discharge, to influence the environment. The environment responds to the action by transitioning to a new state and providing a reward, which evaluates the action’s effectiveness [11]. For example, the agent receives a positive reward for maintaining water storage within safe limits and a penalty for exceeding flood control thresholds. This feedback guides the agent to refine its optimizing decision-making over time. Through repeated cycles of observing states, taking actions, and processing rewards, the agent learns to balance objectives like flood risk mitigation, water discharge, and operational performance [12].
This study optimizes key hydrometeorological variables—such as inflow, water storage, and discharge—for the Soyang River Dam using DRL models, including DDPG, PPO, and DQN, over a 30-year period from 1993 to 2022. Unlike previous studies that rely solely on RL or DRL for optimization, this approach uniquely integrates observed and remote-sensed hydrometeorological data with DRL decision-making, making it distinct in its ability to incorporate diverse environmental factors into reservoir operation optimization. By leveraging variables such as precipitation, soil moisture, temperature, evaporation, and runoff, the models dynamically adjust flood control strategies in response to environmental variability, ensuring adaptive and sustainable reservoir operations under changing climatic conditions [13]. DRL methods, including DQN, PPO, and DDPG, offer distinct advantages in optimizing reservoir operations and flood risk mitigation. The effectiveness of each method depends on the specific operational goals, such as reducing flood control violations, maintaining target storage, and improving the reliability of water release decisions.
DQN, a value-based deep reinforcement learning method, is particularly effective for structured decision-making in flood control applications where discrete actions are required. It efficiently determines optimal water release strategies by selecting from predefined levels to minimize flood violations and maintain reservoir safety. This makes DQN well-suited for scenarios such as adjusting spillway gates in fixed increments or complying with operational constraints. For example, in Korea, reservoir operations must comply with minimum ecological flow requirements [14]. Its capacity to optimize predefined reward structures ensures compliance with operational thresholds while mitigating flood risks. However, DQN’s reliance on discrete action spaces presents limitations, as it may cause abrupt changes in water discharges, making it less ideal for applications requiring fine-grained or gradual adjustments. Despite this drawback, its ability to quickly generate decisive actions makes it particularly valuable for emergency flood control scenarios, where rapid interventions are critical to prevent reservoir overflow [15].
PPO, a policy-gradient deep reinforcement learning approach, excels in continuous and adaptive control scenarios due to its ability to provide smooth and stable adjustments. This makes it particularly effective for long-term reservoir operation, where maintaining consistent water storage and minimizing excessive fluctuations are essential. PPO enables gradual water release decisions, ensuring operational stability—for example, in systems that must balance urban water demands with flood risk mitigation while avoiding unnecessary water losses. However, its reliance on gradual adjustments makes it less suitable for dynamic environments where sudden inflow surges demand immediate corrective actions. Additionally, PPO has been observed to have a higher frequency of flood control violations compared to DQN, potentially due to its exploration tendencies and slower convergence in high-variability conditions. Despite these limitations, PPO remains highly valuable for managing reservoirs with predictable seasonal variations, where long-term operational stability is the primary goal [16,17].
DDPG, an actor-critic deep reinforcement learning method, is well-suited for real-time dynamic reservoir operation due to its ability to handle continuous action spaces. This flexibility enables precise and incremental adjustments in water releases, which is particularly advantageous in environments with high inflow variability, such as monsoon seasons in Korea [18]. Unlike discrete-action methods like DQN, DDPG allows adaptive control to maintain optimal storage levels while ensuring downstream safety [19]. Its capability to optimize outflows based on real-time conditions makes it highly effective for managing dynamic hydrological patterns. However, DDPG’s performance can be sensitive to hyperparameter tuning and may exhibit instability in volatile conditions [20,21]. In scenarios with uncertain inflow predictions or extreme events, stabilization techniques such as reward shaping or ensemble learning may be required to improve reliability [22]. Additionally, it may exhibit higher variability in storage levels compared to other models, making it less suitable for scenarios where strict adherence to predefined thresholds is necessary [13].
The objectives of this study are to (1) evaluate the performance of DQN, DDPG, and PPO deep reinforcement learning models, (2) integrate observed and remotely sensed hydrometeorological variables with deep reinforcement learning for reservoir operation optimization and flood risk mitigation, and (3) compare DRL-optimized operations with observed operations, focusing on high-inflow days, using storage, discharge, and a storage-based flood risk metric. Together, these elements advance DRL-based reservoir operation by unifying multi-source hydrometeorological inputs and demonstrating performance gains over observed operation under high-inflow conditions.

2. Materials and Methods

2.1. Study Area

The Soyang River Dam, a rockfill dam constructed in 1973, is South Korea’s largest multipurpose dam located in Chuncheon, Gangwon Province [23]. The Soyang River is a tributary of the North Han River, which is a tributary of the Han River, a transboundary river traversing Kangwon province in North Korea and Gangwon and Gyonggi Provinces in South Korea [24]. Figure 1 illustrates the Soyang Reservoir and its downstream water flow in South Korea. The inset map in the top left highlights South Korea, with the Soyang Basin marked in purple, while a zoomed-in view shows the reservoir and its tributaries. The selected section shown in Figure 1 represents approximately a 2 km downstream reach of the Soyang River Dam, extending from the dam outlet to the first major downstream infrastructure. Hydrologically, the Soyang River Dam is located on the Soyang River, approximately 110 km upstream of the confluence of the North Han River and the mainstream Han River, corresponding to coordinates 37°56′43″ N, 127°48′49″ E [25].
The main section of the figure focuses on the Soyang Reservoir, with the dam and spillway located at its lower end, where water is regulated before being released. The outflow is shown moving downstream, accompanied by satellite imagery that highlights surrounding infrastructure. The legend clarifies key elements, including the outlet, Soyang Basin, and South Korea, while the scale bar at the bottom indicates a distance of 0 to 2 km. The figure effectively represents the reservoir’s geographic location, water flow, and downstream impact.
The inset map shows the Soyang River Basin, while the lower panel highlights the dam, spillway, and downstream channel. Scale bars are provided for both the inset and detailed maps. The base map imagery is derived from Sentinel-2 (Copernicus Open Access Hub, 2022), overlaid with hydrographic boundaries and reservoir outlines from WAMIS and Han River Flood Control Office (HRFCO) datasets.
The Soyang River Dam catchment, located in the upper Han River basin in northeastern South Korea, covers an area of approximately 2703 km2. The basin is characterized by steep mountainous terrain, with elevations ranging from El. 200 to 1200 m, leading to rapid runoff generation during monsoon events. Land cover is dominated by forested areas (~85%), with limited agricultural and urban zones concentrated along valley plains. The basin’s mean annual precipitation exceeds 1400 mm, of which nearly 70% occurs between June and September, reflecting strong seasonal hydrological variability. Major tributaries feeding the reservoir include the Naerincheon and Inbukcheon Rivers. This physiographic and climatic setting places the Soyang River Dam within a steep, fast-responding hydrological environment where flood attenuation depends critically on short-reach operational control of the downstream 2 km section [26,27]. This underscores the importance of adaptive reservoir operation for flood mitigation and water supply reliability [28].

2.2. Data Used

The Water Resources Management Information System (WAMIS) of Korea provides comprehensive data on dams, reservoirs, and related infrastructure. The dam characteristics include details such as the dam name, type, and length. Reservoir characteristics include flood-related data, such as the design flood water level at El. 198 m, the normal high-water level at El. 193.5 m, and the restricted water level at El. 190.3 m. The reservoir’s normal high-water level capacity is 2543.80 million m3, the flood control capacity is 770 million m3, and the total storage capacity is 2900 million m3. Spillway characteristics include the spillway elevation at El. 185.5 m, the design flood volume of the spillway at 10,500 million m3, and the design discharge of the spillway at 5500 m3/s. Additionally, data on water supply includes the total annual planned water supply of 1213 million m3; 1200 million m3 for domestic use, and 13 million m3 for irrigation. Daily data from 1993 to 2022 were also obtained. The dataset includes low water level, water storage, retention rate, inflow, total discharge, power generation discharge, discharge to the spillway, other discharge amounts, water supply, observed dam basin average rainfall, annual water supply plans for domestic use, and irrigation, respectively. Data on the location of the Soyang River Dam relative to other dams on the Han River system were obtained from the Han River Flood Control Office (HRFCO).
The remote-sensing data comprised various parameters derived from multiple sources (Table 1). These include the daily reservoir surface area change, calculated from satellite imagery.
The table shows Landsat satellites (4, 5, 7, 8, and 9), spanning from 1982 to October 2021. ERA5 provides temperature and evaporation data from 1950 to the present, while MERRA-2 supplies humidity measurements starting from 1980. Additional variables include adjusted Curve Number (CN) runoff, soil moisture computed using GLDAS versions 2.0 and 2.1, covering data from 1948 to the present, solar radiation, and wind speed.
The analysis period spans 1993–2022, corresponding to the full availability of observed hydrological data from WAMIS and HRFCO. Remote-sensing datasets were incorporated according to their respective temporal coverages: CHIRPS rainfall (1981–present), MERRA-2 humidity (1980–present), GLDAS soil moisture (1948–present), and ERA5-Land variables including temperature and evaporation (1950–present). Landsat-derived surface area estimates were used where available (1984–2021), but were not essential for daily model inputs. Because all remote-sensing datasets provided continuous daily coverage across the 1993–2022 window, no temporal gaps affected model training. Occasional missing daily values were filled using linear interpolation, and all variables were normalized using a StandardScaler before training.
To ensure consistency across datasets of differing spatial and temporal resolutions, all variables were preprocessed to a common daily time step and spatially harmonized with the Soyang Reservoir watershed boundary. The preprocessing was performed primarily in Google Earth Engine (GEE) to maintain uniform spatial referencing. Hydrometeorological variables from ERA5-Land (hourly, 0.1°), GLDAS (3-hourly, 0.25°), CHIRPS (daily, 0.05°), and MERRA-2 (hourly, 0.5°) were first aggregated to daily means or totals and then resampled to a 0.05° grid using bilinear interpolation. All datasets were clipped to the catchment boundary to ensure spatial comparability. Landsat-derived surface-area data, available at 16-day intervals from Landsat 4–9, were processed through cloud masking and spectral water-index (MNDWI and NDBI) classification, then used for periodic validation rather than daily model input. Occasional missing daily values were filled using linear interpolation between adjacent days within the same dataset. In summary, all datasets were temporally aggregated to daily means or totals, spatially regridded to a uniform 0.05° grid using bilinear interpolation, clipped to the Soyang watershed boundary, and linearly interpolated in time to fill occasional missing values, thereby ensuring synchronized and spatially consistent daily inputs across all data sources. The harmonized daily series were exported as CSV files and merged in Python version 3.13.5 for modeling. All variables were normalized using a StandardScaler prior to model training to ensure consistent feature magnitudes. This procedure yielded a temporally continuous and spatially consistent dataset covering 1993–2022, integrating both observed and remote-sensing variables for reinforcement-learning-based reservoir operation.
Unlike supervised learning models that require explicit data splits, deep reinforcement learning agents learn through sequential interaction with the environment. Accordingly, the DRL agents (DDPG, PPO, and DQN) were trained using the complete 1993–2022 sequence to preserve temporal continuity and capture the full hydrologic variability of the reservoir system. Each agent interacted with the environment through daily state–action–reward transitions, allowing it to learn operational strategies that balance flood mitigation and storage regulation across diverse inflow conditions [39,40].
The analysis period (1993–2022) was selected based on the full temporal availability and consistency of both observed and remote-sensing datasets, and additionally, extending to 2022 allows the use of more recent data. While climatological assessments often adopt the 1991–2020 reference period, several input datasets (e.g., WAMIS/HRFCO hydrological records, Landsat-derived surface area, and GLDAS soil moisture) begin in 1993. Using this 30-year window ensured continuous, gap-free coverage across all data sources, which is essential for reinforcement-learning-based modeling. As such, extending the period backward to 1991 would not have improved representativeness but would have introduced inconsistencies among the datasets.
Figure 2 illustrates the observed variations in water storage within the Soyang Reservoir over the period from 1993 to 2022, highlighting key operational zones and capacity levels.
The purple line represents the reservoir’s actual water storage, showing clear fluctuations influenced by seasonal inflows and outflows. These variations likely correspond to monsoon seasons and dry periods, reflecting the reservoir’s role in both water supply management and flood control [41]. The red dashed line marks the reservoir’s total storage capacity, i.e., the maximum volume it can hold. Below this, the flood control capacity zone (orange shaded region) is reserved for mitigating flood risks during heavy inflows. Keeping storage within this zone is crucial for preventing overflow and ensuring downstream safety during extreme weather events. The blue-shaded region with dotted patterns represents the annual water supply plan zone, ensuring sufficient water allocation for planned usage throughout the year. Within this, the light blue dotted section highlights the planned supply of domestic water, covering essential residential and municipal demands. At the bottom, the gray shaded region represents inactive and dead storage (917 million m3), which is generally unusable due to operational constraints. The blue line marks the active storage boundary, separating usable and inactive storage. Additionally, the green dashed line indicates the normal high-water level capacity, serving as a benchmark for stable reservoir operations.
Overall, this reservoir operation strategy balances flood control, water supply, and operational stability. Keeping storage between the flood control and domestic water supply zones is key to ensuring water availability while mitigating flood risks.
Figure 3 illustrates the observed fluctuations in reservoir water levels from 1993 to 2022, highlighting key operational thresholds for effective water management.
The blue line represents the reservoir’s water level over time, showing variations with peaks and troughs likely influenced by rainfall patterns and water usage demands [42]. The red dashed line at El. 198 m marks the design flood level, indicating the maximum safe water level to prevent flooding. Below this is the orange dashed line at El. 190.3 m, representing the flood-season restricted water level, which is the maximum reservoir water level permitted during the designated flood season to secure flood control capacity. The purple dashed line at El. 185.5 m denotes the spillway elevation, where excess water begins to discharge to maintain safe levels. Additionally, the green dashed line at El. 193.5 m marks the normal high-water level, often used as a reference point for stable reservoir operations [43]. Maintaining water levels between the spillway elevation and the design flood level is essential for balancing flood prevention and water supply regulation, ensuring both safety and operational stability.

2.3. Methodology

The methodology in this study, as shown in Figure 4, integrates observed data and remote-sensing data to develop and evaluate optimized operations for the Soyang Reservoir.
Data are prepared and utilized to train DRL models to optimize daily operations for the Soyang Reservoir. The DRL methods employed include DQN, PPO, and DDPG. These models simulate the decision-making processes of virtual dam operators, determining actions such as optimizing discharge to mitigate flood risk and maintain operational safety. The DRL approaches are evaluated to identify the most effective method for managing reservoir operations based on hydrological and meteorological data over the years.
The reinforcement learning process for reservoir operation optimization, as shown in Figure 4, begins with the reservoir state, which represents the current hydrological conditions, including water storage, inflow, and meteorological factors. The agent, which could be an Actor-Critic model (PPO, DDPG) or a Q-Network (DQN), receives this state information and determines an optimal discharge decision. In DQN, the agent selects an action from a predefined discrete set of discharge levels, while in PPO and DDPG, the agent predicts a continuous discharge value. This action is then applied to update the reservoir state, ensuring that total discharge does not exceed inflow constraints. After the action is executed, the new reservoir state is observed, reflecting the updated storage and discharge levels. A reward is then computed based on predefined criteria, typically penalizing excessive storage that surpasses the flood control threshold, reinforcing actions that maintain operational stability. For DQN and DDPG, past experiences are stored in a replay buffer, which helps improve decision-making by reducing correlations between consecutive steps. PPO does not use a replay buffer and learns only from newly collected trajectories (state → action → reward → next-state sequences). For DQN and DDPG, experiences from the replay buffer are sampled to update the model; for PPO, updates are computed from newly collected trajectories, thereby refining the agent’s ability to regulate discharge more effectively. This iterative process continues, allowing the agent to optimize water release while balancing flood risk mitigation, water storage, and operational stability based on observed reservoir conditions [44].
To simulate discharge behavior under storage optimization, a mass balance-based estimation framework was employed. Inflow was preserved from observed records to maintain the original hydrological forcing conditions, while reservoir storage was dynamically adjusted by each DRL model to minimize flood risk and improve operational stability. Discharge was not explicitly predicted by the models. Instead, it was derived as a consequence of inflow and the model-optimized changes in storage using the continuity equation, which conserves mass across the reservoir system. This relationship is governed by the mass balance principle [45]:
Q ( t ) = I ( t ) d S ( t ) d t
where I ( t ) is inflow at time t preserved from observed records, S ( t ) is optimized water storage at time t , and Q ( t ) is estimated discharge at time t , which is the variable of interest. Since the available data is daily, this equation is discretized as follows:
Q ( t ) = I ( t ) S ( t ) S ( t 1 ) t
where t is the time step (86,400 s for daily resolution). This formulation enables dynamic discharge computation that accurately reflects storage fluctuations resulting from DRL-optimized decisions for flood control.
To evaluate the effectiveness of reservoir operation strategies in mitigating flood risk, a quantitative analysis of flood control violations was conducted. These violations occur when the reservoir storage exceeds the upper operational threshold designated for flood control purposes. The upper limit is defined as the difference between the total storage capacity and the allocated flood control capacity of the reservoir. Let S max   be the total storage capacity ( 2900 million m 3 ), and S flood     ( 770 million m 3 ) the designated flood control capacity. The critical flood threshold S threshold   is the difference ( 2130 million m 3 ) between S m a x and S flood   , which is the maximum allowable storage before flood control measures are triggered. Flood risk, F R ( t ) , is then quantified at each time step using the following normalized expression [4]:
F R ( t ) = S ( t ) S threshold   S flood  
A flood control violation occurs when F R ( t ) > 0 , i.e., when the current storage exceeds the flood-safe limit (Table 2).
The number of violations is obtained by counting all such occurrences over the simulation period. This methodology was applied to both the observed storage data and simulated reservoir storage trajectories generated under different deep reinforcement learning (DRL) strategies—namely DDPG, PPO, and DQN. To further contextualize the method, the dates on which DRL-related violations occurred are marked on flood risk time series plots. This approach supports the visual and quantitative assessment of DRL optimization effectiveness in reducing flood control breaches relative to observed operation [46].
Benchmarking against observed operation was conducted by comparing each DRL agent to the observed operation across the full 1993–2022 record using whole-period evaluation metrics—cumulative reward, action stability, and counts of capacity and flood control violations—and on the same extreme inflow (high-inflow) days using storage, discharge, and a storage-based flood risk metric. “Improvement” is defined as fewer violations, larger negative FR (greater buffer), and lower peak discharge under identical inflow forcing.
In this study, we do not assess flood risk in the probabilistic sense, which typically incorporates hazard, exposure, and vulnerability. Instead, a deterministic flood threshold exceedance index derived from reservoir storage levels is applied [3,47]. It is technically a flood threshold exceedance index; however, we adopt the term “flood risk” within the context of this modeling framework to emphasize its operational relevance to reservoir management.

2.3.1. Deep Reinforcement Learning Models

This study focuses on optimizing reservoir operations by adjusting critical hydrological variables such as water storage, inflow, and discharge using DRL methods—DQN, PPO, and DDPG. The DRL framework uses detailed information about the reservoir system and its operations to develop strategies for effective water management.
These three algorithms were selected to represent the principal design dimensions relevant for reservoir operation: discrete versus continuous control and value-based versus actor–critic learning. Specifically, DQN captures stepwise, rule-based reservoir decisions through discrete control, DDPG enables fine-grained continuous adjustments suitable for smooth discharge modulation, and PPO emphasizes stability and robustness under dynamic hydrological variability. Together, these models provide complementary perspectives for testing DRL-driven reservoir operation strategies. Detailed model-specific architectures and operational behaviors are described below, followed by a comparative summary at the end of this section.
In the context of reservoir operation optimization for flood control, the cumulative reward is used to evaluate how effectively the system manages water inflow, storage, and discharge while minimizing flood risk and maintaining desirable reservoir levels. The reward at each time step t , denoted as R ( t ) , is determined by penalizing two key conditions: (1) storage levels that exceed the flood-safe threshold and (2) deviations from a target storage level. The reward function is defined as follows:
R ( t ) = 1 1 2 P f ( t ) + P d ( t )
in which P f ( t ) and P d ( t ) are the flood penalty and deviation penalty, respectively. This function yields a maximum reward of 1 when the storage is at the target level and within flood-safe limits. Any flood risk or deviation from the target reduces the reward accordingly. The flood penalty quantifies the extent to which water storage exceeds the flood-safe threshold and is calculated by using the following equation:
P f ( t ) = m a x ( 0 , S ( t ) S threshold ) S flood  
This penalty becomes positive only when S ( t ) exceeds the threshold, indicating a potential flood situation. The deviation penalty measures how far the current storage deviates from the operational target level S target   , and is computed by using the following equation:
P d ( t ) = S ( t ) S target S max  
where S t a r g e t represents the desired storage level, such as normal high-water level capacity (2543.8 million m 3 ). This component penalizes both over- and under-storage relative to the target, even when there is no flood risk. The overall performance is assessed via the cumulative reward t = 1 T R ( t ) over the simulation horizon T , as defined in Equation (4).
The overall performance is assessed through the cumulative reward, which aggregates the rewards across the entire simulation period of T time steps. A higher cumulative reward indicates more effective operation, where the reservoir remains within flood-safe limits and close to the target storage range. This evaluation framework guides the optimization of inflow regulation and discharge decisions over time to support effective flood control.
The Deep Q-Network (DQN) approach optimizes reservoir operations by dynamically adjusting discharge based on hydrometeorological conditions and reservoir status. Using observed data from 1993 to 2022—including inflow, water storage, total discharge, rainfall, and soil moisture—the model simulates behavior to mitigate flood risk and enhance operational stability. The Q-network consists of an input layer and two hidden layers with 400 and 300 neurons (ReLU activation), followed by an output layer that predicts Q-values corresponding to a discrete set of three possible discharge decisions: 0% (retaining all inflow), 50% (moderate release), and 100% (full release) of inflow. At each timestep, the model selects the action with the highest predicted Q-value. Reservoir storage is updated using a water balance equation:
S ( t ) = S ( t 1 ) + ( ( 1 a ( t ) ) I ( t ) ) t
where a ( t ) is the action at time t that denotes the agent’s normalized control signal bounded within 0 , 1 , which scales admissible operating commands (e.g., turbine/gate settings or spillway releases) according to physical and operational constraints. In DQN, a ( t ) takes one of three discrete levels { 0,0.5,1 } (retaining, moderate, full), whereas in DDPG and PPO, it varies continuously within [0, 1], enabling smoother, fine-grained adjustments. Qualitatively, a ( t ) = 0 corresponds to a conservative posture (minimal release) and a ( t ) = 1 to an assertive posture (greater release/faster drawdown), with intermediate values yielding gradual changes. Formally, a ( t ) = π s t , where π is the learned control strategy and s t comprises inflow, storage, precipitation, and other hydrometeorological states [48,49]. For illustration only, an action may be mapped to a commanded discharge, a ( t ) Q max   , with Q max   the feasible discharge capacity (accounting for turbine/spillway limits); in practice, this mapping is implemented via operational rules (level-dependent piecewise mappings, unit limits, environmental-flow minima) [50,51,52,53]. The temporal variability of a ( t ) is evaluated later via the action-stability metric σ a (Equation (9)). Discharge at each step is estimated using the following equation:
Q t = m a x 0 , m i n I t S t S t 1 t , Q m a x
where Q max = 5750   m 3 / s ( 5500     m 3 / s through spillway and 250     m 3 / s through turbines) is the maximum allowable discharge. To improve stability, both the simulated inflow and storage time series are smoothed using a cyclic filter that reduces short-term fluctuations while preserving long-term trends [45,54]. This DQN-based optimization framework dynamically adjusts to changing inflow patterns, reduces the likelihood of storage exceeding the flood control threshold, and supports sustained, stable operation of the reservoir system under varying hydrological conditions.
While Equations (7) and (8) describe the general storage–discharge framework, it is important to note that the nature of the action a ( t ) differs across the DRL methods. For DQN, a ( t ) is discrete, representing distinct release decisions. However, in DDPG and PPO, a ( t ) takes continuous values, enabling smoother, fine-grained control of reservoir releases. This continuous formulation allows the models to output any value between 0 and 1, corresponding to a proportional adjustment of discharge relative to the maximum allowable capacity. Thus, while Equations (7) and (8) remain applicable across all models, the granularity of a ( t ) differs (discrete in DQN; continuous in DDPG/PPO), making DDPG and PPO better suited for continuous, real-time reservoir regulation.
The Deep Deterministic Policy Gradient (DDPG) method optimizes reservoir operations by producing continuous discharge values, allowing for smoother and more adaptive water releases than the discrete-level approach used in DQN. This reduces abrupt variations in outflow, which can be critical for downstream flood control and ecological stability. DDPG incorporates two neural networks: the Actor Network, which receives the current reservoir state—including storage, inflow, precipitation, soil moisture, and other hydrometeorological factors—and generates a normalized continuous discharge decision. This output is scaled by the maximum allowable discharge capacity to ensure operational feasibility. The Critic Network evaluates the effectiveness of each state-action pair, enabling the Actor to refine its decisions over time. Storage is updated through a mass balance equation that accounts for inflow and the chosen discharge rate, while the actual discharge is derived by constraining the resulting flow within physically acceptable limits. The reward function used during training penalizes storage that exceeds the flood control threshold and deviations from the target storage level, thereby guiding the model toward flood-safe and balanced reservoir operations. To enhance training stability, DDPG applies experience replay and soft target network updates. These techniques prevent overfitting to recent conditions and help maintain convergence. Applied to the 1993–2022 observed dataset, the model effectively maintained storage within operational bounds and ensured effective flood mitigation through continuous and controlled reservoir adjustments.
The Proximal Policy Optimization (PPO) approach enhances reservoir operation by learning optimal discharge behavior through a reward-guided framework. Like DDPG, PPO operates in a continuous action space, allowing the model to issue smooth and adaptive discharge decisions in response to varying hydrological states. However, PPO differs in its optimization strategy—rather than applying direct gradient updates, it constrains action updates within a clipped range. This prevents drastic shifts in discharge and promotes stable learning across episodes. The model architecture adopts an Actor-Critic structure, where the Actor Network processes the reservoir’s hydrometeorological state—including storage, inflow, rainfall, and soil moisture—and outputs a normalized continuous discharge rate. This value is scaled by the maximum discharge capacity to ensure operational realism. Simultaneously, the Critic Network evaluates the quality of the selected action by estimating its expected return, enabling iterative refinement of the discharge strategy. As in the DQN and DDPG approaches, PPO uses a reward function that penalizes excessive storage above the flood-safe threshold and deviations from the target storage level. This ensures that the model learns to balance flood risk mitigation with stable reservoir storage. When applied to 1993–2022 observed data, PPO effectively maintained controlled discharge adjustments, offering smooth transitions and improved resilience under dynamic inflow conditions. Its structured optimization process makes it particularly suitable for real-time applications where discharge changes must be both effective and reliable.
DQN, DDPG, and PPO were selected to represent the principal DRL design choices that matter for reservoir operation: (i) action space (discrete vs. continuous), (ii) learning paradigm (value-based vs. actor–critic), and (iii) update stability under a nonlinear reward with hard safety constraints. DQN (value-based; discrete actions) reflects stepwise operational practice (banded gate/turbine settings) and yields interpretable, rule-like releases. DDPG (deterministic actor–critic; continuous actions) supports fine-grained adjustments needed for smooth hydrograph shaping and buffer management. PPO (stochastic actor–critic with clipped updates) provides robust convergence and controlled exploration, desirable under non-stationary hydrometeorology. Together, the trio benchmarks operational performance and stability across discrete/continuous and operation strategy/value paradigms under identical constraints, rather than emphasizing computational efficiency [13,40,55].

2.3.2. Action Stability Metric

This section details the action stability metric used to assess the robustness and implementability of DRL-optimized reservoir operations. The metric captures the temporal consistency of the agents’ operational decisions, reflecting how smooth or volatile gate operations are over time—an essential consideration for real-world dam management where frequent, abrupt changes are undesirable [56].
As defined in Section 2.3.1, the agent outputs a normalized action a ( t ) [ 0 , 1 ] (discrete for DQN; continuous for DDPG/PPO); here we focus on its temporal variability. We quantify variability across the simulation horizon using the action stability metric that is the standard deviation of the actions:
σ a = 1 T t = 1 T ( a ( t ) a ) 2
where a is the mean action over all T time steps, and σ a is the action stability [57].
Physically, σ a captures how much the operational signal (i.e., gate movement or discharge command) fluctuates around its mean value. A small σ a implies that the reservoir control strategy changes smoothly and predictably from day to day, while a large σ a indicates frequent, abrupt shifts in control intensity.
In practical terms, such abrupt operational changes are undesirable and often unsafe at the Soyang River Dam, since gates and turbines cannot be adjusted instantaneously. Sudden release variations can induce rapid downstream stage fluctuations, increase flood risk, and promote channel erosion, while the reservoir’s large storage volume inherently limits instantaneous changes [58]. Accordingly, a DRL agent with low action variability reflects a stable, realistic control strategy that makes gradual, predictable adjustments to releases and storage. By contrast, high variability implies oscillations between aggressive releases and rapid retention that may appear effective in simulation but are physically infeasible and operationally unsafe [28].
Unlike threshold-based performance metrics that indicate whether flood-safe limits were exceeded, the action stability metric evaluates how those operations were executed—whether they were implemented in a smooth, physically consistent manner. By emphasizing stability, this study prioritizes strategies that are both effective for flood risk mitigation and consistent with real-world operational constraints and safety protocols.

3. Results

3.1. Performance Evaluations

Performance Evaluation of DQN, PPO, and DDPG Deep Reinforcement Learning Models

Table 3 presents a comparative evaluation of the deep reinforcement learning models—DDPG, PPO, and DQN—across four key metrics used to assess the effectiveness of reservoir operation strategies focused on optimizing reservoir operation for flood control.
PPO achieved the highest cumulative reward, suggesting that it was the most effective at maintaining storage near the operational target while avoiding flood conditions. DQN followed closely, while DDPG scored the lowest. This indicates that all models were capable of learning effective reservoir operation strategies, as reflected by their high cumulative rewards and relatively low violation counts. Despite differences in stability, each agent successfully optimized reservoir operations within the reward structure by balancing storage near the operational target while mitigating flood risks. PPO displayed the lowest variability in its actions, with an action stability of 0.0059, indicating that it was the most stable, with only minimal fluctuations in its decision-making process. DDPG maintained relatively stable outputs with a variance of 0.0792, while DQN was the most variable at 0.1737. This elevated variance indicates that DQN’s operating strategy was more variable and thus more reactive, possibly due to its discrete decision-making structure [59]. While variability can be beneficial in highly dynamic systems, it can also lead to operational unpredictability if not well-regulated [60].
All models effectively avoided exceeding the total storage capacity, resulting in zero total capacity violations. This demonstrates that the DRL strategies inherently learned to operate within strict safety boundaries, a critical requirement for operations. However, when it comes to flood control violations, measured as the number of times the reservoir storage exceeded the flood control threshold, the results varied more noticeably. Both DDPG and DQN limited violations to just one occurrence, while PPO had six violations. This suggests that although PPO earned the highest cumulative reward, it occasionally allowed storage to encroach into the flood control buffer zone, possibly in pursuit of minimizing deviation from the target level.
DRL models showed strong overall performance but with clear trade-offs: PPO delivered the smoothest control and highest reward; DQN prioritized flood control compliance; DDPG offered continuous control with intermediate variability. Choosing among them depends on the operational priority—use PPO when stability and target tracking are paramount, DQN when minimizing flood control exceedances is the primary constraint, and DDPG when you want a balanced, continuous controller. Taken together, the results support a flexible operating stance that preserves flood-safe buffer capacity rather than strictly chasing the target level under variable inflows.
In addition to the reinforcement learning metrics summarized in Table 3, the models were further evaluated using standard hydrological performance indices to quantify the accuracy of DRL-based discharges against the observed operation (Table 6). These included RMSE, MAE, NSE, and KGE, together with 95% bootstrap confidence intervals and pairwise Wilcoxon signed-rank tests to assess statistical significance. All three DRL agents achieved NSE > 0.64 and KGE > 0.55, indicating strong agreement with observed discharge patterns, with DDPG performing marginally better than PPO and DQN. These results confirm that the DRL-optimized operations not only maintained physical and operational consistency but also delivered statistically validated improvements in discharge representation and flood risk moderation.
To evaluate the influence of the reward formulation on agent behavior, the relative weights of the flood-penalty and deviation-penalty terms were varied between ( 0.7 0.3 ) , ( 0.5 0.5 ) , and ( 0.3 0.7 ) while keeping all other model settings constant. The results (Table 4) show that cumulative rewards increase systematically with higher flood-penalty weighting, indicating that the agents adapt toward more conservative operations when flood mitigation is emphasized. Across all weighting schemes, the performance ranking remained consistent—PPO achieved the highest cumulative reward ( 9597     7784 ) and most stable actions ( σ     0.041 ) , followed by DQN and DDPG. Flood control violations ( 6 ) and total-capacity violations (none) remained nearly unchanged across all settings, demonstrating that the learned operating strategies are robust to moderate changes in the reward composition. This consistency confirms that the adopted equal weighting ( 0.5 0.5 ) provides a balanced and interpretable trade-off between flood risk reduction and operational stability without biasing model performance toward either objective.
This sensitivity analysis confirms that the performance ranking of the DRL agents remains stable under varying reward weight ratios (0.7–0.3, 0.5–0.5, 0.3–0.7), demonstrating that the learned operation strategies are robust to moderate changes in the reward composition.

3.2. Deep Reinforcement Learning Model Results

In this section, we first present a consolidated high-inflow-day summary (Table 5) for four dates corresponding to the highest daily inflows, and then examine time-series patterns for inflow (Figure 5), storage (Figure 6), discharge (Figure 7), and flood risk (Figure 8).
Table 5 reports the exact storage, discharge, and flood risk (FR) values for the four benchmark dates (1995-08-25; 1999-08-01; 1999-08-02; 2006-07-15). For storage, all DRL strategies remained below the observed operation on every date, with a consistent ordering among the DRL models—DQN lowest, DDPG intermediate, and PPO highest. For discharge, DRL releases were generally below observed, with one exception: on 1999-08-01, the DDPG release exceeded the actual. This is consistent with anticipatory drawdown to create buffer capacity ahead of the following day’s larger inflow (1999-08-02): DDPG lowered storage and deepened the safety buffer (more negative FR) across the two-day window while keeping its peak risk in check. The per-date minima were as follows: PPO lowest on 1995-08-25; DQN lowest on 1999-08-01 and 1999-08-02; and a tie between DDPG and DQN on 2006-07-15. For flood risk, all DRL strategies reduced risk relative to the actual operations on every benchmark date—eliminating the exceedance when they occurred—and typically followed the ordering DQN (most negative FR), then DDPG, then PPO (closest to the threshold), reflecting the progressively smaller—but still larger than observed—safety buffers maintained under DRL operation. These quantitative comparisons anchor the time-series patterns discussed in Figure 5, Figure 6, Figure 7 and Figure 8.

3.2.1. Inflow over Time for the Soyang Reservoir

Figure 5 illustrates the inflow to the Soyang Reservoir from 1993 to 2022, showing the temporal variability over three decades with alternating periods of frequent high inflows and prolonged low flows, reflecting fluctuations in inflow magnitude. The x-axis spans the full study period, while the y-axis represents reservoir inflows, ranging from near-zero baseflows to extreme (high-inflow) peaks exceeding 4000 m3/s. Several extreme inflow occurrences are evident in the daily record, including 25 August 1995 (4029 m3/s), 1 August 1999 (4230 m3/s), 2 August 1999 (4665 m3/s), and 15 July 2006 (4208 m3/s). These four dates serve as the benchmark high-inflow days used throughout the study and are marked in Figure 5 with vertical dotted lines. Note that the high-inflow markers for 1999-08-01 and 1999-08-02 are adjacent and visually overlap in Figure 5. These occurrences coincide with the summer monsoon season and typhoon seasons in Korea, when concentrated rainfall over the upper Han River basin generates rapid increases in runoff into the reservoir [27]. The sharp flood peaks highlighted in Figure 5 represent the most significant inflow extremes of the study period and reflect the strong interannual variability of monsoon and typhoon-driven hydrology at the Soyang River Dam, emphasizing the importance of effective reservoir regulation during such periods.

3.2.2. Water Storage over Time Based on DDPG, PPO, and DQN

Figure 6 shows reservoir storage at the Soyang Reservoir from 1993 to 2022 for the observed operation (dark gray) and the three DRL strategies (DDPG, PPO, DQN). All series exhibit a drawdown–refill pattern. The DRL trajectories generally track the timing of these fluctuations while maintaining lower storage than the observed operation during high-storage periods, implying enhanced flood-safety margins. PPO (red) appears visually distinct and slightly higher than the other DRL curves, whereas DDPG and DQN largely overlap through most of the record, making their differences less perceptible in the figure. Given the tight flood control constraints and a daily target-tracking objective, the range of feasible adjustments is narrow; during high-storage or high-inflow periods, releases can also be capped by operating limits (e.g., the flood control threshold or maximum release) [8,28]. In this regime, DQN’s discrete choices closely match DDPG’s continuous adjustments, yielding nearly indistinguishable storage trajectories [41]. Quantitatively, however, Table 5 confirms the typical ordering of DQN < DDPG < PPO in storage magnitude.

3.2.3. Water Discharge over Time Based on DDPG, PPO, and DQN

Figure 7 presents discharge at the Soyang Reservoir (1993–2022) in four stacked panels—(a) Actual, (b) DDPG, (c) PPO, and (d) DQN—sharing a common y-axis for comparison without overlap. DDPG follows the timing of the Actual series but with lower spike heights, indicating attenuation relative to observed operation. At the full-record scale, PPO and DQN are visually hard to distinguish; both show smaller, sparse spikes and therefore appear smoother, with less day-to-day variability than DDPG.
Smoothness means a small change in state leads to small changes in recommended action, for example, if storage nudges up a bit, the optimal release also changes a bit—not abruptly—hence, at this level, discrete DQN steps will produce similar results to a continuous PPO strategy [61].
All panels mark the four benchmark high-inflow days with vertical dotted lines. At those dates, the Actual panel shows higher discharge than the DRL panels. Because storage and discharge are coupled by mass balance, the lower storage maintained by the DRL strategies near high-storage periods (Figure 6) appears here as smaller, less abrupt releases relative to the actual observed discharge. Based on a quantitative, emergent trend inferred from Table 4, the general discharge ordering across the four high-inflow events is DQN (lowest) < PPO (moderate) < DDPG (highest). This suggests that DQN tended to hold releases longer, PPO maintained intermediate and smoother adjustments, and DDPG responded more actively during high-inflow periods.
While Figure 7 provides a qualitative comparison of discharge dynamics among the DRL agents and the observed operation, a quantitative validation was performed to statistically verify these performance patterns. Using standard hydrological evaluation metrics—Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Nash–Sutcliffe Efficiency (NSE), and Kling–Gupta Efficiency (KGE)—each DRL-generated discharge series was compared against the observed discharge time series [62]. Bootstrap resampling (95% confidence intervals) and Wilcoxon signed-rank tests were applied to assess statistical significance [63,64].
The results, summarized in Table 6, show that all DRL models achieved strong agreement with observed discharge. DDPG exhibited the lowest RMSE and MAE, followed by DQN and PPO. NSE values above 0.64 and KGE values around 0.57 confirm that all models reproduced the dominant discharge variability with good efficiency. The narrow confidence intervals ( ± 9 10   m 3 / s for RMSE) and extremely low p-values ( < 10 100 ) from the Wilcoxon tests indicate statistically significant yet small differences among the agents. Overall, DDPG achieved the highest hydrological fidelity, DQN showed moderate accuracy with discrete adaptations, and PPO favored smoother but slightly less exact trajectories.
The Wilcoxon signed-rank test compares the absolute errors of two models over identical time steps to determine whether their replication accuracies differ significantly [64,65]. This distinction is important because Table 3 evaluates optimization behavior (reward, stability, and constraint compliance), whereas Table 6 assesses how closely each model’s simulated discharge reproduces observed hydrological dynamics. Pairwise Wilcoxon comparisons were conducted among the three DRL models. The resulting p -values were 9.83 × 10 104 for PPO compared with DDPG, 2.76 × 10 113 for PPO compared with DQN, and 2.07 × 10 106 for DDPG compared with DQN. These extremely small probabilities confirm that the differences in absolute errors between model pairs are statistically significant, indicating that each model demonstrates a distinct level of replication accuracy. Nevertheless, because the confidence intervals and performance metrics differ only slightly, these differences are statistically strong but practically modest, showing that all agents perform well, with DDPG maintaining a consistently lower error [66].
Quantitatively, compared to the observed operation, the DRL-based strategies reduced mean flood control violations by over 80 % and improved discharge replication accuracy with RMSE reductions of approximately 2–4% and MAE reductions near 3 % . DDPG achieved the best overall accuracy (RMSE 49   m 3 / s ; NSE = 0.67 ) followed closely by DQN (RMSE 50   m 3 / s ; NSE = 0.66 ) and PPO (RMSE 51   m 3 / s ; NSE = 0.65 ). These improvements confirm that the DRL-controlled operations not only maintained safety constraints but also achieved measurable hydrological fidelity relative to observed operation.

3.2.4. Flood Risk Based on DDPG, PPO, and DQN

Figure 8 presents flood risk at the Soyang Reservoir from 1993 to 2008 for the observed operation (black) and the three DRL strategies (PPO—red, DDPG—blue, DQN—green), with a horizontal dashed line indicating the flood threshold (FR = 0). Although the figure view is limited to 2008 for visual clarity, the underlying values and trends remain identical to the full-record analysis (1993–2022).
Visually, PPO most closely follows the actual series, with peaks and troughs aligning in time yet generally lying just below the black curve. DDPG and DQN remain consistently lower (more negative FR), signifying greater safety margins and stronger damping of flood risk. From 1993 to 2008, DDPG and DQN behave similarly, and most of the time, both sit below the Actual curve, indicating lower flood risk. The closeness between DDPG and DQN suggests that for highly constrained reservoirs like Soyang, both discrete and continuous control formulations can yield nearly equivalent operational strategies. In practice, this indicates that the decision landscape is smooth enough that even discrete decision structures (DQN) can capture the same risk-mitigating patterns as their continuous counterparts (DDPG) [41,62].
The four vertical dotted lines mark the benchmark high-inflow days. Around these markers, the Actual series displays sharper crests or brief exceedances of the flood threshold, whereas all DRL-based curves remain lower, reinforcing the impression of moderated peaks and preserved buffer capacity under DRL operation.
Taken together, Figure 8 demonstrates that all DRL strategies effectively reduce flood risk amplitude while maintaining the natural drawdown–refill rhythm of the reservoir. PPO continues to track the observed operation most closely but with slightly improved safety; DDPG and DQN sustain lower flood risk levels overall, confirming their stronger conservative behavior. For exact FR values at the highlighted dates, refer to Table 5.

4. Discussion

Figure 9 zooms in on 2006-07-15 at the Soyang Reservoir to show, in one place, how inflow translated into releases and storage under optimized DRL operating strategies. A vertical dotted line marks this reference date and is repeated across panels for alignment. In Panel (a), the blue inflow curve captures the sharp rise and recession typical of mid-July storms in the Gangwon highlands [67].
The discharge series reveals distinct operating behaviors. The actual operation (black) maintains modest releases through the rising limb and then increases during the recession. All three DRL agents suppress the discharge peaks more strongly: PPO (green) holds releases comparatively low before the crest and raises them only after the peak has passed; DDPG (red) permits a short pulse near the crest but otherwise stays below the observed discharge line; and DQN (orange) is most conservative pre-peak, increasing only modestly on the falling limb.
Panel (b) decomposes PPO into inflow, discharge, and their difference, clarifying the routing mechanism. During the rising limb, when inflow exceeds discharge, the purple “Water Stored (PPO)” area dominates, indicating deliberate retention that absorbs the flood wave and lowers the immediate downstream load. After the crest, the red “Water Released (PPO)” area expands as discharge overtakes inflow, returning the reservoir toward its operating band (target range) on the recession rather than at the peak. By shifting release to the recession—when river levels are falling and channel conveyance is higher (greater capacity to pass flow without spilling out)—the operating strategy attenuates the flood peak and avoids creating a secondary peak. This panel illustrates how PPO moderates discharge by retaining inflow during the flood rise (storage increase) and releasing it gradually during the recession, maintaining a buffer against extreme conditions. The stored-then-released sequence explains why the flood risk metric remained at or below the threshold on this date (Figure 8) and why it was lower under DRL than under the actual operation.
Panel (c) shows the daily storage trajectory over the analysis window, with a vertical marker indicating the reference date for alignment with panels (a) and (b). Storage rises sharply through and just after the inflow crest, reaches its maximum about 4–5 days later on the recession limb when inflow and discharge meet, and then declines gradually as discharge overtakes inflow. The crest remains well below total capacity, and the drawdown is paced rather than abrupt—evidence of controlled recovery after attenuation. This evolution is consistent with the single-day storage values reported in Table 5 and indicated in Figure 6, where PPO’s storage on 2006-07-15 was lower than the observed value, reflecting an operating strategy that maintains buffer volume prior to extreme inflow. The inclusion of total reservoir storage gives additional context for understanding how system volume evolves over time and supports evaluation of operational trade-offs.
Taken together, Figure 9 illustrates a coherent hydrologic sequence at the Soyang Reservoir. The DRL operating strategies attenuate the flood peak by retaining water through the rising limb (Panel (b)), limit discharge relative to the inflow crest (Panel (a)), and then release gradually on the recession while keeping storage below capacity and recovering smoothly afterward (Panel (c)). PPO achieves a balance between flood-peak attenuation and controlled recovery, DQN shows the most conservative behavior with the lowest releases, and DDPG responds more actively near the peak—patterns that are consistent with the operational tendencies summarized in Figure 5, Figure 6, Figure 7 and Figure 8, where DRL models maintained lower flood risk than the observed operation.
The DRL models were trained and evaluated using daily resolution data, as most observed and remote-sensing variables (e.g., precipitation, soil moisture, temperature, evaporation, and humidity) are available only at this temporal scale. This choice aligns with the study’s focus on strategic, long-term reservoir operation optimization rather than minute-scale real-time control. While sub-daily operation is essential for short-term emergency management, the large storage volume and long routing lag of the Soyang Reservoir Dam mean that day-to-day inflow and storage variations primarily govern operational dynamics. Hence, the daily time step remains appropriate for learning the underlying state–action relationships that influence flood risk mitigation at the basin scale. Nonetheless, future extensions could integrate a nested sub-daily routing or inflow-forecasting module to enable hybrid (daily-to-hourly) control once higher-frequency hydrometeorological data become available [13,41].
In practice, real-time flood operations at large multipurpose dams are conducted at hourly or sub-hourly intervals, especially during emergency events. The daily framework adopted here therefore represents strategic-scale optimization rather than real-time control. This temporal simplification means that while the DRL-based strategies effectively capture day-to-day hydrologic responses and storage–release trade-offs, their generalizability to operational dispatch decisions at finer time scales remains limited. Future work could extend this framework to nested sub-daily models once high-frequency inflow and gate-operation data become available.
To interpret the internal behavior of the reinforcement learning models, a SHAP (SHapley Additive exPlanations) analysis was conducted to quantify how each input variable contributed to the models’ operational decisions [68,69]. The bar plots (Panels (a), (c), and (e)) show the mean absolute SHAP values, ranking variables by their overall importance across all decisions, while the beeswarm plots (Panels (b), (d), and (f)) show how feature values (red for high, blue for low) drive the model outputs positively or negatively, illustrating both magnitude and direction of influence. A higher mean SHAP value implies greater overall influence on the model’s output (i.e., the learned reservoir release or storage action). In the beeswarm plots, positive SHAP values imply that the feature/variable increases the model’s output (pushes the decision toward more release) while negative values imply a decrease in the output (pushes the decision towards more retention), and zero implies that the model decision is unaffected by the feature. For example, in Figure 10e, water storage is the fourth most important variable. In the corresponding panel (f), the color of each dot represents the actual (standardized) water storage value at that time step (red = high, blue = low), and each dot itself corresponds to one daily observation—reflecting the reservoir’s storage state on that day rather than the date itself. Most high water storage values (red points) tend to align with positive SHAP values, indicating that when the reservoir holds a large volume of water, the PPO agent generally increases discharge. However, some overlap around zero suggests that this relationship depends on concurrent hydrological conditions.
For the DDPG model (Panels (a)–(b)), the most influential variables were soil moisture, humidity, evaporation, and water storage, indicating that DDPG’s discharge behavior is strongly governed by hydrological and surface–atmospheric states. The beeswarm plot shows that high soil moisture values (red points) mostly produce negative SHAP values, implying that the agent tends to reduce discharge when the basin is already saturated—likely reflecting a cautious strategy to conserve flood storage until inflows increase. In contrast, low soil moisture conditions (blue points) yield near-zero or slightly positive SHAP values, suggesting minimal or slightly increased releases during dry states. Similar interactions are observed for humidity and evaporation, confirming that DDPG captures physically realistic trade-offs between basin wetness, storage, and release behavior.
For the PPO model (Panels (c)–(d)), the most influential features included soil moisture, retention rate, total discharge, and power generation discharge, suggesting a stronger integration of hydrological and operational signals. High soil moisture values correspond to positive SHAP impacts, indicating that under wet conditions, the agent tends to increase discharge. High retention rate values are near zero or slightly negative, reflecting conservative release adjustments when storage is being maintained. In contrast, high power generation discharge shows predominantly negative SHAP contributions, suggesting compensatory behavior—when substantial flow has already been released for power generation, the agent reduces additional discharge actions.
Finally, the DQN model (Panels (e)–(f)) exhibited dominant contributions from temperature, power generation discharge, soil moisture, and water storage. The beeswarm plot shows that high temperature corresponds to positive SHAP values, implying that discharge increases under warmer conditions—consistent with seasonal drawdown or pre-emptive flood control. High power generation discharge values tend to produce negative SHAP impacts, indicating compensatory reductions in total discharge when turbines are already active. High soil moisture is associated with negative SHAP values and a mix near zero, suggesting restrained releases under saturated conditions with allowance for some releases, whereas high water storage yields positive SHAP effects, reflecting increased release pressure when the reservoir is fuller.
Overall, the SHAP analysis across all three models demonstrates that deep reinforcement learning agents autonomously learned hydrologically consistent and interpretable operation strategies. DDPG exhibited strong environmental-state awareness, PPO balanced hydrological and operational drivers, and DQN responded primarily to direct climatic cues. Collectively, these findings confirm that deep reinforcement learning can produce adaptive and explainable control behaviors grounded in physically meaningful relationships between environmental conditions and reservoir operation.
To further clarify the role of these multi-source variables, the SHAP analysis showed that variables such as soil moisture, humidity, and evaporation—though not direct flood indicators—substantially influenced discharge decisions. Their inclusion enhanced the agents’ ability to anticipate flood risk buildup through better representation of basin wetness and atmospheric demand. Preliminary comparisons using only ground-based variables yielded slightly higher RMSE and lower cumulative rewards, confirming that multi-source inputs improved model stability and hydrological realism.
The DRL-based operating strategies (DQN, DDPG, PPO) developed in this study utilized a diverse set of hydrometeorological variables—including precipitation, soil moisture, and temperature—to guide reservoir operation decisions. Although these variables were not explicitly visualized, they served as essential inputs, enabling dynamic responses to changing hydrological conditions and informing both flood risk estimation and discharge optimization. A key contribution of this work lies in the integration of observed data, remote-sensing products, and DRL-driven decision-making into a unified framework, allowing the operating strategies to make context-aware releases and storage adjustments and manage storage adaptively. Unlike many existing studies—such as Phankamolsil et al. [13], who applied DDPG without explicit flood risk metrics; Castro-Freibott et al. [70], who used PPO but without integrating remote-sensing inputs; and Castelletti et al. [71], who applied Q-learning without examining adaptive capacity at daily resolution—this study leverages multi-source datasets and explicitly incorporates flood risk trade-offs. Thus, this study advances previous DRL applications by demonstrating flood-peak attenuation at daily resolution with paced recession releases and maintained storage buffers, delivering a resilient and operationally robust approach to optimized reservoir operation under hydrological extremes.

5. Conclusions

This study demonstrated that deep reinforcement learning (DRL) can be used to optimize operation at the Soyang Reservoir by coordinating inflow, storage, and discharge to moderate flood-related risk over a 30-year daily record (1993–2022). Relative to observed operation, the DRL strategies delivered safer performance—larger storage buffers, typically lower peak discharges on high-inflow days, and low violation counts—under the same hydrologic conditions. Based on the flood risk (FR) metric, DRL-driven operation strategies generally flatten peak discharges, keep storage closer to the flood-safe bound (FR ≈ 0), increase the safety buffer (more negative FR) on high-inflow days, and maintain fewer threshold exceedances relative to observed operation at daily resolution—both on the highlighted dates and across the 1993–2022 record. Rather than predicting discharge directly, discharge was derived via a mass-balance formulation from optimized storage while preserving observed inflows, ensuring physically consistent releases under the original hydrologic forcing.
Over the full 1993–2022 period, all three agents—DQN, PPO, and DDPG—learned physically plausible, safety-aware behavior. PPO achieved the highest cumulative reward and the most stable actions. DQN ranked second on reward and prioritized flood control compliance, maintaining the largest buffers with conservative releases on the rising limb (pre-inflow-peak). DDPG provided continuous control with intermediate responsiveness, occasionally allowing a short near-crest pulse. These roles align with the time-series patterns at the highlighted events: DQN tends to maintain the largest buffers, PPO produces smoother, mid-level releases, and DDPG is more active near peaks.
Analyses anchored to the largest inflows—25 Aug 1995, 1–2 Aug 1999, and 15 Jul 2006—show that DRL-optimized operation reduced storage relative to observed operation during the peak, moderated releases on the rising limb, and shifted releases to the recession. In combination, these behaviors kept the flood risk at or below the threshold (FR ≤ 0) for the DRL runs while preserving buffer capacity. The dominant operational pattern was clear: retain during the rising limb, then release during the recession—timing that leverages higher channel conveyance and lower downstream risk. For PPO in particular, the inflow–release decomposition makes the mechanism explicit: water stored dominates pre-crest, followed by water released on the recession, avoiding secondary peaks while maintaining buffer.
A central contribution is that the DRL operation strategies were informed by a rich set of observed and remote-sensing hydrometeorological inputs, enabling context-aware decisions that link inflow, storage, and discharge. Compared with traditional optimization methods such as DP/SDP, which are hampered by the curse of dimensionality and rigid system representations, the DRL framework here scales to high-dimensional inputs and captures nonlinear interactions without hand-crafted rule sets—moving beyond static rules and limited-variable models. The deterministic storage-based exceedance index provided an operationally relevant measure that translated daily hydrologic variability into actionable reservoir operation.
Several limitations bound interpretation: the “flood risk” metric is deterministic and storage-based (not a full probabilistic risk); discharge was derived by continuity (not a hydraulic routing model); analyses were daily and single reservoir. The DRL approaches tested here show that operationally meaningful attenuation at the Soyang reservoir can be achieved by shaping the timing and size of releases relative to the inflow hydrograph through explicit optimization of an objective function that penalizes flood-threshold exceedance and deviation from the target storage while respecting physical/operational constraints. Future research should explore hybrid DRL-supervised learning frameworks and assess the robustness of DRL-based reservoir operations under projected climate change scenarios, and explore coupling with hydrodynamic and inundation models to further enhance adaptability and long-term reliability.

Author Contributions

Conceptualization, F.S. and K.S.J.; methodology, F.S.; software, F.S.; validation, F.S.; formal analysis, F.S.; investigation, F.S.; resources, K.S.J.; data curation, F.S.; writing—original draft preparation, F.S.; writing—review and editing, F.S. and K.S.J.; visualization, F.S.; supervision, K.S.J.; project administration, K.S.J.; funding acquisition, K.S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Environmental Industry and Technology Institute (KEITI) (Grant number: 2022003460001).

Data Availability Statement

The data presented in this study are openly available in the [Water Resources Management Information System (WAMIS) of Korea] at [https://www.wamis.go.kr/] (accessed on 9 November 2025).

Acknowledgments

The authors thank Korea Water Resources Corporation (K-Water) and the Han River Flood Control Office (HRFCO) for providing dam-related hydrometeorological data. We also acknowledge UCSB/CHG, ECMWF, Copernicus, CERN/Zenodo, NASA/CGIAR, NASA GES DISC, and NASA USGS for freely providing various datasets, including CHIRPS, ERA5-Land, soil texture, SRTM, humidity, soil moisture, and Landsat data.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Bakis, R. Electricity Production Opportunities from Multipurpose Dams (Case Study). Renew. Energy 2007, 32, 1723–1738. [Google Scholar] [CrossRef]
  2. Yu, X.; Xu, Y.P.; Gu, H.; Guo, Y. Multi-Objective Robust Optimization of Reservoir Operation for Real-Time Flood Control under Forecasting Uncertainty. J. Hydrol. 2023, 620, 129421. [Google Scholar] [CrossRef]
  3. Thompson, C.M.; Frazier, T.G. Deterministic and Probabilistic Flood Modeling for Contemporary and Future Coastal and Inland Precipitation Inundation. Appl. Geogr. 2014, 50, 1–14. [Google Scholar] [CrossRef]
  4. Lu, Q.; Zhong, P.-A.; Xu, B.; Zhu, F.; Ma, Y.; Wang, H.; Xu, S. Risk Analysis for Reservoir Flood Control Operation Considering Two-Dimensional Uncertainties Based on Bayesian Network. J. Hydrol. 2020, 589, 125353. [Google Scholar] [CrossRef]
  5. Mahootchi, M.; Tizhoosh, H.R.; Ponnambalam, K.P. Reservoir Operation Optimization by Reinforcement Learning. J. Water Manag. Model. 2007, 8, 165–184. [Google Scholar] [CrossRef]
  6. Delipetrev, B.; Jonoski, A.; Solomatine, D. Optimal Reservoir Operation Policies Using Novel Nested Algorithms; European Geosciences Union: Munich, Germany, 2015; Volume 17. [Google Scholar]
  7. Wu, X.; Cheng, C.; Lund, J.R.; Niu, W.; Miao, S. Stochastic Dynamic Programming for Hydropower Reservoir Operations with Multiple Local Optima. J. Hydrol. 2018, 564, 712–722. [Google Scholar] [CrossRef]
  8. Giuliani, M.; Galelli, S.; Soncini-Sessa, R. A Dimensionality Reduction Approach for Many-Objective Markov Decision Processes: Application to a Water Reservoir Operation Problem. Environ. Model. Softw. 2014, 57, 101–114. [Google Scholar] [CrossRef]
  9. Wu, W.; Eamen, L.; Dandy, G.; Razavi, S.; Kuczera, G.; Maier, H.R. Beyond Engineering: A Review of Reservoir Management through the Lens of Wickedness, Competing Objectives and Uncertainty. Environ. Model. Softw. 2023, 167, 105777. [Google Scholar] [CrossRef]
  10. Hung, F.; Yang, Y.C.E. Assessing Adaptive Irrigation Impacts on Water Scarcity in Nonstationary Environments—A Multi-Agent Reinforcement Learning Approach. Water Resour. Res. 2021, 57, e2020WR029262. [Google Scholar] [CrossRef]
  11. Jiang, Q.; Li, J.; Sun, Y.; Huang, J.; Zou, R.; Ma, W.; Guo, H.; Wang, Z.; Liu, Y. Deep-Reinforcement-Learning-Based Water Diversion Strategy. Environ. Sci. Ecotechnology 2024, 17, 100298. [Google Scholar] [CrossRef] [PubMed]
  12. Luo, W.; Wang, C.; Zhang, Y.; Zhao, J.; Huang, Z.; Wang, J.; Zhang, C. A Deep Reinforcement Learning Approach for Joint Scheduling of Cascade Reservoir System. J. Hydrol. 2025, 651, 132515. [Google Scholar] [CrossRef]
  13. Phankamolsil, Y.; Rittima, A.; Sawangphol, W.; Kraisangka, J.; Tabucanon, A.S.; Talaluxmana, Y.; Vudhivanich, V. Deep Reinforcement Learning for Multiple Reservoir Operation Planning in the Chao Phraya River Basin Deep Reinforcement Learning (DRL) Deep Deterministic Policy Gradient (DDPG) Algorithm Artificial Intelligence (AI) Reservoir Operation Planning Chao Phraya River Basin. Model. Earth Syst. Environ. 2025, 11, 102. [Google Scholar] [CrossRef]
  14. Kim, S.K.; Ahn, H.; Kang, H.; Jeon, D.J. Identification of Preferential Target Sites for the Environmental Flow Estimation Using a Simple Flowchart in Korea. Environ. Monit. Assess. 2022, 194, 215. [Google Scholar] [CrossRef] [PubMed]
  15. Xu, W.; Zhang, X.; Peng, A.; Liang, Y. Deep Reinforcement Learning for Cascaded Hydropower Reservoirs Considering Inflow Forecasts. Water Resour. Manag. 2020, 34, 3003–3018. [Google Scholar] [CrossRef]
  16. Li, Z.; Bai, L.; Tian, W.; Yan, H.; Hu, W.; Xin, K.; Tao, T. Online Control of the Raw Water System of a High-Sediment River Based on Deep Reinforcement Learning. Water 2023, 15, 1131. [Google Scholar] [CrossRef]
  17. Xu, H.; Yan, Z.; Xuan, J.; Zhang, G.; Lu, J. Improving Proximal Policy Optimization with Alpha Divergence. Neurocomputing 2023, 534, 94–105. [Google Scholar] [CrossRef]
  18. Ho, C.-H.; Kim, H.-A.; Cha, Y.; Do, H.-S.; Kim, J.; Kim, J.; Park, S.K.; Yoo, H.-D. Recent Changes in Summer Rainfall Characteristics in Korea. J. Eur. Meteorol. Soc. 2025, 2, 100009. [Google Scholar] [CrossRef]
  19. Bowes, B.D.; Tavakoli, A.; Wang, C.; Heydarian, A.; Behl, M.; Beling, P.A.; Goodall, J.L. Flood Mitigation in Coastal Urban Catchments Using Real-Time Stormwater Infrastructure Control and Reinforcement Learning. J. Hydroinformatics 2021, 23, 529–547. [Google Scholar] [CrossRef]
  20. Kåge, L.; Milić, V.; Andersson, M.; Wallén, M. Reinforcement Learning Applications in Water Resource Management: A Systematic Literature Review. Front. Water 2025, 7, 1537868. [Google Scholar] [CrossRef]
  21. Wan, Z.; Li, W.; He, M.; Zhang, T.; Chen, S.; Guan, W.; Hua, X.; Zheng, S. Research on Long-Term Scheduling Optimization of Water–Wind–Solar Multi-Energy Complementary System Based on DDPG. Energies 2025, 18, 3983. [Google Scholar] [CrossRef]
  22. Qian, X.; Wang, B.; Chen, J.; Fan, Y.; Mo, R.; Xu, C.; Liu, W.; Liu, J.; Zhong, P. an An Explainable Ensemble Deep Learning Model for Long-Term Streamflow Forecasting under Multiple Uncertainties. J. Hydrol. 2025, 662, 133968. [Google Scholar] [CrossRef]
  23. Sung, J.; Kang, B. Comparative Study of Low Flow Frequency Analysis Using Bivariate Copula Model at Soyanggang Dam and Chungju Dam. Hydrology 2024, 11, 79. [Google Scholar] [CrossRef]
  24. Lee, S.; Kim, J. Predicting Inflow Rate of the Soyang River Dam Using Deep Learning Techniques. Water 2021, 13, 2447. [Google Scholar] [CrossRef]
  25. K-water. 2018 Sustainability Report: Providing a Brighter, Happier, and More Prosperous Future with Water; Publication No. 2018-MA-GP-18-107; K-water: Daejeon, Republic of Korea, 2018; Available online: https://www.kwater.or.kr/web/eng/download/smreport/2018_SMReport.pdf?utm (accessed on 9 November 2025).
  26. Kim, J.S.; Jain, S.; Kang, H.Y.; Moon, Y.I.; Lee, J.H. Inflow into Korea’s Soyang Dam: Hydrologic Variability and Links to Typhoon Impacts. J. Hydro-Environ. Res. 2019, 22, 50–56. [Google Scholar] [CrossRef]
  27. Kwak, J. An Assessment of Dam Operation Considering Flood and Low-Flow Control in the Han River Basin. Water 2021, 13, 733. [Google Scholar] [CrossRef]
  28. Zhang, L.; Deng, C.; Wei, J.; Zou, J. Assessing the Impacts of Climate Change and Land Use/Land Cover Data Characteristics on Streamflow Using the SWAT Model in the Upper Han River Basin. J. Hydrol. Reg. Stud. 2025, 61, 102764. [Google Scholar] [CrossRef]
  29. Arregocés, H.A.; Rojano, R.; Pérez, J. Validation of the CHIRPS Dataset in a Coastal Region with Extensive Plains and Complex Topography. Case Stud. Chem. Environ. Eng. 2023, 8, 100452. [Google Scholar] [CrossRef]
  30. Egorov, A.V.; Roy, D.P.; Zhang, H.K.; Li, Z.; Yan, L.; Huang, H. Landsat 4, 5 and 7 (1982 to 2017) Analysis Ready Data (ARD) Observation Coverage over the Conterminous United States and Implications for Terrestrial Monitoring. Remote. Sens. 2019, 11, 447. [Google Scholar] [CrossRef]
  31. Xu, X.; Chen, F.; Wang, B.; Harrison, M.T.; Chen, Y.; Liu, K.; Zhang, C.; Zhang, M.; Zhang, X.; Feng, P.; et al. Unleashing the Power of Machine Learning and Remote Sensing for Robust Seasonal Drought Monitoring: A Stacking Ensemble Approach. J. Hydrol. 2024, 634, 131102. [Google Scholar] [CrossRef]
  32. McNally, A.; Arsenault, K.; Kumar, S.; Shukla, S.; Peterson, P.; Wang, S.; Funk, C.; Peters-Lidard, C.D.; Verdin, J.P. A Land Data Assimilation System for Sub-Saharan Africa Food and Water Security Applications. Sci. Data 2017, 4, 170012. [Google Scholar] [CrossRef]
  33. Jung, H.C.; Getirana, A.; Policelli, F.; McNally, A.; Arsenault, K.R.; Kumar, S.; Tadesse, T.; Peters-Lidard, C.D. Upper Blue Nile Basin Water Budget from a Multi-Model Perspective. J. Hydrol. 2017, 555, 535–546. [Google Scholar] [CrossRef]
  34. Qi, W.; Liu, J.; Chen, D. Evaluations and Improvements of GLDAS2.0 and GLDAS2.1 Forcing Data’s Applicability for Basin Scale Hydrological Simulations in the Tibetan Plateau. J. Geophys. Res. Atmos. 2018, 123, 13,128–13,148. [Google Scholar] [CrossRef]
  35. Gomis-Cebolla, J.; Rattayova, V.; Salazar-Galán, S.; Francés, F. Evaluation of ERA5 and ERA5-Land Reanalysis Precipitation Datasets over Spain (1951–2020). Atmos. Res. 2023, 284, 106606. [Google Scholar] [CrossRef]
  36. Dong, G.; Huang, W.; Smith, W.A.P.; Ren, P. A Shadow Constrained Conditional Generative Adversarial Net for SRTM Data Restoration. Remote Sens. Environ. 2020, 237, 111602. [Google Scholar] [CrossRef]
  37. Corral-Pazos-de-Provens, E.; Rapp-Arrarás, Í.; Domingo-Santos, J.M. Estimating Textural Fractions of the USDA Using Those of the International System: A Quantile Approach. Geoderma 2022, 416, 115783. [Google Scholar] [CrossRef]
  38. Chirachawala, C.; Shrestha, S.; Babel, M.S.; Virdis, S.G.P.; Wichakul, S. Evaluation of Global Land Use/Land Cover Products for Hydrologic Simulation in the Upper Yom River Basin, Thailand. Sci. Total Environ. 2020, 708, 135148. [Google Scholar] [CrossRef] [PubMed]
  39. Qi, J.; Lee, S.; Du, X.; Ficklin, D.L.; Wang, Q.; Myers, D.; Singh, D.; Moglen, G.E.; McCarty, G.W.; Zhou, Y.; et al. Coupling Terrestrial and Aquatic Thermal Processes for Improving Stream Temperature Modeling at the Watershed Scale. J. Hydrol. 2021, 603, 126983. [Google Scholar] [CrossRef]
  40. Xu, W.; Meng, F.; Guo, W.; Li, X.; Fu, G. Deep Reinforcement Learning for Optimal Hydropower Reservoir Operation. J. Water Resour. Plan. Manag. 2021, 147, 04021045. [Google Scholar] [CrossRef]
  41. Ahmad, M.J.; Cho, G.H.; Choi, K.S. Historical Climate Change Impacts on the Water Balance and Storage Capacity of Agricultural Reservoirs in Small Ungauged Watersheds. J. Hydrol. Reg. Stud. 2022, 41, 101114. [Google Scholar] [CrossRef]
  42. Wang, N.; Liu, L.; Shi, T.; Wang, Y.; Huang, J.; Ye, R.; Lian, Z. Study of the Impact of Reservoir Water Level Decline on the Stability Treated Landslide on Reservoir Bank. Alex. Eng. J. 2023, 65, 481–492. [Google Scholar] [CrossRef]
  43. Lee, E.; Ji, J.; Lee, S.; Yoon, J.; Yi, S.; Yi, J. Development of an Optimal Water Allocation Model for Reservoir System Operation. Water 2023, 15, 3555. [Google Scholar] [CrossRef]
  44. Ghobadi, F.; Kang, D. Application of Machine Learning in Water Resources Management: A Systematic Literature Review. Water 2023, 15, 620. [Google Scholar] [CrossRef]
  45. Song, J.H.; Her, Y.; Kang, M.S. Estimating Reservoir Inflow and Outflow From Water Level Observations Using Expert Knowledge: Dealing with an Ill-Posed Water Balance Equation in Reservoir Management. Water Resour. Res. 2022, 58, e2020WR028183. [Google Scholar] [CrossRef]
  46. Lu, Q.; Zhong, P.A.; Xu, B.; Zhu, F.; Huang, X.; Wang, H.; Ma, Y. Stochastic Programming for Floodwater Utilization of a Complex Multi-Reservoir System Considering Risk Constraints. J. Hydrol. 2021, 599, 126388. [Google Scholar] [CrossRef]
  47. Golian, S.; Yazdi, J.; Martina, M.L.V.; Sheshangosht, S. A Deterministic Framework for Selecting a Flood Forecasting and Warning System at Watershed Scale. J. Flood Risk Manag. 2015, 8, 356–367. [Google Scholar] [CrossRef]
  48. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  49. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  50. Campo Carrera, J.M.; Udias, A. Deep Reinforcement Learning for Complex Hydropower Management: Evaluating Soft Actor-Critic with a Learned System Dynamics Model. Front. Water 2025, 7, 1649284. [Google Scholar] [CrossRef]
  51. Yassin, F.; Razavi, S.; Elshamy, M.; Davison, B.; Sapriza-Azuri, G.; Wheater, H. Representation and Improved Parameterization of Reservoir Operation in Hydrological and Land-Surface Models. Hydrol. Earth Syst. Sci. 2019, 23, 3735–3764. [Google Scholar] [CrossRef]
  52. Helseth, A.; Mo, B.; Hågenvik, H.O.; Schäffer, L.E. Hydropower Scheduling with State-Dependent Discharge Constraints: An SDDP Approach. J. Water Resour. Plan. Manag. 2022, 148, 04022061. [Google Scholar] [CrossRef]
  53. Tharme, R.E. A Global Perspective on Environmental Flow Assessment: Emerging Trends in the Development and Application of Environmental Flow Methodologies for Rivers. River Res. Appl. 2003, 19, 397–441. [Google Scholar] [CrossRef]
  54. Serinaldi, F.; Kilsby, C.G.; Lombardo, F. Untenable Nonstationarity: An Assessment of the Fitness for Purpose of Trend Tests in Hydrology. Adv. Water Resour. 2018, 111, 132–155. [Google Scholar] [CrossRef]
  55. Xu, J.; Qiao, J.; Sun, Q.; Shen, K. A Deep Reinforcement Learning Framework for Cascade Reservoir Operations Under Runoff Uncertainty. Water 2025, 17, 2324. [Google Scholar] [CrossRef]
  56. Giuliani, M.; Herman, J.D.; Castelletti, A.; Reed, P. Many-Objective Reservoir Policy Identification and Refinement to Reduce Policy Inertia and Myopia in Water Management. Water Resour. Res. 2014, 50, 3355–3377. [Google Scholar] [CrossRef]
  57. Helsel, D.R.; Hirsch, R.M.; Ryberg, K.R.; Archfield, S.A.; Gilroy, E.J. Statistical Methods in Water Resources; US Geological Survey: Reston, VA, USA, 2020. [Google Scholar]
  58. Mockus, V. USDA Soil Conservation Service. National Engineering Handbook, Section 4: Hydrology, Chapter 17—Flood Routing; USDA: Washington, DC, USA, 1967. [Google Scholar]
  59. Lai, V.; Huang, Y.F.; Koo, C.H.; Ahmed, A.N.; El-Shafie, A. A Review of Reservoir Operation Optimisations: From Traditional Models to Metaheuristic Algorithms. Arch. Comput. Methods Eng. 2022, 29, 3435–3457. [Google Scholar] [CrossRef]
  60. Niu, S.; Insley, M. On the Economics of Ramping Rate Restrictions at Hydro Power Plants: Balancing Profitability and Environmental Costs. Energy Econ. 2013, 39, 39–52. [Google Scholar] [CrossRef]
  61. Tian, W.; Xin, K.; Zhang, Z.; Zhao, M.; Liao, Z.; Tao, T. Flooding Mitigation through Safe & Trustworthy Reinforcement Learning. J. Hydrol. 2023, 620, 129435. [Google Scholar] [CrossRef]
  62. Clark, M.P.; Vogel, R.M.; Lamontagne, J.R.; Mizukami, N.; Knoben, W.J.M.; Tang, G.; Gharari, S.; Freer, J.E.; Whitfield, P.H.; Shook, K.R.; et al. The Abuse of Popular Performance Metrics in Hydrologic Modeling. Water Resour. Res. 2021, 57, e2020WR029001. [Google Scholar] [CrossRef]
  63. Accadia, C.; Casaioli, M.; Mariani, S.; Lavagnini, A.; Speranza, A.; De Venere, A.; Inghilesi, R.; Ferretti, R.; Paolucci, T.; Cesari, D.; et al. Application of a Statistical Methodology for Limited Area Model Intercomparison Using a Bootstrap Technique. Nuovo C.-Soc. Ital. Di Fis. Sez. C 2003, 26, 125–140. [Google Scholar]
  64. Fang, K.; Kifer, D.; Lawson, K.; Feng, D.; Shen, C. The Data Synergy Effects of Time-Series Deep Learning Models in Hydrology. Water Resour. Res. 2022, 58, e2021WR029583. [Google Scholar] [CrossRef]
  65. McInerney, D.; Thyer, M.; Kavetski, D.; Laugesen, R.; Tuteja, N.; Kuczera, G. Multi-Temporal Hydrological Residual Error Modeling for Seamless Subseasonal Streamflow Forecasting. Water Resour. Res. 2020, 56, e2019WR026979. [Google Scholar] [CrossRef]
  66. Pool, S.; Vis, M.; Seibert, J. Evaluating Model Performance: Towards a Non-Parametric Variant of the Kling-Gupta Efficiency. Hydrol. Sci. J. 2018, 63, 1941–1953. [Google Scholar] [CrossRef]
  67. Choi, H. Il Development of Flood Damage Regression Models by Rainfall Identification Reflecting Landscape Features in Gangwon Province, the Republic of Korea. Land 2021, 10, 123. [Google Scholar] [CrossRef]
  68. Xi, H.; Luo, Z.; Guo, Y. Reservoir Evaluation Method Based on Explainable Machine Learning with Small Samples. Unconv. Resour. 2025, 5, 100128. [Google Scholar] [CrossRef]
  69. Zhang, T.; Wu, J.; Chu, H.; Liu, J.; Wang, G. Interpretable Machine Learning Based Quantification of the Impact of Water Quality Indicators on Groundwater Under Multiple Pollution Sources. Water 2025, 17, 905. [Google Scholar] [CrossRef]
  70. Castro-Freibott, R.; García-Sánchez, Á.; Espiga-Fernández, F.; González-Santander de la Cruz, G. Deep Reinforcement Learning for Intraday Multireservoir Hydropower Management. Mathematics 2025, 13, 151. [Google Scholar] [CrossRef]
  71. Castelletti, A.; Galelli, S.; Restelli, M.; Soncini-Sessa, R. Tree-Based Reinforcement Learning for Optimal Water Reservoir Operation. Water Resour. Res. 2010, 46, W09507. [Google Scholar] [CrossRef]
Figure 1. Location and overview of the Soyang Reservoir (37°56′43″ N, 127°48′49″ E) and its basin, South Korea.
Figure 1. Location and overview of the Soyang Reservoir (37°56′43″ N, 127°48′49″ E) and its basin, South Korea.
Water 17 03226 g001
Figure 2. Water storage and operational capacity of the Soyang Reservoir.
Figure 2. Water storage and operational capacity of the Soyang Reservoir.
Water 17 03226 g002
Figure 3. Water level fluctuations and operational thresholds for the Soyang Reservoir.
Figure 3. Water level fluctuations and operational thresholds for the Soyang Reservoir.
Water 17 03226 g003
Figure 4. Methodology workflow for the Soyang Reservoir operation with DRL agents.
Figure 4. Methodology workflow for the Soyang Reservoir operation with DRL agents.
Water 17 03226 g004
Figure 5. Actual inflow at the Soyang Reservoir.
Figure 5. Actual inflow at the Soyang Reservoir.
Water 17 03226 g005
Figure 6. Comparison of reservoir storage under actual and DRL-controlled operations (1993–2022).
Figure 6. Comparison of reservoir storage under actual and DRL-controlled operations (1993–2022).
Water 17 03226 g006
Figure 7. Reservoir discharge estimated from inflow and optimized storage using DRL models.
Figure 7. Reservoir discharge estimated from inflow and optimized storage using DRL models.
Water 17 03226 g007
Figure 8. Actual and DRL-based flood risks at the Soyang Reservoir (1993–2022).
Figure 8. Actual and DRL-based flood risks at the Soyang Reservoir (1993–2022).
Water 17 03226 g008
Figure 9. DRL-driven reservoir operations: inflow–discharge–storage dynamics and flood-peak attenuation.
Figure 9. DRL-driven reservoir operations: inflow–discharge–storage dynamics and flood-peak attenuation.
Water 17 03226 g009
Figure 10. SHAP-based variable importance and feature effects on model predictions for the DDPG, PPO, and DQN models.
Figure 10. SHAP-based variable importance and feature effects on model predictions for the DDPG, PPO, and DQN models.
Water 17 03226 g010aWater 17 03226 g010b
Table 1. Remote-sensing data products used in the study.
Table 1. Remote-sensing data products used in the study.
ProductVariablesSpatiotemporal
Resolution
Reference
CHIRPS
IMERG-Final version “06”
Rainfall0.05° × 0.05° (daily)Arregocés et al. [29]
Landsat, 4, 5, 7, 8, 9Bands (B2, B3, B4, B5, B6, B7)0.0003° × 0.0003° (daily)Egorov et al. [30], Chen et al. [31]
MERRA-2Humidity0.5° × 0.5° (hourly)McNally et al. [32],
Jung et al. [33]
GLDAS-2.0, 2.1Soil moisture 0.25° × 0.25° (daily)Qi et al. [34]
ERA5-LandTemperature, Evaporation, Solar radiation, Wind speed0.1° × 0.1° (daily)Gomis-Cebolla et al. [35]
SRTM digital elevation data v4DEM0.0008° × 0.0008°Dong et al. [36]
USDA systemSoil texture0.002° × 0.002° (yearly)Corral-Pazos-de-Provens et al. [37]
MCD12Q1 V6.1 productLand cover0.004° × 0.004° (yearly)Chirachawala et al. 2020 [38]
Table 2. Interpretation of flood risk values.
Table 2. Interpretation of flood risk values.
Flood Risk ValueMeaning
<0Storage is below flood threshold → No flood risk
=0Storage is exactly at the threshold → Flood-safe limit reached
0 < value ≤ 1Storage is within flood control zone → Potential flood risk
>1Storage exceeds total flood buffer → High flood risk (overflow)
Table 3. Deep reinforcement learning model evaluation metrics.
Table 3. Deep reinforcement learning model evaluation metrics.
MetricPPODQNDDPG
Cumulative Reward869186798235
Action Stability0.00590.17370.0792
Total Capacity Violations000
Flood Control Violations611
Table 4. Reward weight sensitivity analysis.
Table 4. Reward weight sensitivity analysis.
ModelFlood_WeightDeviation_WeightCumulative RewardAction Stability (Std)Flood ViolationsCapacity Violations
DDPG0.70.393230.07320
PPO0.70.395970.04160
DQN0.70.395910.16700
DDPG0.50.582340.07320
PPO0.50.586910.04160
DQN0.50.586800.16700
DDPG0.30.771450.07320
PPO0.30.777840.04160
DQN0.30.777690.16700
Table 5. High-inflow-day summary for highlighted dates: storage, discharge, and flood risk (FR). Positive FR indicates threshold exceedance.
Table 5. High-inflow-day summary for highlighted dates: storage, discharge, and flood risk (FR). Positive FR indicates threshold exceedance.
DateMetricActualDDPGPPODQN
1995-08-25Storage (million m3)2602206621521784
Discharge (m3/s)2226113910221069
FR (–)0.429−0.1870.000−0.353
1999-08-01Storage (million m3)150410651154977
Discharge (m3/s)51142230
FR (–)−0.743−1.167−1.037−1.234
1999-08-02Storage (million m3)1893136514101170
Discharge (m3/s)169139790
FR (–)−0.397−0.882−0.795−1.022
2006-07-15Storage (million m3)1786116713781138
Discharge (m3/s)23301280
FR (–)−0.536−1.092−0.895−1.129
Table 6. Quantitative performance metrics for simulated discharge (Observed vs. DRL) with 95% confidence intervals.
Table 6. Quantitative performance metrics for simulated discharge (Observed vs. DRL) with 95% confidence intervals.
ModelRMSE (m3/s)RMSE 95% CI (Low–High)MAE (m3/s)MAE 95% CI (Low–High)NSENSE 95% CI (Low–High)KGEKGE 95% CI (Low–High)
DDPG49.3640.55–59.2719.0918.27–19.960.670.54–0.740.600.56–0.63
PPO51.0841.83–61.0719.7118.89–20.610.650.52–0.720.560.51–0.59
DQN50.1140.71–60.3519.5118.68–20.420.660.53–0.730.570.52–0.61
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sseguya, F.; Jun, K.S. Deep Reinforcement Learning for Optimized Reservoir Operation and Flood Risk Mitigation. Water 2025, 17, 3226. https://doi.org/10.3390/w17223226

AMA Style

Sseguya F, Jun KS. Deep Reinforcement Learning for Optimized Reservoir Operation and Flood Risk Mitigation. Water. 2025; 17(22):3226. https://doi.org/10.3390/w17223226

Chicago/Turabian Style

Sseguya, Fred, and Kyung Soo Jun. 2025. "Deep Reinforcement Learning for Optimized Reservoir Operation and Flood Risk Mitigation" Water 17, no. 22: 3226. https://doi.org/10.3390/w17223226

APA Style

Sseguya, F., & Jun, K. S. (2025). Deep Reinforcement Learning for Optimized Reservoir Operation and Flood Risk Mitigation. Water, 17(22), 3226. https://doi.org/10.3390/w17223226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop