Applying Machine Learning Methods to Improve Rainfall–Runoff Modeling in Subtropical River Basins

: Machine learning models’ performance in simulating monthly rainfall–runoff in subtropical regions has not been sufficiently investigated. In this study, we evaluate the performance of six widely used machine learning models, including Long Short-Term Memory Networks (LSTMs), Support Vector Machines (SVMs), Gaussian Process Regression (GPR), LASSO Regression (LR), Extreme Gradient Boosting (XGB), and the Light Gradient Boosting Machine (LGBM), against a rainfall–runoff model (WAPABA model) in simulating monthly streamflow across three subtropical sub-basins of the Pearl River Basin (PRB). The results indicate that LSTM generally demonstrates superior capability in simulating monthly streamflow than the other five machine learning models. Using the streamflow of the previous month as an input variable improves the performance of all the machine learning models. When compared with the WAPABA model, LSTM demonstrates better performance in two of the three sub-basins. For simulations in wet seasons, LSTM shows slightly better performance than the WAPABA model. Overall, this study confirms the suitability of machine learning methods in rainfall–runoff modeling at the monthly scale in subtropical basins and proposes an effective strategy for improving their performance.


Introduction
Rainfall-runoff modeling is critical for understanding and predicting the transformation of rainfall into runoff [1][2][3][4].Accurately simulating streamflow is essential for flood forecasting, reservoir operations, water supply planning, irrigation scheduling, and the design of hydraulic structures such as dams and reservoirs [5][6][7][8].In the context of climate change, a well-performing runoff model will allow us to assess the potential impacts of changing precipitation patterns on water availability and flood risks and thus formulate effective strategies to efficiently manage precious water resources and mitigate hazards [9][10][11].
In recent years, machine learning models, as a subset of data-driven models, have become increasingly popular in hydrological investigations.One of the merits of such models is their flexibility, allowing users to train, test, and deploy runoff simulations without an extensive understanding of physical processes regulating the water cycle [12,13].Furthermore, machine learning models stand out because of their suitability to address the non-linearity and intricate interactions buried in data, making them especially suitable for modeling complex relationships between meteorological variables and hydrological processes [13][14][15].However, other researchers raised concerns about the application of machine learning methods in the field of hydrology because these methods have limited interpretability and physical consistency, potentially leading to physically incorrect or unreasonable simulations, especially when the training data have low quality [16].
Water 2024, 16, 2199 2 of 24 Despite concern about the interpretability of machine learning models, recent studies have recognized their promising performance in runoff simulation [17,18].In previous studies, machine learning models have been mainly used to simulate daily or hourly streamflow [19][20][21], with much less attention paid to monthly streamflow.However, monthly streamflow simulations are equally, if not more, important for water resource management [22,23].Unlike daily streamflow, climate variables (e.g., precipitation and evapotranspiration), rather than water storages in basins, play a dominant role in affecting streamflow on a monthly scale.Whether machine learning models are capable of capturing such features remains unclear.
As a result, the performance of machine learning models in simulating monthly streamflow warrants further investigation.First, the types of machine learning methods suitable for hydrological modeling need to be elucidated.Commonly used machine learning models in hydrology include regression analysis models, support vector machines, ensemble learning methods, and deep learning models with multi-layer neural network structures [24,25].Many studies compared the performance of multiple machine learning models and showed that LSTM may be the best model in daily streamflow simulations.For instance, Rahimzad et al. [26] indicated that LSTM demonstrated better performance than linear regression and support vector machines (SVMs) in the Kentucky River in the U.S. In Latif and Ahmed [27], LSTM outperformed Random Forest (RF) and Tree Boost for forecasting the daily streamflow of the Warragamba dam in Australia.Adnan et al. [28] also showed that LSTM had greater accuracy than the Extreme Learning Machine (ELM) and RF techniques.The LSTM model has demonstrated capability in reconstructing streamflow across hundreds of basins in the United States [29], and it has also shown better performance than traditional process-based approaches in applications to ungauged basins [30].The LSTM's outstanding ability to learn the influence of past events on future outcomes is a critical factor responsible for its superior performance, enabling it to effectively capture hydrological processes, such as snowmelt and groundwater rechange, which are critical for streamflow dynamics in temperate regions [31].However, whether the performance of such models still holds at the monthly scale needs further investigation.
Recently, hybrid frameworks integrating physical processes with machine learning models have been developed to simulate hydrological processes [16], showing promise in improving the performance of data-driven models [32].For example, LSTM combined with physically based hydrological models has shown a better capacity in streamflow simulation [33,34].RF and regression models also achieved improved performance based on a hybrid framework [35,36].However, these new directions of applying machine learning models are still under development.
Second, strategies for setting up machine learning models in hydrological applications need further investigation.Machine learning models often exhibit varied performance across different river basins, suggesting that their full potential in hydrological simulations has not been realized.In an investigation across a group of river basins in Australia, machine learning methods demonstrated better performance in larger river basins [37].The dependency of model performance on river basin characteristics indicates the necessity of including additional variables other than meteorological forcing inputs in machine learning simulations.In another investigation across different basins in the U.S., Kratzert et al. [30] found that the LSTM could achieve improved performance in simulating daily streamflow by adding river basin characteristics (e.g., soil and topography) as the input data.However, which river basin property should be included in machine learning-based simulations for subtropical basins remains unclear.
Third, the majority of previous studies focus predominantly on the overall performance of the models, neglecting the variability of model performance under different hydrological conditions (e.g., wet vs. dry periods).Compared with normal flow, water resource managers are often more concerned about large flow events, which could lead to risks such as floods, especially in subtropical and tropical regions.Accurately simulating large flow events has been a challenge in hydrological modeling.Evaluating the perfor-Water 2024, 16, 2199 3 of 24 mance of machine learning models under various hydrological conditions will help to clarify the suitability of using such models in predicting the highly variable streamflow [38].
River basins located in subtropical or tropical regions have special hydrological processes compared with temperate basins.Subtropical and tropical basins often do not have snowfall, and thus lack the responses of streamflow to snowmelting.In addition, abundant rainfall and high evapotranspiration rates make these basins' hydrological response times relatively short [39].Given these special features of hydrological processes in subtropical/tropical basins, Whether the machine learning models are applicable to those basins needs to be investigated.
This study aims to comprehensively investigate the suitability of machine learning models in streamflow simulations at the monthly scale in a typical region of China.The performance of machine learning models in simulating monthly streamflow is compared with a rainfall-runoff model (e.g., the WAPABA model).The objectives of this study are to answer the following questions: (1) What is the relative performance of machine learning models to a traditional hydrological model for monthly streamflow simulation in subtropical regions?(2) Which type of machine learning models are more suitable for subtropical streamflow simulations at the monthly scale?(3) How can machine learning models be effectively set up to improve their performance?

Study Area
The Pearl River Basin (PRB), located in southern China, was selected as the study area.We chose the Pearl River Basin's three major sub-basins, namely the North River sub-basin, the East River sub-basin, and the West River sub-basin (Figure 1), to evaluate streamflow simulations by different models.The Pearl River, also known as the Zhujiang River, is one of the largest and most important river systems in China.The main stream of the river travels over 2214 km from its headwaters to the estuary, with a drainage area of about 453,690 km 2 , making it the third-longest river in China [40].
mance of machine learning models under various hydrological conditions will hel clarify the suitability of using such models in predicting the highly variable stream [38].
River basins located in subtropical or tropical regions have special hydrological cesses compared with temperate basins.Subtropical and tropical basins often do not h snowfall, and thus lack the responses of streamflow to snowmelting.In addition, a dant rainfall and high evapotranspiration rates make these basins' hydrological resp times relatively short [39].Given these special features of hydrological processes in tropical/tropical basins, Whether the machine learning models are applicable to those sins needs to be investigated.
This study aims to comprehensively investigate the suitability of machine lear models in streamflow simulations at the monthly scale in a typical region of China.performance of machine learning models in simulating monthly streamflow is comp with a rainfall-runoff model (e.g., the WAPABA model).The objectives of this study to answer the following questions: (1) What is the relative performance of machine le ing models to a traditional hydrological model for monthly streamflow simulation in tropical regions?(2) Which type of machine learning models are more suitable for tropical streamflow simulations at the monthly scale?(3) How can machine learning m els be effectively set up to improve their performance?

Study Area
The Pearl River Basin (PRB), located in southern China, was selected as the s area.We chose the Pearl River Basin's three major sub-basins, namely the North R sub-basin, the East River sub-basin, and the West River sub-basin (Figure 1), to eval streamflow simulations by different models.The Pearl River, also known as the Zhuj River, is one of the largest and most important river systems in China.The main str of the river travels over 2214 km from its headwaters to the estuary, with a drainage of about 453,690 km 2 , making it the third-longest river in China [40].The PRB has a tropical and subtropical climate, with hot and wet summers and mild winters, and is free of snow throughout the year.Annual mean temperatures in the PRB  C, accompanied by high precipitation averaging between 1200 and 2200 mm, primarily occurring from April to September [40,41].Such a climate supports a wide range of flora and fauna and thriving agriculture.The PRB experiences heavy precipitation events during the wet season, leading to hazards such as flooding.Existing studies have highlighted the necessity of developing flood mitigation measures for this region [42].

Data
In this study, three gauge stations, including Shijiao, Boluo, and Wuzhou, were selected for the North River, East River, and West River sub-basins, respectively (Figure 1).The boundaries of the sub-basin were obtained from the Global Runoff Data Centre database [43].The areas of these three sub-basins are 38,363, 25,325, and 329,705 km 2 , respectively (Figure 1).Gauge records archived by the GRDC [44] database and the Ministry of Water Resources, PRC, were collected for model calibration and evaluation in this study.The depth of runoff in each sub-basin was calculated based on the streamflow observations from these stations and their corresponding drainage areas.
The climate data for hydrological simulations were obtained from the ERA5-Land dataset produced by the European Centre for Medium-Range Weather Forecasts (ECMWF) [45].In this study, all meteorological and hydrological data were aggregated to the monthly scale and averaged for each sub-basin for hydrological simulations.Climatic and hydrological variables used in this study, their units, and multi-year averages are shown in Table 1, while the other statistics are shown in Table S1.The North River sub-basin has the highest rainfall, while the East River sub-basin receives the least among the three sub-basins.The North River sub-basin has the highest runoff, while the West River sub-basin has the lowest.The WAPABA (water partition and balance) model is a lumped rainfall-runoff model by Wang et al. [46].The model has been widely used in monthly streamflow simulations and verified to perform well at this time scale [46][47][48].Therefore, WAPABA was used as the benchmark model to compare the performance of multiple machine learning methods in this study.
The WAPABA model requires two fundamental inputs (i.e., the mean monthly rainfall and ET o ) to simulate the monthly runoff of a river basin.ET o is calculated using the meteorological variables listed in Table 1 and the FAO Penman-Monteith Equation [49].The schematic diagram of the WAPABA model is shown in Figure S1 of the Supplementary Material.There are five parameters in the WAPABA model, including the catchment consumption curve (α 1 ), the parameter of the evapotranspiration curve (α 2 ), the percentage of groundwater yield from the catchment (β), the soil's maximum water-holding capacity (S max ), and the groundwater store time constant in reverse (K).These parameters in our study were calibrated with the Nelder-Mead optimization algorithm [50].

Machine Learning Models Long Short-Term Memory Networks
The Long Short-Term Memory (LSTM) network is a recurrent neural network (RNN) designed to address the problem of gradient vanishing or explosion that exists in traditional RNNs [51].A standard LSTM unit consists of a cell with an input gate, an output gate, and a forget gate [52].
In this study, we applied a Sequential Model using the Keras API of the TensorFlow library.The model's architecture was initiated with an LSTM layer equipped with a ReLU activation function containing ten neurons.This layer was then followed by a densely connected layer employing an L2 regularizer and outputting a single real value.During model compilation, we chose the Adam optimizer for its adaptive learning rates, which collaborated with the MSE loss function.Then, the model was trained for 2000 epochs with an initial training rate of 0.01.Additionally, we incorporated learning rate alterations using the 'Reduce LR On Plateau' strategy.This strategy involved multiplying the learning rate by a factor of 0.5 if there was no improvement in the loss after 20 epochs until reaching the minimal learning rate of 0.00001.We also utilized the 'Model Checkpoint' function to save the optimal state and implemented the 'Early Stopping' command with a patience of 100 epochs to prevent overfitting.Such hyperparameters are defined by a commonly used framework for parameter optimization of the neural networks model, as we initially set the model parameters based on empirical values, followed by splitting the data into training and validation sets for hyperparameter optimization to obtain a better hyperparameter setting.The training data were divided into 80% of the training set and the remaining 20% of the validation set to prevent overfitting.Ten distinct LSTM models were trained, and their values were averaged to mitigate the potential errors caused by the randomness.The topological configuration of the LSTM model is shown in the Supplementary Material (Figure S2).A comprehensive introduction to the LSTM algorithm can be found in Sherstinsky [53].

Support Vector Machine
The Support Vector Machine (SVM) is a type of machine learning method performing data regression through supervised learning, in which the decision boundary becomes a hyperplane that fits the data points to minimize the overall error rate.It works by mapping input vectors into a high-dimensional feature space and then searching for the hyperplane that optimally fits the data [54].SVM was also widely used in hydrological studies [55], especially in rainfall-runoff forecasting [56][57][58].
In our study, we employed the SVM model with a radial basis function (RBF) kernel with the Scikit-Learn library.The RBF kernel is capable of handling non-linear relationships between the predictors and the response variable [59].The cost parameter was set to 100, and the gamma coefficient was set to 0.01 by the grid search algorithm within a predefined parameter range.A general topological configuration of SVM is shown in Figure S3 of the Supplementary Material.The specific algorithm and related equations of support vector regression can be found in Raghavendra and Deka [55].

Gaussian Process Regression
Gaussian Process Regression (GPR) is a nonparametric model for data regression analysis using a Gaussian process prior.The modeling assumptions for GPR include both noise (regression residuals) and a Gaussian process prior, which is solved according to Bayesian inference [60].It is known for providing both predictions and quantifying the predictive uncertainty.This method is suitable for hydrological science due to its capacity for handling non-linear relationships, managing multiple input features, and its robustness in the face of noisy data [61][62][63].
In this study, we initialized the GPR model with a radial basis function (RBF) kernel, which is a good general-purpose kernel used for capturing the smoothness of the rainfallrunoff relationship [59].The hyperparameters of the kernel were then optimized using the built-in optimization methods in the GPy Python library, which flexibly re-adjusts the kernel to best fit the data.The specific equations for the calculation of GRP can be found in Schulz et al. [64].

LASSO Regression
Lasso Regression (LR), also known as the Least Absolute Shrinkage and Selection Operator, is a linear regression model that uses shrinkage, where data values are shrunk towards a central point, typically the mean [65].LR stands out for its inherent ability to perform automatic feature selection and reduce the dimensional complexity of the model [66].These characteristics make LR suitable for rainfall-runoff modeling that involves a wide array of variously correlated meteorological variables [67,68].In our study, LR was implemented through the Scikit-learn library in Python 3.8.18,with an alpha (regularization strength) of 0.1.The algorithm of LR can be found in Roth [69].

Extreme Gradient Boosting
XGBoost (Extreme Gradient Boosting, XGB) is an optimized distributed gradient boosting library designed for substantial scale machine learning problems with a focus on efficiency, flexibility, and portability [70].In this study, we utilized Python's XGBoost library to implement an XGBoost Regressor.The process of optimizing hyperparameters involved a grid search within a predefined range of values.The configuration of the XGB model aimed to minimize the squared error loss.Notably, 80% of features were subsampled for each tree.The specific hyperparameter set included a learning rate of 0.001, a maximum tree depth of 3, an alpha (for L1 regularization) of 10, and a limit of 5000 boosting rounds.To avoid overfitting, the training data were split, with 80% treated as the training set and the remaining 20% as the validation set, using a 'train_test_split' with a random state set at 42 for reproducibility.During training, an early stopping mechanism was activated if model performance failed to improve consecutively over 100 rounds, thus stopping the training and preserving the performance-optimized model.The specific algorithm of XGBoost can be found in Mitchell and Frank [71].

Light Gradient Boosting Machine
LightGBM (LGBM) is a powerful gradient-boosting framework developed by Microsoft [72].In our study, we used Python's LightGBM library to construct a gradientboosting model.Like our approach with the XGB model, a grid search was conducted to optimize the hyperparameters.The model was configured with the 'GBDT' boosting type to address our regression problem.We employed both 'l2' (Mean Squared Error) and 'l1' (Mean Absolute Error) as metrics for model evaluation.Specifications included 31 leaves, a maximum of 5000 iterations, and a learning rate of 0.001.Each split considered 90% of features (feature fraction = 0.9).For each boosting round, we implemented bagging with 80% of the data (bagging fraction = 0.8) and updated every five iterations (bagging frequency = 5).The methodology for splitting the training data and implementing early stopping was consistent with our approach in the XGB model.The algorithm of the LGBM can be found in Ke et al. [72].
The six machine learning models selected for this study are widely used in hydrology, and the strengths and limitations of each model are summarized in Table 2.

Model Simulations
For all the machine learning models employed in this study, both meteorological and hydrological data (shown in Table 1) were split into training and evaluation periods for all sub-basins.The training period starts from January 1954 to December 1986, with a total of 396 months.During the training process for LSTM, XGB, and LGBM, we split 20% of the data as the validation dataset for the early stop mechanism to prevent the model from overfitting.Data normalization was not deemed necessary for the model training and testing based on our investigation, except for the SVM model, where normalization is a required step in this model.SVM, GPR, and LR do not involve a process of training through multiple iterations to gradually approach the optimal solution [55, 64,69], and therefore we did not utilize a validation set for early stopping in their training methodology.The evaluation period spans from January 2004 to May 2023 (with missing runoff data across different river sub-basins: 20 months of data are missing in the North River and the West River sub-basins, while 23 months are missing in East River sub-basin), with a total of about 210 months.The period of 1987-2003 was not included in model simulations because of missing observations.The periods for the calibration and evaluation of the WA-PABA model were aligned with the training and evaluation periods for machine learning models, respectively.
It is worth noting that we used the term 'evaluation' for the assessment of model performance against observations that were not used for model training for both WAPABA and machine learning models.This term is equivalent to 'testing' used in many machine learning modeling studies.
In this study, we conducted three sets of model simulations to evaluate the impacts of three input data combinations on the performance of machine learning models.We first used the climatic forcings only (Experiment 1) as model input data, following the design of many previous investigations [73][74][75].Since the streamflow of the previous month represents overall river basin conditions (e.g., wetness), using this additional variable, other than the meteorological variables, as the input data might add extra skills to streamflow simulations.As a result, we conducted another two sets of simulations to evaluate model performance in response to using simulated (Experiment 2) or observed (Experiment 3) runoff of the previous month as additional model inputs.
We conducted a preliminary test to understand the impacts of different preceding data on the performance of the LSTM model.The results indicated that using preceding data of one month resulted in better performance than using longer preceding data (Figure S4).Consequently, we adopted a one-month time step as the preceding time step for LSTM simulations in this study.Specifically, input data combinations of each type of simulation are shown in Table 3. Experiment 1 utilizes 6 meteorological variables (i.e., P(t), e a (t), u 2 (t), R n (t), T max (t), and T min (t)) as the models' input.In Experiments 2 and 3, antecedent runoff of the previous month (R(t − 1)) is added as an extra input variable to drive these machine learning models.Considering the availability of streamflow observations and the applicability of long-term forecasts, the simulated runoff is utilized as the input of Experiment 2, whereas the observed runoff is used as the input in Experiment 3 during the evaluation period.

Evaluation Metrics
In this study, Bias, Root Mean Squared Error (RMSE), Correlation Coefficient (r), and Nash Sutcliffe efficiency coefficient (NSE) [76] were used for model performance evaluation.The formulas of these metrics are presented in Equations ( 1)-( 4).
where T is the length of the time series data; Q s (t) and Q o (t) denote the simulation and observation at time t; and Q s and Q o are the time averages of simulations and observations.In this research, the units of Bias and RMSE are mm/month, while r and NSE are unitless measures.BIAS is a metric measuring the average of the difference between simulated and observed streamflow, whilet ale RMSE evaluates the average errors in model simulations.r assesses the ability of models to reconstruct the overall timing and magnitude of streamflow, while NSE is a common metric in hydrology for assessing the fit of model simulations to observed data.For simulations matching observations perfectly, their bias and RMSE should be 0, while r and NSE should be 1.

Performance of the WAPABA Model
We first evaluated the performance of the WAPABA model in simulating monthly runoff in the PRB. Figure 2 presents the WAPABA simulations and runoff observations in the North, East, and West River sub-basins during the calibration and evaluation periods.The performance of the WAPABA model varies across the three sub-basins, as suggested by the evaluation metrics (Figure 2).During the calibration period, the North River sub-basin has a low r and NSE of 0.59 and 0.33, respectively, as a result of the underestimated peak flows, whereas the other two sub-basins all show r above 0.8 and NSE above 0.6, with the highest values (0.88 and 0.77 for r and NSE, respectively) found in the West River sub-basin.

Simulation of Machine Learning Models Based on Climate Forcings Only
The time series of the observed and simulated runoff from multiple machine learning models in Experiment 1 are shown in Figure 3. Here, we only present results from 2021 to 2022 during the evaluation period to make sure the time series information is readable, while results for the entire training and evaluation periods are shown in Figure S5 of the Supplementary Material.Evaluation metrics for both the training and evaluation periods are shown in Table 4. Compared with the calibration period, the WAPABA model shows comparable or even better performance during the evaluation period.We found marked improvements Water 2024, 16, 2199 10 of 24 in WAPABA simulations during the evaluation period for the North River sub-basin, as suggested by the 42% increases in r and 88% increases in NSE.The bias during the evaluation period is −9.80 mm/month, −1.74 mm/month, and −2.23 mm/month in the North, East, and West River sub-basins, respectively, indicating that the WAPABA model underestimated the streamflow, particularly for peak flows in the three sub-basins.The values of r are higher than 0.8 in all river sub-basins during the evaluation period (0.84, 0.83, and 0.88 in the North, East, and West River sub-basins, respectively), proving that the WAPABA reconstructed seasonal variations in observations well.When using NSE to evaluate the performance of hydrological simulations, it is generally considered that an NSE greater than 0.5 indicates a satisfactory model performance [77].In this study, the NSEs for three sub-basins during the evaluation period are 0.62, 0.69, and 0.77, respectively, confirming that the WAPABA model performed satisfactorily in the study area in the evaluation period.However, it is worth mentioning that NSE is only 0.33 for the North River sub-basin during the calibration period.This may result from the relatively poorer quality of observation in this sub-basin as well as other factors not considered in WAPABA, such as land use changes and anthropogenic activities, which may have larger impacts in this sub-basin.Overall, considering that the steps during the modeling are correct and that the performance is acceptable during the evaluation period, we think the WAPABA model's performance is acceptable for evaluating the relative performance of machine learning models.

Simulation of Machine Learning Models Based on Climate Forcings Only
The time series of the observed and simulated runoff from multiple machine learning models in Experiment 1 are shown in Figure 3. Here, we only present results from 2021 to 2022 during the evaluation period to make sure the time series information is readable, while results for the entire training and evaluation periods are shown in Figure S5 of the Supplementary Material.Evaluation metrics for both the training and evaluation periods are shown in Table 4.
In Experiment 1, all machine learning models demonstrate the best performance in the West River sub-basin and the poorest performance in the North River sub-basin, consistent with that of the WAPABA simulation (Figure 2).These results suggest that despite differences in the structures and algorithms of the models, using the same input data can lead to similar simulation outcomes across different river basins.The differences between model simulations and observations might be attributable to unaccounted hydrological processes by the models in the sub-basin, such as anthropogenic disturbances (reservoir operation and water withdrawal) on natural water cycling.
Compared with the WAPABA model, all machine learning models of Experiment 1 exhibit noticeably higher RMSE and lower r and NSE during the evaluation period in the North and the East River sub-basins.Specifically, the average RMSE, r, and NSE values of machine learning models in the North River sub-basin are 63.31, 0.65, and 0.42, respectively, while the WAPABA model shows values of 51.33, 0.84, and 0.62 during the calibration period.In the East River sub-basin, machine learning models produce evaluation metrics of 37.82, 0.75, and 0.54, slightly worse than the WAPABA model's values of 31.27,0.83, and 0.69, for RMSE, r, and NSE, respectively.Evaluation metrics show slightly worse performance of the machine learning models relative to WAPABA in the West River sub-basin.In this sub-basin, the average RMSE, r, and NSE values of machine learning models are 18.70, 0.  In Experiment 1, all machine learning models demonstrate the best performance in the West River sub-basin and the poorest performance in the North River sub-basin, consistent with that of the WAPABA simulation (Figure 2).These results suggest that despite differences in the structures and algorithms of the models, using the same input data can lead to similar simulation outcomes across different river basins.The differences between model simulations and observations might be attributable to unaccounted hydrological

Simulations of Machine Learning Models with Antecedent Runoff Input
Meteorological conditions play a crucial role in determining water entering (e.g., precipitation) and leaving (e.g., evapotranspiration) river basins.However, using these variables as the only input for machine learning modeling could not represent water storage in river basins well.This limitation may have resulted in the worse performance of machine learning models (Experiment 1) than the WAPABA model (Figures 2 and 3, Table 4).To better represent the process of water storage within the river sub-basin, we used runoff of the previous month as an additional input variable to drive machine learning simulations.
The evaluation metrics of three experiments during the evaluation period are shown in    In each sub-basin, most of the models have lower RMSE as well as higher r and NSE in Experiments 2 and 3 compared with those of Experiment 1, indicating that adding run off of the previous month as an additional input variable effectively improves the perfor mance of machine learning models (Figure 4).Furthermore, almost all models in Experi ment 3 demonstrate better performance than those in Experiment 2, as suggested by the lower RMSE as well as higher r and NSE.Therefore, including the previous month's runoff could significantly improve the performance of machine learning models, and using observed runoff data could achieve better performance than using simulated runoff.

Comparison between Machine Learning Models and WAPABA
Figure 5 and Table 5 show the comparison of Bias and NSE of simulations by machine learning models and the WAPABA model during the evaluation period.Except for the SVM, XGB, and LGBM models in the North River sub-basin, the NSE of all model simu- In each sub-basin, most of the models have lower RMSE as well as higher r and NSE in Experiments 2 and 3 compared with those of Experiment 1, indicating that adding runoff of the previous month as an additional input variable effectively improves the performance of machine learning models (Figure 4).Furthermore, almost all models in Experiment 3 demonstrate better performance than those in Experiment 2, as suggested by the lower RMSE as well as higher r and NSE.Therefore, including the previous month's runoff could significantly improve the performance of machine learning models, and using observed runoff data could achieve better performance than using simulated runoff.

Comparison between Machine Learning Models and WAPABA
Figure 5 and Table 5 show the comparison of Bias and NSE of simulations by machine learning models and the WAPABA model during the evaluation period.Except for the SVM, XGB, and LGBM models in the North River sub-basin, the NSE of all model simulations is higher than 0.5, showing the acceptable performance of machine learning models from Experiment 3 in simulating runoff.Absolute biases are generally lower than 10 mm/month or even 5 mm/month for most model simulations.Among all machine learning models, LSTM demonstrates the highest NSE, which is also higher than that of the WAPABA model in the East River and West River sub-basins.In the North River sub-basin, although the WAPABA simulation has a larger bias than five of the six machine learning models, the NSE, r, and RMSE suggest that it has slightly better performance than the machine learning models (Figure 5).
We further visualize the performance of the machine learning models and WAPABA using Taylor diagrams (Figure 6).Across the three sub-basins, the standard deviation of observed runoff is higher than that in all model simulations, mainly because of the underestimation of the peak runoff during the wet seasons (e.g., in June 2022).In the North River sub-basin, the LSTM model exhibits the lowest RMSE and highest r among all machine learning models, but it performs slightly worse than the WAPABA model in terms of these two metrics.In the East and West River sub-basins, the WAPABA model shows higher RMSE and lower r than machine learning models, indicating worse performance.LSTM still exhibits the lowest RMSE and highest r in the East River sub-basin, while all machine learning models demonstrate comparable results in these two metrics in the West River sub-basin.
Figure 7 compares the observed and simulated Flow Duration Curves (FDCs) from the WAPABA model as well as the LSTM method, which show better performance than other machine learning models in Experiment 3 (the FDCs for the remaining models are shown in Figure S9 in the Supplementary Material).The FDC depicts the probability that a given runoff quantity is exceeded.Overall, the FDC for the WAPABA and the two bestperforming machine learning models demonstrate strong consistency with the observed runoff, particularly during the dry seasons.However, for high-flow events, where the exceedance probability is less than approximately 10%, model simulations in each basin are significantly lower than observations.This suggests that these models underestimate runoff during the wet seasons.The LSTM model shows runoff values that are approximately 10% higher than those of the WAPABA model at low exceedance probabilities, suggesting that its performance during wet seasons is slightly superior to that of the WAPABA model.For low-flow events with exceedance probabilities higher than approximately 90%, both LSTM and WAPABA are able to accurately reconstruct the magnitude of runoff in the North River sub-basin.However, in the other two sub-basins, WAPABA tends to underestimate streamflow in dry seasons by around 20%, while LSTM demonstrates better performance in capturing the low streamflow.
Water 2024, 16, x FOR PEER REVIEW 15 of 25 mm/month or even 5 mm/month for most model simulations.Among all machine learning models, LSTM demonstrates the highest NSE, which is also higher than that of the WAPABA model in the East River and West River sub-basins.In the North River subbasin, although the WAPABA simulation has a larger bias than five of the six machine learning models, the NSE, r, and RMSE suggest that it has slightly better performance than the machine learning models (Figure 5).We further visualize the performance of the machine learning models and WAPABA using Taylor diagrams (Figure 6).Across the three sub-basins, the standard deviation of observed runoff is higher than that in all model simulations, mainly because of the underestimation of the peak runoff during the wet seasons (e.g., in June 2022).In the North River sub-basin, the LSTM model exhibits the lowest RMSE and highest r among all machine learning models, but it performs slightly worse than the WAPABA model in terms of these two metrics.In the East and West River sub-basins, the WAPABA model shows higher RMSE and lower r than machine learning models, indicating worse performance.LSTM still exhibits the lowest RMSE and highest r in the East River sub-basin, while all machine learning models demonstrate comparable results in these two metrics in the West River sub-basin.Figure 7 compares the observed and simulated Flow Duration Curves (FDCs) from the WAPABA model as well as the LSTM method, which show better performance than other machine learning models in Experiment 3 (the FDCs for the remaining models are shown in Figure S9 in the Supplementary Material).The FDC depicts the probability that a given runoff quantity is exceeded.Overall, the FDC for the WAPABA and the two bestperforming machine learning models demonstrate strong consistency with the observed runoff, particularly during the dry seasons.However, for high-flow events, where the ex-  In summary, based on the selected evaluation metrics (RMSE, r, and NSE) and the FDC, LSTM performs the best among all machine learning models tested in this study (Figures 5 and 6).Compared with the WAPABA model, LSTM shows similar performance as that of WABAPA in the North River sub-basin, but it performs better than WAPABA in the East and West River sub-basins.For simulations in wet seasons, The LSTM model shows slightly improved performance relative to the WAPABA model.The East River and West River sub-basins also have better performance with the LSTM model relative to WAPABA.

Performance of Monthly Runoff Simulations
This investigation in the subtropical zone indicates that the overall performance of machine learning models is better than or comparable with a widely used rainfall-runoff model.The WAPABA model has been applied in climate change impact assessment and seasonal runoff forecasting [46][47][48].This study further confirms the effectiveness of the model in applications in subtropical regions.Compared with the WAPABA model, the overall performance of machine learning models is slightly poorer in the North River subbasin but better in the East and West River sub-basins.The comparable performance between the LSTM and the WAPABA model is consistent with the conclusions drawn by Clark et al. [37], who found that the LSTM model performed better in 69% of the catchments in Australia when compared to the WAPABA model.
During the wet seasons, WAPABA demonstrates a larger negative bias in the extreme In summary, based on the selected evaluation metrics (RMSE, r, and NSE) and the FDC, LSTM performs the best among all machine learning models tested in this study (Figures 5 and 6).Compared with the WAPABA model, LSTM shows similar performance as that of WABAPA in the North River sub-basin, but it performs better than WAPABA in the East and West River sub-basins.For simulations in wet seasons, The LSTM model shows slightly improved performance relative to the WAPABA model.The East River and West River sub-basins also have better performance with the LSTM model relative to WAPABA.

Performance of Monthly Runoff Simulations
This investigation in the subtropical zone indicates that the overall performance of machine learning models is better than or comparable with a widely used rainfall-runoff model.The WAPABA model has been applied in climate change impact assessment and seasonal runoff forecasting [46][47][48].This study further confirms the effectiveness of the model in applications in subtropical regions.Compared with the WAPABA model, the overall performance of machine learning models is slightly poorer in the North River sub-basin but better in the East and West River sub-basins.The comparable performance between the LSTM and the WAPABA model is consistent with the conclusions drawn by Clark et al. [37], who found that the LSTM model performed better in 69% of the catchments in Australia when compared to the WAPABA model.
During the wet seasons, WAPABA demonstrates a larger negative bias in the extreme runoff simulations compared to LSTM.This indicates that machine learning models have greater potential in improving the prediction of extreme events, as reported by Frame et al. [38], who demonstrated that the performance of the LSTM model in simulating streamflow of high return periods was better than a conceptual model and a process-based model across the United States.
With the increasing use of machine learning-based models in hydrological modeling in recent years, concerns are growing about the performance of such data-driven models [18,78].Our study confirms the suitability of machine learning models for rainfall-runoff simulation in subtropical basins, especially the LSTM approach, which performs comparably or better than the conceptual rainfall-runoff model.

Deep Learning in Rainfall-Runoff Modeling
Deep learning models provide a powerful method for learning and representing complex, non-linear patterns from data, which is mainly due to their multilayer structure and non-linear activation functions [79].Models like LSTM consist of multiple hidden layers of neurons, each of which takes the output from the previous layer as input and processes it by utilizing non-linear activation functions, enabling the network to capture a variety of complex processes [80].Meanwhile, the stacked structure of these models facilitates the understanding and learning of complex interactions between different features [81].These characteristics equip deep learning models with more robust capability in handling complex and non-linear problems, such as hydrological processes, when compared to linear models (e.g., LR) and models dealing with simple nonlinearities (e.g., SVM and GPR).
Among the six distinct machine learning models investigated in this study, LSTM, which belongs to the category of deep learning, exhibits superior performance in simulating the rainfall-runoff relationship across different river basins.This result suggests that this model is more suitable for monthly streamflow simulations than other machine learning models.The superior performance of the LSTM from this study is in line with the fact that it has been widely applied in hydrology [82].This finding also provides valuable implications for the selection of machine learning models in future hydrological investigations.

Strategies for Setting Up Machine Learning Models in Rainfall-Runoff Modeling
Compared with traditional hydrological models, which simulate key components of the terrestrial water cycle (e.g., infiltration, evapotranspiration, and groundwater recharge), insufficient representation of water accumulation in water pools is limiting the performance of machine learning models in simulating streamflow.Our Experiment 1 suggested that when using meteorological variables alone as inputs, machine learning-based simulations could not reconstruct streamflow well.Before runoff generation, precipitation undergoes infiltration, evapotranspiration, and exchanges with basin storages.In these processes, water storage in soil and aquifers significantly modulates the amount of water that eventually becomes runoff and shapes the temporal patterns of streamflow [83].Traditional hydrological models, such as WAPABA, use state variables (e.g., S max and K) to simulate the accumulation of water in these pools and the subsequent release of water to river channels.However, machine learning methods do not have such variables, limiting this type of model in simulating temporally continuous processes [16].Using meteorological variables as the only input will not allow machine learning models to simulate water storage in soils and groundwater pools (Figure 3 and Table 4).This could explain the inferior performance of these machine learning models relative to that of the WAPABA model in Experiment 1 (Table 4).
In addition, although including river basin properties as input variables contributes to improving streamflow simulations with machine learning models, these datasets are not always readily available, challenging the inclusion of basin-specific information in hydro-logical modeling based on machine learning models.To account for the spatial variability of streamflow, watershed characteristics (e.g., plant growth, reservoir operations, soils, and topography) are also treated as predictors of streamflow modeling based on machine learning models across multiple river basins.However, using watershed properties as inputs often may require substantial efforts in data collection [84][85][86].
Our modeling experiments provide valuable information for dealing with the above two limitations in hydrological modeling using machine learning models.The streamflow of the previous month carries the information on how precipitation is transformed into runoff and thus could be used as a surrogate variable to represent the impacts of both water storage and watershed properties on streamflow [87].The improvement in model performance with the adoption of the previous month's runoff as model input confirms the necessity of dealing with the intrinsic limitations of machine learning models in modeling continuous hydrological processes and provides a cost-effective solution for considering watershed properties in streamflow modeling.The strategy (Experiments 2 and 3) tested in this study could be used in future applications of machine learning models in streamflow simulation.
Success in improving the performance of machine learning models by incorporating streamflow observation from the previous month (e.g., Experiment 3) indicates the advantage of such data-driven models.Recent studies have demonstrated that applying machine learning techniques can significantly improve the performance of the original model in data assimilation in the Earth's system learning [88,89].The strategy of incorporating previous observational data information to enhance the performance of data-driven models can be adopted by future applications of machine learning models.
Machine learning methods are often more computationally efficient than processbased models.In addition, using machine learning models does not require expertise and understanding of hydrological processes, making them quickly adopted by a broad range of stakeholders.Such conveniences and their rapidly evolving power in simulating complex processes make machine learning models very promising techniques in helping move hydrologic modeling forward.
Although the WAPABA model exhibited worse performance than the LSTM model in our study, its advantage in simulating the physical processes of the water cycle should not be ignored.The WAPABA model employs simplified physical equations to represent water storage processes within watersheds, providing a degree of physical constraint and interpretability, which are pitfalls of machine learning models.However, the oversimplification of these processes in WAPABA may also introduce uncertainties to runoff simulations.Considering the advantages and disadvantages of both types of models, we believe that developing hybrid modeling frameworks to couple both process-based and data-driven models could be a promising direction in future hydrological modeling.
More broadly, this study further emphasizes the importance of a physics-guided setting in machine learning modeling.Traditionally, many users adopt the machine learning model as a black-box approach, focusing on the input and output only, without learning the physical processes.While such a black-box approach can still provide reasonable simulations in many cases, increasing investigations suggest the necessity of reflecting a certain degree of the physical processes in the applications of such models [90,91].Our investigation suggested that a good understanding of the physical processes is essential for maximizing machine learning models' capability in simulating complex processes.

Future Work
Although this investigation confirms the suitability of machine learning models in monthly rainfall-runoff modeling, especially LSTM techniques, further investigations are still required to address a few challenges in the application of machine learning models in hydrology.
First, machine learning modeling processes that map directly from inputs to outputs can obscure basin-specific details, which is unfeasible for fully characterizing the individu-ality of watersheds [92,93].Thus, traditional data-driven models should be complemented with physics-informed approaches as they lack physical constraints and interpretability.In recent years, researchers have begun to work on building a hybrid framework that combines physical processes with deep learning structures to address such limitations [16,94].This framework has been utilized in rainfall-runoff modeling [18], revealing flood mechanisms [34,95], and predicting groundwater levels [96,97].These works reveal physical mechanisms to some extent while using machine learning structures in hydrologic modeling, which is at the frontier of research in this field.
Second, applications of machine learning models in simulating high flows need to be further investigated.Machine learning models require ample data for optimal performance.However, the mechanism regulating extreme runoff differs from the rainfall-runoff relationship of normal flows [98,99].As a result, models trained on normal flow periods may not be suitable for predicting extreme flows.A possible solution to this limitation is to generate 'virtual' extreme events to increase training samples, thereby improving the model's prediction of extreme events [100].In addition, using generative AI, such as the Generative Adversarial Network (GAN), has the potential to simulate large flow events but requires additional testing in future studies [101].

Conclusions
In this study, we carried out a comprehensive evaluation of the performance of six machine learning models in simulating monthly runoff across three subtropical sub-basins of the Pearl River Basin.These models' performance compared to a traditional hydrological model, different strategies for setting up model simulations, and model performance in simulating extreme runoff were thoroughly evaluated.The findings of this study include: (1) LSTM performs better in simulating runoff in the PRB relative to the other five machine learning models (SVM, GPR, LR, XGB, LGBM), with about 11.7%, 5.1%, and 10.8% improvements in RMSE, r, and NSE, respectively.(2) Adding the previous month's runoff as an additional input variable can achieve better performance than using meteorological forcing only, and using observed runoff as input can achieve better performance than using simulated runoff of the last month.(3) LSTM outperforms the WAPABA model in two out of three sub-basins.Although all models underestimate the peak streamflow, the performance of the LSTM model is slightly better than that of WAPABA in all sub-basins during wet seasons.Additionally, the LSTM performs slightly better than WAPABA in the East River and West River sub-basins.
The findings of this study on the relative performance of different machine learning models to a conventional hydrological model provide evidence of the suitability of using these models in hydrological studies in subtropical regions.The strategies for setting up the model will help guide the future use of machine learning models in simulating runoff and hydrological forecasting.We suggest that future research could expand upon the methodologies used in this study by incorporating more comprehensive input data, such as a longer range of antecedent conditions and basin characteristics, into machine learning models.Such enhancements could potentially resolve the limitations exhibited in the present study and further increase the performance and interpretability of machine learning models.

Supplementary Materials:
The following Supplementary Material can be downloaded at: https:// www.mdpi.com/article/10.3390/w16152199/s1, Figure S1 S1: Statistical summary of variables used in this study.'Mean' refers to the average value, 'Std' refers to the standard deviation.'Min' refers to the minimum value in the data.'25%', '50%', and '75%' refer to the 25th, 50th, and 75th percentiles, indicating that 25%, 50%, and 75% of the data values are lower than this value, respectively.'Max' refers to the maximum value in the data; Table S2

Figure 1 .
Figure 1.Location of the Pearl River Basin, the three sub-basins, and gauge stations.

Figure 1 .
Figure 1.Location of the Pearl River Basin, the three sub-basins, and gauge stations.

Figure 2 .
Figure 2. Evaluation of rainfall-runoff simulation by the WAPABA model against observations among different sub-basins in calibration (January 1954-December 1986) and evaluation (January 2004-May 2023) periods.
87, and 0.75, respectively, and the WAPABA model shows values of 17.83, 0.88, and 0.77.The results indicate that machine learning methods in Experiment 1 underperformed compared to the WAPABA model when the model input included meteorological variables only.It is noteworthy that, except for the LR model, all machine learning models exhibited NSE values less than 0.5 in the North River sub-basin, indicating the limitations of Experiment 1 in simulating the streamflow of this sub-basin.

Figure 3 .
Figure 3. Runoff simulations by multiple machine learning models using meteorological forcings alone as input data (Experiment 1) against observations among different sub-basins during 2021-2022.

Figure 3 .
Figure 3. Runoff simulations by multiple machine learning models using meteorological forcings alone as input data (Experiment 1) against observations among different sub-basins during 2021-2022.

Figure 4 .
Evaluation metrics during the training period are shown in Figure S6 of the Supplementary Material.The simulated streamflow by multiple machine learning models in Experiments 2 and 3 are shown in Figures S7 and S8 of the Supplementary Material.
The evaluation metrics of Experiment 2 are shown in

Figure 4 .
Figure 4. RMSE, r, and NSE of different machine learning models in Experiments 1, 2, and 3 during the evaluation period (legend is shown in the subplot of the first column and the third row).

Figure 4 .
Figure 4. RMSE, r, and NSE of different machine learning models in Experiments 1, 2, and 3 during the evaluation period (legend is shown in the subplot of the first column and the third row).

Figure 5 .
Figure 5.Comparison of runoff simulations by WAPABA and the six machine learning models from Experiment 3. Color bars show absolute bias, and dotted lines indicate the NSE of the simulations.

Figure 5 . 25 Figure 6 .
Figure 5.Comparison of runoff simulations by WAPABA and the six machine learning models from Experiment 3. Color bars show absolute bias, and dotted lines indicate the NSE of the simulations.Water 2024, 16, x FOR PEER REVIEW 16 of 25

Figure 6 .
Figure 6.Taylor diagram of machine learning models from Experiment 3 and WAPABA models across three sub-basins during the evaluation period (the units of standard deviation and RMSE are mm/month).

Figure 7 .
Figure 7.The Flow Duration Curves (FDCs) of observations as well as simulations by LSTM and WAPABA in Experiment 3 in the North, East, and West River sub-basins.The x-axis represents the exceedance probability, indicating the probability of runoff exceeding a specific runoff level shown by the y-axis.

Figure 7 .
Figure 7.The Flow Duration Curves (FDCs) of observations as well as simulations by LSTM and WAPABA in Experiment 3 in the North, East, and West River sub-basins.The x-axis represents the exceedance probability, indicating the probability of runoff exceeding a specific runoff level shown by the y-axis.

:
Schematic diagram of the WAPABA (water partition and balance) model.P means precipitation, and ET means reference crop evapotranspiration; Figure S2: Schematic diagram of the Long Short-Term Memory Network; Figure S3: Schematic diagram of the Support Vector Machine; Figure S4: RMSE, NSE, and r for different input time lags (months ahead) for the LSTM model; Figure S5: Runoff simulations using multiple machine learning models in Experiment 1 against observations among different sub-basins in training (1954-1986) and evaluation (2004-2023) periods; Figure S6: Root Mean Squared Error (RMSE), Correlation Coefficient (r), and Nash Sutcliffe efficiency coefficient (NSE) of different machine learning models in Experiments 1, 2, and 3 during the training period.Note that Experiments 2 and 3 have the same training results; Figure S7: Runoff simulations using multiple machine learning models in Experiment 2 against observations among different sub-basins in training (1954-1986) and evaluation (2004-2023) periods; Figure S8: Runoff simulations using multiple machine learning models in Experiment 3 against observations among different sub-basins in training (1954-1986) and evaluation (2004-2023) periods; Figure S9: The Flow Duration Curves (FDCs) of observations and simulations by all machine learning models and WAPABA in Experiment 3 in the North, East, and West River sub-basins.The x-axis represents the exceedance probability, indicating the probability that a specific runoff amount equals or exceeds a given runoff level shown on the y-axis; Table

:
Evaluation metrics results among each machine learning model in Experiment 2 during the training (January 1954to December 1986) and evaluation periods (January 2004 to May 2023) in different river sub-basins.The unit of Bias and RMSE is mm/month, and the other two evaluation metrics (r and NSE) are unitless.Note that the training results of Experiment 3 are the same as those in Experiment 2.

Table 1 .
Variables used for model simulations in each sub-basin.All data are monthly averages.

Table 2 .
Summary of different machine learning models used in this study.

Table 3 .
Input data of three machine learning simulations.The full name of all variables is listed in

Table 1 .
Model output data is R(t), where t denotes the data of the current month and t − 1 the data of the previous month.

Table 4 .
Performance of machine learning models in Experiment 1 during the training (January 1954to December 1986) and the evaluation period (January 2004 to May 2023) in different subbasins.The unit of Bias and RMSE is mm/month, and the other two evaluation metrics (r and NSE) are unitless.
Table S2 of Supplementary Material, while the metrics of Experiment 3 are shown in Table 5.

Table 5 .
Evaluation metrics for machine learning models in Experiment 3 and the WAPABA model during the evaluation period in different river sub-basins.The unit of Bias and RMSE is mm/month, and the other two evaluation metrics (r and NSE) are unitless.Bold text indicates models with better performance.