Advancing Hydrology through Machine Learning: Insights, Challenges, and Future Directions Using the CAMELS, Caravan, GRDC, CHIRPS, PERSIANN, NLDAS, GLDAS, and GRACE Datasets

Fahad Hasan; Paul Medley; Jason Drake; Gang Chen

doi:10.3390/w16131904

,

and

¹

Department of Civil & Environmental Engineering, FAMU-FSU College of Engineering, 2525 Pottsdamer Street, Tallahassee, FL 32310, USA

²

Center for Spatial Ecology & Restoration, Florida A&M University, 407 Frederick S. Humphries Science Research Center, 1515 S. Martin Luther King Jr. Blvd., Tallahassee, FL 32307, USA

^*

Author to whom correspondence should be addressed.

Water2024, 16(13), 1904;https://doi.org/10.3390/w16131904

This article belongs to the Special Issue Water Resource Management in Artificial Intelligence and Big Data Analytics

Version Notes

Order Reprints

Abstract

Machine learning (ML) applications in hydrology are revolutionizing our understanding and prediction of hydrological processes, driven by advancements in artificial intelligence and the availability of large, high-quality datasets. This review explores the current state of ML applications in hydrology, emphasizing the utilization of extensive datasets such as CAMELS, Caravan, GRDC, CHIRPS, NLDAS, GLDAS, PERSIANN, and GRACE. These datasets provide critical data for modeling various hydrological parameters, including streamflow, precipitation, groundwater levels, and flood frequency, particularly in data-scarce regions. We discuss the type of ML methods used in hydrology and significant successes achieved through those ML models, highlighting their enhanced predictive accuracy and the integration of diverse data sources. The review also addresses the challenges inherent in hydrological ML applications, such as data heterogeneity, spatial and temporal inconsistencies, issues regarding downscaling the LSH, and the need for incorporating human activities. In addition to discussing the limitations, this article highlights the benefits of utilizing high-resolution datasets compared to traditional ones. Additionally, we examine the emerging trends and future directions, including the integration of real-time data and the quantification of uncertainties to improve model reliability. We also place a strong emphasis on incorporating citizen science and the IoT for data collection in hydrology. By synthesizing the latest research, this paper aims to guide future efforts in leveraging large datasets and ML techniques to advance hydrological science and enhance water resource management practices.

Keywords:

machine learning; hydrology; big data; streamflow prediction; CAMELS; AI in hydrology; data scarcity; large-sample hydrology; downscaling; uncertainty quantification; citizen science

1. Introduction

Machine learning applications in hydrology have been gaining momentum, transforming how we understand and predict various hydrological processes [1]. The rapid advancement of artificial intelligence technologies is fostering increased research and applications of machine learning in this field, promising significant advancements in the near future [2,3]. These technologies are being used to improve the accuracy and efficiency of hydrological models, addressing complex problems such as climate change impacts, water resource management, and disaster preparedness.

One of the fundamental requirements for effective machine learning models is access to large, high-quality datasets, which provide the necessary data for accurate predictions and robust model training [2,4,5]. In hydrology, machine learning models have seen tremendous successes in predicting different parameters, including streamflow, precipitation, groundwater level, and flood frequency, especially in data-scarce regions [6,7,8,9,10,11,12].

The CAMELS dataset, for instance, integrates meteorological and hydrological data across multiple catchments, providing valuable insights into catchment behavior and facilitating the development of robust machine learning models [13,14]. Its detailed and coherent description of catchment characteristics makes it a valuable resource for exploring interrelationships among different attributes and understanding their influence on hydrological processes. Similarly, the Caravan dataset offers a comprehensive archive of hydrological responses, making it an invaluable resource for modeling and prediction tasks [15,16]. The increasing availability of high-resolution datasets like CHIRPS has enabled significant improvements in precipitation and drought monitoring, particularly in regions where traditional observational data are sparse [17]. This dataset, combined with advanced machine learning techniques, has demonstrated superior performance in various applications, from drought assessment in East Africa to streamflow forecasting in India [18,19]. The integration of satellite observations with ground-based measurements has also enhanced the reliability and accuracy of hydrological predictions, supporting better water resource management and climate impact assessments. Similarly, the PERSIANN dataset has significantly advanced hydrological research by improving flood prediction, precipitation estimation, and runoff simulation, demonstrating its versatility and critical role in enhancing predictive accuracy and supporting drought assessments [20,21,22,23,24,25,26,27,28]. Moreover, datasets like GRDC provide an extensive archive of river discharge data, which is critical for global water resource management, climate impact studies, and hydrological modeling. The GRDC dataset’s extensive coverage and long record length facilitate comprehensive analyses of hydrological patterns and trends, aiding in the development of robust hydrological models and enhancing our understanding of global water cycles [29,30].

Other datasets, such as the NLDAS (North American land data assimilation system), GLDAS (global land data assimilation system), and GRACE (gravity recovery and climate experiment), have been pivotal in advancing the understanding of hydrological processes. These datasets offer diverse and comprehensive data, ranging from catchment attributes and meteorological time series to river discharge and precipitation estimates, which are essential for developing and validating hydrological models. The selection of these datasets is driven by their extensive spatial and temporal coverage, high-quality data, broad applicability, and community acceptance. Specifically, CAMELS offers detailed catchment characteristics, while Caravan provides a global perspective with consistent data quality. GRDC, CHIRPS, PERSIANN, NLDAS, GLDAS, and GRACE offer reliable and comprehensive hydrological data. Their widespread use and acceptance within the hydrology research community ensure that our review covers the most relevant and impactful data sources. By focusing on these well-established datasets, we aim to provide insights that are both scientifically robust and widely applicable, supporting the development of accurate and reliable machine learning models in hydrology.

Despite the advances, challenges, such as data heterogeneity, spatial and temporal inconsistencies, and the need for the integration of human activities remain prevalent [31,32,33]. Addressing these challenges is crucial for advancing the field and ensuring that machine learning models can provide reliable and actionable insights.

This review paper aims to explore the current state of machine learning applications in hydrology, with a particular focus on the utilization of large datasets, such as CAMELS, Caravan, GRDC, CHIRPS, PERSIANN, NLDAS, GLDAS, and GRACE. We will discuss the successes, challenges, and future directions in this rapidly evolving field, highlighting the key studies and emerging trends. By synthesizing the latest research, this article seeks to guide future efforts in leveraging these datasets to advance hydrological science and improve water resource management practices. The paper will also address the integration of human impact on data, the quantification of uncertainties, and the potential of real-time data integration to enhance the accuracy and applicability of machine learning models in hydrology.

2. Trends of ML Applications in Hydrology

To understand the trends in machine learning (ML) applications within hydrology, a comprehensive search was conducted on the Web of Science database. The search utilized the following keywords: “Machine Learning” and “Streamflow”; “Machine Learning” and “Groundwater Level”; “Machine Learning” and “Flood forecasting” or “Flood Prediction” or “Flood Estimation”; “Machine Learning” and “Soil Moisture Forecasting” or “Soil Moisture Prediction” or “Soil Moisture Estimation”; “Machine Learning” and “Drought Forecasting”; Machine Learning” and “Evaporation Estimation” or “Pan Evaporation Modeling”; “Machine Learning” and “Precipitation Forecasting” or “Rainfall Forecasting”; “Machine Learning” and “Bathymetric Mapping”. This initial search yielded a total of 6594 articles.

The abstracts of these articles were carefully reviewed, and the studies that did not explicitly include ML or AI applications were excluded. This rigorous screening process resulted in a final selection of 5292 relevant articles. These articles were then categorized based on their focus on streamflow forecasting, groundwater level forecasting, flood forecasting, and soil moisture forecasting. The analysis of these selected articles revealed distinct trends over the past five years, as shown in the accompanying graph in Figure 1. The data indicates a significant increase in the number of studies employing ML techniques for hydrological applications, particularly in streamflow forecasting and flood forecasting.

Figure 1. Trends in machine learning applications in hydrology (2020–2024).

Precipitation forecasting remained the most extensively studied parameter with a high number of articles, which peaked at 1037 in 2022 before decreasing to 472 in the first part of 2024. This trend demonstrates the ongoing importance of precipitation forecasting in weather prediction and climate modeling. Streamflow forecasting articles increased from 22 in 2020 to a peak of 107 in 2023 before decreasing to 32 in the first five months of 2024. This suggests a growing interest in leveraging ML for predicting streamflow, driven by the need for accurate water resource management and flood risk assessment. Groundwater level forecasting showed a steady rise, starting at 12 articles in 2020 and reaching 35 articles in 2023 before a slight decline to 15 articles in the early months of 2024. This trend reflects the increasing recognition of ML’s potential in managing groundwater resources amid growing concerns about groundwater depletion and contamination.

Flood forecasting also saw substantial growth, with articles increasing from 41 in 2020 to 82 in 2023. The consistent interest in flood forecasting is likely due to the critical need for timely and accurate flood predictions to mitigate the impacts of climate change and extreme weather events. Soil moisture forecasting saw an increase from seven articles in 2020 to forty-one articles in 2023, highlighting the expanding application of ML in soil moisture estimation, which is crucial for agriculture, drought management, and ecosystem monitoring. Drought forecasting articles rose from 19 in 2020 to 69 in 2023, reflecting the heightened focus on understanding and mitigating the impacts of drought conditions. Evaporation estimation showed a consistent interest, with articles fluctuating between 40 and 56 from 2020 to 2023. This interest is driven by the need to accurately estimate evaporation rates for water balance studies and agricultural management. Lastly, bathymetric mapping, although less prominent, showed a gradual increase in studies, peaking at 16 in 2022, indicating a growing interest in using ML for underwater terrain mapping and related applications. Overall, the trends reflect a broadening scope and growing interest in applying ML to various hydrological forecasting and estimation challenges, driven by the urgent need for improved accuracy and efficiency in water resource management amidst changing climatic conditions.

These trends underscore the expanding role of ML in hydrological research and the diverse applications it supports. The accompanying graph visually represents these trends, illustrating the dynamic growth and shifts in research focus over recent years.

3. Machine Learning Methods in Hydrology

3.1. Long Short-Term Memory (LSTM)

LSTM networks, a type of recurrent neural network (RNN), are effective for time series prediction tasks such as streamflow prediction, rainfall-runoff modeling, and groundwater level forecasting [34,35,36]. They capture temporal patterns and dependencies in hydrological data, offering improved predictive accuracy over traditional methods [37]. However, LSTMs require significant computational resources and are prone to overfitting and interpretability issues [37,38,39].

3.2. Random Forests (RFs)

Random forests combine multiple decision trees to enhance predictive accuracy and control overfitting [40,41]. They are used in creating flood susceptibility maps, drought assessments, precipitation downscaling, and forecasting [42,43,44,45]. RF models are robust to noisy data and provide important insight features. However, they may introduce bias in small datasets.

3.3. Support Vector Machines (SVMs)

SVMs classify data by finding the optimal hyperplane that separates different classes, making them suitable for streamflow prediction, groundwater level forecasting, and precipitation downscaling [46,47,48,49]. They are effective in high-dimensional spaces and robust to overfitting. They can be sensitive to noise [50,51,52].

The applications of machine learning techniques, their applications in the context of hydrology, their advantages, and potential limitations are shown in Table 1.

Table 1. Applications, advantages, and disadvantages of machine learning techniques.

3.4. Artificial Neural Networks (ANNs)

ANN models and complex non-linear relationships [53] are used for rainfall-runoff modeling, flood forecasting, and water quality prediction [54,55,56,57]. They are flexible for various tasks but are prone to overfitting and are often seen as black boxes [58,59,60].

3.5. Gradient Boosting Machines (GBMs)

GBMs sequentially build multiple decision trees, improving prediction tasks like flood prediction, soil moisture estimation, and groundwater level prediction [61,62,63,64]. They offer high predictive accuracy and feature important insights, but require careful parameter tuning to avoid overfitting [65,66].

3.6. Convolutional Neural Networks (CNNs)

CNNs, primarily used for spatial data analysis, excel at recognizing spatial patterns in remote sensing data analysis, precipitation estimation, and flood mapping [67,68,69,70]. They handle large-scale datasets effectively and learn features automatically but require large amounts of labeled data [71,72].

3.7. Transformers

Originally developed for natural language processing, transformers have been adapted for hydrological applications due to their ability to handle sequential data and capture long-range dependencies [73]. They have shown superior performance in streamflow prediction and flood forecasting [74,75,76,77].

Some traditional models have also been used for hydrological research along with machine learning models. For instance, a five-level nested experimental watershed was developed to study the water cycle at multiple scales, and it was found that in humid regions, surface runoff constitutes a significant portion of the total runoff [78].

4. Key Datasets

The CAMELS (catchment attributes and meteorology for large-sample studies) dataset offers comprehensive data for 671 minimally impacted catchments across the contiguous United States (CONUS), encompassing various attributes, such as topography, climate, streamflow, land cover, soil, and geology. This diversity facilitates extensive hydrological research and aids in understanding the interrelationships among catchment characteristics [14] Similarly, the Caravan dataset aggregates data from seven large-sample hydrology datasets, covering 6830 catchments globally over nearly four decades. It includes meteorological forcing, streamflow data, and static catchment attributes, promoting accessible, high-quality hydrological research [16]. The datasets extensively used in hydrological ML applications, their spatial and temporal coverage, data resolution, key attributes, and their primary applications in hydrology are demonstrated in Table 2.

Table 2. Key datasets used in hydrological ML applications.

The Global Runoff Data Centre (GRDC) archives river discharge data from over 9800 stations worldwide, with some records dating back 200 years. This extensive archive supports global water resource management, climate studies, and hydrological modeling [79]. CHIRPS (climate hazards group infrared precipitation with stations) provides high-resolution precipitation data from 1981 to the present by combining satellite observations with station data, which is essential for monitoring climate extremes and drought forecasting [17]. The PERSIANN (precipitation estimation from remotely sensed information using artificial neural networks) suite includes several high-resolution precipitation products. For example, PERSIANN-CCS provides near-global, high-resolution (0.04°) estimates at multiple temporal resolutions from 2003 to the present, which are ideal for real-time weather monitoring and severe weather analyses [80]. PERSIANN-CDR offers daily estimates at a 0.25° resolution from 1983 to the present, supporting long-term climatological and hydrological studies [81]. PERSIANN-CCS-CDR combines both, offering three-hourly estimates at a 0.04° resolution from 1983 to the present for extreme weather analyses and climatological studies [82]. The NLDAS (North American land data assimilation system) offers high-resolution, gridded datasets for North America from 1979 onwards, supporting water resource management, drought monitoring, and flood forecasting. On the other hand, the GLDAS (global land data assimilation system) generates high-resolution land surface states and fluxes using satellite and ground-based data from 1948 to the present, aiding global land surface condition monitoring and hydrological modeling. The GRACE (gravity recovery and climate experiment) and its follow-on mission (GRACE-FO) provide monthly data on Earth’s gravitational field variations from 2002 onwards, crucial for studying groundwater depletion, glacial melting, and sea-level rise. This dataset enhances our understanding of global water distribution and climate dynamics.

5. Case Studies

5.1. CAMELS

The CAMELS dataset has been instrumental in streamflow forecasting through various machine learning approaches. Studies have demonstrated the effectiveness of LSTM networks, transfer learning, and other advanced models, consistently showing improvements over traditional models and significant regional performance variations [83,84,85,86,87,88,89,90,91,92,93,94,95]. In rainfall-runoff modeling, CAMELS has enhanced predictive accuracy and robustness, with LSTM and transformer-based models outperforming traditional approaches [96,97,98,99,100,101,102,103]. For flood forecasting, machine learning frameworks have achieved high accuracy in storm classification and flood peak estimation [12,33,104,105,106]. Groundwater level forecasting, though less explored, has seen improved model performance through regional characteristics integration [107,108]. The dataset also advances various hydrological modeling techniques, including knowledge-guided frameworks, hybrid models, and AI-enhanced parameter learning, showcasing its versatility and robustness in hydrological research [109,110,111,112,113,114,115,116]

Notable studies using the CAMELS-GB dataset include investigations of urbanization’s impact on river discharge and hybrid hydroclimatic forecasting, while CAMELS-CL has seen the development of LSTM and random forest models for enhanced hydrological predictions [117,118,119,120,121,122]. CAMELS-BR has applied the FS-LSTM model for streamflow prediction [123], and CAMELS-AUS has focused on hybrid models for streamflow prediction and global water flux partitioning analyses [124,125].

The key case studies and findings using the chosen datasets are shown in Table 3.

Table 3. Key case studies and findings using the datasets.

5.2. CARAVAN

The Caravan dataset, despite being relatively new, shows significant potential in hydrology and machine learning. It has demonstrated superior performance in streamflow prediction, flood forecasting, and catchment model instance prediction through advanced models like temporal fusion transformers and latent factor models [127,128,129,130,131].

5.3. GRDC

The GRDC dataset has been extensively used for streamflow and water balance studies, improving monthly runoff reconstructions and enhancing streamflow and water storage predictions [29,30,160,161]. It has also been pivotal in flood prediction, hydrological modeling, and simulation, demonstrating strong performance in data-scarce areas [132,133,134,135].

5.4. CHIRPS

CHIRPS has been widely utilized in drought assessment, runoff estimation, flood modeling, and improving precipitation models. Studies have shown its superior performance in various regions, enhancing drought monitoring and flood prediction accuracy [18,19,136,137,138,139,140,141,142,162,163,164,165,166].

5.5. PERSIANN

The PERSIANN dataset has significantly contributed to hydrological modeling, flood prediction, and precipitation estimation. It has been used for streamflow and sediment load simulation, rainfall-runoff modeling, and reliable flood forecasting. Advanced techniques like cGANs and deep neural networks have enhanced precipitation estimation, supporting drought assessment and runoff simulation, demonstrating PERSIANN’s versatility and importance in hydrological research [20,21,22,23,24,25,26,27,28].

5.6. NLDAS

NLDAS data has advanced hydrological modeling, runoff and flood prediction, and evapotranspiration and soil moisture estimation. It has shown improved accuracy in predicting lake water temperatures, precipitation, soil moisture, and runoff, demonstrating the dataset’s utility in diverse hydrological applications [42,143,145,147,148,167,168,169,170,171,172].

5.7. GLDAS

GLDAS data has also significantly advanced hydrological research by improving hydrological modeling, soil moisture and evapotranspiration estimation, and groundwater and storage data predictions. It has enhanced the accuracy of terrestrial water storage variations and streamflow simulations and provided detailed spatial and temporal resolution for global land surface conditions [148,149,150,151,152,156,173,174,175].

5.8. GRACE

GRACE data has been used in groundwater and water storage anomaly studies, groundwater level prediction, enhancing spatial resolution, and filling temporal gaps. Machine learning techniques have successfully downscaled GRACE data, improving groundwater monitoring and providing high-resolution predictions, showcasing the dataset’s critical role in hydrological research [9,100,153,154,155,157,158,159,176,177,178,179].

6. Data Challenges in the ML Approach

6.1. Spatial and Temporal Resolution

One of the primary limitations of these datasets is their spatial and temporal resolution. The GLDAS dataset offers data at 0.25° × 0.25° and 1° × 1° resolutions, which are too coarse for detailed local studies such as urban hydrology or small watershed modeling. The NLDAS dataset, with a finer spatial resolution of 1/8th degree (~12.5 km), still may not suffice for applications demanding even higher granularity.

The challenges of current LSH datasets are shown in Figure 2.

Figure 2. Summary of the challenges of current LSH datasets.

Temporal resolution is another critical factor. The CAMELS dataset, for instance, has a spatial focus on the contiguous United States (CONUS) and provides daily data from 1980 to 2015. While this temporal span is useful, the daily resolution might not capture the finer temporal variations necessary for short-term forecasting or real-time applications. While datasets like the NLDAS provide hourly data, enabling more detailed temporal analysis, others like CHIRPS offer daily data, which may miss significant sub-daily variations crucial for certain applications, such as flash flood forecasting. PERSIANN-CCS-CDR faces challenges in accurately representing spatial distribution patterns, especially in high temporal resolutions [82]. Furthermore, NLDAS precipitation data shows discrepancies with observations at hourly timescales, as shown in Figure 3. This is attributed to the inherent variability of precipitation and the analysis scheme used by the NLDAS [180].

Figure 3. Comparison of NLDAS forcing with local forcing for precipitation at Station EF-4 (ARM/CART, Plevna, Kansas), which is representative of other stations. Each point in the hourly panel represents one hour during the period from 0000 UT on 1 January 1998 to 2300 UT on 30 September 1999. The averaging period for the other panels is indicated accordingly [180].

CHIRPS accuracy is influenced by complex local factors like geography and topography. Daily data are less accurate than monthly or interannual data [181], which is shown in Figure 4. The GRDC dataset’s temporal resolution varies widely, with some stations providing daily data and others offering only monthly or annual data, impacting the consistency and utility of the dataset for machine learning models that require uniform temporal granularity.

Figure 4. Mean rainfall data from rain gauge and CHIRPS: (a) daily and (b) monthly.

6.2. Data Quality and Consistency

The quality and consistency of data vary significantly across different regions and periods within these datasets. For instance, the GRDC dataset suffers from inconsistent data quality due to variations in measurement techniques and station maintenance over time. Addressing this, data homogenization and quality control protocols can be used to standardize measurements and reduce inconsistencies over time [182,183,184]. Similarly, the CAMELS and CARAVAN datasets face challenges with data gaps and missing values, which can introduce noise and biases into machine learning models. The accurate imputation of these gaps is necessary but can introduce further uncertainties. Advanced imputation techniques (e.g., multiple imputations, principal component analyes, autoregressive conditional heteroscedasticity (ARCH)), machine learning-based methods (e.g., K-nearest neighbors and neural networks), and robust random regression imputation (RRRI) have been developed by researchers. These techniques can be applied to better handle missing data while minimizing bias [185,186]. CHIRPS tends to overestimate precipitation in most areas (69% of stations), particularly during El Niño and drier periods. Biases in CHIRPS data can be significant for high-detail studies like flood or drought risk analyses [181,187]. PERSIAN products generally underestimate precipitation and exhibit low correlation and efficiency metrics when compared to ground-based observations in the Kelani River Basin, suggesting limited reliability [187]. NLDAS data often show warm bias in solar radiation and cool bias in longwave radiation. Furthermore, the NLDAS may not capture the exact amount of precipitation for individual events, especially small-scale convective precipitation events, though the total amount over a longer period (21 months in the study) can be within 10% of the observed values [180]. Bias correction methods, including quantile mapping, the convolutional autoencoder (ConvAE) neural network, non-linear power bias correction, and power transformation, have been proven to be effective in reducing these biases, helping to adjust the precipitation estimates to more closely match the observed values [188,189,190,191]. GLDAS-1 forcing data is not suitable for detecting long-term changes as the forcing data sources were switched several times in the past, which created discontinuity. In GLDAS-2 data, the continuity is much better; however, as the dataset was bias-corrected, the bias correction makes GLDAS-2 precipitation less correlated with observed precipitation [192].

Many datasets, such as GRDC and CARAVAN, have gaps in their records due to various reasons like equipment failure or data loss. These gaps pose significant challenges for machine learning applications that require continuous and complete datasets. Techniques for data imputation can mitigate this issue but often introduce additional uncertainties.

6.3. Regional and Climatic Representation

The regional focus of datasets like CARAVAN and CHIRPS can limit their applicability. CARAVAN, which primarily focuses on the Neotropics, and CHIRPS, which is designed to perform well in diverse climatic regions, may not generalize well to other areas with different climatic conditions. CHIRPS underestimates precipitation in mainly western, southern Antioquia (31% of stations) [181] and complex topography [187]. Furthermore, the dataset differs from the observed data in higher elevation and limited to the specific region [193]. This regional bias can affect the representativeness and generalizability of machine learning models trained on these datasets. PERSIANN-CCS-CDR shows good performance in monthly precipitation assessment. It tends to slightly underestimate observed precipitation, and its accuracy varies depending on the region and season [194].

Datasets like the NLDAS and GLDAS rely on specific parameterizations to represent land surface processes. These parameterizations may not accurately capture local conditions, particularly in areas with complex terrain or unique land surface characteristics. NLDAS models (like MOSAIC and NOAH36) tend to overestimate ET in mountainous regions [195]. This is attributed to potential errors in both NLDAS data and the way ET is derived in these areas. The assumption of homogeneity within each catchment or grid cell, as seen in CAMELS and CHIRPS, can also lead to inaccuracies in models that depend on the spatial variability of environmental processes.

Furthermore, accurately capturing extreme events such as floods, droughts, and hurricanes is a common challenge. Datasets like CHIRPS may not adequately represent these events due to their spatial resolution and data recording practices. For example, CHIRPS may not accurately represent the intensity of rainfall events like storms due to overestimation or underestimation biases [181]. Previous research also shows that the data has difficulty in detecting extremely high precipitation events [196]. Similarly, PERSIANN-CDR has good detection abilities for small precipitation events but struggles with extreme precipitation events, often underestimating them. This limitation can affect its accuracy in long-term drought trend analyses [197].

6.4. Downscaling of LSH

Downscaling is essential for large-scale hydrological (LSH) datasets to improve spatial and temporal resolution, capture local variability, and support accurate regional impact assessments and decision-making. It translates coarse-resolution data into detailed, actionable information for local water resource management, climate change adaptation, and integration with regional models. However, the downscaling of LSH such as CAMELS, Caravan, GRDC, CHIRPS, PERSIANN, NLDAS, GLDAS, and GRACE involves multiple challenges primarily due to the inherent complexities in translating coarse-resolution data to finer scales. Inconsistent measurements or missing data in datasets like GRDC can lead to inaccuracies in the downscaled product.

Ensuring the physical plausibility and climate realism of downscaled outputs is crucial for accurate regional impact assessments [198]. For instance, downscaling datasets like the GLDAS or NLDAS to a resolution suitable for urban hydrology requires detailed information that may not be present in the original dataset. Furthermore, datasets like PERSIANN, which offer hourly to daily data, may not accurately capture sub-hourly precipitation events when downscaled. The variability in downscaling methodologies, from statistical to dynamical approaches, adds another layer of complexity. Each method comes with its own assumptions and uncertainties, which need a thorough evaluation to ensure their validity for specific applications [199,200]. A downscaling method that is effective in one region (e.g., temperate climates) may not work as well in another (e.g., arid or tropical regions) due to different hydrological processes and data characteristics. Uncertainties in the original CHIRPS dataset can become more pronounced when downscaled, affecting the reliability of precipitation estimates at local scales. Moreover, the computational costs associated with running high-resolution models over long periods can be prohibitive, making the process resource-intensive [201]. Applying downscaling techniques to a global dataset like GRACE that measures terrestrial water storage requires extensive computational power to ensure accuracy. Additionally, the integration of multiple datasets, each with different resolutions and temporal spans, requires sophisticated techniques to maintain consistency and reliability in the downscaled data.

Effectively downscaling large-scale hydrological datasets involves combining statistical and dynamical methods, integrating multi-source data, and enhancing computational capabilities. Statistical techniques, such as regression models and machine learning, refine coarse data cost-effectively by identifying relationships between large-scale and local variables. Dynamical downscaling with regional climate models (RCMs) simulates physical processes at finer scales. Additionally, hybrid models leverage both methods for improved accuracy and robustness. Advances in computational power facilitate high-resolution simulations, making downscaling more feasible. This requires collaboration among climatologists, hydrologists, data scientists, and policymakers to ensure the reliability and applicability of downscaled data.

6.5. Data Accessibility

To advance large-sample hydrology (LSH), it is crucial to make datasets more FAIR (findable, accessible, interoperable, and reusable) [202]. Currently, many datasets are stored in obscure local repositories, making them hard to find. Accessibility is limited in many regions, biasing studies toward areas with better data availability. Interoperability issues arise from inconsistent maintenance practices, and restrictive licensing hampers data reuse. Global disparities in streamflow records present significant barriers. While North America and Europe have extensive records, many regions lack data, often due to a lack of stations or unprocessed, non-digitized data. Issues such as paywalls and cumbersome retrieval processes further complicate data access. Addressing these issues involves standardizing data storage and metadata, ensuring open access, and revising licensing practices to promote data sharing. Making LSH datasets FAIR will enhance hydrological research, enabling comprehensive global studies and better water management strategies.

Based on temporal and spatial resolution, data accessibility, and data coverage, the eight specified datasets are demonstrated in Figure 5:

Figure 5. Comparison of dataset limitations in hydrology.

Comparative hydrology requires consistent data processing across different catchments for meaningful comparisons. While it is relatively easy to compare catchments within the same large-sample hydrology (LSH) dataset, cross-dataset comparisons are challenging due to varying naming conventions, data sources, and calculation methods [202]. Several efforts aim to standardize measurement techniques and data management across LSH datasets. For instance, the CARAVAN project addresses this by creating a globally consistent and open dataset using sources such as ERA5-Land and HydroATLAS, which are processed in the cloud to reduce the burden of handling large datasets [16]. In satellite-based datasets like PERSIANN, GLDAS, and GRACE, ongoing efforts harmonize data products and reduce discrepancies. PERSIANN combines satellite and ground-based observations for high-resolution precipitation estimates, refining algorithms and validation techniques for consistency [81]. The GLDAS and GRACE continually update data processing methodologies to enhance resolution and integration with other datasets [203,204]. Despite these initiatives, full standardization across LSH datasets is still limited. These advancements and collaborative approaches are crucial for overcoming the challenges of hydrological data heterogeneity. Standardizing measurement techniques and data management practices will improve the reliability and comparability of LSH datasets, enabling more robust research and better-informed water management strategies.

7. Benefits of High-Resolution Datasets over Traditional Methods

Although high-resolution datasets have limitations such as spatial and temporal resolution and data accessibility, their integration has significantly improved the monitoring and assessment of various hydrological and meteorological parameters, addressing several shortcomings inherent in traditional methods. The Table 4 highlights the comparative advantages of high-resolution datasets over conventional approaches across different aspects, such as precipitation monitoring, streamflow assessment, and general data issues. Traditional methods, such as rain gauges and empirical models, often suffer from limitations like sparse distribution, point-specific data, and extensive calibration requirements. In contrast, high-resolution satellite data (e.g., PERSIANN-CDR and CHIRPS) provide comprehensive spatial and temporal coverage, enhancing the accuracy and reliability of precipitation and streamflow estimates. Additionally, projects like the NLDAS and GLDAS offer detailed temporal data and integrate advanced observational inputs, significantly improving model accuracy. The table also underscores the benefits of standardized, high-resolution datasets in ensuring consistent data quality, filling data gaps, and improving the detection and modeling of extreme weather events.

Table 4. Comparison of the advantages of high-resolution datasets over conventional approaches.

8. Future Directions

8.1. Focusing on Specific Hydrologic Regimes

There is a growing need for large sample hydrology datasets that cater to the specific needs of researchers studying different hydrologic regimes. These datasets should include relevant data tailored to the dominant hydrologic processes in each regime. For example, datasets for snow-dominated catchments might include information on the snowpack, snow water equivalent (SWE), snowmelt rates, and freezing and thawing cycles. Permafrost regions are particularly sensitive to climate change [81,207,208], and datasets should include data on permafrost extent, temperature, and thaw depth. Similarly, datasets for arid regions could encompass data on soil moisture, evapotranspiration, and groundwater recharge. Furthermore, urbanization significantly alters hydrological processes [209,210,211], and datasets should include data on impervious cover, drainage networks, and water use patterns. Datasets for monsoon-influenced catchments could include precipitation data with high temporal resolution to capture the intense bursts of rainfall that occur during these events, as well as data on soil infiltration capacity and surface runoff processes. By focusing on specific hydrologic regimes, researchers can develop more accurate and transferable machine learning models for different hydrological settings.

The future directions of current LSH datasets are demonstrated in Figure 6.

Figure 6. Summary of the future directions of current LSH datasets.

8.2. Incorporating Human Impacts

Current large sample hydrology datasets offer a wealth of information for machine learning applications. However, a critical gap exists in fully capturing the influence of human activities on hydrological systems, as reported in the limitations of this review paper. While these datasets, including CAMELS, NLDAS, and GLDAS, excel at capturing natural climatic and geographic drivers of streamflow, precipitation, and other hydrological variables, they often lack data on how human actions such as irrigation, urbanization, and water management practices are altering the water cycle [202]. Integrating data on water use, infrastructure development, and land management practices into large sample hydrology datasets is crucial for a more comprehensive picture. By accounting for human activities, machine learning models for hydrological applications can achieve significantly better accuracy and generalizability. Furthermore, this oversight can limit the applicability of machine learning models in regions heavily influenced by anthropogenic factors, potentially leading to inaccurate predictions and analyses. Models that can account for the complex interplay between human water use and natural climate variability will be more reliable for forecasting future streamflow patterns and water availability. Datasets that encompass human impacts allow researchers to develop machine learning models that can simulate the combined effects of climate change and human water use on hydrological systems. By incorporating human impact data, researchers can develop models that can predict future water availability scenarios under different water management strategies and climate change projections.

Data on human activities that impact hydrology, such as water withdrawals, irrigation practices, and reservoir operations, are often limited in spatial and temporal coverage. Additionally, data quality and consistency can vary significantly across different regions. Additionally, human impact data comes in various formats and units, requiring careful standardization and harmonization before integration with hydrological data. Furthermore, data on water use, especially for agricultural or industrial purposes, may have privacy restrictions. Several existing datasets offer valuable information on human activities that impact hydrology. The Socioeconomic Data and Applications Center (SEDAC) [212], Global Water Resources Modeling Coalition (GWRC) [213], Food and Agriculture Organization (FAO), and Aquastat (FAO’s global information system on water and agriculture) [214] databases provide data on water use, infrastructure, and land management practices. Integrating data from these sources with traditional hydrological datasets might significantly enhance the information available for machine learning applications.

8.3. Uncertainty Quantification

A crucial element often missing is a clear understanding of the uncertainties associated with the dataset. Just like any scientific measurement, hydrological data are inherently uncertain due to various factors [215,216]. Ignoring these uncertainties can lead to misleading results and unreliable predictions from machine learning models. Uncertainty quantification allows researchers to assess the reliability and limitations of the data used to train machine learning models [217,218]. By understanding the range of potential values and the likelihood of errors, more robust and informative models can be built. Models trained on data without uncertainty estimates may perform well on historical data but struggle when applied to new scenarios. Uncertainty quantification is necessary to quantify the model’s confidence in its predictions, providing a more realistic picture of its generalizability [217,219]. The methods of uncertainty quantification in hydrology, their description, applications, and the impact on hydrological models are illustrated in Table 5.

Table 5. Impacts of uncertainty quantification methods on hydrological models.

Real-world water resource management decisions often involve inherent uncertainties. By incorporating uncertainty estimates into machine learning models, a more complete picture of the risks and potential outcomes associated with different water management strategies can be provided to decision-makers. Uncertainty quantification can help identify potential biases in the data. For example, systematic errors in precipitation measurements might lead to the underestimation of streamflow in certain regions [230,231,232]. Identifying these biases allows for data correction or the development of models that are less sensitive to them. Propagating known measurement errors through models used to generate data allows for estimating the overall uncertainty in the derived variables. Running multiple hydrological models with different parameterizations on the same data can also provide an ensemble of potential outcomes. The spread of these outcomes indicates the uncertainty associated with the model predictions. Furthermore, Bayesian methods allow for incorporating prior knowledge about the uncertainties associated with the data into the analysis [233]. This can be particularly useful when dealing with limited data or missing information.

8.4. Real-Time Data Integration

For real-time decision making and proactive water resource management, real-time data integration is becoming increasingly crucial [234]. This involves continuously collecting, processing, and incorporating the latest hydrological observations into machine learning models. It is like having a constantly updated feed of information flowing directly into the models, allowing them to react and adapt to the ever-changing hydrological landscape. By leveraging the constant flow of real-time data, machine learning models can evolve from historical analysis tools to powerful real-time decision support systems. For example, real-time data on precipitation, river stages, and soil moisture allows for more accurate and timely flood forecasts [235]. Machine learning models can be continuously updated with the latest observations, leading to earlier warnings and more effective evacuation measures. Real-time data on streamflow, groundwater levels, and evapotranspiration can be used to monitor drought conditions and identify areas at risk. Early detection allows for proactive water management strategies, such as water restrictions or targeted conservation efforts. Real-time data on inflows, outflows, and downstream water demands can be used to optimize reservoir operations. Machine learning models can be trained to predict future water availability and suggest release strategies that balance competing needs for hydropower generation, irrigation, and environmental flows. Ensuring real-time data arrives with minimal delay is crucial. Slow or unreliable data transmission can hinder the effectiveness of machine learning models that rely on the most up-to-date information. Furthermore, real-time data streams may contain errors or inconsistencies [236]. Implementing robust quality control measures is essential to ensure the accuracy and reliability of the data used by machine learning models. Furthermore, processing large volumes of real-time data requires significant computational power. Machine learning models need to be optimized to handle real-time data streams efficiently without compromising accuracy.

Several technologies facilitate real-time data integration in hydrology. Dense networks of sensors deployed across catchments can collect real-time data on precipitation, river stages, soil moisture, and other variables [237]. These networks provide a continuous stream of observations for machine learning models. Modern sensors with enhanced accuracy and durability can monitor water quality, streamflow, soil moisture, and groundwater levels in real time. Coupled with IoT devices and robust 5G networks, these sensors enable extensive data collection and rapid transmission, even from remote areas. This integration supports dynamic modeling and predictive analytics, allowing water managers to respond swiftly to issues like droughts, floods, and contamination, thereby enhancing sustainability and resilience in water management practices. Moreover, cloud-based platforms offer the scalable computational resources needed to process and analyze large volumes of real-time data in real time [238,239]. Machine learning models can be deployed on these platforms for continuous training and prediction. Furthermore, streaming analytics techniques are specifically designed to handle continuous data streams [240,241]. These techniques allow for real-time processing and the analysis of hydrological data, enabling near-instantaneous insights and predictions.

8.5. Data Collection

The future of data collection in hydrology and climate science holds exciting potential, particularly with the integration of citizen science initiatives. Engaging the public in scientific research democratizes data collection and significantly enhances the scope and scale of gathered data [242,243]. Citizen science can involve the public in reporting local precipitation levels, streamflow measurements, and groundwater levels using mobile apps, complementing traditional monitoring systems, especially in under-monitored regions [244]. Additionally, citizen science enhances data density and spatial coverage, fosters community engagement and awareness, and provides timely, localized insights into hydrological events, thereby improving the accuracy and responsiveness of flood warnings and water resource management [245,246]. However, negative impacts of citizen science, including over-burdening participants, health and safety risks, decreased self-reliance, exclusion, technology barriers, decentralizing monitoring and risk, conflict creation, data privacy concerns, and demotivational impacts, should also be considered [247]. Mobile and wearable technology, equipped with GPS, cameras, and environmental sensors, enables real-time, location-based data collection [248]. Social media and online platforms can gather real-time reports from users experiencing hydrological events, providing timely georeferenced data [249,250]. The internet of things (IoT) offers automated, continuous monitoring through smart sensors in rivers and lakes, and transmitting real-time data on water levels and flow rates [251]. Thus, real-time data can be gathered for hydrological analyses. Educational programs and community engagement can train citizens in scientific data collection methods, enhancing data quality and reliability. Integration with traditional methods and robust validation protocols, including machine learning techniques, can ensure the accuracy of citizen-collected data. Supportive policies and governance frameworks are essential in recognizing the value of such data and providing necessary tools. Ethical considerations, including data privacy and ownership, must be addressed to build trust and encourage participation. Leveraging citizen science and modern technology can lead to more comprehensive and inclusive data collection, driving innovation in hydrology and climate science.

9. Conclusions

The integration of machine learning (ML) in hydrology has significantly advanced our understanding and prediction of various hydrological processes. The availability of extensive datasets such as CAMELS, CARAVAN, GRDC, CHIRPS, NLDAS, GLDAS, and GRACE has been crucial in supporting these advancements. These datasets provide diverse and comprehensive data necessary for developing robust ML models that can predict streamflow, groundwater levels, precipitation, and flood frequencies, even in data-scarce regions. The CAMELS dataset, with its detailed catchment attributes and meteorological data, has been instrumental in enhancing streamflow and rainfall-runoff modeling. Similarly, the CARAVAN dataset’s standardization and aggregation of global hydrology data facilitate large-scale hydrological studies. GRDC’s extensive river discharge data, along with CHIRPS’ high-resolution precipitation records, provide invaluable inputs for accurate hydrological modeling. Despite the significant progress, challenges remain. The datasets often face issues with spatial and temporal resolution, data quality, and consistency. For example, the CAMELS dataset’s daily resolution may not capture finer temporal variations, while the GLDAS dataset’s coarse spatial resolution can limit its application in detailed local studies. Additionally, the integration of human activities and impacts into these datasets is still lacking, which is crucial for comprehensive hydrological modeling. However, these high-resolution datasets have a number of advantages including improved accuracy in hydrological modeling, enhanced spatial and temporal detail, and the ability to capture finer-scale processes that are often missed by coarser datasets. Future directions should focus on improving dataset resolution, integrating human impact data, enhancing real-time data integration, and inclusion of citizen science and the IoT in data collection. Developing datasets tailored to specific hydrologic regimes and incorporating uncertainty quantification will further refine ML models and their applications in hydrology. By addressing these challenges and leveraging the strengths of these datasets, the field of hydrology can continue to benefit from the transformative potential of machine learning, leading to more accurate predictions, better water resource management, and improved resilience to climatic extremes.

Funding

The research was supported by Florida State University Council on Research + Creativity (CRC): Sustainability through funding number 046725.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lange, H.; Sippel, S. Machine Learning Applications in Hydrology; Springer: Cham, Switzerland, 2020; pp. 233–257. [Google Scholar] [CrossRef]
Raschka, S.; Patterson, J.; Nolet, C. Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. Information 2020, 11, 193. [Google Scholar] [CrossRef]
Xu, Y.; Liu, X.; Cao, X.; Huang, C.; Liu, E.; Qian, S.; Liu, X.; Wu, Y.; Dong, F.; Qiu, C.W.; et al. Artificial Intelligence: A Powerful Paradigm for Scientific Research. Innovation 2021, 2, 100179. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Pan, S.; Wang, J.; Vasilakos, A.V. Machine Learning on Big Data: Opportunities and Challenges. Neurocomputing 2017, 237, 350–361. [Google Scholar] [CrossRef]
Kratzert, F.; Klotz, D.; Herrnegger, M.; Sampson, A.K.; Hochreiter, S.; Nearing, G.S. Toward Improved Predictions in Ungauged Basins: Exploiting the Power of Machine Learning. Water Resour. Res. 2019, 55, 11344–11354. [Google Scholar] [CrossRef]
Arriagada, P.; Karelovic, B.; Link, O. Automatic Gap-Filling of Daily Streamflow Time Series in Data-Scarce Regions Using a Machine Learning Algorithm. J. Hydrol. 2021, 598, 126454. [Google Scholar] [CrossRef]
Lu, D.; Konapala, G.; Painter, S.L.; Kao, S.C.; Gangrade, S. Streamflow Simulation in Data-Scarce Basins Using Bayesian and Physics-Informed Machine Learning Models. J. Hydrometeorol. 2021, 22, 1421–1438. [Google Scholar] [CrossRef]
Yang, C.; Xu, M.; Kang, S.; Fu, C.; Hu, D. Improvement of Streamflow Simulation by Combining Physically Hydrological Model with Deep Learning Methods in Data-Scarce Glacial River Basin. J. Hydrol. 2023, 625, 129990. [Google Scholar] [CrossRef]
Rafik, A.; Ait Brahim, Y.; Amazirh, A.; Ouarani, M.; Bargam, B.; Ouatiki, H.; Bouslihim, Y.; Bouchaou, L.; Chehbouni, A. Groundwater Level Forecasting in a Data-Scarce Region through Remote Sensing Data Downscaling, Hydrological Modeling, and Machine Learning: A Case Study from Morocco. J. Hydrol. Reg. Stud. 2023, 50, 101569. [Google Scholar] [CrossRef]
Guzman, S.M.; Paz, J.O.; Tagert, M.L.M.; Mercer, A.E. Evaluation of Seasonally Classified Inputs for the Prediction of Daily Groundwater Levels: NARX Networks Vs Support Vector Machines. Environ. Model. Assess. 2019, 24, 223–234. [Google Scholar] [CrossRef]
Zhu, H.; Zhou, Q. Advancing Satellite-Derived Precipitation Downscaling in Data-Sparse Area Through Deep Transfer Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4102513. [Google Scholar] [CrossRef]
Mangukiya, N.K.; Sharma, A. Alternate Pathway for Regional Flood Frequency Analysis in Data-Sparse Region. J. Hydrol. 2024, 629, 130635. [Google Scholar] [CrossRef]
Newman, A.J.; Clark, M.P.; Sampson, K.; Wood, A.; Hay, L.E.; Bock, A.; Viger, R.J.; Blodgett, D.; Brekke, L.; Arnold, J.R.; et al. Development of a Large-Sample Watershed-Scale Hydrometeorological Data Set for the Contiguous USA: Data Set Characteristics and Assessment of Regional Variability in Hydrologic Model Performance. Hydrol. Earth Syst. Sci. 2015, 19, 209–223. [Google Scholar] [CrossRef]
Addor, N.; Newman, A.J.; Mizukami, N.; Clark, M.P. The CAMELS Data Set: Catchment Attributes and Meteorology for Large-Sample Studies. Hydrol. Earth Syst. Sci. 2017, 21, 5293–5313. [Google Scholar] [CrossRef]
Clerc-Schwarzenbach, F.M.; Selleri, G.; Neri, M.; Toth, E.; van Meerveld, I.; Seibert, J. HESS Opinions: A Few Camels or a Whole Caravan? EGUsphere 2024, 2024, 1–29. [Google Scholar] [CrossRef]
Kratzert, F.; Nearing, G.; Addor, N.; Erickson, T.; Gauch, M.; Gilon, O.; Gudmundsson, L.; Hassidim, A.; Klotz, D.; Nevo, S.; et al. Caravan-A Global Community Dataset for Large-Sample Hydrology. Sci. Data 2023, 10, 61. [Google Scholar] [CrossRef] [PubMed]
Funk, C.; Peterson, P.; Landsfeld, M.; Pedreros, D.; Verdin, J.; Shukla, S.; Husak, G.; Rowland, J.; Harrison, L.; Hoell, A.; et al. The Climate Hazards Infrared Precipitation with Stations—A New Environmental Record for Monitoring Extremes. Sci. Data 2015, 2, 150066. [Google Scholar] [CrossRef] [PubMed]
Adem, E.; Elfeki, A.; Chaabani, A.; Alwegdani, A.; Hussain, S.; Elhag, M. Impact of Satellite Precipitation Estimation Methods on the Hydrological Response: Case Study Wadi Nu’man Basin, Saudi Arabia. Theor. Appl. Climatol. 2024, 155, 3907–3925. [Google Scholar] [CrossRef]
Wang, M.; Rezaie-Balf, M.; Naganna, S.R.; Yaseen, Z.M. Sourcing CHIRPS Precipitation Data for Streamflow Forecasting Using Intrinsic Time-Scale Decomposition Based Machine Learning Models. Hydrol. Sci. J. 2021, 66, 1437–1456. [Google Scholar] [CrossRef]
Khan, M.A.; Stamm, J. Assessment of the Hydrological and Coupled Soft Computing Models, Based on Different Satellite Precipitation Datasets, to Simulate Streamflow and Sediment Load in a Mountainous Catchment. J. Water Clim. Change 2023, 14, 610–632. [Google Scholar] [CrossRef]
Bhusal, A.; Parajuli, U.; Regmi, S.; Kalra, A. Application of Machine Learning and Process-Based Models for Rainfall-Runoff Simulation in DuPage River Basin, Illinois. Hydrology 2022, 9, 117. [Google Scholar] [CrossRef]
Yeditha, P.K.; Kasi, V.; Rathinasamy, M.; Agarwal, A. Forecasting of Extreme Flood Events Using Different Satellite Precipitation Products and Wavelet-Based Machine Learning Methods. Chaos 2020, 30, 063115. [Google Scholar] [CrossRef]
Chancay, J.E.; Espitia-Sarmiento, E.F. Improving Hourly Precipitation Estimates for Flash Flood Modeling in Data-Scarce Andean-Amazon Basins: An Integrative Framework Based on Machine Learning and Multiple Remotely Sensed Data. Remote Sens. 2021, 13, 4446. [Google Scholar] [CrossRef]
Hayatbini, N.; Kong, B.; Hsu, K.L.; Nguyen, P.; Sorooshian, S.; Stephens, G.; Fowlkes, C.; Nemani, R.; Ganguly, S. Conditional Generative Adversarial Networks (CGANs) for near Real-Time Precipitation Estimation from Multispectral GOES-16 Satellite Imageries-PERSIANN-CGAN. Remote Sens. 2019, 11, 2193. [Google Scholar] [CrossRef]
Tao, Y.; Hsu, K.; Ihler, A.; Gao, X.; Sorooshian, S. A Two-Stage Deep Neural Network Framework for Precipitation Estimation from Bispectral Satellite Information. J. Hydrometeorol. 2018, 19, 393–408. [Google Scholar] [CrossRef]
Das, P.; Zhang, Z.; Ren, H. Evaluating the Accuracy of Two Satellite-Based Quantitative Precipitation Estimation Products and Their Application for Meteorological Drought Monitoring over the Lake Victoria Basin, East Africa. Geo-Spat. Inf. Sci. 2022, 25, 500–518. [Google Scholar] [CrossRef]
Yu, C.; Hu, D.; Shao, H.; Dai, X.; Liu, G.; Wu, S. Runoff Simulation Driven by Multi-Source Satellite Data Based on Hydrological Mechanism Algorithm and Deep Learning Network. J. Hydrol. Re.g Stud. 2024, 52, 101720. [Google Scholar] [CrossRef]
Khajehali, M.; Safavi, H.R.; Nikoo, M.R.; Fooladi, M. A Fusion-Based Framework for Daily Flood Forecasting in Multiple-Step-Ahead and near-Future under Climate Change Scenarios: A Case Study of the Kan River, Iran. In Natural Hazards; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]
Ayzel, G.; Kurochkina, L.; Zhuravlev, S. The Influence of Regional Hydrometric Data Incorporation on the Accuracy of Gridded Reconstruction of Monthly Runoff. Hydrol. Sci. J. 2022, 67, 2429–2440. [Google Scholar] [CrossRef]
Wang, C.; Jiang, S.; Zheng, Y.; Han, F.; Kumar, R.; Rakovec, O.; Li, S. Distributed Hydrological Modeling With Physics-Encoded Deep Learning: A General Framework and Its Application in the Amazon. Water Resour. Res. 2024, 60, e2023WR036170. [Google Scholar] [CrossRef]
Jiang, S.; Zheng, Y.; Solomatine, D. Improving AI System Awareness of Geoscience Knowledge: Symbiotic Integration of Physical Approaches and Deep Learning. Geophys. Res. Lett. 2020, 47, e2020GL088229. [Google Scholar] [CrossRef]
Xu, T.; Liang, F. Machine Learning for Hydrologic Sciences: An Introductory Overview. WIREs Water 2021, 8, e1533. [Google Scholar] [CrossRef]
Rasheed, Z.; Aravamudan, A.; Gorji Sefidmazgi, A.; Anagnostopoulos, G.C.; Nikolopoulos, E.I. Advancing Flood Warning Procedures in Ungauged Basins with Machine Learning. J. Hydrol. 2022, 609, 127736. [Google Scholar] [CrossRef]
Zhou, F.; Chen, Y.; Liu, J. Application of a New Hybrid Deep Learning Model That Considers Temporal and Feature Dependencies in Rainfall–Runoff Simulation. Remote Sens. 2023, 15, 1395. [Google Scholar] [CrossRef]
Ehteram, M.; Ghanbari-Adivi, E. Self-Attention (SA) Temporal Convolutional Network (SATCN)-Long Short-Term Memory Neural Network (SATCN-LSTM): An Advanced Python Code for Predicting Groundwater Level. Environ. Sci. Pollut. Res. 2023, 30, 92903–92921. [Google Scholar] [CrossRef] [PubMed]
Arsenault, R.; Martel, J.L.; Brunet, F.; Brissette, F.; Mai, J. Continuous Streamflow Prediction in Ungauged Basins: Long Short-Term Memory Neural Networks Clearly Outperform Traditional Hydrological Models. Hydrol. Earth Syst. Sci. 2023, 27, 139–157. [Google Scholar] [CrossRef]
Sabzipour, B.; Arsenault, R.; Troin, M.; Martel, J.L.; Brissette, F.; Brunet, F.; Mai, J. Comparing a Long Short-Term Memory (LSTM) Neural Network with a Physically-Based Hydrological Model for Streamflow Forecasting over a Canadian Catchment. J. Hydrol. 2023, 627, 130380. [Google Scholar] [CrossRef]
Shen, C.; Lawson, K. Applications of Deep Learning in Hydrology. Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science and Geosciences; John Wiley & Sons: Hoboken, NJ, USA, 2021; pp. 283–297. [Google Scholar] [CrossRef]
Tripathy, K.P.; Mishra, A.K. Deep Learning in Hydrology and Water Resources Disciplines: Concepts, Methods, Applications, and Research Directions. J. Hydrol. 2024, 628, 130458. [Google Scholar] [CrossRef]
Hegelich, S. Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events. Eur. Policy Anal. 2016, 2, 98–120. [Google Scholar] [CrossRef]
Ali, J.; Khan, R.; Ahmad, N.; Maqsood, I. Random Forests and Decision Trees. Int. J. Comput. Sci. Issues 2012, 9, 272. [Google Scholar]
He, X.; Chaney, N.W.; Schleiss, M.; Sheffield, J. Spatial Downscaling of Precipitation Using Adaptable Random Forests. Water Resour. Res. 2016, 52, 8217–8237. [Google Scholar] [CrossRef]
Liang, Z.; Tang, T.; Li, B.; Liu, T.; Wang, J.; Hu, Y. Long-Term Streamflow Forecasting Using SWAT through the Integration of the Random Forests Precipitation Generator: Case Study of Danjiangkou Reservoir. Hydrol. Res. 2018, 49, 1513–1527. [Google Scholar] [CrossRef]
Elbeltagi, A.; Pande, C.B.; Kumar, M.; Tolche, A.D.; Singh, S.K.; Kumar, A.; Vishwakarma, D.K. Prediction of Meteorological Drought and Standardized Precipitation Index Based on the Random Forest (RF), Random Tree (RT), and Gaussian Process Regression (GPR) Models. Environ. Sci. Pollut. Res. 2023, 30, 43183–43202. [Google Scholar] [CrossRef] [PubMed]
Saber, M.; Boulmaiz, T.; Guermoui, M.; Abdrabo, K.I.; Kantoush, S.A.; Sumi, T.; Boutaghane, H.; Hori, T.; Binh, D.V.; Nguyen, B.Q.; et al. Enhancing Flood Risk Assessment through Integration of Ensemble Learning Approaches and Physical-Based Hydrological Modeling. Geomat. Nat. Hazards Risk 2023, 14, 2203798. [Google Scholar] [CrossRef]
Anandhi, A.; Srinivas, V.V.; Nanjundiah, R.S.; Nagesh Kumar, D. Downscaling Precipitation to River Basin in India for IPCC SRES Scenarios Using Support Vector Machine. Int. J. Climatol. 2008, 28, 401–420. [Google Scholar] [CrossRef]
Sudheer, C.; Shrivastava, N.A.; Panigrahi, B.K.; Mathur, S. Groundwater Level Forecasting Using SVM-QPSO. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2011; pp. 731–741. [Google Scholar] [CrossRef]
Sudheer, C.; Maheswaran, R.; Panigrahi, B.K.; Mathur, S. A Hybrid SVM-PSO Model for Forecasting Monthly Streamflow. Neural. Comput. Appl. 2014, 24, 1381–1389. [Google Scholar] [CrossRef]
Raghavendra, S.; Deka, P.C. Support Vector Machine Applications in the Field of Hydrology: A Review. Appl. Soft Comput. 2014, 19, 372–386. [Google Scholar] [CrossRef]
Pappu, V.; Pardalos, P.M. High-Dimensional Data Classification. Springer Optim. Its Appl. 2014, 92, 119–150. [Google Scholar] [CrossRef]
Xu XUHUAN, H.; Caramanis, C.; Mannor, S.; Smola, A. Robustness and Regularization of Support Vector Machines. J. Mach. Learn. Res. 2009, 10, 1485–1510. [Google Scholar]
Li, H.X.; Yang, J.L.; Zhang, G.; Fan, B. Probabilistic Support Vector Machines for Classification of Noise Affected Data. Inf. Sci. 2013, 221, 60–71. [Google Scholar] [CrossRef]
Tan, C.O.; Beklioglu, M. Modeling Complex Nonlinear Responses of Shallow Lakes to Fish and Hydrology Using Artificial Neural Networks. Ecol. Model. 2006, 196, 183–194. [Google Scholar] [CrossRef]
Kouadri, S.; Pande, C.B.; Panneerselvam, B.; Moharir, K.N.; Elbeltagi, A. Prediction of Irrigation Groundwater Quality Parameters Using ANN, LSTM, and MLR Models. Environ. Sci. Pollut. Res. 2022, 29, 21067–21091. [Google Scholar] [CrossRef]
Wu, W.; Dandy, G.C.; Maier, H.R. Protocol for Developing ANN Models and Its Application to the Assessment of the Quality of the ANN Model Development Process in Drinking Water Quality Modelling. Environ. Model. Softw. 2014, 54, 108–127. [Google Scholar] [CrossRef]
Chang, L.C.; Amin, M.Z.M.; Yang, S.N.; Chang, F.J. Building ANN-Based Regional Multi-Step-Ahead Flood Inundation Forecast Models. Water 2018, 10, 1283. [Google Scholar] [CrossRef]
Nourani, V.; Komasi, M.; Mano, A. A Multivariate ANN-Wavelet Approach for Rainfall-Runoff Modeling. Water Resour. Manag. 2009, 23, 2877–2894. [Google Scholar] [CrossRef]
Carabantes, M. Black-Box Artificial Intelligence: An Epistemological and Critical Analysis. AI Soc. 2020, 35, 309–317. [Google Scholar] [CrossRef]
Khalaf Jabbar Rafiqul Zaman Khan, H.D. Methods to Avoid Over-Fitting and under-Fitting in Supervised Machine Learning (Comparative Study). Comput. Sci. Commun. Instrum. Devices 2015, 70, 978–981. [Google Scholar]
Piotrowski, A.P.; Napiorkowski, J.J. A Comparison of Methods to Avoid Overfitting in Neural Networks Training in the Case of Catchment Runoff Modelling. J. Hydrol. 2013, 476, 97–111. [Google Scholar] [CrossRef]
Pham, Q.B.; Kumar, M.; Di Nunno, F.; Elbeltagi, A.; Granata, F.; Islam, A.R.M.T.; Talukdar, S.; Nguyen, X.C.; Ahmed, A.N.; Anh, D.T. Groundwater Level Prediction Using Machine Learning Algorithms in a Drought-Prone Area. Neural. Comput. Appl. 2022, 34, 10751–10773. [Google Scholar] [CrossRef]
Ibrahem Ahmed Osman, A.; Najah Ahmed, A.; Chow, M.F.; Feng Huang, Y.; El-Shafie, A. Extreme Gradient Boosting (Xgboost) Model to Predict the Groundwater Levels in Selangor Malaysia. Ain Shams Eng. J. 2021, 12, 1545–1556. [Google Scholar] [CrossRef]
Chen, L.; Xing, M.; He, B.; Wang, J.; Shang, J.; Huang, X.; Xu, M. Estimating Soil Moisture over Winter Wheat Fields during Growing Season Using Machine-Learning Methods. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3706–3718. [Google Scholar] [CrossRef]
Xu, K.; Han, Z.; Xu, H.; Bin, L. Rapid Prediction Model for Urban Floods Based on a Light Gradient Boosting Machine Approach and Hydrological–Hydraulic Model. Int. J. Disaster Risk Sci. 2023, 14, 79–97. [Google Scholar] [CrossRef]
Natekin, A.; Knoll, A. Gradient Boosting Machines, a Tutorial. Front. Neurorobot. 2013, 7, 63623. [Google Scholar] [CrossRef] [PubMed]
Tao, H.; Awadh, S.M.; Salih, S.Q.; Shafik, S.S.; Yaseen, Z.M. Integration of Extreme Gradient Boosting Feature Selection Approach with Machine Learning Models: Application of Weather Relative Humidity Prediction. Neural. Comput. Appl. 2022, 34, 515–533. [Google Scholar] [CrossRef]
Wang, Y.; Fang, Z.; Hong, H.; Peng, L. Flood Susceptibility Mapping Using Convolutional Neural Network Frameworks. J. Hydrol. 2020, 582, 124482. [Google Scholar] [CrossRef]
Sadeghi, M.; Asanjan, A.A.; Faridzad, M.; Nguyen, P.H.U.; Hsu, K.; Sorooshian, S.; Braithwaite, D.A.N. PERSIANN-CNN: Precipitation Estimation from Remotely Sensed Information Using Artificial Neural Networks–Convolutional Neural Networks. J. Hydrometeorol. 2019, 20, 2273–2289. [Google Scholar] [CrossRef]
Yang, F.; Feng, T.; Xu, G.; Chen, Y. Applied Method for Water-Body Segmentation Based on Mask R-CNN. J. Appl. Remote Sens. 2020, 14, 1. [Google Scholar] [CrossRef]
Naganna, S.R.; Marulasiddappa, S.B.; Balreddy, M.S.; Yaseen, Z.M. Daily Scale Streamflow Forecasting in Multiple Stream Orders of Cauvery River, India: Application of Advanced Ensemble and Deep Learning Models. J. Hydrol. 2023, 626, 130320. [Google Scholar] [CrossRef]
Yamashita, R.; Nishio, M.; Do, R.K.G.; Togashi, K. Convolutional Neural Networks: An Overview and Application in Radiology. Insights Imaging 2018, 9, 611–629. [Google Scholar] [CrossRef] [PubMed]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Castangia, M.; Grajales, L.M.M.; Aliberti, A.; Rossi, C.; Macii, A.; Macii, E.; Patti, E. Transformer Neural Networks for Interpretable Flood Forecasting. Environ. Model. Softw. 2023, 160, 105581. [Google Scholar] [CrossRef]
Liu, C.; Liu, D.; Mu, L. Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River. IEEE Access 2022, 10, 58240–58253. [Google Scholar] [CrossRef]
Ghobadi, F.; Kang, D. Improving Long-Term Streamflow Prediction in a Poorly Gauged Basin Using Geo-Spatiotemporal Mesoscale Data and Attention-Based Deep Learning: A Comparative Study. J. Hydrol. 2022, 615, 128608. [Google Scholar] [CrossRef]
Yin, L.; Wang, L.; Keim, B.D.; Konsoer, K.; Yin, Z.; Liu, M.; Zheng, W. Spatial and Wavelet Analysis of Precipitation and River Discharge during Operation of the Three Gorges Dam, China. Ecol. Indic. 2023, 154, 110837. [Google Scholar] [CrossRef]
Zhang, K.; Li, Y.; Yu, Z.; Yang, T.; Xu, J.; Chao, L.; Ni, J.; Wang, L.; Gao, Y.; Hu, Y.; et al. Xin’anjiang Nested Experimental Watershed (XAJ-NEW) for Understanding Multiscale Water Cycle: Scientific Objectives and Experimental Design. Engineering 2022, 18, 207–217. [Google Scholar] [CrossRef]
Global Runoff Data Centre (GRDC)-Dataset-Waterdata. Available online: https://wbwaterdata.org/dataset/global-runoff-data-centre-grdc (accessed on 17 May 2024).
Nguyen, P.; Shearer, E.J.; Tran, H.; Ombadi, M.; Hayatbini, N.; Palacios, T.; Huynh, P.; Braithwaite, D.; Updegraff, G.; Hsu, K.; et al. The CHRS Data Portal, an Easily Accessible Public Repository for PERSIANN Global Satellite Precipitation Data. Sci. Data 2019, 6, 180296. [Google Scholar] [CrossRef] [PubMed]
Ashouri, H.; Hsu, K.L.; Sorooshian, S.; Braithwaite, D.K.; Knapp, K.R.; Cecil, L.D.; Nelson, B.R.; Prat, O.P. PERSIANN-CDR: Daily Precipitation Climate Data Record from Multisatellite Observations for Hydrological and Climate Studies. Bull. Am. Meteorol. Soc. 2015, 96, 69–83. [Google Scholar] [CrossRef]
Sadeghi, M.; Nguyen, P.; Naeini, M.R.; Hsu, K.; Braithwaite, D.; Sorooshian, S. PERSIANN-CCS-CDR, a 3-Hourly 0.04° Global Precipitation Climate Data Record for Heavy Precipitation Studies. Sci. Data 2021, 8, 157. [Google Scholar] [CrossRef] [PubMed]
Ma, K.; Feng, D.; Lawson, K.; Tsai, W.P.; Liang, C.; Huang, X.; Sharma, A.; Shen, C. Transferring Hydrologic Data Across Continents–Leveraging Data-Rich Regions to Improve Hydrologic Prediction in Data-Sparse Regions. Water Resour. Res. 2021, 57, e2020WR028600. [Google Scholar] [CrossRef]
Ouyang, W.; Lawson, K.; Feng, D.; Ye, L.; Zhang, C.; Shen, C. Continental-Scale Streamflow Modeling of Basins with Reservoirs: Towards a Coherent Deep-Learning-Based Strategy. J. Hydrol. 2021, 599, 126455. [Google Scholar] [CrossRef]
Kratzert, R.; Klotz, F.; Brenner, D.; Schulz, C.; Herrnegger, K. Rainfall-Runoff Modelling Using Long Short-Term Memory (LSTM) Networks. Hydrol. Earth Syst. Sci. 2018, 22, 2775–2784. [Google Scholar] [CrossRef]
Khand, K.; Senay, G.B. Evaluation of Streamflow Predictions from LSTM Models in Water- and Energy-Limited Regions in the United States. Mach. Learn. Appl. 2024, 16, 100551. [Google Scholar] [CrossRef]
Xu, L.; Shi, P.; Wu, H.; Qu, S.; Li, Q.; Sun, Y.; Yang, X.; Jiang, P.; Qiu, C. Investigating the Potential of EMA-Embedded Feature Selection Method for ESVR and LSTM to Enhance the Robustness of Monthly Streamflow Forecasting from Local Meteorological Information. J. Hydrol. 2024, 636, 131230. [Google Scholar] [CrossRef]
Duan, S.; Ullrich, P.; Shu, L. Using Convolutional Neural Networks for Streamflow Projection in California. Front. Water 2020, 2, 28. [Google Scholar] [CrossRef]
Ren, K.; Fang, W.; Qu, J.; Zhang, X.; Shi, X. Comparison of Eight Filter-Based Feature Selection Methods for Monthly Streamflow Forecasting–Three Case Studies on CAMELS Data Sets. J. Hydrol. 2020, 586, 124897. [Google Scholar] [CrossRef]
Feng, D.; Fang, K.; Shen, C. Enhancing Streamflow Forecast and Extracting Insights Using Long-Short Term Memory Networks With Data Integration at Continental Scales. Water Resour. Res. 2020, 56, e2019WR026793. [Google Scholar] [CrossRef]
Sadler, J.M.; Appling, A.P.; Read, J.S.; Oliver, S.K.; Jia, X.; Zwart, J.A.; Kumar, V. Multi-Task Deep Learning of Daily Streamflow and Water Temperature. Water Resour. Res. 2022, 58, e2021WR030138. [Google Scholar] [CrossRef]
Wi, S.; Steinschneider, S. Assessing the Physical Realism of Deep Learning Hydrologic Model Projections Under Climate Change. Water Resour. Res. 2022, 58, e2022WR032123. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G.; Langousis, A. Super Ensemble Learning for Daily Streamflow Forecasting: Large-Scale Demonstration and Comparison with Multiple Machine Learning Algorithms. Neural. Comput. Appl. 2021, 33, 3053–3068. [Google Scholar] [CrossRef]
Frame, J.M.; Kratzert, F.; Raney, A.; Rahman, M.; Salas, F.R.; Nearing, G.S. Post-Processing the National Water Model with Long Short-Term Memory Networks for Streamflow Predictions and Model Diagnostics. JAWRA J. Am. Water Resour. Assoc. 2021, 57, 885–905. [Google Scholar] [CrossRef]
Feng, D.; Liu, J.; Lawson, K.; Shen, C. Differentiable, Learnable, Regionalized Process-Based Models With Multiphysical Outputs Can Approach State-Of-The-Art Hydrologic Prediction Accuracy. Water Resour. Res. 2022, 58, e2022WR032404. [Google Scholar] [CrossRef]
Kratzert, F.; Klotz, D.; Hochreiter, S.; Nearing, G.S. A Note on Leveraging Synergy in Multiple Meteorological Data Sets with Deep Learning for Rainfall-Runoff Modeling. Hydrol. Earth Syst. Sci. 2021, 25, 2685–2703. [Google Scholar] [CrossRef]
Xie, K.; Liu, P.; Zhang, J.; Han, D.; Wang, G.; Shen, C. Physics-Guided Deep Learning for Rainfall-Runoff Modeling by Considering Extreme Events and Monotonic Relationships. J. Hydrol. 2021, 603, 127043. [Google Scholar] [CrossRef]
Yin, H.; Guo, Z.; Zhang, X.; Chen, J.; Zhang, Y. RR-Former: Rainfall-Runoff Modeling Based on Transformer. J. Hydrol. 2022, 609, 127781. [Google Scholar] [CrossRef]
Herath, H.M.V.V.; Chadalawada, J.; Babovic, V. Hydrologically Informed Machine Learning for Rainfall-Runoff Modelling: Towards Distributed Modelling. Hydrol. Earth Syst. Sci. 2021, 25, 4373–4401. [Google Scholar] [CrossRef]
Yin, W.; Fan, Z.; Tangdamrongsub, N.; Hu, L.; Zhang, M. Comparison of Physical and Data-Driven Models to Forecast Groundwater Level Changes with the Inclusion of GRACE–A Case Study over the State of Victoria, Australia. J. Hydrol. 2021, 602, 126735. [Google Scholar] [CrossRef]
Jin, J.; Zhang, Y.; Hao, Z.; Xia, R.; Yang, W.; Yin, H.; Zhang, X. Benchmarking Data-Driven Rainfall-Runoff Modeling across 54 Catchments in the Yellow River Basin: Overfitting, Calibration Length, Dry Frequency. J. Hydrol. Reg. Stud. 2022, 42, 101119. [Google Scholar] [CrossRef]
Klotz, D.; Kratzert, F.; Gauch, M.; Keefe Sampson, A.; Brandstetter, J.; Klambauer, G.; Hochreiter, S.; Nearing, G. Uncertainty Estimation with Deep Learning for Rainfall-Runoff Modeling. Hydrol. Earth Syst. Sci. 2022, 26, 1673–1693. [Google Scholar] [CrossRef]
Yin, H.; Zhang, X.; Wang, F.; Zhang, Y.; Xia, R.; Jin, J. Rainfall-Runoff Modeling Using LSTM-Based Multi-State-Vector Sequence-to-Sequence Model. J. Hydrol. 2021, 598, 126378. [Google Scholar] [CrossRef]
Stein, L.; Clark, M.P.; Knoben, W.J.M.; Pianosi, F.; Woods, R.A. How Do Climate and Catchment Attributes Influence Flood Generating Processes? A Large-Sample Study for 671 Catchments Across the Contiguous USA. Water Resour. Res. 2021, 57, e2020WR028300. [Google Scholar] [CrossRef]
Jarajapu, D.C.; Rathinasamy, M.; Agarwal, A.; Bronstert, A. Design Flood Estimation Using Extreme Gradient Boosting-Based on Bayesian Optimization. J. Hydrol. 2022, 613, 128341. [Google Scholar] [CrossRef]
Liu, L.; Liu, X.; Bai, P.; Liang, K.; Liu, C. Comparison of Flood Simulation Capabilities of a Hydrologic Model and a Machine Learning Model. Int. J. Climatol. 2023, 43, 123–133. [Google Scholar] [CrossRef]
Cai, H.; Shi, H.; Liu, S.; Babovic, V. Impacts of Regional Characteristics on Improving the Accuracy of Groundwater Level Prediction Using Machine Learning: The Case of Central Eastern Continental United States. J. Hydrol. Reg. Stud. 2021, 37. [Google Scholar] [CrossRef]
Cai, H.; Liu, S.; Shi, H.; Zhou, Z.; Jiang, S.; Babovic, V. Toward Improved Lumped Groundwater Level Predictions at Catchment Scale: Mutual Integration of Water Balance Mechanism and Deep Learning Method. J. Hydrol. 2022, 613, 128495. [Google Scholar] [CrossRef]
Ghosh, R.; Renganathan, A.; Tayal, K.; Li, X.; Khandelwal, A.; Jia, X.; Duffy, C.; Nieber, J.; Kumar, V. Robust Inverse Framework Using Knowledge-Guided Self-Supervised Learning: An Application to Hydrology. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14 August 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 465–474. [Google Scholar]
Abbas, A.; Boithias, L.; Pachepsky, Y.; Kim, K.; Chun, J.A.; Cho, K.H. AI4Water v1.0: An Open-Source Python Package for Modeling Hydrological Time Series Using Data-Driven Methods. Geosci. Model. Dev. 2022, 15, 3021–3039. [Google Scholar] [CrossRef]
Feng, D.; Beck, H.; Lawson, K.; Shen, C. The Suitability of Differentiable, Physics-Informed Machine Learning Hydrologic Models for Ungauged Regions and Climate Change Impact Assessment. Hydrol. Earth Syst. Sci. 2023, 27, 2357–2373. [Google Scholar] [CrossRef]
Frame, J.M.; Kratzert, F.; Gupta, H.V.; Ullrich, P.; Nearing, G.S. On Strictly Enforced Mass Conservation Constraints for Modelling the Rainfall-Runoff Process. Hydrol. Process. 2023, 37, e14847. [Google Scholar] [CrossRef]
Tsai, W.P.; Feng, D.; Pan, M.; Beck, H.; Lawson, K.; Yang, Y.; Liu, J.; Shen, C. From Calibration to Parameter Learning: Harnessing the Scaling Effects of Big Data in Geoscientific Modeling. Nat. Commun. 2021, 12, 5988. [Google Scholar] [CrossRef] [PubMed]
Papacharalampous, G.; Tyralis, H.; Langousis, A.; Jayawardena, A.W.; Sivakumar, B.; Mamassis, N.; Montanari, A.; Koutsoyiannis, D. Probabilistic Hydrological Post-Processing at Scale: Why and How to Apply Machine-Learning Quantile Regression Algorithms. Water 2019, 11, 2126. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G.; Tantanee, S. How to Explain and Predict the Shape Parameter of the Generalized Extreme Value Distribution of Streamflow Extremes Using a Big Dataset. J. Hydrol. 2019, 574, 628–645. [Google Scholar] [CrossRef]
Li, B.; Sun, T.; Tian, F.; Ni, G. Enhancing Process-Based Hydrological Models with Embedded Neural Networks: A Hybrid Approach. J. Hydrol. 2023, 625, 130107. [Google Scholar] [CrossRef]
Han, S.; Slater, L.; Wilby, R.; Faulkner, D. Contribution of Urbanisation to Non-Stationary River Flow in the UK. J. Hydrol. 2022, 613, 128417. [Google Scholar] [CrossRef]
Slater, L.J.; Arnal, L.; Boucher, M.A.; Chang, A.Y.Y.; Moulds, S.; Murphy, C.; Nearing, G.; Shalev, G.; Shen, C.; Speight, L.; et al. Hybrid Forecasting: Blending Climate Predictions with AI Models. Hydrol. Earth Syst. Sci. 2023, 27, 1865–1889. [Google Scholar] [CrossRef]
Slater, L.; Coxon, G.; Brunner, M.; McMillan, H.; Yu, L.; Zheng, Y.; Khouakhi, A.; Moulds, S.; Berghuijs, W. Spatial Sensitivity of River Flooding to Changes in Climate and Land Cover Through Explainable AI. Earths Future 2024, 12, e2023EF004035. [Google Scholar] [CrossRef]
De la Fuente, L.A.; Gupta, H.V.; Condon, L.E. Toward a Multi-Representational Approach to Prediction and Understanding, in Support of Discovery in Hydrology. Water Resour. Res. 2023, 59, e2021WR031548. [Google Scholar] [CrossRef]
Taheri, P.; Taheri, S.; Taheri, M.; Taheri, G. A Novel 24-Hour Deep Neural Network Based Streamflow Forecasting Method in Data-Scarce Regions. In Proceedings of the 2023 13th Smart Grid Conference (SGC), Tehran, Iran, 5–6 December 2023; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023. [Google Scholar]
Vega-Briones, J.; de Jong, S.; Galleguillos, M.; Wanders, N. Identifying Driving Processes of Drought Recovery in the Southern Andes Natural Catchments. J. Hydrol. Reg. Stud. 2023, 47, 101369. [Google Scholar] [CrossRef]
Quiñones, M.P.; Zortea, M.; Martins, L.S.A. Fast-Slow Streamflow Model Using Mass-Conserving LSTM. arXiv 2021, arXiv:2107.06057. [Google Scholar]
Kapoor, A.; Pathiraja, S.; Marshall, L.; Chandra, R. DeepGR4J: A Deep Learning Hybridization Approach for Conceptual Rainfall-Runoff Modelling. Environ. Model. Softw. 2023, 169, 105831. [Google Scholar] [CrossRef]
Althoff, D.; Destouni, G. Global Patterns in Water Flux Partitioning: Irrigated and Rainfed Agriculture Drives Asymmetrical Flux to Vegetation over Runoff. One Earth 2023, 6, 1246–1257. [Google Scholar] [CrossRef]
Yin, H.; Wang, F.; Zhang, X.; Zhang, Y.; Chen, J.; Xia, R.; Jin, J. Rainfall-Runoff Modeling Using Long Short-Term Memory Based Step-Sequence Framework. J. Hydrol. 2022, 610, 127901. [Google Scholar] [CrossRef]
Koya, S.R.; Roy, T. Temporal Fusion Transformers for Streamflow Prediction: Value of Combining Attention with Recurrence. J. Hydrol. 2024, 637, 131301. [Google Scholar] [CrossRef]
Bouri, I.; Lahariya, M.; Nivron, O.; Julia, E.P.; Backes, D.; Bilinski, P.; Schumann, G. ML Framework for Global River Flood Predictions Based on the Caravan Dataset. arXiv 2022, arXiv:2212.00719. [Google Scholar]
Lima, M.; Deck, K.; Dunbar, O.R.A.; Schneider, T. Toward Routing River Water in Land Surface Models with Recurrent Neural Networks. arXiv 2024, arXiv:2404.14212. [Google Scholar]
Yang, Y.; Chui, T.F.M. Profiling and Pairing Catchments and Hydrological Models With Latent Factor Model. Water Resour. Res. 2023, 59, e2022WR033684. [Google Scholar] [CrossRef]
Renganathan, A.; Ghosh, R.; Khandelwal, A.; Kumar, V. Task Aware Modulation Using Representation Learning: An Approach for Few Shot Learning in Heterogeneous Systems. arXiv 2023, arXiv:2310.04727. [Google Scholar]
Fischer, S.; Schumann, A.; Schumann, A.H. Dominant Flood Types in Europe and Their Role in Flood Statistics Dominant Flood Types in Europe and Their Role in Flood Statistics. Authorea 2024, Preprint. [Google Scholar] [CrossRef]
Nearing, G.; Cohen, D.; Dube, V.; Gauch, M.; Gilon, O.; Harrigan, S.; Hassidim, A.; Klotz, D.; Kratzert, F.; Metzger, A.; et al. Global Prediction of Extreme Floods in Ungauged Watersheds. Nature 2024, 627, 559–563. [Google Scholar] [CrossRef] [PubMed]
Murray, A.M.; Jørgensen, G.H.; Godiksen, P.N.; Anthonj, J.; Madsen, H. DHI-GHM: Real-Time and Forecasted Hydrology for the Entire Planet. J. Hydrol. 2023, 620, 129431. [Google Scholar] [CrossRef]
Lin, Y.; Wang, D.; Jiang, T.; Kang, A. Assessing Objective Functions in Streamflow Prediction Model Training Based on the Naïve Method. Water 2024, 16, 777. [Google Scholar] [CrossRef]
Constenla-Villoslada, S.; Liu, Y.; Wen, J.; Sun, Y.; Chonabayashi, S. Large-Scale Land Restoration Improved Drought Resilience in Ethiopia’s Degraded Watersheds. Nat. Sustain. 2022, 5, 488–497. [Google Scholar] [CrossRef]
Zambrano, F.; Vrieling, A.; Nelson, A.; Meroni, M.; Tadesse, T. Prediction of Drought-Induced Reduction of Agricultural Productivity in Chile from MODIS, Rainfall Estimates, and Climate Oscillation Indices. Remote Sens. Environ. 2018, 219, 15–30. [Google Scholar] [CrossRef]
Jalayer, S.; Sharifi, A.; Abbasi-Moghadam, D.; Tariq, A.; Qin, S. Assessment of Spatiotemporal Characteristic of Droughts Using In Situ and Remote Sensing-Based Drought Indices. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1483–1502. [Google Scholar] [CrossRef]
Sulugodu, B.; Deka, P.C. Evaluating the Performance of CHIRPS Satellite Rainfall Data for Streamflow Forecasting. Water Resour. Manag. 2019, 33, 3913–3927. [Google Scholar] [CrossRef]
Riazi, M.; Khosravi, K.; Shahedi, K.; Ahmad, S.; Jun, C.; Bateni, S.M.; Kazakis, N. Enhancing Flood Susceptibility Modeling Using Multi-Temporal SAR Images, CHIRPS Data, and Hybrid Machine Learning Algorithms. Sci. Total Environ. 2023, 871, 162066. [Google Scholar] [CrossRef] [PubMed]
Iamampai, S.; Talaluxmana, Y.; Kanasut, J.; Rangsiwanichpong, P. Enhancing Rainfall-Runoff Model Accuracy with Machine Learning Models by Using Soil Water Index to Reflect Runoff Characteristics. Water Sci. Technol. 2024, 89, 368–381. [Google Scholar] [CrossRef]
Nakhaei, M.; Mohebbi Tafreshi, A.; Saadi, T. An Evaluation of Satellite Precipitation Downscaling Models Using Machine Learning Algorithms in Hashtgerd Plain, Iran. Model. Earth Syst. Environ. 2023, 9, 2829–2843. [Google Scholar] [CrossRef]
Read, J.S.; Jia, X.; Willard, J.; Appling, A.P.; Zwart, J.A.; Oliver, S.K.; Karpatne, A.; Hansen, G.J.A.; Hanson, P.C.; Watkins, W.; et al. Process-Guided Deep Learning Predictions of Lake Water Temperature. Water Resour. Res. 2019, 55, 9173–9190. [Google Scholar] [CrossRef]
Han, H.; Morrison, R.R. Data-Driven Approaches for Runoff Prediction Using Distributed Data. Stoch. Environ. Res. Risk Assess. 2021, 36, 2153–2171. [Google Scholar] [CrossRef]
Alipour, A.; Ahmadalipour, A.; Abbaszadeh, P.; Moradkhani, H. Leveraging Machine Learning for Predicting Flash Flood Damage in the Southeast US. Environ. Res. Lett. 2020, 15, 024011. [Google Scholar] [CrossRef]
Lee, J.; Park, S.; Im, J.; Yoo, C.; Seo, E. Improved Soil Moisture Estimation: Synergistic Use of Satellite Observations and Land Surface Models over CONUS Based on Machine Learning. J. Hydrol. 2022, 609, 127749. [Google Scholar] [CrossRef]
Fang, B.; Lakshmi, V.; Bindlish, R.; Jackson, T.J. AMSR2 Soil Moisture Downscaling Using Temperature and Vegetation Data. Remote Sens. 2018, 10, 1575. [Google Scholar] [CrossRef]
Wang, F.; Chen, Y.; Li, Z.; Fang, G.; Li, Y.; Wang, X.; Zhang, X.; Kayumba, P.M. Developing a Long Short-Term Memory (LSTM)-Based Model for Reconstructing Terrestrial Water Storage Variations from 1982 to 2016 in the Tarim River Basin, Northwest China. Remote Sens. 2021, 13, 889. [Google Scholar] [CrossRef]
Chen, Z.; Zeng, Y.; Shen, G.; Xiao, C.; Xu, L.; Chen, N. Spatiotemporal Characteristics and Estimates of Extreme Precipitation in the Yangtze River Basin Using GLDAS Data. Int. J. Climatol. 2021, 41, E1812–E1830. [Google Scholar] [CrossRef]
Greifeneder, F.; Notarnicola, C.; Wagner, W. A Machine Learning-Based Approach for Surface Soil Moisture Estimations with Google Earth Engine. Remote Sens. 2021, 13, 2099. [Google Scholar] [CrossRef]
Li, C.; Yang, H.; Yang, W.; Liu, Z.; Jia, Y.; Li, S.; Yang, D. Error Characterization of Global Land Evapotranspiration Products: Collocation-Based Approach. J. Hydrol. 2022, 612, 128102. [Google Scholar] [CrossRef]
Zhang, G.; Zheng, W.; Yin, W.; Lei, W. Improving the Resolution and Accuracy of Groundwater Level Anomalies Using the Machine Learning-Based Fusion Model in the North China Plain. Sensors 2020, 21, 46. [Google Scholar] [CrossRef]
Seyoum, W.M.; Kwon, D.; Milewski, A.M. Downscaling GRACE TWSA Data into High-Resolution Groundwater Level Anomaly Using Machine Learning-Based Models in a Glacial Aquifer System. Remote Sens. 2019, 11, 824. [Google Scholar] [CrossRef]
Agarwal, V.; Akyilmaz, O.; Shum, C.K.; Feng, W.; Yang, T.Y.; Forootan, E.; Syed, T.H.; Haritashya, U.K.; Uz, M. Machine Learning Based Downscaling of GRACE-Estimated Groundwater in Central Valley, California. Sci. Total Environ. 2023, 865, 161138. [Google Scholar] [CrossRef] [PubMed]
Malakar, P.; Mukherjee, A.; Bhanja, S.N.; Ray, R.K.; Sarkar, S.; Zahid, A. Machine-Learning-Based Regional-Scale Groundwater Level Prediction Using GRACE. Hydrogeol. J. 2021, 29, 1027–1042. [Google Scholar] [CrossRef]
Ali, S.; Liu, D.; Fu, Q.; Cheema, M.J.M.; Pal, S.C.; Arshad, A.; Pham, Q.B.; Zhang, L. Constructing High-Resolution Groundwater Drought at Spatio-Temporal Scale Using GRACE Satellite Data Based on Machine Learning in the Indus Basin. J. Hydrol. 2022, 612, 128295. [Google Scholar] [CrossRef]
Liu, D.; Mishra, A.K.; Yu, Z.; Lü, H.; Li, Y. Support Vector Machine and Data Assimilation Framework for Groundwater Level Forecasting Using GRACE Satellite Data. J. Hydrol. 2021, 603, 126929. [Google Scholar] [CrossRef]
Sun, A.Y.; Scanlon, B.R.; Save, H.; Rateb, A. Reconstruction of GRACE Total Water Storage Through Automated Machine Learning. Water Resour. Res. 2021, 57, e2020WR028666. [Google Scholar] [CrossRef]
Yin, W.; Zhang, G.; Liu, F.; Zhang, D.; Zhang, X.; Chen, S. Improving the Spatial Resolution of GRACE-Based Groundwater Storage Estimates Using a Machine Learning Algorithm and Hydrological Model. Hydrogeol. J. 2022, 30, 947–963. [Google Scholar] [CrossRef]
Senay, G.B.; Velpuri, N.M.; Bohms, S.; Demissie, Y.; Gebremichael, M. Understanding the Hydrologic Sources and Sinks in the Nile Basin Using Multisource Climate and Remote Sensing Data Sets. Water Resour. Res. 2014, 50, 8625–8650. [Google Scholar] [CrossRef]
Wang, J.; Gao, H.; Liu, M.; Ding, Y.; Wang, Y.; Zhao, F.; Xia, J. Parameter Regionalization of the FLEX-Global Hydrological Model. Sci. China Earth Sci. 2021, 64, 571–588. [Google Scholar] [CrossRef]
Ngoma, H.; Wen, W.; Ayugi, B.; Babaousmail, H.; Karim, R.; Ongoma, V. Evaluation of Precipitation Simulations in CMIP6 Models over Uganda. Int. J. Climatol. 2021, 41, 4743–4768. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, A.; Analui, B.; Nguyen, P.; Sorooshian, S.; Hsu, K.; Wang, Y. Comparing Quantile Regression Forest and Mixture Density Long Short-Term Memory Models for Probabilistic Post-Processing of Satellite Precipitation-Driven Streamflow Simulations. Hydrol. Earth Syst. Sci. 2023, 27, 4529–4550. [Google Scholar] [CrossRef]
Neeti, N.; Arun Murali, C.M.; Chowdary, V.M.; Rao, N.H.; Kesarwani, M. Integrated Meteorological Drought Monitoring Framework Using Multi-Sensor and Multi-Temporal Earth Observation Datasets and Machine Learning Algorithms: A Case Study of Central India. J. Hydrol. 2021, 601, 126638. [Google Scholar] [CrossRef]
Kolluru, V.; Kolluru, S.; Wagle, N.; Dev, T. Secondary Precipitation Estimate Merging Using Machine Learning: Development and Evaluation over Krishna River Basin, India. Remote Sens. 2020, 12, 3013. [Google Scholar] [CrossRef]
Alquraish, M.M.; Khadr, M. Remote-Sensing-Based Streamflow Forecasting Using Artificial Neural Network and Support Vector Machine Models. Remote Sens. 2021, 13, 4147. [Google Scholar] [CrossRef]
Bair, E.H.; Calfa, A.A.; Rittger, K.; Dozier, J. Using Machine Learning for Real-Time Estimates of Snow Water Equivalent in the Watersheds of Afghanistan. Cryosphere 2018, 12, 1579–1594. [Google Scholar] [CrossRef]
Rasiya Koya, S.; Kar, K.K.; Srivastava, S.; Tadesse, T.; Svoboda, M.; Roy, T. An Autoencoder-Based Snow Drought Index. Sci. Rep. 2023, 13, 20664. [Google Scholar] [CrossRef]
Gavahi, K.; Abbaszadeh, P.; Moradkhani, H. How Does Precipitation Data Influence the Land Surface Data Assimilation for Drought Monitoring? Sci. Total Environ. 2022, 831, 154916. [Google Scholar] [CrossRef] [PubMed]
Lee, W.J.; Lee, E.H. Runoff Prediction Based on the Discharge of Pump Stations in an Urban Stream Using a Modified Multi-Layer Perceptron Combined with Meta-Heuristic Optimization. Water 2022, 14, 99. [Google Scholar] [CrossRef]
Xu, T.; Guo, Z.; Xia, Y.; Ferreira, V.G.; Liu, S.; Wang, K.; Yao, Y.; Zhang, X.; Zhao, C. Evaluation of Twelve Evapotranspiration Products from Machine Learning, Remote Sensing and Land Surface Models over Conterminous United States. J. Hydrol. 2019, 578, 124105. [Google Scholar] [CrossRef]
Kim, H.; Crow, W.T.; Wagner, W.; Li, X.; Lakshmi, V. A Bayesian Machine Learning Method to Explain the Error Characteristics of Global-Scale Soil Moisture Products. Remote Sens. Environ. 2023, 296, 113718. [Google Scholar] [CrossRef]
Evans, S.; Williams, G.P.; Jones, N.L.; Ames, D.P.; Nelson, E.J. Exploiting Earth Observation Data to Impute Groundwater Level Measurements with an Extreme Learning Machine. Remote Sens. 2020, 12, 2044. [Google Scholar] [CrossRef]
Elbeltagi, A.; Kumari, N.; Dharpure, J.K.; Mokhtar, A.; Alsafadi, K.; Kumar, M.; Mehdinejadiani, B.; Ramezani Etedali, H.; Brouziyne, Y.; Towfiqul Islam, A.R.M.; et al. Prediction of Combined Terrestrial Evapotranspiration Index (Ctei) over Large River Basin Based on Machine Learning Approaches. Water 2021, 13, 547. [Google Scholar] [CrossRef]
Zhang, J.; Liu, K.; Wang, M. Downscaling Groundwater Storage Data in China to a 1-Km Resolution Using Machine Learning Methods. Remote Sens. 2021, 13, 523. [Google Scholar] [CrossRef]
Rahaman, M.M.; Thakur, B.; Kalra, A.; Li, R.; Maheshwari, P. Estimating High-Resolution Groundwater Storage from GRACE: A Random Forest Approach. Environ. MDPI 2019, 6, 63. [Google Scholar] [CrossRef]
Khorrami, B.; Ali, S.; Gündüz, O. Investigating the Local-Scale Fluctuations of Groundwater Storage by Using Downscaled GRACE/GRACE-FO JPL Mascon Product Based on Machine Learning (ML) Algorithm. Water Resour. Manag. 2023, 37, 3439–3456. [Google Scholar] [CrossRef]
Sahour, H.; Sultan, M.; Vazifedan, M.; Abdelmohsen, K.; Karki, S.; Yellich, J.A.; Gebremichael, E.; Alshehri, F.; Elbayoumi, T.M. Statistical Applications to Downscale GRACE-Derived Terrestrialwater Storage Data and to Fill Temporal Gaps. Remote Sens. 2020, 12, 533. [Google Scholar] [CrossRef]
Satizábal-Alarcón, D.A.; Suhogusoff, A.; Ferrari, L.C. Characterization of Groundwater Storage Changes in the Amazon River Basin Based on Downscaling of GRACE/GRACE-FO Data with Machine Learning Models. Sci. Total Environ. 2024, 912, 168958. [Google Scholar] [CrossRef]
Luo, L.; Robock, A.; Mitchell, K.E.; Houser, P.R.; Wood, E.F.; Schaake, J.C.; Lohmann, D.; Cosgrove, B.A.; Wen, F.; Sheffield, J.; et al. Validation of the North American Land Data Assimilation System (NLDAS) Retrospective Forcing over the Southern Great Plains. J. Geophys. Res. Atmos. 2003, 108, 8843. [Google Scholar] [CrossRef]
López-Bermeo, C.; Montoya, R.D.; Caro-Lopera, F.J.; Díaz-García, J.A. Validation of the Accuracy of the CHIRPS Precipitation Dataset at Representing Climate Variability in a Tropical Mountainous Region of South America. Phys. Chem. Earth Parts A/B/C 2022, 127, 103184. [Google Scholar] [CrossRef]
Venema, V.K.C.; Mestre, O.; Aguilar, E.; Auer, I.; Guijarro, J.A.; Domonkos, P.; Vertacnik, G.; Szentimrey, T.; Stepanek, P.; Zahradnicek, P.; et al. Benchmarking Homogenization Algorithms for Monthly Data. Clim. Past 2012, 8, 89–115. [Google Scholar] [CrossRef]
Zhao, Q.; Zhu, Y.; Wan, D.; Yu, Y.; Cheng, X. Research on the Data-Driven Quality Control Method of Hydrological Time Series Data. Water 2018, 10, 1712. [Google Scholar] [CrossRef]
Costa, A.C.; Soares, A. Homogenization of Climate Data: Review and New Perspectives Using Geostatistics. Math. Geosci. 2009, 41, 291–305. [Google Scholar] [CrossRef]
Gao, Y.; Merz, C.; Lischeid, G.; Schneider, M. A Review on Missing Hydrological Data Processing. Environ. Earth Sci. 2018, 77, 1–12. [Google Scholar] [CrossRef]
Hamzah, F.B.; Hamzah, F.M.; Razali, S.F.M.; Samad, H. A Comparison of Multiple Imputation Methods for Recovering Missing Data in Hydrological Studies. Civ. Eng. J. 2021, 7, 1608–1619. [Google Scholar] [CrossRef]
Wu, W.; Li, Y.; Luo, X.; Zhang, Y.; Ji, X.; Li, X. Performance Evaluation of the CHIRPS Precipitation Dataset and Its Utility in Drought Monitoring over Yunnan Province, China. Geomat. Nat. Hazards Risk 2019, 10, 2145–2162. [Google Scholar] [CrossRef]
Le, X.H.; Lee, G.; Jung, K.; An, H.U.; Lee, S.; Jung, Y. Application of Convolutional Neural Network for Spatiotemporal Bias Correction of Daily Satellite-Based Precipitation. Remote Sens. 2020, 12, 2731. [Google Scholar] [CrossRef]
Katiraie-Boroujerdy, P.S.; Naeini, M.R.; Asanjan, A.A.; Chavoshian, A.; Hsu, K.L.; Sorooshian, S. Bias Correction of Satellite-Based Precipitation Estimations Using Quantile Mapping Approach in Different Climate Regions of Iran. Remote Sens. 2020, 12, 2102. [Google Scholar] [CrossRef]
Goshime, D.W.; Absi, R.; Haile, A.T.; Ledésert, B.; Rientjes, T. Bias-Corrected CHIRP Satellite Rainfall for Water Level Simulation, Lake Ziway, Ethiopia. J. Hydrol. Eng. 2020, 25, 05020024. [Google Scholar] [CrossRef]
Goshime, D.W.; Absi, R.; Ledésert, B. Evaluation and Bias Correction of CHIRP Rainfall Estimate for Rainfall-Runoff Simulation over Lake Ziway Watershed, Ethiopia. Hydrology 2019, 6, 68. [Google Scholar] [CrossRef]
Wang, W.; Cui, W.; Wang, X.; Chen, X. Evaluation of GLDAS-1 and GLDAS-2 Forcing Data and Noah Model Simulations over China at the Monthly Scale. J. Hydrometeorol. 2016, 17, 2815–2833. [Google Scholar] [CrossRef]
Mulungu, D.M.M.; Mukama, E. Evaluation and Modelling of Accuracy of Satellite-Based CHIRPS Rainfall Data in Ruvu Subbasin, Tanzania. Model. Earth Syst. Environ. 2023, 9, 1287–1300. [Google Scholar] [CrossRef]
Najmi, A.; Igmoullan, B.; Namous, M.; El Bouazzaoui, I.; Brahim, Y.A.; El Khalki, E.M.; Saidi, M.E.M. Evaluation of PERSIANN-CCS-CDR, ERA5, and SM2RAIN-ASCAT Rainfall Products for Rainfall and Drought Assessment in a Semi-Arid Watershed, Morocco. J. Water Clim. Change 2023, 14, 1569–1584. [Google Scholar] [CrossRef]
Zhang, B.; Xia, Y.; Long, B.; Hobbins, M.; Zhao, X.; Hain, C.; Li, Y.; Anderson, M.C. Evaluation and Comparison of Multiple Evapotranspiration Data Models over the Contiguous United States: Implications for the next Phase of NLDAS (NLDAS-Testbed) Development. Agric. For. Meteorol. 2020, 280, 107810. [Google Scholar] [CrossRef]
Du, H.; Tan, M.L.; Zhang, F.; Chun, K.P.; Li, L.; Kabir, M.H. Evaluating the Effectiveness of CHIRPS Data for Hydroclimatic Studies. Theor. Appl. Climatol. 2024, 155, 1519–1539. [Google Scholar] [CrossRef]
Yang, N.; Yu, H.; Lu, Y.; Zhang, Y.; Zheng, Y.; Walter, R.C.; Bechtel, T.D.; Yang, N.; Yu, H.; Lu, Y.; et al. Evaluating the Applicability of PERSIANN-CDR Products in Drought Monitoring: A Case Study of Long-Term Droughts over Huaihe River Basin, China. Remote Sens. 2022, 14, 4460. [Google Scholar] [CrossRef]
Ekström, M.; Grose, M.R.; Whetton, P.H. An Appraisal of Downscaling Methods Used in Climate Change Research. Wiley Interdiscip Rev. Clim. Change 2015, 6, 301–319. [Google Scholar] [CrossRef]
Schoof, J.T. Statistical Downscaling in Climatology. Geogr. Compass 2013, 7, 249–265. [Google Scholar] [CrossRef]
Chen, J.; Brissette, F.P.; Leconte, R. Uncertainty of Downscaling Method in Quantifying the Impact of Climate Change on Hydrology. J. Hydrol. 2011, 401, 190–202. [Google Scholar] [CrossRef]
Ferraro, R.; Waliser, D.; Peters-Lidard, C. NASA Downscaling Project: Final Report; JPL Open Repository; Jet Propulsion Laboratory: Pasadena, CA, USA, 1 February 2017. [Google Scholar]
Addor, N.; Do, H.X.; Alvarez-Garreton, C.; Coxon, G.; Fowler, K.; Mendoza, P.A. Large-Sample Hydrology: Recent Progress, Guidelines for New Datasets and Grand Challenges. Hydrol. Sci. J. 2020, 65, 712–725. [Google Scholar] [CrossRef]
Rodell, M.; Houser, P.R.; Jambor, U.; Gottschalck, J.; Mitchell, K.; Meng, C.J.; Arsenault, K.; Cosgrove, B.; Radakovich, J.; Bosilovich, M.; et al. The Global Land Data Assimilation System. Bull. Am. Meteorol. Soc. 2004, 85, 381–394. [Google Scholar] [CrossRef]
Landerer, F.W.; Swenson, S.C. Accuracy of Scaled GRACE Terrestrial Water Storage Estimates. Water. Resour. Res. 2012, 48, W04531. [Google Scholar] [CrossRef]
Bai, L.; Shi, C.; Li, L.; Yang, Y.; Wu, J. Accuracy of CHIRPS Satellite-Rainfall Products over Mainland China. Remote Sens. 2018, 10, 362. [Google Scholar] [CrossRef]
Miao, C.; Ashouri, H.; Hsu, K.L.; Sorooshian, S.; Duan, Q. Evaluation of the PERSIANN-CDR Daily Rainfall Estimates in Capturing the Behavior of Extreme Precipitation Events over China. J. Hydrometeorol. 2015, 16, 1387–1396. [Google Scholar] [CrossRef]
Wang, K.; Zhang, T.; Clow, G.D. Permafrost Thermal Responses to Asymmetrical Climate Changes: An Integrated Perspective. Geophys. Res. Lett. 2023, 50, e2022GL100327. [Google Scholar] [CrossRef]
Peng, X.; Zhang, T.; Frauenfeld, O.W.; Mu, C.; Wang, K.; Wu, X.; Guo, D.; Luo, J.; Hjort, J.; Aalto, J.; et al. Active Layer Thickness and Permafrost Area Projections for the 21st Century. Earths Future 2023, 11, e2023EF003573. [Google Scholar] [CrossRef]
O’Driscoll, M.; Clinton, S.; Jefferson, A.; Manda, A.; McMillan, S. Urbanization Effects on Watershed Hydrology and In-Stream Processes in the Southern United States. Water 2010, 2, 605–648. [Google Scholar] [CrossRef]
Fanelli, R.; Prestegaard, K.; Palmer, M. Evaluation of Infiltration-Based Stormwater Management to Restore Hydrological Processes in Urban Headwater Streams. Hydrol. Process. 2017, 31, 3306–3319. [Google Scholar] [CrossRef]
Oswald, C.J.; Kelleher, C.; Ledford, S.H.; Hopkins, K.G.; Sytsma, A.; Tetzlaff, D.; Toran, L.; Voter, C. Integrating Urban Water Fluxes and Moving beyond Impervious Surface Cover: A Review. J. Hydrol. 2023, 618, 129188. [Google Scholar] [CrossRef]
Socioeconomic Data and Applications Center|SEDAC. Available online: https://sedac.ciesin.columbia.edu/ (accessed on 25 May 2024).
Global Water Research Coalition (GWRC). Available online: https://globalwaterresearchcoalition.net/ (accessed on 25 May 2024).
AQUASTAT-FAO’s Global Information System on Water and Agriculture. Available online: https://www.fao.org/aquastat/en/databases/ (accessed on 25 May 2024).
Moges, E.; Demissie, Y.; Larsen, L.; Yassin, F. Review: Sources of Hydrological Model Uncertainties and Advances in Their Analysis. Water 2021, 13, 28. [Google Scholar] [CrossRef]
Renard, B.; Kavetski, D.; Kuczera, G.; Thyer, M.; Franks, S.W. Understanding Predictive Uncertainty in Hydrologic Modeling: The Challenge of Identifying Input and Structural Errors. Water Resour. Res. 2010, 46, 5521. [Google Scholar] [CrossRef]
Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
Nemani, V.; Biggio, L.; Huan, X.; Hu, Z.; Fink, O.; Tran, A.; Wang, Y.; Zhang, X.; Hu, C. Uncertainty Quantification in Machine Learning for Engineering Design and Health Prognostics: A Tutorial. Mech. Syst. Signal. Process. 2023, 205, 110796. [Google Scholar] [CrossRef]
Dolezal, J.M.; Srisuwananukorn, A.; Karpeyev, D.; Ramesh, S.; Kochanny, S.; Cody, B.; Mansfield, A.S.; Rakshit, S.; Bansal, R.; Bois, M.C.; et al. Uncertainty-Informed Deep Learning Models Enable High-Confidence Predictions for Digital Histopathology. Nat. Commun. 2022, 13, 6572. [Google Scholar] [CrossRef] [PubMed]
Kimani, M.W.; Hoedjes, J.C.B.; Su, Z. Bayesian Bias Correction of Satellite Rainfall Estimates for Climate Studies. Remote Sens. 2018, 10, 1074. [Google Scholar] [CrossRef]
Abbasi, M.; Farokhnia, A.; Bahreinimotlagh, M.; Roozbahani, R. A Hybrid of Random Forest and Deep Auto-Encoder with Support Vector Regression Methods for Accuracy Improvement and Uncertainty Reduction of Long-Term Streamflow Prediction. J. Hydrol. 2021, 597, 125717. [Google Scholar] [CrossRef]
Xie, X.; Xie, B.; Cheng, J.; Chu, Q.; Dooling, T. A Simple Monte Carlo Method for Estimating the Chance of a Cyclone Impact. Nat. Hazards 2021, 107, 2573–2582. [Google Scholar] [CrossRef]
Hong, Y.; Hsu, K.L.; Moradkhani, H.; Sorooshian, S. Uncertainty Quantification of Satellite Precipitation Estimation and Monte Carlo Assessment of the Error Propagation into Hydrologic Response. Water Resour. Res. 2006, 42, 8421. [Google Scholar] [CrossRef]
Greatrex, H.; Grimes, D.; Wheeler, T. Advances in the Stochastic Modeling of Satellite-Derived Rainfall Estimates Using a Sparse Calibration Dataset. J. Hydrometeorol. 2014, 15, 1810–1831. [Google Scholar] [CrossRef]
Gan, Y.; Duan, Q.; Gong, W.; Tong, C.; Sun, Y.; Chu, W.; Ye, A.; Miao, C.; Di, Z. A Comprehensive Evaluation of Various Sensitivity Analysis Methods: A Case Study with a Hydrological Model. Environ. Model. Softw. 2014, 51, 269–285. [Google Scholar] [CrossRef]
Song, X.; Zhang, J.; Zhan, C.; Xuan, Y.; Ye, M.; Xu, C. Global Sensitivity Analysis in Hydrological Modeling: Review of Concepts, Methods, Theoretical Framework, and Applications. J. Hydrol. 2015, 523, 739–757. [Google Scholar] [CrossRef]
Mirzaei, M.; Huang, Y.F.; El-Shafie, A.; Shatirah, A. Application of the Generalized Likelihood Uncertainty Estimation (GLUE) Approach for Assessing Uncertainty in Hydrological Models: A Review. Stoch. Environ. Res. Risk Assess. 2015, 29, 1265–1273. [Google Scholar] [CrossRef]
Galavi, H.; Mirzaei, M.; Yu, B.; Lee, J. Bootstrapped Ensemble and Reliability Ensemble Averaging Approaches for Integrated Uncertainty Analysis of Streamflow Projections. Stoch. Environ. Res. Risk Assess. 2023, 37, 1213–1227. [Google Scholar] [CrossRef]
Duan, Q.; Ajami, N.K.; Gao, X.; Sorooshian, S. Multi-Model Ensemble Hydrologic Prediction Using Bayesian Model Averaging. Adv. Water Resour. 2007, 30, 1371–1386. [Google Scholar] [CrossRef]
Ehsani, M.R.; Behrangi, A. A Comparison of Correction Factors for the Systematic Gauge-Measurement Errors to Improve the Global Land Precipitation Estimate. J. Hydrol. 2022, 610, 127884. [Google Scholar] [CrossRef]
Horner, I.; Renard, B.; Le Coz, J.; Branger, F.; McMillan, H.K.; Pierrefeu, G. Impact of Stage Measurement Errors on Streamflow Uncertainty. Water Resour. Res. 2018, 54, 1952–1976. [Google Scholar] [CrossRef]
Mizukami, N.; Smith, M.B. Analysis of Inconsistencies in Multi-Year Gridded Quantitative Precipitation Estimate over Complex Terrain and Its Impact on Hydrologic Modeling. J. Hydrol. 2012, 428–429, 129–141. [Google Scholar] [CrossRef]
Van de Schoot, R.; Kaplan, D.; Denissen, J.; Asendorpf, J.B.; Neyer, F.J.; van Aken, M.A.G. A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. Child Dev. 2014, 85, 842. [Google Scholar] [CrossRef] [PubMed]
Kamyab, H.; Khademi, T.; Chelliapan, S.; SaberiKamarposhti, M.; Rezania, S.; Yusuf, M.; Farajnezhad, M.; Abbas, M.; Hun Jeon, B.; Ahn, Y. The Latest Innovative Avenues for the Utilization of Artificial Intelligence and Big Data Analytics in Water Resource Management. Results Eng. 2023, 20, 101566. [Google Scholar] [CrossRef]
Ming, X.; Liang, Q.; Xia, X.; Li, D.; Fowler, H.J. Real-Time Flood Forecasting Based on a High-Performance 2-D Hydrodynamic Model and Numerical Weather Predictions. Water Resour. Res. 2020, 56, e2019WR025583. [Google Scholar] [CrossRef]
Warren, J. Nathan Marz Big Data: Principles and Best Practices of Scalable Realtime Data Systems; Simon and Schuster: New York, NY, USA, 2015. [Google Scholar]
Fersch, B.; Francke, T.; Heistermann, M.; Schrön, M.; Döpper, V.; Jakobi, J.; Baroni, G.; Blume, T.; Bogena, H.; Budach, C.; et al. A Dense Network of Cosmic-Ray Neutron Sensors for Soil Moisture Observation in a Highly Instrumented Pre-Alpine Headwater Catchment in Germany. Earth Syst. Sci. Data 2020, 12, 2289–2309. [Google Scholar] [CrossRef]
Khan, Z.; Anjum, A.; Kiani, S.L. Cloud Based Big Data Analytics for Smart Future Cities. In Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing, Dresden, Germany, 9–12 December 2013; pp. 381–386. [Google Scholar] [CrossRef]
Khan, S.; Shakil, K.A.; Alam, M. Big Data Computing Using Cloud-Based Technologies: Challenges and Future Perspectives. In Networks of the Future; CRC: Boca Raton, FL, USA, 2017; pp. 393–414. [Google Scholar] [CrossRef]
Krishnamurthy, S.; Franklin, M.J.; Davis, J.; Farina, D.; Golovko, P.; Li, A.; Thombre, N. Continuous Analytics over Discontinuous Streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 1081–1091. [Google Scholar] [CrossRef]
Kolajo, T.; Daramola, O.; Adebiyi, A. Big Data Stream Analysis: A Systematic Literature Review. J. Big Data 2019, 6, 47. [Google Scholar] [CrossRef]
Sauermann, H.; Vohland, K.; Antoniou, V.; Balázs, B.; Göbel, C.; Karatzas, K.; Mooney, P.; Perelló, J.; Ponti, M.; Samson, R.; et al. Citizen Science and Sustainability Transitions. Res. Policy 2020, 49, 103978. [Google Scholar] [CrossRef]
Buytaert, W.; Zulkafli, Z.; Grainger, S.; Acosta, L.; Alemie, T.C.; Bastiaensen, J.; De Bièvre, B.; Bhusal, J.; Clark, J.; Dewulf, A.; et al. Citizen Science in Hydrology and Water Resources: Opportunities for Knowledge Generation, Ecosystem Service Management, and Sustainable Development. Front. Earth Sci. 2014, 2, 104024. [Google Scholar] [CrossRef]
Njue, N.; Stenfert Kroese, J.; Gräf, J.; Jacobs, S.R.; Weeser, B.; Breuer, L.; Rufino, M.C. Citizen Science in Hydrological Monitoring and Ecosystem Services Management: State of the Art and Future Prospects. Sci. Total Environ. 2019, 693, 133531. [Google Scholar] [CrossRef]
Tran, H.N.; Rutten, M.; Prajapati, R.; Tran, H.T.; Duwal, S.; Nguyen, D.T.; Davids, J.C.; Miegel, K. Citizen Scientists’ Engagement in Flood Risk-Related Data Collection: A Case Study in Bui River Basin, Vietnam. Environ. Monit. Assess. 2024, 196, 280. [Google Scholar] [CrossRef]
Paul, J.D.; Buytaert, W.; Allen, S.; Ballesteros-Canovas, J.A.; Bhusal, J.; Cieslik, K.; Clark, J.; Dugar, S.; Hannah, D.M.; Stoffe, M.; et al. Citizen Science for Hydrological Risk Reduction and Resilience Building. Wiley Interdiscip. Rev. Water 2018, 5, e1262. [Google Scholar] [CrossRef]
Walker, D.W.; Smigaj, M.; Tani, M. The Benefits and Negative Impacts of Citizen Science Applications to Water as Experienced by Participants and Communities. Wiley Interdiscip. Rev. Water 2021, 8, e1488. [Google Scholar] [CrossRef]
Salamone, F.; Masullo, M.; Sibilio, S. Wearable Devices for Environmental Monitoring in the Built Environment: A Systematic Review. Sensors 2021, 21, 4727. [Google Scholar] [CrossRef] [PubMed]
Tavra, M.; Racetin, I.; Peroš, J. The Role of Crowdsourcing and Social Media in Crisis Mapping: A Case Study of a Wildfire Reaching Croatian City of Split. Geoenvironmental Disasters 2021, 8, 10. [Google Scholar] [CrossRef]
Khan, Q.; Kalbus, E.; Zaki, N.; Mohamed, M.M. Utilization of Social Media in Floods Assessment Using Data Mining Techniques. PLoS ONE 2022, 17, e0267079. [Google Scholar] [CrossRef]
Perumal, T.; Sulaiman, M.N.; Leong, C.Y. Internet of Things (IoT) Enabled Water Monitoring System. In Proceedings of the 2015 IEEE 4th Global Conference on Consumer Electronics, GCCE 2015, Osaka, Japan, 27–30 October 2016; pp. 86–87. [Google Scholar] [CrossRef]

Figure 1. Trends in machine learning applications in hydrology (2020–2024).

Figure 2. Summary of the challenges of current LSH datasets.

Figure 3. Comparison of NLDAS forcing with local forcing for precipitation at Station EF-4 (ARM/CART, Plevna, Kansas), which is representative of other stations. Each point in the hourly panel represents one hour during the period from 0000 UT on 1 January 1998 to 2300 UT on 30 September 1999. The averaging period for the other panels is indicated accordingly [180].

Figure 4. Mean rainfall data from rain gauge and CHIRPS: (a) daily and (b) monthly.

Figure 5. Comparison of dataset limitations in hydrology.

Figure 6. Summary of the future directions of current LSH datasets.

Table 1. Applications, advantages, and disadvantages of machine learning techniques.

Machine Learning Techniques	Applications	Advantages	Disadvantages
Long short-term memory networks (LSTMs)	Streamflow prediction, rainfall-runoff modeling, groundwater level forecasting	-Captures temporal patterns -Improved predictive accuracy	-Prone to overfitting -Limited interpretability
Random forests (RFs)	Flood forecasting, drought assessments, precipitation modeling	-Robust to noisy data -Provides feature importance -Handles large datasets	-Potential bias in small datasets
Support vector machines (SVMs)	Streamflow prediction, groundwater level forecasting, precipitation estimation	-Effective in high-dimensional spaces -Robust to overfitting	-Requires careful parameter tuning -Sensitive to noise
Artificial neural networks (ANNs)	Rainfall-runoff modeling, flood forecasting, water quality prediction	-Models complex non-linear relationships	-Prone to overfitting -Limited interpretability
Gradient boosting machines (GBMs)	Flood prediction, soil moisture estimation, groundwater level prediction	-High predictive accuracy -Provides feature importance -Suitable for classification and regression	-Requires careful parameter tuning -Prone to overfitting
Convolutional neural networks (CNNs)	Remote sensing data analysis, precipitation estimation, flood mapping	-Recognizes spatial patterns -Handles large-scale datasets -Learns features automatically	-Complex to design and tune -Requires large, labeled datasets
Transformer models	Streamflow prediction, flood forecasting	-Captures long-range dependencies -Scalable with parallel processing -Superior performance in sequential data	-High computational demand -Complex architecture -Requires large amounts of data

Table 2. Key datasets used in hydrological ML applications.

Dataset	Spatial Coverage	Temporal Coverage	Data Resolution	Key Attributes	Primary Applications
CAMELS	671 catchments in CONUS	1980–2014	Daily	Topography, climate, streamflow, land cover, soil, geology	Large-sample hydrological studies, catchment attribute analysis
Caravan	6830 catchments globally	Nearly four decades	Sub-daily	Meteorological forcing, streamflow, static catchment attributes	Global hydrological studies, extensibility for new locations
GRDC	9800 stations worldwide	Up to 200 years	Daily, monthly	River discharge data	Global water resource management, climate impact studies
CHIRPS	Global	1981–present	Daily, pentadal, monthly; 0.05° spatial resolution	Precipitation estimates	Climate extremes monitoring, drought forecasting
PERSIANN (CCS, CDR, CCS-CDR)	Near-global (60°S to 60°N)	CCS: 2003–present CDR: 1983–present CCS-CDR: 1983–present	CCS: 0.04°; hourly, 3-hourly, 6-hourly, daily, monthly, yearly CDR: 0.25°; daily CCS-CDR: 0.04°; 3-hourly	Precipitation estimates	CCS: real-time weather monitoring, short-term forecasting, severe weather analysis CDR: long-term climatological studies, precipitation analysis CCS-CDR: extreme weather event analysis, climatological studies, hydrological modeling
GLDAS	Global (north of 60° S)	1948–present	3-hourly; 1 degree and 1/4-degree spatial resolution	Land surface states and fluxes	Global land surface condition monitoring, hydrological modeling
GRACE	Global	2002–2017 (GRACE), 2018-present (GRACE-FO)	Monthly; 1-degree spatial resolution	Gravitational field variations	Water distribution and mass transport studies, groundwater depletion analysis

Table 3. Key case studies and findings using the datasets.

Dataset	Applications	Case Studies and Findings
CAMELS	Streamflow forecasting	LSTM with transfer learning outperforms locally trained models in Chile and China [83]; LSTM networks outperform traditional models [85]; Data integration improves accuracy [90]; multi-task learning enhances predictions [91].
	Rainfall-runoff modeling	The LSTM with multiple meteorological forcings improves accuracy [96]; the PHY-LSTM integrates physical mechanisms [97]; transformer-based RR-former outperforms LSTM models [126]; MDNs and Monte Carlo Dropout address prediction uncertainties [102].
	Flood forecasting	The ML framework for flood peak prediction [33]; random forest models for climate attributes influencing flood processes [104]; extreme gradient boosting for design flood estimation [105].
	Groundwater level forecasting	Improved model performance by integrating regional characteristics [107]; combining water balance-based processes with deep learning outperforms pure deep learning models [108].
	Other hydrological applications	Differentiable, physics-informed machine learning models demonstrate their generalizability to ungauged regions [111]; MC-LSTM models performed comparably to standard LSTM models [112]; AI4Water enhances model accuracy and interpretability [110].
CAMELS-GB	Streamflow and hydroclimatic impacts	Urbanization impacts on river discharge [117]; hybrid hydroclimatic forecasting [118].
CAMELS-CL	Hydrological predictions	LSTM and random forest models enhance predictions [120]; a deep neural network for 24-hour streamflow forecasting [121].
CAMELS-BR	Streamflow prediction	The FS-LSTM model shows improved performance for streamflow prediction [123].
CAMELS-AUS	Streamflow prediction and water flux	A hybrid model combining GR4J with CNN and LSTM networks [124]; the global analysis of water flux partitioning [125].
Caravan	Streamflow and flood prediction	Temporal fusion transformers outperform the LSTM and transformer models [127]; a two-path LSTM model for river flood prediction [128]; LSTM models outperform traditional models in streamflow prediction [129].
	Catchment model instance prediction	A latent factor model for predicting catchment model instance associations [130].
	Task-aware modulation in predictions	Task-aware modulation using representation learning (TAM-RL) for GPP and streamflow predictions [131].
GRDC	Streamflow and water balance	Improved monthly runoff reconstructions [29]; a physics-encoded deep learning framework for streamflow predictions [30]; a flood type analysis across Europe [132].
	Flood prediction and analysis	An AI-based model to predict extreme floods in ungauged watersheds [133].
	Hydrological modeling and simulation	A DHI-GHM model for real-time and forecasted hydrological simulations globally [134]; evaluated objective functions for streamflow prediction, showing the LSTM excels in high-flow forecasting [135].
CHIRPS	Drought assessment	A drought assessment in Ethiopia [136]; predicting drought-induced reductions in agricultural productivity [137]; integrated CHIRPS for drought assessment in Iran [138].
	Runoff estimation and streamflow	CHIRPS’s performance in Saudi Arabia [18]; enhanced streamflow forecasting in India’s Varahi River basin [19]; better performance of CHIRPS over IMD data for streamflow forecasting [139].
	Flood modeling and susceptibility	Flood susceptibility modeling in Iran [140]; combining CHIRPS data with a soil water index for enhanced rainfall-runoff model accuracy [141].
	Precipitation model improvement	Bias correction methods for downscaling precipitation models using CHIRPS in high-altitude regions [142].
PERSIANN	Hydrological modeling	Hydrological and coupled soft computing models for streamflow and sediment load [20]; ML and process-based models for rainfall-runoff in the DuPage River Basin [21].
	Flood prediction	Forecasting extreme flood events using satellite precipitation and wavelet-based ML [22]; improving hourly precipitation estimates for flash flood modeling in the Andean-Amazon basins [23].
	Precipitation estimation	cGANs for real-time precipitation estimation from GOES-16 imagery [24]; a two-stage deep neural network for precipitation estimation from bispectral satellite information [25].
	Drought assessment	Hybrid ensemble learning for super drought computation in the Lake Victoria Basin [26].
	Runoff simulation	A runoff simulation using multi-source satellite data and deep learning [27]; a fusion-based framework for daily flood forecasting in the Kan River, Iran [28].
NLDAS	Hydrological modeling	Process-guided deep learning for lake water temperatures [143]; spatial downscaling of precipitation [42].
	Runoff and flood prediction	ML models to predict hourly runoff in California’s Russian River basin [144]; predicting flash flood damage in the Southeast US [145].
	Soil moisture and evapotranspiration	Enhanced soil moisture estimation [146]; a downscaling algorithm for soil moisture estimation [147].
GLDAS	Hydrological modeling	An LSTM-based model for terrestrial water storage [148]; a combined hydrological model for streamflow simulations in Thailand [149].
	Soil moisture and evapotranspiration	Estimating surface soil moisture using GBRT [150]; evaluating global land evapotranspiration (ET) products [151].
	Groundwater and storage data	Improving groundwater level anomaly predictions [152]; downscaling groundwater storage data using ML techniques [152].
GRACE	Groundwater and water storage	Downscaling GRACE TWSA data with boosted regression trees [153]; random forest models highlight groundwater storage loss [154]; SVM models for groundwater level prediction [155]; the XGBoost model to downscale GRACE-derived groundwater storage data in the Indus Basin [156].
	Groundwater level prediction	Combined SVMs with ensemble Kalman filtering [157]; RF and SVM models for monitoring groundwater fluctuations [9].
	Enhancing spatial resolution	The AutoML workflow for reconstructing GRACE TWSA data [158]; RF and hydrological models for higher accuracy in the Haihe River Basin [159].

Table 4. Comparison of the advantages of high-resolution datasets over conventional approaches.

Aspect	Shortcomings of Traditional Methods	How High-Resolution Datasets Address These Shortcomings
Precipitation Monitoring
Rain gauges	Point-specific data, sparse distribution, maintenance required	High-resolution satellite data (e.g., PERSIANN-CDR and CHIRPS) provide comprehensive coverage and effectively capture spatial and temporal patterns of precipitation [205,206].
Thiessen polygons	Assumes uniform precipitation within polygons, inaccurate for heterogeneous landscapes	Datasets like CHIRPS and PERSIANN provide finer spatial resolutions, which are 0.05 degrees and 0.04 degrees, respectively.
Isohyetal method	Labor-intensive, subjective, relies on the manual drawing of isohyets	Automated algorithms in datasets like PERSIANN offer consistent and objective estimates.
Empirical models	Dependent on historical data quality and availability, less reliable under varying climatic conditions	The NLDAS and GLDAS provide detailed temporal and spatial data, enhancing model accuracy.
Streamflow Assessment
Stream gauges	Measures water levels at specific points, limited spatial coverage	GRDC offers extensive global streamflow records.
Hydrological models	Based on simplified assumptions, extensive calibration and validation needed	The NLDAS and GLDAS enhance model inputs and improve accuracy by integrating advanced observational data [203].
General Data Issues
Sparse data coverage	Many regions lack sufficient ground-based measurements	High-resolution satellite datasets (e.g., PERSIANN, CHIRPS) provide global coverage.
Inconsistent data formats	Varying formats and standards make it difficult to integrate data from different sources	The CARAVAN project and FAIR data principles improve standardization and interoperability [16,202].
Temporal resolution	Limited temporal resolution, missing short-term variations	The NLDAS offers hourly data, the GLDAS provides three-hourly to monthly data
Data gaps and missing values	Equipment failure or loss of data leads to gaps in traditional datasets	CAMELS and Caravan offer consistent data coverage, minimizing gaps
Regional bias	Traditional datasets often focus on specific regions, limiting applicability to other areas	High-resolution global datasets (e.g., PERSIANN, GLDAS, GRACE) provide universally applicable data
Extreme event detection	May fail to capture extreme weather events accurately	High-resolution datasets improve the detection and modeling of extreme events [206].
Data quality and consistency	Variations in measurement techniques and data quality introduce inconsistencies	Standardized high-resolution datasets ensure more consistent data quality

Table 5. Impacts of uncertainty quantification methods on hydrological models.

Method	Description	Applications	Impact on Hydrological Models
Bayesian approaches	Updates probabilities with new data, incorporates prior knowledge.	Bayesian hierarchical models for multi-level uncertainty quantification [220].	Comprehensive uncertainty assessment, enhanced robustness.
Machine learning techniques	Generates output distributions using ensemble and deep learning methods.	Random forest models provide uncertainty estimates [221].	Improved predictions, better variability representation.
Monte Carlo simulations	Runs models with different input parameters [222].	Quantifies the error of satellite rainfall estimation [223].	Quantifies outcome range, better risk assessment.
Stochastic modeling	Incorporates random variables to represent variability.	Improves rainfall simulation with multi-zone calibration for complex terrains and sparse data [224].	Enhances hydrological models, improving crop yield forecasts and water management.
Sensitivity analysis	Assesses output variation due to different input parameters.	Global sensitivity analysis with Sobol indices [225,226].	Identifies key uncertainty sources, guides data collection.
GLUE (generalized likelihood uncertainty estimation)	Runs multiple simulations with different parameters, evaluates likelihood.	Estimates the uncertainty of water resource modeling, including quality, rainfall-runoff, and groundwater modeling [227].	Enhances parameter sensitivity understanding and performance.
Bootstrap methods	Resampling technique for estimating statistic distribution.	Assesses parameter and prediction uncertainty [228].	Provides confidence intervals, enhances prediction reliability.
Data assimilation	Integrates observations with predictions for accuracy.	Ensembles the Kalman filter in the NLDAS and GLDAS.	Reduces uncertainty, updates model states with observed data.
Polynomial chaos expansion	Expands random variables onto orthogonal polynomials.	Groundwater flow modeling.	Efficient uncertainty quantification in complex models.
Bayesian model averaging	Combines predictions from multiple models weighted by their probabilities.	Combines hydrological model predictions [229].	Reduces model uncertainty by averaging, provides reliable predictions.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Advancing Hydrology through Machine Learning: Insights, Challenges, and Future Directions Using the CAMELS, Caravan, GRDC, CHIRPS, PERSIANN, NLDAS, GLDAS, and GRACE Datasets

Abstract

1. Introduction

2. Trends of ML Applications in Hydrology

3. Machine Learning Methods in Hydrology

3.1. Long Short-Term Memory (LSTM)

3.2. Random Forests (RFs)

3.3. Support Vector Machines (SVMs)

3.4. Artificial Neural Networks (ANNs)

3.5. Gradient Boosting Machines (GBMs)

3.6. Convolutional Neural Networks (CNNs)

3.7. Transformers

4. Key Datasets

5. Case Studies

5.1. CAMELS

5.2. CARAVAN

5.3. GRDC

5.4. CHIRPS

5.5. PERSIANN

5.6. NLDAS

5.7. GLDAS

5.8. GRACE

6. Data Challenges in the ML Approach

6.1. Spatial and Temporal Resolution

6.2. Data Quality and Consistency

6.3. Regional and Climatic Representation

6.4. Downscaling of LSH

6.5. Data Accessibility

7. Benefits of High-Resolution Datasets over Traditional Methods

8. Future Directions

8.1. Focusing on Specific Hydrologic Regimes

8.2. Incorporating Human Impacts

8.3. Uncertainty Quantification

8.4. Real-Time Data Integration

8.5. Data Collection

9. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics