1. Introduction
Accurate hydrological modeling is essential for flood prevention, water resource management, and evaluating the impacts of climate changes and human activities [
1,
2]. Reliable inputs are crucial to achieve accurate predictions, including meteorological data, hydrological records, and relevant physical attributes [
3]. Generally, hydrological data are less abundant and comprehensive compared to meteorological data. This disparity arises partly because key hydrological processes, such as evaporation, infiltration, and subsurface flow, are inherently complex and challenging to measure [
4]. Even for relatively easier-to-measure variables, such as streamflow, the global network of observation stations is much less developed than meteorological monitoring systems, leading to substantial data gaps in many regions [
2].
Currently, the Global Runoff Data Centre (GRDC) contains streamflow data from 10,836 gauging stations worldwide. However, only a small subset of these stations meets the criteria for providing high-quality data essential for reliable hydrological modeling. For instance, Huang et al. identified only 1761 “high-quality” catchments globally from the GRDC dataset, using selected criteria such as catchment boundary quality, irrigation area, and a minimum of 10 years of continuous daily observations [
1]. Among these criteria, the length of historical time-series records is a particularly critical factor influencing data reliability.
Figure 1 illustrates the distribution of GRDC gauging stations based on their record length, highlighting significant disparities across regions [
5]. In many developing countries, stations often have only 1–10 years of available data, posing substantial challenges for long-term hydrological forecasting in these areas.
To improve the hydrological modeling accuracy in data-scarce or ungauged catchments, a commonly adopted solution is to transform hydrological information from gauged areas to ungauged areas [
6]. This generally falls into two categories: (1) a regionalization approach for traditional hydrological models, and (2) a transfer learning approach for deep learning models.
For the first category, substantial progress has been made, especially since the predictions in ungauged basins (PUB) decade [
7]. Regionalization methods have been employed predominantly with conceptual models (e.g., GR4J [
8,
9]) and, to a lesser extent, with physically based models (e.g., SWAT [
10]). These methods can be generally grouped into three types: (i) similarity-based methods, which rely on spatial proximity or similar catchment attributes (such as climate type, land use, and geology); (ii) regression-based methods, which establish regression relationships between model parameters and catchment descriptors; and (iii) hydrological signature-based methods, which leverage key information embedded in streamflow data (e.g., mean flow, flow percentile, and baseflow index. However, the main limitation of this approach does not lie in its regionalization rationale, but in the traditional hydrological models, which inevitably oversimplify the nonlinear and complex nature of streamflow [
11].
In contrast, recent deep learning (DL) models have demonstrated a strong ability for directly capturing nonlinear patterns and relationships from observed meteorological and historical hydrological data [
12]. Commonly used architectures in hydrological studies include Artificial Neural Networks (ANN) [
13], Convolutional Neural Networks (CNN) [
14], and Recurrent Neural Networks (RNN) [
15], along with their variants—Gated Recurrent Unit (GRU) [
16], Long Short-Term Memory (LSTM) [
17,
18], and Transformers [
19]. However, these models are generally data-hungry, requiring a large amount of labeled data for effective training. In data-scarce regions, traditional (i.e., non-transferable) DL models are habitually trained over limited labeled data, leading to degrading or even failing performance [
20,
21].
To address these challenges, Transfer Learning (TL), a technique in ML/DL, offers a new insight for data-scarce scenarios. TL enables the transfer of learned knowledge from a data-rich “source” domain (catchment) to a related data-scarce “target” domain (catchment) [
22]. It can be applied to various real-world tasks, including regression (e.g., time-series regression in this study), classification, and clustering [
23]. TL has been widely applied across diverse fields, such as image recognition [
24], natural language processing [
25], biology and medicine [
26], economics [
27], and military applications [
28]. However, its application in hydrological studies remains relatively nascent, with research efforts emerging only in recent years.
Ma et al. used Catchment Attributes and MEteorology for Large-sample Studies (CAMELS) streamflow dataset, comprising 671 U.S. catchments, to pre-train an LSTM model, and transferred it to other continents with varying data densities, including Great Britain, Chile, and China. This approach enhanced overall model performance, demonstrating the feasibility of cross-continental knowledge transfer for streamflow prediction [
20]. Similarly, Khoshkalam et al. leveraged the knowledge from the CAMELS dataset but tested the transferability on snow-dominated regions in Southern Quebec using the LSTM model, achieving improved accuracy in daily streamflow predictions [
29]. Muhammad and Abba used a source selection strategy based on Dynamic Time Warping (DTW) and semantic entropy calculation to select 10 source catchments from the 438-catchment Model Parameter Estimation Experiment (MOPEX) dataset. They then applied TL with Gated Recurrent Unit (TL + GRU) model to improve the streamflow predictions for most catchments [
30]. Xu et al. proposed a cross-regional interpretable machine learning (XGBoost) TL model to predict runoff in ungauged basins, leveraging flowmeter and catchment characteristic data from 5764 catchments across various climate zones in the Caravan dataset, achieving NSE values improvement [
31].
These studies demonstrate that current hydrological TL efforts predominantly rely on large-scale meteoro-hydrological datasets. This strategy offers the advantage of improving overall model performance across multiple target catchments due to the broad range of hydrological scenarios, resulting in strong generalizability. However, several key challenges remain:
(1) Local performance trade-off: There is a long-standing issue termed “negative transfer” in TL-related studies, which refers to the situation where leveraging source domain data undesirably reduces learning performance in the target domain [
32]. It generally arises from four reasons: large domain divergence, poor source data quality, poor target data quality, and inappropriate TL algorithms [
33]. In the context of hydrological TL, domain divergence is particularly critical. When using all catchments from a large dataset, it remains unclear which catchments contribute positively or negatively to the final TL performance. While using large datasets has been favored for their broad coverage of diverse hydrological events, its generalization can sometimes come at the cost of degraded local performance for specific target catchments [
34]. For such cases, although fine-tuning helps the model adapt to local patterns, the inclusion of an excessive number of low-correlation catchments may reduce or even destroy the positive impact of high-related catchments [
34]. Therefore, this study aims to systematically investigate various source selection strategies to better quantify the impact of source-target similarity on transfer effectiveness.
(2) High computational cost: Based on our experiments and records from previous studies, pretraining the basic LSTM model on the CAMELS-GB dataset typically requires 6–10 h for one hyperparameter configuration, depending on the computational device [
35]. When exploring more advanced TL architectures, such as domain adaptation which networks align source and target domains in high-dimensional feature space, such large dataset imposes substantial computational demands and further increases the model training complexity.
(3) Limited availability of large-scale datasets in some regions: Although more countries have recently contributed to the expansion of the CAMELS dataset, many regions in the world still lack dense hydrological monitoring networks, making it difficult to compile long-term records. Beyond applying cross-continental transfer [
20], this study also explores the potential of identifying a small number of locally and highly correlated source catchments to achieve comparable or even better TL performance.
To achieve these, this study ranks source catchments based on their similarity to the selected target catchments. Three commonly used similarity comparison strategies are employed: (1) spatial similarity (SS), (2) physical attributes similarity (PS), and (3) flow regime similarity (FS). After ranking, source catchments are sub-grouped by similarity level and used to train TL networks. Two baseline networks are included for comparison: a Non-Transfer Learning (NTL) network and an All-Source Transfer Learning (ASTL) network, which uses the full set of source datasets. Finally, this study aims to identify which similarity comparison strategy is more effective in guiding source selection for enhancing TL performance.
The remainder of this article is organized as follows:
Section 2—Data and Methodology,
Section 3—Results and Discussions of three similarity comparison strategies,
Section 4—Limitations and Future Works, and
Section 5—Conclusions.
3. Results and Discussion
3.1. Effects of Spatial Similarity on Transferability
For the three target catchments, the Non-Transfer Learning (NTL) network, trained solely on one year of target catchment data, yielded NSEs of 0.327, 0.425, and 0.277, respectively, indicating poor ability to capture hydrological patterns. These results are the lower thresholds, and any similarity-guided TL experiment performing below the thresholds is considered to exhibit negative transfer. In contrast, the All-Source Transfer Learning (ASTL) network, trained using data from all 668 source catchments (excluding the selected target catchment data), reached a substantially higher NSEs of 0.792, 0.783, and 0.809, respectively. These represent the upper thresholds of performance. Any similarity-guided TL experiment approaching or exceeding these thresholds, while relying on a significantly smaller subset of source catchments, demonstrates superior transferability.
Then, seven spatial similarity-based TL experiments (SS1–SS7) were conducted for each target catchment. The spatial distributions of the target catchments and their selected source catchments are presented in
Figure 3a–c.
Figure 3d–f show the corresponding NSE results. The results for KGE, %BiasFHV, %BiasFMS, and %BiasFLV of all the experiments are provided in
Appendix B,
Appendix C and
Appendix D. For clearer visualization of streamflow dynamics across different target catchments, we have also included the streamflow plots of SS1 and ASTL experiments in
Appendix E.
According to the results of SS1–SS4, SS1 achieved the highest performance across all the target catchments. The three target catchments yielded NSEs of 0.856, 0.770, and 0.852, respectively, approaching or even surpassing their ASTL thresholds. However, as spatial similarity decreased from SS1 to SS4, transferability degraded. The decline was most pronounced for target catchment “39010”, where SS4 performed worse than the NTL threshold. To further investigate this phenomenon, we compared catchment characteristics between the ranked groups (SS1–SS4) and the target “39010” (
Appendix F). The results show that SS4 exhibited significantly greater heterogeneity in mean gauged daily flow, maximum gauged daily flow and catchment area. Such variability would introduce learning difficulties and conflicting parameter updates in TL, which were reflected in substantially larger biases in high-flow prediction accuracy (68.04%) compared with SS1–SS3.
These results highlight the potential of leveraging a few highly spatially similar catchments to achieve superior performance compared to ASTL, while also emphasizing the risk of degrading performance when only a small amount and more distant catchments are used. In regions lacking large datasets, SS-guided TL could offer a practical alternative. However, further research is needed to explore the critical similarity threshold beyond which transferability diminishes, and this threshold is influenced by regional heterogeneity regarding hydrological conditions.
In addition, the results of SS5–SS7, which progressively included more catchments, demonstrated that their performance may not always surpass SS1, but it may appear more stable than SS2–SS4. The result implies that the inclusion of top 10-ranked catchments can offset the negative impact of adding lower similarity ones. However, the ASTL experiments performed worse than some SS5–SS7 experiments, suggesting that including an excessively large and heterogeneous set of catchments can reduce the benefits of TL.
3.2. Effects of Physical Attributes Similarity on Transferability
To investigate the role of physical attributes in similarity-guided TL, 21 static catchment attributes were divided into four clusters based on k-means clustering. The optimal k = 4 was determined by statistical optimization (elbow method and silhouette analysis), while considering the hydrological interpretability. Visualizations of the elbow method, silhouette analysis, and the clustering results are shown in
Appendix G. The list of attributes in each cluster is presented in
Table 5.
Cluster 1 mainly comprises landcover and climate-related attributes, including indicators of vegetation type (e.g., percentage of deciduous woodland, crops, and urban area), precipitation seasonality and variability (e.g., frequency and duration of high or low precipitation events), and climatic variables (e.g., mean daily potential evapotranspiration). Cluster 2 contains attributes related to topography, soil, and snow dynamics, such as mean elevation, drainage slope, sand content, hydraulic conductivity, and fraction of snowfall. Cluster 3 consists solely of catchment area, capturing spatial scale of hydrological processes. Cluster 4 contains soil texture and porosity indicators, including percentages of silt and clay and volumetric porosity, reflecting water retention and infiltration capacity.
Maps in
Figure 4 show the spatial distributions of the source catchments selected from each cluster for each target catchment. Clusters 1, 2, 4, and the full attribute set (“All”) generally show spatial concentration surrounding the selected target catchment, reflecting regional similarities in land use, climate, and geology. Cluster 3, which is solely based on catchment area, exhibits no clear spatial patterns.
Before examining the characteristics of each cluster in detail, it is noteworthy that for the target catchment “12007”, the distributions of source catchments in PS-C1, PS-C2, and PS-All are highly similar to those in the SS experiments. Nearly all the selected source catchments are located near the Cairngorms National Park in Scotland, indicating that the landcover, climate, topography, and snow dynamics of this region are highly unique compared to the rest of UK. Consistently, the NSE results of PS-C1, PS-C2, and PS-All show similarly stable and superior performance as the SS experiments, and they are highly competitive with the ASTL result (NSE = 0.783). Moreover, they outperform the results of the other two target catchments under the same similarity metrics. These findings suggest that for catchments that are strongly heterogeneous relative to other regions, the positive contribution from full source dataset primarily comes from a small number of highly similar catchments. In such cases, comparable transferability can be more readily achieved by leveraging different physical similarity measures.
Compared with the target catchment “12007”, the other two target catchments are more sensitive to the choice of similarity metric. In the PS-C1 experiments for target catchment “39010” and “33023” (
Figure 4a,k), the top 10 catchments (PS-C1-1) exhibit strong transferability, achieving NSEs of 0.793 and 0.738, respectively. Comparing the results of PS-C1-1 with PS-C1-2 to PS-C1-4, performance generally decreases. When comparing results from cumulative subsets (PS-C1-5 to PS-C1-7), performance rebounds as more data are included. These results indicate that high similarity in land cover and climate can enable effective TL, but performance is sensitive to declining similarity.
For the PS-C2 experiments (
Figure 4b,l), no clear advantage is observed in the top 10 subset, nor is there a consistent trend of decreasing performance with decreasing similarity. This may be due to Cluster 2, which includes a wide range of attributes (elevation, slope, soil, snow, and precipitation). Therefore, using the overall similarity derived from the entire cluster is difficult to identify the most relevant sources, particularly when hydrological characteristics of the target catchment are not highly heterogeneous, as in the case of catchment “12007”. Consequently, similarity rankings based on C2 may not effectively guide TL performance.
Regarding the PS-C3 results in
Figure 4c,h,m, all seven ranked groupings yield lower NSEs than their ASTL upper thresholds. Some groupings for target catchment “39010” even fall below the NTL lower threshold, indicating negative transfer. The phenomenon likely attributes to one of the most common factors of negative transfer, which is large domain divergence [
33]. These findings confirm that catchment area alone is an inadequate criterion for similarity assessment in TL.
For the PS-C4 results in
Figure 4d,i,n, the top 10 catchments again show strong transfer performance (NSE = 0.769, 0.723, and 0.859), comparable to their ASTL thresholds. Although no strict pattern of decreasing performance with decreasing similarity is observed across ranked groups in target catchments “39010” and “33023”, most groups achieve strong performance close to the ASTL thresholds. The results suggest that soil structure and water retention properties are informative for guiding effective transfer. However, a clear threshold of attribute similarity—beyond which performance begins to decline—cannot yet be identified and requires further investigation.
In
Figure 4e,j,o, where all 21 physical attributes are used, each subset achieves performance close to or better than the ASTL. As a result, when it is unclear which attributes to prioritize for similarity evaluation, using the full attribute set is a reliable alternative.
In summary, TL performance is sensitive to the selected attribute cluster. For regions lacking large datasets or aiming to reduce training costs with complex models, the following recommendations can be made: if an attribute cluster contains diverse and physically unrelated indicators (e.g., Cluster 2) or only includes a single feature with weak linkage to rainfall–runoff processes (e.g., Cluster 3—area), it is not recommended for guiding source selection. If the cluster relates to land cover and climate (e.g., Cluster 1), it can be used as a selection guide when high similarity is ensured. A more robust and safer choice is to use the full set of physical attributes, especially including variables such as soil porosity and infiltration capacity, as represented in Cluster 4.
3.3. Effects of Flow Regime Similarity on Transferability
Flow regime similarity was assessed by calculating Dynamic Time Warping (DTW) distances between standardized weekly step hydrographs. The distributions of the top-ranked source catchments for each target catchment are presented in
Figure 5a–c. According to the NSE results in
Figure 5d–f, most ranked groups exhibit relatively stable and comparable performance to the ASTL thresholds. Among them, the ranked groups for target catchment “39010” show slightly better performance compared to target catchments “12007” and “33023”. By referring to their hydrograph comparisons and corresponding DTW values (
Appendix H,
Appendix I and
Appendix J), DTW ranges from 0.153 to 0.493 for the top 40 source catchments of target “39010”, from 11.663 to 35.366 for target “12007”, and from 1.721 to 6.016 for target “33023”. These values indicate that nearly all the top 40 source catchments for target “39010” are highly similar to the target itself, while target “12007” has fewer highly regime similar source catchments across the dataset.
Although target “12007” benefits much from surrounding catchments with similar physical attributes (
Section 3.2), physically similarity does not always guarantee hydrologically similarity. Previous study has also stated the major reasons behind this inconsistency: (1) these catchments often have a quite specific hydrological behavior or (2) the complex underground behavior was not accurately described by the available attributes [
6]. In such case, for catchments characterized by unique hydrological characteristics (i.e., similar flow regime are rarely seen across the dataset), physical attributes similarity can be prioritized as a more straightforward and effective alternative.
Apart from the special cases, flow regime similarity is still a promising selection criterion, given that it represents the final manifestation of catchment behavior and serves as a direct indicator of hydrological similarity.
It is also important to note that in this study, only one year of training data was available for the target catchment. Thus, DTW-based similarity was calculated using just one year of flow data for both the target and source catchments. In scenarios where longer training periods are available for the target catchment, comparing multi-year flow regime patterns could yield more robust and informative similarity assessments.