Error Characteristics and Scale Dependence of Current Satellite Precipitation Estimates Products in Hydrological Modeling

: Satellite precipitation estimates (SPEs) are promising alternatives to gauge observations for hydrological applications (e.g., streamﬂow simulation), especially in remote areas with sparse observation networks. However, the existing SPEs products are still biased due to imperfections in retrieval algorithms, data sources and post-processing, which makes the effective use of SPEs a challenge, especially at different spatial and temporal scales. In this study, we used a distributed hydrological model to evaluate the simulated discharge from eight quasi-global SPEs at different spatial scales and explored their potential scale effects of SPEs on a cascade of basins ranging from approximately 100 to 130,000 km 2 . The results indicate that, regardless of the difference in the accuracy of various SPEs, there is indeed a scale effect in their application in discharge simulation. Speciﬁcally, when the catchment area is larger than 20,000 km 2 , the overall performance of discharge simulation emerges an ascending trend with the increase of catchment area due to the river routing and spatial averaging. Whereas below 20,000 km 2 , the discharge simulation capability of the SPEs is more randomized and relies heavily on local precipitation accuracy. Our study also highlights the need to evaluate SPEs or other precipitation products (e.g., merge product or reanalysis data) not only at the limited station scale, but also at a ﬁner scale depending on the practical application requirements. Here we have veriﬁed that the existing SPEs are scale-dependent in hydrological simulation, and they are not enough to be directly used in very ﬁne scale distributed hydrological simulations (e.g., ﬂash ﬂood). More advanced retrieval algorithms, data sources and bias correction methods are needed to further improve the overall quality of SPEs.


Introduction
Precipitation is an important climate variable as well as a major driver of the water cycle [1][2][3]. The traditional method of measuring precipitation is to observe through gauge stations on the ground. However, due to its high cost, the rain gauge network is very sparsely distributed in some regions (e.g., remote areas or high-altitude locations), resulting in its spatial under-representation [4,5]. Radar precipitation is susceptible to many effects (e.g., topography), and usually cannot provide full coverage of precipitation observation [6]. Satellite precipitation estimation is currently the most promising tool, which can provide quasi-global precipitation estimation with high spatial and temporal resolution [7]. Existing state-or-the-art satellite precipitation estimate products (SPEs) include the Precipitation Estimate from Remotely Sensed Information using Artificial Neural Networks (PERSIANN), the Tropical Rainfall Measuring Mission (TRMM), the Global Precipitation Measurement Mission Integrated Multi-satellitE (GPM IMERG) and (or sampling error) of the satellite or radar products used. However, the conclusions they obtained are not entirely consistent due to the small number of relevant cases, so more case studies are needed. Moreover, they did not use real precipitation products or fully reveal the scale effects. As indicated, this still is an unsolved issue [28].
To date, we have a large number of available precipitation products. It is time to revisit the latest precipitation products and their possible scale effects in hydrological application. In this study, we select the Yalong River basin in China as a case. The eight latest quasi-global SPEs are used to drive a distributed hydrological model to validate their applicability and scale effects. This study not only deepens our understanding of satellite precipitation estimates products and their hydrological applications, but also contributes to the study of other rainfall-related topics, such as an estimation of the Areal Reduction Factor (ARF) [33,34] and spatial classification of Rainfall events [35][36][37]. Section 2 introduces the study area, dataset, hydrological model and performance metrics we selected. The main results are described and analyzed in Section 3. Related studies and issues are discussed in Section 4. Section 5 summarizes the conclusions.

Study Area
Yalong River is the largest tributary of the Jinsha River in the upper reaches of the Yangtze River of China ( Figure 1a). The Yalong River (YLJ) basin is located in the eastern part of the Qinghai-Tibet Plateau and stretches from 96 • 52 E to 102 • 48 E longitude and from 26 • 32 N to 33 • 58 N latitude. The basin spans 1570 km, with a total area of about 130,000 km 2 and significant variation in elevation from north to south (from 7148 m to 115 m). The precipitation in the YLJ basin increases from north to south, with an average annual precipitation of 550 mm in the source area and up to about 1500 mm in the middle and downstream areas (Figure 1d). The runoff distribution is consistent with the precipitation. For better calibration and verification of the hydrological model, we collected flow records from 2006-2015 at four hydrological stations, namely, Tongzilin (TZL), Yajiang (YJ), Daofu (DF) and Ganzi (GZ) (Figure 1b). The average annual discharge at the outlet (TZL) of the basin is about 1860 m 3 /s. The downstream areas of the YLJ basin are heavily exploited for hydropower production through a complex system of reservoirs (e.g., Pingxiang Hydropower Station and Ertan Hydropower Station). These reservoirs are useful for flood control during the rainy season, guarantee human water use during the dry season and provide abundant power resources through specific scheduling rules.

Gauge Precipitation Observations
The China Meteorological Administration (CMA)-observed gridded precipitation (0.5 • × 0.5 • and daily) was used in this study as a benchmark and reference. This product was developed by interpolating the high-density gauge observations from more than 2400 stations over China [38]. The good quality and spatial representativeness has been confirmed [38] and widely applied in climate change [39], drought monitoring [40,41] and hydrological simulations [42].

Satellite Precipitation Estimates Products (SPEs)
In this study, eight satellite precipitation estimates products (SPEs) were selected as alternatives to CMA ground observations ( Figure 2) and they have also been widely used as alternatives in previous studies. These products differ in algorithms, data sources and bias correction. They can be broadly divided into three categories: (1) Near real-time products with sole IR input (PERSIANN, PERSIANN-CCS and PDIR-Now) and without bias correction, (2) bias-corrected products using rain gauge observations (CMORPH, IMERG Final, GSMaP) and (3) climate data records (PERSIANN-CDR, PERSIANN-CCS-CDR). Detailed information of these products can be found in Table 1 as well as in some earlier studies [9,43].    [44] is the first generation of the PERSIANN family of products. PERSIANN uses a modified propagation network and clusters the satellite infrared (IR) imagery into various categories using a self-organizing feature map (SOFM). Meanwhile, the passive microwave (PMW) imagery is used to adapt the parameters of the network.

PCCS
The PERSIANN-Cloud Classification System (hereafter, PCCS) [45] uses a cloud-patchbased algorithm, which extracts features (e.g., cloud coverage) from the satellite images under specified temperature thresholds and classifies them into different categories using SOFM. Then, the relationships between brightness temperature and rainfall rate are fitted to estimate the precipitation using histogram matching and the nonlinear exponential function.

PCDR
The PERSIANN-Climate Data Record (hereafter, PCDR) [46] was intended to develop a long-time historical precipitation record (1983-present). Therefore, it uses more available satellite information from the International Satellite Cloud Climatological Project (ISCCP). Meanwhile, it uses the National Centers for Environmental Prediction (NCEP) Stage IV hourly precipitation data to tune the parameters of the modified PERSIANN model. Finally, the Global Precipitation Climatology Project (GPCP) monthly 2.5 precipitation data are also used to correct the bias of the raw PCDR data.

PDIR
PERSIANN-Dynamic Infrared Rain Rate near real-time (PDIR-Now, hereafter, PDIR) [48,49] is the latest generation of the PERSIANN family of products. It incorporates more highfrequency sampled IR imagery and provides near real-time precipitation estimates with a very short time latency (15-60 min). Moreover, the latest PDIR algorithm establishes a dynamic shifting curve between cloud-top brightness temperature and rainfall rate using rainfall climatology to reduce the uncertainties derived from satellite IR images. PIDR dataset is produced at a high spatiotemporal resolution (1 h, 0.04 • × 0.04 • ) with a quasi-global coverage (60 • N-60 • S).

CMORPH
Bias-corrected Climate Prediction Center Morphing Technique Climate Data Record (BC-CPC-CMORPH V1.0-CDR, hereafter referred to CMORPH) [50] is a reproduced version of CMORPH post-processed with bias correction by incorporating CPC daily gauge analysis over land and the Global Precipitation Climatology Project (GPCP) merged precipitation over the ocean. The original CMORPH technique is a novel method that combines the PMW data from low earth orbiting satellites (LEOS) and IR data from geostationary earth orbiting (GEO) satellites. CMORPH provides high-resolution global satellite precipitation estimates by performing a time-weighted interpolation between the PMW scans. CMORPH are now available at three temporal resolutions (30 min, 1 h and 1 day) and include a quasiglobal (60 • N-60 • S) high-spatial-resolution (0.25 • × 0.25 • ) record from 1998 to present for researchers.

IMERG
The Global Precipitation Measurement Mission Integrated Multi-satellitE (GPM IMERG) [8] provides multi-satellite precipitation estimates retrieved from PMW satellite images. Its main products include the IMERG Early Run (near real-time low-latency), the IMERG Late Run (quasi-Lagrangian time interpolation) and the IMERG Final Run (bias correction). In this study, we selected the IMERG Final Run V6B product (hereafter, IMERG). Its algorithm fully integrates the advantages of morphing technique, PCCS, and Kalman filter. Then, after bias correction by GPCC gauge analysis, research-quality long-term (2000-present) quasi-global (60 • N-60 • S) high-spatiotemporal-resolution (0.1 • × 0.1 • , 30 min) precipitation estimates are generated.

GSMaP
Gauge-calibrated Global Satellite Mapping of the Near real-time Precipitation product (GSMaP_Gauge_NRT_v6, hereafter, GSMaP) [11,12] covers quasi-global (60 • N-60 • S) coverage and provides high-spatiotemporal-resolution (0.1 • × 0.1 • , 1 h) precipitation estimates. The algorithms of this product are mainly done by the microwave-IR merged algorithm (GSMaP-MVK) and gauge calibration, which makes full use of the multiple-input information, including microwave imager and sounder, GEO IR imager, space-borne precipitation radar, atmospheric condition, sea surface temperature (SST), CPC unified global gauge analysis and topographic data.

Other Data
Other meteorological forcing data (wind speed, maximum and minimum temperature) were collected from 10 national meteorological stations located in the YLJ basin. For hydrological modeling, the precipitation and temperature data were further interpolated into each sub-basin by using the inverse distance weighting (IDW) method and elevation correction [51]. The wind speed data were also interpolated into each sub-basin using the synergistic mapping (SYMAP) algorithm [52]. Daily (2007-2014) streamflow records for four hydrological stations were collected for model calibration and verification. The 1 km soil types and land use data were clipped from the China soil map-based harmonized world soil database (HWSD) (v1.2) and the Chinese Academy of Sciences Resource and Environmental Science Land use dataset, respectively. The 90 m digital elevation model (DEM) data were downloaded from the National Aeronautics and Space Administration Shuttle Radar Topographic Mission (NASA SRTM).

Hydrological Model
In this study, we use a distributed hydrological model called the distributed timevariant gain hydrological model (DTVGM) [53,54], which was developed from rainfallrunoff nonlinear theory and can now be fully driven by remote sensing information [55].
First, given a threshold of the sub-basin (100 km 2 in this study), we use the drainage network extraction method introduced by Du et al. [56] to spit the whole YLJ basin into 522 sub-basins. The minimum threshold for sub-basin is based on the complexity of hydrological modeling, the running time step, the area of the Yalong River basin, the time scale of the hydrological simulation and the resolution of the forcing data. In each sub-basin (as known as the hydrological response unit, HRU), the runoff is calculated according to the water balance (Equation (1)).
where t is the time step; P is the precipitation (mm); Ep is the potential evapotranspiration (mm); AW represents the soil moisture (mm); WM is the field soil moisture (mm); Subscript u and g represent the upper and lower layers of soil, respectively; Ke, Kr, and Kg are the coefficients of evapotranspiration, interflow runoff and groundwater runoff, respectively; g 1 and g 2 are factors describing the nonlinear process of runoff; and C is the land cover parameter. The kinematic wave equation is used for streamflow routing [57,58]. n in the river routing procedure stands for the roughness factor [58]. The degree-day method is used to compute the snowmelt in the upper reaches of the YLJ basin [59].

Performance Metrics
To evaluate the satellite precipitation estimates products, we selected five metrics as follows: the Pearson Correlation Coefficient (R), the Nash-Sutcliffe efficiency (NSE) [60], the Kling-Gupta efficiency (KGE) [61], the Relative bias (RB) [62] and the Normalized centered root mean square error (NCRMSE) [22]. Among them, R describes the correlation; NSE and KGE are two integrated goodness-of-fit metrics; RB and NCRMSE describe the systematic bias and random bias, respectively. Their equations, intervals and optimal values are summarized in Table 2. Table 2. Performance metrics used in this study.

Metrics
Formula Interval Optimum Note: s is a coefficient matrix, here s = [1, 1, 1]. S is the satellite precipitation estimates products; O is the gauge observations.

Results
We evaluated the performance of each of the eight SPEs at the hydrologic station and sub-basin scales using the calibrated DTVGM hydrologic model from 2001 to 2019. In this paper, we took the discharge simulations driven by the observed precipitation as the reference (or "truth") and compared the satellite-driven simulations with them. The differences between the satellite-driven and observation-driven simulations were considered as the error of SPEs. We chose the observation-driven simulations instead of the observed discharge at hydrological stations as the reference for the following two considerations: (1) We only used one distributed hydrological model in this study. Although we adequately calibrated the model, there may still be uncertainties in the model structure and model parameters. Using the observation-driven simulations as the reference allowed us to focus on analyzing the uncertainties of different SPEs without considering the influence of the model structure and the model parameter uncertainties. (2) The number of stations with observed discharge is often limited, especially in remote areas. We collected data from four hydrological stations, which was valid for model calibration and verification to some extent but was far from enough for assessing the applicability of SPEs at the sub-basin or grid scale.

Model Calibration and Verification
In this study, we used CMA precipitation gauge observations to force the DTVGM and four hydrological discharge records to calibrate and verify DTVGM at a daily scale. The discharge simulation was divided into a warm-up period (2006), a calibration period (2007-2010) and a validation period (2011-2014). We manually calibrated the model parameters during the calibration period (2007-2010). The Nash-Sutcliffe efficiency (NSE) and relative bias (RB) were selected as the main objective functions, constraining the NSE > 0.7 and RB between ±10%. Firstly, we adjusted Ke (0 < Ke < 1) to reduce the overall RB of simulation, then tuned Kr (0 < Kr < 1), Kg (0 < Kg < 1), g1 (0 < g1 < 1) and g2 (g2 > 0) to achieve a good enough NSE, and finally fitted the flood peak time by changing different n (0.001 < n < 0.15). Figure 3 shows the model validation results for the four hydrological stations. Except for the TZL station, all other stations achieved good simulation accuracy (with R > 0.9; NSE and KGE > 0.85; RB < ±10%). In order to ensure the effectiveness of the hydropower plants located in the lower YLJ basin, these reservoirs store more water through regular operations compared to natural conditions. Periodic reservoir operations led to the observation of more undulating hydrologic curves than the smoother hydrologic curves we simulated ( Figure 3a). However, we did not consider the effect of downstream, regional, intense reservoir scheduling in this study, which was also not the main objective of our research. Meanwhile, we used the observed precipitation-driven simulated flows as a reference in the following section to ignore the uncertainties in the model structure and model parameters.

Validation of Discharge Simulations at Four Gauged Hydrological Stations and Sub-Basin Scale
By driving the DTVGM model with different SPEs, we obtained corresponding streamflow simulations for four hydrological stations. Figure 4 shows an example of the hydrograph for the TZL station, and those for the other three hydrological stations can be found in the Supplementary Materials (Figures S1-S3). By comparing the streamflow obtained from different datasets, there are quite a lot of mismatches as well as a severe overestimation of flood peaks in PCCS and PERSIANN. The mismatches and overestimation of flood peaks also can be quantified by the low temporal correlation coefficient and high RB values. These two main factors are also responsible for the overall poor performance of PCCS and PERSIANN. Except for PCCS and PERSIANN, the simulations obtained from the other six precipitation products are in high agreement with the reference (CMA). All of them can capture the intra-annual and inter-annual variability well. This is attributed to the fact that they are all bias corrected (except PDIR), whereas the main sources of errors are mainly overestimation and underestimation of flood peaks. For example, PCCSCDR severely overestimates during the high flow period in flood season. PCDR, PDIR and GSMaP also show different degrees of overestimation, while IMERG relatively underestimates the flood peaks. Among the eight products, CMORPH is largely consistent with the reference at the TZL station. Table 3 summarizes the results for the four hydrological stations, which are distributed at different locations (internal or at the outlet) in the YLJ basin and represent the variation in model performance with the catchment area ( Figure 1b). So, these four hydrological stations representing the results of streamflow simulations in four catchments (or sub-basins) from large to small can be used to preliminarily evaluate the performance of different SPEs in different spatial scales and their potential scale effects. To distinguish these products, we divide the results into three columns. The numbers in bold on the left and middle column represent the best values for the near real-time products without bias correction and with bias correction, respectively. Furthermore, the numbers in bold on the right column represent the best values for the climate data records.  When comparing the accuracy of different products horizontally, the performance ranking of different products is basically the same, despite the differences in the size of the catchment area. For example, in the near real-time family, PDIR expresses the best performance (R, NSE, KGE and RB) for all stations evaluated. Similarly, for the biascorrected products, IMERG has a better simulation performance than others for three of the stations evaluated (GZ, YJ and DF), except CMORPH outperforms IMERG at the TZL station, with the best values of NSE, KGE, RB and NCRMSE.
As mentioned above, we have drawn some common conclusions about the strengths and weaknesses of the products based on the horizontal comparison of the different products. However, what is more noteworthy is that the performance of these products varies greatly from the stations. For example, although PDIR obtained relatively good results at the TZL station, it performed relatively worse at the YJ, GZ and DF stations, especially the NSE value, which was 0.6 at the TZL station but only 0.32 at the GZ station. Other indicators also showed similar degradation. This phenomenon indicates that the same products may vary greatly when evaluated at different stations, and these stations are outlets of different sub-basins of the whole basin.
Further, by observing the results of these sub-basins with different catchment areas, another interesting conclusion that can be summarized is that this degradation seems to show a regular pattern according to the size of the catchment. For example, the best performance was observed at the TZL station, which has the largest catchment area, while the performance of SPEs deteriorates as the catchment size decreases. However, the generalization of this pattern is hindered by the assessment results at the DF station, which has the smallest catchment area. The performance for different products is better at the DF station than at the GZ station, and even better than at the YJ station.
In general, the comparison of the four hydrological stations allows for a ranking of the products, but this ranking varies somewhat with the stations. The latest generation of the PERSIANN family of products, PDIR, is a very significant improvement over PERSIANN and PCCS, while PCCSCDR, a blended product of PCCS and PCDR, is not a significant improvement. This can be attributed to PCCS, which uses a fixed relationship between the cloud-top brightness temperature and the rain rate, and results in an underestimation in wet regions and an overestimation in dry regions [48,49]. More importantly, the results for the different sites remind us that it is worth noting the spatial characteristics of the model performance more. Does the performance of different SPEs have a scale effect? How does the model performance correlate with the catchment area? Figure 5 shows the spatial distribution of the five different goodness-of-fit metrics at each sub-basin of the YLJ basin. To better explore the spatial characteristics of the SPEsdriven simulations, we compare and evaluate the performance of each of the eight products at the sub-basin scale. The same color bars are used to describe the relative goodness of the metrics, except for the presence of positive and negative values of the systematic bias (RB), with darker colors representing better results for that indicator. The spatial patterns of the metrics allow us to not only clearly derive the spatial differences between the different SPEs, but also to analyze the differences of each product in different internal locations of the basin. It should be noted that the results for the specific sub-basins are the model performance when the sub-basin is used as an outlet for the catchment. First, by comparing the different products, we can easily distinguish the advantages and disadvantages of the products, and this is consistent with the results obtained in the previous section at the four hydrological stations. In terms of NSE and KGE, among all products, IMERG and CHORPH belong to the first tier, showing the best performance, while PDIR, GSMaP and PCDR are in the second tier, and their spatial distribution of NSE is relatively poor in the local area of the basin. For example, the NSE of PDIR is even less than 0 near the source part and outlet of the YLJ basin. Further, the same is true for PCDR, with more low values in the middle and upper part of the YLJ basin, while for GSMaP, these cases occur mainly in the middle and lower reaches. PCCSCDR, PCCS and PERSIANN are in the third tier for the YLJ basin, where they express the worst accuracy and temporal correlation. This can be attributed to their severe overestimation in the upper and middle reaches of the YLJ basin (see the spatial distribution of RB in Figure 5), and this overestimation could be dominated by the false alarm precipitation overestimation, and it also leads to the mismatches of flood peaks. Systematic overestimation (or underestimation) of flood peaks also occurs in several other products, such as the overestimation (or underestimation) of PDIR, PCDR and GSMaP (or IMERG and CMORPH). However, these overestimates (or underestimates) did not destroy the temporal correlation.
Then, we focus on the scale effect of discharge simulation. It was not difficult to find that there is a significant spatial variability for the eight SPEs in the internal locations of the whole YLJ basin. Furthermore, the significance of spatial variability varies to a degree with the products. PCCSCDR, PCCS and PERSIANN present the most significant spatial differences, indicating that these datasets are insufficient or even far from adequate for application in the upper areas of the YLJ basin. In contrast, the other products express relatively less spatial variability and perform consistently throughout the whole basin. However, while these products perform poorly in the upstream areas of the catchment, they perform better along with the river network, suggesting that the ability to simulate streamflows driven by SPEs in these regions is very close to that of observation-driven simulations. The common feature of these regions is that they are all near the riverways, with larger catchment areas. This suggests that there is indeed a scale effect in the application of hydrological simulations forcing by SPEs. That is, the error generated in the upstream area of the catchment is attenuated with the streamflow routing. However, for the source part of the catchment, i.e., the sub-basins with smaller flow accumulation area (FAA), there are large differences in model performance. For example, PDIR performs worse at the source and near the outlet of the YLJ basin compared to IMERG, and CMORPH performs worse near the outlet of the basin.

Spatial Scale Dependence of Simulation Performance
The basic law is that the simulation performance increase with the increase of the catchment area, and there may be a threshold condition triggering the occurrence of this law. To better quantify the relationship between the simulation capacity and the catchment area, as well as to find the threshold condition for the occurrence of the scale effect, we plot the simulation accuracy of all sub-basins and the corresponding flow accumulation area (FAA, or catchment area) of these sub-basins in a two-dimensional coordinate system. Figure 6a-e shows the scores of different metrics as a function of the FAA. Figure 6f shows the distribution of the FAA in different locations of the YLJ basin. Despite the different evaluation metrics used, Figure 6a-e clearly reveals the correlation between the model performance and FAA. It is also worth noting that this relationship does have a threshold, and the threshold is about 20,000 km 2 , which is determined by all five metrics. When the FAA (or catchment area) is less than 20,000 km 2 , the model performance behaves more randomly, with different model performances in various locations of the YLJ basin. When the FAA (or catchment area) is larger than 20,000 km 2 , the model performance increases with the increase of FAA. This reaffirms our hypothesis that the scale effect affects the model performance. For the YLJ basin, 20,000 km 2 is a key condition to activate the scale effect.
According to the threshold condition, we further divided the FAA into four intervals (0-20, 20-60, 60-100 and 100-130; ×10 3 km 2 ) according to their size. The number of subbasins in the four intervals are 476, 19, 13 and 14, respectively. Meanwhile, we selected the median values of metrics in each interval to represent the average performance of that interval. For better visualization, we use radar plots to depict the capacity of different products. In the radar plot, we uniformly standardize the five metrics to the range of 0 and 1. The method of standardization is shown in Appendix A. In this way, for a specific product, 1 means that the metrics corresponding to that product perform best in the four intervals. Conversely, 0 means that the metrics corresponding to that product performs worst in the four intervals. It is important to note that we have standardized the results between 0 and 1, which can only be applied to compare the relative goodness of products in four intervals. To be able to quantitatively express the model performance of different SPEs and FAA, we summarize the results in Table 4.  Figure 7 shows the relative goodness of the eight products in the four catchment intervals with respect to five metrics. Just as the scale effect obtained in Figure 6, the simulation performance keeps increasing when the catchment area is larger than 20,000 km 2 . This pattern is expressed in the radar plot as the inclusion relationship of the circles. For example, in Figure 7c, the purple circle encloses the blue circle, while the blue circle encloses the red circle. Likewise, PERSIANN, PCCS, PCCSCDR, and CMORPH show the same patterns. However, PDIR, IMERG and GSMaP show different characteristics. Among them, PDIR and IMERG are mainly the anomalies of the systematic bias (RB). This may be due to the offset effect of positive and negative values in biases. The results may exist in three cases: (1) RB is negative in all upstream areas of the catchment, and its value may increase to a large negative one as it accumulates in the process of streamflow routing; (2) RB is positive in all upstream areas of the catchment, and it will also increase to a large positive one as it accumulates in the process of streamflow routing; (3) there are both positive and negative values in the upstream areas of the catchment, and they may accumulate (or cancel each other out), causing the RB to increase (or decrease). GSMaP does not seem to follow the patterns of the scale effect, which may be due to the relatively good performance in the upstream of the YLJ basin and the poor performance in the downstream regions ( Figure 5). Therefore, during the routing process, this routing-induced positive scale effect occurs mainly in the upstream region, which does not further improve the model performance in the downstream region due to the worse performance in these regions. The FFA of the upstream region is roughly between 20,000 and 60,000 km 2 , corresponding to the results in the radar plot (see Figure 7h). On the other hand, when the catchment area is less than 20,000 km 2 (black line in the radar plot), model performance varies in different products, which is highly correlated with the local simulation capacity, without being affected by the river routing. Furthermore, the model performance of the local area may be dominated by the accuracy of precipitation estimates.

Product Selection Based on Spatial Scale Dependence
In this study, we divided the YLJ basin into 522 sub-basins according to the 100 km 2 threshold to expand enough cases for the study of the scale effect. The FAA of different sub-basins ranges from 100 km 2 to 130,000 km 2 . Based on this framework, the simulation capability of different FFA for different SPEs is analyzed. In practical applications, models with different resolutions may be built to achieve the relevant research objectives. For example, flash flood monitoring and early warnings need high-resolution models with as many hydrological response units (HRUs) as possible to capture the local flood characteristics. For water resources planning and reservoir regulation, we only need to obtain the flow conditions at the main stream or outlet of the basin. A coarse-resolution distributed hydrological model, a semi-distributed hydrological model or a lumped model driven by the area mean precipitation can meet the requirements. Figure 8 shows the relationship between the advantages and disadvantages of different products in different catchment area intervals. Similarly, the five metrics are uniformly standardized to the range of 0-1, where 1 means that the product is the best compared to other products in a certain dimension, and 0 means that the product is the worst compared to other products in that dimension. As shown in the figure, the radar plot can more clearly guide us to identify the strengths and weaknesses of different products. It should be emphasized again that we have scaled down the metrics to between 0 and 1, which can only be applied to compare the relative goodness of different products at the same catchment scale. To quantitatively express the performance of the different products, we summarize the results in the Table 4. It can be seen from the figure that when the catchment area is less than 20,000 km 2 , the difference between the eight products is significant, and the best product is IMERG. Meanwhile, IMERG still maintains a high level of simulation performance when the catchment area increases to 100,000 km 2 . For other products, CMORPH and PDIR have comparable performance and gradually catch up with IMERG in the large catchment. The performance of CMORPH exceeds that of IMERG when the catchment area is larger than 100,000 km 2 . GSMaP and PCDR are at the same level, followed by PCCSCDR. PERSIANN and PCCS are the worst products for the YLJ basin. Interestingly, for the random error (NCRMSE), although the difference between the eight products is small, the CMORPH and PDIR products perform worse when the catchment area is in the range of 0-100,000 km 2 , which indicates that the random error of these two products is larger than other products. However, the random error decreases gradually with the increase of the FAA.

Comparison with Previous Studies
In this study, we investigate the scale dependence of the accuracy of SPEs-driven streamflow simulations. In the YLJ basin, when the catchment area is larger than 20,000 km 2 , the simulation accuracy increases with the increase of FFA (or catchment area), which is mainly attributed to the river streamflow routing. It weakens the propagation of errors in upstream areas into downstream areas. On the other hand, when the catchment area is smaller than 20,000 km 2 , the spatial distribution of simulation accuracy is more stochastic, and its performance is mainly controlled by the local simulation performance. Since we selected only one basin as the case, more evidence is needed to prove whether the results are generalizable. In fact, some previous studies have some degree of similarity with the results we present, or they can be combined with our study to reveal common patterns.
One important law is that when the catchment area is small, the accuracy of flow simulations is dominated by forcing accuracy. A large-scale hydrological simulation experiment confirms this conclusion. Jiang and Bauer-Gottwein [63] selected 300 watersheds with catchment areas smaller than 5000 km 2 in mainland China. The authors conclude that these catchments generally follow the rainfall-runoff error propagation and are not influenced by the streamflow routing process. The above studies were conducted in different basins and some other studies were conducted in the same basin with different sub-basins. Terink et al. [64] used the Bootstrap sampling method to set different gauge network densities ranging from 1 to 106 stations in a small watershed in the Netherlands and found that higher densities significantly reduced the uncertainty in runoff simulations and thus increased simulation accuracy. Similarly, the conclusion was confirmed in other studies that the degradation in precipitation resolution introduces errors in the precipitation field and propagates to the runoff simulation [65]. In addition, Cunha et al. [66] selected watersheds (with catchment areas of 20-1600 km 2 ) in central Iowa, showing that the uncertainty in peak flow simulations depends to a large extent on the size of the catchment. Uncertainty decreases with increasing catchment area due to the aggregation effect of the river network, which filters out small-scale uncertainty.
Another issue not covered in this paper but equally important is the temporal scale effect of precipitation, which has been discussed in many previous studies. In these studies, it is also called precipitation time averaging (temporal resolution), or time resampling (sampling error) [29,31,67]. An example is the density scatter plot (see Figure 4 in [64]). In their study, they assessed the quality of IMERG precipitation products at three scales (daily, monthly and annual). The results showed that the quality of precipitation products increased significantly with the increase of time scales. Similar findings can also be found in the assessment of daily and hourly precipitation [68].
However, the high temporal resolution of precipitation is still very important. Li et al. [69] conducted a case study using hourly and daily precipitation to drive SWAT models in a large-scale watershed in eastern China, respectively. The results show that the hourly precipitation-driven runoff simulation is better than the daily simulation because the hourly precipitation is better able to capture the detailed features (e.g., flood peak) of streamflow. Additionally, their results also confirm the existence of a temporal scale effect on the temporal resolution of precipitation. This effect was also confirmed by many other case studies in Tengboche, southern France and America [68,70,71].
In addition, the importance of temporal resolution relative to spatial resolution was emphasized in these studies that included both temporal and spatial resolution discussions. However, this conclusion was limited to smaller study watersheds where precipitation was more uniformly distributed in whole regions [68]. In another study about urban flood simulation, researchers discussed the response of urban hydrological dynamics to the spatial and temporal resolution of precipitation using radar precipitation data and showed that urban flood processes are more sensitive to spatial resolution compared to temporal resolution [72]. Both spatial averaging and temporal averaging result in a flattening of the flood peak simulations [29,67].
In summary, high-precision precipitation estimates with high spatial and temporal resolution are urgently needed, especially for areas with frequent flash floods or urban areas with high-proportioned impervious surfaces. In that case, equally refined high-resolution hydrological models are necessary. The two strategies can work together to improve our understanding and prevention of disasters (e.g., flash floods and urban flooding) [73].

Statement of Different SPEs in This Study
In this study, we chose eight different remote sensing precipitation products. Our main objective in selecting these products is to increase the diversity of the dataset and thus highlight the generality of the scale effects. Therefore, it is important to state that the generality of the conclusions regarding scale effects is acceptable and this is confirmed by a large number of relevant studies [63][64][65][66][67][68][69][70][71][72]. However, inter-comparisons about product performance are to some extent inappropriate because these data are not used for the same purposes.
PERSIANN, PCCS and PDIR are near real-time products without bias correction and with very short time delay (15 min to 2 days), so they are mainly used for real-time flood or drought monitoring [9,44,45]. PERSIANN and PCCS were among the first generation of global satellite precipitation estimates products, and their algorithm design provided a solid foundation for the advancement of PDIR [48,49]. In addition, they are the foundations of PCDR and PCCSCDR [46,47]. PDIR is a very promising product with a very short delay time (15 to 60 min), relatively high accuracy and the highest spatial resolution (0.04 degree), and it uses the Dynamic Infrared-Rain model, which utilizes climatological data to construct a dynamic cloud-top brightness temperature (T b )-rain rate relationship to retrieve precipitation. It can be used for a wide range of studies (e.g., atmospheric rivers, flash floods and extreme rainstorms).
CMORPH, IMERG and GSMaP are bias-corrected, near-real-time products. While GSMaP has a relatively short latency (4 h), the other two products have a much longer latency (3 to 4 months). CMORPH (1998 to present) is for the bias-corrected, reprocessed CPC Morphing Technique (CMORPH) high-resolution global satellite precipitation estimates. Bias in raw CMORPH is removed through a comparison against CPC daily gauge analysis over land and adjustment against the Global Precipitation Climatology Project (GPCP) merged analysis of pentad precipitation over ocean [50,74]. IMERG Final provides researchquality, gridded, global, multi-satellite precipitation estimates with quasi-Lagrangian time interpolation, gauge data and climatological adjustment [8]. Its algorithm is intended to intercalibrate, merge and interpolate "all" satellite microwave precipitation estimates, together with microwave-calibrated infrared (IR) satellite estimates, precipitation gauge analyses and potentially other precipitation estimators. The system is run several times for each observation time, first giving a quick estimate (IMERG Early Run) and successively providing better estimates as more data arrive (IMERG Late Run). The final step uses monthly gauge data to create research-level products (IMERG Final Run). So for IMERG Final, incorporating data from multiple sources and being fully bias corrected makes it the best performer in this study. These three datasets achieved very good accuracy with bias correction. This also confirms the importance of bias correction.
PCDR and PCCSCDR are climate-scale data records (CDR). PCDR and PCCSCDR are consistent from 1983 to present at a high resolution. The consistency in them is from the consistent global IR and GPCP datasets utilized in creating these datasets. These datasets are unique and useful for climate studies (i.e., historical trend analysis or climate change). Although PCCSCDR did not exceed the accuracy of PCDR in this study (YLJ basin), PCCSCDR has a higher spatial resolution (0.04 degree) compared to PCDR. It may exceed PCDR in other regions and can be used for some local studies [47].
In addition, the results of this study can only illustrate the basic performance of these data in the Yalong River basin. In other regions, the data may behave differently. Therefore, end-users should select products in a targeted manner according to the specific usage.

Conclusions
In this study, we evaluated eight satellite precipitation estimate products in a set of nested watersheds in the YLJ basin in China. The selected eight satellite products include Remotely Sensed Information using Artificial Neural Networks (PERSIANN), the PERSIANN-Cloud Classification System (PCCS), the PERSIANN-Climate Data Record (PCDR), the PERSIANN-Cloud Classification System-Climate Data Record (PCCSCDR), PERSIANN-Dynamic Infrared Rain Rate near real-time (PDIR), the Bias-corrected Climate Prediction Center Morphing technique Climate Data Record (CMORPH), the Global Precipitation Measurement Mission Integrated Multi-satellitE Final Run V6B (IMERG) and Gauge-calibrated Global Satellite Mapping of Near Real-time Precipitation product version 6 (GSMaP). The SPEs-driven simulated flow processes were evaluated for 522 catchments and four hydrological stations based on simulations driven by gauge observations, respectively. The 522 selected catchments range in size from 100 km 2 to 130,000 km 2 . The results of this study indicate that the quality of satellite-driven runoff simulations is characterized regionally at different spatial scales. Despite the different satellite precipitation estimates products, they tend to have poorer model performance upstream of the catchment and better performance downstream of the catchment (areas close to the river network). In addition, the scale effects contribute to these regional characteristics of simulation performance. Due to regional averaging and river routing, the "parent" basins are better simulated than the "son" basins, but this law is triggered under a threshold value (20,000 km 2 in this study).