Evaluation of Multi-Satellite Precipitation Datasets and Their Error Propagation in Hydrological Modeling in a Monsoon-Prone Region

: This study comprehensively evaluates eight satellite-based precipitation datasets in streamﬂow simulations on a monsoon-climate watershed in China. Two mutually independent datasets—one dense-gauge and one gauge-interpolated dataset—are used as references because commonly used gauge-interpolated datasets may be biased and unable to reﬂect the real performance of satellite-based precipitation due to sparse networks. The dense-gauge dataset includes a substantial number of gauges, which can better represent the spatial variability of precipitation. Eight satellite-based precipitation datasets include two raw satellite datasets, Precipitation Estimation from Remotely Sensed Information using Artiﬁcial Neural Networks (PERSIANN) and Climate Prediction Center MORPHing raw satellite dataset (CMORPH RAW); four satellite-gauge datasets, Tropical Rainfall Measuring Mission 3B42 (TRMM), PERSIANN Climate Data Record (PERSIANN CDR), CMORPH bias-corrected (CMORPH CRT), and gauge blended datasets (CMORPH BLD); and two satellite-reanalysis-gauge datasets, Multi-Source Weighted-Ensemble Precipitation (MSWEP) and Climate Hazards Group InfraRed Precipitation with Stations (CHIRPS). The uncertainty related to hydrologic model physics is investigated using two di ﬀ erent hydrological models. A set of statistical indices is utilized to comprehensively evaluate the precipitation datasets from di ﬀ erent perspectives, including detection, systematic, random errors, and precision for simulating extreme precipitation. Results show that CMORPH BLD and MSWEP generally perform better than other datasets. In terms of hydrological simulations, all satellite-based datasets show signiﬁcant dampening e ﬀ ects for the random error during the transformation process from precipitation to runo ﬀ ; however, these e ﬀ ects cannot hold for the systematic error. Even though di ﬀ erent hydrological models indeed introduce uncertainties to the simulated hydrological processes, the relative hydrological performance of the satellite-based datasets is consistent in both models. Namely, CMORPH BLD performs the best, which is followed by MSWEP, CMORPH CRT, and TRMM. PERSIANN CDR and CHIRPS perform moderately well, and two raw satellite datasets are not recommended as proxies of gauged observations for their worse performances.

of satellite-based datasets. Generally, two main sources are (1) the error of the satellite-based datasets and (2) the error propagation of satellite-based datasets through the hydrological model [45].
The monsoon regions, having an obvious seasonal variation of precipitation, have always been a research focus of satellite-based precipitation datasets [46][47][48][49][50][51][52]. For example, Prakash et al. [51] compared four satellite-based precipitation datasets (Climate Prediction Center MORPHing-raw satellite dataset (CMORPH RAW), Naval Research Laboratory (NRL)-blended, PERSIANN, and TRMM 3B42) with the gauged-interpolated dataset in one Indian monsoon region with respect to their abilities to simulate the seasonal rainfall and the rainfall detection abilities over regions with diverse topography. The results show that although all four datasets underestimate the summer seasonal mean rainfall (June to September), TRMM 3B42 generally performs better than the other three datasets mainly due to its incorporation of rain gauge observations. Mou et al. [49] compared five satellite-based precipitation datasets (TRMM 3B42, its real-time dataset TRMM 3B42RT, GPCP-1DD, PERSIANN Climate Data Record (PERSIANN CDR), and CMORPH RAW) and a gauge-interpolated dataset (Asian Precipitation-Highly Resolved Observational Data Integration Towards Evaluation of Water Resources (APHRODITE)) at daily, monthly, seasonal, and annual scales with rain gauges over Malaysia. It was found that TRMM 3B42 and APHRODITE performed the best, while PERSIANN CDR slightly overestimated observed precipitation, and the other three satellite-based datasets showed the worst performance. In addition, all six precipitation datasets show better performances in southern Peninsular Malaysia, which receives higher precipitation, while worse performances appear in the western and dryer Peninsular Malaysia.
There also have been some studies executed in the monsoon regions aiming to evaluate the applicability of satellite-based datasets in hydrologic simulations [53][54][55][56][57][58][59][60]. For example, Tong et al. [58] evaluated four satellite-based datasets (TRMM 3B42, TRMM 3B42RT, CMORPH RAW, and PERSIANN) through comparing with the gauged China Meteorological Administration dataset (CMA) in streamflow simulations over the Tibetan Plateau based on the distributed Variable Infiltration Capacity (VIC) hydrological model. It was found that the error sources of these datasets are systematically different in different seasons. Furthermore, TRMM 3B42 shows comparable performance to CMA for both monthly and daily streamflow simulations due to its monthly gauge adjustment. However, the other three satellite-based datasets only show potentials or little capability for streamflow simulations over TP. In addition, five satellite-based precipitation datasets (TRMM 3B42, TRMM 3B42RT, CMORPH RAW, CMORPH CRT, and CMORPH BLD) were used by Wang et al. [59] to simulate the daily streamflow by driving the distributed Vegetation Interface Processes (VIP) model over two river basins in the southeastern Tibetan Plateau. The results show that these satellite-based datasets perform better in summer than other seasons, and CMORPH BLD performs the best for runoff simulations. TRMM 3B42 and CMORPH CRT show much better performance than their uncorrected counterparts: TRMM 3B42RT and CMORPH RAW.
From the previous studies, we found that first, there are relatively few evaluations focusing on satellite-based precipitation datasets in the monsoon regions of southern China, which is a flood-prone area. Both the flood predictions and water resource management are mainly based on hydrological simulations. Moreover, most existing studies in the monsoon characterized regions only compare several commonly used satellite-based datasets (such as TRMM and CMORPH serial datasets) and some promising recently released precipitation datasets, such as PERSIANN CDR, Climate Hazards Group InfraRed Precipitation with Stations V2.0 (CHIRPS), and Multi-Source Weighted-Ensemble Precipitation V2.0 (MSWEP) have not been thoroughly evaluated. Second, it is crucial to ensure that the gauged benchmark reference is sufficient to reflect the real performance of satellite-based precipitation when testing satellite-based datasets. However, many studies compared the satellite-based datasets based on the sparse-gauge datasets or gridded datasets generating from sparse gauges, which may not accurately reflect the spatial characteristic of precipitation [47,49,54,59,60]. Furthermore, when evaluating the accuracies of satellite-based datasets, the gauged references in some studies are not independent of the satellite-based datasets, which uses the gauged precipitation as part of their source data [49,52,58].
Third, despite the fact that some studies show that the performance of hydrological simulation is highly dependent on the satellite-based datasets themselves in the monsoon regions, the uncertainties of hydrological models caused by different models' complexities could also influence the hydrological simulation. The impact of these two uncertainties has not been carefully examined.
The latest review article of Maggioni et al. [35] pointed out that one of the future research areas for satellite-based precipitation datasets is to study the conditions (climate type, basin area, acceptable error in the output, and model structure) under which satellite-based precipitation could be successfully used in hydrological models. In order to provide a comprehensive understanding of the error of the satellite-based precipitation and its error propagation through hydrological models for monsoon-characterized watersheds, this study tests the reliability of eight satellite-based precipitation datasets in hydrological modeling for a large-sized (>80,000 km 2 ) monsoon-characterized watershed (Xiangjiang River Basin) in southern China. Even though one of the main usages of the satellite-based datasets is for ungauged watersheds or watersheds with spare weather stations, the test of their reliability requires a watershed with dense gauges. The Xiangjiang River Basin, which has 267 precipitation gauges (referred to as the dense-gauge precipitation dataset in the study), can meet this requirement for an 80,000 km 2 surface area. All the eight satellite-based precipitation datasets include TRMM 3B42 (TRMM), PERSIANN, PERSIANN CDR, CMORPH RAW, CMORPH bias-corrected (CMORPH CRT), CMORPH gauge blended (CMORPH BLD), MSWEP, and CHIRPS. In addition to using the dense-gauge precipitation dataset as a reference, an independent gridded gauge-interpolated precipitation dataset is also used, which incorporates much fewer stations from the National Meteorological Information Center dataset from the China Meteorological Administration (CN05) [61]. As high-density gauged precipitation is usually not available in China, CN05 is commonly used for meteorological and hydrological studies over most watersheds [62][63][64]. This study could be extended to test whether CN05 is capable of being used as a reliable reference for using satellite-based datasets over other watersheds where gauges are much less dense. To investigate the uncertainty related to hydrological models, the lumped Xinanjiang (XAJ) model and the semi-distributed Soil and Water Assessment Tool (SWAT) model, with different complexities, are used.

Study Area
The Xiangjiang River Basin has a complex topography with elevation ranging from 0 to 2100 m above sea level and is located between 24.5 • -28.1 • N and 110.5 • -114.0 • E in the southern part of China ( Figure 1). The Xiangjiang River originates from Haiyang Mountain in Guangxi province with a drainage area of 80,669 km 2 and a total length of 801 km, making it one of the largest tributaries of the Yangtze River [3,65]. The Xiangjiang River Basin, located in the subtropical and warm temperate zone, which is dominated by the East-Asian monsoon climate with heavy summer rainfall in the south, is an ideal experimental basin with a good relationship between precipitation and runoff [65]. The average temperature is around 17 • C, and the annual precipitation is close to 1500 mm with occasionally little snowfall in the winter. More than 70% of the annual precipitation occurs between March and August. In addition, there are abundant water resources in the Xiangjiang River Basin; the study of satellite-based precipitation could provide valuable information for flood forecasting and water resources management for the administrative department.

Data
In this study, eight satellite-based datasets are selected and can be further classified into three categories: (1) satellite-only (PERSIANN and CMORPH RAW), in which their quality fully depends on the raw satellite data, (2) satellite-gauge (TRMM, PERSIANN CDR, CMORPH CRT, and CMORPH BLD), in which their quality partly depends on gauge data, and (3) satellite-reanalysis-gauge/blended (MSWEP and CHIRPS), in which reanalysis data are blended. These datasets share the same spatial resolution of 0.25° × 0.25° for latitude and longitude, and the common period between 2003 to 2013.
Although PERSIANN and CMORPH RAW both incorporate PMW and IR to estimate rainfall, the proportion of PMW and IR is totally different between these two datasets. Specifically, CMORPH RAW is primarily based on PMW remote sensing of rainfall, while PERSIANN is mainly based on IR imagery [66,67]. Each satellite-gauge and blended (gauges, satellites, and reanalysis data) dataset blends different source data by using different data fusion methods. In general, CMORPH BLD and MSWEP directly incorporate daily gauge data, while TRMM and CMORPH CRT directly incorporate monthly gauge data. Unlike these four datasets specially designed to provide the best instantaneous accuracy, PERSIANN CDR (monthly precipitation) and CHIRPS (5-day precipitation) have been designed to achieve the best simulations of the most temporally homogeneous record.
Specifically, TRMM blended GPCC with their satellite-only counterparts TMPA 3B42RT (which, similar to CMORPH RAW, is also estimated primarily by PMW remote sensing of rainfall) by the inverse error variance weighting method [68]. CMORPH CRT was produced by blending the CMORPH RAW dataset with Climatic Prediction Center (CPC) and GPCC via the probability density function matching a bias correction method [69]. The optimal interpolation method was used to combine the CMORPH CRT with daily gauge analysis to produce the CMORPH BLD [69]. Instead of using gauged observations directly, PERSIANN CDR was adjusted to match the monthly satellite-gauge GPCP, which uses gauge-interpolated GPCC, to remove its monthly biases [6,70]. Although both MSWEP and CHIRPS are categorized as blended datasets, the data sources and

Data
In this study, eight satellite-based datasets are selected and can be further classified into three categories: (1) satellite-only (PERSIANN and CMORPH RAW), in which their quality fully depends on the raw satellite data, (2) satellite-gauge (TRMM, PERSIANN CDR, CMORPH CRT, and CMORPH BLD), in which their quality partly depends on gauge data, and (3) satellite-reanalysis-gauge/blended (MSWEP and CHIRPS), in which reanalysis data are blended. These datasets share the same spatial resolution of 0.25 • × 0.25 • for latitude and longitude, and the common period between 2003 to 2013.
Although PERSIANN and CMORPH RAW both incorporate PMW and IR to estimate rainfall, the proportion of PMW and IR is totally different between these two datasets. Specifically, CMORPH RAW is primarily based on PMW remote sensing of rainfall, while PERSIANN is mainly based on IR imagery [66,67]. Each satellite-gauge and blended (gauges, satellites, and reanalysis data) dataset blends different source data by using different data fusion methods. In general, CMORPH BLD and MSWEP directly incorporate daily gauge data, while TRMM and CMORPH CRT directly incorporate monthly gauge data. Unlike these four datasets specially designed to provide the best instantaneous accuracy, PERSIANN CDR (monthly precipitation) and CHIRPS (5-day precipitation) have been designed to achieve the best simulations of the most temporally homogeneous record.
Specifically, TRMM blended GPCC with their satellite-only counterparts TMPA 3B42RT (which, similar to CMORPH RAW, is also estimated primarily by PMW remote sensing of rainfall) by the inverse error variance weighting method [68]. CMORPH CRT was produced by blending the CMORPH RAW dataset with Climatic Prediction Center (CPC) and GPCC via the probability density function matching a bias correction method [69]. The optimal interpolation method was used to combine the CMORPH CRT with daily gauge analysis to produce the CMORPH BLD [69]. Instead of using gauged observations directly, PERSIANN CDR was adjusted to match the monthly satellite-gauge GPCP, which uses gauge-interpolated GPCC, to remove its monthly biases [6,70]. Although both MSWEP and CHIRPS are categorized as blended datasets, the data sources and fusion methods are totally different. MSWEP is mainly produced by giving weights to each dataset on each grid from different data sources (daily and monthly gauges such as CPC and GPCC, reanalysis from ERA-Interim, Japanese 55-year Reanalysis (JRA 55) and satellite from CMORPH RAW, Global Satellite Mapping of Precipitation (GSMap MVK) and TRMM 3B42RT) based on their comparative performances at the surrounding gauges [71]. However, CHIRPS mainly uses the NOAA Climate Forecast System (CFS) reanalysis datasets to fill the missing values calculated by satellite datasets (from such as TRMM 3B42) and five-day gauged precipitation from datasets such as World Meteorological Organization's Global Telecommunication System [72]. More details of the above datasets are shown in Appendix A.
The reliability of the eight satellite-based precipitation datasets is evaluated by comparing it with two gauged precipitation datasets, including the dense-gauge dataset and the gridded gauge-interpolated dataset (CN05). As an important experimental basin, the Xiangjiang River Basin owns the dense-gauge precipitation dataset derived from a dense ground network of 267 precipitation stations with complete temporal coverage from 1963 to 2013, which is offered by the local hydrological department: the Water Conservation Bureau of Hunan Province. CN05, as a national gauge-interpolated dataset, is composed of daily precipitation estimates at the spatial resolution of 0.5 • for the quasi-China coverage of 54 • N to 18 • S latitude from 1961 to 2016. CN05, independent from the dense-gauge precipitation dataset, is generated by blending daily precipitation data (2472 Chinese national weather gauges and 44 gauges locating in this study region) with Chinese mainland Digital Elevation Model (DEM) data (resampled from the Global 30 Arc Second Elevation Dataset, with a spatial resolution of 0.5 • × 0.5 • ) using Thin Plate Spline algorithm (TPS) [73]. It is worth noting that CN05 is not independent of the eight satellite-based datasets. This is because two of the 44 gauges of CN05 in the study region are selected as the international exchange gauges that provide measured components (such as GPCC and CPC) from four satellite-gauges and two blended datasets. This means that the gauged components of the satellite-gauge and blended datasets come from the same source. In other words, factors that influence the performances of satellite-based datasets come from other data sources (satellite or reanalysis) or the blending strategies between and within various source data. Compared with the eight above-mentioned daily satellite-based precipitation datasets, which define a day as 0-23:59 UTC, both dense-gauge and CN05 precipitation datasets use the same daily precipitation time interval, from 8 UTC of one day to that of the next day. This ensures that the daily precipitation measurement in China, in the eastern eight zones, is executed simultaneously with daily precipitation measurements under the 0-23:59 UTC standard. A brief summary of the eight satellite-based datasets and two gauged datasets is presented in Table 1. The locations of 267 dense-gauge precipitation datasets, 44 precipitation gauges of source data of CN05, and two international exchange gauges are shown in Figure 1.
For hydrological modeling, temperature data from 13 stations and streamflow time series at the watershed outlet are also used. In addition, a Digital Elevation Model (DEM) dataset with a spatial resolution of 30 m, a land-use dataset with a spatial resolution of 1 km, and a soil dataset from Harmonized-world-soil-datasets (HWSD) are used to establish the semi-distributed SWAT model.

Methodology
The comparison of datasets is carried out in both precipitation evaluations and hydrological simulations. When evaluating the precipitation, we compared the differences among all satellite-based precipitation datasets on both areal mean and grid scales to better understand the hydrological impacts of the errors from the satellite-based datasets. This is because the areal mean precipitation and the spatial distribution of precipitation are respectively decisive factors in the lumped XAJ and semi-distributed SWAT models used in this study. When an evaluation is executed at the grid-scale, the dense-gauge observations are interpolated by the IDW method to 151 grids with a spatial resolution of 0.25 • × 0.25 • , which is the same with eight satellite-based precipitation data [74]. For CN05 with a spatial resolution of 0.5 • × 0.5 • , the precipitation in four 0.25 • grids within one 0.5 • grid shares the same value.

Hydrological Models
In this study, two hydrological models with different complexities, such as a conceptual lumped model and a physically-based semi-distributed model, are utilized for hydrological modeling. Both models have been successfully established in the Xiangjiang River Basin for many studies [3,44,75,76]. Compared to the lumped XAJ, which uses the areal mean precipitation as the model input, the semi-distributed SWAT uses precipitation from a single rain gauge closest to each sub-basin's centroid as the model input. Details of these two models are described below.

Xinanjiang Model (XAJ)
The XAJ model is a lumped conceptual rainfall-runoff model of a set of 15 variables developed in the 1970s [77,78]. It has been successfully used in humid regions of China [79][80][81]. Outflow simulation from the total outlet of the basin mainly consists of three phases: evapotranspiration, runoff generation, and runoff routing. Four parameters account for evapotranspiration, two account for runoff generation, and nine account for runoff routing. Its hydrological cycle is based on the water balance equation: where S t and S 0 are the mean and initial free water storage capacity, W t and W 0 are the mean and initial tension water storage, R day is the amount of precipitation on day i, Q sur f is the amount of surface runoff on day i, E a is the amount of evapotranspiration on day i, Q lat is the amount of lateral flow on day i, and Q gw is the amount of groundwater flow on day i. The evapotranspiration is calculated by dividing the soil into three layers: an upper layer, a lower layer, and a deep layer. The storage curve calculates the total runoff according to the hypothesis that when the soil moisture content reaches the field capacity, all rainfall turns into a runoff. The rainfall exceeding infiltration is transformed into the surface runoff Q sur f , and the rainfall that has infiltrated belongs to the lateral flow Q lat and groundwater flow Q gw .

Soil and Water Assessment Tool Model (SWAT)
SWAT, a physically-based semi-distributed model, is designed to predict the effects of land management practices on the hydrology, sediment, and contaminant transport [82]. SWAT could be operated under different soil compositions, land uses, and management conditions in an agricultural watershed [3,83]. Different from the XAJ model, which uses the whole basin as the operation unit, SWAT divides the entire basin into several unit basins, and each unit basin is further divided into Remote Sens. 2020, 12, 3550 9 of 33 several Hydrologic Research Units (HRUs). Each HRU is calculated individually based on relatively homogeneous land use, land cover, and soil types. The water balance of SWAT is described below as: where SW t is the final soil water content, SW 0 is the initial soil water content on day i, t is the time, R day is the precipitation amount on day i, Q sur f is the surface runoff amount on day i, and W seep is the water amount entering the vadose zone from the soil profile on day i. The Penman-Monteith method is used to estimate evapotranspiration E a [84]. The surface runoff volume Q sur f is calculated by a Soil Conservation Service Curve Number method, and groundwater flow Q gw is simulated by creating a shallow aquifer. The outlet simulation of basin is calculated by the Muskingum method for each sub-basin's simulation results [85].

Model Calibration and Validation
XAJ and SWAT models are respectively calibrated using the Shuffled Complex Evolution (SCE-UA) algorithm [86] and Sequential Uncertainty Fitting version 2 (SUFI2) algorithm [87], using the Nash-Sutcliffe efficiency (NSE shown in Table 2

Statistical Analysis Methods
A set of statistical indices is utilized to evaluate the performance of eight satellite-based datasets in preserving precipitation and simulating watershed runoff. For precipitation evaluation, the indices include (1) four categorical statistics for detection error, (2) three quantitative metrics, of which two of them could reflect the systematic and random errors, and (3) four extreme precipitation statistics. There is one metric for hydrological evaluation to determine the overall hydrological performances and three hydrological statistics to reflect the characteristic values for streamflow. Additionally, the error propagation from precipitation to streamflow is qualified by two absolute ratios. A list of the indices can be found in Table 2, and more details are explained in the following section.

Precipitation Indices
Detection, systematic, and random errors are three main error sources of satellite-based datasets [35,88]. False alarms (when gauges do not observe the satellite-detected precipitation) and missed rain (when the gauge-observed precipitation are not actually detected by satellites) constitute the detection errors [89]. When the satellite correctly detects precipitation, errors of estimated precipitation compose systematic and random errors [90][91][92][93].
In this study, four categorical statistics: the frequency bias index (FBI), the probability of detection (POD), the false alarm ratio (FAR), and the equitable threat score (ETS) are used to quantify the detection errors of each satellite-based dataset [1]. The FBI reflects the tendency to underestimate or overestimate rainfall events. The FAR (POD) measures the fraction of false alarms (rain occurrences) that were correctly detected. The ETS provides an overall skill measurement of the correctly detected rain events (observed and/or detected).
The three quantitative statistics of precipitation are the relative bias (RB), unbiased root mean squared error (ubRMSE), and the coefficient of determination (R 2 ). RB reflects the systematic error, which is the relative difference in the long-term mean values of the two series. Although RMSE shows the amplitude of differences between the two series, it could not directly reflect the random error unless the system error is removed by subtracting the mean difference from the RMSE to get the ubRMSE. R 2 indicates the correlation between two series.
Four extreme statistics are selected from the recommended list by the joint World Meteorological Organization Commission for Climatology/World Climate Research Programme project on Climate Change Detection and Indices (https://www.climdex.org/indices.html). These are the annual total precipitation when daily precipitation amount on a wet day > 99th percentile (R99pTOT), the annual daily precipitation amount on a wet day (SDII), the maximum length of wet and dry spells (CWD and CDD). P99pTOT is one threshold index, and SDII reflects the intensity of extreme precipitation. CWD (CDD) shows the duration of extreme precipitation (non-precipitation) events.

Hydrological Indices
The widely used metrics NSE is used to evaluate the performance of each precipitation dataset for hydrological simulations. NSE is calculated as the ratio of residual variance to measured discharge variances [94]. Simulated discharges using these datasets were also compared against their gauged counterparts using three hydrological statistics: daily mean discharge, winter low flow (5th percentile of the winter flow), and summer high flow (95th percentile of the summer flow).

Error Propagation Indices
Two absolute ratios (γ) between error metrics (RB and ubRMSE) for the runoff and precipitation series are used to quantify the error propagation through the precipitation-runoff process. γ RB and γ ubRMSE respectively reflect the systematic and random error propagation effects. They are always greater than 0 due to their absolute values, and values larger (smaller) than 1 indicate the amplification (dampening) of the error from precipitation to runoff. where N is the total number of estimates) (−∞, 1), 1 Quantitative metrics Extreme statistics

R99pTOT
Annual total precipitation when daily precipitation amount on a wet day>99th percentile -SDII Annual daily precipitation amount on wet day -CWD Maximum length of wet spell, maximum number of consecutive days with daily precipitation ≥ 1 mm -CDD Maximum length of dry spell, maximum number of consecutive days with daily precipitation < 1 mm -

Hydrological indices
Evaluation metrics Hydrological statistics DMD Daily mean discharge -WLF Winter low flow (5th percentile) -SHF Summer high flow (95th percentile) - Figure 2 presents the seasonality (spring: March-May, summer: June-August, autumn: September-November, winter: December-February; wet season: April-September and dry season: October-March) of the mean precipitation for all ten precipitation datasets (eight satellite-based precipitation datasets, one gauged precipitation (i.e., the dense-gauge dataset), and one gauge-interpolated precipitation (i.e., CN05)). All stations or grids within the watershed are averaged to a single time series to calculate the seasonal mean values. The figure graphically demonstrates that CN05 agrees well with the dense-gauge observation for all four seasons. Specifically, CN05 presents a small RB within ±7.0% for seasonal precipitation (−2.4% for spring, −6.1% for summer, −1.7% for autumn, and 0.3% for winter). With the exception of satellite-only datasets, which considerably underestimate the precipitation for all seasons, the satellite-based datasets also reasonably represent the observed seasonality. However, all of them are worse than CN05 for all seasons. The better performance of PERSIANN CDR among satellite-based datasets for seasonal precipitation, especially in spring, summer, and autumn, could reflect the effects of its blending strategies. PERSIANN CDR maintains monthly precipitation that is consistent with the monthly GPCP, and GPCP is mainly composed of gauged precipitation datasets (e.g., GPCC) [70]. In addition, all the satellite-gauge datasets overestimate the dense-gauge precipitation in summer and the wet season while underestimating in winter. In addition, both blended datasets (MSWEP and CHIRPS) overestimate the precipitation all year round. TRMM, CMORPH BLD, and MSWEP fit the dense-gauge precipitation better in the dry season than the wet season, while PERSIANN CDR, CMORPH CRT, CHIRPS, and satellite-only datasets perform better in the wet season than the dry season.

Seasonal Patterns of Precipitation Datasets
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 34 Figure 2 presents the seasonality (spring: March-May, summer: June-August, autumn: September-November, winter: December-February; wet season: April-September and dry season: October-March) of the mean precipitation for all ten precipitation datasets (eight satellite-based precipitation datasets, one gauged precipitation (i.e., the dense-gauge dataset), and one gauge-interpolated precipitation (i.e., CN05)). All stations or grids within the watershed are averaged to a single time series to calculate the seasonal mean values. The figure graphically demonstrates that CN05 agrees well with the dense-gauge observation for all four seasons. Specifically, CN05 presents a small RB within ±7.0% for seasonal precipitation (−2.4% for spring, −6.1% for summer, −1.7% for autumn, and 0.3% for winter). With the exception of satellite-only datasets, which considerably underestimate the precipitation for all seasons, the satellite-based datasets also reasonably represent the observed seasonality. However, all of them are worse than CN05 for all seasons. The better performance of PERSIANN CDR among satellite-based datasets for seasonal precipitation, especially in spring, summer, and autumn, could reflect the effects of its blending strategies. PERSIANN CDR maintains monthly precipitation that is consistent with the monthly GPCP, and GPCP is mainly composed of gauged precipitation datasets (e.g., GPCC) [70]. In addition, all the satellite-gauge datasets overestimate the dense-gauge precipitation in summer and the wet season while underestimating in winter. In addition, both blended datasets (MSWEP and CHIRPS) overestimate the precipitation all year round. TRMM, CMORPH BLD, and MSWEP fit the dense-gauge precipitation better in the dry season than the wet season, while PERSIANN CDR, CMORPH CRT, CHIRPS, and satellite-only datasets perform better in the wet season than the dry season.  The spatial distributions of summer precipitation are also presented for all datasets in Figure 3. The dense-gauge datasets are presented as color dots, while all the other datasets are presented as grids. Generally, summer precipitation is heavier in high elevation areas (southeastern, southwestern, and southern parts) than in other regions. CN05 clearly missed quite some regional intensive precipitation (such as the heavy precipitation in the southeastern parts of the region), which can even be captured by MSWEP and CHIRPS. The bad performance of CN05 may be caused by two reasons: (1) its lower spatial resolution (0.5 • × 0.5 • ) and (2) its less gauged source data compared with the dense-gauged dataset. Satellite-only datasets underestimate precipitation for all grids, even though PERSIANN can capture the heavy precipitation signal in mountain areas. Although all satellite-gauge datasets could capture this spatial distribution pattern, these datasets still underestimate the heavy precipitation in mountain areas (southern and southeastern parts) while overestimating the small precipitation in central plain regions. The spatial distributions of winter precipitation, as shown in Appendix B in Figure A1, display similar patterns, as CN05 still performs relatively worse than the two blended datasets: MSWEP and CHIRPS. The better performance of blended datasets for the spatial distribution of seasonal precipitation may be due to their reanalysis components.

Seasonal Patterns of Precipitation Datasets
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 34 The spatial distributions of summer precipitation are also presented for all datasets in Figure 3. The dense-gauge datasets are presented as color dots, while all the other datasets are presented as grids. Generally, summer precipitation is heavier in high elevation areas (southeastern, southwestern, and southern parts) than in other regions. CN05 clearly missed quite some regional intensive precipitation (such as the heavy precipitation in the southeastern parts of the region), which can even be captured by MSWEP and CHIRPS. The bad performance of CN05 may be caused by two reasons: (1) its lower spatial resolution (0.5° × 0.5°) and (2) its less gauged source data compared with the dense-gauged dataset. Satellite-only datasets underestimate precipitation for all grids, even though PERSIANN can capture the heavy precipitation signal in mountain areas. Although all satellite-gauge datasets could capture this spatial distribution pattern, these datasets still underestimate the heavy precipitation in mountain areas (southern and southeastern parts) while overestimating the small precipitation in central plain regions. The spatial distributions of winter precipitation, as shown in Appendix B in Figure A1, display similar patterns, as CN05 still performs relatively worse than the two blended datasets: MSWEP and CHIRPS. The better performance of blended datasets for the spatial distribution of seasonal precipitation may be due to their reanalysis components.   Figure 4A presents two types of the daily gridded precipitation information for the dense-gauge, CN05, and eight satellite-based datasets: (1) bar charts represent the frequency distribution of precipitation under seven different rain rate classes (0, 0-1, 1-5, 5-10, 10-25, 25-50 and >50 mm/day) and (2) line charts represent the contribution of the precipitation amount under each rain rate class to the total precipitation. As shown in the bar charts, PERSIANN CDR, CMORPH CRT, and CMORPH BLD are close to the gauged counterparts of where the precipitation frequencies decrease from 0 to 0-1 mm and slightly increase under the 1-5 mm class, and then decrease until the >50 mm class. These tendencies of precipitation frequency under the 0, 0-1, and 1-5 mm classes are inaccurately represented by MSWEP. Another two satellite-based datasets (TRMM and CHIRPS) overestimate the frequencies of no rain (0 mm) and heavy rain (>50 mm) and underestimate little rain (0-1 mm). Line charts show that the largest precipitation contribution of all datasets except for TRMM and CMORPH CRT occurs at the 10-25 mm class. Large differences in the precipitation contribution among datasets occur at the 25-50 mm and >50 mm classes.

Error Structures of Precipitation Datasets
The detection errors of each satellite-based dataset are quantified based on the FBI, FAR, POD, and ETS in terms of the 11-year (2003-2013) annual, wet season, and dry season precipitation processes. Figure 4B presents the distribution of the FBI for nine precipitation datasets (CN05 and eight satellite-based datasets). FBI values of CMORPH RAW at the 25-50 (13.66) and >50 (89.35) intervals being larger than 6 are not demonstrated, which is the same as Figure 4C,D. Although both satellite-gauge and blended datasets poorly simulate the annual FBI values in the rain rates of 0 mm (e.g., the FBI of MSWEP is 2.55) and 0-1 mm (e.g., the FBI of TRMM is 3.95), they overall outperform the satellite-only categories, which have worse annual FBI results under more than half of the rain rate classes ( Figure 4B). Figure 4C,D further demonstrates that more overestimations of FBI values of satellite-only datasets under most rain rate classes (5-10, 10-25, 25-50, and >50 mm) occur in the dry season than the wet season. The larger underestimation of precipitation events in the dry season is in good agreement with the seasonal precipitation amount in Section 4.1.1 and could further explain the sources of poor performances for satellite-only datasets. This may be because of the underestimation of precipitation events with the rain rate classes being larger than 10 mm during the wet season and the underestimation of all precipitation events during the dry season. As the rain rate class increases, the FBI of satellite-gauge datasets improves until the precipitation class exceeds 50 mm for both seasons. The annual FBI values of CMORPH CRT (0.78), TRMM (0.62), and CHIRPS (0.49) at this class are less than 1, indicating that these datasets overestimate the number of heavy rain events. This may also explain the overestimation of the percentage of heavy rains ( Figure 4A). FAR, POD, and ETS of satellite-based datasets also show obvious seasonal patterns. Two satellite-only datasets significantly deteriorate with the increasing rain rate classes in terms of the annual FBI, FAR, and POD, indicating their inability to capture the heavy precipitation. These two datasets clearly perform better in the wet season than in the dry season, especially in terms of the POD ( Figure 4I,J) and ETS ( Figure 4L,M). However, CMORPH BLD and MSWEP show an opposite seasonal pattern, as the better performance occurs in the dry season than the wet season in terms of the three statistics. In addition, both of them maintain their superiority among all the satellite-gauge datasets.
Although gauge-interpolated CN05 shows relatively worse performances than CMORPH BLD and MSWEP, it shares similar seasonal patterns with them and also outperforms the other six satellite-based datasets with regard to POD and ETS under the most rain rate classes (0-1, 1-5, 5-10, and 10-25 mm). Figure 5 shows the RB, ubRMSE, and R 2 of nine precipitation datasets (CN05 and eight satellite-based datasets) at both grid (shown as boxplots) and watershed-average scales (shown as radar plots). Generally, the performances of each precipitation dataset under two different scales are basically consistent in terms of all three quantitative statistics. Figure 5A,B show that CN05 presents a better RB than the eight satellite-based datasets. Among all satellite-gauge datasets, TRMM, PERSIANN CDR, and CMORPH CRT show the smallest RBs, indicating their smaller systematic errors, at both grid and watershed-averaged scales. CMORPH BLD and two blended datasets (MSWEP and CHIRPS) generally show positive RB under both scales, especially for CHIRPS, which overestimates the daily precipitation for more than 86.1% of the grids and has an RB of 16.5% at the watershed-average scale. In contrast, satellite-only datasets considerably underestimate the mean precipitation at both scales.
Random errors of CN05 and satellite-based datasets are quantified using the ubRMSE ( Figure 5C). CN05 shows relatively larger random errors than the satellite-based datasets except for CMORPH RAW and CHIRPS. In addition, large differences are observed among satellite-based datasets. Specifically, CMORPH BLD presents the smallest ubRMSE with the median value of 6.31 mm at the grid scale ( Figure 5C) and 2.09 mm at the watershed-average scale ( Figure 5D), while CHIRPS presents the largest ubRMSE with the median value of 10.94 mm at the grid scale and 6.28 mm at the watershed-average scale. MSWEP performs the best among all nine precipitation datasets with a median value of 5.83 mm at the grid scale and 2.14 mm at the watershed-average scale.
Remote Sens. 2020, 12, x FOR PEER REVIEW 15 of 34 datasets. Specifically, CMORPH BLD presents the smallest ubRMSE with the median value of 6.31 mm at the grid scale ( Figure 5C) and 2.09 mm at the watershed-average scale ( Figure 5D), while CHIRPS presents the largest ubRMSE with the median value of 10.94 mm at the grid scale and 6.28 mm at the watershed-average scale. MSWEP performs the best among all nine precipitation datasets with a median value of 5.83 mm at the grid scale and 2.14 mm at the watershed-average scale.  Figure 5E,F presents the R values for all nine precipitation datasets, and both clearly reflect the influence of the blending methods and the incorporated gauged datasets on R . Datasets designed to provide the best instantaneous accuracy of precipitation (TRMM, CMORPH CRT, CMORPH BLD, and MSWEP) perform relatively better than those aimed to achieve the most temporally homogeneous record (PERSIANN CDR and CHIRPS). Within the four better-behaved satellite-based datasets, those that directly incorporate daily gauge data (CMORPH BLD and  Figure 5E,F presents the R 2 values for all nine precipitation datasets, and both clearly reflect the influence of the blending methods and the incorporated gauged datasets on R 2 . Datasets designed to provide the best instantaneous accuracy of precipitation (TRMM, CMORPH CRT, CMORPH BLD, and MSWEP) perform relatively better than those aimed to achieve the most temporally homogeneous record (PERSIANN CDR and CHIRPS). Within the four better-behaved satellite-based datasets, those that directly incorporate daily gauge data (CMORPH BLD and MSWEP) clearly perform better than those that directly incorporated monthly gauge data (TRMM and CMORPH CRT). Two satellite-only datasets show the worst performance among all the satellite-based datasets. Similarly, CN05 is also less correlated with the dense-gauge dataset than half of the satellite-based datasets (TRMM, CMORPH CRT, CMORPH BLD, and MSWEP).
Remote Sens. 2020, 12, x FOR PEER REVIEW 16 of 34 MSWEP) clearly perform better than those that directly incorporated monthly gauge data (TRMM and CMORPH CRT). Two satellite-only datasets show the worst performance among all the satellite-based datasets. Similarly, CN05 is also less correlated with the dense-gauge dataset than half of the satellite-based datasets (TRMM, CMORPH CRT, CMORPH BLD, and MSWEP).

Simulation of Extreme Precipitation
The results of four extreme precipitation statistics are presented in Figure 6 for nine datasets (CN05 and eight satellite-based datasets) at both grid (shown as relative bias compared to the dense-gauge precipitation dataset in boxplots) and watershed (shown as the absolute value in radar plots) scales. In the radar plots, red and blue lines represent the results of the dense-gauge and each dataset, respectively.
R99pTOT ( Figure 6A,B) reflects the total precipitation of heavy rain. The R99pTOT values of satellite-gauge and blended datasets except PERSIANN CDR and CHIRPS are similar to the dense-gauge observation, especially for more than 50% grids having biases within ±20.0% at the grid scale. Specifically, PERSIANN CDR underestimates R99pTOT with more than 56.3% of grids having negative bias being smaller than −20% and a relative bias of 9.3% at the watershed scale. However, CHIRPS overestimates R99pTOT at both scales (with more than 63.6% of grids having positive bias Figure 5. Relative bias (RB), unbiased root mean squared error (RMSE), and R 2 of the daily precipitation for the nine gridded datasets on both grid (shown as boxplot in (A,C,E)) and watershed (shown as radar plot in (B,D,F)) scales. In radar plots at right sides, red and blue lines represent the optimal values (RB (0), unbiased RMSE (0) and R 2 (1)) and the results of gridded datasets of each statistic, respectively.

Simulation of Extreme Precipitation
The results of four extreme precipitation statistics are presented in Figure 6 for nine datasets (CN05 and eight satellite-based datasets) at both grid (shown as relative bias compared to the dense-gauge precipitation dataset in boxplots) and watershed (shown as the absolute value in radar plots) scales. In the radar plots, red and blue lines represent the results of the dense-gauge and each dataset, respectively.
R99pTOT ( Figure 6A,B) reflects the total precipitation of heavy rain. The R99pTOT values of satellite-gauge and blended datasets except PERSIANN CDR and CHIRPS are similar to the dense-gauge observation, especially for more than 50% grids having biases within ±20.0% at the grid scale. Specifically, PERSIANN CDR underestimates R99pTOT with more than 56.3% of grids having negative bias being smaller than −20% and a relative bias of 9.3% at the watershed scale.
However, CHIRPS overestimates R99pTOT at both scales (with more than 63.6% of grids having positive bias being larger than 20%, and a relative bias of 45.3% at the watershed scale). Additionally, two satellite-only datasets underestimate R99TOT.
The SDII values are shown in Figure 6C,D, and the similar results of SDII and R99pTOT can be explained by two factors: (1) heavy precipitation accounts for a large proportion of the annual precipitation amount, and (2) the number of wet days is similar for all nine datasets. The CWD is presented in Figure 6E,F at the grid and watershed-average scales, respectively. The CDD is presented in Figure 6G,H. The results show that CMORPH BLD maintains its superiority among all the datasets in simulating these two extreme statistics. However, the other seven satellite-based datasets could not accurately capture the CDD and the CWD at the same time, especially for the CDD, which is used as a criterion for representing droughts. For example, CMORPH CRT shows a small bias of CWD at both grid (with more than half of the grids have a bias of between ±10.0%) and watershed-average scales (CMORPH CRT: 18 days and the dense-gauge: 19 days). On the contrary, the CDD of CMORPH CRT is not accurately estimated with more than 50.0% of the grids having a bias larger than 10.0% and bias of 44.7% at the watershed-average scale (CMORPH CRT: 55 days and the dense-gauge: 38 days).
The SDII values are shown in Figure 6C,D, and the similar results of SDII and R99pTOT can be explained by two factors: (1) heavy precipitation accounts for a large proportion of the annual precipitation amount, and (2) the number of wet days is similar for all nine datasets. The CWD is presented in Figure 6E,F at the grid and watershed-average scales, respectively. The CDD is presented in Figure 6G,H. The results show that CMORPH BLD maintains its superiority among all the datasets in simulating these two extreme statistics. However, the other seven satellite-based datasets could not accurately capture the CDD and the CWD at the same time, especially for the CDD, which is used as a criterion for representing droughts. For example, CMORPH CRT shows a small bias of CWD at both grid (with more than half of the grids have a bias of between ±10.0%) and watershed-average scales (CMORPH CRT: 18 days and the dense-gauge: 19 days). On the contrary, the CDD of CMORPH CRT is not accurately estimated with more than 50.0% of the grids having a bias larger than 10.0% and bias of 44.7% at the watershed-average scale (CMORPH CRT: 55 days and the dense-gauge: 38 days).
CN05 better represents these four extreme precipitation statistics than all satellite-based datasets, especially for CDD, as shown in Figure 6G,H. It also shows tiny detection and systematic errors in the previous comparison in Section 4.1.2. However, CN05 misses quite some regional intensive seasonal precipitation and has larger random errors and worse R compared with more than half of the satellite-based datasets. In other words, the bad performance of CN05 indicates that some satellite-based datasets are effective in representing the spatial distribution of precipitation. However, this effect can be missed by gauge-interpolated datasets using sparse gauges with a relatively coarser spatial resolution. Therefore, there is a risk of having CN05 as the reference when investigating the statistical properties of satellite-based precipitation, especially for high-precision datasets. Figure 6. Annual total precipitation when daily precipitation amount on a wet day > 99th percentile (R99pTOT), the annual daily precipitation amount on a wet day (SDII), and the maximum length of wet and dry spells (CWD and CDD) of the daily precipitation for the nine gridded datasets on both grids (shown as boxplot in (A,C,E,G)) and areal mean (shown as radar plot in (B,D,F,H)) scales. In radar plots at right sides, red line and blue lines respectively represent the results of the dense-gauge dataset and the nine other gridded datasets. CN05 better represents these four extreme precipitation statistics than all satellite-based datasets, especially for CDD, as shown in Figure 6G,H. It also shows tiny detection and systematic errors in the previous comparison in Section 4.1.2. However, CN05 misses quite some regional intensive seasonal precipitation and has larger random errors and worse R 2 compared with more than half of the satellite-based datasets. In other words, the bad performance of CN05 indicates that some satellite-based datasets are effective in representing the spatial distribution of precipitation. However, this effect can be missed by gauge-interpolated datasets using sparse gauges with a relatively coarser spatial resolution. Therefore, there is a risk of having CN05 as the reference when investigating the statistical properties of satellite-based precipitation, especially for high-precision datasets.

Hydrological Simulations
Eight satellite-based datasets and gauge-interpolated CN05 are further compared against the dense-gauge dataset in hydrological modeling by both XAJ and SWAT models calibrated by observed streamflow. Both models are adequately calibrated with NSE values of 0.89 (XAJ) and 0.86 (SWAT) for calibration, and 0.89 (XAJ) and 0.84 (SWAT) for validation (Table 3). Table 3. Comparison of Nash-Sutcliffe efficiency (NSE) of both Xinanjiang (XAJ) and Soil and Water Assessment Tool (SWAT) models in daily step simulation based on the dense-gauge and the nine precipitation datasets. For illustrating the intra-annual variability of the hydrological process, Figure 7 shows the mean monthly hydrographs of observed and the simulated streamflow of the dense-gauge and the other nine precipitation datasets based on two models. The reason for using a monthly hydrograph rather than a daily hydrograph is to avoid noises when calculating the climatology due to the relatively short time period (i.e., 10 years) [95,96]. It can be observed that (1) the most precise simulation of discharge is achieved by the gauge-interpolated CN05 among all nine precipitation datasets. CMORPH BLD, MSWEP, TRMM, and CMORPH CRT offer better performance than the other satellite-based datasets. CHIRPS and PERSIANN CDR, respectively, overestimate and underestimate the observed discharge for almost the whole year. (2) During the flood periods (from April to August), the simulation processes of both the dense-gauge and the other nine datasets based on the XAJ model are obviously larger than results based on the SWAT model. Similar results were also discovered by Xu et al. [3], who used XAJ and SWAT models to test the ability of two reanalysis datasets in simulating flood events in the Xiangjiang River Basin. two models, the relative sort orders of the datasets based on NSE are almost consistent in both models. The best simulation was achieved by satellite-gauge CMORPH BLD, which was followed by blended MSWEP, TRMM, and CMORPH CRT. CHIRPS. PERSIANN CDR performed moderately; however, satellite-only datasets showed the worst performance. This consistency indicates that using different models does not significantly alter the relative performances of streamflow simulation of satellite-based precipitation datasets. Three hydrological statistics (daily mean discharge, winter low flow, and summer high flow) are further used to compare the daily simulated discharge of both the dense-gauge and the nine alternative datasets against their observed counterparts. Figure 8 presents the annualized results To further quantify the performance of satellite-based datasets in representing streamflow time series, the NSE values of two hydrological models based on daily streamflow are calculated and presented in Table 3. Results based on the XAJ model show that three satellite-gauge precipitation datasets (TRMM, CMORPH CRT, and CMORPH BLD) and blended datasets (MSWEP) are satisfactory in simulating streamflow time series, with NSE being larger than 0.72. CMORPH BLD (NSE = 0.84) outperforms all other satellite-based datasets. CHIRPS (NSE = 0.44) and PERSIANN CDR (NSE = 0.56) show moderate performances. Satellite-only datasets cannot represent the observed streamflow time series with NSE = −0.97 for CMORPH RAW and NSE = −0.40 for PERSIANN. The semi-distributed SWAT shows the similar daily simulation performance of each dataset with the lumped XAJ, but the performance of satellite-gauge datasets in SWAT is slightly worse than that in the XAJ except for PERSIANN CDR. Similar to PERSIANN CDR, both blended datasets, CHIRPS (NSE = 0.44/0.48 for XAJ/SWAT) and MSWEP (NSE = 0.78/0.79 for XAJ/SWAT), perform better in SWAT than in XAJ. Despite some differences in the simulation performances of the two models, the relative sort orders of the datasets based on NSE are almost consistent in both models. The best simulation was achieved by satellite-gauge CMORPH BLD, which was followed by blended MSWEP, TRMM, and CMORPH CRT. CHIRPS. PERSIANN CDR performed moderately; however, satellite-only datasets showed the worst performance. This consistency indicates that using different models does not significantly alter the relative performances of streamflow simulation of satellite-based precipitation datasets.

Datasets
Three hydrological statistics (daily mean discharge, winter low flow, and summer high flow) are further used to compare the daily simulated discharge of both the dense-gauge and the nine alternative datasets against their observed counterparts. Figure 8 presents the annualized results (shown as the relative bias between the simulated discharge of each precipitation dataset and the observed discharge) of three statistics from 2004 to 2013. XAJ and SWAT models show similar results for daily mean discharge ( Figure 8A); however, SWAT obviously underestimates the other two hydrological statistics ( Figure 8B,C), especially for the winter low flow. Based on three statistics, CMORPH BLD consistently performs better than other satellite-gauge datasets. PERSIANN CDR and CHIRPS respectively underestimate and overestimate the observed discharge for both models. Blended MSWEP performs well, although its daily maxima discharge in the XAJ model shows an obvious overestimation (the results of 8 years are more than 0) and underestimation in the SWAT model (the results of 7 years are less than 0). Similar to previously used indexes, satellite-only datasets still show the worst performance.

Error Propagation
RB and ubRMSE of streamflow respectively reflect the systematic and random errors of each dataset in simulating the streamflow.

Error Propagation
RB and ubRMSE of streamflow respectively reflect the systematic and random errors of each dataset in simulating the streamflow. Figures 9 and 10 respectively show the RB and ubRMSE of annual, wet, and dry seasons streamflow and their corresponding propagation factors (γ RB and γ ubRMSE ) simulated using CN05 and eight satellite-based precipitation datasets from 2004 to 2013.

Systematic Error Propagation
Generally, TRMM, with the minimal RB (systematic error) of annual streamflow performs the best among all datasets, which is then followed by CMORPH CRT and CMORPH BLD, displaying comparable performance with CN05 ( Figure 9A). Two satellite-only datasets considerably underestimate the annual streamflow. However, their results of the systematic error propagation factor (γ RB shown in Figure 9B) are larger than 1, indicating amplification of the systematic error when translating the precipitation into a runoff. TRMM, PERSIANN CDR, and CHIRPS have the same amplified effect for the systematic error of the precipitation, while γ RB values for the other five datasets are around 1.
There is a seasonal trend for the RB of streamflow for all datasets in which the range of RB values for the wet season streamflow ( Figure 9C) is much smaller than that for the dry season ( Figure 9E). This narrow RB range means a smaller inter-annual difference in the wet season. As for RB, six out of nine datasets (all datasets except CMORPH BLD, MSWEP, and CHIRPS) show smaller RBs (closer to 0) in the wet season than in the dry season. Thus, the more apparent amplification of the systematic error of precipitation to runoff (the larger results of γ RB ) occurs in the dry season compared to the wet season for nearly all nine datasets except for satellite-only datasets ( Figure 9D,F).
Moreover, the hydrological models also influence the RB of streamflow. During the wet season, SWAT generally performs much better than XAJ for more than half of the datasets (all datasets except for TRMM, PERSIANN, and CHIRPS, Figure 9C). While during the dry season, XAJ outperforms SWAT for more than half of the datasets (all datasets except for CMORPH CRT and MSWEP, Figure 9E). Remote Sens. 2020, 12, x FOR PEER REVIEW 22 of 34   Figure 10A demonstrates that the streamflow values of ubRMSE are not distinctive among nine datasets, except for two satellite-only data, which have significantly larger values. Among the rest of the datasets, PERSIANN CDR and CHIRPS show the largest random errors of streamflow. CN05 shows the minimum ubRMSE of streamflow; however, this is different from its relatively larger ubRMSE of the precipitation (as demonstrated in Section 4.1.2). This discrepancy in the ubRMSE of precipitation and streamflow for CN05 is due to its largest dampening effect of random error. All the other eight datasets have similar dampening effects with γ ubRMSE being smaller than 1 ( Figure 10B), and CMORPH BLD along with MSWEP have the largest γ ubRMSE among these datasets. ubRMSE of streamflow also has a seasonal trend with its values and ranges in the wet season ( Figure 10C) being larger than those in the dry season ( Figure 10E) for all datasets. This seasonal difference also applies to the random error propagation factor γ ubRMSE (Figure 10D,F). ubRMSE of the same precipitation dataset generated from different hydrological models is different ( Figure 10C,E).

Random Error Propagation
Specifically, SWAT generates a larger ubRMSE than that of the XAJ model for nearly all datasets (nine datasets except for PERSIANN CDR and CHIRPS), especially during the wet season ( Figure 10D,F). other eight datasets have similar dampening effects with γ being smaller than 1 ( Figure 10B), and CMORPH BLD along with MSWEP have the largest γ among these datasets. ubRMSE of streamflow also has a seasonal trend with its values and ranges in the wet season ( Figure 10C) being larger than those in the dry season ( Figure 10E) for all datasets. This seasonal difference also applies to the random error propagation factor γ ( Figure 10D,F). ubRMSE of the same precipitation dataset generated from different hydrological models is different ( Figure  10C,E). Specifically, SWAT generates a larger ubRMSE than that of the XAJ model for nearly all datasets (nine datasets except for PERSIANN CDR and CHIRPS), especially during the wet season ( Figure 10D,F).  Satellite-only datasets directly estimate precipitation through PMW or IR sensors [22][23][24]. Their worst performances among all eight satellite-based datasets reflect the defectiveness of the existent remote sensing retrievals algorithms and the necessity to blend with gauged measurement to account for their limited abilities, such as distinguishing rain particles and electromagnetic interferences from rough terrain and trees to sensors [26]. Theoretically, PMW is more accurate than VIS-IR, because the former physically links the sensors' signal to the size and phase of the hydrometeors, which is presented within the observed atmospheric column [1,24,91]. However, CMORPH RAW (mainly based primarily on PMW) performs worse than PERSIANN (mainly based on IR), which is opposite for some other regional studies in the Asian monsoon regions, such as Japan [97] and the Tibet Plateau [98]. This inconsistent result may reflect the influence of integrating methods (PMW and IR) on the performance of satellite-only datasets and the region-dependent nature of these methods.
Compared with satellite-only datasets (CMORPH RAW and PERSIANN), better performances in both precipitation and hydrological simulations are clearly achieved by their improved satellite-gauge versions (CMORPH CRT, CMORPH BLD, and PERSIANN CDR). This improvement proves the validity of the blending algorithms using gauge precipitation to enhance the precipitation estimation performances of the satellite-only datasets. Satellite-gauge CMORPH BLD outperforms all satellite-gauge datasets in both precipitation and hydrological simulations, which is mainly due to the effectiveness of using bias correction and blending algorithms by incorporating the daily precipitation gauge dataset to improve CMORPH RAW [36]. Among blended datasets, MSWEP shows comparable good performance with CMORPH BLD. The superiority of MSWEP could be mainly explained by two factors: (1) the gauged component utilized by MSWEP takes up a higher proportion (30.0% to 50.0%) in the final precipitation dataset compared to the other satellite-based datasets; and (2) the reanalysis data used in MSWEP may bring more potential information [71,99].
CN05 outperforms all satellite-based datasets in hydrological simulation. This satisfactory performance suggests that CN05 is fully able to act as the proxy of the dense-gauge precipitation dataset in the hydrological simulation in the Xiangjiang River Basin, although it could not act as the reference data to directly evaluate the statistical properties of satellite-based precipitation.
Additionally, the datasets used for model calibration would influence the hydrological performances of satellite-based datasets for the validation period, and many studies suggested recalibrating hydrological models directly using satellite-based datasets [60,100,101]. However, only the dense-gauge precipitation was used in this study to calibrate the hydrological models, and all satellite-based datasets then used the same set of optimal parameters for hydrological modeling. This is based on the assumption that the dense-gauge dataset is more accurate than the satellite-based datasets, excluding the effects of uncertainty in model parameters on hydrological simulations. Even though the satellite product-forced model performance may be degraded, when using the dense-gauge precipitation for model calibration, all satellite-based datasets used the same set of optimal parameters for hydrological modeling. In addition, one test based on the XAJ model has been conducted to prove that the calibration dataset would not change the relative hydrological performance of these satellite-based datasets, which are shown in Appendix C as Table A1 (NSE value for both calibration and validation periods) and Figure A1 (mean monthly hydrograph during 2004-2013). Therefore, it is rational to compare the performance of each satellite-based dataset. For those watersheds where the dense-gauge precipitation dataset is not available, the hydrological model may be calibrated using satellite-based datasets or other gridded datasets.

Conclusions
This study evaluates eight high-resolution satellite-based precipitation datasets (satellite-only: PERSIANN and CMORPH RAW, satellite-gauge: TRMM PERSIANN CDR, CMORPH CRT and CMORPH BLD and blended: MSWEP and CHIRPS) and a gauge-interpolated CN05 based on a dense-gauge dataset for hydrological modeling over a monsoon prone watershed in China. We can draw the following conclusions: (1) All satellite-gauge and blended datasets are able to capture the seasonality of precipitation in the study region, even though biases are observed. Specifically, the satellite-gauge CMORPH BLD generally outperforms all other satellite-based datasets with the smallest detection, systematic, random errors, and most precise extreme precipitation simulation. However, satellite-only datasets perform the worst with respect to almost all the precipitation indices. Although CN05 presents the smallest systematic errors, CN05 cannot be used as the reference data to statistical analysis of the satellite-based datasets because it is missing some seasonal local precipitation and has larger random errors and a smaller R 2 . (2) There are large differences among satellite-gauge datasets in hydrological simulations. Datasets designed to provide the best instantaneous precipitation (TRMM, CMORPH CRT, CMORPH BLD, and MSWEP) perform better than those designed to achieve the most temporally homogeneous record (PERSIANN CDR and CHIRPS). Among the four better-behaved datasets, two directly incorporating daily gauge data (CMORPH BLD and MSWEP) outperform two directly incorporating monthly gauge data (TRMM and CMORPH CRT). However, satellite-only datasets (CMORPH RAW and PERSIANN) are the least capable of simulating streamflow, which is not recommended to use in the hydrological application. CN05 outperforms all satellite-based datasets in the hydrological simulation, indicating its capability to act as reference data during the hydrological evaluation. (3) With different model structures, XAJ and SWAT models perform differently for each satellite-based dataset, and differences in model performances also depend on seasons. Generally, the XAJ model performs better than the SWAT model in terms of random errors of streamflow simulations for both wet and dry seasons and in terms of systematic errors for the dry season. However, compared with the hydrological model uncertainties, the uncertainties from different satellite-based datasets dominate the uncertainty of hydrological simulation. In other words, the hydrological model structure does not affect the overall performance ranking of satellite-based precipitation datasets in hydrological simulations in this study. (4) The random error from all datasets show a general decrease from precipitation to runoff with γ ubRMSE being smaller than 1, but this does not hold for the systematic error with γ RB varying in different datasets. In addition, the seasons and the hydrological models affect the error propagation from precipitation to streamflow for all datasets. The systematic (γ RB ) and random (γ ubRMSE ) error propagation factors of the wet season are larger than those of the dry season.
The XAJ model shows a more amplified error propagation effect of the systematic errors, while the random errors are more amplified by the SWAT model.
There are still some limitations in this study. For example, the eight satellite-based datasets were compared over only one monsoon-prone watershed, and the conclusion may not be the same for other regions. In addition, the differences between using a dense-gauge dataset and satellite-based datasets to calibrate hydrological models were not fully investigated. For some data-lacking regions, the satellite-based datasets may be directly used to calibrate the hydrological models when the dense-gauge dataset is available. Therefore, in future studies, more watersheds from various climate regimes should be used to generalize the conclusions drawn from this study. In addition, the impacts of using different satellite-based datasets to calibrate the hydrological models on hydrological performances also need to be investigated.