Evaluation of Six Satellite and Reanalysis Precipitation Products Using Gauge Observations over the Yellow River Basin, China

Satellite-based and reanalysis products are precipitation data sources with high potential, which may exhibit high uncertainties over areas with a complex climate and terrain. This study aimed to evaluate the accuracy of the latest versions of six precipitation products (i.e., Climate Hazards Group Infrared Precipitation with Stations (CHIRPS) V2.0, gauge-satellite blended (BLD) Climate Prediction Center Morphing technique (CMORPH) V1.0, European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis (ERA) 5-Land, Integrated Multisatellite Retrievals for Global Precipitation Measurement (IMERG) V6 Final, Global Satellite Mapping of Precipitation (GSMaP) near-real-time product (NRT) V6, and Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN)-CDR) over the Yellow River Basin, China. The daily precipitation amounts determined by these products were evaluated against gauge observations using continuous and categorical indices to reflect their quantitative accuracy and capability to detect rainfall events, respectively. The evaluation was first performed at different time scales (i.e., daily, monthly, and seasonal scales), and indices were then calculated at different precipitation grades and elevation levels. The results show that CMORPH outperforms the other products in terms of the quantitative accuracy and rainfall detection capability, while CHIRPS performs the worst. The mean absolute error (MAE), root mean square error (RMSE), probability of detection (POD), and equitable threat score (ETS) increase from northwest to southeast, which is similar to the spatial pattern of precipitation amount. The correlation coefficient (CC) exhibits a decreasing trend with increasing precipitation, and the mean error (ME), MAE, RMSE, POD and BIAS reveal an increasing trend. CHIRPS demonstrates the highest capability to detect no-rain events and the lowest capability to detect rain events, while ERA5 has the opposite performance. This study suggests that CMORPH is the most reliable among the six precipitation products over the Yellow River Basin considering both the quantitative accuracy and rainfall detection capability. ME, MAE, RMSE, POD (except for ERA5) and BIAS (except for ERA5) increase with the daily precipitation grade, and CC, RMSE, POD, false alarm ratio (FAR), BIAS, and ETS exhibit a negative correlation with elevation. The results of this study could be beneficial for both developers and users of satellite and reanalysis precipitation products in regions with a complex climate and terrain.

In recent decades, a series of satellite-based and reanalysis precipitation products have been developed, such as the Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) [17] and its successor (Integrated Multisatellite Retrievals for Global Precipitation Measurement (IMERG)) [18], Climate Hazards Group Infrared Precipitation with Stations (CHIRPS) [19], Climate Prediction Center Morphing (CMORPH) [20], Global Satellite Mapping of Precipitation (GSMaP) [21], Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN) [22], Global Precipitation Climatology Centre (GPCC) [23], European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis (ERA) [24], and Climate Forecast System Reanalysis (CFSR) [25]. Due to differences in the gauge data, the raw satellite data from different types of sensors (e.g., passive microwave (PMW) or near-infrared ray (NIR) sensors) and the variations in algorithms used to generate the datasets, these products, with various spatial and temporal resolutions, exhibit discrepant performances according to previous evaluation studies worldwide [26][27][28][29]. A comparison of CHIRPS V2.0, PERSIANN-CDR, and TRMM 3B42 V7, for example, reveals that CHIRPS performs best in South America and tropical Africa, CHIRPS and TRMM have comparable performance in southeast Asia, and that PERSIANN-CDR is not recommended in tropical forest areas [30]. TRMM, CMORPH, PERSIANN and ECMWF all underestimate rainfall over northwest Europe, and ECMWF performs well in winter and performs poorly in summer, while the other products have the opposite performance [31]. IMERG V05 Early run outperforms ERA5-HRES (high resolution) in regions dominated by convective storms, while the opposite was observed in regions with complex terrain [32].
However, regions with complex terrain have exhibited poor performance because of the high spatial and temporal variability in precipitation and the lack of rain gauge observations [33,34]. The bias of the real-time version of TMPA (TMPA-RT) and PERSIANN over the Tibetan Plateau depends on topography, and can be explained mostly by the variability of elevation and surface roughness [35]. The IMERG and GSMaP products attain a good detection ability in southern China, except for the Sichuan Basin, where the topography and climate types are extremely complex [36]. Therefore, it is important to evaluate precipitation products before specific applications, especially in regions with a complex topography [37,38].
As the origin of Chinese civilization and an important area of economic development and ecological protection, the Yellow River originates in the Tibetan Plateau and flows across plateaus, mountains, and plains, with the Yellow River Basin covering arid, semi-arid, semi-humid and humid climate zones. The complex terrain and climate over the basin result in unevenly (spatially and temporally) distributed precipitation. In recent decades, the Yellow River Basin has experienced a decreasing annual precipitation and an increasing frequency and amount of short-duration consecutive precipitation events, which could aggravate water resource shortages and lead to higher risks of flooding in summer and drought in spring and autumn, especially in the middle reaches [39]. Thus, choosing an accurate precipitation dataset is vital for related studies in this basin.
To date, evaluation studies of precipitation products over the Yellow River Basin have been performed over mainland China [40][41][42][43], the whole basin [44,45], or subregions of the basin [37,46,47], and have mainly focused on earlier versions of precipitation products [43,48]. However, various precipitation products have been updated (e.g., IMERG V6 morphs microwave estimates to decrease the gaps at high latitudes and incorporates SAPHIR estimates, CMORPH V1.0 extends the data period, updates the estimation algorithm, and combines gauge data with satellite data). Hence, a comparative evaluation of the latest precipitation products is necessary before specific applications.
The aims of this study are: (1) to evaluate the accuracy of the latest versions of six widely applied daily precipitation products (i.e., IMERG, CHIRPS, CMORPH, ERA5, GSMaP, and PERSIANN) at multiple temporal scales, and (2) to analyze the accuracy at different precipitation grades and elevation levels.

Study Area
The Yellow River, with a length of 5464 km and a drainage area of 75.2 × 10 4 km 2 , is the second longest river in China and the sixth longest river globally [49]. The landforms over the basin vary, including plateaus, mountains, and plains. The climate in this basin is mainly arid and semi-arid continental monsoon, and the basin thus covers arid, semi-arid, semi-humid and humid climate zones. The average annual precipitation is between 123 and 1021 mm [50] and is distributed unevenly in both the spatial and temporal domains. The amount of rainfall increases from the northwest to the southeast, with most rainfall occurring between May and October [51]. The average annual temperature ranges from −4 to 14 • C across the basin [50]. The runoff from the upper reaches accounts for half of the total runoff discharge of the entire Yellow River, and the middle reaches are the main source of sediment [52,53]. The farmland in this basin accounts for approximately 15% of the cropland and 14% of the GDP of China [54].

Gauged Precipitation Data
Daily gauged precipitation data were extracted from the Dataset of Daily Climate Data from the Chinese Surface Stations for Global Exchange (V3.0) of the China Meteorological Data Service Center (CMDC) (http://data.cma.cn). This dataset includes daily precipitation data from 840 Chinese surface stations since 1951. There are 95 stations within the Yellow River Basin, which are not evenly distributed, especially in the lower reaches and the source region ( Figure 1). questionable results as the altered physiology of the cells may also alter steps of virus replication compared to infection of cells in natural tissues [5]. Therefore, conventional 2D cell cultures might alter growth characteristics. In consequence, this might also alter the outcome e.g., of screening approaches for new antiviral molecules in infected 2D cultures. This in turn could affect the planning of animal experiments or the outline of clinical studies. Furthermore, it might lead to misinterpretation of the efficiency of viral inhibitors and thus might hinder the search for effective antiviral substances [6,7].
Additionally, host range or cell tropism of a virus might be altered in conventional 2D cultivation. For example, Hepatitis E virus (HEV) and HuNoV are unable to replicate in in vitro cultured human cells. However, they can replicate to high titres in the respective cells in vivo [8,9]. Therefore, characterisation of such a virus in 2D cultures can lead to misinterpretation of host range, tissue specificity, or pathogenicity and thus in consequence affect the risk assessment of a pathogen. Moreover, 2D cultures are often not stable for an extended period of time, which may hinder propagation of slowly replicating viruses such as Hepatitis A virus (HAV) or Puumala virus (PUUV). In consequence, it might preclude propagation of such viruses from diagnostic samples when initial titres are too low for successful replication [10,11]. 2D cultures usually consist of only one cell type. In order to cultivate an unidentified virus from a diagnostic sample several cell lines might have to be tried for virus propagation, provided that sufficient sample material is available.
3D cultivation of cells may overcome these limitations of 2D cell cultures. 3D cultures can

Satellite-Based and Reanalysis Precipitation Products
Six satellite-based and reanalysis precipitation products were selected to be evaluated in this study, with the basic information of these datasets listed in Table 1. Because of the data accessibility, the daily precipitation from 2001 to 2018 was chosen to be assessed. To mask a comparison, the spatial and temporal resolutions of products were unified as 0.05 • and daily scale, respectively. A nearest neighbor interpolation method was adopted and the spatial resolution of 0.05 • is the same as CHIRPS, to prevent missing information and resulting in no changing values of the other products.  CMORPH was created based on PMW and half-hour interval infrared (IR) data. Version 1.0 includes raw (satellite-only precipitation estimates), bias-corrected (CRT) and gauge-satellite blended (BLD) precipitation products. A bias correction was first performed on the raw satellite-based precipitation estimates by matching their probability density function (PDF) with that of the gauge observations, resulting in the CRT dataset, and then the CRT was modified by combining it with gauge data through an optimal interpolation (OI) technique to generate the BLD version [55,59]. The BLD version (hereafter referred to as CMORPH) was evaluated in this study because previous evaluation studies demonstrated that BLD outperforms the raw version in the Lake Titicaca region, South America [60], and performs better than CRT in the Huaihe and Fujiang River Basins, China [61][62][63].
PERSIANN-CDR is a daily precipitation product with a spatial resolution of 0.25 • and is generated by the Center for Hydrometeorology and Remote Sensing at the University of California. The 3 h GridSat-B1 infrared window (IRWIN) brightness-temperature data from Geostationary Earth orbiting (GEO) satellites are the input data of the PERSIANN model to yield a primary dataset, and the monthly Global Precipitation Climatology Project (GPCP) version 2.2 was adopted to correct the bias of this dataset [56].
GPM IMERG is an extension of TRMM, with an improved capability to measure light rainfall and snowfall at middle and high latitudes. Version 6 is the current version of the dataset, and it provides three types of products: early, late and final runs [57]. The IMERG final run (hereafter referred to as IMERG) is selected to be evaluated in this study because the latest final version performs the best over complex terrain regions such as the Sichuan Basin of China [64], similar to earlier final versions across China [29,65].
Atmosphere 2020, 11, 1223 5 of 20 ERA5 is the atmospheric reanalysis dataset for the global climate by the ECMWF. ERA5-Land is the land component of the ERA5 climate reanalysis dataset and contains a series of improvements (e.g., higher spatial resolution), making it more suitable for all types of land applications, e.g., evaporation, precipitation, and runoff [58]. Total precipitation data (hereafter referred to as ERA5) at an hourly resolution are resampled to a daily scale to be evaluated in this study.
GSMaP is a collection of hourly global rainfall maps provided by the Japan Aerospace Exploration Agency (JAXA, Tokyo, Japan). Precipitation data are obtained using PMW sensors in low-Earth orbit and IR radiometers in geostationary Earth orbit [21]. GSMaP is mainly composed of four products, i.e., standard product (GSMaP-MVK), near-real-time product (GSMaP-NRT), real-time product (GSMaP_NOW), and reanalysis product (GSMaP_RNL), and each product has an additional gauge-adjusted version. The gauge-adjusted GSMaP-NRT product version 6 (hereafter referred to as GSMaP) is selected to be evaluated in this study.
CHIRPS was primarily developed for agricultural drought monitoring by scientists at the Climate Hazards Group, University of California, Santa Barbara, and the USGS Famine Early Warning Systems Network (FEWS NET). The dataset is quasi-global, with a high spatial resolution of 0.05 • and a long data period (1981 to the near present). The dataset is available at multiple temporal resolutions, i.e., 6 h, daily, pentadal, and monthly resolutions. In the data generation process, the TMPA 3B42 V7 satellite precipitation product and IR cold cloud duration (CCD) observations are used for calibration, and GPCC monthly product and independent stations are employed to perform the validation. The dataset of version 2.0 [19] (hereafter referred to as CHIRPS) is evaluated in this study.

Digital Elevation Model Data
The ASTER GDEM V2 dataset with a resolution of 1 arc-second (~30 m) was downloaded from the Geospatial Data Cloud site, Computer Network Information Center, Chinese Academy of Sciences (http://www.gscloud.cn).

Evaluation Indices
Daily precipitation values of the above precipitation products at the pixels where gauge stations are located were extracted for comparison against gauge data in this study, and four continuous and four categorical indices were selected to quantify the performances of the products ( Table 2). The four continuous indices (i.e., correlation coefficient (CC), mean error (ME), mean absolute error (MAE), and root mean square error (RMSE)) measure the quantitative accuracy of the precipitation products, and the four categorical indices (i.e., probability of detection (POD), false alarm ratio (FAR), BIAS, and equitable threat score (ETS)) reflect the capability of the products to detect rainfall events. CC represents the correlation between the observed and estimated precipitation. ME measures the average difference between the observed and estimated rainfall, where a negative ME indicates rainfall underestimation by the product, and a positive ME indicates rainfall overestimation. Both MAE and RMSE denote the magnitude of the difference between the observed and estimated rainfall, and RMSE assigns higher weights to large errors. POD measures the ratio of the satellite-detected rain events to the observed rain events. FAR determines the ratio of detected rain events but no observed rain to all detected rain events. A high POD indicates that the product exhibits a good capability to correctly detect rain events, and a low FAR indicates that the product attains a low possibility of incorrectly detecting no-rain events as rain events. The BIAS is the ratio of the estimated rain frequency to the observed rain frequency. If the BIAS is lower than 1, the number of satellite-detected rain events is smaller than the observed number of rain events. ETS is a more comprehensive index to assess the capability of satellites to detect rainfall events. An ETS value of 1 implies that the satellite detects all observed rain events without any misses or false detection results, and an ETS value of −1/3 implies that the satellite incorrectly detects all rain events as no-rain and detects all no-rain events as rain events. Both types of indices should be considered to determine the performance of a given precipitation product. A reliable product should attain both a high quantitative accuracy and rainfall detection Atmosphere 2020, 11, 1223 6 of 20 capability. The rainfall threshold is set to 0 with regard to the daily precipitation, and 0.1 mm/day with respect to light rainfall. Table 2. Continuous and categorical indices to quantify the performance of the precipitation products.

Evaluation Indices Equation Range of Values Perfect Value
Correlation coefficient Notation: N is the total number of daily records, S i and G i are the estimated and gauged precipitation, respectively, S and G are their mean values, respectively, and H, M, F, and C are the four events indicating whether the satellite has correctly detected an observed rainfall event (as summarized in Table 3), where H indicates that the observed rain event has been correctly detected by the satellite, M indicates that the satellite has missed the occurring rain event, F indicates that the satellite has incorrectly detected the occurring rain event as no-rain, and C indicates that there is no-rain and the satellite has detected none. Moreover, h, m, f, and c are the numbers of H, M, F, and C, respectively.

Evaluation Experimental Design
The evaluation of the above precipitation products was conducted in terms of the daily rainfall at the daily, monthly, and seasonal scales. First, the indices were calculated across all stations to obtain the overall accuracy of each precipitation product over the Yellow River Basin, after which the indices at each station were calculated to better understand their spatial patterns.
In addition, to obtain a better understanding of the performance of the products at various precipitation intensities and elevations, the evaluation indices were calculated in this study at different daily precipitation grades (i.e., 0 mm, 0-0.1 mm, 0.1-1 mm, 1-2 mm, 2-5 mm, 5-10 mm, 10-20 mm, 20-40 mm, and more than 40 mm) and elevation levels (i.e., 0-500 m, 500-1000 m, . . . , 4500-5000 m). The classification of precipitation grades referred to the evaluation of precipitation products across the globe and mainland China by Chen et al. (2020) [66]. The elevation levels were divided into equal intervals of 500 m according to previous studies in China [36], the Awash River basin of Ethiopia [67], and the distribution of gauge stations with elevation in the study area.

Basic Statistics and Distribution of the Daily Precipitation
The average daily precipitation during the study period was calculated for each gauge station and each pixel of the six products. Table 4 lists their statistics, Figure 2 shows their spatial distributions, while Figure 3 depicts the histogram and density curve of the average daily precipitation recorded by the gauge stations during the study period and the density curves of the average daily precipitation over the whole basin for the products. The statistics in Table 4 indicate that ERA5 yields the highest average daily precipitation over the Yellow River Basin (with a mean of 1.66 mm), followed by CHIRPS (with a mean of 1.35 mm), CMORPH (with a mean of 1.33 mm), GSMaP and PERSIANN (with a mean of 1.33 mm), and IMERG (with a mean of 1.30 mm). With regard to the range and deviation of the average daily precipitation, IMERG attains the narrowest range (0.41-2.33 mm), while ERA5 exhibits the widest range (0.41-4.16 mm), and the range of PERSIANN (0.38-3.02 mm) is close to that of the gauge stations (0.38-2.91 mm). IMERG exhibits the smallest standard deviation (0.40 mm), and ERA5 attains the largest standard deviation (0.58 mm), while the standard deviation of CMORPH (0.43 mm) is close to that of the gauge stations (0.43 mm). The shapes of the density curve of the average daily precipitation in Figure 3 reveal that the CHIRPS's and CMORPH's density curves are more similar to that of the gauge stations than those of the other products, and the average daily precipitation of ERA5 is obviously more concentrated at precipitation intervals larger than 1.5 mm, which might lead to precipitation overestimation.
Viruses 2020, 12, x FOR PEER REVIEW 3 of 24 of 3D cultures is their lack of transparency. This impedes microscopy and the live screening of cell growth, which is a basic and simple advantage in conventional 2D cultures that allows tracking of virus infection e.g., through the observation of cytopathic effects (CPE) and cell fusion events.
A promising and aspiring option to circumvent these limitations is represented by 3D bioprinting [19]. This technology allows to print components of biological tissues, such as extracellular matrices as a scaffold that can be seeded with cells, or to generate whole tissue equivalents, including the respective cell types in their in vivo typical arrangement [20,21]. One of the major advantages of this method is the possibility to precisely define composition and arrangement of a culture. Therefore, the design and morphological properties of a matrix can be freely defined and thus be optimised for practical use, even for high-throughput applications [22]. In recent years, bioprinted 3D cultures have already been applied in the field of toxicology and in liver disease modelling, where bioprinted tissue constructs helped to prolong cultivation times, thus giving detailed insights into safety and efficacy testing [23]. samples with titres too low for successful replication in 2D cultures. As a proof of concept, five model viruses with different characteristics regarding host range, cell tropism, genome characteristics, cytopathogenicity, and replication rate were used as infectious agents. They were combined with five cell culture models, established from five different cell types with different host and tissue origin covering a wide spectrum of possible virus-host-tissue combinations.

Cells and Culture Conditions
All cell lines applied are listed in Table A1. Cells were adapted to a uniform growth medium consisting of DMEM with 10% foetal bovine serum (FBS; Biochrom, Berlin, Germany, DE), 2 mM L-Glutamine (Thermo Fisher Scientific, Waltham, MA, USA), and 1X MEM nonessential amino acids (Thermo Fisher Scientific) for several passages prior to application to 3D culture generation. Cells were cultivated at 37 °C and 5% CO2 in a humidified incubator (later referred to as standard conditions).

3D Bioprinting of Cultivation Matrices
Cultivation scaffolds were designed using appropriate CAD software (Rhinoceros 5; McNeel Europe, Barcelona, Spain, ES). Afterwards, the 3D model ( Figure 1D) was printed layer by layer using a stereolithographic bioprinter (Cellbricks GmbH, Berlin, Germany, DE) capable of printing multiple materials within one printing process [21]. Up to 24 scaffolds are printed simultaneously. The optimised specimen, which was named Wellbrick, consists of PEGDA700 (Sigma-Aldrich, St. Louis, MO, USA) for the wall and main ring structure of the bottom and methacrylated gelatine (GelMA) as the cultivation matrix with a central window without PEGDA700 support for microscopy (Figure  The spatial patterns of the average daily precipitation based on the precipitation products are similar throughout the study period, and they are consistent with those based on the gauge stations ( Figure 2). The average daily precipitation gradually increases from the northwest to the southeast. High daily precipitation mainly occurs in the lower reaches and the southern source region, and low precipitation occurs in the upper reaches in the northwestern areas. The average daily precipitation values of ERA5 are higher than those of the other products in the west of the source region and middle reaches.

Evaluation at the Daily Scale
With regard to the continuous indices, CMORPH performs the best, with the lowest MAE and RMSE, a high CC and low ME, followed by ERA5 and IMERG, while CHIRPS performs the worst, with the lowest CC and the highest MAE and RMSE (Table 5). Compared to the observed data, Atmosphere 2020, 11, 1223 9 of 20 all products yield an overestimation, and ERA5 has the greatest overestimation (Table 5). In terms of the categorical indices, ERA5 yields the highest POD and FAR, a BIAS much higher than 1, and the lowest ETS, because it correctly detects the most rain events (30.15% of the total daily records), falsely detects the most no-rain events as rain (63.52% of the total daily records), with the number of falsely detected rain events much larger than that of the events where rain is missed (0.32% of the total daily records) (i.e., f is much larger than m, as indicated in Table 2) ( Figure S3), resulting in a low ETS. CHIRPS yields the lowest POD and a BIAS much lower than 1, because it correctly detects the least rain (8.34% of the total daily records), and the number of falsely detected rain events (9.78% of the total daily records) is much smaller than that of the events whereby rain is missed (22.13% of the total daily records) (i.e., f is much smaller than m, as indicated in Table 2). Additionally, it correctly detects the most no-rain events (59.75% of the total daily records). CMORPH and GSMaP exhibit the lowest FAR, high PODs (0.72 and 0.58, respectively), and BIASs close or equal to 1 (1.24 and 1, respectively), thus resulting in the highest ETSs. IMERG and PERSIANN perform a moderate rainfall detection capability. IMERG and PERSIANN falsely detect more no-rain events as rain events than they miss rain events, resulting in a BIAS higher than 1 ( Figure S3). Therefore, POD, FAR and ETS of ERA5 reveal conflicts with regard to the reliability of the products, while CMORPH yields a POD lower than ERA5, the highest FAR, a BIAS close to 1, and the highest ETS, revealing that it performs the best. Among the six products, CMORPH attains the best quantitative accuracy and rainfall detection capability, and it is the most reliable product to select in the Yellow River Basin.  Figure 4 shows the spatial distribution of the evaluation indices at the gauge stations over the Yellow River Basin, and Figure 5 shows boxplots of these indices. High CCs are mainly located in the middle reaches, and low CCs occur in the upper reaches. MAE, RMSE, POD, and ETS increase from northwest to southeast, which is consistent with the pattern of the daily precipitation. High FARs mainly occur in the upper reaches, and low FARs are observed in the source region and southern middle reaches, which indicates that incorrect rainfall detection occurs more severely in arid areas than in humid areas. Across the basin, CMORPH attains the highest quantitative accuracy (with the lowest MAEs and RMSEs, high CCs and low MEs) and highest rainfall detection capability (with the highest ETSs, high PODs, low FARs and BIASs close to 1). Hence, CMORPH is considered to be the most reliable product in this basin ( Figure 5).
Compared to previous studies, the RMSEs, FARs, PODs and BIASs of CHIRPS and IMERG in this study are close to those reported by Yu (2020) [40] in the Yellow River Basin, but the CCs are lower in this study, probably because different study periods (2015-2018 in 2001-2018 in this study) were selected, which may cause a difference between the gauge observations when evaluating the products. GSMaP slightly overestimates the daily precipitation in the lower reaches of the Yellow River Basin, which is similar to the results of Ning et al. (2017) [28]. PERSIANN underestimated the rainfall amount across the Wei River Basin (a subregion in the southern middle reaches of the Yellow River Basin) in a previous study, which is consistent with the spatial pattern of ME in this study [47]. CMORPH has the best capability for detecting rainfall, with high PODs and low FARs in this basin, similar to the results in the Huaihe River Basin, which is due to the application of PDF-OI algorithms in the dataset generation process [61,68]. The POD and FAR of CMORPH and GSMaP increase from the northwest to the southeast, which is similar to the result reported by Yang et al. (2020) [44].  Compared to previous studies, the RMSEs, FARs, PODs and BIASs of CHIRPS and IMERG in this study are close to those reported by Yu (2020) [40] in the Yellow River Basin, but the CCs are lower in this study, probably because different study periods (2015-2018 in Yu and 2001-2018 in this study) were selected, which may cause a difference between the gauge observations when evaluating the products. GSMaP slightly overestimates the daily precipitation in the lower reaches of the Yellow River Basin, which is similar to the results of Ning et al. (2017) [28]. PERSIANN underestimated the

Evaluation at Monthly and Seasonal Scales
Values of the evaluation indices of the products' daily precipitation were calculated month by month and season by season across all stations and at each station (Figures 6 and 7) as well as at the daily scale. Throughout the year, ERA5 attains the highest CCs for most months, except August and October, followed by CMORPH and IMERG. Among all products, CCs from September to November are much higher than those in the other months, resulting in the highest CCs in autumn. With regard to CMORPH, ERA5 and PERSIANN, precipitation is overestimated for all months and seasons (except that PERSIANN underestimates precipitation in August). CHIRPS and IMERG underestimate precipitation in winter and overestimate precipitation in the other seasons, while GSMaP underestimates precipitation in summer and overestimates precipitation in the other seasons. MAE and RMSE first increase and then decrease throughout the year, with peak values occurring in July, which is consistent with the monthly precipitation. CHIRPS exhibits the highest MAEs and RMSEs from May to September, while these two indices of CMORPH reveal the lowest values throughout the year.    Figure 6 shows that the PODs of ERA5 are approximately 1, which means that almost all the observed rain events are detected in all months. The PODs of CHIRPS increase from January to March, then decrease from April to October and finally increase, resulting in high PODs from January to April and low PODs from March to December. The PODs of the other products first increase from January to April, then change slightly from May to September and finally decrease, resulting in high PODs from May to September and low PODs in December, January and February. The FARs of the products first decrease from January to July, then remain quite stable from July to September and finally increase from October to December. The months with low FARs are from June to September, which are all months with large rainfall amounts, while December to January exhibit a low rainfall resulting in high FARs. For all products, the FARs in summer are the lowest, followed by those in autumn and spring, and winter has the highest FAR (Figure 7). ERA5 attains the highest FARs from March to November. The BIASs of the products first decrease and then increase. ERA5 exhibits the highest BIASs for all months, which are much lower from May to September. Regarding the other products, the BIASs are low from May to October. The ETSs of the products first increase and then decrease, and the highest ETSs occur in July or August except for ERA5 and CMORPH. GSMaP reaches the highest ETSs from March to November, and ERA5 has the lowest ETSs from March to November. Generally, MAE, RMSE, POD, and ETS are consistent with the monthly and seasonal rainfall, and FAR and BIAS reveal the opposite trend to the rainfall trend. The summer months (e.g., June and July) with much rainfall result in a low quantitative accuracy and high rainfall detection capability, which provide reliable precipitation estimates.   The spatial patterns of the evaluation indices in terms of the products' daily precipitation at the monthly and seasonal scales are shown in Figures S1 and S2, respectively. Similar to the patterns at the daily scale, MAE, RMSE, POD, and ETS increase from the northwest to the southeast, which is consistent with the precipitation distribution, but FAR reveals the opposite spatial pattern to that of precipitation. Seasonal characteristics can also be summarized from the spatial pattern of the indices. The CCs in autumn are obviously higher than those in the other seasons across the basin, the MAE and RMSE are visibly high in summer and low in winter, and the FAR is remarkably lower in summer than in the other seasons. With regard to the products, the CCs, PODs and BIAs of ERA5, and the BIAS and PODs of CHIRPS throughout the year are prominently different from the indices of the other products.
Compared to previous studies, the monthly variations in ME and RMSE for PERSIANN and CHIRPS are similar to the results obtained by Wei et al. (2019) [45], but CCs are highest in spring and slightly lower in autumn in Wei's study, while CCs are the highest in autumn and slightly lower in spring in this study. This difference probably occurs because of the use of various validation datasets: a grid dataset from 405 meteorological stations was used for product evaluation by Wei et al. (2019) [45], and gauge observations at 95 stations are considered in this study, which may lead to minor differences between the spring and autumn results. The seasonal performance order of CC, RMSE POD and FAR for CHIRPS and IMERG match the results reported by Yu et al. (2020) [40]. The monthly variations in ME and RMSE for CMORPH and GSMaP are similar to the results obtained by Yang et al. (2020) [44]. Few studies have been performed to evaluate ERA5-Land precipitation estimations over the Yellow River Basin. ERA5 overestimates precipitation under all wetness and slope classes in Amjad's research [69] conducted in Turkey, which is similar to the results obtained in this study. Figure 8 shows the evaluation indices for the nine daily precipitation grades. CC generally increases with increasing precipitation for all products when it is larger than 1 mm per day, and ERA5 reveals the highest CCs when the daily precipitation is less than 10 mm. CMORPH yields the highest CCs when daily precipitation is larger than 10 mm precipitation, while CHIRPS attains the lowest CCs at all grades. At the light rain grade (0-0.1 mm/day), ERA5 shows the highest CCs, the lowest RMSE and a low ME and MAE, which implies that it exhibits a better capability to detect light rain than the others. All products slightly overestimate precipitation at low precipitation grades (daily rainfall lower than 2 mm) and underestimate it at high precipitation grades (daily rainfall higher than 5 mm), which is similar to the performance of IMERG-F and GSMaP-N over mainland China [36]. The magnitudes of ME, MAE, RMSE, POD (except for ERA5) and BIAS (except for ERA5) increase with the daily precipitation grade, which is similar to previous studies [36]. When the daily rainfall exceeds 0, the POD and BIAS calculation equations are the same. The associated values are lower than 1 because the observed rain events are missed to a certain extent. The PODs and BIASs of ERA5 are approximately 1 because it misses few rain events ( Figure S3). CCs at all grades. At the light rain grade (0-0.1 mm/day), ERA5 shows the highest CCs, the lowest RMSE and a low ME and MAE, which implies that it exhibits a better capability to detect light rain than the others. All products slightly overestimate precipitation at low precipitation grades (daily rainfall lower than 2 mm) and underestimate it at high precipitation grades (daily rainfall higher than 5 mm), which is similar to the performance of IMERG-F and GSMaP-N over mainland China [36]. The magnitudes of ME, MAE, RMSE, POD (except for ERA5) and BIAS (except for ERA5) increase with the daily precipitation grade, which is similar to previous studies [36]. When the daily rainfall exceeds 0, the POD and BIAS calculation equations are the same. The associated values are lower than 1 because the observed rain events are missed to a certain extent. The PODs and BIASs of ERA5 are approximately 1 because it misses few rain events ( Figure S3). The number of H, M, F and C events were counted at each daily precipitation grade ( Figure S3). No-rain days accounted for 69.53% of the total daily records at 95 stations across the Yellow River Basin throughout the study period. Hence, it is important to clarify the precipitation products' performance on no-rain days. The products' C event proportions indicate that all products underestimate the proportions of no-rain days in the total daily records ( Figure S3), which is similar The number of H, M, F and C events were counted at each daily precipitation grade ( Figure S3). No-rain days accounted for 69.53% of the total daily records at 95 stations across the Yellow River Basin throughout the study period. Hence, it is important to clarify the precipitation products' performance on no-rain days. The products' C event proportions indicate that all products underestimate the proportions of no-rain days in the total daily records ( Figure S3), which is similar to previous studies. With regard to no-rain events, CHIRPS correctly detects the most no-rain events and falsely detects the fewest rain events, indicating the highest capability to detect rainfall events. ERA5 correctly detects the least no-rain events and falsely detects the most rain events, thus performing the worst. GSMaP and CMORPH reveal a slightly lower capability than that of CHIRPS but perform much better than IMERG and PERSIANN. In terms of rainy days, ERA5 correctly detects the most rain events and misses the fewest rain events at all daily precipitation grades, while CHIRPS correctly detects the fewest rain events and misses the most rain events. The order of the H event proportions remains consistent when the daily rainfall is between 0.1 and 20 cm, i.e., ERA5 > IMERG > CMORPH > PERSIANN > GSMaP > CHIRPS. Figure 8 shows that in the Yellow River Basin, products yield an overestimation on no-rain days and days with low precipitation grades (less than 2 mm), and an underestimation for high precipitation grades (more than 5 mm), defining the overall performance of the products. The characteristics of the precipitation across the basin vary according to gauge observations. The source region has the largest number of rainy days, and daily precipitation varies in a relatively narrow range, resulting in the most days with high precipitation grades (more than 5 mm) and the lowest average daily precipitation. In the upper reaches, no-rain days represent the largest proportion, days with high precipitation grades represent the fewest, and the average daily precipitation for these days is higher than that in the source region. In the middle reaches, the average daily precipitation with high grades is greater than that in the middle region, and the average daily precipitation with high grades are also larger. With regard to the lower reaches, there is a larger number of days with high precipitation grades, and they account for the largest proportion of the annual rainfall. As a consequence, the middle reaches, with the largest portion of days with no-rain and low precipitation grades and the smallest number of days with high precipitation grades, are likely to be underestimated by products, as the results of CHIRPS, CMORPH, GSMaP and IMERG show, and the lower reaches with the largest proportion of days with high precipitation grades are underestimated. However, GSMAP, IMERG and PERSIANN exhibit overestimations in the southeastern middle reaches, where daily precipitation is generally high, because the overestimation for no-rain days and days with low precipitation are stronger than the underestimation of days with high precipitation grades. Figure 9 shows the evaluation indices of the daily precipitation for the products at the different elevation levels. CCs of all products except ERA5 exhibit two valleys at elevations of 2000-2500 m and 3500-4000 m and one peak at elevations 2500-3000 m. The highest CC of ERA5 occurs at the peak (2500-3000 m), and the lowest CC occurs at one of the valleys (3500-4000 m). ERA5 overestimates rainfall at all elevation levels, while the other products underestimate precipitation from 2500 to 3000. MAEs first decrease with increasing elevation and reach a valley at an elevation level of 1000-1500 m, then increase and reach peaks at an elevation level of 3000-3500 m. RMSE generally tends to decrease with increasing elevation. ERA5 reaches a POD of approximately 1 at all elevation levels. PODs of the other products vary within quite narrow ranges. FARs of the products generally decrease with increasing elevation, and valley values occur at an elevation level of 3000-3500 m. ERA5 yields the highest FARs at all elevation intervals, and CMORPH and GSMaP attain the lowest values. BIASs of the products generally also show a decreasing trend. CHIRPS attains a BIAS lower than 1 at all elevation levels, and GSMaP reaches a BIAS very close to 1, while the other products yield a BIAS higher than 1 at all elevation levels. With regard to ETS, CMORPH reaches the highest ETSs at all elevation levels except for 2500-3000 m, followed by GSMaP, while ERA5 exhibits the lowest ETSs. Among the six products, ERA5 attains the highest CCs and MEs when the elevation is higher than 500 m, the highest FARs and BIASs, and the lowest ETSs for all elevation levels. CHIRPS exhibits the lowest CCs and PODs, and the highest MAEs and RMSEs for all elevation levels, while CMORPH has the lowest MAEs when the elevation is between 500 and 4500 m, the highest ETSs, high CCs and low RMSEs and BIAs. Generally, CMORPH performs the best with high quantitative accuracy and rain detection capabilities, while CHIRPS performs the worst. To better understand the relationships between the various evaluation indices and elevation, the CCs between the indices and elevation at the stations were calculated (Table S1). The results showed that most evaluation indices (i.e., CC, RMSE, POD, FAR, BIAS, and ETS) exhibited a negative correlation with elevation, which is consistent with the result for mainland China obtained by Yu [40]. This is reasonable because these indices are consistent with the daily precipitation, which reveals a negative relationship with elevation ( Figure S4) and is similar to the observed relationship in the Hengduan Mountain region in China [70]. Thus, elevation could be used to correct or downscale the precipitation estimates of the satellite and reanalysis precipitation products [71,72].

Evaluation at the Different Elevation Levels
The accuracy of the precipitation products is affected by terrain. All products in this study, except for CHIRPS, perform the best in the middle reaches with elevations of 500-1500 m (Figure 9). PERSIANN performs the worst in the source region and southwestern middle reaches. There are several possible reasons: the terrain in these areas is complex, resulting in various precipitation patterns, where the satellite-based sensors are not especially sensitive [73], and the number of gauge stations is limited to product calibration and evaluation.

Conclusions
Six state-of-the-art precipitation products were evaluated over the Yellow River Basin of China from 2001 to 2018. The main findings are as follows: (1) At the daily scale, CMORPH outperforms the other products in terms of both quantitative accuracy and rainfall detection capability, while CHIRPS performs the worst. To better understand the relationships between the various evaluation indices and elevation, the CCs between the indices and elevation at the stations were calculated (Table S1). The results showed that most evaluation indices (i.e., CC, RMSE, POD, FAR, BIAS, and ETS) exhibited a negative correlation with elevation, which is consistent with the result for mainland China obtained by Yu [40]. This is reasonable because these indices are consistent with the daily precipitation, which reveals a negative relationship with elevation ( Figure S4) and is similar to the observed relationship in the Hengduan Mountain region in China [70]. Thus, elevation could be used to correct or downscale the precipitation estimates of the satellite and reanalysis precipitation products [71,72].
The accuracy of the precipitation products is affected by terrain. All products in this study, except for CHIRPS, perform the best in the middle reaches with elevations of 500-1500 m (Figure 9). PERSIANN performs the worst in the source region and southwestern middle reaches. There are several possible reasons: the terrain in these areas is complex, resulting in various precipitation patterns, where the satellite-based sensors are not especially sensitive [73], and the number of gauge stations is limited to product calibration and evaluation.

Conclusions
Six state-of-the-art precipitation products were evaluated over the Yellow River Basin of China from 2001 to 2018. The main findings are as follows: (1) At the daily scale, CMORPH outperforms the other products in terms of both quantitative accuracy and rainfall detection capability, while CHIRPS performs the worst. (2) At the monthly and seasonal scales, ERA5 produces the highest CCs in all months. Among all products, CCs are the highest in autumn, MAE and RMSE increase first and then decrease throughout the year, and peak values occur in July. The PODs of all products except for CHIRPS increase first from January to April, then remain stable for some months and finally decrease. The FARs and BIASs of the products first decrease and then increase, and the FARs in summer are the lowest, followed by those in autumn, spring, and winter. MAE, RMSE, POD, and ETS exhibit consistent monthly and seasonal rainfall amounts, and FAR and BIAS show the opposite trends to that of the rainfall amount. The summer months (e.g., June and July) with large rainfall amounts result in a low continuous accuracy and high rainfall detection capability. (3) Spatially, at the daily, monthly, and seasonal scales, the MAE, RMSE, POD, and ETS increase from northwest to southeast, which is consistent with the precipitation pattern. The precipitation amount in the humid areas of the source region and middle reaches is likely to be overestimated, and that in the arid areas is likely to be underestimated. (4) The CC, ME, MAE, RMSE, POD and BIAS generally exhibit an increasing trend with increasing daily precipitation. All products slightly overestimate precipitation at low precipitation grades and underestimate it at high precipitation grades. Regarding rainfall detection capability, CHIRPS exhibits the best capability to detect no-rain events, and ERA5 performs the worst. Regarding the rainy days, ERA5 misses the fewest rain events, while CHIRPS misses the most. (5) Most evaluation indices show a negative correlation with elevation. The RMSE, FAR, BIAS, and ETS decrease with increasing elevation. CMORPH performs the best for all elevation levels, while CHIRPS performs the worst.
Overall, the precipitation evaluation indices vary with the terrain condition, temporal scale, and product. An evaluation is therefore necessary before an application. The distribution of the gauge stations is uneven across the Yellow River Basin, especially in the source region and lower reaches, where the stations are sparse. This may increase the evaluation uncertainty. More gauge data should be collected to conduct better evaluation studies. The time scales adopted to conduct the evaluation in this study are the daily, monthly, and seasonal scales. However, rain events usually occur within a very short time, such as a few hours. Hence, in further studies, evaluations conducted at the hourly scale are necessary to obtain a more accurate understanding of the validity of these products.