Evaluation of the Integrated Multi-SatellitE Retrievals for the Global Precipitation Measurement (IMERG) Product in the S ã o Francisco Basin (Brazil)

: The S ã o Francisco River basin is one of the largest in the Brazilian territory. This basin has enormous economic, social and cultural importance for the country. Its water is used for human and animal supply, irrigation and energy production. This basin is located in an area with different climatic characteristics (humid and semiarid) and studies related to precipitation are very important in this region. In this scenario, the objective of this investigation is to present an assessment of rainfall estimated through the Integrated Multi-SatellitE Retrievals for Global Precipitation Measurement (IMERG) product compared with rain gauges over the S ã o Francisco river basin in Brazil. For that, a period from of 20 years and 18 surface weather stations were used to evaluate the product. Based on different evaluation techniques, the study found that the IMERG is appropriate to represent precipitation over the basin. According to the results, the performance of the IMERG product depends on the location where the rain occurs. The bias ranged from − 1.67 to 0.34 mm, the RMSE ranged from 5.36 to 10.36 mm and the values of the correlation coefﬁcients between the daily data from the IMERG and rain gauge ranged from 0.28 to 0.61. The results obtained by Student t -test, density curves and regression analysis, in general, show that the IMERG is able to satisfactorily represent rain gauge data. The exception is the eastern portion of the basin, where the product, on average, underestimates the precipitation ( p -value < 0.05) and presents the worst statistical metrics. this region. Previous studies have suggested the formation of warm clouds over this region. Another hypothesis is that the product is not suitable to represent the variability of precipitation over the region closest to the east coast, due to the product’s deﬁciency in efﬁciently representing points that are at the ocean-continent interface. This question is still open among researchers regarding precipitation estimation algorithms globally.


Introduction
Precipitation is one of the most important variables in hydrometeorological studies, whose analysis encompasses applications in various activities, such as energy generation, irrigated agriculture, water resource management and monitoring areas of hydrological disaster risk, among others. The availability of reliable precipitation data, with a high spatial and temporal resolution, is essential for studies in these areas [1,2]. Rain gauges measure the amount of rain accurately and immediately, although it has been reported that raw precipitation amounts are usually underestimated due to wind-induced undercatch, wetting and evaporation losses, trace amount of precipitation, and equipment design [3]. However, hydrometeorological monitoring in some countries is deficient, especially in developing or underdeveloped countries [4,5].
In Brazil, most river basins have low rainfall. Additionally, the precipitation database in general suffers from a high percentage of missing data. For example, in Northeast Brazil (NEB), the percentage of failures per weather station can reach 23% in 30-year climatological series [6]. In the Amazon Basin (AMZ), the percentage of failures is highly variable, but even in the case of data previously filtered based on the standards of the World Meteorological Organization (WMO), the failures usually reach values above 6% for annual series [7]. Thus, satellite precipitation products are alternatives for hydrometeorological monitoring [5,8].
In particular, the rainfall estimated from the tropical rainfall measuring mission (TRMM) algorithms [9] and more recently the global precipitation measurement (GPM) has been widely used in regional and global meteorological and climatological studies [8,[10][11][12][13][14][15]. In this context, several studies have been performed to evaluate the precipitation estimates from the TRMM and GPM satellites in different regions of Brazil [5,[16][17][18][19]. Overall, the results of these studies suggest that the accuracy of satellite precipitation estimates depends on factors such as terrain, type of precipitation and local climate.
In the tropical portion of South America, the rainfall estimated by the TRMM algorithms were analyzed for their ability to represent precipitation in the AMZ from the high-frequency variability, represented by the diurnal precipitation cycle [1,20], up to the modes of seasonal and interannual variability [20][21][22]. In NEB, whose spatial distribution of precipitation is mostly inserted in a semiarid zone, flanked by the rainy coastal region and borders with the Amazon and the Cerrado biomes [6], TRMM products have been used in various hydrometeorological applications. For example, [5] used TRMM data to estimate return periods for extreme precipitation events. In contrast [23] used TRMM data to monitor drought in the semiarid region.
Most recently, the Integrated Multi-SatellitE Retrievals for GPM (IMERG) product from the GPM satellite constellation has been evaluated in different regions of the world against rain gauge such as in China [24,25], Pakistan [26], different climatic zones of Brazil [27], the Ebro River basin in Spain [28], Chile [29], and Canada [30]. It was also evaluated under specific conditions such as orographic rains [28], different microphysical rain regimes [29], and heavy rain conditions such as during hurricane days [31] and typhoons [32]. Overall IMERG performance is satisfactory. However, has been reported [28,29] that it strongly depends on altitude and precipitation regime. Concerning the Brazilian rivers basin, the performance of IMERG product was still poorly analyzed.
In the specific context of the São Francisco River basin (SFB), which is the largest in extent and volume of NEB, the most recent studies with TRMM data have focused on the analysis of drought variability [33] or analysis of the precipitation trend [34]. However, these studies focused on the upper reach of the São Francisco River, which corresponds to the southern portion of the SFB. Thus, in this study we evaluated the IMERG product of the GPM mission, was developed as a continuation and improvement of the TRMM mission, to estimate precipitation over the entire length of the São Francisco River basin, comparing it with rainfall data measured on the surface.

Study Area
The SFB is located in Brazil between the geographical coordinates 7.0 • -21.0 • S and 35.0 • -47.7 • W ( Figure 1). According to the São Francisco River Basin Committee [35], the basin is one of the largest in Brazilian territory, being located in portions of the North-Water 2021, 13, 2714 3 of 12 east, Southeast and Midwest regions of the country. In addition, it occupies an area of 619,543.94 km 2 , which corresponds to almost 8% of national territory. Due to its size, it is divided into four physiographic regions: Upper, Middle, Sub-Middle and Lower. Furthermore, the SFB has enormous economic, social and cultural importance to the country. Its water is used for human and animal supply, irrigation and energy production for a large part of the surrounding population, living in about 505 municipalities [36,37] with a population of 18,218,575 [38]. In addition, the river serves for transport between cities. The rainy season and the amount of rainfall differ in time and space in the SFB. In the southern portion and at the mouth the climate is humid while in the central reach a semiarid climate predominates [39,40].

Study Area
The SFB is located in Brazil between the geographical coordinates 7.0°-21.0° S and 35.0° 47.7° W (Figure 1). According to the São Francisco River Basin Committee [35], the basin is one of the largest in Brazilian territory, being located in portions of the Northeast, Southeast and Midwest regions of the country. In addition, it occupies an area of 619,543.94 km 2 , which corresponds to almost 8% of national territory. Due to its size, it is divided into four physiographic regions: Upper, Middle, Sub-Middle and Lower. Furthermore, the SFB has enormous economic, social and cultural importance to the country. Its water is used for human and animal supply, irrigation and energy production for a large part of the surrounding population, living in about 505 municipalities [36,37] with a population of 18,218,575 [38]. In addition, the river serves for transport between cities. The rainy season and the amount of rainfall differ in time and space in the SFB. In the southern portion and at the mouth the climate is humid while in the central reach a semiarid climate predominates [39,40].

Satellite Data
Precipitation estimates data used in this work are from the IMERG product, the GPM level 3 multi-satellite precipitation algorithm, which combines all microwave sensors in the constellation and infrared-based observations from geosynchronous satellites [41]. The IMERG version 6 product fuses the early precipitation estimates by the TRMM satellite from the 2000 to 2015 with more recent precipitation estimates by the GPM satellite from 2014 to present. The IMERG version 6 products on a half-hour 0.1° grid were accessed on NASA's Goddard Space Flight Center website https://pmm.nasa.gov/data-access/downloads/gpm (accessed on 17 December 2020). In this study, the IMERG data collection period was from June 2000 to December 2019, then accumulated to a daily data.

Satellite Data
Precipitation estimates data used in this work are from the IMERG product, the GPM level 3 multi-satellite precipitation algorithm, which combines all microwave sensors in the constellation and infrared-based observations from geosynchronous satellites [41]. The IMERG version 6 product fuses the early precipitation estimates by the TRMM satellite from the 2000 to 2015 with more recent precipitation estimates by the GPM satellite from 2014 to present. The IMERG version 6 products on a half-hour 0.1 • grid were accessed on NASA's Goddard Space Flight Center website https://pmm.nasa.gov/data-access/ downloads/gpm (accessed on 17 December 2020). In this study, the IMERG data collection period was from June 2000 to December 2019, then accumulated to a daily data.

Precipitation Data
In order to assess the daily precipitation estimate of the satellite product, a set of rainfall data observed at 18 surface gauge stations was used as a reference. These stations are distributed throughout the SFB and are managed by the National Meteorological Institute (INMET), through the Meteorological Database Project (BDMEP). Table 1 presents information about the surface gauges, such as the municipalities where the gauges is installed, latitude, longitude, elevation (m) and percentage of missing data for each gauge within the period from June 2000 to December 2019. The data series  (Table 1). Condition imposed by several researchers as [18,42,43]. To assess the IMERG estimates, the dates with missing data were excluded from the two databases, Gauge and IMERG.

Statistical Analysis
The performance of the IMERG precipitation estimates was evaluated using five statistical indices, namely: bias (BIAS), root mean squared error (RMSE), Pearson's correlation coefficient (r), probability of detection (POD) and false alarm ratio (FAR). BIAS represents the systematic errors in satellite precipitation estimates based on observations from rain gauges, and is calculated by Equation (1). RMSE measures the average absolute error of the satellite's precipitation estimates, Equation (2). The lower the RMSE, the better the estimates of IMERG product to represent rain gauges dataset. The value of r quantifies the relationship between the precipitation values estimated by the satellite and observed by the rain gauges, Equation (3): where X i represents the satellite precipitation estimate on day/month i; Y i denotes the precipitation value observed by the rain gauge on day/month i; n is the number of days/months analyzed in the study; and X and Y represent the mean values of X i and Y i , respectively. For the categorical analysis of precipitation estimated by the IMERG, the variables POD and FAR from Equations (4) and (5) were used. POD measures the fraction of the total precipitation events correctly detected by the satellite, while FAR measures the total fraction of precipitation events incorrectly detected by the satellite: Water 2021, 13, 2714 where N a is the number of correct answers, referring to the number of days in which the precipitation values estimated by the satellite and observed by rain gauges were greater than a predefined precipitation limit, 1 mm/day, in this study, following the criterion adopted by [44,45]; N b is the number of days on which the satellite-based precipitation estimate was greater than 1 mm/day and the rainfall observed by the pluviometer was less than 1 mm/day; and N c is the number of days on which the satellite-based precipitation estimate was less than 1 mm/day and the rainfall observed by the pluviometer was greater than 1 mm/day. The probability density function (PDF) was also used. Denoted by f X (x) the FDP describes the behavior, in polygon form, of the frequency distribution of a random variable. The probability of the random variable being less than a given value of interest, x, is calculated using the cumulative distribution function (CDF), represented by Equation (6): The CDF of a continuous random variable is a non descending function, and the expressions are validated: F X (−∞) = 0 and F X (+∞) = 1 [46][47][48][49]. Conversely, the corresponding FDP can be obtained by differentiating F X (x), The PDF, f X (x), is represented by Equation (7): The Student t-test was applied to verify a significant difference between the means (gauge and IMERG), adopting 5% statistical significance. In addition, a simple linear regression model was used to assess the semilarity of monthly precipitation variations between the data estimated by IMERG and observed by gauges, Equation (8): where Y represents the dataset observed by the rain gauges; β 0 is the intercept; β 1 is the slope; and X represents the dataset estimated by the satellite. The closer to zero and one the estimated values, respectively, of parameters β 0 and β 1 are, the better the relationship between the gauge and IMERG data is. The determination coefficient (R 2 ) represents the goodness of fit of the regression model.

Results and Discussion
The annual average precipitation observed by the rain gauges and estimated by the IMERG are shown in Figure 2. The precipitation in the SFB presents a characteristic distribution with higher annual average accumulations above 1500 mm in the South (Upper São Francisco). In the Middle São Francisco, the observed precipitation was around 600 and 1050 mm, decreasing to values below 500 mm in the northern portion of the basin (Lower São Francisco). This gradual decrease in precipitation from South to North observed in the rain gauges (Figure 2a) is represented consistently by the algorithm, but IMERG underestimates precipitation in the Lower São Francisco.
The analysis of precision and accuracy is shown in Figure 3. The underestimation in the Lower São Francisco (Figure 3a) is confined to the northernmost of the SFB, located in the coastal region. In the rest of the SFB, the algorithm presents biases between −0.5 and 0.5 mm.
IMERG are shown in Figure 2. The precipitation in the SFB presents a characteristic distribution with higher annual average accumulations above 1500 mm in the South (Upper São Francisco). In the Middle São Francisco, the observed precipitation was around 600 and 1050 mm, decreasing to values below 500 mm in the northern portion of the basin (Lower São Francisco). This gradual decrease in precipitation from South to North observed in the rain gauges (Figure 2a) is represented consistently by the algorithm, but IMERG underestimates precipitation in the Lower São Francisco. The analysis of precision and accuracy is shown in Figure 3. The underestimation in the Lower São Francisco (Figure 3a) is confined to the northernmost of the SFB, located in the coastal region. In the rest of the SFB, the algorithm presents biases between −0.5 and 0.5 mm.  The RMSE exceeds 10 mm in the southernmost portion of the SFB, decreasing towards the central region (Figure 3b). The lower the RMSE values, the better the capacity of the IMERG product to represent the precipitation values observed by rain gauges. The correlation is below 0.4 in the Lower São Francisco (Figure 3c) and it is in this region where the lowest POD ( Figure 3d) and highest FAR (Figure 3e) are found. Similar values were found by [18] who observed in inland and coast northeastern regions, regiões que abrangem o SFB, that 3B42 product of TRMM demonstrated a better performance, as demonstrated in the metrics for inland and coast northeastern regions, bias = 2.82 mm/day and − 2.94 mm/day; r = 0.18 and 0.30; std = 8.53 mm/day and 6.97 mm/day; rmse = 14.75 mm/day and 7.03 mm/day, respectively.
The precipitation in the SFB shows an evident spatial and temporal variability (Figure 4). The positive aspect regarding the IMERG data is the ability to adequately capture the seasonality of precipitation, observed over most of the basin. The RMSE exceeds 10 mm in the southernmost portion of the SFB, decreasing towards the central region (Figure 3b). The lower the RMSE values, the better the capacity of the IMERG product to represent the precipitation values observed by rain gauges. The correlation is below 0.4 in the Lower São Francisco (Figure 3c) and it is in this region where the lowest POD ( Figure 3d) and highest FAR (Figure 3e) are found. Similar values were found by [18] who observed in inland and coast northeastern regions, regiões que abrangem o SFB, that 3B42 product of TRMM demonstrated a better performance, as demonstrated in the metrics for inland and coast northeastern regions, bias = 2.82 mm/day and − 2.94 mm/day; r = 0.18 and 0.30; std = 8.53 mm/day and 6.97 mm/day; rmse = 14.75 mm/day and 7.03 mm/day, respectively. The precipitation in the SFB shows an evident spatial and temporal variability (Figure 4). The positive aspect regarding the IMERG data is the ability to adequately capture the seasonality of precipitation, observed over most of the basin. In this figure, it is clear that the rain regime is different among the sites. The majority presents its rainy period in the austral summer. On the other hand, the dry period prevails in the austral winter. For these sites, its clearly observed that the IMERG product is well related to the in-situ data. However, for the sites which presented their rainy period in the austral winter and the dry period during the summer, the results are not the same. in the municipalities of Água Branca (Figure 4a), Pão de Açúcar (Figure 4n) and Propriá (Figure  4r), the satellite did not consistently reproduce the distribution of precipitation between the months of May and August, underestimating both the average, median and quartiles measured using rain gauges. This could be explained by the occurrence of warm cloud in these regions [5,50], which in turn, generate worse precipitation estimates by sattelites.
The density curves (gauge and IMERG) referring to the monthly data distributions are similar in a large part of the basin ( Figure 5). As indicated by the boxplot (Figure 4), the exception is in the municipalities of Água Branca (Figure 5a), Pão de Açúcar ( Figure  5n) and Propriá (Figure 5r), where the distributions of the satellite data show a higher concentration on the left compared to the data from the rain gauges. According to the results of the Student t-test, the averages of the distributions observed by the rain gauges and estimated by the satellite are statistically different over these municipalities (valor-p < 0.05). When observing the estimates of the parameters of the linear regression model, In this figure, it is clear that the rain regime is different among the sites. The majority presents its rainy period in the austral summer. On the other hand, the dry period prevails in the austral winter. For these sites, its clearly observed that the IMERG product is well related to the in-situ data. However, for the sites which presented their rainy period in the austral winter and the dry period during the summer, the results are not the same. in the municipalities of Água Branca (Figure 4a), Pão de Açúcar (Figure 4n) and Propriá (Figure 4r), the satellite did not consistently reproduce the distribution of precipitation between the months of May and August, underestimating both the average, median and quartiles measured using rain gauges. This could be explained by the occurrence of warm cloud in these regions [5,50], which in turn, generate worse precipitation estimates by sattelites.
The density curves (gauge and IMERG) referring to the monthly data distributions are similar in a large part of the basin ( Figure 5). As indicated by the boxplot (Figure 4), the exception is in the municipalities of Água Branca (Figure 5a), Pão de Açúcar (Figure 5n) and Propriá (Figure 5r), where the distributions of the satellite data show a higher con- centration on the left compared to the data from the rain gauges. According to the results of the Student t-test, the averages of the distributions observed by the rain gauges and estimated by the satellite are statistically different over these municipalities (valor-p < 0.05). When observing the estimates of the parameters of the linear regression model, we also found that Água Branca, Propriá and Pão de Açúcar showed the worst results, with r ≤ 0.68 and R 2 ≤ 0.46 (Table 2).
Water 2021, 13, x FOR PEER REVIEW 9 of 13 we also found that Água Branca, Propriá and Pão de Açúcar showed the worst results, with r ≤ 0.68 and R ≤ 0.46 (Table 2). It stands out that the greatest accumulation of precipitation is concentrated between the months of October and April, a period during the rainy season in much of the region, consistent with previous analyses obtained from rain gauge and satellite data [5,6,51]. A possible explanation for this underestimation of precipitation in the NEB east coast is the type of weather systems which usually affect this region. Previous studies have suggested the formation of warm clouds over this region [50,52]. This hypothesis is consistent with the low frequency of electrical discharges in practically the entire NEB east coast [53]. Warm clouds have little vertical development and low ice content, which is essential for the electrification process of clouds. These conditions contribute to a deficience on estimating precipitation by satellites [5,8,18,50,54]. In addition, it is reasonable assume that IMERG's resolution can be not sufficient to represen(mainly) the local convection in the region, thus the rainfall is underestimated.  It stands out that the greatest accumulation of precipitation is concentrated between the months of October and April, a period during the rainy season in much of the region, consistent with previous analyses obtained from rain gauge and satellite data [5,6,51]. A possible explanation for this underestimation of precipitation in the NEB east coast is the type of weather systems which usually affect this region. Previous studies have suggested the formation of warm clouds over this region [50,52]. This hypothesis is consistent with the low frequency of electrical discharges in practically the entire NEB east coast [53]. Warm clouds have little vertical development and low ice content, which is essential for the electrification process of clouds. These conditions contribute to a deficience on estimating precipitation by satellites [5,8,18,50,54]. In addition, it is reasonable assume that IMERG's resolution can be not sufficient to represen(mainly) the local convection in the region, thus the rainfall is underestimated.
In general, the IMERG provided satisfactory results to estimate the precipitation in the BSF. Data from this product is precious as it covers a more detailed spatial coverage that is not always supplied by the network of rainfall stations, in addition to being important for the most diverse applications that demand knowledge about the quantity, duration and distribution of the rainfall regime in time and space [55][56][57][58][59][60].
It is also reinforced that the IMERG presented limitations for estimating precipitation in the lower São Francisco and that future studies are important to elucidate remaining gaps. For example, the results found in this work corroborate the results of [61], these authors analyzed the following satellite-derived precipitation estimate products: Multi-Satellite Precipitation Analysis (TMPA); Integrated Multi-satellite Retrievals (IMERG-F, version V05) and Global Satellite Mapping of Precipitation (GSMaP); in all of them there was an underestimate when compared with data measured by pluviometers in the coastal portion of Northeastern Brazil. [62] analyzed the daily precipitation estimated by IMERG for the year 2016 and found similar results. On the other hand, [63] evaluated 13 years of rainfall estimates for Brazil by the TRMM and concluded that there was good agreement (by 97%) in all regions of the country.

Conclusions
We present an assessment of the data from the IMERG algorithm for the São Francisco River basin in Brazil. For that, we used data from 18 surface weather stations, from which statistics were determined for each location. The results obtained depend on the basin region. The bias ranged from −1.67 to 0.34 mm/day, the RMSE ranged from 5.36 to 10.36 mm/day and the values of the correlation coefficients between the daily data from the IMERG and rain gauge ranged from 0.28 to 0.61. The results obtained by Student t-test, density curves and regression analysis, in general, show that the IMERG is able to satisfactorily represent rain gauge data. We observed that the product is suitable for represent the precipitation variability, except in the region closest to the east coast, where the product presents the worst statistical metrics. A possible explanation for underestimation of precipitation in the NEB east coast is the type of weather systems which usually affect this region. Previous studies have suggested the formation of warm clouds over this region. Another hypothesis is that the product is not suitable to represent the variability of precipitation over the region closest to the east coast, due to the product's deficiency in efficiently representing points that are at the ocean-continent interface. This question is still open among researchers regarding precipitation estimation algorithms globally.