Statistical Study of Rainfall Control: The Dagum Distribution and Applicability to the Southwest of Spain

It is of vital importance in statistical distributions to fit rainfall data to determine the maximum amount of rainfall expected for a specific hydraulic work. Otherwise, the hydraulic capacity study could be erroneous, with the tragic consequences that this would entail. This study aims to present the Dagum distribution as a new statistical tool to calculate rainfall in front of frequent statistical distributions such as Gumbel, Log-Pearson Type III, Gen Extreme Value (GEV) and SQRT-ET max. The study was performed by collecting annual rainfall data from 52 meteorological stations in the province of Badajoz (Spain), using the statistical goodness-of-fit tests of Anderson–Darling and Kolmogorov–Smirnov to establish the degree of fitness of the Dagum distribution, applied to the maximum annual rainfall series. The results show that this distribution obtained a flow 21.92% greater than that with the traditional distributions. Therefore, in the Southwest of Spain, the Dagum distribution fits better to the observed rainfall data than other common statistical distributions, with respect to precision and calculus of hydraulics works and river flood plains.


Introduction
River flooding and hydrologic studies are carried out ensuring that the waters will reach a certain maximum level during statistic rainfalls with a certain return period. The possible level of flooding and the waterworks are designed by means of different classical statistical distributions applied to rainfall, using a series of maximum annual recorded rainfall data.
Some models of future scenarios suggest that climate change will involve a significant modification in the distribution of extreme rainfall intensity [1].
There have been several significant flooding events in the southwest of Spain in the last few decades. Particularly in Badajoz (Spain) in the year 1997, there were even casualties as a result of the river Rivillas breaking its banks. Therefore, undoubtedly another solution must be sought to provide a better fit to the historical rainfall data than those options currently available [2].
According to the scientific literature the statistical distributions most commonly used in Europe and Spain are: The Gumbel distribution, developed by the German mathematician Gumbel [3] and later applied to hydrology [4]; the Log-Pearson Type III distribution [5,6] put forward by several authors for use in hydrology [7,8]; extreme values distribution (GEV) [9]; and the SQRT-ET max. distribution [10,11], which best fits the characteristics of Spanish rainfall. the best results. Finally, Domma and Condino (2017) [29] carried out a simulation study that shows the good performance of the maximum likelihood estimators for finite samples.
However, Mayooran and Laheetharan (2014) [30] used the same form of Dagum as in this work and compared it to other 44 different distributions. The parameters of the selected probability distributions were used to generate random numbers for both actual and estimated maximum daily precipitation. Log- Pearson 3 and Burr (4P) were found to be the best-fit probability model for the annual period and the first inter monsoon study period, respectively [31]. The transmuted Dagum model provides the broader range of hazard behavior than the Dagum model [32]. The parameters of the new model are estimated by maximum likelihood using Newton Raphson approach and the information matrix and confidence intervals are also obtained.
Other simulation results showed that both the corrected Akaike information criterion and Bayesian information criterion (BIC) always detected nonstationary, but the BIC selected the correct model more often except in very small samples [33]. Simulation studies indicated that the bias corrected and accelerated (BCa) method is best overall for the extreme percentiles that are often the focus of interest [34].
Despite all the above, it is not understandable that such a small number of distributions are used in professional practice in Europe and especially in Spain, considering that there are other efficient distributions in the field of hydrology.
To demonstrate the effectiveness of the Dagum distribution, the adjustment of the statistical distribution to the observed maximum annual rainfall values will be confirmed using the Anderson-Darling [35] and Kolmogorov-Smirnov goodness-of-fit tests and comparing them with the other distributions.
A statistical distribution must provide as good fit to the rainfall data as possible, since the better the fit, the more precise the value for the calculated rainfall. It can be used in the sizing of waterworks and flood plains.
This study intends to introduce the Dagum distribution as a new statistical tool to calculate rainfall because it fits better to the observed rainfall data by testing it with a dataset from Spain.

Materials and Methods
Firstly, a review of the statistical distributions used in hydrology studies will be performed, mainly in Europe and Spain, including the distribution of Dagum, with its fundamental characteristics.
The demonstration of the validity of this distribution in the field of civil engineering is analyzed to check whether the Dagum distribution provides a better fit, according to the goodness-of-fit tests, to the maximum annual rainfall distributions than the fit given by the commonly used distributions in the province of Badajoz, using real maximum annual rainfall data from the meteorological stations in that province.

Statistical Distribution Functions
The current method used in Spain to find out the flow rate that allows to dimension waterworks or calculate a flood plain of a river is the Rational Method [13], (except for large basins size). This method transforms the statistical rainfall associated to a certain return period (mm) to a flow rate (L/s): where c is a constant called runoff coefficient; I is the maximum intensity of precipitation and A is the area of the basin. According to Instruction 5.2-IC, variable I can be obtained from the IDF curves (intensity, duration, frequency), with the following expression: As it can be verified, the correct estimation of P d is very important, since it is the maximum daily precipitation obtained through the series of daily rainfall recorded in rainfall stations.
The rainfall associated with the return period is currently calculated using the statistical distributions commonly used in hydrology: Gumbel, Log-Pearson Type III, SQRT-ET max and GEV. Thus, starting from a historical record obtained from the rainfall stations the maximum value of rainfall associated with a certain return period is determined (frequency).
The use of one of these statistical distributions is nearly always found in hydrological studies [12].

The Gumbel Distribution
According to Gumbel [36], the density function is: where x is the value of the random variable, F(x) represents the probability that rainfall is less than or equal to x, and α and µ are parameters of the fit which depend on the mean and standard deviation of the variable y i , which in turn depends on the sample size.
Being y and S N the mean and standard deviation of the variable y i , respectively, x and S x the mean and standard deviation respectively of the sample from the data of daily maximum values of annual rainfall: The return period, T(x), is related to the distribution function, F(x), by Equation (6): After entering sample values, the analytical expression is reached Equation (7), after clearing x: where the expected daily maximum precipitation P d is obtained for a given return period T(x).

The Log-Pearson Type III Distribution
The Log-Pearson density function is: being y = log(x) and Γ(β) the function Gamma and e the Euler's number. β, λ and ε are the parameters of form, scale and position respectively and are drawn from the equation (9)(10)(11): The density function of this distribution is not integral, so it is resolved by parametric methods.

The Distribution of SQRT-ET Max
The density function is as follows: where F(x) is the probability that the value will be less than x, and k and α are parameters to be estimated that depend on the mean and typical deviation of the data series.

The Gen Extreme Value (GEV)
The density function [10] is: where z = x − µ/α, and k, µ and α are the parameters of form, position and scale.

The DAGUM Distribution
The Dagum distribution has long been used in different fields such as economics, econometrics [37] and social sciences. However, there are just a few applications found in hydrology. The importance of using this distribution in hydrology is both its adoptability to extreme data and similar capability to traditional distributions.
In probability theory, statistics and econometrics, the Dagum distribution is a continuous distribution with a probability distribution defined on real positive numbers. The Dagum distribution arose from several variations of a new model in the size distribution of personal incomes and is associated above all with the study of incomes. This distribution can be used for three parameters (Type I) and for four parameter (Type II). The density function is defined by: (15) and the distribution function where k is a continuous shape parameter (k > 0), a is continuous shape parameter (a > 0), β is a continuous scale parameter (β > 0) and γ is a continuous location parameter (γ = 0 yields the three-parameter Dagum distribution) (γ ≤ x ≤ ¥). Figure 1 shows the density function of Dagum.

Tests of Goodness
For the adjustment of the distributions the EasyFit software [39] was used, which adjusts the probability laws to the rainy series and allows performing the goodness-of-fit tests by the Kolmogorov-Smirnov, Anderson-Darling and chi-square methods.

The Kolmogorov-Smirnov test
The Kolmogorov-Smirnov test [40] is a nonparametric, single sample, bi-sample and continuous test that proves particularly useful for large samples and is therefore optimal for the study [41].
The Kolmogorov-Smirnov test considers two hypotheses: H0: F(X) = Fs(X), H1: F(X) ≠ Fs(X) (17) where F(X) is the distribution function to be studied, and Fs(X) is the probability or theoretical proportion of values that must be less than or equal to x assuming the proposed hypothesis to be true. Sample: n independent observations.

The Anderson-Darling Test
The Anderson-Darling test has been widely used in hydrology due to its reliability in comparison with other tests and its common use in samples with pronounced tails. This test is very interesting compared to commonly used tests when faced with a variety of hydraulic engineering alternatives [42].
The Anderson-Darling test [43] uses the following formulation as a test statistic: Where A2 is the test statistic, N is the sample size, and F(x) is the frequency.

Case Study
The present study was performed using annual rainfall data from 52 meteorological stations in Badajoz, provided by the Spanish Meteorological Institute.
An exhaustive study was made using the statistical goodness-of-fit tests of Anderson-Darling

Tests of Goodness
For the adjustment of the distributions the EasyFit software [38] was used, which adjusts the probability laws to the rainy series and allows performing the goodness-of-fit tests by the Kolmogorov-Smirnov, Anderson-Darling and chi-square methods.

The Kolmogorov-Smirnov test
The Kolmogorov-Smirnov test [39] is a nonparametric, single sample, bi-sample and continuous test that proves particularly useful for large samples and is therefore optimal for the study [40].
The Kolmogorov-Smirnov test considers two hypotheses: where F(X) is the distribution function to be studied, and Fs(X) is the probability or theoretical proportion of values that must be less than or equal to x assuming the proposed hypothesis to be true. Sample: n independent observations.

The Anderson-Darling Test
The Anderson-Darling test has been widely used in hydrology due to its reliability in comparison with other tests and its common use in samples with pronounced tails. This test is very interesting compared to commonly used tests when faced with a variety of hydraulic engineering alternatives [41].
The Anderson-Darling test [42] uses the following formulation as a test statistic: where A2 is the test statistic, N is the sample size, and F(x) is the frequency.

Case Study
The present study was performed using annual rainfall data from 52 meteorological stations in Badajoz, provided by the Spanish Meteorological Institute.
An exhaustive study was made using the statistical goodness-of-fit tests of Anderson-Darling and Kolmogorov-Smirnov to establish the degree of fitness of the Dagum distribution applied to the maximum annual rainfall series, and thus, be able to compare the fits of this distribution to those of the classical statistical distributions such as Gumbel, SQRT-ET max, Log-Pearson type III and the GEV.
Adjustment tests are widely used in hydrology due to the high degree of precision they provide when reflecting the fit of the statistical distribution to the rainfall data series available.
For the fit of the distributions, the software Easyfit was used. On introducing the maximum annual rainfall data, the program gives the degree of fitness of each statistical distribution according to the two goodness-of-fit tests used.
The 52 meteorological stations used in the analysis are shown in Table 1 and Figure 2. These contained maximum and minimum temperatures and daily precipitation for the period between 1990 and 2015. The quality control procedures of the Algorithm Theoretical Basis Document (ATBD) project, developed by the Royal Netherlands Meteorological Institute (KNMI) for the European Climate Assessment & Dataset (ECA&D), have been applied [43]. The blended series passed the standard homogeneity test, the Buishand range test, the Pettitt test and the Von Neumann ratio, as described by Wijngaard et al. [44] and ECA&D. Some series presenting missing values were completed following the recommendations of WMO [45] and Allen et al. [46]. The daily data from each station were processed and analyzed. Coefficients of variation and maximum precipitations at each meteorological station are also shown in Table 1.
The rainfall data from the 52 meteorological stations were introduced in EasyFit statistical program. The function of density was applied to the rainfall histogram for its adjustment. Finally, the goodness-of-fit to the rainfall histograms was studied as a function of the density of each statistical distribution.

Figure 3 shows how the Dagum probability distribution fits the rainfall histogram in the town of San Vicente de Alcántara.
The rainfall data from the 52 meteorological stations were introduced in EasyFit statistical program. The function of density was applied to the rainfall histogram for its adjustment. Finally, the goodness-of-fit to the rainfall histograms was studied as a function of the density of each statistical distribution. Figure 3 shows how the Dagum probability distribution fits the rainfall histogram in the town of San Vicente de Alcántara.  Figure 4 shows how the density function fits the cumulative histogram of a maximum annual rainfall series. The density function of the statistical distribution is never going to reproduce the exact values of the histogram as desired that is why the goodness-to-fit tests are used to check which one provides the best fit to the rainfall data series when comparing various statistical distributions. That is to say, the chosen statistical distribution should be the one that fits the rainfall histogram more accurately.   Figure 4 shows how the density function fits the cumulative histogram of a maximum annual rainfall series. The density function of the statistical distribution is never going to reproduce the exact values of the histogram as desired that is why the goodness-to-fit tests are used to check which one provides the best fit to the rainfall data series when comparing various statistical distributions. That is to say, the chosen statistical distribution should be the one that fits the rainfall histogram more accurately.

Results
The rainfall data from the 52 meteorological stations were introduced in EasyFit statistical program. The function of density was applied to the rainfall histogram for its adjustment. Finally, the goodness-of-fit to the rainfall histograms was studied as a function of the density of each statistical distribution. Figure 3 shows how the Dagum probability distribution fits the rainfall histogram in the town of San Vicente de Alcántara.  Figure 4 shows how the density function fits the cumulative histogram of a maximum annual rainfall series. The density function of the statistical distribution is never going to reproduce the exact values of the histogram as desired that is why the goodness-to-fit tests are used to check which one provides the best fit to the rainfall data series when comparing various statistical distributions. That is to say, the chosen statistical distribution should be the one that fits the rainfall histogram more accurately.  After fitting the statistical functions to the rainfall data, Figure 5 shows graphically how the Dagum distribution is aligned with both the fit of Gumbel distribution and Log-Pearson type III distribution, being difficult to decide the best, since the curves are very close together. After fitting the statistical functions to the rainfall data, Figure 5 shows graphically how the Dagum distribution is aligned with both the fit of Gumbel distribution and Log-Pearson type III distribution, being difficult to decide the best, since the curves are very close together. Similarly, Figure 6 shows that it is difficult to determine which of the three distributions from all the cumulative distribution function fits more accurately the histogram of rainfall data in San Vicente de Alcántara. As previously noted, it is difficult to draw any conclusions. Therefore, it is necessary to apply the goodness-to-fit tests. In this study, we applied the test to four distributions, as shown in Table 2. The distribution of Dagum clearly presents the lowest goodness-to-fit statistics in the San Vicente de Similarly, Figure 6 shows that it is difficult to determine which of the three distributions from all the cumulative distribution function fits more accurately the histogram of rainfall data in San Vicente de Alcántara. After fitting the statistical functions to the rainfall data, Figure 5 shows graphically how the Dagum distribution is aligned with both the fit of Gumbel distribution and Log-Pearson type III distribution, being difficult to decide the best, since the curves are very close together. Similarly, Figure 6 shows that it is difficult to determine which of the three distributions from all the cumulative distribution function fits more accurately the histogram of rainfall data in San Vicente de Alcántara. As previously noted, it is difficult to draw any conclusions. Therefore, it is necessary to apply the goodness-to-fit tests. In this study, we applied the test to four distributions, as shown in Table 2. The distribution of Dagum clearly presents the lowest goodness-to-fit statistics in the San Vicente de As previously noted, it is difficult to draw any conclusions. Therefore, it is necessary to apply the goodness-to-fit tests. In this study, we applied the test to four distributions, as shown in Table 2. The distribution of Dagum clearly presents the lowest goodness-to-fit statistics in the San Vicente de Alcántara, Jerez de los Caballeros and Herrera del Duque data, which means that it fits better than the other three to the rainfall data. These tests were applied to rainfall data from the remaining 51 stations [47], in which the analysis reflects a similar trend, and the statistical distribution Dagum presents lower goodness statistics than the rest of the distributions. Other tests were taken into account to select the best distribution model, such as the corrected Akaike information criterion and the Bayesian information criterion (BIC), but finally the tests of Anderson-Darling and the test of Kolmogorov-Smirnov were chosen.

Results
Subsequently, and to confirm the above results, tests of goodness-of-fit were carried out with a series of statistical distributes (applied both in hydrology and in other disciplines), using the ten stations with the largest sample size (among the 52 stations).
In Figure 7, it can see that Dagum appears as one of the most frequent distributions (within the five best settings), just below the GEV distribution, but above the Gumbel and the Log-Pearson type III distribution. Alcántara, Jerez de los Caballeros and Herrera del Duque data, which means that it fits better than the other three to the rainfall data. These tests were applied to rainfall data from the remaining 51 stations [48], in which the analysis reflects a similar trend, and the statistical distribution Dagum presents lower goodness statistics than the rest of the distributions. Other tests were taken into account to select the best distribution model, such as the corrected Akaike information criterion and the Bayesian information criterion (BIC), but finally the tests of Anderson-Darling and the test of Kolmogorov-Smirnov were chosen.
Subsequently, and to confirm the above results, tests of goodness-of-fit were carried out with a series of statistical distributes (applied both in hydrology and in other disciplines), using the ten stations with the largest sample size (among the 52 stations).
In Figure 7, it can see that Dagum appears as one of the most frequent distributions (within the five best settings), just below the GEV distribution, but above the Gumbel and the Log-Pearson type III distribution.  Table 3 shows that the application of the Dagum Distribution to the cases of Cabeza la Vaca, Monterrubio and Campanario obtain different flows to those of the traditional distributions. For example, Cabeza la Vaca shows 21.92% flow greater than the most commonly used distribution (Gumbel).

Distribution
Cabeza la Vaca Monterrubio Campanario Figure 7. Number of events in the different distributions according to the goodness-of-fit test at the locations for the ten largest simple sizes. Table 3 shows that the application of the Dagum Distribution to the cases of Cabeza la Vaca, Monterrubio and Campanario obtain different flows to those of the traditional distributions. For example, Cabeza la Vaca shows 21.92% flow greater than the most commonly used distribution (Gumbel).
It can be deduced that the statistical distributions that provide the greatest rainfall are the Log-Logistic 3P distribution followed by the Dagum distribution. The quantitative differences in relation to the value provided by the Gumbel distribution, the most widespread in studies and projects, are variable and in some cases are even considerable, up to 58% higher in the case of Jerez de los Caballeros. Therefore, in order to be on the safety side, the Log-Logistic 3P distributions and the Dagum distribution, must be used, since you can be completely sure that the flows and precipitations derived from its application will be greater than the results obtained with the distributions of Gumbel, SQRT-ET max and Log Pearson 3. It is important to emphasize that these distributions are the best fit in the Kolmogorov-Smirnov, Anderson-Darling kindness tests and their weighting.

Discussion
It is important to denote that the comparison of distributions is complicated. Goodness of fit tests are not very powerful and with the typical sample sizes available in practice it is rarely possible to reject statistically some distribution candidates. Therefore, the comparison must be done on a larger scale [48]. Thus, in this case study, sites located throughout the province considered are sufficiently numerous and evenly distributed to obtain significant results. However, these results cannot be extended elsewhere, that is, the choice of a particular distribution at a given place should be carefully studied and selected. Although some distributions, such as the Gumbel or Log-Pearson type III, have been extensively used in many hydrologic studies and without any additional consideration related to particular conditions of the basins, the inclusion of the spatial factor would reduce the uncertainty concerning the choice of the model [49]. Physical factors such as large-scale meteorological phenomena could create regional probabilities dependencies which have to be accounted for. In consequence, as it was previously indicated, each region or zone should be initially characterized for the choice of the statistical distribution, which better explains the expected rainfall events [50].
As a result of the analyses carried out in the 52 locations throughout the province of Badajoz, the Dagum model scored better than the other models which have been traditionally used in hydrologic and hydraulic works. There are few previous studies where the Dagum distribution had been used for these topics.
The Dagum model was found to overestimate a great number of times when compared to the Gumbel distribution. Therefore, the Dagum distribution seems to be the most recommendable distribution for a conservative design and for to plan accordingly [51].
Because of the ample availability of computers nowadays, many statistical distributions have to be considered when a single-site flood frequency analysis is done [52]. Moreover, as more data are being accumulated since the recent and coming years are providing more new information, new analyses could be performed with regionalized parameters of proven model for each location [53]. In this sense, the consideration of the Dagum model can provide more accurate results in many places of southwestern Spain that those obtained using traditional distributions. The evaluation and simulation of rainfall scenarios indicate that changes in rainfall characteristics have a considerable impact on the built drainage system and that Low Impact Development (LID) practices can adequately control flooding [54].
Future work should aim at verifying the applicability of the Dagum distribution in other regions of southern Europe.

Conclusions
On analyzing the maximum annual rainfall data from 52 stations (strategically located throughout the zone) and treating them by using the goodness-of-fit tests of Anderson-Darling and Kolmogorov-Smirnov, it is confirmed that, in addition to the distributions traditionally used in hydrology (such as the Gumbel, Log-Pearson type III and the EVD distribution), there is another statistical distribution, the Dagum, which can be used in hydrology and meets the formulation of extreme values (outliers) and fits better to the rainfall histograms.
Based on the statistical data from the study, it is concluded that the Dagum distribution presents lower statistics in the two goodness-of-fit tests mentioned above and, therefore, adjusts significantly better to the histograms of the maximum annual rainfall data than the commonly used distributions. Particularly, this new statistical distribution is more appropriate to reflect the rainfall regime in Badajoz.
In conclusion, the Dagum statistical distribution is proposed to improve hydrological studies in Badajoz, since the rainfalls given by its density function are more precise (as shown by the goodness-of-fit tests) than the rainfall data calculated through classical statistical distributions. Its use in the professional field would allow for greater flows rates to be considered when designing drainage systems and studies of flooding, thus preventing future possible rainfall damage.