1. Introduction
Analysis of rainfall data, and the subsequent modelling of the many variables concerning rainfall, is fundamental to many areas such as agricultural, ecological and engineering disciplines. From assessing hydrological risk to both crop and hydropower plannings, rainfall modelling is of the utmost importance. Moreover, being able to provide reliable rainfall modelling is essential in the well known issue of climate change. Due to the complexity of hydrological systems, their analysis and modelling rely heavily on historical records. Rainfall historical records are of various time scales, from hourly data to annual data. However, daily rainfall series are arguably the most used information in environmental, climate, hydrological, and water resources studies [
1]. Rainfall manifests one peculiar characteristic which is common to many other geophysical processes: intermittence [
2]. Intermittence is found in variables which are related to the internal and external structure of rainfall. The most commonly seen for the external structure are the Wet Spells (
) and Dry Spells (
), meaning the sequences of rainy days and non-rainy days, respectively. A way of studying the alternance of
and
is through the Interarrival Times (
), that is the time elapsed between two consecutive days of rain. If we suppose that
observations are independent and identically distributed (i.i.d.), one natural way to model them is through the well known renewal processes [
3]. Many examples can be found in the literature. The simplest renewal process, the Bernoulli process, has been used in [
4] for example. In this case, the
’s are geometrically distributed. Its continuous counterpart, the Poisson process, has been used for its simpler mathematical tractability, but requires dealing with the
random variable (r.v.) as continuous, despite its discrete nature. The need to suppose a non-constant probability of rain requires slightly more sophisticated models.
The challenge of this paper is to propose a suitable discrete distribution to fit
at the daily scale. It is on this time scale that the intermittent character of precipitation can be appreciated and at the same time most practical applications are possible. The proposed distribution must be able to model both the numerous occurrences of the value equal to one, which represent the sequence of rainy days, and some large values scattered over time and responsible for drought phenomena. Our starting point is the three parameter family Hurwitz–Lerch Zeta distribution (HLZD), successfully proposed in [
5]. Such a distribution represents a step forward with respect to other commonly used
modelling distributions, such as the logarithmic one. In
Section 3, we summarize the main properties of the HLZD and state new results on its log-concavity and convolution. As a step forward, in this paper we propose to model the
r.v. using the Poisson-stopped HLZD (PSHLZD). This discrete distribution presents excess zeroes (paralleling the excess of
) and a long tail [
6]. The PSHLZD has been used in [
6] for comparisons with the negative binomial distribution, a popular model for fitting over-dispersed data. Indeed the PSHLZD can be seen both as a Poisson-stopped sum of HLZD’s as well as a generalization of a negative binomial distribution. The Poisson contribution allows us to model the superposition of i.i.d. HLZD’s in the observed time series as rare event. In
Section 4, we summarize its main properties using the combinatorics of exponential Bell polynomials. It is noteworthy to mention that Bell polynomials are used within fractional calculus, see for example [
7,
8] and within fractal models [
9]. Moreover, new results are added on the the PSHLZD, as for example on log-concavity.
A second goal of this paper is to show that the PSHLZD is also a suitable model for a different feature strictly related to the internal structure of rainfall: the depth (or the intensity) of the rainy days [
10]. In the literature, refs. [
11,
12] rainfall depths are more often treated as continuous despite that sometimes these models fail to account for the time discreteness of the sample process [
13]. Daily rainfall depth measurements are almost always performed by automatically counting how many times a small bucket corresponding to
mm is filled. This led use to treat them as discrete, because of the abundance of ties in the data. Finally, in
Section 5 we have also considered a third modelling distribution: the One Inflated HLZD (OIHLZD). Such a distribution mixes two generating processes: the first generates one’s and the latter is governed by a HLZD. This stochastic structure takes into account the dominance of one’s in the rainfall depth or interarrival time series.
In
Section 6, we discuss the results for fitting all these models to rainfall data, proving that the PSHLZD provides a very general framework for rainfall modelling. Indeed the PSHLZD replicates the fitting features of the OIHLZD and outperforms the fitted HLZD in some cases. The PSHLZD has a limited number of parameters and at the same time can adapt very well to data collected in very different climates, from England to Sicily. Let us underline that the analyzed dataset has never been considered in the literature and consists of measures sampled along 70 years at 5 different stations. These stations were chosen in order to represent different climates from the rainfall characteristics point of view. In fact, the interarrival data examined are very different from each other, with a regular pattern of many rainy days in England, and a winter rainy season alternating with long periods in summer without rain, typical of the Sicily Mediterranean climate. The same is for the rainfall depth, namely many small depths in England, and few big storms in Sicily. This made it possible to confirm the great utility of the proposed statistical models within rainfall modelling. Some concluding remarks and future developments are addressed at the end of the paper.
2. Bell Polynomials in a Nutshell
The partial exponential Bell polynomials are usually written as [
14]
where the summation is over all the solutions in non-negative integers
of
and
A lighter expression is obtained using partitions of the integer
n with length
Recall that a partition of an integer
n is a sequence
of weakly decreasing positive integers, named parts of
such that
A different notation is
where
named multiplicities of
are the number of parts of
equal to
respectively. The length of the partition is
and the vector of multiplicities is
We write
to denote that
is a partition of
Thus the partial exponential Bell polynomials (
1) can be rewritten as [
15]
where the sum is over all the partitions
with length
and
Using integer partitions, the explicit expression of the partial exponential polynomials can be recovered in
R using the
kStatistics package [
16]. A useful property used in the following is
with
constants. Equation (
4) follows from (
2) since from (
3) we have
taking into account that
and
The
n-th complete exponential Bell polynomials in the indeterminates
is defined as [
14]
with
the partial exponential Bell polynomials (
1). Note that
n is the positive integer corresponding to the maximum degree of the monomials in (
5). This polynomial sequence satisfies the following recurrence [
14]
with the initial value
The generating function of
is the formal power series composition [
14]
where
is the ring of formal power series in
t and
is the generating function of
, that is
A different expression of the
n-th complete exponential Bell polynomial involves integer partitions [
15] as follows
where the sum is over all the partitions
and
are given in (
3). In particular we have
with
a constant. Now, suppose to replace
in (
9) with a numerical sequence
Thanks to this device, the complete exponential Bell polynomials results as a special case of a wider class of polynomial families, the generalized partition polynomials [
16]
where the sum is again over all the partitions
A different expression of (
10) involves the partial exponential Bell polynomials
in (
1)
An example of a well known polynomial family, arising from (
11) is the logarithmic one [
14]
4. The Poisson-Stopped Hurwitz-Lerch Zeta Distribution
Definition 3. A discrete random variable if its pgf is where Φ
is the Lerch Transcendent function (15). According to Definition 3,
takes non-negative integer values and belongs to the class of generalized r.v.’s [
23]. Indeed given two independent r.v.’s
Z and
with pgf
and
respectively, the generalized r.v.
X has pgf
The composition (
26) matches (
27) when
and
Z is a Poisson r.v. (PS) of parameter
independent of
since
In the following we analyse in detail the properties of the PSHLZD using the complete exponential Bell polynomials. Some of the properties given in [
6] will also be briefly recalled.
Proposition 3. If then where is the complete exponential Bell polynomial (5) of degree Proof. Observe that
and
The result follows from (
7) with
and
since from (
26) we have
□
Corollary 1. If then andwhere the sum is over all the partitions and is given in (21). Proof. From (
28), using (
8) and (
3), we have
by which (
29) follows observing that
and
□
As a corollary of Proposition 3 and recursion (
6), the sequence
in (
28) satisfies the following equations.
Corollary 2. If then Proof. The result follows using (
6) since we have
□
The PSHLZD is unimodal if
and
([
6], Property 1).
4.1. Log-Concavity
Under suitable conditions, the PSHLZD is log-concave.
Proposition 4. If and then X has a log-concave cumulative distribution function (cdf), that is Proof. According to ([
24], Theorem 1), a random sum
of i.i.d. r.v.’s has a log-concave cdf if
Z is strongly unimodal and the distribution of
has a decreasing pdf. Thus, the result follows as
with
which has a log-concave pdf (strongly unimodal), and
with a decreasing pdf when
(see Proposition 2). □
Proposition 4 gives a sufficient condition to get cdf log-concavity. A different way is to consider the sequence
Indeed, if
X has a log-concave pdf (
19), then its cdf is also log-concave [
24]. In the more general setting of generalized r.v.’s,
X has a log-concave pdf if and only if the sequence
is log-concave with
and
Equation (
30) follows from Equation (2.3) in [
23] using the general partition polynomials (
8). When
a necessary and sufficient condition to recover strong unimodality is related to the magnitude of
and
as the following theorem shows.
Theorem 2. If X is a generalized r.v. with Y strongly unimodal and then X is strongly unimodal if and only if .
Note that a similar result is proved in ([
25], Theorem 4). We provide a different proof using the following lemma.
Lemma 1. If is a log-concave sequence, then the sequence is log-concave if and only if with given in (5). Proof. If
with
is a log-concave sequence of non-negative real numbers and the sequence
is defined by
then the sequence
is log-concave [
26]. Equation (
31) parallels (
7). Therefore, the sequence
results as log-concave if the sequence
is log-concave. Note that for
we have
which easily reduces to
always satisfied when
is log-concave. Now let
. We have
is log-concave if and only if
and the result follows. □
Proof (Proof of Theorem 2). Following the same arguments of Proposition 3, for a generalized r.v. with
(
30) reduces to
with
for
The sequence
is log-concave if and only if the sequence
is log-concave. The result follows using Lemma 1. □
Corollary 3. If is strongly unimodal if and only if
4.2. Moments and Cumulants
PSHLZD moments and cumulants have closed form expressions in terms of moments of
Proposition 5. If then with the k-th complete exponential Bell polynomial (5) and the first k moments of given in (16). Proof. If
and
are the mgf’s of
and
respectively, then
from (
27). Equation (
32) follows as the rhs of (
33) can be written as (
7), with
and
□
Remark 1. Taking into account (33), if then with and that is X is a compound Poisson r.v. Therefore the PSHLZD is an infinitely divisible distribution [27]. Moments (
32) can be explicited written using (
9). A straightforward corollary of recursion (
6) is the following.
Corollary 4. If
denotes the
k-th central moment of
then
Proposition 6. If then where are the first k factorial moments of given in (17). Proof. Let us recall that, if
is the factorial mgf of
then
with
the pgf of
Therefore we have
with
the generating function of the factorial moments
Equation (
34) follows as the rhs of (
35) can be written as (
7), with
and
□
Proposition 7. If is the n-th cumulant of then for where is the n-th moment of given in (16). 4.3. Maximum Likelihood Estimation
Suppose to have
independent observations of
The MLE of
is
with
the log-likelihood function and
The MLE of the PSHLZD parameters in this case must be directly tackled with the global optimization method described in
Section 3.4, since
is not analytically tractable referring to (
29).
6. Data-Fitting
6.1. Rainfall Depths and Interarrival Times
With rainfall depth we indicate to what depth liquid precipitation would cover a horizontal surface in an observation period if nothing could drain, evaporate or percolate from this surface. Let a time series of rainfall data be defined as
, where
h (mm) is the rainfall depth recorded at a fixed uniform unit
of time (e.g., a day). A day
k is considered rainy if the rainfall depth
, where
is a fixed rainfall threshold. The sub-series of
h of the rainy days can be defined as the event series
, where
is an integer multiple of the time-scale
. The sequence built with the times elapsed between each element of
E (except the first one) and the immediately preceding one is defined as the interarrival time series
. In order to select an appropriate distribution for
, some statistical characteristics usually observed in
samples have to be considered: very high variance and skewness, relatively high frequency associated to the observation
, monotonically decreasing frequencies with a slowly decaying tail and a drop in the passage from the frequency at
to the one at
. The HLZD in (
13) has been fitted to rain
in [
5] for stations in Sicily and in [
30] for stations in Piedmont, with good results. However, it has not yet been considered for rainfall depths. Recall that in the following we assume to model rainfall depths with a discrete r.v.
6.2. The Data
In this paper, the series analyzed were obtained from the recorded rainfall observations, using the rainfall threshold mm, which is the conventional threshold stated by the World Meteorological Organization in order to discriminate between rainy and non rainy days. This dataset has not been previously considered in the literature and consists of both and h measured over 70 years at the following five stations: Floresta, Trapani, Torino, Oxford, Ceva. They were chosen in order to represent different climates from the rainfall characteristics point of view. Floresta and Trapani represent the Mediterranean climate with a very wet and a very dry situation respectively. For both stations, the rainfall is concentrated in the colder part of the year, as typical of the Mediterranean climate. Torino and Ceva are more continental, but Ceva is more influenced by the Ligurian sea. Therefore, Torino has its maximum rain in Spring, while that of Ceva is in Autumn, because of the heating of the sea in the Summer. Finally, Oxford is a northern Europe station with rainfall homogeneously spread across the whole year. The recordings start in 1947 and end in 2017, for a total of 70 years. Moreover, the time series were further subdivided. Thus for each station we considered for a total of 33 samples. Note that we did not consider wet and dry seasons for Oxford station due to its climate.
More specifically let
station_name∈ {Floresta, Trapani, Torino, Oxford, Ceva} and
season_name∈ {wet, dry, spring, summer, autumn, winter}. Then the samples tagged with
station_name year span the whole length of the series for the
station_name station, while the samples tagged with
station_name season_name are the union of all the
season_name seasons in the whole time series for the
station_name station, omitting all the other seasons from the dataset. The MLE was used to fit the HLZD (
Section 3), the PSHLZD (
Section 4) and the OIHLZD (
Section 5) to the dataset (Note that the PSHLZD has support
and the r.v.
naturally has support
so we had to consider the shifted r.v.
.). In all cases, the MLE has been tackled with the method described in
Section 3.4. The addressed global optimization procedure was further simplified by the previously mentioned statistical characteristics of the data allowing to work on a subset of the parameter space
the whole time series, without subdividing the different seasons;
all the wet seasons and all the dry seasons;
the standard meteorological seasons,
6.3. Results
In the following we summarize the results of the distribution fitting for
and
h data. The fitting was satisfactory for both the PSHLZD and the HLZD. The assessment of the goodness-of-fit was obtained by following the methodology suggested by [
31]. In the case of long tailed distributions, the goodness-of-fit through the classical
test might be biased, because if there are several small classes, strong asymmetry might occur [
31] and some problems of inaccuracy might appear if the classes are grouped [
32]. The alternative procedure used to test the goodness-of-fit relies on Monte Carlo simulation to numerically reconstruct the null hypothesis of the
test to compute the
p-values [
33].
To further inspect the differences between the distributions, we have measured the fitting errors whose magnitude is strictly related to the discrepancy between the empirical frequencies and the fitted ones. Since many empirical frequencies are zero (in the tail), the cdf has been considered. In particular we considered the mean absolute error (MAE) and the mean relative absolute error (MRAE). Let us recall that, if is an ordered sample, then and with the empirical cdf and the fitted cdf.
6.3.1. Interarrival Times
We have compared the fitted PSHLZD with the fitted HLZD and the fitted PSHLZD with the fitted OIHLZD. To summarize the results, we have selected 4 of the 33 available samples since they have been considered particularly meaningful with respect to the whole dataset. The selected samples were Floresta Summer, Trapani Wet, Trapani Dry and Torino Winter.
Figure 1 is an example of
empirical frequencies: they usually range from a high peak located at
to a multitude of rather smaller values in the slow decaying tails. Therefore, to perform comparisons, a log-log scale for all the plots has been adopted.
Figure 2 shows plots of the fitted PSHLZD (solid line) and HLZD (dotted line) compared with the empirical frequencies (dot line) for the 4 selected samples. The fitting in both cases is very good. In particular, in the cases of
Floresta Summer,
Torino Winter and
Trapani Dry the PSHLZD succeeds in fitting the drop from
to
whereas the HLZD fails. This happens in the drier periods, where this drop is more prominent.
Moreover,
Figure 1 shows the dominance of the frequencies corresponding to
and
which are particularly meaningful in hydrology.
Figure 3 shows the plots of MAE and MRAE obtained by comparing the fitted cdf’s of the PSHLZD (circle) and the HLZD (triangle) with the empirical
cdf’s. Note that the MAE and the MRAE are in general lower for the PSHLZD. Due to the dominance of the frequency corresponding to
we explored modelling
with the OIHLZD for all the samples. In all cases, the fitted OIHLZD and the PSHLZD one have minimal differences and are almost indistinguishable (see
Figure 4 for an example), confirming the great flexibility of the latter distribution.
To conclude the validation analysis, we compared sample means and sample variances with the same theoretical moments of the HLZD and the PSHLZD computed in
Section 3 and
Section 4 respectively. In
Table 1, we show the results for the 4 selected samples. In all cases, the fitted distributions agree with the sample means. For the variances, the PSHLZD performs better in many cases. In
Table 1, an exception is
Trapani Wet because the data are highly dispersed.
6.3.2. Rainfall Depths
In this section, we summarize the fitting of the rainfall depth time series using both the PSHLZD and the HLZD. We omit the comparison with the OIHLZD since this distribution does not add more insights on the fitting nor what happens for the datasets.
Figure 5 shows again an empirical frequency histogram ranging from a high peak in
to a multitude of rather smaller values in the slowly decaying tails. As in the previous section, we employed a log-log scale for all the plots. The selected samples were
Ceva Winter,
Torino Winter,
Floresta Dry and
Trapani Summer.
In
Figure 6 we have plotted the fitted PSHLZD and HLZD compared with the empirical frequencies. As with
samples, the fitting is very good, even better that in the
case. Moreover there is less difference between the performances of the PSHLZD and the HLZD.
Figure 7 shows the plots of the MAE and the MRAE obtained by comparing the fitted cdf’s of the PSHLZD (circle) and the HLZD (triangle) with the empirical
h cdf’s. Even though both errors are smaller for the PSHLZD, there is less difference between the two distributions and they are generally lower than for the
case.
7. Conclusions
The first part of this paper focuses on a class of discrete distributions useful to describe very high one counts and long tails. We have reviewed the main properties using the combinatorics of exponential Bell polynomials. This device has permitted the derivation of closed form expressions for the pdf’s and their convolutions, as well as moments and cumulants. Moreover, new results on log-concavity have been presented. We have also considered the OIHLZD to compare its features with the HLZD and the PSHLZD. This deep analysis was aimed of investigating how to use these models to find a better fit for rainfall data. Indeed, the PSHLZD and the HLZD were fitted on Interarrival Times and rainfall depths h data coming from 5 different stations, which composed a dataset never previously analyzed in the literature. The h data were treated as observations of a discrete r.v., which is not the usual practice in the literature, but seems reasonable when taking into account how they are measured. The fitting was performed with the classical MLE method, but the likelihood was maximized using the Simulated Annealing procedure, which turns out to be fundamental since there are no closed forms of the likelihood equations. The fit was very good for both distributions, with the PSHLZD performing slightly better than the HLZD. This mostly happens for the data. Moreover, the PSHLZD was also able to replicate the fit of the OIHLZD further validating its flexibility.
From the modelling point of view, let us underline two final remarks. Firstly, the fit was excellent for both the
and the
h data, suggesting that the PSHLZD can be proposed as a general framework in rainfall modelling. Secondly, it is noteworthy to underline that these models capture the variability of rainfall stochastic phenomena, even though the 5 considered stations represent very different climates: a case study not yet considered in the literature that deals with previous applications of HLZD. Future works will consider modelling the dependence (inter-correlation) between
and
h. Given the remarkable performance of these distribution families in the univariate modelling, a first step would be to consider bivariate modified power series distributions [
34] and the methods to estimate their parameters on a rainfall time series. This is in the agenda for our future developments.