1. Introduction
It has become necessary to study statistical models that have the ability to evaluate these rare phenomena to avoid its dangers due to the sudden rise of some natural harmful phenomena, such as earthquakes, Tsunami, air pollution, and other phenomena. In the last two decades, the EVT has emerged as one of the most significant statistical modeling disciplines for the applied sciences. The EVT can be applied to environmental studies, such as hydrology, pollution, rainfall, floods, wind gusts, and corrosion, in order to develop models for describing the distribution of extreme events. The distributional properties of the extreme and intermediate order statistics and exceedances over (below) high (low) thresholds are determined by the upper and lower tails of the underlying distribution. The most important challenges in any application of such extreme value models is the scarcity of extreme data, choosing the threshold, or beginning of the tail, and choosing the methods of estimating the unknown parameters. Much of the classical EVT is concerned substantially with distribution properties of the maximum
of iid RVs
and all of the results obtained for maximum of course lead to anologous results for minimum through the obvious relation
The core of the EVT is the extreme value distributions, which are well known in the literature (cf. [
1]), and they are used as approximations to DFs of normalized partial maximum
of iid RVs. A DF
F is said to belong to the l-max domain of attraction of an extreme value distribution
G under linear normalization, denoted by
if there exist norming constants
and
such that
where “
” stands for weak convergence, as
It is well known that the asymptotic relation (
1) yields only three possible types of non-degenerate limiting DFs, which are Frèchet, Weibull, and Gumbel DFs. Moreover, any non-degenerate DF
G is an extreme value distrbution (i.e., it is a limit in (
1)) if and only if it satisfies the stability relation
for every integer
n, where
and
are some suitable constants (cf. [
1,
2]). For this reason, these limits are called l-max-stable laws. On the other hand, these l-max stable laws may be written in the von Mises−Jenkinson format
where
and
are the location and scale parameters, respectively, while
is a shape parameter that is known as the extreme value index (EVI), which is the central issue in empirical research dealing with extreme events. It is obviously found that the DF
which is known as the generalized extreme value distribution under linear normalization (GEVL), describes the Gumbel, Frèchet, and Weibull types with respect to the cases
(interpreted as
),
and
The GEVL provides a prevailing parametric approache for modeling extreme events, which is known as the block maxima (BM). Its application consists of partitioning a data set into blocks of equal length, and fitting the GEVL to the set of block maxima. An extension of the BM approach is the peak over threshold (POT) approach (see [
1]), where we only consider the observations which lie above an appropriate threshold. The generalized Pareto distribution under linear normalization (GPDL) introduced by [
3,
4] is considered as a foremost pillar of the POT approach. The GPDL is the limit distribution of scaled excesses over high thresholds, which has the form
In order to widen the class of limit laws in EVT for solving more approximation problems, the authors of [
5] extended the EVT under power normalization
where
according to
respectively. Another reason for using the power normalization in EVT is concerning the possibility of getting a better rate of convergence in EVT (cf. [
6]). Clearly, the power normalization is a strictly monotone continuous transformation. Therefore, this transformation does not give rise to any wastage of information that the data contains (e.g., the sufficiency property is preserved under one to one transformation). Nevertheless, we might lose some flexibility if we used such normalization. For example, under this normalization we can not change the sign of the data or get rid of zero. The DF
F is said to belong to the p-max domain of attraction of a non-degenerate DF H under power normalization, denoted by
, if for some norming constants
and
The possible p-types of limiting DFs
H in (
3) are the p-max stable laws satisfying the stability relation
for every
where
and
are some suitable sequences of constants. Here, two DFs,
F and
G, are of the same p-type if we can find
and
for which
for all
Consequently, any non-degenerate DF
H is a p-max stable, or equivalently
H is a limit in (
3), if and only if for every
the two DFs
H and
are of the same p-type. In [
7] the author has exemplified these types by the von Mises representation
Each of these families is called generalized extreme value distribution under power normalization (GEVP). It is well known that the p-max-stable laws attract more distributions than the l-max-stable laws. This fact virtually means that the linear model may be unsuccessful for fitting an extreme data set; on the contrary, the power model succeed to fit it (see [
1]). The authors of [
8] applied the BM approach under power normalization using the GEVPs. Moreover, in a series of papers, refs. [
9,
10,
11,
12,
13] developed the modeling of extreme values under power normalization by defining and using the generalized Pareto distributions under power normalization (GPDPs),
to a real extreme-value data (for more details regarding the power transformation, see [
14,
15,
16,
17,
18]).
Once more, in order to widen the class of the limit laws in EVT, in [
19] the authors extended the EVT under exponential normalization
Under this transformation, we can say that the DFs
F and
G are of the same
e-type if
for some constants
In this case, a non-degenerate DF
is said to be an
e-max-stable laws if there exists a DF
F and norming constants
, such that
If (
4) is satisfied, then we can say that the DF
F belongs to the e-max-domain of attraction of the non-degenerate DF
under
e-normalization, denoted by
The authors of [
19], showed that the possible limiting DFs
in (
4) are the e-max stable laws that satisfy the stability property that any non-degenerate DF
is an e-max stable, or equivalently
is a limit in (
4), if and only if for every
the two DFs
and
are of the same e-type (for more details about the exponential transformation, see [
20]).
In [
19], the authors showed that the possible limit laws arisen from (
4) attract more DFs than the p-max-stable laws. This fact virtually means that the linear and power models may fail to fit the given extreme data, while the exponential model succeeds. This fact gives us a sufficient motivation for developing the modeling of extreme values via the exponential model, denoted by the e-model. The aimed development is the first object of this paper and it will be achieved within two stages. the first stage is to infer the generalized extreme value distributions related to the EVT under exponential normalization. These asymmetric DFs enable us to apply the BM approach. The second stage is deriving the possible generalized Pareto families of asymmetric distributions relating to the EVT under exponential normalization. These families will pave the way to applying the POT approach. The second object of this paper is comparing between the EVT under linear, power, and exponential normalization via a real data sets of air pollution.
The rest of this paper is structured, as follows: In
Section 2, we deduce the generalized extreme value distributions relating to the EVT under exponential normalization (GEGEs). In
Section 3, which is devoted to the theoretical details, we first suggest an estimate for the EVI in each of the GEGEs. This estimate corresponds to a Dubey estimate in the GEVL model (
3) and the GEVP models
(cf. [
8]). Secondly, we derive the generalized Pareto distributions under exponential normalization (GPDEs). Finally, we propose estimators for the EVI in these GPDEs.
Section 4 is devoted to a simulation study, which illustrates and corroborates the theoretical results. In
Section 5, the EVT under linear, power, and exponential normalization is applied, with comparisons to several real data sets.
3. BM Approach and GPDEs
When considering the BM approach, let
be the set of maximums of the given blocks. Clearly, in view of the shape of the e-types (
6), the modeling under exponential normalization can only be applied if all values of these maximums belong to one and only one of the non-overlapping intervals
and
More specifically, if
or
or
or
we would select the model
or
or
or
respectively. Subsequently, we compute the maximum likelihood (ML) estimates
of
as the numerical solutions of the likelihood equations based on the selected model. The estimate of the shape parameter
corresponds to a Dubey estimate in the GEVL model is linear combinations of ratios of spacing
where
and
Clearly, the statistic
is invariant under the exponential transformation. Now, relaying on the obvious relations: (1)
where
or
or
or
if
or
or
or
respectively, and
is the sample DF, (2) for large
we have
and (3)
we obtain
The relation (
7), after some algebra, yields
if
satisfy the equation
Upon taking the logarithm of both sides of (
8), we get the estimate
On the other hand, if
for some
we get the estimate family
By taking
we get
In
Section 4, we will compare the ML method and estimate
for estimating
via the
Moreover, we will detect the value of
which gives the best estimate for
It will be revealed that the estimate (
9) is very poor for large values of
(
Regardless of the fact that this estimate is based on the BM approach, this approach also suffers some other problems, among them is only considering several maxima within several blocks and ignoring most the other data. In a spirit of the result of [
3,
4], we propose applying the POT approach based on the EVT under exponential normalization, where we deal with the right tail
for large
i.e., we deal with top-order observations. In order to adapt this approach for the e-model we derive the GPDE. Our focus will be mainly on the case
via Theorem 1. Clearly, the case
covers most of the important practical applications of the EVT. However, the case
will be briefly discussed in Theorem 3. In the next theorems and throughout the paper, we adopt the notations
and “
” to mean convergence as
Theorem 1. Let (4) be satisfied withThen there existssuch thatwhere “” means weak convergence, asand - a.
andif;
- b.
andif
Proof. The proof of Part [a]: In view of the EVT, we obtain
which, in view of the assumption
implies that
On the other hand, (
11) cannot be true unless
for all
x for which
Thus, we can write
By using the modified Khinchin’s Theorem (cf. [
1]), the relations (
11) and (
12) yield
Now, let
n be chosen, such that
where
u is any real number such that
(note that by putting
in (
11), we get
). Subsequently, (
13) implies that
Thus, put
and apply again the modified Khinchin’s Theorem, (
11) may be written in the form
Therefore, by putting
in (
14), we get
By combining (
14) and (
15), we get, as
or equivalently as
,
which was to be proved. The proof of Part [b] is very similar to the proof of Part [a], with the exception of only of obvious changes. This completes the proof of Theorem 1. □
Theorem 2 (the peak over threshold stability property). The left truncated GPDE again yields a GPDE. This means that, for every we have where and Moreover, for every we have where and
Proof. Let
Subsequently,
where
On the other hand, we have
where
This completes the proof of Theorem 2. □
Theorem 3. Let (4) be satisfied withSubsequently, there existssuch thatwhere “” means weak convergence, asand - c.
if;
- d.
if
Moreover, the limitsandsatisfy the peak over threshold stability property.
Proof. The proof is very similar to the proof of Theorems 1 and 2, with the exception of only of obvious changes. □
Estimation of the EVI via GPDE Model
In this subsection, we derive estimates for the parametrs
and
in the GPDE
These estimates consort with the Pickand’s estimates in the GEVL model (
2) (cf. [
4]). Let
n be the sample size and
be an integer much smaller than
Let
be the
ith largest observation in the sample,
The values
will be treated as though they were the descending order statistics from a sample of size
from the DF
for some
and
Because, for any
we have
we get
and
Clearly,
which implies
and
To estimate
and
, we replace the population quantiles
and
by the sample quantiles
and
Therefore,
In the next section, we will consider the determination problem of
m via a simulation study. Theoretically, the value
should satisfy the two conditions
and
(cf. [
4]).
4. Simulation Study
In
Table 1, we compare the ML method and the Formula (
9) for estimating the EVI
via the first GEVE defined in (
6). Additionally, from
Table 1, we determine the value of
which gives the best estimate for
In
Table 1, we present estimates for each value of
by applying the ML method and computing the estimate
that resulted from (
9) for different quantiles
This procedure is repeated 1000 times to obtain the average estimates (for the given different values of
q) for
and their mean square errors (MSE’s).
Table 1 shows that the estimates (
9) are poor when compared with the Ml estimates. Moreover, the precision of the estimate
closely depends on the value of
It was revealed that when
the estimates computed by (
9) became very poor, for this reason in
Table 1 we only considered the values
In
Table 2, for each value of
we generate a random sample of size
from
Moreover, we choose the threshold values
(in the interval
). In view of Theorem 2, the DF of the simulated data, which come after any threshold value
has the same type of the DF
Therefore, we can estimate the parameter
by using the ML method for each of these threshold values. This procedure is repeated 1000 times to obtain the average estimates and their MSE’s. Finally, we determine the value
which gives the best estimate for the parameter
by using the ML method.
Table 3 is devoted to display the computed estimates of
by using (
16). In
Table 3, the same procedure is applied with the exception that we choose
m instead of
k as
(note that
m).
In both
Table 2 and
Table 3, the asterisk in the superscript of a value means that this value is the best. Here, the “best” is according to the closeness to the actual value of
and then according to the value of MSE in the case of equal closeness to true value of
of two or more estimates. Moreover,
Table 2 and
Table 3 show that the ML and (
16) estimators for estimating the EVI
via the GPDE
have high accuracy when comparing with the estimates of
via
5. Comparison Study between the Linear, Power and Exponential Models
Air pollution is a global problem, from which most countries across the world suffer (cf. [
21,
22,
23]). In this section, we consider this problem via two data sets of pollutants, each of them consists of the maximum data of the three pollutants, nitric oxide (
), nitrogen dioxide (
), and particulate matter diameter less than 10 mm (
) (for some properties of these pollutants, see [
1,
22]). The first data set is taken from the site Lambeth–Streatham Green-Urban Background (denoted by LB6). The daily maximum of these pollutants was monitored and recorded every hour. Therefore, around 21,169 records are presented from 1 January 2014 to 31 July 2016. These data sets are publicly available from the following site:
http://www.londonair.org.uk/london/asp/datadownload.asp.
Table 4 shows the summary statistics for these maximum data sets.
Table 5 is devoted to the estimate parameters of the generalized extreme value distributions for LB6.
We checked the fitting of any family by the Kolmogorov–Smirnov (K-S) test, where, in this test, we have four functions H is equal to 0 or 1, P is the p-value, is the maximum difference between the data and the fitting curve and is a critical value. Therefore,
we accept if and level of significant and
we reject if and level of significant.
Table 6 gives the result of the Kolmogorov–Smirnov (K-S) test for fitting the three models
, and
to the maximum data sets from LB6.
Table 7 illustrates the summary statistics for these maximum data set. Finally, the graphical representations of the data sets and the fitted distributions are given in
Figure 1,
Figure 2,
Figure 3,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8 and
Figure 9.
The second data set is taken from the site Greenwich-Eltham (denoted by GR4). The daily maxima of these pollutants are recorded every hour, so around 43,825 records are presented from 1 January 2014 to 31 December 2018. These data are publicly available from the following site:
http://www.londonair.org.uk/london/asp/datadownload.asp.
Table 8 is devoted to the estimate parameters of the generalized extreme value distributions for GR4.
Table 9 gives the result of the Kolmogorov–Smirnov (K-S) test for fitting the three models
and
to the maximum data sets from GR4. Finally, the graphical representations of the data set and the fitted distributions are given in
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16,
Figure 17 and
Figure 18.
The result summary of this study is given below, where the more favorable model is chosen among accepted models and has a minimum KSSTAT value.
Only the power and exponential models are favorable in describing the pollutant that is monitored by LB6. The power model is the best one.
The linear model is only the favorable model to describe the pollutant that is monitored by LB6.
All of the models are favorable to describe the pollutant that is monitored by LB6. The best model is the linear model followed by the power model.
Only the e-model is favorable to describe the pollutant , which is monitored by GR4.
None of the three models is favorable to describe the pollutant that is monitored by GR4.
All of the models are favorable to describe the pollutant that is monitored by GR4. The best model is the e-model followed by the linear model.
It is worth remarking that the study shows an interesting fact that the kurtosis of the data has an impact, to some extent, on the kind of the extreme model that describes the data, e.g., as the kurtosis increases, the e-model becomes more favorable. Moreover, the linear, power and exponential models become less favorable to fit the symmetric-platykurtic data set (for details about the description of data according to the skewness and kurtosis, see [
24,
25]), e.g., the case of pollutant
Finally, a quick look at the
Figure 1,
Figure 2,
Figure 3,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14,
Figure 15,
Figure 16,
Figure 17 and
Figure 18 reveals that the curves of the empirical DF and the tested family nearly coincide when we accept
(e.g.,
Figure 2,
Figure 3 and
Figure 4,
Figure 6,
Figure 7,
Figure 9,
Figure 12,
Figure 15,
Figure 16, and
Figure 18), while, in the case of the rejection, the two curves diverge in some regions. This result endorses the results that are given in
Table 6 and
Table 9.