Abstract
The high quantile estimation of heavy tailed distributions has many important applications. There are theoretical difficulties in studying heavy tailed distributions since they often have infinite moments. There are also bias issues with the existing methods of confidence intervals (CIs) of high quantiles. This paper proposes a new estimator for high quantiles based on the geometric mean. The new estimator has good asymptotic properties as well as it provides a computational algorithm for estimating confidence intervals of high quantiles. The new estimator avoids difficulties, improves efficiency and reduces bias. Comparisons of efficiencies and biases of the new estimator relative to existing estimators are studied. The theoretical are confirmed through Monte Carlo simulations. Finally, the applications on two real-world examples are provided.
1. Introduction
Extreme value analysis (EVA) was first introduced by Leonard Tippett (Fisher and Tippett, 1928 [1]). Tippett was working on how to make cotton thread stronger, he realized that the strength of the weakest threads were the only factor that matters when it comes to deciding the strength of the cotton thread. Nowadays, extreme value analysis is widely used in almost all fields, from engineering, social science, economics, traffic predictions to insurance and so on. People are interested in extreme events in these fields such as, the shortest life span of a new engine, the maximum appreciation of the stock market, the longest driving time on a highway at rush hour, or the biggest medical claim to an insurance company. The distributions of these extreme events are usually unknown. In general, EVA involves the extrapolation of an unknown distribution and its high quantiles. Estimating high quantile based on observation is very important in EVA, since it gives the corresponding value x for a very small exceeding possibility p.
There are certain risks, ones that are not decided by us or can barely be predicted until right before they are about to happen. This can include things such as an earthquake, terrorist attacks, a virus breakout, and so forth. For these events, we will need risk management which is in place to minimize, monitor, and control the impact of unfortunate events, or to maximize the realization of opportunities. Estimating the confidence interval of high quantiles plays an important role in risk management. Since a high quantile is located at the tail area, it heavily depends on the behaviour of the tail distribution, or from the statistical point of view, it depends on the k largest order statistics. This leads to the challenges of the instability in the choice of and the bias issues. There are many research on the mathematical models and theoretical studies in the literature for estimating confidence intervals of high quantiles, we review them in Section 2.
This paper proposes a new method to estimate high quantile of a heavy-tailed distribution. The new method has interesting improvements compared with other existing methods. This paper makes three main contributions to methodology.
(1) This paper proposes a new estimation method based on a geometric mean with good asymptotic properties. It is consistent and stable relative to the existing methods. The paper provides a computational algorithm which overcomes the mathematical difficulties and bias problems of the estimation of confidence intervals of high quantiles of a heavy tailed distribution.
(2) The Monte Carlo simulation studies on three heavy tailed distribution models: Fréchet (0.25), GPD (0.5) and GPD(2) (GPD: generalized Pareto distribution). The simulation results confirm that the proposed method is more efficient relative to the existing quantile estimators.
(3) This paper uses the proposed estimation method to predict extreme values in the flu in Canada, and gamma ray from solar flare examples. It is interesting to see that these data sets fit the GPD model very well. We apply the proposed method to estimate the confidence intervals of high quantiles. The numerical results show that the proposed method gives more efficient results compared with other existing methods.
In this paper, we review several existing high quantile estimators with their behavior in Section 2. We propose a new estimator for the confidence interval of high quantiles based on the geometric mean and explore its asymptotic properties in Section 3. To compare the new estimator with the existing estimators, Section 4 presents Monte Carlo simulation results and the improvement of the proposed quantile estimator relative to existing methods. In Section 5 we apply the proposed new method to construct confidence intervals of high quantiles on flu in Canada and gamma ray examples. Finally, conclusions and discussions are given in Section 6.
2. Existing Estimator for High Quantiles
Heavy-tailed distributions (de Haan and Ferreira, 2006 [2]) is important to extreme value events.
Definition 1.
A random variable X is said to have a heavy tail distribution if its distribution function F(x) satisfies
whereis a slowly varying function for all is the tail index.
Notice that we can have (de Hann and Ferreira, 2006, p. 362 [2]). Since behaves approximately as a constant c, for simplicity, we assume that a heavy tailed distribution satisfies
Since the heavy tailed distributions decay slower than the exponential distributions and have longer tails. A tail function is defined as
Definition 2.
A tail function of any distribution function F(x) is defined as
For the heavy tailed distribution in (1), we can rewrite the tail function as
Definition 3.
The quantile function of a heavy tailed distribution in (1) for a given probability is defined by
where is the generalized inverse function of F, we call the quantile function of .
Value at Risk () is widely used in risk management. When p is very small, becomes a high quantile as the pth value at risk, we define
Also we can use the tail function in (2) to write as
The heavy-tailed models have a compulsory infinite right endpoint. In the case of negative observations in the model, the sample size should be exclusively the number of positive observations, although a deterministic shift in the data is preferred by some authors, to work only with positive values. In this paper, we use the real line .
To estimate let be the order statistics from a random sample We review the four high quantile estimation methods in the literature.
2.1. Quantile Function-Tail Index Method
For estimating high quantiles, we use the ln function, and estimate the tail index first
To estimate high quantile function, we estimate the tail index first (Dekkers and de Haan 1989 [3]). Hill (1975) [4] estimator is a well known consistent estimator for tail index .
Definition 4.
Consider the order statistics , and k as an intermediate sequence of integers, Hill estimator is defined as
where, , .
The Hill estimator in (5) used largest k order statistics of a random sample. Substitute defined in (5) into (4), then we obtain ln th high quantile as
This estimator depends on small values of k provide high volatility whereas large values of k induce considerable bias. Hence, semi-parametric extensions may be considered for increasing the degree of freedom in the trade-off between variance and bias. Note that the tail index is a parameter of a given distribution, and a quantile of a distribution is a function of
2.2. Weissman Method
Weissman (1978) [5] proposed the following semiparametric estimator of a high quantile
We substitute in (5) into the function above, then we have,
Without any prior indication on k, the Weissman estimator shows a large volatility as it depends on the fraction sample k. Although the minimization of the bias and MSE can be considered as a criterion to select k, it is impractical as they are unknown. Other methods for the selection of sample fraction k can be found in Beirlant et al. (1996) [6]; Dreea and Kaufmann (1998) [7]; Guillou and Hall (2001) [8]; Gomes and Oliveira (2001) [9].
The optimal k value through the tail index Hill estimator is given by formula (15) in Section 2.4 Optimal k Values.
2.3. Reduced-Bias Method
Hall and Welsh (1985) [10] proposed a second-order expansion on the tail function U in (2)
with and . Where is the scale second-order parameter and is the shape second-order parameter.
To further reduce the bias of quantile estimators which requires us to observe the behavior of the estimation of the second-order parameters and . Second-order reduced-bias was discussed by Peng (1998) [11], Beirlant, Dierckx, Goegebeur and Mattys (1999) [12], Freueverger and Hall (1999) [13], Gomes, Martins and Neves (2000) [14], Caeiro and Gomes (2002) [15], Gomes, Figueiredo and Mendonea (2004) [16], among others. Comes and Pestana (2007) [17] considered the estimators for the second-order parameters (.
Careiro et al. (2005, p. 122) [18] advises the the use of turning parameter in the estimation of It provides higher stability as functions of the number of the top order statistics used, for a wide range of large k value, by means of any stability criterion.
Definition 5.
Caeiro et al. (2005) [18] defined the bias-corrected Hill estimator
whereis defined in (5). For a tuning real parameter ,
withas defined in (5) that, and (10) achieves consistency ifasand.
The corresponding ln-quantile estimator with the tail index estimator in (9) is
A similar estimator to the estimator in (12) is considered in Lekina et al. (2014) [19] and Lekina (2010) [20].
Gomes and Pestana (2007) [17] considered the ln-Var estimator
Substitute the estimator in (9) into (13), we have another estimator for high quantile as
2.4. Optimal k Values
As discussed previously, we have problem that the estimation varies as the k varies, and it become very unreliable when k is large. Gomes and Pestana (2007) [17] suggested to use the numerically estimated optimal k values.
The optimal k for the tail index estimator through Hill estimator in (5) is
The optimal k for the semiparametric quantile estimator in (7), is
The optimal k for the second-order reduced-bias quantile estimator in (12) and in (14) should be larger than , is ,
By using these optimal k values, all the quantile estimators provide better results. However, with an unknown distribution, and estimated second-order parameters, these numerically estimated k values are not always accurate. Since all the quantile estimators are so sensitive to the k value, in this paper, we propose a new quantile estimator which does not depends on
3. New Estimator for High Quantile
3.1. New Estimator
Our goal is to improve the quantile estimators in Section 2. There are bias issues and difficult in determining k with the existing estimating methods. In order to overcome these problems, Huang (2011) [21] proposed a new quantile estimator which is the geometric mean of the reduced-bias quantile estimator in (14).
Definition 6.
whereis thetop order statistic,is any consistent estimator for γ, and Q stands for quantile function.
Based on (16), (20) can be written as
where and is a constant that
is the adjustment term, where is defined in (13) that reduces bias using the second-order parameters. is a key value depends only on n to furthermore reduce the bias by observing the behavior of the second-order parameters. We will discuss the choice of in Section 4.
1. The new quantile estimator has the least bias, the smallest and the highest efficiency.
2. The new quantile estimator is consistent and does not depend on k as the existing quantile estimators does.
3. The confidence interval based on the new quantile estimator is the most efficient compared to the existing methods, where it not only has the shortest length of the interval, but also has the highest probability coverage of the true value in most cases.
3.2. Asymptotic Properties of the New Estimator ln
Using the Hall-Welsh class of model in (8), we derive that the new estimator in (19) has the asymptotic properties under following conditions, when in (5).
Condition 1 (C1).
For intermediate as
Condition 2 (C2).
where A is in (8).
Theorem 1.
Under (C1) and (C2), if we use in (5), then has a asymptotic normal distribution
The asymptotic mean, variance and efficiency of in (19) relative to in (14) are given by
where w is the weight, is correlation coefficient of and
See Appendix A for the proof of Theorem 1.
3.3. The C.I. for The New Estimator ln
Theorem 2.
Under conditions (C1) and (C2), a
confidence interval for by using in (19) is given by
where is the quantile of standard normal distribution, and
See Appendix A for the proof of Theorem 2.
Remark 1.
Note that in the CI in (23), the main term does not depend on only the error terms depends on
Remark 2.
In Section 4 Simulations and Section 5 Applications, we use the maximum weight in Formula (23), thus, we use maximum CI length for new proposed estimator ln comparing with existing methods. Even with maximum CI length. Section 4 and Section 5 show that the new estimator obtained confidence interval in (23) is still shorter than existing estimators obtained confidence intervals for most of k values.
4. Simulations
4.1. Computer Simulations of Quantile Estimators
To verify that the new estimator has good properties, we use simulations and compare the new estimator to the existing estimators using the following statistics
- The expected value .
- The root of mean squared errors .
- The relative efficiencies
In this Section, we choose models of Fréchet (0.25), GPD (0.5), GPD (2) to compare with the simulation results of Gomes and Pestana (2007) [17]. We use four quantile estimators in Table 1 to run simulations. When estimators and in use the tuning parameter , otherwise, use
Table 1.
The four ln-quantile estimators we use in simulations.
- (1)
- The Fréchet distribution (Fréchet, 1927) [22] has the c.d.f.An estimator of the pth ln-high quantile function is
- (2)
- The generalized Pareto distribution (GPD) (de Zea Bermudeza and Kotz, 2010) [23] has the c.d.f.for , an estimator of the pth ln-high quantile function is
4.2. The Choice of
As mentioned in Section 3, is a key value to reduce the bias of the defined in (19). We developed an algorithm to estimate based on the results of simulation runs:
Step 1: For a fixed sample size n, the in ith iteration, is the true solution of equation
then Note that depends on is the true lnVaR value.
Step 2: Obtain estimator based on the linear regression (LR) models where is related to We collect data set with the sample size
Note that the estimate in (27) depends the parameters of the models and LR relationship with sample size n.
Remark 3.
If we assume in is normally distributed, based on (Bickel and Doksum, 2015, pp. 286–388) [24], then is a maximum likelihood estimator (MLE) and has an asymptotic normal distribution. Since the estimator only depands to n not related to the order statistics, it will not affect the asymptotic proprties of the proposed estimator in (19).
4.3. Simulation of (0.25). GPD (0.5) and GPD (2)
Table 2, Table 3 and Table 4 list the results of simulations under the Fréchet (0.25), GPD (0.5) and GPD (2), where iterations for sample size and With in (27), we compare mean values, mean squared errors (MSE) and REFF of the four estimators in Table 1, at optimal level based on (15) Note that the new estimator has the highest REFF values among the four estimators which are in bold in all three models. The simulation MSE of is defined as
where is the in the ith iteration, So do for other ln-quatile estimators.
Table 2.
Fréchet (, . Mean, MSE, REFF of the Estimators. The highest REFF values are in bold.
Table 3.
GPD ( . Mean, MSE, REFF of the estimators. The highest REFF values are in bold.
Table 4.
GPD( Mean, MSE, REFF of the estimators. The highest REFF values are in bold.
Figure 1, Figure 2 and Figure 3 are based on Table 2, Table 3 and Table 4 results, Figure 1 is for Fréchet (0.25), we use iterations, sample size , , , , . The new estimator has the best performance with the least bias and RMSE. It does not change as k varies. Figure 2 and Figure 3 are for GPD(0.5) and GPD(2), iterations, sample size , and 2, , , . We note that the new estimator is the best estimator as well, with the least bias, consistency as k varies, and the smallest . Note that values are very close to the true lnVaR values.
Figure 1.
Underlying Fréchet (), . (a) The means of ln-quantile estimators with the true (). (b) The RMSE of Ln-quantile estimation, ,
Figure 2.
Underlying (), , (a) The means of ln-quantile estimators with the true (b) The RMSE of Ln-quantile estimation, , .
Figure 3.
Underlying (2), , (a) The means of ln-quantile estimators with the true (). (b) The RMSE of ln-quantile estimators, ,
4.4. Simulations of Confidence Intervals
By Gomes and Pestana (2007) [17], the 95% confidence interval of the true tail index using H is
and the 95% confidence interval of the true tail index using is
Next, we compute the confidence intervals for the true ln-quantile by using the quantile estimators. We only use three out of four quantile estimators in Table 1, except which has the worst result. Therefore, we compare CIs only using , and in (30), (31) and (23). Thus
- (1)
- The 95% confidence interval for the true using iswhere , is given in (28), and
- (2)
- The 95% confidence interval for the true using iswhere , is given in (29), and
- (3)
- The 95% confidence interval for the true using is given in (24).
To compare new proposed CI in (23) to CIs in (30) and (31), we use evaluate the length and probability coverage of the CIs.
The length of CI is given as
and the efficiency of the length of 95% is given as
Also, the confidence interval is more efficient when it has a higher coverage of the true value under the simulations, where the probability coverage of 95% is defined as
and the efficiency of the probability coverage of 95% is given as
when is bigger means it is more efficient.
Figure 4, Figure 5 and Figure 6 show the 95% confidence interval of the three ln-quantile estimators under Fréchet (0.25), GPD (0.5 and 2) with . We compare the size of each confidence interval at their optimal k level, and the probability coverage of each confidence interval at their optimal k level. Recall, the optimal k level for is at based in (16), the optimal k level for and is at based in (15).
Figure 4.
(0.25) model, 95% confidence interval of quantile estimators, , Note that (purple) has shortest CI with length 0.2668. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).
Figure 5.
The (0.5) model, 95% confidence interval of quantile estimators, Note that (purple) has shortest CI with length 0.7094. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).
Figure 6.
The (2) model, 95% confidence interval of quantile estimators Note that (purple) has shortest CI with length 2.2511. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).
Table 5 compare the efficiencies of 95% of the three quantile estimators under (0.25), (0.5 and 2). The efficiency of 95% can be compared by the length of and the probability coverage of , denoted by and .
Table 5.
efficiencies of 95% CI for .
In this section, we compared the new quantile estimator in (19) with the existing methods. has the least bias, the smallest , and not depends on k too much. It also has the smallest length and the highest probability coverage in 95% confidence interval in most cases. The simulation results verify that is the best quantile estimator among all three methods. Next section, we apply the new estimator to real world examples.
5. Applications
We will study two real-world examples in this Section. We are interested in the population that is above the threshold for each example. The goal is to estimate the high quantiles of the example, where is a very small. We use the four quantile estimators in table 1 and in (21), and compare their performances.
- Procedure:
- Step 1:
- Choose and collect data of examples of real life extreme events.
- Step 2:
- Run Goodness-of-Fit tests to check if data is heavy distributed.
- Step 3:
- Estimate the high quantiles and construct the confidence intervals by using the new method and the existing methods.
- Step 4:
- Estimate the high quantiles and construct the confidence intervals by using the new method and the existing methods.
- Estimators
- Two tail index estimators in (5) and in (9).
- Four quantile estimators (6), (7), (12) and (19) are in Table 1.
- We use in (27) for the new estimator in (19) for the GPD model.
Remark 4.
In applications, the GPD is used as a tail approximation to the population distribution from which a sample of excesses above some suitably high threshold μ are observed. The GPD is parameterized by location, scale and shape parameters and γ, and can equivalently be specified in terms of threshold excesses or, as here, exceedances , as three parameters ( GPD in (34) (de Zea Bermudeza and Kotz, 2010) [23],
Traditionally, the threshold was chosen before fitting, giving the so-called fixed threshold approach (Pickands, 1975 [25], Balkema and de Haan, 1974 [26]). It is common for practitioners to assume a constant quantile level, determined by some assessment of fit across all or a subset of the datasets (Scarrott and McDonald, 2012, p.36 [27]). In our application, the threshold is pre-determined by physical considerations, that is, number of type A flu viruses detected weekly in Canada above the average in flu season, and the counts of gamma ray released from significant solar flares (M and X rated) during the Sun’s active years. Although it is possible to make some arbitrary definition of the choice of the threshold, it is preferable not to become involved with such delicate question. The application of the proposed method is presented in both examples for illustrative purpose.
5.1. Flu in Canada Example
According to the WHO (World Health Organization, 2020 [28]), seasonal influenza is a common infection of the airways and lungs that can spread easily among humans. There are 37 million people in Canada, and flu season usually runs from November to April. Most people recover from the flu in about a week. However, influenza may be associated with serious complications such as pneumonia, especially in infants, the elderly and those with underlying medical conditions like diabetes, anemia, cancer, and immune suppression. On average, the flu and its complications send about 12,200 Canadians to the hospital every year, and around 3500 Canadians die. There are 3 types of flu viruses, A, B and C. Type A flu virus is the most harmful, and it is constantly changing and is generally responsible for the large flu epidemics. The 1918 Spanish Flu, 1957 Asian Flu, 1968 Hong Kong Flu, 2009 Swine flu, and the most recent 2014 H5N1 Bird Flu are all type A flu. In this paper, we study type A viruses in Canada.
We collected the number of the type A flu viruses detected weekly in Canada, from 1 January 1997 to 31 December 2019, resulting in a sample size of 994 weeks. According to the WHO, the average number of type A flu viruses detested per week in the flu season, November to April, is 953, for the past 10 years. We set 953 viruses/week as the threshold, which reduced our sample size to 111 weeks. Full data-set is available at http://apps.who.int/influenza/gisrs_laboratory/flunet/en.
Figure 7a shows a Flu chart in 994 weeks of type A flu viruses detected in Canada, and 111 weeks remaining after the threshold, of average 953 flu viruses. For each flu incubation period, a flu virus can last from one up to few weeks, that is why some arches are narrow and some arches are more bell shaped in this figure. The top three weeks are circled in the plot. Figure 7b shows a histogram of 994 weeks data. We are interested in the 99% quantile, , such that 99% chance that the viruses detected in a given week would be less than this value, or equivalently, with a 1% possibility, the number of flu viruses detested in a given week would be in excess of this value. This information is useful for monitoring and studying the virus, also is helpful for medical organizations that deal with disease control and prevention, pharmaceutical availability, and hospital resource readiness, especially during a serious flu outbreak. is approximately located in the plot. In this paper, we propose a new estimate high quantiles method, and compare it with existing methods.
Figure 7.
Flu original data from 1 January 1997 to December 31 2019, 994 weeks, (a) Flu chart of type A flu viruses detected in Canada, and 111 weeks remaining after the threshold, of average 953 flu viruses. (b) Histogram of the number of type A flu viruses detected in Canada.
Our interest is to find the 5% and 1% of the number of type A flu viruses detested in a week, and their 95% confidence intervals.
5.1.1. Goodness-of-Fit Test
Through data transformation , , Take as the threshold, the maximum likelihood estimators (MLE) are and Figure 8a is the log-log plot of curve with the horizontal axis against the vertical axis . Visually the transformed data fit the one parameter in (26) the bestusing (red curve). Figure 8b shows the GPD density curve (red curve) fits the histogram very well.
Figure 8.
After threshold 953 flu viruses, Flu transformation data, (a) Log-log plot of flu in Canada example. (b) Estimate GPD curve and the 99% high quantile and histogram of the distribution of type A flu viruses detested weekly.
Beside visual view of Figure 8, we also carry on the three goodness-of-fit tests: the Kolmogorov-Smirnov (K-S) test (Kolmogorov, 1933 [29]), Anderson-Darling (A-D) test, and Cramér von Mises (C-v-M) test (Anderson-Darling, 1952 [30]). All three tests are based on the maximum vertical distance between the empirical distribution function and the observations, and the parent distribution function is the
The Hypothesis for all three tests is
is the true but unknown distribution of the sample. is the theoretical distribution, in our project, the parent distribution, . is the empirical distribution and step function of the sample. It is defined as
where ,
The test statistics under of test is
Based on Table 6 goodness of fit tests’ results, we set the GPD model for the flu in Canada data. We define the absolute errors () in (34) and integrated errors () in (35) as
Table 6.
The goodness-of-fit tests under the model for the flu in Canada data.
For both and , we use 3 different r values by letting , and top statistics. Table 7 lists the AE and IE errors which are very small.
Table 7.
and under the model for the flu in Canada data by using .
Next, we estimate the high quantiles and their confidence interval for this example.
5.1.2. Compare Four Estimation Methods
We use the four estimators in Table 1: , and the new estimator .
We use in (10), and in (11). To decide if the tuning parameter or 1, consider , for , and compute their median , then
With , we get and , then , conclude that , thus we have and , where is the optimal k value. Figure 9 shows the results.
Figure 9.
For flu in the Canadian data, (a) Estimates of the second-order parameter and , (b) Estimates and (c) Tail index estimators, H, (d) ln-quantile estimators, The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level.
Figure 9a shows estimates of the second-order parameters through and , Figure 9b shows Estimates and Figure 9c shows the two estimated tail index, H, at its optimal level using based on (15) and at its optimal level using based on (17). Figure 9d shows four quantile estimators of flu in Canada example, with . The full circles “•” in the plot are the values of the quantile estimators at their optimal k level. We note that has a constant value, which does not depend on
Figure 10 compares the confidence intervals of three quantile estimators in (7), (12) and (19). This figure shows that the new quantile estimator has the smallest confidence interval with length 0.7966, where we use (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).
Figure 10.
95% confidence interval of three ln-quantile estimators after the threshold 953 for the flu in Canada example. , Note that (purple) has shortest CI with length 0.7966. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).
In Table 8, we compare the four ln-quantile estimators and their mean, median, and Table 9 compares the size of confidence intervals at and of the three quantile estimators.
Table 8.
Estimated and for the flu in Canada data. (Unit: Type A flu viruses).
Table 9.
The 95% confidence interval for and .
In Table 9, we compared , and the has the shortest confidence interval with the highest efficiency of 2.2462.
5.1.3. Summary
5.2. Gamma Ray of Solar Flare Example
Gamma ray has the most penetrating power among all the radiations. The burst of gamma rays are thought to be, due to the collapse of stars called hypernovas, the most powerful events so far discovered in the cosmos. The measurement of gamma rays are in counts, and it is the number of atoms in a given quantity of radioactive material that are detected by an instrument to have decayed. We have collected gamma ray data from solar flares, from November 2008 to September 2020, from NASA (National Aeronautics and Space Administration, 2020 [31]). Full data-set is available at http://hesperia.gsfc.nasa.gov/fermi/gbm/qlook/fermi_gbm_flare_list.txt.
The solar flare travels hundreds of miles per second, and can reach the Earth within hours. It can disrupt communication navigational equipment, damage satellites, and even cause blackouts by damaging power plants. In 1989, a strong solar storm knocked out the power grid in Québec, Canada, causing 6 million people to lose power for more than 9 hours, and it cost millions of dollars to repair. It can bring additional radiation around the north and south poles, a risk that forces airlines to reroute flights. The Fermi Gamma-ray Space Telescope was launched in late 2008 to explore high-energy phenomena in the Universe. It is worth noting that more than one trigger may have occurred during the flare, the one nearest the peak of the flare is listed, resulting in a sample size of 5128. Solar flares are classified as A, B, C, M or X according to the peak flux (in watts per square meter, W/m2) of 1 to 8 angstrom (The angstrom is a unit of length equal to 1/10,000,000,000 (one ten-billionth) of a meter.) X-rays near the Earth, as measured on the GOES spacecraft. Gamma ray activity is correlated with the X ray activity, as shown in Figure 11 (NOAA, 2020 [32]. When the amount of gamma ray released is over 5 million counts, it usually corresponds to an X rated flare or significant M rated flares.
Figure 11.
Two weeks plot of gamma ray & X ray from July 2 to 16, 2012.
Figure 12a shows a Gamma ray chart of 5128 flares, and 104 flares remaining after the threshold of 86 million counts. The most powerful gamma ray was released in March 7, 2012 with nearly 1.5 billion counts, the sun was brightened by 1000 times, and became the brightest object in the gamma ray sky. The top three events are circled in the chart. Figure 12b shows a histogram of 5128 flares. We are interested in the 99% quantile, , such that 99% gamma ray released from solar flares are under this value, or equivalently, with a 1% possibility, the amount of gamma ray a solar flare releases would be in excess of this value. During the spring and fall, the satellites that are used to detect solar flares experience eclipses, in which the Earth or the Moon blocks between the satellites and the Sun for a short period every day. Eclipse season lasts for about 45 to 60 days and ranges from minutes to just over an hour. The quantile estimation would provide useful predictions for these times. is approximately located in the plot since we do not know this value yet.
Figure 12.
Gamma ray original data from November 2008 to April 2017, (a) Gamma ray released V.S solar flare occurred. After the threshold of 86 million counts, flares remaining. (b) Histogram of gamma ray released from solar flares.
We chose the threshold as the mean of the data from the peak period. The solar cycle is every 11.6 years, and the sun’s activity peaked from 2011 to 2014. In Figure 12a we can see that the top 3 flares, in fact, almost 90% of the top 100 flares, are from the 2011 to 2014 time period. Taking the average of all the X rated and significant M rated flares from this peak period, we obtained a mean of 86 million counts, resulting in a remaining sample size of .
For the Gamma ray of solar flare example, our goal is to find out the high quantiles, specifically, the 5% and 1% of the amount of gamma ray a solar flare would release, and their 95% confidence intervals.
5.2.1. Gooness-of-Fit Tests
Similar as Flu in Canada Example, we set , and obtain , and Figure 13a is a log-log plot of gamma ray data under model, with the horizontal axis against the vertical axis . Figure 13b shows the histogram fits the GPD model.
Figure 13.
After threshold 86 millions count, transformation data, (a) Log-log plot of gamma ray from solar flare example. (b) The Estimate GPD and the 99% high quantile of the distribution of gamma ray released by solar flare.
Next, we will perform three goodness-of-fit tests: Kolmogorov-Smirnov test, the Anderson-Darling test and the Cramér-von-Mises test. The results listed in Table 10, the data fits the with the best, nearly 59%.
Table 10.
Compare the goodness-of-fit tests under the model for the gamma ray data.
In Table 11, all the errors are less than 0.07 for AE, and less than 0.01 for .
Table 11.
and under the model for the gamma ray data using .
Next, we can compare the four high quantile estimators and their confidence intervals of this example.
5.2.2. Compare Four Estimation Methods
Similar as Example 1, we use the four quantile estimators in Table 1: , , , and the .
We use and and , thus we have and , where is the optimal k value for the second-order parameters. The results are in Figure 14.
Figure 14.
For gamma ray of solar flare example, (a) Estimates of the second-order parameters and , , (b) Estimates and (c) Tail index estimators, H, . (d) ln-quantile estimators, The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level.
Figure 14a shows the estimates of the second-order parameters and , Figure 14b shows and Figure 14c shows the two different tail index estimators, H, . We have at its optimal level with , at its optimal level with Figure 14d shows all four quantile estimators of gamma ray example, with . We note that has a constant value which does not depend on k.
Figure 15 compares the confidence intervals of our ln-quantile estimators in (7), (12) and (19). This figure shows that the new quantile estimator has the smallest confidence interval with length 1.4451, where we use The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level.
Figure 15.
95% confidence interval of three ln-quantile estimators after threshold of 86 million counts for the gamma ray example. , Note that (purple) has shortest CI with length 1.4451. (The solid circles “•” in the plot are the values of the quantile estimators at their optimal k level).
In Table 12, we compare all four quantile estimators under and Table 13 compares the size of confidence intervals of and by three quantile estimators.
Table 12.
Estimated and in the gamma ray example. (Unit: million counts).
Table 13.
The 95% confidence interval of and
Table 13 shows that the new estimator has the shortest confidence interval, compared to ln, and ln with the highest efficiency of 1.6016.
5.2.3. Summary
6. Conclusions
Based on the studies in this paper, we conclude that:
1. High quantile and its CI estimation provides important information for risk management and for extreme event predictions.
2. Based on the theoretical and simulation results, the proposed new method for estimating confidence interval of high quantiles has advantages properties comparing with other existing methods. The estimation is consistent and stable with less error. The proposed method provides a useful computational algorithm to the readers.
3. The confidence interval of high quantile obtained by the new proposed method also has the highest efficiency compared to the existing methods, in terms of having the smallest size of confidence interval, and the highest probability coverage of the true quantile values in most cases.
4. Based on the analysis of the two real-world examples, flu in Canada and gamma ray from the solar flare, we can see that the new proposed method can be applied to many more fields, including other extreme events such as insurance claims, natural disasters, stock market predictions and pandemic disease monitoring.
Author Contributions
The authors M.L.H. and X.R.-Y. carried this work and drafted the manuscript together. All authors have read and agreed to the published version of the manuscript.
Funding
This research is supported by the Nature Sciences and Engineering Research Council of Canada (NSERC) grant: MLH DDG-2019-04206.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Publicly available datasets were analyzed in this study. The datasets can be found here: http://hesperia.gsfc.nasa.gov/fermi/gbm/qlook/fermi_gbm_flare_list.txt [31] and https://www.who.int/influenza/gisrs_laboratory/flunet/en [28].
Acknowledgments
We are grateful for the comments of the reviewers and editor. They have helped us to improve the paper. We deeply appreciate the Brock Library Open Access Publishing Fund support.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proofs of Theorems 1 and 2
Lemma 1.
The sum ofsatisfy the inequality
where w is a weight
is correlation coefficient ofand
Proof of Lemma 1.
For each pair
then we have a boundary if we only add the positive terms in the following summation
Lemma 2.
Under conditions (C1) and (C2), forin (14) by use Theorem 5.1, formula (5.2) in Gomes and Pestana (2007, p.285 [17]), as
then the asymptotic expected value and variance are
Proof of Theorem 1.
Under conditions (C1) and (C2), in the Hall-Welsh class of models in (6), where is in (8) with conditions
and
where is an asymptotic standard normal random variable.
By Schwartz inequality and Lemma 1 formula (A1), sinece is a contant in (19), based on asympototic properties of in (13) (Gomes and Pestana, 2007, p.286 [17]), we have that
Therefore when n is large enough, use Lemma 2, formula (A2),
and
we can have the following approximate relation
this proved (21). Therefore, we have the asymptotic normal distribution in (20)
Furthermore, we use (21) and (A2), we obtain (22) as
Proof of Theorem 2.
Under conditions (C1) and (C2), Use in Theorem 1, Formula (22), and with
we have
therefore approximately
then
and using to guarantee a coverage probability of at least we have
References
- Fisher, R.A.; Tippett, L.H.C. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Procs. Camb. Philos. Soc. 1928, 24, 180–190. [Google Scholar] [CrossRef]
- de Haan, L.D.; Ferreira, A. Extreme Value Theory; Springer: New York, NY, USA, 2006. [Google Scholar]
- Dekkers, A.L.M.; de Haan, L. On the estimation of the extreme value index and large quantile estimation. Ann. Stat. 1989, 17, 1795–1832. [Google Scholar] [CrossRef]
- Hill, B.M. A simple general approach to inference about the tail of a distribution. Ann. Statist. 1975, 3, 1163–1174. [Google Scholar] [CrossRef]
- Weissman, I. Estimation of parameters and large quantiles based on the k largest observations. J. Am. Stat. Assoc. 1978, 73, 812–815. [Google Scholar]
- Beirlant, J.; Vynckier, P.; Teugels, J.L. Tail index estimation, Pareto quantile plots and regression diagnostics. J. Am. Statist. Assoc. 1996, 91, 1659–1667. [Google Scholar]
- Dreea, H.; Kaufmann, E. Selecting the optimal sample fraction in univaraiate extreme values estimation. Stoch. Proc. Appl. 1998, 75, 149–172. [Google Scholar]
- Guillou, A.; Hall, P.G. A diagnostic for selecting the threshold in extreme value analysis. J. R. Stat. Soc. B 2001, 63, 293–305. [Google Scholar] [CrossRef]
- Gomes, M.I.; Oliveira, O. The bootstrap methodology in statistics of extremes: Choice odf the optimal sample fraction. Extremes 2001, 4, 331–358. [Google Scholar] [CrossRef]
- Hall, P.; Welsh, A.H. Adaptive estimates of parameters of regular variation. Ann. Stat. 1985, 13, 331–341. [Google Scholar] [CrossRef]
- Peng, L. Asymptotic unbiased estimator for extreme-value index. Stat. Probab. Lett. 1998, 38, 107–115. [Google Scholar] [CrossRef]
- Beirlant, J.; Dierckx, G.; Goegebeur, Y.; Matthys, G. Tail index estimation and an exponential regression model. Extremes 1999, 2, 177–200. [Google Scholar] [CrossRef]
- Feuerverger, A.; Hall, P.G. Estimating a tail exponent by modelling departure from a Pareto. Ann. Stat. 1999, 27, 760–781. [Google Scholar]
- Gomes, M.I.; Martins, M.J.; Neves, M. Alternatives to a semi-parametric estimator of parametric of rare events-the jackknife methodology. Extremes 2000, 3, 207–229. [Google Scholar] [CrossRef]
- Caeiro, F.; Gomes, M.I. A class of asymptotically unbiased semi-parametric estimations of the tail index. Test 2002, 11, 345–364. [Google Scholar] [CrossRef]
- Gomes, M.I.; Figueiredo, F.; Mendonca, S. Asymptotically best linear unbiased tail estimators under second order regular variation. J. Stat. Plan. Inference 2004, 134, 409–433. [Google Scholar] [CrossRef]
- Gomes, M.I.; Pestana, D. A study reduced-bias extreme quantile (VaR) estimator. J. Am. Assoc. 2007, 102, 280–292. [Google Scholar] [CrossRef]
- Caeiro, F.; Gomes, M.I.; Pestana, D. Direct reduction of bias of the classical Hill estimator. Rev. Stat. 2005, 3, 113–136. [Google Scholar]
- Lekina, A.; Chebana, F.; Ouarda, T.B.M.J. Weighted estimate of extreme quantile: An application to the estimation of high flood return periods. Stoch. Environ. Res. Risk Assess. 2014, 28, 147–165. [Google Scholar] [CrossRef]
- Lekina, A. Estimation Non-Paramétrique des Quantiles Extrêmes Conditionnels. Ph.D. Thesis, Université de Grenoble, Saint-Martin-d’Hères, France, 2010. [Google Scholar]
- Huang, M.L. A New High Quantile Estimator for Heavy Tailed Distributions; (Working Paper); Department of Mathematics, Brock University: St. Catharines, ON, Canada, 2011. [Google Scholar]
- Fréchet, M. Sur la loi de probabilite de l’écart maximum. Ann. Soc. Pol. Math. 1927, 6, 93–116. [Google Scholar]
- de Zea Bermudeza, P.; Kotz, S. Parameter estimation of the generalized Pareto distribution. J. Stat. Inference 2010, 140, 1374–1388. [Google Scholar] [CrossRef]
- Bickel, P.J.; Docksum, K.A. Mathematical Statistics, Basic Ideas and Selected Topic, 2nd ed.; CRC Press, Taylor & Frances Group: Boca Raton, FL, USA, 2015; Volume 1. [Google Scholar]
- Pickands, J. Statistical inference using extreme order statistics. Ann. Stat. 1975, 3, 119–131. [Google Scholar]
- Balkema, A.A.; de Haan, L. Residual life time at great age. Ann. Prob. 1974, 2, 792–804. [Google Scholar] [CrossRef]
- Scarrott, G.; McDonald, A. A Review of extreme value threshold estimation and uncertainty quantification. Revstat 2012, 10, 33–60. [Google Scholar]
- World Health Organization (WHO). 2020. Available online: https://www.who.int/influenza/gisrs_laboratory/flunet/en (accessed on 31 December 2020).
- Kolmogorov, A.N. Sulla determinazione empirica di una legge di distribuzione. G. Dell. Istituto Ital. Degli Attuari 1933, 4, 83–91. [Google Scholar]
- Anderson, T.W.; Darling, D.A. Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Stat. 1952, 23, 193–212. [Google Scholar] [CrossRef]
- National Aeronautics and Space Administration (NASA). Gamma Ray. 2020. Available online: http://hesperia.gsfc.nasa.gov/fermi/gbm/qlook/fermi_gbm_flare_list.txt (accessed on 31 December 2020).
- National Weather Service (NOAA). Space Weather Prediction Center. 2020. Available online: https://satdat.ngdc.noaa.gov/sem/goes/data/plots (accessed on 31 December 2020).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).