Measuring extremal clustering in time series

. The propensity of data to cluster at extreme values is important for risk assessment. For example, heavy rains that last over time lead to catastrophic floods. The extremal index is a measure of Extreme Values Theory that allows measuring the degree of high values clustering in a time series. Inference about the extremal index requires the prior choice of values for tuning parameters which impacts the efficiency of existing estimators. In this work we propose an algorithm that avoids these constraints. Performance will be evaluated based on simulation. We also illustrate with real data.


Introduction
The occurrence of extreme values can lead to risky situations.Climate change and the global economic and financial crisis resulting from the COVID19 pandemic situation and the war in Ukraine have contributed to a continuous growing attention from analysts, namely, to the risk of extreme phenomena.The duration of extreme values in time means the generation of clusters, whose extension can aggravate the phenomenon.Extreme Values Theory (EVT) presents a set of adequate tools in this context.The extremal index is a measure of serial dependence assessing the propensity of data for the occurrence of clusters of extreme values.Figure 1 shows the maximum of sea-surge heights, where clusters of high values are visible.
More precisely, considering X = {X n } n≥1 a stationary sequence of random variables (r.v.) with common marginal distribution function (d.f.) F and denoting M n = max(X 1 , ..., X n ), then X has extremal index θ ∈ (0, 1] if for each real τ > 0 there exists a sequence of normalized levels u n , i.e., satisfying n(1 − F (u n )) → τ , as n → ∞, such that P (M n ≤ u n ) → exp(−θτ ).In the independent and identically distributed (i.i.d.) case, we have P (M n ≤ u n ) → exp(−τ ) and thus θ = 1.On the other hand, if θ = 1 then the tail behavior of X resembles an i.i.d.sequence.Clustering of extreme values takes place whenever θ < 1 and the smaller the θ the larger the propensity for clusters to appear.Under some dependence conditions, θ is stated as the arithmetic inverse of the mean cluster size (Hsing et al. [15] 1988).
Assuming F continuous, we have where Ŷni = −b n log( F (M ni )) and F denotes the empirical d.f.estimating the usually unknown F , with t n = k n or t n = n − b n + 1, whether we are using disjoint or sliding blocks, respectively.In Berghaus and Bücher ([2] 2018) it was considered with , a more amenable formulation to derive the asymptotic properties.Here we consider the Berghaus and Bücher estimator with bias adjustment given by We also consider the sliding blocks version since it usually performs better (Northrop [19] 2015, Berghaus and Bücher [2] 2018).
Observe that the estimators above only depend on a tuning parameter: the block length b ≡ b n .This is an advantage of these methods since most estimators of θ presented in literature have two sources of uncertainty and thus two parameters to be defined in advance: the clustering generation of high values and the choice of a high threshold above which the clusters occur.To mention the best known ones, there is the Nandagopalan ([17] 1990), Runs and Blocks (Weissman & Novak, [22] 1998 and references there in), K-gaps (Süveges & Davison, [16] 2010), censored/truncated (Holěsovský & Fusek, [13,14] 2020/22), cycles Estimator (Ferreira & Ferreira,[7] 2018).We also refer other estimators that require a single tuning parameter, like the Intervals estimator which needs to fix a high threshold Ferro & Segers, [9] 2003), and similar to the Northop estimator above, where we only choose the block length for maxima, we cite Gomes ([11]  As already highlighted in the literature, there is no simple optimal methodology for the best choice of block length and a single estimate for θ.In EVT we have a typical bias-variance trade-off observed in sample paths estimates of rare events parameters.For blocks estimators, the bias decreases with b while the variance increases.A recurrent method is to plot the estimates obtained for successive block size values and visually identify case-by-case plateau zones of these estimates.The stability around a value is an indicator of a reasonable estimate, and this stability region, in general, should not be neither in too small nor in too large values of b, due to the trade-off between bias and variance already mentioned.In Figure 2, it is plotted a trajectory of estimates (full line) along with 95% confidence intervals (CI) (dashed line) obtained for each block length b from 1 to 100, in a random sample of dimension 1000 generated from a moving maximum model with standard Fréchet margins.We can see a plateau region in the estimates around the true value (horizontal line) θ = 0.5 for the block sizes between 25 and 45.Observe the large variability occurring for large values of b and the higher bias for small values of b.
Some methods have been proposed in the literature to help in the choice of tuning parameters based on the stability regions of the estimates graph.See, e.g., Frahm et  will be extracted, as described above.The method will be detailed in Section 2 and analyzed through simulation in Section 3. We end with an application to real data.

Estimation method
Our proposal estimation of θ is based on the bias corrected estimator θ in (3) by considering sliding blocks and on the heuristic plateau-finding algorithm of Frahm et al. ([10] 2005).
The algorithm is described in the following steps: Step 1. Calculate estimates θ b from estimator (3), for 1 ≤ b ≤ t < n; Step 2. Smooth the results of the previous step by taking means of 2w + 1 successive estimates; we have considered bandwidth w = ⌊0.02t⌋; Step 3. Define plateaus of length m = ⌊ √ t − 2w⌋, i.e., p j = θ j , ..., θ j+m−1 , j = 1, ..., t − 2w − m + 1; Step 4. Compute the standard deviation s of θ 1 , ..., θ t−2w and choose the first plateau Step 5.The extremal index is estimated through 1 m m i=1 θ j+i−1 , i.e., taking the average of the estimates that constitute the plateau chosen in the previous step.This will be denoted plateau estimator.
The estimators (1), ( 2) and ( 3) are already implemented in package exdex of software R (Northrop & Christodoulides [20] 2019) with respective CI.We use package exdex to compute estimator (3) under sliding blocks and respective upper and lower 95% CI bounds.We also apply steps 1, 2 and 3 to the lower and upper bounds of CI.Once the plateau of theta estimates is chosen in step 4, we pick the corresponding plateau in the CI limits and in step 5 we apply the average to the plateau values of the lower limit of the CI as well as the average of the plateau values of the upper limit of the CI.

Simulation study and application
The simulation study is based on random generation of samples with size 1000 replicated 1000 times, within each of the models described above.We consider different models with different values of θ.We apply the estimation plateau method of Section 2 both to estimate θ and respective 95% CI lower and upper bounds.In Table 1 are the estimation global results of the plateau method.See also the simulation results of θ given in (3) for each block size b in Figure 3 as well as the results of plateau method.We can observe in each model that the plateau estimate (thicker gray horizontal full line) is located in a plateau zone of the sample path of estimates plotted as a function of block size b (full black line), as expected.We can also see that the plateau estimate is close to the true value (blue horizontal full line).In all cases it is verified that the limits of 95% CI estimated by the plateau method (thicker gray horizontal dotted-dashed lines) include the true value of θ.In the ARCH case, the estimates closest to the true value of θ occur for large values of b where the variability is very high, which makes it difficult to apply the plateau methodology.Even so, the root mean squared error (rmse) of 0.1126 is not very expressive.The independent model (Ind) has θ = 1 and, therefore, constitutes a frontier value of the parameter support, which typically leads to difficulties in statistical estimation.Still, the plateau method showed relatively low bias and rmse.Observe also that in the MAR model we have θ = 0.9, quite near to boundary value 1, and the plateau method does a very good job.

Conclusion
This work addresses the estimation of the extremal index θ.This is an important measure in time series, namely in assessing risky phenomena, as it measures the propensity for occurrence of clusters of extreme values.The estimation of θ requires the prior setting of tuning parameters values that impact the precision of estimates.In this work we present an algorithm that allows to estimate θ free of tuning parameters.We applied this methodology to diverse models and the results are encouraging in several cases.In the future it is intended to continue the study of this methodology and develop it in order to improve its applicability to different types of models.

Fig. 3 .Fig. 4 .
Fig.3.Simulation results: average of estimates of θ for each block size b = 2, ..., 200 using θ in (3) (full black line) and average of respective 95% CI upper and lower bounds (dotted lines); plateau estimation of θ (thicker gray horizontal full line) and respective plateau estimates of 95% CI upper and lower bounds (thicker gray horizontal dotteddashed lines).The true value of θ corresponds to the blue horizontal full line.

Table 1 .
Simulation results of plateau method: average of θ estimates (mean), average of lower and upper 95% CI bounds estimates, bias, root mean squared error (rmse) and standard deviation of θ estimates (sd).We illustrate the method with the real data newlyn available in R package exdex, consisting of 2894 sea-surge heights measured at the coast at Newlyn, Cornwall, UK, over years 1971-1976.The observations correspond to the maximum hourly surge heights during periods of 15 hours.See the left plot in Figure4.Previous analysis of this data can be seen in Northrop([19]2015) and references therein.The sample path of estimates from (3) as a function of block size b and respective 95% confidence limits are plotted on the right graph of Figure4.The horizontal full line corresponds to the plateau estimate of θ given by 0.2577 and the horizontal dotted-dashed lines to the plateau 95% CI estimate (0.2206, 0.2948).