Modelling Short- and Long-Term Dependencies of Clustered High-Threshold Exceedances in Significant Wave Heights

The peaks-over-threshold (POT) method has a long tradition in modelling extremes in environmental variables. However, the assumption of independently and identically distributed (iid) data is likely to be violated in practical settings, leading to clustering of high-threshold exceedances. These violations can be the result of shortand long-term dependencies in the underlying time series. We review popular approaches that either focus on modelling shortor long-range dynamics explicitly. In particular, we consider conditional POT variants and the Mittag-Leffler distribution modelling waiting times between exceedances. Further, we propose a two-step approach capturing both shortand long-range correlations simultaneously. We suggest the autoregressive fractionally integrated moving average (ARFIMA)-POT model, which first fits an ARFIMA model to the original series and then utilises a classical POT model for the residuals. Applying these models to an oceanographic time series of significant wave heights measured on the Sefton coast (UK), we find that neither solely modelling shortnor long-range dependencies satisfactorily explains the clustering of extremes. The ARFIMA-POT approach however provides a significant improvement in terms of model fit. We therefore conclude that there is a need for developing new models that jointly incorporate shortand long-range dependence to address extremal clustering. JEL-Numbers: C22, C52


Introduction
Extreme value theory (EVT) has emerged as an invaluable toolkit for a broad range of sciences over the past decades. Its techniques attempt to predict the occurrence of extreme events by making the best use of limited available data. They can be divided into two fundamental approaches: the block maxima method and the peaks-over-threshold (POT) method. While the former splits the observation period into non-overlapping, equally-sized periods and then only analyses the maximum of each period, the latter picks extreme observations exceeding a specific high threshold [32]. In the following, we focus on the POT method, which has proven useful in a variety of applications including financial risk management [18,52], insurance [63], coastal engineering [6,26], and oceanography [70,16]. We regard the POT approach from the perspective of a marked point process (MPP): exceedances over a high threshold u will be referred to as extreme events. Such an event is associated with an occurrence time and a mark size, the latter corresponding to the size of the excess above the threshold.
The motivation for the classical POT model can be found in the asymptotic behaviour of high threshold exceedances for identically and independently distributed (iid) data. The approach assumes that if the threshold u is chosen high enough, the extreme events follow a marked Poisson process, i.e. the number of extremes events occurring can be described by a homogeneous Poisson process and their mark sizes are iid with a generalised Pareto distribution (GPD). This process also characterises the inter-event times as being iid exponentially distributed.
In many practical settings, however, the iid assumption is violated. Short-range dependencies arising from local serial correlation of the underlying time series may cause extreme events to occur in clusters [18]. Therefore, assuming independent exponential inter-event times may seem unreasonable. These practical issues are often tackled by using declustering methods and then fitting the marked Poisson process to the largest value of each cluster only [see e.g. 24]. This approach however disregards the information contained within the clusters [cf. 82,58], and gives rise to the challenge of appropriately and consistently identifying clusters from given data. Relevant literature includes works on statistical declustering using the extremal index due to Leadbetter et al. [60] [82,33,30,22], on physical declustering [82,61,62], or on double threshold approaches combining statistical and physical declustering [66,7]. However, all these declustering methods have in common that they avoid modelling the short-range dependence directly.
It is impossible to universally characterise the short-term behaviour of stationary sequences as it can take countless different forms, but imposing certain constraints on the local dependence structure enables parametric modelling thereof. One popular approach builds on a marked point process with a self-exciting structure to model the clustering of extreme events, leading to conditional POT approaches. The Hawkes-POT model introduced by Chavez-Demoulin et al. [18] belongs to the general class of self-exciting Hawkes processes [46] and allows the intensity of the occurrence of extreme events to depend on both past event times and marks. A related approach is the so-called self-exciting POT (SEPOT) suggested by McNeil et al. in the first edition of [68], where one Hawkes process influences the intensity of both event times and marks. The various conditional POT models were reviewed in for example [51,17,52,41] amongst others, and extended to a multivariate framework by Grothe et al. [42] and Hautsch and Herrera [45]. Conditional POT models are based on similar ideas as the epidemic-type aftershock sequence (ETAS) model proposed by Ogata [72], which is designed to describe the occurrence of earthquakes based on previous ones and was further explored for example by Helmstetter and Sornette [50].
In the aforementioned approaches, the conditional intensities of the events depend exponentially on the characteristics of past events implying a focus on modelling short-term dependencies explicitly, but neglecting any possible long-term correlations. However, we also have to consider settings in which longrange dependence in the underlying sequence influences the clustering behaviour so that we cannot per se assume that widely separated events are independent. Many time series of practical interest ranging from finance [36] to hydrology [54] have long been known to exhibit long-range dependence, implying a power-like decay of the autocovariance function γ(h), h ∈ Z, i.e.
Modelling the extreme events of long-range dependent time series hence requires special attention to the characteristics that event times and marks inherit from the underlying process. Generally, long-range dependence in the underlying time series has been observed to lead to pronounced clustering of extreme events [13,28]. There is empirical evidence that the inter-event times may follow a Weibull [9], a stretched exponential distribution [12,2,86] or the product of a power law and a stretched exponential [79], but these observations lack analytical justification. The inter-event times themselves have also been found to exhibit long-range dependence [10,28]. In numerical experiments, maxima sequences were as well found to exhibit long-range dependence and, depending on the distribution and the strength of the long-term correlations of the underlying sequence, the distribution of maxima may deviate from the asymptotics for finite block sizes [27]. It could be suspected that the effects on the whole tail of the mark distribution might be more pronounced than when considering only a sequence of maxima.
In the context of EVT, not many parametric approaches exist that incorporate explicit modelling of the long-term dependence structure in the underlying series. One example is suggested by Hees et al. [49] and motivated by so-called bursty extremes where event clusters are separated by long periods without any event. Hees et al. [49] provide a parametric approach to modelling a setting in which the event occurrences follow a fractional Poisson process which was proved to be long-range dependent by Biard and Saussereau [8]. They focus on characterising the inter-event times via the Mittag-Leffler distribution, which generalises the exponential distribution [57,69]. This framework can easily be complemented by additionally modelling the mark sizes through a GPD [49].
All these variants of the POT model have in common that they explicitly model the dependence structure of the underlying times series to capture the clustering behaviour of the extreme events, either by taking into account short-term or long-term correlations. However, none of the models embeds shortand long-range dynamics at the same time. Considering real world applications, it is very likely that many financial and physical processes are governed by both short-and long-range dependencies affecting the cluster formation of their extreme events.
The focus of this paper lies on oceanographic variables which are frequently identified to be longrange dependent. Especially near the coast, sea level observations [3,31] and sea surface temperatures [64,55] were found to exhibit long-range dependence. Similar findings are reported for sea-ice thickness [74] and significant wave height [73,15,87]. The dataset we consider in our application in Section 4 also consists of significant wave heights that were measured on the Sefton coast (UK). Figure 1 depicts the observed time series and illustrates the exceedances of the high threshold u = 2.5 m, i.e. the extreme events. It is obvious that the extreme events appear in widely separated clusters implying strong local dependencies, which is confirmed by the empirical autocorrelation functions (ACFs) of both observations and their squares (cf. Figure 2). Therefore, it is not surprising that the inter-event times appear to have a distribution clearly differing from the exponential distribution expected for iid observations, as can be observed from the quantile-quantile (QQ) plot in Figure 1. The periodogram of the significant wave heights in Figure 2 further hints at the presence of long-range dependence due to the pole at the origin, which is in line with the aforementioned literature. Thus the visual inspection of the considered dataset implies that the dependence structure of the significant wave heights exhibits both short-and long-term correlations. Of course this has to be accounted for when attempting to explicitly characterise the extremal clustering. Therefore, with the aim of jointly modelling short-and long-range dependencies, we suggest a two-step approach in the spirit of McNeil and Frey [67] which we will call autoregressive fractionally integrated moving average peaks-over-threshold (ARFIMA-POT) approach. Addressing the issue of heteroscedasticity in financial data, McNeil and Frey [67] introduced the so-called GARCH-EVT approach which first fits a GARCH model to a return series and then applies standard EVT techniques to the pseudo-iid residuals from the first step. Even though there is no theory that mathematically characterises the clustering behaviour of the extremes from GARCH model residuals [71], the two-stage method has proven to work well in practice [14,56,34,65,19]. In a similar fashion, we propose a two-stage method combining a parametric model with a classical EVT model. Our ARFIMA-POT approach is based on first fitting an autoregressive fractionally integrated moving average (ARFIMA) model [40,53] to the original oceanographic dataset and then applying the standard POT model to the ARFIMA residuals. Given that the ARFIMA model captures both short-and long-range dependence in the underlying time series, the model residuals are supposed to be iid, so that all difficulties arising for the distribution of event times and mark sizes due to the short-and long-term behaviour in the original series can be avoided. The method is inspired by the idea of prewhitening the underlying series as suggested for example by Davison and Smith [24] to deal with non-stationarity and seasonality in POT modelling. For Hill's estimator of the tail index, Resnick and Stȃricȃ [76,77] show that using the residuals of an autoregressive (AR) process is superior in terms of asymptotic variance than estimation based directly on the observations, given that the series follows an AR process and the AR model has a causal representation. This clearly indicates Figure 1: Offshore significant wave heights at Sefton coast from September 10 to December 09, 2013. Exceedances above a threshold u = 2.5 m, corresponding to 7.9% of the observations, and QQ plot of the inter-event times compared to a unit exponential distribution. that prewhitening can have favourable properties in the context of EVT. Further, the idea of utilising an ARFIMA model to describe a time series in the context of oceanography is not new. ARFIMA models are a popular choice for characterising levels and trends in weather-related, hydrological and oceanographic data [85]. Specifically, Ercan at al. [31] for instance model the Caspian sea level using ARFIMA processes. Thus, our ARFIMA-POT approach can be interpreted as filtering the underlying oceanographic time series using an ARFIMA model before applying the classical POT method.
This paper contributes to the line of literature that aims to model the clustering behaviour of extreme events. Specifically, we consider strongly dependent significant wave height data for which high level exceedances appear in an extremely clustered manner. As this setting poses a particular challenge for traditional POT approaches, we analyse whether popular models dealing with short-or long-range dependence convey enough information about the dependence structure in the data. Moreover, we suggest the two-step ARFIMA-POT approach that captures both short-and long-range dependence in the underlying time series. Further, we verify our conclusions from the application to significant wave heights in a simulation study. The range of models reviewed is not meant to be exhaustive, but is rather supposed to present a selection of commonly used approaches modelling either the short-or long-term behaviour of extreme events, but not both simultaneously. Our results are meant to provide a basis for further discourse on the modelling of extremal clustering behaviour in EVT. We show that current tools are not flexible enough to simultaneously capture the short-and long-range dependence of the underlying process. Therefore, we want to spark new interest in researching the intersection of modelling local clustering behaviour and long-range dependence to find solutions to a question of practical interest.
The paper is organised as follows. Section 2 reviews the basic concepts around POT modelling. Section 3 presents the considered models. This section is divided into established models characterising extremal clustering by taking into account short-and long-range dependence respectively, and the presentation of the two-step ARFIMA-POT approach incorporating short-and long-term dynamics simultaneously. Section 4 applies the considered approaches to significant wave heights and discusses their performance in the presence of extremal clustering and long-range dependence. Section 5 examines the conclusions from the application in a simulation study and Section 6 concludes.

Peaks-over-threshold modelling
In the following, we review the univariate POT approach through the lens of point processes as introduced by Pickands [75]. In the presentation, we loosely follow McNeil et al. [68].
Consider a sequence of iid random variables X 1 , . . . , X n with some continuous distribution function F that is in the maximum domain of attraction of a non-degenerate limiting distribution, i.e. F ∈ M DA(H ξ ) for some ξ ∈ R, so that lim n→∞ nF (a n x + b n ) = − ln H ξ (x) holds for normalising sequences a n > 0 and b n whereF (x) = 1 − F (x). It follows that H ξ is a generalised extreme value (GEV) distribution with distribution function where (·) + := max (0, ·), and µ, ξ, σ > 0 denote location, shape and scale parameters respectively. The number of exceedances above a high threshold u can be expressed by K := #{i ∈ {1, . . . , n} : (X i − b n )/a n > u}. Since the X i are iid, K is binomially distributed with parameters (n, p), where p is the probability of crossing u, i.e. p = P ((X i − b i )/a n ) > u) =F (a n u + b n ) = − 1 n ln H ξ (u). Thus, as n → ∞, K converges to a Poisson random variable with mean λ(u) = − ln H ξ (u). However, not only the total number of extreme events is asymptotically Poisson, but also their occurrences can be described by a Poisson point process. We can define the point process of event times as where Y i,n = i n 1 {Xi>anu+bn} for 1 ≤ i ≤ n is either the time of an event re-scaled to X 1 = (0, 1] or zero, and A ⊂ X 1 . Then N (t) converges in distribution to a homogeneous Poisson point process on the space X 1 with intensity measure Λ(A) = (t 2 − t 1 )λ(u) for A = (t 1 , t 2 ) ⊂ X 1 , where the constant intensity is given by λ(u) = λ = − ln H ξ,µ,σ (u) and the scaling sequences are absorbed into the parameters of H [see e.g. 30].
One way to describe the POT model is by means of a non-homogeneous two-dimensional Poisson point process, for which a two-dimensional pattern of points (t, x) is recorded representing the times and magnitudes of exceedances of a high threshold u [81,68]. On a space X 2 = (0, 1] × (u, ∞), we assume for A ⊂ X 2 that the point process This intensity function does not depend on t, but on x, so that the two-dimensional Poisson process N (t,x) is non-homogeneous and we define λ(x) := λ(t, x). Considering A = (t 1 , t 2 ) × (z, ∞) ⊂ X 2 , the intensity measure of the process is given by It follows that for any z ≥ u the implied one-dimensional process of occurrences of exceedances with level z forms a homogeneous Poisson process with intensity τ (z) = − ln H ξ,µ,σ (z), which is similar to the previous result. From this representation, we can also retrieve the GPD as the limiting distribution of the mark sizes: the limiting conditional probability that (X i − b n )/a n > u + v given (X i − b n )/a n > u, i.e. the tail of the excess distribution function over the threshold u, is obtained from the ratio of the rates of exceeding the levels u + v and u: for a positive scaling parameter β = σ + ξ(u − µ).Ḡ ξ,β exactly corresponds to the tail of the GPD model for excesses of a high threshold u. Hence, this representation of the POT model incorporates a homogeneous Poisson process when only considering the re-scaled event times j/n and iid marks Y j = X j − u, following the GPD with distribution function so that we can regard this POT model as an MPP.
To fit the POT model, it is sufficient to fit the Poisson point process with the intensity function (4) for A = X 2 . The likelihood for N (t,x) (A) := N * exceedances X j can be formulated as Alternatively, the POT model can be reparameterised using τ := τ (u) = − ln H ξ,µ,σ (u) for the intensity of the one-dimensional Poisson process of event times, and β = σ + ξ(u − µ) for the GPD scaling parameter that is implied for the mark sizes. In that case, the intensity function in (4) can be reformulated using the mark sizes Y j = X j − u as with ξ ∈ R and τ, β > 0. The logarithm of the likelihood in (8) can then be shown to become ln L (ξ, σ, µ; X 1 , . . . , X N * ) = ln L 1 (ξ, β; X 1 , . . . , X N * ) + ln L 2 (τ ; N * ) where L 1 corresponds to the likelihood of fitting a GPD to the mark sizes and L 2 to the likelihood of a one-dimensional homogeneous Poisson process with intensity τ . Thus we can make inferences about the parameters for each partition of the log-likelihood separately.
In the POT framework, a central question is how to choose the threshold u. Coles [20] suggested a graphical diagnostic based on the criterion of stability domains. The method aims at identifying regions for the shape parameter ξ and the modified scale parameter σ = β − ξu, in which the estimates remain constant for an increasing u. Mazas and Hamm [66] propose to select the lowest threshold of the highest domain of stability in order to balance the trade off between choosing a high threshold to minimise the bias and choosing a lower threshold to retain enough data for minimising the variance of the estimates.
This classical POT approach relies on the assumption that the underlying series is independent. Leadbetter [59] extended this theory to stationary sequences satisfying certain weak dependency conditions. More specifically, stationary sequences can be shown to behave like iid series if they satisfy (i) a mixing condition to restrict the long-range dependence, and (ii) a condition restricting the local clustering of extremes. When the local condition is violated due to the short-range dependence structure of the underlying sequence, the extremal behaviour can still be accommodated in the asymptotic theory using the extremal index, but the exact clustering behaviour cannot be fully described providing the justification for explicit modelling of the short-range dependence structure. For detailed discussions on EVT for stationary sequences and the respective conditions for point processes, we refer to Leadbetter et al. [60] and Beirlant et al. [4].
There is no general asymptotic theory for the extremal behaviour of long-range dependent times series. Although the mixing condition for stationary sequences is meant to limit the effect of longterm correlations, certain processes exhibiting long-range dependence still satisfy it. For example, the autocovariances of Gaussian linear processes are known to fulfil Berman's condition γ(h) ln h → 0 for h → ∞, which is sufficient to establish that Gaussian sequences satisfy mixing and local conditions. This also includes the long-range dependent Gaussian ARFIMA model [see e.g. 30]. Linear processes with heavy-tailed and possibly long-range dependent innovations however exhibit local clustering of extremes measured by the extremal index [5,Theorem 4.47]. In practical applications, the observed effects of a long-range dependence structure in the underlying series suggest deviations from the asymptotic iid limits and pronounced clustering (cf. Section 1). Thus independent from the challenge of developing a more general asymptotic theory for the extremes of long-range dependent series, the presence of long-range dependence has to be accommodated for when modelling the dependency structure of extremes in finite datasets.

Modelling approaches for dependent data
This section reviews existing POT models that attempt to explicitly model short-or long-range dependencies to address the clustering of high threshold exceedances. We then suggest a two-step method to incorporate both types of dependencies jointly. This is motivated by observations from the considered wave height dataset, which appear to be governed by both local dependency structures and long-range correlations as illustrated in Section 1.

Models capturing short-range dependence
We begin by presenting selected well-known conditional POT approaches that explicitly model the shortrange behaviour affecting extremal clustering. Chavez-Demoulin et al. [18] and McNeil et al. [68] suggest conditional POT models to capture short-range dynamics. The conditional approach to POT modelling constructs an MPP for the high threshold exceedances and introduces dependence structures by utilising a self-exciting process for the intensity based on past event times and marks. This self-exciting process is also referred to as a Hawkes process [46] and is commonly used in the context of seismology [72]. Following the more general notation from Herrera and Schipp [52], we consider an event process (T j , Y j ) j∈N defined on the space (0, T ] × (u, ∞) with event times T j and mark sizes Y j = X j − u. Thus the set of observed events (t j , y j ) j≥1 only records the times and marks of extreme events. Further, let the internal history of the process be given by H t := {(t j , y i ) : j = 1, . . . , N (t) − 1} where N (t) := j≥1 1 {(tj ≤t,yj )} defines an MPP. Following Daley and Vere-Jones [21], the conditional intensity of this MPP can be factorised into the ground intensity λ g (t), which describes the event times, and the conditional probability density function of the marks g (y|H t , t) if we assume conditional independence of the times and marks given the history of the process up to time t. The factorised conditional intensity function is then given by Similar to the classical POT approach, the contributions of the mark sizes are assumed to follow a GPD, but the scale parameter β(y|H t , t) is allowed to depend on the process history. The conditional probability density function of the GPD then becomes It follows from the factorised conditional intensity function that the log likelihood of the MPP can be expressed as log L = log L 1 + log L 2 with and In the case of iid data, the intensity of the ground process λ g (t, H t ) would be constant and the mark distribution would follow an unconditional GPD, so that we retrieve the likelihood of the two-dimensional non-homogeneous Poisson process in equation (10). A wide range of functional forms has been proposed for the conditional intensity function of the ground process, whereof Chavez-Demoulin et al. [18] choose a self-exciting process of the shape where µ 0 is a positive constant loading, µ1 is a positive constant, and ω(v) is non-negative for v > 0 and zero otherwise. Such a conditional intensity function triggers clustering of the extreme events, since each previous event contributes to the conditional intensity and the amount may depend on both the time (t − t j ) elapsed since the last event and the mark size y j . Thus, the conditional intensity itself is a stochastic process depending on ω through H t . For ω, Chavez-Demoulin et al. [18] specify a monotonically decreasing ETAS-type functional form with parameter restrictions δ, γ, ρ > 0. An alternative parametric specification of ω was suggested by McNeil et al. [68,Sec. 7.4.3] and picked up by Chavez-Demoulin and McGill [17] consisting of a Hawkes process with exponential decay: for δ, γ > 0. For the scale parameter of the conditional GPD, Chavez-Demoulin et al. [18] select a form that only depends on the previous mark size: where a, b ∈ Z. This specification is referred to as Hawkes-POT. An alternative approach is the SEPOT model suggested by McNeil et al. in the first edition of [68] and reviewed by Herrera and Schipp [51]. The SEPOT model uses an ETAS or exponential decay parameterisation also for the scale parameter of the conditional GPD: where β 0 , β 1 > 0, and ω(t − t j ; y j ) as in equation (16) or (17). In this case, the ground intensity and the conditional probability density function of the marks are driven by the same parameters δ, γ, and ρ (where necessary).

Models capturing long-range dependence
Next, we turn to a modelling approach that attempts to explain extremal clustering by accounting for long-range dependence. Hees et al. [49] take a different perspective on extreme events than the conditional POT approaches presented so far. They focus on modelling the inter-event times using the heavy-tailed Mittag-Leffler distribution as opposed to the exponential distribution that arises naturally from Poisson distributed event times. In this heavy-tailed inter-event time scenario, the event times form a fractional Poisson process [57]. This process represents a renewal process with Mittag-Leffler distributed waiting times and is a generalisation of the Poisson process. Further, Biard and Saussereau [8] demonstrate that the fractional Poisson process exhibits the long-range dependence property, so that this process is well suited to a setting in which the event times are long-term correlated.
Hees et al. [49] consider a continuous time random exceedance (CTRE) model for the event times and the mark sizes. Assume that the kth observation X k for k ∈ {1, . . . , n} occurs at time T k = W 1 +· · ·+W k , where the W i > 0 are the waiting times between observations X i and X i−1 for 1 ≤ i ≤ k. Through defining stopping times for a marked renewal process (MRP), Hees et al. [49] describe the CTRE model with the pair sequence (T j , Y j ), j ∈ {1, . . . , N * }, for N * exceedances above a given threshold u. The MRP and therefore also the CTRE are called uncoupled if the waiting times W and the observations X are independent, which implies that the sequences (Y j ) and (T j ) are independent as well. The CTRE process is referred to as bursty if the tail functionF W (t) = P (W > t) is regularly varying at infinity with index β ∈ (0, 1), that is lim Hees et al. [49] strictly separate the term burst from the term cluster. The latter refers to a stationary underlying process where successive events are connected, whereas bursts refer to accumulations of events caused by a heavy-tailed waiting time distribution. For matters of simplicity and consistence, we stick to the term cluster whenever several events occur consecutively.
Assuming an uncoupled CTRE, Hees et al. [49] suppose that the sequence of observations (X i ) for i ∈ {1, . . . , n} has a distribution belonging to the maximum domain of attraction of some non-degenerate limiting distribution. Further, the waiting times W i are assumed to be in the domain of attraction of a positively skewed sum-stable law with stability parameter 0 < β < 1. For u → ∞, Hees et al. [49] then demonstrate that the event times T converge weakly to a Mittag-Leffler distributed random variable with tail parameter β and scale parameter σ = b( 1 /pu), where p u := P (X > u) and b(k) is a regular varying function at infinity with parameter 1 /β. For the distribution function of the waiting times we therefore have where E β (z) = ∞ k=0 z k Γ(1+βk) defines the Mittag-Leffler function with z ∈ R, 0 < β ≤ 1 and Γ(·) denotes the Gamma function [8]. Note that the exponential distribution is nested in the Mittag-Leffler distribution with β = 1.
Similar to the methods suggested for the threshold choice in the classical POT framework, Hees et al. [49] propose to choose the threshold u based on stability plots for the parameters β and σ of the Mittag-Leffler distribution. The estimation of the parameters requires a dataset that only records the event times and mark sizes, i.e. (t j , y j ) j where y j = x j − u given x j ≥ u. Then the parameters can be estimated via fractional log-moment estimation, or the less computationally efficient but more reliable maximum likelihood estimation. For model inference, Hees et al. [49] provide an algorithm that also encompasses choosing a suitable threshold. Let k denote the number of events. For a range of values for k, the Mittag-Leffler distribution is fitted to the corresponding inter-event times. For the stability plots, the estimates for the tail parameterβ k and a rescaled version of the scale parameterσ 0 = k 1 /βσ k are plotted against the considered range of values for k. Trading-off bias and variance, representative parameter values in the respective regions of parameter stability are selected for bothβ k andσ 0 .
Finally, Hees et al. [49] remark that this modelling approach can easily be extended to a method in the line of POT modelling by fitting a GPD to the mark sizes.

Two-step ARFIMA-POT approach capturing both short-and long-range dependence
After having presented models explicitly dealing with either short or long-range correlations, we now propose a method combining both types of dependencies. We suggest a two-step approach in which short-and long-range dependencies are captured with an ARFIMA model, and a classical POT model is fitted to the residuals of the first step. The motivation for this approach was outlined in Section 1, where the considered dataset of significant wave heights appeared to be governed by both short-and long-term dynamics.
Formally, we assume that the dynamics of the underlying time series X t can be summarised by a linear process dependent on innovations Z t that follow a white noise process with marginal distribution function F Z (z). We make no further assumptions about F Z (z) except for stationarity, but seek to model its tail behaviour using the classical POT approach. Thus, we capture the short-and long-term correlations by only assuming that X t follows an ARFIMA process. The proposed two-step method can be summarised as follows.
1. Fit an ARFIMA model to the original dataset making no assumptions about the distribution of the innovations F Z (z). Calculate the implied model residuals.
2. Assume the residuals to be realisations of a strict white noise process and use the classical POT model to describe the tail of the residual distributionF Z (z).
In stage 1, we first want to give a model for the dynamics of the underlying series X t . The ARFIMA model due to Granger and Joyeux [40] and Hosking [53] combines the short-ranged AR and MA dependence with additional long-term correlation. Assuming that E(X t ) = 0, we can define a stationary ARFIMA(p, d, q) model for the dynamics of X t as where B denotes the backshift operator BX t = X t−1 , d ∈ (0, 0.5) the fractional integration parameter, represent the AR and MA polynomials of order p and q respectively. This ARFIMA(p, d, q) process is causal and invertible if the respective roots of φ(z) and θ(z) lie outside the unit circle. Given causality, there is a unique non-deterministic stationary solution for the difference equation in (21) given by which forms a linear process and corresponds to an infinite moving average representation [11,Section 13.2]. For d ∈ (0, 0.5), the process exhibits long-range dependence, while we recover the short-range dependent ARMA model for d = 0. Inference for the parameters of the ARFIMA(p, d, q) can be done via Whittle's approximation to the maximum likelihood estimator [see e.g. 11], which was first proposed by [88] for short-range dependent models. For both Gaussian [35] and non-Gaussian series [39], this approximation has been proved to provide asymptotically consistent and normally distributed estimators. For a dataset of size n, the Whittle estimator is essentially based on minimising where I(λ j ) denotes the periodogram of the sample at the Fourier frequencies λ j = 2πj n , f (λ j , Ψ) the spectral density of the model with parameters Ψ = (φ 1 , . . . , φ p , d, θ 1 , . . . θ q ), and m the integer part of n−1 2 . The minimum Q(Ψ) gives the estimate of the model parametersΨ. Further details on the Whittle estimator can be found in [5] and [11]. Suitable model orders p and q can be chosen based on the Bayesian information criterion (BIC) [80] or Akaike information criterion (AIC) [1].
From the fitted model, we can calculate the implied model residuals aŝ where the estimated AR and MA polynomialsΦ(B) andΘ(B) are based on the estimated parameters (φ 1 , . . . ,φ p ) and (θ 1 , . . .θ q ) respectively, andd is the estimated fractional difference parameter. Given a perfect model fit, the sequenceẐ t is supposed to be a strict white noise process.
In stage 2, we now want to apply the classical POT model to the residualsẐ t . For the application to significant wave heights, we are only interested in the right tail of the residual distribution, so the extreme positive heights are of interest. Given a dataset of the form (t,Ẑ t ), we can fit the non-homogeneous twodimensional Poisson point process as discussed in Section 2 with intensity function (9) and likelihood (10). Thus, the tail of the residual distributionF Z , i.e. the mark sizes, is modelled by a GPD as given in equation (7). The threshold u can again be chosen according to parameter stability plots. The model residuals that are identified as extreme in this stage can be traced back to those observations which are very unlikely to occur under the ARFIMA model assumption. Our particular interest lies in characterising these largest model deviations, since they cannot be forecast using the parametric model describing the bulk of the data. In the case of significant wave heights, the extreme model residuals can be translated to the most extreme wave heights that cannot be predicted based on the average sea wave process, yet are most likely to contribute to flooding and beach erosion.
The two-step ARFIMA-POT approach is closely related to other two-step approaches incorporating long-term correlations that are popular in financial literature. In that context, however, the focus is on modelling the heteroscedasticity that is known to be a stylised fact of return series. Therefore, Youssef et al. [89] and Zhao et al. [90] suggest various types of fractionally integrated GARCH-EVT models, which are long-range dependence augmented versions of the original GARCH-EVT model by McNeil and Frey [67]. He et al. [47] additionally model the long-range correlated conditional mean of the return series by an ARFIMA model, resulting in ARFIMA-GARCH-EVT type models. In hydrology, a related approach only focusing on short-range dependence has been applied to river discharges by Elek and Márkus [29]. They use an ARMA-GARCH-EVT variant in which the conditional variance is rather dependent on the past values of the series than the lagged innovations. In our ARFIMA-POT approach we deliberately exclude the modelling of potential heteroscedasticity to be able to solely focus on finding an explicit representation for the short-and long-range dependence of the underlying series. We concentrate on oceanographic data which has of course fundamentally different characteristics than financial data [see e.g. 31]. So assuming that the time series is described precisely enough by an ARFIMA model, we would not expect to encounter heteroscedastic variances for the residuals. Thus, our approach is closer to the concept of prewhitening than the modelling of heteroscedasticity.
Clearly, a stationary Gaussian ARFIMA model is covered by classical EVT theory using Berman's condition and thus is supposed to exhibit the same behaviour as Gaussian iid sequences in the limit [30,Section 4.4]. So if the underlying time series can be described by such a stationary Gaussian linear process, the classical POT approach is applicable. However, there is no guarantee that the assumption of Gaussian innovations is justified. Further, assuming the presence of long-range dependence in the underlying time series, the convergence to the asymptotic limit can be slower [27], which calls for larger datasets. Considering oceanographic data, however, longer time series can actually complicate explicit modelling attempts, since the focus shifts from the physics of the sea wave process to the impact of the climatology of a given location [44]: the longer the sequence, the more deterministic elements such as long-term trends due to climate change [83] and further cyclical or seasonal behaviour [43] have to be modelled additionally to the dependencies already present in the shorter datasets. Thus, from a practitioner's point of view, it would be desirable to have an approach that is already applicable to shorter observation horizons, in which it is reasonable to assume that the average boundary conditions affecting the average wave height (e.g. average wind, temperatures and atmospheric pressure) remain stationary [83]. Our two-step ARFIMA-POT approach is, therefore, a practical attempt to solve the challenge of finding a POT formulation that simultaneously models short-and long-range dependencies in significant wave heights for shorter observation horizons.

Application to significant wave heights
In the following, we will now present the results of applying the aforementioned POT approaches for dependent data on significant wave heights. We consider a dataset of significant wave height measured at the WaveNet buoy (22 m water depth and from the Centre for Environment, Fisheries and Aquaculture Science: http://wavenet.cefas.co.uk) on the Sefton coast in Liverpool Bay, UK. This dataset has already been extensively studied by Dissanayake et al. [25,26] among others. Analogous to Dissanayake et al. [26], we restrict our analysis to a subsample encompassing the observations from 10 September to 09 December 2013, which were recorded in 30-minute intervals. This subsample contains 4,367 observations ranging from 0 to 4.55 m, amongst which are 268 missing values. The missing values were dealt with by replacing the small gaps of up to 4 observations by linear interpolation. The initial threshold for the conditional POT variants was chosen following the detailed discussion of Dissanayake et al. [25], who selected a threshold of u = 2.5 m or equivalently 7.9% of the observations yielding a total of 324 events. This threshold choice is in line with EVT applications to financial data, where 8%, or more generally a threshold between 5% and 10%, is commonly chosen [cf. 17,52]. Further, additional analyses using the thresholds 5% (u = 2.82 m) and 10% (u = 2.23 m) were conducted to examine the robustness of our findings, but did not show any different results and are thus omitted for brevity.

Results: Conditional POT models
We begin by presenting the inference results for the Hawkes-POT and SEPOT models as reviewed in Section 3.1 with both ETAS (16) and exponential (17) decay functions, which aim at accommodating extremal clustering by explicitly modelling short-range dependencies. We further compare how the results for significant wave heights differ from the ones for financial data given in [18,17,52]. Table 1 summarises the estimation results and log-likelihoods for the four considered model variants given the threshold of 7.9%. For the interpretation of these inference results, we first focus on the parameter estimates influencing the conditional intensity of the MPPs through the conditional intensity of the ground process λ g governing the evolution of event times. Compared to the estimation results from [18] for negative daily returns of Bayer shares and the Djind index, we find a considerably lower base ratê µ 0 , which underlines the distinct extremal structure in the significant wave heights time series, where event clusters are followed by long periods without any events (cf. Figure 1). We estimate a considerably higher factorμ 1 further stressing this particular behaviour, since the most recent event, depending on its mark size, leads to a sudden, large increase in the conditional intensity of the ground process. For the models with ETAS decay functions, we find the estimatesρ to be significantly different from zero as opposed to the findings of [18] for financial data. This indicates a larger influence of the time passed since a previous event on the conditional intensity for significant wave heights: the more time has passed, the smaller is the value of ω and, therefore, the increasing impact on the λ g .
Turning to the parameter estimates impacting the conditional intensity of the MPPs through the conditional probability density function of the mark sizes g, we first consider the two Hawkes-POT variants. Here, we observe a smaller estimate forâ, indicating a lower value forσ in general, and an estimate forb that is significantly different from zero, stressing the strong influence of the previous on the following mark size. For ξ, we find negative estimates implying a finite tail of the distribution of significant wave heights and more probability mass concentrated on smaller mark sizes compared to the positive estimate for ξ that is commonly found in financial data [see e.g. 18,52] . For the two SEPOT variants, we retrieve smaller estimates forβ 0 and larger estimates forβ 1 compared to the results in Herrera and Schipp [52] for the return series of various stock indices. This again shows a stronger dependence of the GPD scale parameter on past event times and mark sizes. Similar to the Hawkes-POT model variants, the estimate forξ is found to be negative with the same implications as before. Overall, the estimation results for the significant wave height dataset differ significantly from the results reported in literature for financial data, stressing that the characteristics of this oceanographic time series are fundamentally different. Comparing the results for the four considered model variants, all specifications fit the significant wave height data similarly well.
To evaluate the fit of the considered conditional POT models to the significant wave heights, we separately consider diagnostics for event time and mark modelling. For the evaluation of the fit to the event times, we look at two visual diagnostic tools: a plot of the estimated conditional intensityλ and a plot of the cumulative intensity Λ H (t j ) = tj 0 λ H (u) du against the time transformed based on the number of events. Figure 3 shows these diagnostics for the Hawkes-POT model with ETAS decay for the 7.9% threshold, whereas the very similar looking plots for the exponential decay variant and both SEPOT models can be found in the appendix in Figure 19. The plotted estimated conditional intensities  for the Hawkes-POT on the left of Figure 3 strongly differ from the results found for financial data by e.g. [18,17]. The occurrence of an event causes the conditional intensity to instantly increase by an amount dependent on the mark size, which is followed by a rather sudden decline towards the estimated base ratê µ after the last event of the cluster, likely due to the large estimated weight on the exponential dependence on past values. This is of course not realistic, but illustrates that the time span between observed events is so long that the dependencies captured by the conditional intensity fully disappear between clusters.
Given the very small estimated ground intensities, it is actually very unlikely to see another event during the observation period based on these models, which contradicts the intuition that, the further away from the last event, the more likely it should be to see another cluster of events. The spiky appearance of the estimated conditional intensity results in a cumulative intensity resembling rather a step function than a line close to the bisecting one and therefore not lying within the 95% confidence bands of the diagonal. From these visual diagnostics, we can conclude that of the conditional POT model variants are appropriate to model the event times in the significant wave height dataset. Their conditional intensities do not seem to sufficiently capture the dependencies in the data to explain the clustering behaviour. To evaluate the model fit for the event marks, we consider unit exponential QQ plots and plots of the ACF for the GPD residuals R j = 1 ξ ln(1 +ξ yĵ β ). Figure 4 depicts these visual diagnostics for the Hawkes-POT model with ETAS decay function. Similar results for the exponential decay and both SEPOT models can be found in the appendix (Figure 20). The QQ plots reveal that in none of the considered setups the residuals are well-described by a unit exponential distribution as would be expected from a perfect fit. Although the Hawkes-POT model variants seem to give a more reasonable picture than the SEPOT model variants, the QQ plots confirm that there are still dependencies present that are not accounted for in the modelling of the mark sizes. This is validated by the ACF of the GPD residuals, which clearly depicts remaining autocorrelations for the first five lags indicating that the approach neglects dependencies. The squared residuals in contrast exhibit no systematic autocorrelations. From the residual diagnostics we can conclude that neither the considered Hawkes-POT nor SEPOT models can satisfactorily characterise the mark sizes in the significant wave height data. We may suspect that the long-range correlation found in the underlying series is responsible for depedencies that cannot be captured by the considered Hawkes-POT and SEPOT models. For this reason, we now turn to the next model that incorporates long-range dependence.

Results: Mittag-Leffler distribution
Following Hees et al. [49], we fit the Mittag-Leffler distribution to the inter-event times using the R packages MittagLeffleR [37] and CTRE [48]. We begin by checking the necessary assumptions. Firstly, the inter-event times and event marks are not supposed to exhibit any (inter-)dependencies. The ACF of the logarithmic inter-event times in Figure 5 reveals small autocorrelations for low lags, whereas the ACF of the event marks in Figure 5 shows strong autocorrelation for many lags. Considering the crosscorrelation function (CCF) of inter-event times and event marks, we observe that the series are also cross-correlated, so that we clearly have to reject the independence assumption. The coupling of interevent times and event marks is further confirmed by the empirical copula presented on the left of Figure 6. If both were uncoupled, the points would be evenly distributed over the [0, 1] × [0, 1] space. What we find instead is that the empirical copula is dominated by strong extremal clustering. The particular shape can be explained by events being mostly separated by a single time unit, so that there is an accumulation of points for this inter-event time combined with all possible values of event marks. After a long inter-event time, i.e. a long period without an event, we rather record smaller event marks, indicating that event clusters usually start with smaller marks which then magnify for the successive events.
According to Hees et al. [49], the suitability of the Mittag-Leffler distribution can be verified using a logarithmic Mittag-Leffler QQ plot for the logarithmic inter-event times, displayed on the right of Figure 6. The tail parameter is determined via log-moment estimation yieldingβ = 1.1055 with a 95% confidence interval of [1.0586; 1.1524], which again strongly contradicts the model assumptions, while the scale is set to 1 for this diagnostic. It is obvious that the Mittag-Leffler distribution does not fit the inter-event times of the significant wave heights. Again, it appears that we have too much probability mass concentrated on lower, or more precisely, unit inter-event times, indicating that the present extremal clustering is too strong to be captured with this approach. Representing the long gaps between clusters, we further find a few inter-event times that are too large to appear under a Mittag-Leffler distribution. Given these results, we can conclude that the Mittag-Leffler distribution does not provide a better fit to the inter-event times of the significant wave heights than the exponential distribution arising in the limit of iid data. Even though this approach incorporates long-range dependence, it does only account for it in the event times and disregards any local, short-range dependencies affecting the distribution of event and inter-event times.
Although the assumptions are obviously violated, we include the parameter estimates of the Mittag-Leffler distribution for the significant wave height data for completeness. The stability plots for the tail and re-scaled scale are presented in Figure 7 and do not give clear stable regions for the parameter estimates. For the tail parameter, the only rather stable region we can identify is around the estimateβ = 1.12, which is consistent with the previous estimation and again clearly violates the model assumptions. Additionally, the scale parameter is dependent on the number of events, indicating that the model is not suitable for the significant wave height dataset.   So far, neither the applied conditional POT methods nor the fitted Mittag-Leffler distribution satisfactorily explain the extremal clustering in the significant wave height data. Either the models fail to account for the long-range dependence in the mark sizes inherited from the dependence structure of the underlying observations, or the short-range dependencies between the event times are disregarded. Therefore, we continue with applying the two-step ARFIMA-POT approach which aims at capturing the full dependence structure of the underlying observations, in order to avoid the need for explicit modelling of inherited properties in the event times and mark sizes all together.

Results: Two-step ARFIMA-POT approach
In the first stage of the two-step ARFIMA-POT approach, we seek to identify and fit the ARFIMA(p, d, q) model that is most suitable for describing the significant wave heights. Using the R package arfima [84], we systematically estimate the model parameters for various combinations of p and q orders and find the ARFIMA model with p = 4 and q = 4 to minimise both AIC and BIC. Table 2 reports the parameter estimates for the selected ARFIMA model. The corresponding fractional difference parameter is estimated as 0.1398 indicating the presence of mild long-range dependence. Comparing the theoretical ACF of the chosen ARFIMA model to the empirical ACF of the dataset in Figure 8, we can conclude that the selected model specification satisfactorily captures the autocorrelation structure of the significant wave heights. Further, Figure 9 provides visual diagnostics for the ARFIMA model residuals in the form of plotted ACFs of residuals and squared residuals as well the periodogram of the residuals. These diagnostic suggest that, apart from some variance clustering, there are no dependencies left in the residuals. Since we deliberately refrained from modelling the conditional variance of the significant wave heights in order to solely find an explicit representation of the dependence structure of the underlying series, the residual diagnostics confirm suitability of the classical POT model for the ARFIMA model residuals.
In the second stage of the ARFIMA-POT approach, we apply the classical POT method to the ARFIMA residuals using the R packages POT [78] and extRemes [38]. The threshold is chosen from the mean residual life plot and GPD parameter stability plots given in Figure 10. The mean excess plot shows a course linear in u from 0 to 0.12 and is rather constant between 0.12 and 0.28. Choosing a higher threshold is not sensible due to the erratic course of the mean excess and too few remaining observations. The GPD parameter stability plots suggest a long stable region for both GPD parameter estimates following a threshold of u = 0.12 that corresponds to 7.3% of data and matches the threshold from the mean residual life plot. Thus, we base the second step of the ARFIMA-POT approach on a threshold corresponding to 7.3% of data.
Based on the ARFIMA residuals, Figure 11 displays the exceedances of the selected threshold. In line with the autocorrelations found in the ACF of the squared ARFIMA residuals (cf. Figure 9), the extreme events still appear to be slightly clustered, but the long gaps without clusters that governed the original significant wave height series are no longer as distinct. The QQ plot of the inter-event times on the right of Figure 11 still depicts the tails of the inter-event times as lying outside the confidence bands of the exponential distribution, but this is much closer to the exponential distribution than what was observable for the original time series (cf. Figure 1). We can note that the two-step ARFIMA-POT approach improves the predictability of event times over the original series, but the dynamics of the significant wave height series affecting the distribution of event times are not captured in their entirety.
We fit a GPD to the ARFIMA residuals via maximum likelihood estimation and obtain the estimateŝ β = 0.0944 (0.0074) andξ = −0.0189 (0.0554) for scale and shape parameters, respectively. Similar to the results for the conditional POT approaches, the estimated shape parameter is negative, indicating a finite tail of the distribution of ARFIMA residuals and a concentration of probability mass on the smaller mark sizes. Based on the estimated GPD residuals, Figure 12 provides visual diagnostics. The    QQ plot illustrates that the GPD residuals are following a unit exponential distribution very closely. Compared with the results for the Hawkes-POT in Figure 4, we find a clearly improved model fit. Considering the ACF of the GPD residuals and squared residuals, we observe that, apart from lag 2, there is no autocorrelation left, indicating that the two-step ARFIMA-POT approach satisfactorily models the dependencies in the underlying significant wave height series which may affect the distribution of the mark sizes.
In summary, the two-step ARFIMA-POT approach improves upon the considered conditional POT approaches in terms of model fit for the mark sizes, and upon the fit of the Mittag-Leffler distribution for the inter-event times. Clearly, the two-step method is an imperfect solution to modelling significant wave heights, but we show that incorporating both short-and long-range dependencies in a parametric approach can be a more successful pathway than focusing on short-or long-term dynamics alone. To review these conclusions, we systematically break down the effects of short-and long-range dependence on the model fits in the following section.

Simulation study
In Section 4, we argued that the combination of short-and long-range dependence is pivotal to the unsatisfactory fits of the conditional POT models and the Mittag-Leffler distribution to the significant wave height data. In order to verify these results and the appropriateness of the two-step ARFIMA-POT approach, we conduct a simulation study.
The two-step ARFIMA-POT approach is based on the assumption that dependencies in the underlying time series are well-characterised by an ARFIMA model. Therefore, we consider a base setting in which we simulate data from an ARFIMA model with a specification that is similar to the one estimated for the significant wave height dataset in Section 4.3, i.e. we simulate from an ARFIMA(4, 0.1398, 4) with parameters given in Table 2 (setting A). Further, we consider two additional settings in which either the ARMA orders p and q are set to zero (setting B) or the fractional difference parameter d (setting C). Thus, the settings B and C correspond to data in which the short-or long-range correlations are eliminated respectively. Analogous to the size of the significant wave height dataset, we simulate 4, 367 data points for each setting and then apply all considered methods to the resulting datasets to evaluate the model fits.

Simulation results: Conditional POT models
We begin by examining the estimation results for the conditional POT models based on a threshold of 7.9% in the considered settings A, B, and C, which are presented in Table 3. The parameter estimates in the settings A and B are mostly found to be in a similar range as reported for the significant wave height dataset in Table 1. For setting C, however, we observe quite different results, i.e. for the setting that only includes long-range dependence: for both Hawkes-and SEPOT model variants, the base rate of the conditional intensity of the ground processμ 0 has increased approximately tenfold, which already suggests a more homogeneous behaviour of the conditional intensity. The estimates for the factorμ 1 are generally smaller implying less pronounced reactions on previous events, but this effect is potentially balanced out for the mark sizes by the mostly larger estimatesδ. With respect to the ETAS decay model variants, we find a smaller estimateρ in setting C, resulting in a higher impact of distant events. A similar effect is observed for the exponential decay models through a smaller estimateγ.
For the Hawkes-POT variants, we find estimates forb which are noticeably smaller, indicating that the scale parameter of the conditional GPD is less dependent on the previous mark size. This observation is even more apparent for the parameterβ 1 in case of the SEPOT model variants where the estimates are close to zero, indicating a rather constant GPD scale parameter. Thus, we can conclude that the strongly time-varying nature of the scale parameter of the conditional GPD in settings A and B is rather the result of short-term, ARMA-type correlations in the data. Considering the estimation results, the dataset in setting C appears to have properties that can be handled more realistically by the conditional POT approaches than the settings A and B.
To verify this presumption, we provide various visual diagnostics. Exemplarily, we present the diagnostics for the Hawkes-POT with ETAS decay function, as the results for the other conditional POT variants were found to be very similar. Figure 13 displays the estimated conditional intensities and the cumulative conditional intensities for all three simulation settings. The results for setting A show a similar pattern as was observed for the significant wave heights in Figure 3, indicating that the conditional POT models are not well-suited for ARFIMA-type time series. Setting B improves upon the results for setting A, presumably because the events appear to exhibit less pronounced extremal clustering, as can be seen from slightly more equally distributed event ticks. The simulated fractional white noise in setting C, i.e. the ARFIMA(0, 0.1398, 0), follows completely different dynamics with only slight extremal clustering. Fractional integration only seems to intensify the particular clustering behaviour when shortrange correlations are already present, as visible when comparing settings A and B, while it has little effect when added to a white noise sequence using a fractional difference parameter d that is this low. In consequence, the conditional POT model seems to be an appropriate fit for the data of setting C, leading to a well-shaped estimated conditional intensity and a cumulative intensity that evenly spreads across events.
The residual diagnostics in Figure 14, consisting of ACFs for GPD residuals and squared residuals as well as unit exponential QQ plots for the three simulation settings, confirm these findings. In the GPD residuals of the Hawkes-POT in setting A there is still a large amount of autocorrelation left, which is similar, albeit more pronounced than what was observed for the significant wave height dataset in Figure  4. This also leads to the GPD residuals being far from unit exponentially distributed. The diagnostics improve when switching to setting B, so a situation with short-range dependent, ARMA-type data. In setting C, the conditional POT appears to capture all dependencies in the data, so that assuming a unit exponential distribution for the GPD residuals seems reasonable, which strengthens our previous findings.
In summary, the simulation study for the conditional POT models shows that the combination of short-and long-range dependence in the data leads to pronounced extremal clustering. We can conclude that the more extreme the clustering behaviour in the extreme events gets, the less appropriate this type Setting A Setting B Setting C t Figure 13: Estimated conditional intensity and cumulative intensity of the Hawkes-POT model (with ETAS decay) and for the simulated settings A, B and C. Ticks on the upper axis represent events. 95% confidence bands are based on the Kolmogorov-Smirnov statistic [23]. of model is. The long-range dependence itself does not appear to be causal for the unsatisfactory model fit, but rather the impact that the long-range correlation has on the extremes in data that is already governed by short-term dependencies. Therefore, we can note that the combination of short-and longrange dynamics leads to a deterioration in the model fit for conditional POT approaches, confirming the presumptions from Section 4.1.

Simulation results: Mittag-Leffler distribution
Next, we consider fitting the Mittag-Leffler distribution to the inter-event times for the simulated settings A, B and C. Figure 15 provides the ACFs of logarithmic inter-event times and event marks as well as their CCF for all simulation settings. For setting C, we find no systematic correlations, implying that the Mittag-Leffler distribution may be suitable to a pure long-range dependence setting. However, it is obvious from setting B that the autocorrelation in the event marks and the cross-correlation between event times and event marks are induced by the short-range, ARMA-type dependencies. Comparing the results from settings A and B, we can further state that when combining the ARMA-dependencies with a fractionally integrated component, the observed correlations appear even more pronounced. Similar conclusions can be drawn from the empirical copulas for the inter-event times and event marks, depicted in the appendix in Figure 21. Figure 16 presents the logarithmic Mittag-Leffler QQ plots of the logarithmic inter-event times for all settings. Again, it is apparent that in settings A and B the Mittag-Leffler distribution does not seem to be appropriate for the inter-event times. For setting C, we note a substantially improved fit, but still observe distinct deviations for lower quantiles. The parameter stability plots for the tail and re-scaled scale estimates of the Mittag-Leffler distribution are given in Figure 17 for all simulation settings. Clearly, the presence of short-ranged ARMA dependencies in settings A and B leads to an absence of areas of stability for tail and especially re-scaled scale estimates. In setting C, however, it is easy to identify stability regions, implying that the Mittag-Leffler distribution may be well-suited for a situation in which the underlying data is only governed by long-range dependence.
From these simulation results, we can conclude that modelling the inter-event times by means of a Mittag-Leffler distribution is not appropriate when the underlying time series not only exhibits longrange dependence, but also short-term correlations. This confirms the findings from the application of the Mittag-Leffler distribution to significant wave heights in Section 4, which posed a situation in which the dependence structure of the data can only be characterised by a combination of short-and long-range dynamics.

Simulation results: Two-step ARFIMA-POT approach
Lastly, we continue the simulation study by investigating whether the two-step ARFIMA-POT approach is suitable for ARFIMA-type dependencies in the time series, even when considering only either ARMA components or fractional integration. For this, we make use of the same settings A, B, and C as before and apply the same two-step ARFIMA-POT procedure as described in detail with the application to significant wave heights in Section 4.3: we first estimate the ARFIMA model with the AR and MA orders being selected according to the BIC, and calculate the ARFIMA residuals, to which we then apply the classical POT model. The corresponding estimation results for all settings are stated in Table 4 in Appendix A.2.
Similar to the results for the significant wave height data, the ARFIMA residuals do not exhibit any remaining autocorrelation regardless of the considered setting (see Figure 22 in Appendix A.2). The thresholds for the three settings were again determined using mean residual life plots and corresponding parameter stability plots, which can be found in Appendix A.2 in Figure 23 for reference. The selected thresholds are u A = 0.165 m (4.78%), u B = 0.135 m (7.70%), and u C = 0.1 m (14.36%) for the settings A, B, and C, respectively. For setting C, it is worth mentioning that the small fractional component induces a weaker dependence structure compared to settings A and B, resulting in the GPD being appropriate for a lower threshold and therefore larger tail of the residual distribution. For all three settings, the estimation results of the second stage of the ARFIMA-POT method, i.e. fitting of a GPD to the ARFIMA residuals, can be found in Table 5 in Appendix A.2. Figure 18 provides visual diagnostics for the GPD residuals in the form of unit exponential QQ plots and ACFs of residuals and squared residuals. The GPD residuals of the two-step ARFIMA-POT approach show a noticeably better fit than the GPD residuals of the conditional POT approaches in Figure 14. In all three settings, there is no systematic autocorrelation left, and the distribution of the GPD residuals comes close to a unit exponential distribution. Thus, the two-step ARFIMA-POT approach works equally well regardless of whether we only find short-term ARMA correlations, fractional integration or both types of dependencies in the underlying dataset.
This simulation study clearly highlighted that the two-step ARFIMA-POT approach is advantageous in situations in which it is unclear what type of dependencies are present in the data. As opposed to the conditional POT models and the Mittag-Leffler distribution, the approach still provides a useful characterisation of the extremal behaviour in cases in which both short-and long-range correlations govern the underlying time series, as found for the significant wave height dataset.

Setting A
Setting B Setting C Figure 18: Unit exponential QQ plot and auto-correlation functions for the GPD residuals of the ARFIMA-POT for the three simulated settings.

Conclusion
We considered a setting in which short-and long-range dependencies in the underlying time series of significant wave heights lead to pronounced clustering behaviour of high threshold exceedances. We reviewed commonly used conditional POT models which parametrically model the short-term dynamics underlying the extremal clustering, as well as the Mittag-Leffler distribution for inter-event times incorporating long-range correlations in the implied event times. For both approaches, we found that long-or short-term dependencies were neglected respectively, resulting in unsatisfactory model fits to the significant wave height data. In a simulation study, we confirmed that the combination of short-and long-range correlations is pivotal to this observation. To prevent event times and mark sizes from inheriting properties of the dependence structure of the underlying series that would have to be modelled explicitly, we suggested the two-step ARFIMA-POT approach. This two-step approach first fits an ARFIMA model to the significant wave heights capturing the short-and long-term correlations of the observations, and then applies a classical POT model to the residuals. We found that the two-step ARFIMA-POT approach provides significant improvements in terms of model fit for the significant wave heights. Additionally, a simulation study confirmed that the two-step ARFIMA-POT approach enables a sensible characterisation of extreme events in case of ARFIMA-type dependencies in the time series, even when considering only either ARMA components or fractional integration. Our results illustrate the need for new methods filling the gap at the intersection of modelling local clustering behaviour and long-range dependence. Clearly, our solution to this issue is a pragmatic one, but it provides a strategy that is easily applicable and interpretable by practitioners. However, future work should address this challenge by developing models with a sound mathematical underpinning, possibly combining existing approaches to focus on short-term dynamics with long-range dependent components and vice versa.

A.1 Application to significant wave heights
This section contains supplementary figures for the application of the considered models to the significant wave height dataset in Section 4. Figure 19: Estimated intensity and cumulative intensity of (a) the Hawkes-POT model (with exponential decay), (b) the SEPOT model (with ETAS decay), and (b) the SEPOT model (with exponential decay). Ticks on the upper axis represent event occurrences. 95% confidence bands are based on the Kolmogorov-Smirnov statistic [23].

A.2 Simulation study
This section contains supplementary figures and tables for the simulation study in Section 5.

Setting A
Setting B Setting C Figure 21: Empirical copula plot for inter-event times and marks for the simulated settings A, B, and C.  Figure 22: Autocorrelation functions and periodogram of the ARFIMA residuals for the simulated settings A, B and C.  Figure 23: Mean residual life plot and GPD parameter stability plots for the ARFIMA residuals of the simulated settings A, B and C.