Extreme Treatment Effect: Extrapolating Dose-Response Function Into Extreme Treatment Domain

The potential outcomes framework serves as a fundamental tool for quantifying causal effects. The average dose-response function (also called the effect curve), denoted as (\mu(t)), is typically of interest when dealing with a continuous treatment variable (exposure). The focus of this work is to determine the impact of an extreme level of treatment, potentially beyond the range of observed values--that is, estimating (\mu(t)) for very large (t). Our approach is grounded in the field of statistics known as extreme value theory. We outline key assumptions for the identifiability of the extreme treatment effect. Additionally, we present a novel and consistent estimation procedure that can potentially reduce the dimension of the confounders to at most 3. This is a significant result since typically, the estimation of (\mu(t)) is very challenging due to high-dimensional confounders. In practical applications, our framework proves valuable when assessing the effects of scenarios such as drug overdoses, extreme river discharges, or extremely high temperatures on a variable of interest.


Introduction
Quantifying causal effects is a fundamental problem in many diverse fields of research (Rosenbaum and Rubin, 1983;Holland, 1986;Robins et al., 2000;Imai et al., 2008).Some prevalent examples include the impact of smoking on developing cancer (Imai and van Dyk, 2004), the influence of education on increased wages (Heckman et al., 2018), the effects of various meteorological factors on precipitation (Hannart and Naveau, 2018) or the effect of policy design on various economy factors (Low and Meghir, 2017).
The potential outcomes framework (Rubin, 2005) has been the fundamental language to express the notion of the causal effect.The crux of this framework lies in acknowledging that, in any given scenario, multiple potential outcomes exist based on different interventions or exposures (Imbens and Rubin, 2015).This perspective challenges researchers to consider not only the observed outcome but also the unobserved outcomes that could materialize under alternative conditions.The typical focus in causal inference is on the binary treatment variable (exposure).However, binary treatment is unable to differentiate between different levels of the treatment variable.This issue can be partially solved by assuming a continuous treatment.For example, Kennedy et al. (2017) and Westling et al. (2020) proposed an estimator based on local linear smoothing.Galagate (2016) discussed the combination of parametric and non-parametric models for the effect curve estimation.Rubin and van der Laan (2006) and Neugebauer and van der Laan (2007) utilized marginal structural causal models framework.Y. Zhang et al. (2023) andBica et al. (2020) applied neural networks for the effect estimation.However, typical methods that work with a continuous treatment variable are not well suited for the inference that goes beyond the observed range of the data.
In this paper, we are interested in the extreme treatment effect; that is, the quantity of interest is the effect of an extreme level of treatment-outside of the observed range.Consider the following example from medicine: assume that the data of a study (either randomized or observational) is available to us, with the health status (Y ) of patients and the corresponding dose of a medicine administrated (T ).The data available only depicts Y when T ≤ 20 mg.What if then, we would like to know the change in Y when the dose is increased to T = 25mg?Answering this inquiry is hard, since we have zero data to answer it (this might be considered unethical to give such a dose to a patient), and we must rely on strong unverifiable assumptions and extrapolation.Additionally, in the case of observational studies, high-dimensional confounders pose yet another significant challenge.
The connection between causal inference and extreme value theory has been receiving increasing interest.Zhang (2018) and Deuber et al. (2023) analyze the Extreme Quantile Treatment Effect (EQTE) of a binary treatment on a continuous, heavy-tailed outcome.The paper authored by Huang et al. (2022) develops a method to estimate the EQTE and the Extreme Average Treatment Effect (EATE) for continuous treatment.Bodik et al. (2024) developed a framework for Grangertype causality in extremes.Some other approaches for causal discovery using the extreme values include Gnecco et al. (2020); Pasche et al. (2023); Krali et al. (2023); Bodik and Chavez-Demoulin (2023).Engelke and Hitz (2020) propose graphical models in the context of extremes.Naveau et al. (2020) analyzed the the effect of climate change on weather extremes.Courgeau and Veraart (2021) proposed a framework for extreme event propagation.Kiriliouk and Naveau (2020) study probabilities of necessary and sufficient causation as defined in the counterfactual theory using the multivariate generalized Pareto distributions.We contribute to this growing literature and provide a theoretically well-founded approach for estimation and inference of the extremal treatment effect.
Recent advancements in machine learning research have spotlighted the extrapolation capabilities of different models (Dong and Ma, 2023;Christiansen et al., 2021;Saengkyongam et al., 2024;Chen and Bühlmann, 2021).For instance, 'engression,' as proposed by Shen and Meinshausen (2024), presents a framework that serves as an extrapolating alternative to regression-based neural networks.Similarly, Pfister and Bühlmann (2024) explored extrapolation of conditional expectations by assuming that the maximum derivative occurs within the observed range of support.While these approaches aren't inherently causal, they can be construed as such under certain assumptions.Despite achieving cutting-edge performance, these methods encounter two primary limitations: difficulty handling multiple confounders and reliance on often uninterpretable extrapolation assumptions.In contrast, our framework focuses on the causal aspect of the extrapolation, The average dose-response function and patient-specific dose-response function are defined as respectively.Although the term "dose" is typically associated with the medical domain, we adopt here the term dose-response learning in its more general setup: estimating the causal effect of a treatment on an outcome across different (continuous) levels of treatment.Our objective is to learn the behaviour of µ(t) or µ x (t) for t ≈ τ R .
For a pair of real functions f 1 , f 2 , we employ the following notation: f 1 (t) ∼ f 2 (t) for t → τ R , if lim t→τ R f1(t) f2(t) = 1.Sequence of random variables approach same distribution as sample size grows.In the remaining of the paper, we assume that µ(t), µ x (t) are continuous on some neighbourhood of τ R for all x.

Classical assumptions
Two classical assumptions in the literature for identifying the average dose-response function are: • Unconfoundedness: Given the observed covariates, the distribution of treatment is independent of potential outcome.Formally, we have T ⊥ ⊥ Y (t) | X, ∀t ∈ T , where ⊥ ⊥ denotes the independence of random variables.
• Positivity: p T |X (t | x) > 0 for all t, x, where p T |X represents the conditional density function of the treatment given the covariates.
Under these assumptions, Hirano and Imbens (2004) showed the identifiability of the doseresponse function via where the inner expectation is taken over Y and the outer expectation is taken over X.
Even if we are not willing to rely on the unconfoundedness assumption, it may often still be of interest to estimate the function t → E[E[Y |T = t, X = x]] as an adjusted measure of association, defined purely in terms of observed data.It may be interpreted as the average value of Y in a population with exposure fixed at T = t but otherwise characteristic of the study population with respect to X (Gill and Robins, 2001;Kennedy et al., 2017;Westling et al., 2020).
When the positivity assumption is violated, a different type of extrapolation arises (King and Zeng, 2006), which is different from the one considered in this work.This scenario occurs when the distributional support of variable T varies across different levels of confounding variables X. Various approaches have been devised to confront this challenge, including propensity thresholding (Crump et al., 2009).
Several algorithms were proposed to estimate the function µ(t) in the body of the distribution of T .State-of-the-art-methods estimate µ(t) via i:Ti≈t w i Y i for appropriate weights w i what serve to "erase" the confounding effect of X (Ai et al., 2021;Bahadori et al., 2022;Li et al., 2020;Kennedy et al., 2017;Kreif et al., 2015).Typically, the estimation of µ(t) involves a two-step procedure (Imai and van Dyk, 2004;Hirano and Imbens, 2004;Zhao et al., 2020).In the first step, we model the distribution T | X, also known as the propensity.In the second step, we model the distribution of Y | T , suitably adjusted by the propensity, with the aim of mitigating the confounding effect of X. Hirano and Imbens (2004) introduced a generalized propensity score (GPS) defined as e(t, x) := p T |X (t | x).One common approach is to model p T |X using a Gaussian model.In binary treatment cases (when T = {0, 1}), the propensity score is a probability denoted as e(1, X) = P (T = 1 | X) and is typically modeled using logistic regression.Subsequently, we define weights w i as w i :=

Extreme value theory
When dealing with extreme values, it is easy to introduce a strong selection bias.A naive approach for estimating µ(t) for large values of t might involve only considering observations where t exceeds a certain threshold, denoted as τ , and computing µ(t) using conventional techniques, while disregarding all values below this threshold.This is a typical approach of many classical algorithms, which estimate µ(t) by focusing solely on a local neighborhood of observations around t.However, this approach can introduce a significant selection bias.In its extreme manifestation, all observations where t exceeds τ might exclusively pertain to men, for instance.The selection bias arises if the effect of T on Y differs between men and women (see Figure 2 in Section 5.1 with τ = 3).We employ the Extreme Value Theory technique known as peaks-over-threshold to tackle this issue.
Extreme value theory is a sub-field of statistics that explores techniques for extrapolating the behavior (distribution) of T beyond the observed values.A limiting theory posits that the tail of T can be well approximated by the Generalized Pareto Distribution (GPD), as detailed in the following explanation.
Consider a sequence (T i ) i≥1 of independent and identically distributed (iid) random variables with a common distribution F , and M n = max i=1,...,n T i represents the running maximum.It is well known (Resnick, 2008) that if (Condition 1:) there exists a non-degenerate distribution , then G falls within the Generalized Extreme Value (GEV) distribution family.Condition 1 can equivalently be expressed using the following definition: Definition 1. (Pickands (1975)) The distribution F is in the max-domain of attraction of a generalized extreme value distribution (notation In case γ = 0, the right side is interpreted as exp(−e −x ).The parameter γ is called the extreme value index (EVI).
This condition is mild as it is satisfied for most standard distributions, for example, the normal, Student-t and beta distributions.The following crucial theorem states that the tail of T can be well approximated by GPD if the distribution of T belongs to M DA(γ).
Assumption 1.We assume that the distributions T and T | X are in the max-domain of attraction of a generalized extreme value distribution.

Our tail framework
We aim to model the effect of a treatment variable T in the context of extreme values of T .However, it's essential to approach the term 'extreme' with caution, considering the discrepancy between real-world implications and the interpretations within our model.Take, for instance, if T represents a drug dose in milligrams; our model operates under the assumption that T tends toward τ R , which can be potentially larger than several kilograms.While this mathematical abstraction lacks practical significance-given that administering several kilograms of a drug is physically implausible-the model does include values of T arbitrarily large.Of course, we do not claim that our model performs well when T equals several kilograms, but only that it performs well for T in the 'reasonable neighborhood' of the observed values, where the effect of T is expected to be 'extrapolatable' from the observed values.

Assumptions
We are not aiming to estimate the complete µ(t) but rather only its values for large t.Therefore, we can relax the classical assumptions for the identification of µ(t); what we specifically need is their tail counterparts.
Assumption 2 (Unconfoundedness in tail).For all x ∈ X holds (Unconfoundedness in tail) We always assume the existence of the expected values.
Rather than simply writing t → τ R , we frequently opt for the notation t(x) → τ R to emphasize its dependence on the random variable X.Note that Assumption 2 is strictly less restrictive than the Unconfoundedness assumption introduced in Section 2.1.
Remark 1.To provide some intuition regarding the permissiveness of Assumption 2, we rephrase our framework in the language of structural causal models (SCM, Pearl (2009)).Assume that the data-generating process of the output Y is as follows: Here, H represents a (possible) latent confounder of T and Y .Then, the dose-response function has a form µ(t) = E[f Y (t, X, H, ε)] where the expectation is taken with respect to joint (X, H, ε).
Assumption 2 can be rephrased as follows: There exist a function fY such that f Y (t, x, h, e) ∼ fY (t, x, e) as t → τ R , (Unconfoundedness in tail in SCM) for all admissible values of x, h, e.This assumption is valid for example in additive models; that is, when f Y (t, x, h, e) = fY (t, x, e) + g(h) for some functions f , g.
Additionally, we restate the positivity assumption in the context of its tail counterpart.
Assumption 3 (Positivity in tail).p T |X (t | x) > 0 for all x and all t > t 0 for some t 0 ∈ T , where p T |X represents the conditional density function of the treatment given the covariates.
Note that this assumption is weaker than Assumption 1.

Adjusting only for θ(X)
The following lemma serves as a tail counterpart of an identifiability for the classical framework.It states that, under Assumptions 2 and 3, the tail of the dose-response function is identifiable from the observational distribution via the propensity function π 0 (t, x).
Recall that the distribution of T | X = x, conditioned on T > τ (x) for large τ (x) ≈ τ R , is approximately GPD with parameters θ(x) = (τ (x), σ(x), γ(x)).The following result suggests that instead of conditioning on (potentially high-dimensional) covariates X, we only need to condition on θ(X).
Lemma 2. Under Assumptions 1 and 2, for all s in the support of θ(X) holds Hence, Lemma 2 suggests that it is sufficient to condition only on θ(X) rather than on X.This finding is pivotal for dimension reduction, effectively reducing the dimension from dim(X) to at most 3. Nonetheless, this is merely a limiting result, and it introduces an approximation error of the GPD approximation for finite samples.The upper three figures illustrate a first-order extrapolation approach, employing a linear fit at the boundary of the support of T .In contrast, the lower three figures depict estimations generated by distinct models: the first utilizes a pre-additive noise model parameterized by neural networks (Shen and Meinshausen, 2024), the second employs smoothing splines (Wood, 2017), and the third utilizes a random forest approach (Breiman, 2001).
Under Assumption 2, modeling µ X reduces to a statistical modeling of E[Y | T, X].Furthermore, under Assumptions 1 and 2, it reduces to modeling E[Y | T, θ(X)].In principle, a wide range of models can be considered, ranging from simple linear models to non-parametric neural networks.The principle of Occam's razor suggests that, especially when extrapolating beyond the range of observed values, simpler models often prove to be the most effective choices (Soklakov, 2002).The extrapolation capabilities of various models have recently garnered attention in machine learning research.(Shen and Meinshausen, 2024) introduced the 'engression' framework as an extrapolating counterpart to regression-based neural networks.While we build our framework under a linear model for simplicity, it is possible to utilize different models, such as the engression-based ones.Figure 1 illustrates the extrapolating properties of various commonly used models.
A straightforward approach to model E[Y | T = t, X = x] under the assumption of linearityin-the-tail would be assuming an existence of functions α, β such that Following the notation in Remark 1, this corresponds to assuming for all admissible values of h, e.This assumption is valid for example in additive models where f Y (t, x, h, e) = α(x) + β(x)t + g(x, h, e) for some function g.However, using the result from Lemma 2, it is sufficient to condition only on θ(X) instead of potentially high-dimensional X.Therefore, we introduce the following model assumption: Assumption 4 (Conditional linearity of tail).There exist functions α and β such that for all s in the support of θ(X), the following holds: Such an assumption was explored in various contexts (typically where θ(X) represents parameters of a normal distribution, see Imai and van Dyk (2004) or Section 2.2.1 in Zhao et al. (2020).To the best of our knowledge the extreme case was not yet explored).We can construct our inference method (as discussed in Section 4) by estimating α and β using various machine-learning methodologies.

Inference and estimation
Let (x i , t i , y i ) n i=1 be the observed data.In the following, we propose a methodology for the estimation of µ(t) for t ≈ τ R under Assumptions 1, 2 and 4.
Consider the following two-step procedure.In the first step, we approximate the tail of T | X using GPD (that is, we estimate the location, scale and shape parameters θ(X) = (τ (X), σ(X), ξ(X))).
In the second step, we estimate the expectation of Y given a large T conditional on the estimated GPD parameters θ(X).
• Estimate covariant-dependent threshold τ (x) using a quantile regression: That is, estimate q-quantile of T | X = x.
• From now on, restrict our inference on the observations from S := {i : t i > τ (x i )}.
• Estimate θ(x) in the tail-model: That is, estimate (σ, ξ) from the data-points in S in the model where 2. Estimate µ(t) or µ x ⋆ (t) using θ(x): • Estimate α, β in model ( 5) from the data-points in S (that is, we only consider t > τ (x)).
The first step is a very standard procedure in extreme value literature called 'peak-overthreshold' (Coles, 2001); it is standard to assume constant shape parameter γ(x) ≡ γ ∈ R (Smith, 1990;Davison and Huser, 2015) since in practice, it is untypical for the shape parameter to change with covariates.For the estimation of τ (x), σ(x), α(x), β(x), we use either linear parametrisation1 or non-parametric smooth estimation using splines (GAM, Wood (2017)), but any method can be used in practice.In case of a very small sample size, we can also assume a constant scale parameter σ(x) ≡ σ ∈ R in order to reduce the dimension of the estimation.
The choice of q in the first step is a standard issue in extreme value theory (Schneider et al., 2021;Caeiro and Gomes, 2015;Davison and Smith, 1990).For theoretical results, q should be growing with the sample size; that is, q = q n satisfying lim n→∞ q n = 1 and lim n→∞ n(1 − q n ) = ∞.In practical terms, q should be set as high as possible while ensuring that a sufficient amount of data remains above the threshold to maintain good inferential properties.Classical choices include q = 0.9, q = 0.95 or q = 0.99, depending on the size of our dataset.
We utilize the basic bootstrap technique (sometimes also called Efron's percentile method, see Chapter 23 in van der Vaart (1998)) to establish confidence intervals.This involves randomly sampling, with replacement, from our dataset to generate multiple bootstrap samples, each matching the size of our original dataset.For each bootstrap sample, we calculate the estimate of the statistic μ⋆ (t).Subsequently, we determine the α-percentiles of the re-sampled statistics to derive the confidence intervals.Details can be found in Appendix D.1.
Remark 2. One must be cautious when interpreting confidence intervals during extrapolation.Generally, estimation of µ(t) is subject to two primary sources of bias: 1) bias stemming from model misspecification, and 2) bias arising from estimation variance.While the former bias can be mitigated within the body of the distribution by comparing different models and employing crossvalidation, AIC or BIC criteria to select the most suitable model, this approach becomes less reliable in the extremal region.Eliminating this bias necessitates observation of data within the region of interest.The latter bias stemming from estimation uncertainty can be addressed by computing confidence intervals (in our case, via bootstrap).It's important to note that our bootstrap confidence intervals only account for the latter bias and consequently, the first type of bias presents a greater challenge during extrapolation since it is, in principle, unquantifiable without additional data.
Theorem 1 (Idea: Precise statements can be found in Appendix D).Assuming the conditions outlined in either Theorem D.1 or Theorem D.2 are met, our procedure is consistent.Furthermore, under the assumptions detailed in Theorem D.3, the bootstrap confidence intervals are asymptotically consistent at a correct level.
In our procedure, we adopt a practice common in extreme value theory, where we concentrate solely on the extreme observations (set S) while discarding all non-extreme values.This approach stems from the rationale that extreme observations provide the most valuable insights into out-ofsupport behavior.Utilizing data within the body of the distribution may introduce bias, as these values may exhibit different behavioral patterns.Mathematically, this rationale can be expressed through an examination of the precision of the GPD approximation.This approximation shows high precision exclusively in extreme values while displaying bias and low precision for non-extreme values.

Illustration and experiments
In this section, we assess the performance of our methodology using both a simple illustrative example and experimental data.A comprehensive simulation study is provided in Appendix B.
The quantity of interest in the application presented in Section 6 is the difference µ(t 1 ) − µ(t 2 ) for t 1 , t 2 < τ R .Hence, in the simulations, we focus on estimating: assuming τ R = ∞ and that the corresponding limits exist.Note that under the linear model ( 5), the limit exists and corresponds to the parameter ω x = β[θ(x)].Here, ω x can be regarded as the tail counterpart of a coefficient β x in a linear model Y = α x + β x T + ε, where α x and β x are real coefficients, possibly dependent on X.

Simple example
The subsequent illustrative example outlines our methodology and the main ideas.Consider a single confounder X = X 1 ∼ Bernoulli(0.75)(where X 1 = 1 denotes men and X 1 = 0 denotes women, for instance).Define T = X 1 + ε T , where ε T ∼ N (0, 1) (indicating that T generally tends to be larger for men than for women).Let where ε ∼ N (0, 1).
Simple computation gives us µ(t) = 0.75t+(1−0.75)2t= 1.25t for any t > 1, while µ(t) = −t+2 for t ≤ 1.Consequently, our primary interest lies in estimating the slope We generate data as specified with a sample size of n = 500.Setting the threshold at q = 0.9, we employ the methodology outlined in Section 4 to estimate ω.This process is repeated 100 times, yielding a mean and 0.95 quantile of ω = 1.26 ± 0.39.
Additionally, we employ the bootstrap technique to calculate confidence intervals, and we obtain a confidence interval of the form ω ∈ (0.72, 1.87) on average.We see from other simulations that these confidence intervals are slightly conservative for n ≤ 1000, but work well for larger sample sizes.
In Figure 2, we present one generated dataset with a sample size of n = 500, showcasing various methodologies from existing literature and their extrapolation efficacy.Notably, classical techniques often exhibit a tendency to underestimate µ(t) for t large, primarily due to the fact that only the 'men' category (X = 1) received T > 2.
We conclude with an important remark regarding the sample size: a substantial amount of valuable information is lost when we discard 90% of the data by focusing solely on the data in the set S (data above the threshold τ (x)).This is the primary reason behind the considerably large confidence intervals and the heightened variability in our estimates.We encounter the inevitable bias-variance trade-off; the inclusion of more data introduces a potential bias, given that the behavior of µ(t) differs in the body and in the tail.

Simulations
We provide a comprehensive discussion of all simulations in detail in Appendix B. In our study, we devised several simulation setups to model diverse scenarios and explore them thoroughly.Specifically, we focused on five key scenarios: 1. Investigating how our method scales with respect to the dimension of the confounders d = dim(X).
2. Comparing our method with classical methods from the literature, 3. Expanding upon the simple example introduced in Section 5.1, wherein we evaluated performance across various dependence structures (employing different copulas), sample sizes, and a spectrum of causal effects.
4. Examining the presence of a hidden confounder affecting both T and Y .

Focusing on variations in the function µ(t).
In this section, we present two key findings from our simulation study.Table 1 illustrates how our methodology scales with varying dimensions of the confounders d = dim(X).Additionally, Table 2 depicts the comparison of our method with other classical methods from the literature, where we estimated µ( t) for t = max i=1,...,n t i + 10.As discussed in the Appendix B, the reason for the bias observed in larger dimensions d is that Assumption 1 and (3) are only asymptotic results, and with higher dimensions d, we require more data for the asymptotic theory for T | X to take effect.It is well known that the convergence rate of the maxima of the Gaussian random sample to an extreme value distribution is very slow, whereas it is faster with Exponential or Pareto distribution (Davis, 1982).
Analysis of Table 2 reveals that our method achieves the smallest extrapolation error.Hirano and Imbens (Hirano and Imbens, 2004) method utilizing a GAM outcome model showed surprisingly reasonable performance, while the IPTW method (van der Wal and Geskus, 2011) exhibits poor performance due to the quadratic nature of the extrapolating curve.Please note that our method assumes linearity in the tail.If this assumption isn't met, our method may produce inferior results.In such instances, alternative regression techniques can be utilized in step 2 of our algorithm instead of linear regression.For example, neural networks, as demonstrated in Shen and Meinshausen (2024), can be employed to address nonlinear behavior and enhance performance.
6 Application: River discharge dataset Understanding the causal relationship between extreme precipitation and river discharge is crucial for effective water resource management.In this study, we examine how extreme precipitation events impact river discharge.By utilizing a comprehensive dataset spanning various hydro-logical conditions, our research seeks to provide insights into the critical nexus between extreme precipitation dynamics and extreme river discharge events.The data were collected by the Swiss Federal Office for the Environment (hydrodaten.admin.ch),but were provided by the authors of Pasche et al. (2023); Engelke and Ivanovs (2021), with some useful preliminary insights.We used precipitation data and other relevant measured variables from meteorological stations provided by Swiss Federal Office of Meteorology and Climatology, MeteoSwiss (gate.meteoswiss.ch/idaweb).
We exclusively examine the discharge levels of the River Reuss, situated near Zurich in Switzerland (Figure 3).We selected this river due to the availability of excellent measurements of its discharge levels, complemented by well-documented weather conditions from nearby meteorological stations and diverse landscape.Our measurements include average daily discharges between January 1930 and December 2014 and daily precipitation in the nearby meteo-stations.To reduce any seasonal effects due to unobserved confounders, we only consider data during June, July and August, as the more extreme observations happen during this period when mountain rivers are less likely to be frozen.
We center our attention on addressing two distinct research questions: one characterized by a straightforward scenario where the ground truth is known, and another presenting a more intriguing challenge.

Known ground truth
We demonstrate our methodology using a straightforward example where the ground truth is known.Consider a pair of river stations, such as stations 2 and 1.Let T represent the water discharge at station 2, Y represent the water discharge at station 1, and X denote measurements taken at a nearby meteorological station (including precipitation, humidity, etc.The full list of confounders can be found in Appendix C).Our objective is to investigate the impact of extreme discharge levels at station 2 on the water discharge observed at station 1.In mathematical terms, we seek to ascertain µ(t) or µ x (t) for large values of t.In this context, the ground truth is the following: for all t 1 , t 2 ≥ 0 and all x ∈ X .This can also be explained in words as follows: if we pour t 1 liters of water into the river at station 2 (in causal terminology, we interpret this as an intervention do(T = T + t 1 )), we expect the water discharge at station 1 (Y ) to increase by exactly t 1 .Hence, ω = ω x = 1.As we will see below, our methodology consistently yields this expected outcome.We follow the methodology introduced in Section 4 with q = 0.95.Detailed steps, diagnostics and preliminary data analysis (just for a pair 2 → 1) can be found in Appendix C. The resulting estimates can be found in Table 3.The results are very similar with a different choices of q (changing ω by not more than by 0.1).We observe that our results align very well with the ground truth (ω = 1).However, there is a slight bias evident in the relationships between the pairs 5 → 3 and 4 → 3: this can be attributed to distinct geographical features.Notably, a lake Vierwaldstättersee lies between the pair 5 and 3, which diminishes the influence of 5 on 3. Additionally, a 3238m Titlis mountain is situated between pair 4 and 3, amplifying the effect of 4 on 3 due to the melting glacier ice, acting as an unmeasured confounding factor.Our methodology relies on Assumptions 1, 2, 3, and 4, along with some continuity assumptions and the SUTVA assumption discussed in Section 2. Assumptions 1 and 3 are minor and are used frequently when dealing with these types of data.Assumption 2 is a common and challenging aspect of every causal inference methodology.While our assumption is weaker than the classical unconfoundness assumption (requiring no hidden confounder in the tail), complete rejection of the possibility of its violation is unattainable.However, we believe that the meteo-station between a pair of stations can capture the most significant confounders (with the exception when the lake or mountains are present in between the river-stations).Finally, Assumption 4 is a strong assumption that allows us to extrapolate observed values into the extremal region.However, this assumption (or at least some similar model assumptions) are necessary.In this case, linear assumption is valid, since the underlying ground truth is known.

Effect of precipitation on river discharge
We employ our methodology to address a more complex inquiry where the ground truth is not known.Let's consider water discharge at station 3 (Y ) and let T denote the precipitation measured in meteo-station M2.Our focus lies in understanding the impact of extreme precipitation events (T ) on the water discharge (Y ).As mentioned in the introduction, on 6.6.2002,we recorded a historical maximum precipitation level of T max = 111 mm m 2 , coinciding with the scenario when the river nearly breached its banks.Our inquiry centers on the question: how would the river discharge Y alter if T were to reach 120 mm m 2 ?In mathematical terminology, we are interested in estimating µ(120) − µ(111) or possibly µ x ⋆ (120) − µ x ⋆ (111) where x ⋆ are other covariates corresponding to that event.Addressing this question is challenging as we lack data within this extreme regime, necessitating reliance on extrapolation.This task is especially challenging, since we anticipate that the effect of precipitation on river discharge may vary between the body of the distribution and its tail, since the ground absorbs a significant portion of the rainfall during a light rain.
We follow the methodology introduced in Section 4. A straightforward approach would be to define T as precipitation and Y the water discharge on the same day, while choosing appropriate confounders X from some measurements at M2.Then, we can use classical approach for estimating µ(t) in the body and our approach to estimate it in the tail.However, some problematic issues arise in this application: • Time issue: T monday → Y monday but also T monday → Y tuesday since it takes time for the rain water to reach the river and rain tends to be more frequent around midnight.In  4: Estimates β and ω represent the estimation of the effect of T on Y in the body and in the tail, respectively.β is computed using standard regression while ω is the tail counterpart computed using steps introduced in Section 4. *Note that Station 5 is in different altitude and relatively far from meteo-station M2 with a lake in between: hence, there is a bias due to data collection problems.
fact, correlation (and extreme correlation coefficient as well, see Figure 15) is much higher for a pair (T monday , Y tuesday ) than for (T monday , Y monday ).The extreme storm on 6.6.2002corresponded to extremely high river discharge on 7.6.2002(where Y was about five times larger than on 6.6.2002).Hence, our interest lies in the effect T monday → Y tuesday (that is, we consider t i as precipitation on day i while y i is the discharge on day i + 1).Additionally, the presence of time introduces an auto-correlation issue.This can be handled by taking for example weekly maxima or discarding consecutive observations within a certain time frame to reduce the auto-correlation effect.Alternatively, applying techniques like time series decomposition, differencing, or using autoregressive models can also mitigate the issue of auto-correlation in the data analysis process.We leave the data unchanged since the temporal dependence is primarily local, spanning only a few days, and does not introduce a substantial bias.
• Variable selection issue: choosing appropriate confounders X that act as confounders of Y and T .It is not clear which variables can be safely considered as confounders: if a variable X lie on a path T → X → Y , adjusting for X would lead to so-called path-cancelling causal effect (Pearl, 2001).Here, we are interested in so-called total causal effect, so we need to be cautious of which covariates to adjust for.However, not adjusting for a common cause leads to a bias.Moreover, there is often a feedback loop: X i ↔ P recipitation for X i for example humidity or temperature.However, some of the variables can be safely considered as common causes: for example temperature on Sunday (the day before measuring precipitation).There is a huge amount of literature for such a variable selection, and we do not aim to comment on this research area: we only provide a full list of chosen confounders in Appendix C.
We estimate two values: ω which is the tail quantity defined as difference between µ(t+1)−µ(t) for large t: in our case, how would Y change if it was raining by 1 mm m 2 more on 6.6.2002?Next, we also estimate β = µ(t + 1) − µ(t) corresponding to the body of the distribution (see Appendix C.2.2 for details on its computation).The resulting estimates can be found in Table 4 and visualisation of the µ(t) can be found in Figure 4. We observe that the effect of T on Y is larger in the extreme region than in the body of the distribution by a factor of 3.04 2.4 ≈ 1.25.
As for the answer to our question 'how would the river discharge Y alter in station 3 if T were to reach 120 mm m 2 on 6.6.2002',our results suggest that the water discharge would be larger by about 9 × 1.62 = 14.5 m 3 s (note that median of Y is 11.2 and 95% quantile of Y is 51.2).Would this result in the river overflowing its banks?We cannot definitively say, as we lack the necessary data regarding the volume and contours of the river banks.Moreover, Y represent the daily average of the water discharge while in order to answer this question, daily maximum is a better suited variable for answering this question.Nonetheless, this advances us towards a more accurate understanding of the effects and impacts of extreme precipitation events and potentially enhancing statistical inference for hydroelectric power stations located along this river.

Conclusion and future work
Analyzing the impact of extreme levels of a treatment variable (exposure) is essential for comprehending its effects on diverse systems and populations.In this paper, we introduced a novel Classical approach for estimation of µ(t) Our estimation of extremal treatment effect Classical approach for estimation of µ(t) Our estimation of extremal treatment effect Confidence interval for mu(111) Confidence interval for omega  111), ω and their 95% confidence intervals.
framework aimed at estimating the causal effect of extreme treatment values.Leveraging insights from extreme value theory, we enhanced the estimation of the extreme treatment effect.Our framework can handle a substantial number of confounders.Nonetheless, our methodology relies on extrapolation, presenting inherent challenges even in the absence of confounding variables, where the bias stemming from a model misspecification is impossible to quantify.Our framework holds promise for initial assessments of the impact of extreme environmental events, such as the effects of severe storms or droughts on economic damages.Future work may explore the application of our extreme value theory approach to address time-varying effects, a prevalent issue in environmental research.

A.1 Main analysis
In this section, we delve into a dataset (Yeh, 1998) focused on concrete compressive strength2 .
Concrete serves as a fundamental material in civil engineering, and understanding its compressive strength (denoted as Y and measured in MPa) is paramount for ensuring structural integrity (Neville, 2011).Concrete comprises ingredients such as cement (X 1 ), fly ash (X 2 ), water (X 3 ), superplasticizer (X 4 ) and blast furnace slag (T ) (amongst some other additions).The units of T, X 1 , X 2 , X 3 , X 4 are in kilograms in a m 3 mixture.The concrete compressive strength (Y ) exhibits a highly nonlinear relationship with these ingredients and the elapsed time.Our focus is on exploring the effect of blast furnace slag (T ) on compressive strength (Y ).It is well-established that increasing the quantity of T can enhance Y , yet an excessive amount of T may lead to a decrease in Y .
In our dataset X 1 , X 2 , X 3 , X 4 may affect the decision of how much T was used (engineers often decide about the quantity of T based on the looks of the mixture of other ingredients).Our dataset contains n = 1030 instances of observational data {x i , t i , y i } n i=1 where x i = (x 1,i , . . ., x 4,i ) ⊤ .The range (min i y i , max i y i ) = (2.3,82.5) and (min i t i , max i t i ) = (0, 359).
Suppose we fit a linear model EY = β 0 +β T T +β 1 X 1 +β 2 X 2 +β 3 X 3 +β 4 X 4 ; then, a least square estimation of the coefficient β T leads to βT = 0.08 ± 0.006.This can be (wrongly) interpreted as 'adding one additional kg of T in m 3 mixture increases Y by 0.08M P a'.We expect different behaviour for small and large values of T , we expect strong (nonlinear) interactions between the covariates and, more importantly, this result is derived from the body of the distribution, while we are interested in values of T above the observed ones.
Our objective is to quantify the effect of an extreme amount of blast furnace slag T on Y .Specifically, we answer the following questions: 1. Given a concrete mixed with T = 359 and X = x for some specific value of x, if we intervene and change T to T = 400, what effect on concrete compressive strength can we expect?Using the potential outcome notation, the quantity of interest is µ x (400) − µ x (359).Note that max i=1,...,n t i = 359 (we don't observe the blast furnace slag larger than 359, and there is no observation in the interval (220, 359)) and hence, we have zero data in such an extreme region.
2. How would an extreme increase in T change Y for an 'average' concrete (On a population level, i.e. integrating over the covariates)?Using the potential outcome notation, the quantity of interest is µ(400) − µ(359).
We follow the methodology introduced in Section 4 with q = 0.9.Detailed steps, diagnostics and preliminary data analysis can be found in Section A.2.The resulting estimates are as follows: The results are similar with a different choices of q (see Table 5).In summary, the results suggest that for a mixture of concrete with covariates X = x ⋆ and T = 359, intervening on T and changing it to T = 400 would decrease the concrete compressive strength by about 4.1M P a.On the population level, increasing T from 359 to 400 would lead to decrease in the concrete compressive strength by about 4.5M P a.The 95% confidence intervals suggest that this estimate can be inaccurate by about 3M P a; however, one must be cautious about the interpretation of the confidence intervals, since they are in general unreliable when extrapolating, see Remark 2.
Figure 5 graphically shows the estimation of μ(t) in the body using the method introduced in Kennedy et al. (2017); Kennedy (2019), as well as our estimation of μ(t) for extreme values.
SUTVA assumption discussed in Section 2).While we argue that assumptions 1 and 3 are minor, we can not disregard the possibility of a hidden confounder (assumption 2).The validity of assumption 2 has to be further argued by an expert knowledge.Finally, the strongest assumption is assumption 4, as its violation leads to the most significant bias.However, this assumption (or a similar assumption using different model) is necessary when extrapolating and it is hypothetically testable by measuring values with T ≈ 400.Appendix A.3 also discuss the differences for the range of choices of q.
A.2 Detailed computations of the estimates Some data visualisation can be found in Figures 6 and 7.In the following, we provide detailed descriptions of the steps undertaken in the application for a specific choice q = 0.9.First, we estimate τ (x) using a classical quantile regression (Koenker, 2005).We observe that all covariates are highly significant and the diagnostic plots do not show any significant problems (except the fact that for many observations is T i = 0): we illustrate the estimation on Figure 8, where points above the 90% threshold (points in the set S) are marked.
In the next step, we routinely estimate θ(x)3 .More precisely, we assume fixed ξ(x) = ξ ∈ R and only estimate σ(x) as a smooth function of the covariates; Plot 9 shows the estimated values of σ(x) on a log-scale.
Regarding the second question (population level), we simply take the average μ(t Regarding the confidence intervals, we resample the data using the code: resampled_data = sample_n(data, size = length(y), replace = TRUE).
Then, we follow the same steps as above and estimate the coefficients (from the resampled dataset).

A.3 Discussion about the results regarding different threshold q
Table 5 shows the results for different choices of q.Even though they yield distinct estimates for ω, the confidence intervals overlap, and the values ω ∈ (−6.4,−1.9) are encompassed by all of them.This suggests some stability in the choice of q.
The selection of q reflects the bias-variance tradeoff; as we increase q, our inference relies on values closer and closer to T = 400 (datapoints with small and intermediate T i can bias our estimation since in this region, increasing T i can increase Y ).However, increasing q also means disregarding more and more datapoints, and our estimate will have less power and larger variance.
The challenge of choosing q is a common problem in extreme value theory, and the rule of thumb is to select q as large as possible while maintaining an adequate number of datapoints above the 1 − q quantile to ensure reasonably good inference.

A.4 Discussion about the assumptions
Our methodology relies on Assumptions 1, 2, 3, and 4, along with some continuity assumptions and the SUTVA assumption discussed in Section 2. Below, we provide a detailed discussion of each assumption.
1. Assumptions 1 and 3 are considered minor.As mentioned in Section 2, Assumption 1 is satisfied for most common distributions, and similar model assumptions are imposed in almost all applications utilizing extreme value theory.Assumption 3 appears to be satisfied, as there is no specific range of values in the support of T that has zero probability of occurring.
2. Assumption 2 is a common and challenging aspect of every causal inference methodology.While our assumption is weaker than the classical unconfoundness assumption (requiring no hidden confounder in the tail), complete rejection of the possibility of its violation is unattainable.A potential hidden confounder could be the 'quality of ingredients.'If the quality is low, engineers might tend to use excessive amounts of T in the mixture, potentially leading to spurious dependence between large T and low Y .However, in this case, it seems plausible that this hidden dependence due to low ingredient quality does not introduce a substantial bias.An expert knowledge is required to ensure the validity of this assumption.
3. Assumption 4 is a strong assumption that allows us to extrapolate observed values into the extremal region.However, this assumption (or at least some similar model assumptions) are necessary; estimating µ(400) from observed values is not feasible otherwise.In essence, Assumption 4 asserts that the relationship between T and Y (given other confounders) is linear in the unobserved region below T = 400.Since there is no other reason to believe that this relationship has any particular form, a linear assumption seems to be the most suitable choice.Although this assumption is strong, it is hypothetically possible to test by measuring values with T ≈ 400.

B Appendix: Simulations
In this section, we create various simulations setups to assess the performance of our methodology.
• Section B.1 provides insight into how our method scales with the dimension of the confounders dim(X).
• Section B.3 extends the simple example presented in Section 5.1, evaluating performance across different dependence structures (various copulas), sample sizes, and a range of causal effects.
• Section B.4 addresses a scenario involving a hidden confounder affecting both T and Y .
• Section B.5 focuses on variations in the function µ(t) and assesses the extent to which our method can extrapolate µ(t) into the 'extreme' region.
In some of the simulations, we use the following function: where typically slope(x) = |x| and c ∈ R is a hyper-parameter.Graphical visualisation of the function µ x for x = 3 can be found in Figure 10.In other words, µ x grows with slope x until c, and then declines with slope x.

B.1 Simulations with a high dimensional X
In this simulations we consider X = (X 1 , . . ., X d ) where the dimension d is potentially large.Consider the following data-generating process: • • Consider X being centered Gaussian vector with cor(X i , X j ) = 0.1 for all i ̸ = j and var(X i ) = 1.
Note that a simple linear regression Y ∼ T + X 1 + • • • + X d leads to a biased estimation of the effect of T , since the effect is different for large and for small values of T .However, simply discarding values where T i < 1 leads to a selection bias.
We generate data as specified with a sample size of n = 5000.Setting the threshold at τ = 0.95, we employ the methodology outlined in Section 4 to estimate ω.Specifically, we utilize linear parametrization of the parameters in the estimation procedure.This process is repeated 100 times.The mean of the estimates ω together with 95% quantile for various values of d and distributions of the noise variables can be found in Table 1.
Table 1 illustrates that with a sample size of n = 5000, the results are reasonably accurate as long as d ≤ 50.The reason for the bias observed in larger dimensions d is that Assumption 1 and (3) are only asymptotic results, and with higher dimensions d, we require more data for the asymptotic theory for T | X to take effect.It is well known that the convergence rate of the maxima of the Gaussian random sample to an extreme value distribution is very slow, whereas it is faster with Exponential or Pareto distribution (Davis, 1982).With a large dimension d, we also observe a more pronounced effect of the estimation error accumulated in the first step on the second step of the algorithm.

B.2 Comparison with classical methods
We evaluate our extrapolation method by comparing it with several state-of-the-art techniques from existing literature.Specifically, we assess the performance of four methods: the doubly robust estimation method introduced by Kennedy et al. (Kennedy et al., 2017;Kennedy, 2019), the additive spline estimator proposed by Bia et al. (Bia et al., 2014), the approach suggested by Hirano and Imbens (Hirano and Imbens, 2004) employing a GAM outcome model (taken from Galagate and Schafer (2022)), and the inverse probability of treatment weighting (IPTW) estimator by VanderWal et al. (van der Wal and Geskus, 2011).
Our analysis employs the same simulation setup described in Section B.1, utilizing exponentially distributed noise variables (other distributions yield similar results).After generating (x i , t i , y i ) n i=1 , we estimate µ( t), where t = max i=1,...,n (t i ) + 10, using all aforementioned methods.Subsequently, we compute the absolute relative error (ARE) defined as This procedure is repeated 100 times, and the average of the obtained ARE values is presented in Table 2. Analysis of Table 2 reveals that our method achieves the smallest extrapolation error.Conversely, the IPTW method (van der Wal and Geskus, 2011) exhibits poor performance due to the quadratic nature of the extrapolating curve.Kennedy et al.'s method (Kennedy et al., 2017;Kennedy, 2019) typically produces a constant extrapolation curve.Meanwhile, both Bia et al.'s (Bia et al., 2014) and HI (Hirano and Imbens, 2004) methods are highly sensitive to extreme values in the dataset, leading to significant variability of the estimates.
Please note that our method assumes linearity in the tail.If this assumption isn't met, our method may produce inferior results.In such instances, alternative regression techniques can be utilized instead of linear regression.For example, neural networks, as demonstrated in Shen and Meinshausen (2024), can be employed to address nonlinear behavior and enhance performance.

B.3 Dependence, sample size and the causal effect
In the following, we conduct simulations based on a model with covariates X = (X 1 , X 2 , X 3 ) that function as a common cause of both T and Y .The details of the simulation are as follows: • X is generated with standard Gaussian margins and a Gumbel copula with a parameter α (where α represents the degree of dependence (Kolesárová et al., 2018); α = 1 corresponds to independence, and α → ∞ corresponds to full dependence; see Plot 11 for an illustration with α = 2).
• T is generated in such a way that the marginal distribution of T follows an exponential distribution with a scale parameter of 1, and the dependence structure between X and T follows a Gumbel copula with parameter α.
• The response variable Y is generated as follows: where f is a randomly generated smooth function4 , ε ∼ N (0, 1), and ω is a hyper-parameter that we vary in our simulations.Plot 11 shows one realization of such a dataset.

C.2.1 Choice of variables
We used the following set of variables: • Y = River discharge on day i + 2 • T = Precipitation in the corresponding meteo-station on day i + 1 • X 1 =Total (sum) precipitation during the previous 7 days (days i, i − 1, . . ., i − 6) • X 2 =Daily maximum of Air temperature 2 m above ground on day i • X 3 =Daily maximum of Relative air humidity 2 m above ground on day i • X 4 =Daily maximum of Pressure reduced to sea level on day i • X 5 =Daily total of Reference evaporation from FAO on day i where i spans from 1.6.1930 up to 29.8.2014 (recall that we only considered the summer months).The choice of Y and T was addressed in the main text: since typically the auto-correlation peaking when T represents the day prior to Y (as illustrated in Figure 15, as well as its extreme counterpart extremogram (Davis and Mikosch, 2009)).The rationale behind choosing X 1 is straightforward -precipitation over the preceding days emerges as a significant confounding factor affecting both Y and T .Regarding additional variables, we opted for those deemed relevant and with reliable measurements across meteorological stations, all of which were recorded on the preceding day.
Is there a common cause between Y and T that remains unaccounted for?Variables X 2 through X 5 , measured on day i+1, could serve as potential common causes for both Y and T .For instance, a sudden temperature change might elevate the likelihood of intense rainfall, while alterations in river discharge could stem from specific soil characteristics affected by the temperature change.Nevertheless, we contend that the majority of these variables require more than a day to manifest their effects, which we believe are largely encapsulated by our chosen variables.

C.2.2 Computation of β
In Table 4, we introduced a variable β that represent the effect of precipitation on the river discharge level in the body of T .This can be defined in several ways: 1. Using the method introduced in Kennedy et al. (2017); Kennedy (2019) we estimate μ(t + 1) − μ(t) for t = E(T ).
2. Using very straightforward approach where we model the data generating process of Y using a linear structural equation model Y = c + β T T + β X X + ε and return the least square estimate of βT .
Coincidentally, both approaches return a very similar value of β and hence, it does not matter which approach we use (values in Table 4 are from using the second approach).

D Consistency, bootstrap and its asymptotics
In this section, we give more detailed description of the bootstrap algorithm and a more precise statement of Theorem 1 together with its proof.Theorem D.1 presents the consistency of μ(t) for large t, while Theorem D.2 shows the consistency of ωx ⋆ under different assumptions.Note that, using the notation from Sections 4 and 5,

D.1 Bootstrap
In what follows, we explain in details the procedure for an estimator ζα satisfying We only focus on the upper confidence interval, the lower and both-sided intervals can be done analogously.Our approach is standard and van der Vaart (1998) provides a good overview.Let P n be the empirical distribution of the observations Z iid ∼ P n , and we compute the parameter ω⋆ from (Z ⋆ 1 , . . .Z ⋆ n ) the same way as we compute ω from (Z 1 , . . ., Z n ).We define ζα as the upper α-quantile of ω⋆ : that is, the smallest value x = ζα that satisfies The notation P (•|P n ) indicates that the distribution of ω⋆ must be evaluated assuming that the observations are sampled according to P n given the original observations.In particular, in the preceding display ω is to be considered nonrandom.
It is almost never possible to calculate the bootstrap quantiles exactly (van der Vaart, 1998).In practice, these estimators are approximated by a simulation procedure.A large number of independent bootstrap samples Z ⋆ 1 , . . ., Z ⋆ n are generated according to the estimated distribution P n .Each sample gives rise to a bootstrap value ω⋆ .Finally, the bootstrap quantiles are estimated by the empirical quantiles of these bootstrap values.This simulation scheme always produces an additional (random) error in the coverage probability of the resulting confidence interval.In principle, this error can be made arbitrarily small by using a sufficiently large number of bootstrap samples.Therefore, the additional error is usually ignored in the theory of the bootstrap procedure.This section follows this custom and concerns the "exact" quantiles, without taking a simulation error into account.

D.2 Simplifying assumptions
We simplify some steps in the inference process in order to simplify the proof of the consistency.In particular, we assume the following: A) (Causality justification) Consider Assumptions 1, 2 and 4 to be valid.
In particular, following the notation in Section 4 and using the notation τ (x) = τ ⊤ lin x, our algorithm is as follows: • choose q ∈ (0, 1), • (Step 1) estimate τ lin ∈ R d by minimising τlin ∈ argmin b n i=1 h q (T i −X ⊤ i b) where h q (x) = x(q1 x≥0 − (1 − q)1 q<0 ).(11) and β is the corresponding least square estimate.We prove the bullet-points here: • First bullet-point is a trivial consequence of the assumption q n → 1.
• Second bullet-point is a trivial consequence of Lemma 2 together with Assumption D, • Third bullet-point is a trivial consequence of Assumptions 4 and D, • The fourth bullet-point follows from a well-known consistency of τlin .It is well known that for a fixed quantile q, the maximum likelihood estimator τlin = argmin b n i=1 h q (T i − X ⊤ i b) is consistent and even asymptotically normal (see e.g.Theorem 4.1 in Koenker (2005), noting that we assume continuous T and finite second moments of X).However, quantile q is not fixed and is increasing with the sample size with the speed lim n→∞ q n = 1 and lim n→∞ n(1 − q n ) = ∞.This is a well-known generalization of quantile regression known as 'intermediate order regression quantiles' or 'moderately extreme quantiles' (Chernozhukov et al., 2016) and is as well consistent and asymptotically normal under Assumption C (see Theorem 5.1 in Chernozhukov (2005)).
• The fifth bullet-point: For a moment, fix τ lin ̸ = 0.It is an elementary knowledge that the estimation of β using least squares in a model ( 11), where τ lin is fixed, is consistent and even asymptotically normal under conditions var(Y ) < ∞ , E||X|| 2 < ∞, (X, T ) satisfying Grenander conditions and the sample-size |S| =: k n = n(1−q n ) → ∞. (see e.g.Lemma D.2).
Observe that least squares estimate β is linear in τ lin : that is, if we express β explicitly, we get β = τ ⊤ lin β, where β is a coefficient in a linear model corresponding to ( 14) (where T is assumed to be larger than τ ⊤ X implicitly).Finally, using this observation, we can replace the fixed value of τ lin by a random τlin , and we still obtain β = τ ⊤ lin β.Since by increasing n we can make β arbitrarily accurate with arbitrarily large probability, the same holds for β.In the following paragraph, we present an an illustration of the linearity of β in τ lin for d = 1.An explicit expression of β as a function of τ lin and our data explicitly is the following: where M is the data-matrix corresponding to a model ( 14).
Combining all the bullet-points, we obtain that with arbitrarily large probability that where each sign '≈' represent equality up to a factor of ε (in either multiplicative or additive form) which is negligible as ε → 0. This implies consistency, Quod erat demonstrandum.

D.4 Bootstraps correctness
We use results from Chapter 23.2 in van der Vaart (1998).The main step is to use delta-method for bootstrap (Theorem 23.5 in van der Vaart (1998)) and the fact that regression models in step 1 and step 2 of our algorithm are 'bootstrappable' (Theorem 3 in Hahn (1995) and Freedman (1981)).
To simplify some steps of the proof, we assume the following: E. Assume that ωx ⋆ is consistent (which holds for example under assumptions A,B,C,D).
F. We compute τlin from the first ⌊ n 2 ⌋ data-points and we compute β from the remaining ⌈ n 2 ⌉ data-points.
G. In the computation of the set S, we assume that τ is known and non-random; that is, S = {i ≤ n : T i > τ (X i )} instead of S = {i ≤ n : T i > τ (X i )}.
Theorem D.3.Assume validity of assumptions D,E,F,G and H. Let q ∈ (0, 1) be chosen such that the distribution of T | T > τ q (x), X = x follows GP D(θ(x)) for all x ∈ X .Then, ζα is asymptotically consistent; that is, Proof.We will show that both 1 √ n (ω x ⋆ − ω x ⋆ ) and 1 √ n (ω ⋆ x ⋆ − ωx ⋆ ) given P n both converge to the same distribution, say G.That is, 1 x ⋆ − ωx ⋆ ) as n → ∞.This directly implies (see e.g.Lemma 23.3 in van der Vaart (1998)) that ζα is asymptotically consistent.
Observation 2) β = τ ⊤ lin β, where β is a coefficient in a linear model corresponding to (14) (where T is assumed to be larger than τ (X) implicitly since we assumed that τ is known and non-random in S).Note that β ⊥ ⊥ τlin since β is computed from the second half of the dataset and its computation does not contain τlin .However, we know that β satisfies that 1 √ n ( β − β) and 1 √ n ( β⋆ − β) given P n both converge to the same Gaussian distribution (Theorem 2 in Eck (2018) or Freedman (1981)).
Note that ωx ⋆ = ϕ(τ lin , β).Since τlin and β satisfy the conditions of the theorem, we get that x ⋆ − ωx ⋆ ) given P n both converge to the same distribution.That is what we wanted to show.

3. 3 Figure 1 :
Figure1: Extrapolation of E[Y | T = t] under various models (without confounding).The upper three figures illustrate a first-order extrapolation approach, employing a linear fit at the boundary of the support of T .In contrast, the lower three figures depict estimations generated by distinct models: the first utilizes a pre-additive noise model parameterized by neural networks(Shen and Meinshausen, 2024), the second employs smoothing splines(Wood, 2017), and the third utilizes a random forest approach(Breiman, 2001).

Figure 2 :
Figure 2: Left: Data generated based on the simulations outlined in Section 5.1 with n = 500.Points falling within the set S are identified by a blue square.Right: Estimation of µ(t) using various methods: Orange represents the true µ(t), blue depicts our estimate employing the method from Section 4 with 95% confidence intervals, grey illustrates the doubly robust estimation method introduced by Kennedy et al. (2017); Kennedy (2019), red showcases the additive spline estimator described in Bia et al. (2014), dark green demonstrates the approach proposed byHirano and Imbens (2004) utilizing a GAM outcome model (further details in Galagate and Schafer (2022)), and purple describes the inverse probability of treatment weighting estimator (van der Wal and Geskus, 2011).

Figure 3 :
Figure 3: Map of meteostations (red) and five river stations (black).Note that the river flow is from south to north (with springs in the mountains).

Figure 4 :
Figure 4: The estimation of µ(t) using the doubly robust estimator introduced in Kennedy et al. (2017); Kennedy (2019) cut at the second largest observation, together with estimation of µ(111), ω and their 95% confidence intervals.

Figure 6 :
Figure 6: The figure illustrate the dependence among T and Y .Note that the correlation between T and Y is 0.14 ± 0.07.

Figure 7 :
Figure 7: Diagnostics of a linear model fitted into the original data.

Figure 8 :
Figure 8: Visualisation of the estimation of τ (x)-estimated using classical quantile regression.Blue points characterise the observations above this threshold (points in the set S).

Figure 9 :
Figure 9: Estimation of scale in the tail-model.

TYFigure 12 :
Figure12: Visualisation of the estimation of τ (x)-estimated using classical quantile regression.Blue points characterise the observations above this threshold (points in the set S).

Figure 14
Figure 14 visualize the dataset.

Figure 14 :
Figure14: Visualisation of the estimation of τ (x) for Part2 in the application-estimated using classical quantile regression.Blue points characterise the observations above this threshold (points in the set S).

Figure 15 :
Figure 15: Cross-correlation and cross-extremogram of precipitation recorded at meteostation M2 and water discharge at Station 3, both measured on the same day.

Table 1 :
Estimates of ω = −1 with varying dimension of the confounders d = dim(X) and with different distributions of the noise of T .The sample size is n = 5000.The full simulations setup can be found in Appendix B.1.

Table 2 :
Comparing the extrapolation performance of various models using the average Absolute Relative Error (ARE) across 100 simulations with varying dimension of the confounders d = dim(X).The interpretation of the values is as follows: if the true value of µ( t) is 1, an ARE of 0.17 indicates an approximate typical error of |μ( t) − µ( t)| ≈ 0.17.

Table 1
suggests that the results are reasonably accurate as long as d ≤ 50.