Extreme Treatment Effect: Extrapolating Dose-Response Function into Extreme Treatment Domain

Bodik, Juraj

doi:10.3390/math12101556

Open AccessArticle

Extreme Treatment Effect: Extrapolating Dose-Response Function into Extreme Treatment Domain

by

Juraj Bodik

^1,2

¹

Faculty of Business and Economics (HEC Lausanne), University of Lausanne, 1015 Lausanne, Switzerland

²

Expertise Center for Climate Extremes (ECCE), University of Lausanne, 1015 Lausanne, Switzerland

Mathematics 2024, 12(10), 1556; https://doi.org/10.3390/math12101556

Submission received: 25 March 2024 / Revised: 6 May 2024 / Accepted: 12 May 2024 / Published: 16 May 2024

(This article belongs to the Special Issue Computational Statistical Methods and Extreme Value Theory)

Download

Browse Figures

Versions Notes

Abstract

:

The potential outcomes framework serves as a fundamental tool for quantifying causal effects. The average dose–response function

μ (t)

(also called the effect curve) is typically of interest when dealing with a continuous treatment variable (exposure). The focus of this work is to determine the impact of an extreme level of treatment, potentially beyond the range of observed values—that is, estimating

μ (t)

for very large t. Our approach is grounded in the field of statistics known as extreme value theory. We outline key assumptions for the identifiability of the extreme treatment effect. Additionally, we present a novel and consistent estimation procedure that can potentially reduce the dimension of the confounders to at most 3. This is a significant result since typically, the estimation of

μ (t)

is very challenging due to high-dimensional confounders. In practical applications, our framework proves valuable when assessing the effects of scenarios such as drug overdoses, extreme river discharges, or extremely high temperatures on a variable of interest.

Keywords:

causal inference; potential outcomes; extreme value theory; extreme causal effect; dimension reduction

MSC:

62D20; 60G70

1. Introduction

Quantifying causal effects is a fundamental problem in many diverse fields of research [1,2,3,4]. Some prevalent examples include the impact of smoking on developing cancer [5], the influence of education on increased wages [6], the effects of various meteorological factors on precipitation [7] or the effect of policy design on various economy factors [8].

The potential outcomes framework [9] has been the fundamental language to express the notion of the causal effect. The crux of this framework lies in acknowledging that, in any given scenario, multiple potential outcomes exist based on different interventions or exposures [10]. This perspective challenges researchers to consider not only the observed outcome but also the unobserved outcomes that could materialize under alternative conditions. The typical focus in causal inference is on the binary treatment variable (exposure). However, binary treatment is unable to differentiate between different levels of the treatment variable. This issue can be partially solved by assuming a continuous treatment. For example, Refs. [11,12] proposed an estimator based on local linear smoothing. Ref. [13] discussed the combination of parametric and non-parametric models for the effect curve estimation. Refs. [14,15] utilized marginal structural causal models framework. Refs. [16,17] applied neural networks for the effect estimation. However, typical methods that work with a continuous treatment variable are not well suited for the inference that goes beyond the observed range of the data.

In this paper, we are interested in the extreme treatment effect, that is, the quantity of interest is the effect of an extreme level of treatment, outside of the observed range. Consider the following example from medicine: Assume that the data of a study (either randomized or observational) are available to us, with the health status (Y) of patients and the corresponding dose of a medicine administrated (T). The data available only depict Y when

T \leq 20

mg. What if then, we would like to know the change in Y when the dose is increased to

T = 25

mg? Answering this inquiry is hard, since we have zero data to answer it (this might be considered unethical to give such a dose to a patient), and we must rely on strong unverifiable assumptions and extrapolation. Additionally, in the case of observational studies, high-dimensional confounders pose yet another significant challenge.

The connection between causal inference and extreme value theory has been receiving increasing interest. Refs. [18,19] analyze the Extreme Quantile Treatment Effect (EQTE) of a binary treatment on a continuous, heavy-tailed outcome. The paper authored by [20] develops a method to estimate the EQTE and the Extreme Average Treatment Effect (EATE) for continuous treatment. Ref. [21] developed a framework for Granger-type causality in extremes. Some other approaches for causal discovery using the extreme values include [22,23,24,25]. Ref. [26] propose graphical models in the context of extremes. Ref. [27] analyzed the the effect of climate change on weather extremes. Ref. [28] proposed a framework for extreme event propagation. Ref. [29] study probabilities of necessary and sufficient causation as defined in the counterfactual theory using the multivariate generalized Pareto distributions. We contribute to this growing literature and provide a theoretically well-founded approach for estimation and inference of the extremal treatment effect.

Recent advancements in machine learning research have spotlighted the extrapolation capabilities of different models [30,31,32,33]. For instance, ‘engression’, as proposed by [34], presents a framework that serves as an extrapolating alternative to regression-based neural networks. Similarly, Ref. [35] explored extrapolation of conditional expectations by assuming that the maximum derivative occurs within the observed range of support, and while these approaches are not inherently causal, they can be construed as such under certain assumptions. Despite achieving cutting-edge performance, these methods encounter two primary limitations: difficulty handling multiple confounders and reliance on often uninterpretable extrapolation assumptions. In contrast, our framework focuses on the causal aspect of the extrapolation and can handle a large number of confounders. This is achieved under weak assumptions commonly embraced in the extreme value theory. Moreover, our framework relies on strong yet more interpretable extrapolation assumptions; while our primary focus is on linear regression, our approach has the flexibility to integrate with various machine learning methodologies, potentially improving the overall performance. However, this integration may come at the expense of losing interpretability for certain assumptions.

As for the application in this work, we consider a dataset describing extreme precipitation and river discharge levels in Switzerland. A historical record indicates a maximum precipitation level near Zurich’s meteo-station on 6 June 2002, reaching an extreme of 111

\frac{m m}{m^{2}}

. This event coincided with the river Reuss (near Zurich) nearly breaching its banks, causing damage to adjacent settlements. We focus on the following question: how would the river discharge alter if the precipitation on that day were to reach 120

\frac{m m}{m^{2}}

? Would the river breach its banks under such circumstances? We anticipate that the effect of precipitation on river discharge may vary between the body of the distribution and its tail. This anticipation stems from several factors: During periods of light to moderate precipitation, the ground absorbs a significant portion of the rainfall, reducing its contribution to the river flow. In contrast, during severe rainstorms, a larger proportion of the precipitation directly contributes to the river flow, potentially resulting in a more pronounced impact on discharge levels. Therefore, we expect to observe differing, potentially more severe, impacts of extreme precipitation on discharge levels compared to moderate events, highlighting the importance of understanding such dynamics across varying levels of precipitation intensity.

The structure of this paper is as follows: we introduce the notation and preliminaries on causal inference and extreme value theory in Section 2. In Section 3, we present the main assumptions along with some simple theoretical implications. In Section 4, we introduce a practical statistical methodology for estimating an extreme treatment effect from the data. Section 5 explains our methodology using a simple simulated example and discusses simulation results. In Section 6, we explore the application of inferring the effect of extreme precipitation on river discharge levels.

This manuscript includes five appendices: Appendix A introduces a second real-world application regarding the compressive strength of concrete, which has been relocated to the appendix for the sake of brevity. Appendix B contains a detailed simulation study, exploring the methodology under various conditions, including (1) a varying dimension

d i m (X)

, (2) comparison with classical methods from the literature, (3) a hidden confounder, (4) different dependence structures, and (5) varying dose–response functions. Appendix C contains a detailed inference process for the river application described in Section 6. Appendix D provides a more detailed explanation of the bootstrap algorithm used in the inference process and presents the theory behind the consistency result. Finally, proofs can be found in Appendix E.

2. Problem Statement, Notation and Preliminaries

Following [36], we define dose–response functions in the potential outcomes framework. We consider the triplet of

(X, T, Y)

, where

X \in X \subseteq R^{d}

,

T \in T \subseteq R

, and

Y \in Y \subseteq R

denote the confounders, treatment, and response variable, respectively, in an observational causal study. We assume a continuous treatment setting, where

T = (τ_{L}, τ_{R})

for some

τ_{L}, τ_{R} \in \bar{R} : = R \cup {- \infty} \cup {\infty}

. Here,

τ_{R} \in R \cup {\infty}

is the right endpoint of the support of T. For simplicity, assume that the right endpoint of the support of

T ∣ X = x

is equal to

τ_{R}

for all

x \in X

. Let

Y (t)

be a set of potential outcomes corresponding to the hypothetical world in which

T = t

is set deterministically. The fundamental problem of causal inference arise, since in the real world, each individual can only receive one treatment level T, and we only observe the corresponding outcome

Y = Y (T)

.

We observe a random sample

{X_{i}, T_{i}, Y_{i}}_{i = 1}^{n}

of size

n \in N

. It follows from our setting that, given the observed covariates, the distribution of the potential outcome for one unit is assumed to be unaffected by the particular treatment assignment of another unit (Stable Unit Treatment Value Assumption). We utilize the letter H for a (possible) hidden confounder. We denote vectors by bold letters. For any pair of continuous random variables Z and

Z^{'}

, we denote its probability distribution function

P_{Z} (\cdot)

, density function

p_{Z} (\cdot)

and a conditional density

p_{Z ∣ Z^{'}} (\cdot ∣ \cdot)

.

The average dose–response function and patient-specific dose–response function are defined as

μ (t) = E [Y (t)], μ_{x} (t) = E [Y (t) ∣ X = x],

respectively. Although the term “dose” is typically associated with the medical domain, we adopt here the term dose–response learning in its more general setup: estimating the causal effect of a treatment on an outcome across different (continuous) levels of treatment. Our objective is to learn the behavior of

μ (t)

or

μ_{x} (t)

for

t \approx τ_{R}

.

For a pair of real functions

f_{1}, f_{2}

, we employ the following notation:

f_{1} (t) \sim f_{2} (t)

for

t \to τ_{R}

, if

{lim}_{t \to τ_{R}} \frac{f_{1} (t)}{f_{2} (t)} = 1

. The sequence of random variables approaches the same distribution as the sample size grows. In the remainder of the paper, we assume that

μ (t), μ_{x} (t)

are continuous on some neighborhood of

τ_{R}

for all

x

.

2.1. Classical Assumptions

Two classical assumptions in the literature for identifying the average dose–response function are as follows:

Unconfoundedness: Given the observed covariates, the distribution of treatment is independent of the potential outcome. Formally, we have $T ⊥ ⊥ Y (t) | X, \forall t \in T$ , where $⊥ ⊥$ denotes the independence of random variables.
Positivity: $p_{T ∣ X} (t ∣ x) > 0$ for all $t,$ $x$ , where $p_{T ∣ X}$ represents the conditional density function of the treatment given the covariates.

Under these assumptions, [36] showed the identifiability of the dose–response function via

μ_{x} (t) = E [Y | T = t, X = x], and μ (t) = E [μ_{X} (t)] = E [E [Y | T = t, X = x]],

(1)

where the inner expectation is taken over Y, and the outer expectation is taken over

X

.

Even if we are not willing to rely on the unconfoundedness assumption, it may often still be of interest to estimate the function

t \to E [E [Y | T = t, X = x]]

as an adjusted measure of association, defined purely in terms of observed data. It may be interpreted as the average value of Y in a population with exposure fixed at

T = t

, but it is otherwise characteristic of the study population with respect to

X

[11,12,37].

When the positivity assumption is violated, a different type of extrapolation arises [38], which is different from the one considered in this work. This scenario occurs when the distributional support of variable T varies across different levels of confounding variables

X

. Various approaches have been devised to confront this challenge, including propensity thresholding [39].

Several algorithms were proposed to estimate the function

μ (t)

in the body of the distribution of T. State-of-the-art methods estimate

μ (t)

via

\sum_{i : T_{i} \approx t} w_{i} Y_{i}

for appropriate weights

w_{i}

what serve to “erase” the confounding effect of

X

[11,40,41,42,43]. Typically, the estimation of

μ (t)

involves a two-step procedure [5,36,44]. In the first step, we model the distribution

T ∣ X

, also known as the propensity. In the second step, we model the distribution of

Y ∣ T

, suitably adjusted by the propensity, with the aim of mitigating the confounding effect of

X

.

Ref. [36] introduced a generalized propensity score (GPS) defined as

e (t, x) : = p_{T ∣ X} (t ∣ x)

. One common approach is to model

p_{T ∣ X}

using a Gaussian model. In binary treatment cases (when

T = {0, 1}

), the propensity score is a probability denoted as

e (1, X) = P (T = 1 ∣ X)

and is typically modeled using logistic regression. Subsequently, we define weights

w_{i}

as

w_{i} : = \frac{1}{\hat{e} (T_{i}, X_{i})}

or stabilized weights

w_{i} : = \frac{{\hat{p}}_{T} (T_{i})}{e (T_{i}, X_{i})}

, where we additionally model and estimate the marginal distribution of T, denoted as

{\hat{p}}_{T}

.

In a similar vein, Ref. [5] introduced the concept of a “uniquely parameterized propensity function assumption,” which states that for every value of

X

, there exists a unique finite-dimensional parameter

θ \in Θ

such that

e (\cdot ∣ X)

depends on

X

only through

θ (X)

. Since

θ (X)

contains all information about the confounding, we only model

E [Y ∣ T = t, θ (X) = s]

instead of

E [Y ∣ T = t, X = x]

in Equation (1). In a vast majority of applications,

θ (X)

corresponds to the parameters of a normal distribution; to the best of our knowledge, there has been no exploration of the extreme value distribution in this particular context.

2.2. Extreme Value Theory

When dealing with extreme values, it is easy to introduce a strong selection bias. A naive approach for estimating

μ (t)

for large values of t might involve only considering observations where t exceeds a certain threshold, denoted as

τ

, and computing

μ (t)

using conventional techniques, while disregarding all values below this threshold. This is a typical approach of many classical algorithms, which estimate

μ (t)

by focusing solely on a local neighborhood of observations around t. However, this approach can introduce a significant selection bias. In its extreme manifestation, all observations where t exceeds

τ

might exclusively pertain to men, for instance. The selection bias arises if the effect of T on Y differs between men and women (see Figure 2 in Section 5.1 with

τ = 3

). We employ the Extreme Value Theory technique known as peaks-over-threshold to tackle this issue.

Extreme value theory is a sub-field of statistics that explores techniques for extrapolating the behavior (distribution) of T beyond the observed values. A limiting theory posits that the tail of T can be well approximated by the Generalized Pareto Distribution (GPD), as detailed in the following explanation.

Consider a sequence

{(T_{i})}_{i \geq 1}

of independent and identically distributed (iid) random variables with a common distribution F, and

M_{n} = {max}_{i = 1, \dots, n} T_{i}

represents the running maximum. It is well known [45] that if there exists a non-degenerate distribution G such that

\frac{M_{n} - b_{n}}{a_{n}} \overset{D}{\to} G

as

n \to \infty

for some sequences of constants

{a_{n}, b_{n}}_{n = 1}^{\infty} \in R_{+}^{N} \times R^{N}

, then G falls within the Generalized Extreme Value (GEV) distribution family. This can equivalently be expressed using the following definition:

Definition 1

([46]). The distribution F is in the max domain of attraction of a generalized extreme value distribution (notation

F \in M D A (γ)

) if there exist

γ \in R

and sequences of constants

a_{n} > 0, b_{n} \in R, n = 1, 2, \dots

such that

{lim}_{n \to \infty} F^{n} (a_{n} x + b_{n}) = e x p (- {(1 + γ x)}^{- 1 / γ})

for all x satisfying

1 + γ x > 0

. In case

γ = 0

, the right side is interpreted as

e x p (- e^{- x})

. The parameter γ is called the extreme value index (or shape index).

This condition is mild as it is satisfied for most standard distributions, for example, the normal, Student-t and beta distributions. The following crucial theorem states that the tail of T can be well approximated by GPD if the distribution of T belongs to

M D A (γ)

.

Theorem 1

(Theorem 4.1 in [47]). Let

T \sim F \in M D A (γ)

. Then, for large

τ \approx τ_{R}

, there exist

σ > 0, γ \in R

such that the distribution of

T - τ ∣ T > τ

is approximately

G P D (0, σ, γ)

.

GPD distribution has three parameters, namely a location

τ \in R

, scale

σ > 0

and a shape

γ \in R

. Its distribution function takes the following form:

H (x) = \{\begin{matrix} 1 - {(1 + γ \frac{x - τ}{σ})}^{- 1 / γ}, γ \neq 0, \\ 1 - exp (- \frac{x - τ}{σ}), γ = 0, \end{matrix}

defined on the support

[τ - σ / γ, \infty), (- \infty, \infty), (- \infty, τ - σ / γ]

for cases

γ < 0, γ = 0, γ > 0

, respectively. Cases when

γ > 0, γ = 0

, and

γ < 0

correspond to the Fréchet, Gumbel, and Weibull distributions, respectively, [45,48].

Note that when the distribution of

T - τ

given

T > τ

follows a GPD with parameters 0,

σ

, and

γ

, an equivalent assertion can be made that T given

T > τ

follows a GPD with parameters

τ

,

σ

, and

γ

. We denote

θ = {(τ, σ, γ)}^{⊤}

.

Assumption 1.

We assume that the distributions T and

T ∣ X

are in the max domain of attraction of a generalized extreme value distribution.

3. Our Tail Framework

We aim to model the effect of a treatment variable T in the context of extreme values of T. However, it is essential to approach the term ‘extreme’ with caution, considering the discrepancy between real-world implications and the interpretations within our model. Take, for instance, if T represents a drug dose in milligrams; our model operates under the assumption that T tends toward

τ_{R}

, which can be potentially larger than several kilograms. While this mathematical abstraction lacks practical significance—given that administering several kilograms of a drug is physically implausible—the model does include values of T that are arbitrarily large. Of course, we do not claim that our model performs well when T equals several kilograms but only that it performs well for T in the ‘reasonable neighborhood’ of the observed values.

3.1. Assumptions

We are not aiming to estimate the complete

μ (t)

but rather only its values for large t. Therefore, we can relax the classical assumptions for the identification of

μ (t)

; what we specifically need is their tail counterparts.

Assumption 2

(Unconfoundedness in tail). For all

x \in X

, it holds that

E [Y (t) ∣ X = x] \sim E [Y ∣ X = x, T = t] as t \to τ_{R} . (Unconfoundedness in tail)

We always assume the existence of the expected values.

Rather than simply writing

t \to τ_{R}

, we frequently opt for the notation

t (x) \to τ_{R}

to emphasize its dependence on the random variable

X

. Note that Assumption 2 is strictly less restrictive than the Unconfoundedness assumption introduced in Section 2.1.

Remark 1.

To provide some intuition regarding the permissiveness of Assumption 2, we rephrase our framework in the language of structural causal models (SCM, [49]). Assume that the data-generating process of the output Y is as follows:

Y = f_{Y} (T, X, H, ε), ε ⊥ ⊥ (T, X, H) .

Here, H represents a (possible) latent confounder of T and Y. Then, the dose–response function has a form

μ (t) = E [f_{Y} (t, X, H, ε)]

where the expectation is taken with respect to

{(X, H, ε)}^{⊤}

.

Assumption 2 can be rephrased as follows: there exists a function

{\tilde{f}}_{Y}

such that

f_{Y} (t, x, h, e) \sim {\tilde{f}}_{Y} (t, x, e) as t \to τ_{R}, (Unconfoundedness in tail in SCM)

for all admissible values of

x, h, e

. This assumption is valid for example in additive models; that is, when

f_{Y} (t, x, h, e) = {\tilde{f}}_{Y} (t, x, e) + g (h)

for some functions

\tilde{f}, g

.

Additionally, we restate the positivity assumption in the context of its tail counterpart.

Assumption 3

(Positivity in tail).

p_{T ∣ X} (t ∣ x) > 0

for all

x

and all

t > t_{0}

for some

t_{0} \in T

, where

p_{T ∣ X}

represents the conditional density function of the treatment given the covariates.

Note that this assumption is weaker than Assumption 1.

3.2. Adjusting Only for $θ (X)$

The following lemma serves as a tail counterpart of an identifiability for the classical framework. It states that, under Assumptions 2 and 3, the tail of the dose–response function is identifiable from the observational distribution via the propensity function

π_{0} (t, x)

.

Lemma 1

(Identifiability). Under Assumptions 2 and 3, it holds that

μ (t) \sim E {π_{0} (T, X) Y ∣ T = t}, as t \to τ_{R},

(2)

where

π_{0} (t, x) : = \frac{p_{T} (t)}{p_{T ∣ X} (t ∣ x)}

is the (stabilized) propensity function.

Recall that the distribution of

T ∣ X = x

, conditioned on

T > τ (x)

for large

τ (x) \approx τ_{R}

, is approximately GPD with parameters

θ (x) = (τ (x), σ (x), γ (x))

. The following result suggests that instead of conditioning on (potentially high-dimensional) covariates

X

, we only need to condition on

θ (X)

.

Lemma 2.

Under Assumptions 1 and 2, for all

s

in the support of

θ (X)

holds

E [Y (t) ∣ θ (X) = s] \sim E [Y ∣ T = t, θ (X) = s] for t \to τ_{R} .

(3)

Hence,

μ (t) \sim \int E [Y ∣ T = t, θ (x)] p_{θ (X)} (x) d x for t \to τ_{R} .

Lemma 2 suggests that it is sufficient to condition only on

θ (X)

rather than on

X

. This finding is pivotal for dimension reduction, effectively reducing the dimension from

d i m (X)

to at most 3. Nonetheless, this is merely a limiting result, and it introduces an approximation error into the GPD approximation for finite samples.

3.3. Model for the Conditional Expectation of Y Given a T

Under Assumption 2, modeling

μ_{X}

reduces to a statistical modeling of

E [Y ∣ T, X]

. Furthermore, under Assumptions 1 and 2, it reduces to modeling

E [Y ∣ T, θ (X)]

. In principle, a wide range of models can be considered, ranging from simple linear models to non-parametric neural networks. The principle of Occam’s razor suggests that, especially when extrapolating beyond the range of observed values, simpler models often prove to be the most effective choices [50]. The extrapolation capabilities of various models have recently garnered attention in machine learning research. Ref. [34] introduced the ‘engression’ framework as an extrapolating counterpart to regression-based neural networks, and while we build our framework under a linear model for simplicity, it is possible to utilize different models, such as the engression-based ones. Figure 1 illustrates the extrapolating properties of various commonly used models.

A straightforward approach to model

E [Y ∣ T = t, X = x]

under the assumption of linearity-in-the-tail would be assuming an existence of functions

\tilde{α}, \tilde{β}

such that

E [Y ∣ T = t, X = x] \sim \tilde{α} (x) + \tilde{β} (x) t, as t \to τ_{R} .

(4)

Following the notation in Remark 1, this corresponds to assuming

f_{Y} (t, x, h, e) \sim \tilde{α} (x) + \tilde{β} (x) t as t \to τ_{R},

for all admissible values of

h, e

. This assumption is valid for example in additive models where

f_{Y} (t, x, h, e) = \tilde{α} (x) + \tilde{β} (x) t + g (x, h, e)

for some function g. However, using the result from Lemma 2, it is sufficient to condition only on

θ (X)

instead of potentially high-dimensional

X

. Therefore, we introduce the following model assumption:

Assumption 4

(Conditional linearity of tail). There exist functions α and β such that for all

s

in the support of

θ (X)

, the following holds:

E [Y ∣ T = t, θ (X) = s] \sim α (s) + β (s) t, as t \to τ_{R} .

(5)

Such an assumption was explored in various contexts (typically where

θ (X)

represents parameters of a normal distribution, see [5] or Section 2.2.1 in [44]; to the best of our knowledge, the extreme case was not yet explored). We can construct our inference method by estimating

α

and

β

using various machine-learning methodologies. This is discussed in Section 4.

4. Inference and Estimation

Let

{(x_{i}, t_{i}, y_{i})}_{i = 1}^{n}

be the observed data. In the following, we propose a methodology for the estimation of

μ (t)

for

t \approx τ_{R}

under Assumptions 1, 2, and 4.

Consider the following two-step procedure. In the first step, we approximate the tail of

T ∣ X

using GPD (that is, we estimate the location, scale, and shape parameters

θ (X) = (τ (X), σ (X), ξ (X))

). In the second step, we estimate the expectation of Y given a large T conditional on the estimated GPD parameters

\hat{θ} (X)

.

Estimate $θ (x)$ :
- Choose $q \in (0, 1)$ .
- Estimate the covariant-dependent threshold $τ (x)$ using a quantile regression, that is, estimate q-quantile of $T ∣ X = x$ .
- From now on, restrict our inference on the observations from $S : = {i : t_{i} > \hat{τ} (x_{i})}$ .
- Estimate $θ (x)$ in the tail model, that is, estimate $(σ, ξ)$ from the data points in S in the model, where
  
  $T ∣ T > \hat{τ} (x), X = x \sim G P D (\hat{τ} (x), σ (x), ξ (x)) .$
Estimate $μ (t)$ or $μ_{x^{★}} (t)$ using $\hat{θ} (x)$ :
- Estimate $α, β$ in model (5) from the data points in S (that is, we only consider $t > \hat{τ} (x)$ ).
- Return $\hat{μ} (t) : = \frac{1}{n} \sum_{i = 1}^{n} \hat{α} [\hat{θ} (x_{i})] + \hat{β} [\hat{θ} (x_{i})] t$ or ${\hat{μ}}_{x^{★}} (t) : = \hat{α} [\hat{θ} (x^{★})] + \hat{β} [\hat{θ} (x^{★})] t$ .

The first step is a very standard procedure in the extreme-value literature called ‘peak-over-threshold’ [47]; it is standard to assume a constant shape parameter

γ (x) \equiv γ \in R

since in practice, it is untypical for the shape parameter to change with covariates [53,54]. For the estimation of

τ (x), σ (x), α (x), β (x)

, we use either linear parametrization (that is,

τ (x) = τ^{⊤} x, σ (x) = σ^{⊤} x, α (s) = α^{⊤} s, β (s) = β^{⊤} s

for some real coefficients

τ, σ, α, β

and their estimation is carried out via classical maximum likelihood) or non-parametric smooth estimation using splines (GAM, [51]), but any method can be used in practice. In case of a very small sample size, we can also assume a constant scale parameter

σ (x) \equiv σ \in R

in order to reduce the dimension of the estimation.

The choice of q in the first step is a standard issue in extreme value theory [55,56,57]. For theoretical results, q should be growing with the sample size; that is,

q = q_{n}

satisfying

{lim}_{n \to \infty} q_{n} = 1

and

{lim}_{n \to \infty} n (1 - q_{n}) = \infty

. In practical terms, q should be set as high as possible while ensuring that a sufficient amount of data remains above the threshold to maintain good inferential properties. Classical choices include

q = 0.9

,

q = 0.95

, or

q = 0.99

, depending on the size of our dataset.

We utilize the basic bootstrap technique (sometimes also called Efron’s percentile method, see Chapter 23 in [58]) to establish confidence intervals. This involves random sampling, with replacement, from our dataset to generate multiple bootstrap samples, each matching the size of our original dataset. For each bootstrap sample, we calculate the estimate of the statistic

{\hat{μ}}^{★} (t)

. Subsequently, we determine the

α

-percentiles of the re-sampled statistics to derive the confidence intervals. Details can be found in Appendix D.1.

Remark 2.

One must be cautious when interpreting confidence intervals during extrapolation. Generally, estimation of

μ (t)

is subject to two primary sources of bias: (1) bias stemming from model misspecification, and (2) bias arising from estimation variance; while the former bias can be mitigated within the body of the distribution by comparing different models and employing cross-validation, AIC or BIC criteria to select the most suitable model, this approach becomes less reliable in the extremal region. Eliminating this bias necessitates observation of data within the region of interest. The latter bias stemming from estimation uncertainty can be addressed by computing confidence intervals (in our case, via bootstrapping). It is important to note that our bootstrap confidence intervals only account for the latter bias, and consequently, the first type of bias presents a greater challenge during extrapolation since it is, in principle, unquantifiable without additional data.

Theorem 2

(Idea: Precise statements can be found in Appendix D). Assuming that the conditions outlined in either Theorem A1 or Theorem A2 are met, our procedure is consistent. Furthermore, under the assumptions detailed in Theorem A3, the bootstrap confidence intervals are asymptotically consistent at a correct level.

In our procedure, we adopt a practice common in extreme value theory, where we concentrate solely on the extreme observations (set S) while discarding all non-extreme values. This approach stems from the rationale that extreme observations provide the most valuable insights into out-of-support behavior. Utilizing data within the body of the distribution may introduce bias, as these values may exhibit different behavioral patterns. Mathematically, this rationale can be expressed through an examination of the precision of the GPD approximation. This approximation shows high precision exclusively in extreme values while displaying bias and low precision for non-extreme values.

5. Illustration and Experiments

In this section, we assess the performance of our methodology using both a simple illustrative example and experimental data. A comprehensive simulation study is provided in Appendix B.

The quantity of interest in the application presented in Section 6 is the difference

μ (t_{1}) - μ (t_{2})

for

t_{1}, t_{2} < τ_{R}

. Hence, in the simulations, we focus on estimating

ω_{x} : = lim_{t \to \infty} [μ_{x} (t + 1) - μ_{x} (t)] or ω : = lim_{t \to \infty} [μ (t + 1) - μ (t)],

assuming

τ_{R} = \infty

and that the corresponding limits exist. Note that under the linear model (5), the limit exists and corresponds to the parameter

ω_{x} = β [θ (x)]

. Here,

ω_{x}

can be regarded as the tail counterpart of a coefficient

β_{x}

in a linear model

Y = α_{x} + β_{x} T + ε

, where

α_{x}

and

β_{x}

are real coefficients, possibly dependent on

X

.

5.1. Simple Example

The subsequent illustrative example outlines our methodology and the main ideas. Consider a single confounder

X = X_{1} \sim Bernoulli (0.75)

(where

X_{1} = 1

denotes men and

X_{1} = 0

denotes women, for instance). Define

T = X_{1} + ε_{T}

, where

ε_{T} \sim N (0, 1)

(indicating that T generally tends to be larger for men than for women). Let

Y = \{\begin{matrix} T + ε, when X_{1} = 1, T > 1, \\ 2 T + ε, when X_{1} = 0, T > 1, \\ 2 - T + ε, when T \leq 1, \end{matrix}

where

ε \sim N (0, 1)

.

Simple computation gives us

μ (t) = 0.75 t + (1 - 0.75) 2 t = 1.25 t

for any

t > 1

, while

μ (t) = - t + 2

for

t \leq 1

. Consequently, our primary interest lies in estimating the slope

ω = μ (t + 1) - μ (t) = 1.25 for t > 1 .

We generate data as specified with a sample size of

n = 500

. Setting the threshold at

q = 0.9

, we employ the methodology outlined in Section 4 to estimate

ω

. This process is repeated 100 times, yielding a mean and

0.95

quantile of

\hat{ω} = 1.26 \pm 0.39 .

Additionally, we employ the bootstrap technique to calculate confidence intervals, and we obtain a confidence interval of the form

ω \in (0.72, 1.87)

on average. We see from other simulations that these confidence intervals are slightly conservative for

n \leq 1000

but work well for larger sample sizes.

In Figure 2, we present one generated dataset with a sample size of

n = 500

, showcasing various methodologies from the existing literature and their extrapolation efficacy. Notably, classical techniques often exhibit a tendency to underestimate

μ (t)

for t large, primarily due to the fact that only the ‘men’ category (

X = 1

) received

T > 2

.

We conclude with an important remark regarding the sample size: a substantial amount of valuable information is lost when we discard

90 %

of the data by focusing solely on the data in the set S (data above the threshold

τ (x)

). This is the primary reason behind the considerably large confidence intervals and the heightened variability in our estimates. We encounter the inevitable bias-variance trade-off; the inclusion of more data introduces a potential bias, given that the behavior of

μ (t)

differs in the body and in the tail.

5.2. Simulations

We provide a comprehensive discussion of all simulations in detail in Appendix B. In our study, we devised several simulation setups to model diverse scenarios and explore them thoroughly. Specifically, we focused on the following five key scenarios:

Investigating how our method scales with respect to the dimension of the confounders $d = d i m (X)$ .
Comparing our method with classical methods from the literature.
Expanding upon the simple example introduced in Section 5.1, wherein we evaluated performance across various dependence structures (employing different copulas), sample sizes, and a spectrum of causal effects.
Examining the presence of a hidden confounder affecting both T and Y.
Focusing on variations in the function $μ (t)$ .

In this section, we present two key findings from our simulation study. Table 1 illustrates how our methodology scales with varying dimensions of the confounders

d = d i m (X)

. Additionally, Table 2 depicts the comparison of our method with other classical methods from the literature, where we estimated

μ (\tilde{t})

for

\tilde{t} = {max}_{i = 1, \dots, n} t_{i} + 10

.

Table 1 suggests that the results are reasonably accurate as long as

d \leq 25

. As discussed in the Appendix B, the reason for the bias observed in larger dimensions d is that Assumption 1 and Lemma 2 are only asymptotic results, and with higher dimensions d, we require more data for the asymptotic theory for

T ∣ X

to take effect. It is well known that the convergence rate of the maxima of the Gaussian random sample to an extreme value distribution is very slow, whereas it is faster with Exponential or Pareto distribution [63].

Analysis of Table 2 reveals that our method achieves the smallest extrapolation error. Hirano and Imbens [36] method utilizing a GAM outcome model showed surprisingly reasonable performance, while the IPTW method [62] exhibits poor performance due to the quadratic nature of the extrapolating curve. Please note that our method assumes linearity in the tail. If this assumption is not met, our method may produce inferior results. In such instances, alternative regression techniques can be utilized in step 2 of our algorithm instead of linear regression. For example, neural networks, as demonstrated in [34], can be employed to address nonlinear behavior and enhance performance.

6. Application: River Discharge Dataset

Understanding the causal relationship between extreme precipitation and river discharge is crucial for effective water resource management. In this study, we examine how extreme precipitation events impact river discharge. By utilizing a comprehensive dataset spanning various hydro-logical conditions, our research seeks to provide insights into the critical nexus between extreme precipitation dynamics and extreme river discharge events. The data were collected by the Swiss Federal Office for the Environment (https://www.hydrodaten.admin.ch/ (accessed on 15 October 2023)) but were provided by the authors of [23,64], with some useful preliminary insights. We used precipitation data and other relevant measured variables from meteorological stations provided by Swiss Federal Office of Meteorology and Climatology, MeteoSwiss (https://gate.meteoswiss.ch/idaweb/login.do (accessed on 15 October 2023)).

We exclusively examine the discharge levels of the River Reuss, situated near Zurich in Switzerland (Figure 3). We selected this river due to the availability of excellent measurements of its discharge levels, complemented by well-documented weather conditions from nearby meteorological stations and diverse landscape. Our measurements include average daily discharges between January 1930 and December 2014 and daily precipitation in the nearby meteo-stations. To reduce any seasonal effects due to unobserved confounders, we only consider data during June, July and August, as the more extreme observations happen during this period when mountain rivers are less likely to be frozen.

We center our attention on addressing two distinct research questions: one characterized by a straightforward scenario where the ground truth is known, and another presenting a more intriguing challenge.

6.1. Known Ground Truth

We demonstrate our methodology using a straightforward example where the ground truth is known. Consider a pair of river stations, such as stations 2 and 1. Let T represent the water discharge at station 2, Y represent the water discharge at station 1, and

X

denote measurements taken at a nearby meteorological station (including precipitation, humidity, etc.; the full list of confounders can be found in Appendix C). Our objective is to investigate the impact of extreme discharge levels at station 2 on the water discharge observed at station 1. In mathematical terms, we seek to ascertain

μ (t)

or

μ_{x} (t)

for large values of t. In this context, the ground truth is the following:

μ (t_{1} + t_{2}) - μ (t_{2}) = μ_{x} (t_{1} + t_{2}) - μ_{x} (t_{2}) = t_{1} - t_{2}, for all t_{1}, t_{2} \geq 0 and all x \in X .

This can also be explained in words as follows: if we pour

t_{1}

liters of water into the river at station 2 (in causal terminology, we interpret this as an intervention

d o (T = T + t_{1})

), we expect the water discharge at station 1 (Y) to increase by exactly

t_{1}

. Hence,

ω = ω_{x} = 1

. As we will see below, our methodology consistently yields this expected outcome.

We follow the methodology introduced in Section 4 with

q = 0.95

. Detailed steps, diagnostics and preliminary data analysis (just for a pair

2 \to 1

) can be found in Appendix C. The resulting estimates can be found in Table 3. The results are very similar with different choices of q (changing

\hat{ω}

by not more than by

0.1

). We observe that our results align very well with the ground truth (

ω = 1

). However, there is a slight bias evident in the relationships between the pairs

5 \to 3

and

4 \to 3

: this can be attributed to distinct geographical features. Notably, Lake Vierwaldstättersee lies between the pair 5 and 3, which diminishes the influence of 5 on 3. Additionally, a 3238 m Titlis mountain is situated between pair 4 and 3, amplifying the effect of 4 on 3 due to the melting glacier ice, acting as an unmeasured confounding factor. Our methodology relies on Assumptions 1–4, along with some continuity assumptions and the SUTVA assumption discussed in Section 2. Assumptions 1 and 3 are minor and are used frequently when dealing with these types of data. Assumption 2 is a common and challenging aspect of every causal inference methodology; while our assumption is weaker than the classical unconfoundness assumption (requiring no hidden confounder in the tail), complete rejection of the possibility of its violation is unattainable. However, we believe that the meteo-station between a pair of stations can capture the most significant confounders (with the exception of when the lake or mountains are present in between the river stations). Finally, Assumption 4 is a strong assumption that allows us to extrapolate observed values into the extremal region. However, this assumption (or at least some similar model assumptions) are necessary. In this case, the linear assumption is valid, since the underlying ground truth is known.

6.2. Effect of Precipitation on River Discharge

We employ our methodology to address a more complex inquiry where the ground truth is not known. Let us consider water discharge at station 3 (Y), and let T denote the precipitation measured in meteo-station M2. Our focus lies in understanding the impact of extreme precipitation events (T) on the water discharge (Y). As mentioned in the introduction, on 6 June 2002, we recorded a historical maximum precipitation level of

T_{\max} = 111

\frac{m m}{m^{2}}

, coinciding with the scenario when the river nearly breached its banks. Our inquiry centers on the question: how would the river discharge Y alter if T were to reach 120

\frac{m m}{m^{2}}

? In mathematical terminology, we are interested in estimating

μ (120) - μ (111)

or possibly

μ_{x^{★}} (120) - μ_{x^{★}} (111)

, where

x^{★}

are other covariates corresponding to that event. Addressing this question is challenging as we lack data within this extreme regime, necessitating reliance on extrapolation. This task is especially challenging, since we anticipate that the effect of precipitation on river discharge may vary between the body of the distribution and its tail, since the ground absorbs a significant portion of the rainfall during a light rain.

We follow the methodology introduced in Section 4. A straightforward approach would be to define T as precipitation and Y the water discharge on the same day, while choosing appropriate confounders

X

from some measurements at M2. Then, we can use the classical approach for estimating

μ (t)

in the body and our approach to estimate it in the tail. However, some problematic issues arise in this application:

Time issue: $T_{m o n d a y} \to Y_{m o n d a y}$ but also $T_{m o n d a y} \to Y_{t u e s d a y}$ since it takes time for the rain water to reach the river and rain tends to be more frequent around midnight. In fact, correlation (and extreme correlation coefficient as well, see Figure A11) is much higher for a pair ( $T_{m o n d a y}, Y_{t u e s d a y}$ ) than for ( $T_{m o n d a y}, Y_{m o n d a y}$ ). The extreme storm on 6 June 2002 corresponded to extremely high river discharge on 7 June 2002 (where Y was about five times larger than on 6 June 2002). Hence, our interest lies in the effect $T_{m o n d a y} \to Y_{t u e s d a y}$ (that is, we consider $t_{i}$ as precipitation on day i while $y_{i}$ is the discharge on day $i + 1$ ). Additionally, the presence of time introduces an auto-correlation issue. This can be handled by taking for example weekly maxima or discarding consecutive observations within a certain time frame to reduce the auto-correlation effect. Alternatively, applying techniques like time series decomposition, differencing, or using autoregressive models can also mitigate the issue of auto-correlation in the data analysis process. We leave the data unchanged since the temporal dependence is primarily local, spanning only a few days, and does not introduce a substantial bias.
Variable selection issue: choosing appropriate confounders $X$ that act as confounders of Y and T. It is not clear which variables can be safely considered as confounders: if a variable X lie on a path $T \to X \to Y$ , adjusting for X would lead to so-called path-canceling causal effect [65]. Here, we are interested in the so-called total causal effect, so we need to be cautious of which covariates to adjust for. However, not adjusting for a common cause leads to a bias. Moreover, there is often a feedback loop: $X_{i} \leftrightarrow P r e c i p i t a t i o n$ for $X_{i}$ for example humidity or temperature. However, some of the variables can be safely considered as common causes: for example, the temperature on Sunday (the day before measuring precipitation). There is a huge amount of literature for such a variable selection, and we do not aim to comment on this research area—we only provide a full list of chosen confounders in Appendix C.

We estimate two values:

\hat{ω}

which is the tail quantity defined as the difference between

μ (t + 1) - μ (t)

for large t: in our case, how would Y change if it was raining by 1

\frac{m m}{m^{2}}

more on 6 June 2002? Next, we also estimate

\hat{β} = μ (t + 1) - μ (t)

corresponding to the body of the distribution (see Appendix C.2.2 for details on its computation). The resulting estimates can be found in Table 4 and visualization of the

μ (t)

can be found in Figure 4. We observe that the effect of T on Y is larger in the extreme region than in the body of the distribution by a factor of

\frac{3.04}{2.4} \approx 1.25

.

As for the answer to our question ‘how would the river discharge Y alter in station 3 if T were to reach 120

\frac{m m}{m^{2}}

on 6 June 2002’, our results suggest that the water discharge would be larger by about

9 \times 1.62 = 14.5

\frac{m^{3}}{s}

(note that median of Y is

11.2

, and the

95 %

quantile of Y is

51.2

). Would this result in the river overflowing its banks? We cannot definitively say, as we lack the necessary data regarding the volume and contours of the river banks. Moreover, Y represents the daily average of the water discharge; while in order to answer this question, the daily maximum is a better suited variable for answering the question. Nonetheless, this advances us towards a more accurate understanding of the effects and impacts of extreme precipitation events and potentially enhancing statistical inference for hydroelectric power stations located along this river.

7. Conclusions and Future Work

Analyzing the impact of extreme levels of a treatment variable (exposure) is essential for comprehending its effects on diverse systems and populations. In this paper, we introduced a novel framework aimed at estimating the causal effect of extreme treatment values. Leveraging insights from extreme value theory, we enhanced the estimation of the extreme treatment effect. Our framework can handle a substantial number of confounders. Nonetheless, our methodology relies on extrapolation, presenting inherent challenges even in the absence of confounding variables, where the bias stemming from a model misspecification is impossible to quantify. Our framework holds promise for initial assessments of the impact of extreme environmental events, such as the effects of severe storms or droughts on economic damages. Future work may explore the application of our extreme value theory approach to address time-varying effects, a prevalent issue in environmental research.

Funding

This study was supported by the Swiss National Science Foundation.

Data Availability Statement

The code is available in the online repository (https://github.com/jurobodik/Extreme_treatment_effect.git (accessed on 25 March 2024)) or on request from the author, alongside with the data regarding Appendix A. While the data related to the application discussed in Section 6 are not publicly available, they can be accessed through https://www.hydrodaten.admin.ch/ (accessed on 15 October 2023) and https://gate.meteoswiss.ch/idaweb/login.do (accessed on 15 October 2023) after registration or by requesting the used data from the authors of [23].

Acknowledgments

The author would like to thank Valérie Chavez for her mentorship and Mats Stensrud for their insightful discussions.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Application 2—Concrete Compressive Strength

Appendix A.1. Main Analysis

In this section, we delve into a dataset [66] focused on concrete compressive strength.

Concrete serves as a fundamental material in civil engineering, and understanding its compressive strength (denoted as Y and measured in MPa) is paramount for ensuring structural integrity [67]. Concrete comprises ingredients such as cement (

X_{1}

), fly ash (

X_{2}

), water (

X_{3}

), superplasticizer (

X_{4}

), and blast furnace slag (T) (among some other additions). The units of

T, X_{1}, X_{2}, X_{3}, X_{4}

are in kilograms in a

m^{3}

mixture. The concrete compressive strength (Y) exhibits a highly nonlinear relationship with these ingredients and the elapsed time. Our focus is on exploring the effect of blast furnace slag (T) on compressive strength (Y). It is well-established that increasing the quantity of T can enhance Y, yet an excessive amount of T may lead to a decrease in Y.

In our dataset

X_{1}, X_{2}, X_{3}, X_{4}

may affect the decision of how much T was used (engineers often decide about the quantity of T based on the looks of the mixture of other ingredients). Our dataset contains

n = 1030

instances of observational data

{x_{i}, t_{i}, y_{i}}_{i = 1}^{n}

where

x_{i} = {(x_{1, i}, \dots, x_{4, i})}^{⊤}

. The range

({min}_{i} y_{i}, {max}_{i} y_{i}) = (2.3, 82.5)

and

({min}_{i} t_{i}, {max}_{i} t_{i}) = (0, 359)

.

Suppose we fit a linear model

E Y = β_{0} + β_{T} T + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4}

; then, a least square estimation of the coefficient

β_{T}

leads to

{\hat{β}}_{T} = 0.08 \pm 0.006

. This can be (wrongly) interpreted as ‘adding one additional kg of T in

m^{3}

mixture increases Y by

0.08 M P a

’. We expect different behavior for small and large values of T, and we expect strong (nonlinear) interactions between the covariates, and more importantly, this result is derived from the body of the distribution, while we are interested in values of T above the observed ones.

Our objective is to quantify the effect of an extreme amount of blast furnace slag T on Y. Specifically, we answer the following questions:

Given a concrete mixed with $T = 359$ and $X = x$ for some specific value of $x$ , if we intervene and change T to $T = 400$ , what effect on concrete compressive strength can we expect? Using the potential outcome notation, the quantity of interest is $μ_{x} (400) - μ_{x} (359)$ . Note that $m a x_{i = 1, \dots, n} t_{i} = 359$ (we do not observe the blast furnace slag larger than 359, and there is no observation in the interval $(220, 359)$ ), and hence, we have zero data in such an extreme region. We aim to answer this question for a choice $x = x^{★}$ where $x_{1}^{★} = 239, x_{2}^{★} = 0, x_{3}^{★} = 185, x_{4}^{★} = 0$ (the observation corresponding to $T_{i} = 359$ ).
How would an extreme increase in T change Y for an ‘average’ concrete (on a population level, i.e., integrating over the covariates)? Using the potential outcome notation, the quantity of interest is $μ (400) - μ (359)$ .

We follow the methodology introduced in Section 4 with

q = 0.9

. Detailed steps, diagnostics and preliminary data analysis can be found in Appendix A.2. The resulting estimates are as follows:

\begin{matrix} {\hat{μ}}_{x^{★}} (400) - {\hat{μ}}_{x^{★}} (359) & = - 4.1 \pm 3.0, & \hat{μ} (400) - \hat{μ} (359) & = - 4.5 \pm 2.6, \\ {\hat{ω}}_{x^{★}} & = - 0.1 \pm 0.07, & \hat{ω} & = - 0.11 \pm 0.06 . \end{matrix}

The results are similar with a different choices of q (see Table A1). In summary, the results suggest that for a mixture of concrete with covariates

X = x^{★}

and

T = 359

, intervening on T and changing it to

T = 400

would decrease the concrete compressive strength by about 4.1 MPa. On the population level, increasing T from 359 to 400 would lead to a decrease in the concrete compressive strength by about 4.5 MPa. The

95 %

confidence intervals suggest that this estimate can be inaccurate by about 3 MPa; however, one must be cautious about the interpretation of the confidence intervals, since they are in general unreliable when extrapolating, see Remark 2.

Figure A1 graphically shows the estimation of

\hat{μ} (t)

in the body using the method introduced in [11,59], as well as our estimation of

\hat{μ} (t)

for extreme values.

In the Appendix A.4, we discuss the assumptions made regarding our results. In brevity, our methodology uses Assumptions 1–4 (among some continuity assumptions and SUTVA assumption discussed in Section 2). While we argue that Assumptions 1 and 3 are minor, we can not disregard the possibility of a hidden confounder (Assumption 2). The validity of Assumption 2 has to be further argued by an expert knowledge. Finally, the strongest assumption is Assumption 4, as its violation leads to the most significant bias. However, this assumption (or a similar assumption using different model) is necessary when extrapolating, and it is hypothetically testable by measuring values with

T \approx 400

. Appendix A.3 also discusses the differences for the range of choices of q.

Figure A1. Black: The estimation of

μ (t)

using the doubly robust estimator introduced in [11,59] together with

95 %

confidence intervals. Green: Quantiles of T. Blue: Our estimation of

μ (t)

for values

t = 359, 400

for

q = 0.9

, together with the

95 %

confidence intervals for the slope. Red:

95 %

confidence interval for

μ (359)

.

Figure A1. Black: The estimation of

μ (t)

using the doubly robust estimator introduced in [11,59] together with

95 %

confidence intervals. Green: Quantiles of T. Blue: Our estimation of

μ (t)

for values

t = 359, 400

for

q = 0.9

, together with the

95 %

confidence intervals for the slope. Red:

95 %

confidence interval for

μ (359)

.

Appendix A.2. Detailed Computations of the Estimates

Some data visualization can be found in Figure A2 and Figure A4. In the following, we provide detailed descriptions of the steps undertaken in the application for the specific choice of

q = 0.9

. First, we estimate

τ (x)

using a classical quantile regression [68]. We observe that all covariates are highly significant, and the diagnostic plots do not show any significant problems (except the fact that for many observations,

T_{i} = 0

): we illustrate the estimation on Figure A5, where points above the

90 %

threshold (points in the set S) are marked.

In the next step, we routinely estimate

θ (x)

(using

evgam

function in

evgam

package [69] using the following code: evgam(list( $T_{e} \sim s (X 1_{e}) + s (X 2_{e}) + s (X 3_{e}) + s (X 4_{e}), \sim 1), d a t a = d a t a . f r a m e (T_{e}, X_{e})$ , family = “gpd”) where

T_{e}

are datapoints in S (above the estimated

90 %

threshold)). More precisely, we assume fixed

ξ (x) = ξ \in R

and only estimate

σ (x)

as a smooth function of the covariates; Figure A3 shows the estimated values of

σ (x)

on a log scale.

Finally, following the expression

E [Y ∣ T = t, X = x] = α [\hat{θ} (x)] + β [\hat{θ} (x)] t

, we estimate

α, β

from the data points in S using

gam

function [51]. Under Assumptions 1–4, we obtain

{\hat{μ}}_{x^{★}} (400) = \hat{α} [\hat{θ} (x)] + \hat{β} [\hat{θ} (x)] 400

, and in effect, we return

{\hat{ω}}_{x^{★}} = {\hat{μ}}_{x^{★}} (400) - {\hat{μ}}_{x^{★}} (359) = \hat{β} [\hat{θ} (x^{★})] (400 - 359) = - 4.1 .

Regarding the second question (population level), we simply take the average

\hat{μ} (t) : = \frac{1}{n} \sum_{i = 1}^{n} \hat{α} [\hat{θ} (x_{i})] + \hat{β} [\hat{θ} (x_{i})] t

and compute

\hat{ω} = \hat{μ} (400) - \hat{μ} (359) = \frac{1}{n} \sum_{i = 1}^{n} \hat{β} [\hat{θ} (x_{i})] (400 - 359) = - 4.5 .

Regarding the confidence intervals, we resample the data using the following code:

resampled_data = sample_n (data, size = length (y), replace = TRUE) .

Then, we follow the same steps as above and estimate the coefficients (from the resampled dataset). We repeat this procedure 500 times. Finally, we take the

95 %

quantile out of all computed resampled coefficients. For example,

{\hat{ω}}_{x^{★}} = - 6.1 \pm 3.2

represents the fact that the

95 %

quantile was

- 6.1 + 3.2 = - 2.9

, and hence, only

5 %

of values were larger than

- 2.9

.

Figure A2. The figure illustrates the dependence among T and Y. Note that the correlation between T and Y is

0.14 \pm 0.07

.

Figure A2. The figure illustrates the dependence among T and Y. Note that the correlation between T and Y is

0.14 \pm 0.07

.

Figure A3. Estimation of scale.

Figure A4. Diagnostics of a linear model fitted into the original data.

Figure A5. Visualization of the estimation of

τ (x)

estimated using classical quantile regression. Blue points characterize the observations above this threshold (points in the set S).

Figure A5. Visualization of the estimation of

τ (x)

estimated using classical quantile regression. Blue points characterize the observations above this threshold (points in the set S).

Appendix A.3. Discussion about the Results Regarding Different Threshold q

Table A1 shows the results for different choices of q. Even though they yield distinct estimates for

\hat{ω}

, the confidence intervals overlap, and the values

\hat{ω} \in (- 6.4, - 1.9)

are encompassed by all of them. This suggests some stability in the choice of q.

Table A1. Estimates of

ω_{x^{★}} : = μ_{x^{★}} (400) - μ_{x^{★}} (359)

for different thresholds q, together with the corresponding

95 %

confidence intervals.

Table A1. Estimates of

ω_{x^{★}} : = μ_{x^{★}} (400) - μ_{x^{★}} (359)

for different thresholds q, together with the corresponding

95 %

confidence intervals.

$q = 0.85$	$q = 0.9$	$q = 0.95$
${\hat{ω}}_{x^{★}} = - 6.1 \pm 3.2$	${\hat{ω}}_{x^{★}} = - 4.1 \pm 3.0$	${\hat{ω}}_{x^{★}} = - 2.8 \pm 2.8$
$\hat{ω} = - 5.3 \pm 4.0$	$\hat{ω} = - 4.5 \pm 2.6$	$\hat{ω} = - 3.3 \pm 3.1$

The selection of q reflects the bias–variance tradeoff; as we increase q, our inference relies on values closer and closer to

T = 400

(datapoints with small and intermediate

T_{i}

can bias our estimation since in this region, increasing

T_{i}

can increase Y). However, increasing q also means disregarding more and more datapoints, and our estimate will have less power and larger variance.

The challenge of choosing q is a common problem in extreme value theory, and the rule of thumb is to select q as large as possible while maintaining an adequate number of datapoints above the

1 - q

quantile to ensure reasonably good inference.

Appendix A.4. Discussion about the Assumptions

Our methodology relies on Assumptions 1–4, along with some continuity assumptions and the SUTVA assumption discussed in Section 2. Below, we provide a detailed discussion of each assumption.

Assumptions 1 and 3 are considered minor. As mentioned in Section 2, Assumption 1 is satisfied for most common distributions, and similar model assumptions are imposed in almost all applications utilizing extreme value theory. Assumption 3 appears to be satisfied, as there is no specific range of values in the support of T that has zero probability of occurring.
Assumption 2 is a common and challenging aspect of every causal inference methodology. While our assumption is weaker than the classical unconfoundness assumption (requiring no hidden confounder in the tail), complete rejection of the possibility of its violation is unattainable. A potential hidden confounder could be the ‘quality of ingredients’. If the quality is low, engineers might tend to use excessive amounts of T in the mixture, potentially leading to spurious dependence between large T and low Y. However, in this case, it seems plausible that this hidden dependence due to low ingredient quality does not introduce a substantial bias. Expert knowledge is required to ensure the validity of this assumption.
Assumption 4 is a strong assumption that allows us to extrapolate observed values into the extremal region. However, this assumption (or at least some similar model assumptions) is necessary; estimating $μ (400)$ from observed values is not feasible otherwise. In essence, Assumption 4 asserts that the relationship between T and Y (given other confounders) is linear in the unobserved region below $T = 400$ . Since there is no other reason to believe that this relationship has any particular form, a linear assumption seems to be the most suitable choice. Although this assumption is strong, it is hypothetically possible to test by measuring values with $T \approx 400$ .

Appendix B. Simulations

In this section, we create various simulations setups to assess the performance of our methodology.

Appendix B.1 provides insight into how our method scales with the dimension of the confounders $d i m (X)$ .
Appendix B.2 compares our method with classical methods from the literature.
Appendix B.3 extends the simple example presented in Section 5.1, evaluating performance across different dependence structures (various copulas), sample sizes, and a range of causal effects.
Appendix B.4 addresses a scenario involving a hidden confounder affecting both T and Y.
Appendix B.5 focuses on variations in the function $μ (t)$ and assesses the extent to which our method can extrapolate $μ (t)$ into the ‘extreme’ region.

In some of the simulations, we use the following function:

μ_{x} (t) = \{\begin{matrix} 5 - s l o p e (x) (t - c) for t \geq c, \\ 5 + s l o p e (x) (t - c) for t < c, \end{matrix}

(A1)

where typically

s l o p e (x) = | x |

and

c \in R

is a hyper-parameter. Graphical visualization of the function

μ_{x}

for

x = 3

can be found in Figure A6. In other words,

μ_{x}

grows with slope x until c and then declines with slope x.

Figure A6. Function

μ_{x} (t)

with parameters

c = 2

and

s l o p e (x) = 3

.

Figure A6. Function

μ_{x} (t)

with parameters

c = 2

and

s l o p e (x) = 3

.

Appendix B.1. Simulations with a High Dimensional X

In this simulations we consider

X = (X_{1}, \dots, X_{d})

where the dimension d is potentially large. Consider the following data-generating process:

Let $a_{1}, \dots, a_{d} \overset{i i d}{\sim} N (1, 1)$ and $b_{1}, \dots, b_{d} \overset{i i d}{\sim} N (- 1, 1)$ be fixed numbers at the beginning of the simulations.
Consider $X$ being centered Gaussian vector with $c o r (X_{i}, X_{j}) = 0.1$ for all $i \neq j$ and $v a r (X_{i}) = 1$ .
Let $T = \sum_{i = 1}^{d} a_{i} X_{i} + ε_{T}$ , where $ε_{T}$ is distributed according to either $N (0, 10)$ , $E x p (\frac{1}{10})$ , or $P a r e t o (1, 1)$ .
Let $Y = μ_{x} (T) + \sum_{i = 1}^{d} b_{i} X_{i} + ε_{Y}$ , where $μ_{x} (t)$ is defined in (A1) with hyper-parameters $c = s l o p e (x) = 1$ and where $ε_{Y} \sim N (0, 1)$ .

This data generating process leads to

μ (t + 1) - μ (t) = - s l o p e (x) = - 1

for

t \geq c

and

μ (t + 1) - μ (t) = + 1

for

t < c

. Consequently, our primary interest lies in estimating

ω = μ (t + 1) - μ (t) = - 1

for

t \geq c

.

Note that a simple linear regression

Y \sim T + X_{1} + \dots + X_{d}

leads to a biased estimation of the effect of T, since the effect is different for large and for small values of T. However, simply discarding values where

T_{i} < 1

leads to a selection bias.

We generate data as specified with a sample size of

n = 5000

. Setting the threshold at

τ = 0.95

, we employ the methodology outlined in Section 4 to estimate

ω

. Specifically, we utilize linear parametrization of the parameters in the estimation procedure. This process is repeated 100 times. The mean of the estimates

\hat{ω}

together with

95 %

quantile for various values of d and distributions of the noise variables can be found in Table 1.

Table 1 illustrates that with a sample size of

n = 5000

, the results are reasonably accurate as long as

d \leq 50

. The reason for the bias observed in larger dimensions d is that Assumption 1 and (3) are only asymptotic results, and with higher dimensions d, we require more data for the asymptotic theory for

T ∣ X

to take effect. It is well known that the convergence rate of the maxima of the Gaussian random sample to an extreme value distribution is very slow, whereas it is faster with the Exponential or Pareto distribution [63]. With a large dimension d, we also observe a more pronounced effect of the estimation error accumulated in the first step on the second step of the algorithm.

Appendix B.2. Comparison with Classical Methods

We evaluate our extrapolation method by comparing it with several state-of-the-art techniques from the existing literature. Specifically, we assess the performance of four methods: the doubly robust estimation method introduced by Kennedy et al. [11,59], the additive spline estimator proposed by Bia et al. [60], the approach suggested by Hirano and Imbens [36] employing a GAM outcome model (taken from [61]), and the inverse probability of treatment weighting (IPTW) estimator by VanderWal et al. [62].

Our analysis employs the same simulation setup described in Appendix B.1, utilizing exponentially distributed noise variables (other distributions yield similar results). After generating

{(x_{i}, t_{i}, y_{i})}_{i = 1}^{n}

, we estimate

μ (\tilde{t})

, where

\tilde{t} = {max}_{i = 1, \dots, n} (t_{i}) + 10

, using all aforementioned methods. Subsequently, we compute the absolute relative error (ARE) defined as

A R E = | \frac{\hat{μ} (\tilde{t}) - μ (\tilde{t})}{μ (\tilde{t})} | .

This procedure is repeated 100 times, and the average of the obtained ARE values is presented in Table 2.

Analysis of Table 2 reveals that our method achieves the smallest extrapolation error. Conversely, the IPTW method [62] exhibits poor performance due to the quadratic nature of the extrapolating curve. Kennedy et al.’s method [11,59] typically produces a constant extrapolation curve, and while both Bia et al.’s [60] and the HI method [36] yield estimates that are reasonably close to the expected values, they also exhibit a notable degree of variability.

Please note that our method assumes linearity in the tail. If this assumption is not met, our method may produce inferior results. In such instances, alternative regression techniques can be utilized instead of linear regression. For example, neural networks, as demonstrated in [34], can be employed to address nonlinear behavior and enhance performance.

Appendix B.3. Dependence, Sample Size and the Causal Effect

In the following, we conduct simulations based on a model with covariates

X = (X_{1}, X_{2}, X_{3})

that function as a common cause of both T and Y. The details of the simulation are as follows:

$X$ is generated with standard Gaussian margins and a Gumbel copula with a parameter $α$ (where $α$ represents the degree of dependence [70]; $α = 1$ corresponds to independence; and $α \to \infty$ corresponds to full dependence; see Figure A7 for an illustration with $α = 2$ ).
T is generated in such a way that the marginal distribution of T follows an exponential distribution with a scale parameter of 1, and the dependence structure between $X$ and T follows a Gumbel copula with parameter $α$ .
The response variable Y is generated as follows:

$Y = \{\begin{matrix} \frac{1}{2} ω T + f (X) + ε, when X_{1} > 0, T > 1, \\ \frac{3}{2} ω T + f (X) + ε, when X_{1} \leq 0, T > 1 \\ - 10 T + 15 + f (X) + ε, when T \leq 1, \end{matrix}$

(A2)

where f is a randomly generated smooth function (to randomly generate a d-dimensional function, we use the concept of the Perlin noise generator [71]; for more details, refer to the supplementary package, and readers can conceptualize this as a function ranging from quadratic to linear), $ε \sim N (0, 1)$ , and $ω$ is a hyper-parameter that we vary in our simulations. Figure A7 shows one realization of such a dataset.

Please note that

μ (t) = ω t + E f (X)

for any

t > 1

. As such, our primary focus lies in estimating the slope

μ (t + 1) - μ (t) = ω

. We generate data with varying parameters, including

ω

,

α

, and the sample size n. Employing the method outlined in Section 4, we estimate

ω

across a spectrum of data-generating processes. For

n = 1000

, we set the threshold at

τ = 0.9

, and for

n > 1000

, we use

τ = 0.95

. This process is repeated 100 times, and the mean and

0.95 %

quantile are presented in Table A2. The numbers in the brackets represent the mean of the estimated bootstrap confidence intervals. Ideally, these intervals should align with the

0.95 %

quantile.

The findings indicate that the methodology performs as anticipated in this simulation study: augmenting the sample size enhances the estimation, whereas elevating

α

(heightening the influence of the covariates) degrades the accuracy of the estimation. We observe that the bootstrap confidence intervals align relatively well with the actual

95 %

quantiles.

Figure A7. The figures illustrate the dependence among

X_{1}

, T, and Y, generated based on the simulations outlined in Appendix B.3 with a dependence parameter of

α = 2

and

ω = 5

. Points falling within the set S are identified by a blue square.

Figure A7. The figures illustrate the dependence among

X_{1}

, T, and Y, generated based on the simulations outlined in Appendix B.3 with a dependence parameter of

α = 2

and

ω = 5

. Points falling within the set S are identified by a blue square.

Table A2. Resulting estimates of parameter

ω = μ (t + 1) - μ (t), t > 1

from Appendix B.3. Parameter

α

represents the dependence between

X

, T. The notation

\hat{ω} = a \pm b (\pm c)

represent the following: given 100 estimations of

\hat{ω}

, a is the mean, b is the

95 %

quantile, and c is the (average)

95 %

quantile computed using the bootstrap technique.

Table A2. Resulting estimates of parameter

ω = μ (t + 1) - μ (t), t > 1

from Appendix B.3. Parameter

α

represents the dependence between

X

, T. The notation

\hat{ω} = a \pm b (\pm c)

represent the following: given 100 estimations of

\hat{ω}

, a is the mean, b is the

95 %

quantile, and c is the (average)

95 %

quantile computed using the bootstrap technique.

		$ω = 0$	$ω = 1$	$ω = 10$
$α = 1$	$n = 1000$	$\hat{ω} = - 0.05 \pm 0.33 (\pm 0.45)$	$\hat{ω} = 0.94 \pm 0.60 (\pm 0.55)$	$\hat{ω} = 9.86 \pm 2.99 (\pm 3.01)$
	$n = 5000$	$\hat{ω} = - 0.01 \pm 0.28 (\pm 0.25)$	$\hat{ω} = 0.98 \pm 0.29 (\pm 0.39)$	$\hat{ω} = 10.09 \pm 1.91 (\pm 2.04)$
	$n = 10, 000$	$\hat{ω} = 0.00 \pm 0.18 (\pm 0.16)$	$\hat{ω} = 0.99 \pm 0.21 (\pm 0.21)$	$\hat{ω} = 9.95 \pm 1.62 (\pm 1.59)$
$α = 1.5$	$n = 1000$	$\hat{ω} = 0.33 \pm 0.92 (\pm 0.96)$	$\hat{ω} = 1.37 \pm 0.98 (\pm 1.13)$	$\hat{ω} = 10.95 \pm 3.77 (\pm 3.38)$
	$n = 5000$	$\hat{ω} = 0.24 \pm 0.52 (\pm 0.68)$	$\hat{ω} = 0.97 \pm 0.29 (\pm 0.25)$	$\hat{ω} = 10.90 \pm 2.22 (\pm 2.43)$
	$n = 10, 000$	$\hat{ω} = 0.13 \pm 0.28 (\pm 0.49)$	$\hat{ω} = 1.20 \pm 0.46 (\pm 0.55)$	$\hat{ω} = 10.72 \pm 1.42 (\pm 1.76)$
$α = 2$	$n = 1000$	$\hat{ω} = - 0.17 \pm 1.19 (\pm 1.34)$	$\hat{ω} = 0.99 \pm 1.01 (\pm 1.31)$	$\hat{ω} = 11.14 \pm 3.32 (\pm 3.59)$
	$n = 5000$	$\hat{ω} = 0.03 \pm 0.66 (\pm 0.85)$	$\hat{ω} = 1.05 \pm 1.01 (\pm 0.99)$	$\hat{ω} = 11.15 \pm 2.91 (\pm 2.58)$
	$n = 10, 000$	$\hat{ω} = - 0.09 \pm 0.50 (\pm 0.61)$	$\hat{ω} = 0.96 \pm 0.59 (\pm 0.66)$	$\hat{ω} = 10.70 \pm 2.02 (\pm 1.83)$

Appendix B.4. Simulations with a Hidden Confounder

Consider a similar simulations setup as in Section 5.1 but with a hidden confounder. Consider an observed confounder

X = X_{1} \sim Bernoulli (0.75)

and a hidden confounder

H \sim N (1, 1)

. Let

δ \in R

and define

T = δ H + X_{1} + ε_{T}

, where

ε_{T} \sim N (0, 1)

. Note that

δ

represents the effect of a hidden confounder. Let

Y = \{\begin{matrix} δ H + \frac{2}{3} ω T + ε, when X_{1} = 1, T > 1, \\ δ H + \frac{6}{3} ω T + ε, when X_{1} = 0, T > 1, \\ δ H + 3 - 2 T + ε, when T \leq 1, \end{matrix}

(A3)

where

ε \sim N (0, 1)

. A simple computation leads to

μ (t) = 0.75 \frac{2}{3} ω t + (1 - 0.75) \frac{6}{3} ω t = ω t

for any

t > 1

, while

μ (t) = - 2 t + 3

for

t \leq 1

. Consequently, our primary interest lies in estimating

ω

for

t > 1

.

We generate data as specified with a sample size of n. Setting the threshold at

τ = 0.9

, we employ the methodology outlined in Section 4 to estimate

ω

. This process is repeated 100 times. The estimates

\hat{ω}

for a range of values of

δ

and

ω

and n can be found in Table A3.

The results in Table A3 suggest that a hidden confounder does not bias our estimate unless its strength is very large. Indeed, Remark 1 suggests that Assumption 2 is still valid since the hidden confounder enters the equality in an additive way.

Table A3. Resulting estimates of

ω = μ (t + 1) - μ (t), t > 1

from Appendix B.4, together with

95 %

quantile. Parameter

δ

represent the strength of a hidden confounder.

Table A3. Resulting estimates of

ω = μ (t + 1) - μ (t), t > 1

from Appendix B.4, together with

95 %

quantile. Parameter

δ

represent the strength of a hidden confounder.

		$ω = 0$	$ω = 5$	$ω = 10$
$δ = 0$	$n = 1000$	$\hat{ω} = 0.05 \pm 0.22$	$\hat{ω} = 5.03 \pm 0.31$	$\hat{ω} = 10.02 \pm 0.4$
	$n = 5000$	$\hat{ω} = - 0.01 \pm 0.17$	$\hat{ω} = 4.97 \pm 0.18$	$\hat{ω} = 9.97 \pm 0.20$
	$n = 10, 000$	$\hat{ω} = 0.01 \pm 0.12$	$\hat{ω} = 4.99 \pm 0.12$	$\hat{ω} = 9.97 \pm 0.14$
$δ = 1$	$n = 1000$	$\hat{ω} = 0.55 \pm 0.27$	$\hat{ω} = 5.60 \pm 0.32$	$\hat{ω} = 10.48 \pm 0.33$
	$n = 5000$	$\hat{ω} = 0.53 \pm 0.12$	$\hat{ω} = 5.50 \pm 0.18$	$\hat{ω} = 10.51 \pm 0.18$
	$n = 10, 000$	$\hat{ω} = 0.50 \pm 0.08$	$\hat{ω} = 5.48 \pm 0.09$	$\hat{ω} = 10.46 \pm 0.16$
$δ = 5$	$n = 1000$	$\hat{ω} = 0.96 \pm 0.04$	$\hat{ω} = 5.98 \pm 0.22$	$\hat{ω} = 10.91 \pm 0.21$
	$n = 5000$	$\hat{ω} = 0.96 \pm 0.03$	$\hat{ω} = 5.95 \pm 0.07$	$\hat{ω} = 10.94 \pm 0.16$
	$n = 10, 000$	$\hat{ω} = 0.96 \pm 0.025$	$\hat{ω} = 5.94 \pm 0.04$	$\hat{ω} = 10.92 \pm 0.08$
$δ = 10$	$n = 1000$	$\hat{ω} = 0.98 \pm 0.06$	$\hat{ω} = 5.96 \pm 0.15$	$\hat{ω} = 10.94 \pm 0.30$
	$n = 5000$	$\hat{ω} = 0.99 \pm 0.025$	$\hat{ω} = 5.98 \pm 0.07$	$\hat{ω} = 10.97 \pm 0.14$
	$n = 10, 000$	$\hat{ω} = 0.99 \pm 0.02$	$\hat{ω} = 5.97 \pm 0.04$	$\hat{ω} = 10.95 \pm 0.09$
$δ = 50$	$n = 1000$	$\hat{ω} = 0.99 \pm 0.01$	$\hat{ω} = 5.99 \pm 0.14$	$\hat{ω} = 10.97 \pm 0.28$
	$n = 5000$	$\hat{ω} = 0.99 \pm 0.01$	$\hat{ω} = 5.99 \pm 0.07$	$\hat{ω} = 10.98 \pm 0.15$
	$n = 10, 000$	$\hat{ω} = 1.0000 \pm 0.004$	$\hat{ω} = 5.97 \pm 0.04$	$\hat{ω} = 10.95 \pm 0.08$

Appendix B.5. Simulations with Varying Extremal Region

In the following simulations, we explore variations in the function

μ (t)

and analyze the corresponding estimations

\hat{μ} (t)

for large values of t.

Consider the following data-generating process:

\begin{matrix} X = ε_{X}, ε_{X} \sim N (0, 1) \\ T = X + ε_{T}, ε_{T} \sim t_{ν}, \\ Y = μ_{X} (T) + ε_{Y}, ε_{Y} \sim N (0, 1), \end{matrix}

Student’s t distribution with

ν

degrees of freedom. Note that if

ν \to \infty

, we obtain a Gaussian distribution, where

μ_{x} (t)

is defined in (A1). If c is too large, we only observe the region where

μ_{x}

grows, and hence, our estimation tends to be larger than the true value.

With varying c and

ν

, we estimate the parameter

\begin{matrix} ω = μ (t + 1) - μ (t) = - E | X | = - 0.798 for t \geq c . \end{matrix}

If we fit a linear model

E Y = β_{0} + β_{T} T + β_{X} X

, the estimate

{\hat{β}}_{T}

tends to be positive (depending on c and

ν

, for example, if

c = ν = 5

, then

{\hat{β}}_{T} = 0.58 \pm 0.02

). Using our methodology, we estimate

\hat{ω}

as in the previous simulations. The resulting numbers are presented in Table A4. We observe that if c grows, our estimate becomes more biased as the data above the threshold still fall below

T < c

. Specifically, if

ν = \infty

, only

0.2 %

of data points have

T > 5

, making the behavior of

μ (t)

above

t > c

challenging to estimate. Note that the degrees of freedom

ν

correspond to the heavy-tailness of T; smaller

ν

values lead to more extreme values of T. Conversely, if

ν = \infty

, T follows a Gaussian distribution. Heavier tails of T lead to better estimates.

Table A4. Estimates of

\hat{ω}

with varying c and

ν

. Note that true

ω = - E | X | \approx - 0.79

.

Table A4. Estimates of

\hat{ω}

with varying c and

ν

. Note that true

ω = - E | X | \approx - 0.79

.

True $ω \approx - 0.79$ .	$c = 1$	$c = 2$	$c = 5$	$c = 10$
$ν = \infty$	$\hat{ω} = - 0.75 \pm 0.15$	$\hat{ω} = - 0.33 \pm 0.13$	$\hat{ω} = 0.63 \pm 0.13$	$\hat{ω} = 0.71 \pm 0.38$
$ν = 5$	$\hat{ω} = - 0.78 \pm 0.18$	$\hat{ω} = - 0.72 \pm 0.15$	$\hat{ω} = - 0.13 \pm 0.15$	$\hat{ω} = 0.6 \pm 0.33$
$ν = 2$	$\hat{ω} = - 0.77 \pm 0.23$	$\hat{ω} = - 0.76 \pm 0.22$	$\hat{ω} = - 0.7 \pm 0.22$	$\hat{ω} = - 0.51 \pm 0.19$

Appendix C. River Data Application

Appendix C.1. Simple Illustration with Known Ground Truth

We used the following set of confounders:

$X_{1} =$ Total precipitation (daily);
$X_{2} =$ Total precipitation during the previous 7 days;
$X_{3} =$ Daily maximum of air temperature 2 m above ground;
$X_{4} =$ Daily maximum of relative air humidity 2 m above ground;
$X_{5} =$ Daily mean of vapor pressure 2 m above ground;
$X_{6} =$ Daily maximum of pressure reduced to sea level;
$X_{7} =$ Daily total of reference evaporation from FAO.

For the pair

2 \to 1

, we considered measurements from meteo-station M1 (station code MURI, AG), and for the remaining pairs, we used measurements from M2 (a station with a code LUZ). However, variables

X_{5}, X_{6}, X_{7}

were also measured in station M2 also for the pair

2 \to 1

, since some values were missing, and M2 has a much longer time period of measurements. All of these covariates can be safely considered as common causes of T and Y, and no feedback loop is present. For modeling of

θ (X)

, we used linear parametrization, that is,

τ (X) = c o n s t + \sum_{i = 1}^{7} β_{i, τ} X_{i}

and

σ (X) = c o n s t + \sum_{i = 1}^{7} β_{i, σ} X_{i}

, where the parameters

β_{i, τ}

were estimated using quantile regression (in R using quantreg package) and the parameters

β_{i, σ}

were estimated using evgam package. We fixed

ξ (X)

to be constant.

In the following, we focus on the pair

2 \to 1

; for other pairs of stations, the results were similar. In the modeling of

θ (X)

,

X_{3}

and

X_{5}

were not significant at the

0.05

level (note that for an estimation of

ω

, we do not care which covariates were significant since the function

θ (X)

is more or less unchanged and adding non-significant covariates only slightly increases variance in the estimation). Using non-parameteric GAM estimation of the parameters did not change the final estimation much (from

\hat{ω} = 1.03

to

\hat{ω} = 0.99

). Estimation of

τ (X)

is also plotted in Figure A8. Using the linear model in this case does not seem to be very wrong, see plot Figure A9, where except for the normality violation, the model seems to fit quite well (in the body of the distribution).

Figure A8. Visualization of the estimation of

τ (x)

, estimated using classical quantile regression. Blue points characterize the observations above this threshold (points in the set S).

Figure A8. Visualization of the estimation of

τ (x)

, estimated using classical quantile regression. Blue points characterize the observations above this threshold (points in the set S).

Figure A9. Diagnostics of a model

Y \sim X_{1} + \dots + X_{7}

.

Figure A9. Diagnostics of a model

Y \sim X_{1} + \dots + X_{7}

.

Appendix C.2. Effect of Precipitation on River Discharge

Figure A10 visualizes the dataset.

Figure A10. Visualization of the estimation of

τ (x)

for Part2 in the application, estimated using classical quantile regression. Blue points characterize the observations above this threshold (points in the set S).

Figure A10. Visualization of the estimation of

τ (x)

for Part2 in the application, estimated using classical quantile regression. Blue points characterize the observations above this threshold (points in the set S).

Appendix C.2.1. Choice of Variables

We used the following set of variables:

$Y =$ River discharge on day $i + 2$ ;
$T =$ Precipitation in the corresponding meteo-station on day $i + 1$ ;
$X_{1} =$ Total (sum) precipitation during the previous 7 days (days $i, i - 1, \dots, i - 6$ );
$X_{2} =$ Daily maximum of Air temperature 2 m above ground on day i;
$X_{3} =$ Daily maximum of Relative air humidity 2 m above ground on day i;
$X_{4} =$ Daily maximum of Pressure reduced to sea level on day i;
$X_{5} =$ Daily total of Reference evaporation from FAO on day i.

Here, i spans from 1.6.1930 up to 29.8.2014 (recall that we only considered the summer months). The choice of Y and T was addressed in the main text: since typically the auto-correlation peaking when T represents the day prior to Y (as illustrated in Figure A11, as well as its extreme counterpart extremogram [72]). The rationale behind choosing

X_{1}

is straightforward—precipitation over the preceding days emerges as a significant confounding factor affecting both Y and T. Regarding additional variables, we opted for those deemed relevant and with reliable measurements across meteorological stations, all of which were recorded on the preceding day.

Is there a common cause between Y and T that remains unaccounted for? Variables

X_{2}

through

X_{5}

, measured on day

i + 1

, could serve as potential common causes for both Y and T. For instance, a sudden temperature change might elevate the likelihood of intense rainfall, while alterations in river discharge could stem from specific soil characteristics affected by the temperature change. Nevertheless, we contend that the majority of these variables require more than a day to manifest their effects, which we believe are largely encapsulated by our chosen variables.

Figure A11. Cross-correlation and cross-extremogram of precipitation recorded at meteo-station M2 and water discharge at Station 3, both measured on the same day.

Appendix C.2.2. Computation of $\hat{β}$

In Table 4, we introduced a variable

\hat{β}

that represents the effect of precipitation on the river discharge level in the body of T. This can be defined in several ways:

Using the method introduced in [11,59], we estimate $\hat{μ} (t + 1) - \hat{μ} (t)$ for $t = E (T)$ .
Using a very straightforward approach where we model the data generating process of Y using a linear structural equation model $Y = c + β_{T} T + β_{X} X + ε$ and return the least square estimate of ${\hat{β}}_{T}$ .

Coincidentally, both approaches return a very similar value of

\hat{β}

, and hence, it does not matter which approach we use (values in Table 4 are from using the second approach).

Appendix D. Consistency, Bootstrap and Its Asymptotics

In this section, we give a more detailed description of the bootstrap algorithm and a more precise statement of Theorem 2 together with its proof. Theorem A1 presents the consistency of

\hat{μ} (t)

for large t, while Theorem A2 shows the consistency of

{\hat{ω}}_{x^{★}}

under different assumptions. Note that using the notation from Section 4 and Section 5,

\begin{matrix} {\hat{ω}}_{x^{★}} & = lim_{t \to \infty} {\hat{μ}}_{x^{★}} (t + 1) - {\hat{μ}}_{x^{★}} (t) = lim_{t \to \infty} \hat{α} [\hat{θ} (x^{★})] + \hat{β} [\hat{θ} (x^{★})] (t + 1) - \hat{α} [\hat{θ} (x^{★})] - \hat{β} [\hat{θ} (x^{★})] t \\ = \hat{β} [\hat{θ} (x^{★})] . \end{matrix}

(A4)

Appendix D.1. Bootstrap

In what follows, we explain in detail the procedure for an estimator

{\hat{ζ}}_{α}

satisfying

P (ω_{x^{★}} \leq {\hat{ζ}}_{α}) \geq 1 - α, α \in (0, 1) .

We only focus on the upper confidence interval, the lower and both-sided intervals can be carried out analogously. Our approach is standard and [58] provides a good overview.

Let

P_{n}

be the empirical distribution of the observations

Z_{i} : = (X_{i}, T_{i}, Y_{i}), i = 1, \dots, n

. We draw a random sample

(Z_{1}^{★}, \dots Z_{n}^{★}) \overset{i i d}{\sim} P_{n}

, and we compute the parameter

{\hat{ω}}^{★}

from

(Z_{1}^{★}, \dots Z_{n}^{★})

the same way as we compute

\hat{ω}

from

(Z_{1}, \dots, Z_{n})

. We define

{\hat{ζ}}_{α}

as the upper

α

-quantile of

{\hat{ω}}^{★}

, that is, the smallest value

x = {\hat{ζ}}_{α}

that satisfies

P ({\hat{ω}}^{★} - \hat{ω} \leq x ∣ P_{n}) \geq 1 - α .

The notation

P (\cdot | P_{n})

indicates that the distribution of

{\hat{ω}}^{★}

must be evaluated assuming that the observations are sampled according to

P_{n}

given the original observations. In particular, in the preceding display

\hat{ω}

is to be considered nonrandom.

It is almost never possible to calculate the bootstrap quantiles exactly [58]. In practice, these estimators are approximated by a simulation procedure. A large number of independent bootstrap samples

Z_{1}^{★}, \dots, Z_{n}^{★}

are generated according to the estimated distribution

P_{n}

. Each sample gives rise to a bootstrap value

{\hat{ω}}^{★}

. Finally, the bootstrap quantiles are estimated by the empirical quantiles of these bootstrap values. This simulation scheme always produces an additional (random) error in the coverage probability of the resulting confidence interval. In principle, this error can be made arbitrarily small by using a sufficiently large number of bootstrap samples. Therefore, the additional error is usually ignored in the theory of the bootstrap procedure. This section follows this custom and concerns the “exact” quantiles, without taking the simulation error into account.

Appendix D.2. Simplifying Assumptions

We simplify some steps in the inference process in order to simplify the proof of the consistency. In particular, we assume the following:

(A): (Causality justification) Consider Assumptions 1, 2 and 4 to be valid.
(B): (Step 2 convergence) $E Y^{2} < \infty$ , ${E | | X | |}^{2} < \infty$ and $(X, T)$ satisfy Grenander conditions (this is a minor assumption assuring that the matrix of observations have a full rank with probability tending to one. See Table 4.2 in [73]).
(C): (Step 1 convergence) We assume that conditions R1, R2, and R3 from [74] are satisfied. That is, $E ({XX}^{⊤})$ is positive semi-definite, $X$ has a compact support with existing and finite quantile densities $\frac{\partial F_{U}^{- 1} (τ | x)}{\partial τ}$ , $\frac{\partial F_{U}^{- 1} (τ)}{\partial τ}$ where $U = T - τ_{l i n}^{⊤} X$ , $τ_{l i n} \in R^{d}$ .
(D): (Linearity) Assume that functions $θ$ , $α, β$ are linear, functions $σ, ξ$ are constant and that we employ linear regression for the estimation of the parameters.

In particular, following the notation in Section 4 and using the notation

τ (x) = τ_{l i n}^{⊤} x

, our algorithm is as follows:

Choose $q \in (0, 1)$ ,
(Step 1) Estimate $τ_{l i n} \in R^{d}$ by minimizing ${\hat{τ}}_{l i n} \in a r g m i n_{b} \sum_{i = 1}^{n} h_{q} (T_{i} - X_{i}^{⊤} b)$ where $h_{q} (x) = x (q 1_{x \geq 0} - (1 - q) 1_{q < 0})$ .
(Step 2) We estimate $α, β \in R$ using least squares in a model

$E [Y ∣ T = t, τ (X) = {\hat{τ}}_{l i n}^{⊤} x] = α {\hat{τ}}_{l i n}^{⊤} x + β {\hat{τ}}_{l i n}^{⊤} x t,$

(A5)

from the data-points in S (that is, we only consider $t > {\hat{τ}}_{l i n}^{⊤} x$ ). Using $R$ language we run the following code: fit = lm(Y ∼ s +s: $T_{S}$ , data = data.frame(s, $T_{S}$ )), where $s = {\hat{τ}}_{l i n}^{⊤} X_{S}$ , $T_{S} = {T_{i} : i \in S}$ and $X_{S} = {X_{i} : i \in S}$ .
We output ${\hat{ω}}_{x^{★}} = \hat{β} {\hat{τ}}_{l i n}^{⊤} x^{★}$ (see (A4)).

Remark A1.

Assumption C implies consistency of

{\hat{τ}}_{l i n}

(under the assumption that q is chosen as a function of the sample size n, denoted as

q = q_{n}

satisfying

{lim}_{n \to \infty} q_{n} = 1

and

{lim}_{n \to \infty} n (1 - q_{n}) = \infty

), see Theorem 5.1 in [74]. This assumption can be simplified by directly assuming consistency of

{\hat{τ}}_{l i n}

.

Appendix D.3. Consistency

We present two consistency results. One concerns the consistency of

\hat{μ} (t)

under general non-linear assumptions but under a neglecting the GPD approximation error. The second result describes the consistency of

{\hat{ω}}_{x^{★}}

under linear assumptions presented in Appendix D.2.

Theorem A1

(Consistency). Consider Assumptions 1, 2, and 4 to be valid.

Assume that θ, α, and β are continuous functions, and suppose we employ consistent estimators for θ, α, and β. For instance, the Generalized Additive Model (GAM) estimator [51] has been shown to be consistent under specific smoothness conditions.
Let $q \in (0, 1)$ be chosen such that the distribution of $T ∣ T > τ_{q} (x), X = x$ follows $G P D (θ (x))$ for all $x \in X$ , where $X = s u p p (X)$ is assumed to be compact.

Under these conditions, our estimator is consistent in the sense that for all

t \in T

\hat{μ} (t) \overset{P}{\to} \tilde{μ} (t) as n \to \infty,

where

\tilde{μ}

is a function that satisfies

\tilde{μ} (t) \sim μ (t)

as

t \to τ_{R}

.

The second assumption outlined in Theorem A1 is introduced to address certain technical hurdles that arise when dealing with a quantile q which varies with the sample size n. Broadly speaking, when q is not fixed, the statistical framework becomes considerably more intricate, making the task of demonstrating the consistency of a quantile regression notably challenging. Please note that while the distribution of

T ∣ T > τ_{q} (x), X = x

converges to a Generalized Pareto Distribution (

G P D (θ (x))

) in the limit for large q, the exact validity is limited to special cases, such as when

T ∣ X

follows a Pareto distribution. However, by selecting q to be sufficiently large, one can mitigate this issue, effectively reducing the disparity between the distributions of

T ∣ T > τ_{q} (x), X = x

and

G P D (θ (x))

to insignificance. This theorem, therefore, provides valuable insight into the general consistency of the model, despite the idealized nature of the assumption.

Note that Theorem A1 can be reformulated for

μ_{X}

analogously.

The subsequent theorem does not necessitate a fixed q; however, it presupposes the linearity in the models for T and Y.

Theorem A2

(Consistency). Under Assumptions A, B, C, and D, where q is chosen as a function of the sample size n, denoted as

q = q_{n}

satisfying

{lim}_{n \to \infty} q_{n} = 1

and

{lim}_{n \to \infty} n (1 - q_{n}) = \infty

, our estimator

{\hat{ω}}_{x^{★}}

is consistent. That is,

{\hat{ω}}_{x^{★}} - ω_{x^{★}} \overset{P}{\to} 0, as n \to \infty .

Proof of Theorem A1.

The proof is very straightforward. Lemma 2 shows that

μ (t) \sim \int_{X} E [Y ∣ T = t, θ (x)] p_{θ (X)} (x) d x for t \to τ_{R} .

Assumption 4 allows us to rewrite (correctness of this step follows directly from Lemma A1 by considering

f (t, x) = E [Y ∣ T = t, θ (x)]

and

g (t, x) = α (θ (x)) + β (θ (x)) t

)

\int_{X} E [Y ∣ T = t, θ (x)] p_{θ (X)} (x) d x \sim \int_{X} [α (θ (x)) + β (θ (x)) t] p_{θ (X)} (x) d x : = \tilde{μ} (t) for t \to τ_{R} .

Since

θ

,

α

, and

β

are continuous and their estimators are consistent, we obtain that for all

t \in T

, it holds that

\int_{X} [\hat{α} (\hat{θ} (x)) + \hat{β} (\hat{θ} (x)) t] p_{θ (X)} (x) d x \overset{P}{\to} \tilde{μ} (t) as n \to \infty .

Moreover, from the law of large numbers, it holds that

\hat{μ} (t) - \int_{X} [\hat{α} (\hat{θ} (x)) + \hat{β} (\hat{θ} (x)) t] p_{θ (X)} (x) d x \overset{P}{\to} 0, as n \to \infty .

Together, we obtain

\hat{μ} (t) \overset{P}{\to} \tilde{μ} (t), as n \to \infty,

where the function on the right side is tail-equivalent with

μ (t)

, which is what we wanted to show. □

Lemma A1.

Let

X

be compact set and

τ_{R}

be the right endpoint of

T \subseteq R

. Let

f, g : T \times X \to R

be continuous functions such that for all

x \in X

holds

f (t, x) \sim g (t, x)

as

t \to τ_{R}

. Let F be a continuous distribution function. Then,

\int_{X} f (t, x) d F (x) \sim \int_{X} g (t, x) d F (x), a s t \to τ_{R} .

Proof.

Let

ε > 0

. Find

t_{0}

such that for all

t > t_{0}

and for all

x \in X

holds

1 - ε < \frac{f (t, x)}{g (t, x)} < 1 + ε

. Then, for any

t > t_{0}

, it holds that

\frac{\int_{X} f (t, x) d F (x)}{\int_{X} g (t, x) d F (x)} < \frac{\int_{X} f (t, x) d F (x)}{\int_{X} \frac{1}{1 + ε} f (t, x) d F (x)} = 1 + ε

and analogously

\frac{\int_{X} f (t, x) d F (x)}{\int_{X} g (t, x) d F (x)} > 1 - ε

. The proof is finished by sending

ε \to 0

. □

Proof of Theorem A2.

Idea: We assume that the GPD approximation and linear model approximations are correct up to a factor of ε: we argue that for a large n this is correct. Next, we use Theorem 5.1 in [74] to show consistency of

\hat{τ}

. We use Theorem 4.4 in [73] to show consistency of

\hat{β}

together with linearity of the least square estimate (to show that it does not depend on the inaccuracy of the estimate of

\hat{τ}

). Finally, we use Lemma 2 and send

ε \to 0

.

Proof: Let

ε > 0

. We claim that it is possible to find

t < τ_{R}

and

n_{0} \in N

such that for all

x \in X

and all

n \geq n_{0}

, the following five statements hold with arbitrarily large probability:

$t < F_{T ∣ X = x}^{- 1} (q_{n})$ (In other words, $q_{n_{0}}$ is large enough such that the $q_{n}$ -quantile of $T ∣ X$ is larger than t.);
It holds that

$1 - ε < \frac{μ_{x^{★}} (t)}{E [Y ∣ T = t, τ (X) = τ (x^{★})]} < 1 + ε,$

where $τ (x) = τ_{l i n}^{⊤} x$ is the $q_{n_{0}}$ -quantile of $T ∣ X = x$ ;
It holds that

$1 - ε < \frac{E [Y ∣ T = t, τ (X) = τ (x^{★})]}{α τ (x^{★}) + β τ (x^{★}) t} < 1 + ε;$
$| | {\hat{τ}}_{l i n} - τ_{l i n} | | < ε$ ;
$| \hat{β} - β | < ε$ ;

where

{\hat{τ}}_{l i n} = a r g m i n_{b} \sum_{i = 1}^{n_{0}} h_{q_{n_{0}}} (T_{i} - X_{i}^{⊤} b)

is the maximum likelihood estimator, and

β

is the real coefficient in the model

E [Y ∣ T = t, T > τ_{l i n}^{⊤} x, X = x] = α τ_{l i n}^{⊤} x + β τ_{l i n}^{⊤} x t,

(A6)

and

\hat{β}

is the corresponding least square estimate.

We prove the following bullet-points here:

The first bullet-point is a trivial consequence of the assumption $q_{n} \to 1$ .
The second bullet-point is a trivial consequence of Lemma 2 together with Assumption D.
The third bullet-point is a trivial consequence of Assumptions 4 and D;
The fourth bullet-point follows from a well-known consistency of ${\hat{τ}}_{l i n}$ . It is well known that for a fixed quantile q, the maximum likelihood estimator ${\hat{τ}}_{l i n} = a r g m i n_{b} \sum_{i = 1}^{n} h_{q} (T_{i} - X_{i}^{⊤} b)$ is consistent and even asymptotically normal (see, e.g., Theorem 4.1 in [68], noting that we assume continuous T and finite second moments of $X$ ). However, quantile q is not fixed and is increasing with the sample size with the speed ${lim}_{n \to \infty} q_{n} = 1$ and ${lim}_{n \to \infty} n (1 - q_{n}) = \infty$ . This is a well-known generalization of quantile regression known as ‘intermediate order regression quantiles’ or ‘moderately extreme quantiles’ [75] and is as consistent and asymptotically normal under Assumption C (see Theorem 5.1 in [74]).
The fifth bullet-point: For a moment, fix $τ_{l i n} \neq 0$ . It is an elementary knowledge that the estimation of $β$ using least squares in a model (A6), where $τ_{l i n}$ is fixed, consistent, and even asymptotically normal under conditions $v a r (Y) < \infty$ , ${E | | X | |}^{2} < \infty$ , $(X, T$ ) satisfying Grenander conditions and the sample-size $| S | = : k_{n} = n (1 - q_{n}) \to \infty$ (see, e.g., Lemma A2). Observe that least squares estimate $\hat{β}$ is linear in $τ_{l i n}$ , that is, if we express $\hat{β}$ explicitly, we obtain $\hat{β} = τ_{l i n}^{⊤} \hat{\tilde{β}}$ , where $\tilde{β}$ is a coefficient in a linear model corresponding to (A9) (where T is assumed to be larger than $τ^{⊤} X$ implicitly). Finally, using this observation, we can replace the fixed value of $τ_{l i n}$ by a random ${\hat{τ}}_{l i n}$ , and we still obtain $\hat{β} = {\hat{τ}}_{l i n}^{⊤} \hat{\tilde{β}}$ . Since by increasing n we can make $\hat{\tilde{β}}$ arbitrarily accurate with arbitrarily large probability, the same holds for $\hat{β}$ . In the following paragraph, we present an an illustration of the linearity of $\hat{β}$ in $τ_{l i n}$ for $d = 1$ . An explicit expression of $\hat{β}$ as a function of $τ_{l i n}$ and our data is the following:

$[\begin{matrix} \hat{α} \\ \hat{β} \end{matrix}] = {(M^{⊤} M)}^{- 1} M^{⊤} Y_{S}, w h e r e M = [\begin{matrix} τ_{l i n} x_{1} & τ_{l i n} x_{1} t_{1} \\ \dots & \dots \\ τ_{l i n} x_{k} & τ_{l i n} x_{k} t_{k} \end{matrix}]$

(A7)

where $Y_{S} = {(Y_{1}, \dots, Y_{k})}^{⊤}$ , WLOG let $S = {1, \dots, k_{n}} \subset {1, \dots, n}$ . Note that

$M = τ_{l i n} d i a g (x_{1}, \dots, x_{k}) [\begin{matrix} 1 & t_{1} \\ \dots & \dots \\ 1 & t_{k} \end{matrix}] = τ_{l i n} \tilde{M},$

(A8)

where $\tilde{M}$ is the data matrix corresponding to a model (A9).

Combining all the bullet-points, we obtain that with arbitrarily large probability that

\begin{matrix} μ_{x^{★}} (t) & \approx E [Y ∣ T = t, τ (X) = τ (x)] \\ \approx α τ (x) + β τ (x) t \\ \approx \hat{α} {\hat{τ}}_{l i n}^{⊤} x^{★} + \hat{β} {\hat{τ}}_{l i n}^{⊤} x^{★} t, \end{matrix}

where each sign ‘≈’ represent equality up to a factor of

ε

(in either multiplicative or additive form) which is negligible as

ε \to 0

. This implies consistency, Quod erat demonstrandum.

□

Lemma A2.

Consider an estimate

(\hat{α}, \hat{\tilde{β}})

of

α \in R, \tilde{β} \in R^{d}

using least squares in a model

E [Y ∣ T = t, X = x] = α x + {\tilde{β}}^{⊤} x t,

(A9)

based on a random sample

(Y_{1}, T_{1}, X_{1}), \dots (Y_{k}, T_{k}, X_{k})

. Then,

\hat{\tilde{β}}

is consistent and asymptotically normal if

v a r (Y) < \infty

and

{E | | X | |}^{2} < \infty

and

(X, T

) satisfy Grenander conditions.

Proof can be found in Theorem 4.4 in [73].

Appendix D.4. Bootstraps Correctness

We use results from Chapter 23.2 in [58]. The main step is to use the delta method for bootstrap (Theorem 23.5 in [58]) and the fact that regression models in step 1 and step 2 of our algorithm are ‘bootstrappable’ (Theorem 3 in [76,77]).

To simplify some steps of the proof, we assume the following:

E.: Assume that ${\hat{ω}}_{x^{★}}$ is consistent (which holds for example under assumptions A,B,C,D).
F.: We compute ${\hat{τ}}_{l i n}$ from the first $⌊ \frac{n}{2} ⌋$ data points, and we compute $\hat{β}$ from the remaining $⌈ \frac{n}{2} ⌉$ data points.
G.: In the computation of the set S, we assume that $τ$ is known and non-random; that is, $S = {i \leq n : T_{i} > τ (X_{i})}$ instead of $S = {i \leq n : T_{i} > \hat{τ} (X_{i})}$ .
H.: Assumption of Theorem 3 in [76] are satisfied; that is, $E [X X^{⊤}]$ is non-singular matrix, the conditional density of $Y - τ^{⊤} X$ given $X$ , denoted as f, satisfies $f (ϵ ∣ X) > r_{1}$ whenever $| ϵ | \leq r_{2}$ for some positive numbers $r_{1}$ , $r_{2}$ . Finally, there exists some function G such that $f (ϵ ∣ X) \leq G (X)$ for all $ϵ$ and $E [{(1 + G (X)) | | X | |}^{2}] < \infty .$

Theorem A3.

Assume validity of assumptions D,E,F,G and H. Let

q \in (0, 1)

be chosen such that the distribution of

T ∣ T > τ_{q} (x), X = x

follows

G P D (θ (x))

for all

x \in X

. Then,

{\hat{ζ}}_{α}

is asymptotically consistent; that is,

l i m i n f_{n \to \infty} P (ω_{x^{★}} \leq {\hat{ζ}}_{α}) \geq 1 - α .

Proof.

We will show that both

\frac{1}{\sqrt{n}} ({\hat{ω}}_{x^{★}} - ω_{x^{★}})

and

\frac{1}{\sqrt{n}} ({\hat{ω}}_{x^{★}}^{★} - {\hat{ω}}_{x^{★}})

given

P_{n}

both converge to the same distribution, say G. That is,

\frac{1}{\sqrt{n}} ({\hat{ω}}_{x^{★}} - ω_{x^{★}}) \overset{D}{\to} G \overset{D}{\leftarrow} \frac{1}{\sqrt{n}} ({\hat{ω}}_{x^{★}}^{★} - {\hat{ω}}_{x^{★}})

as

n \to \infty

. This directly implies (see, e.g., Lemma 23.3 in [58]) that

{\hat{ζ}}_{α}

is asymptotically consistent.

(Observation 1)

{\hat{τ}}_{l i n}

satisfies that

\frac{1}{\sqrt{n}} ({\hat{τ}}_{l i n} - τ_{l i n})

and

\frac{1}{\sqrt{n}} ({\hat{τ}}_{l i n}^{★} - {\hat{τ}}_{l i n})

given

P_{n}

both converge to the same Gaussian distribution (see Theorem 3 in [76]).

(Observation 2)

\hat{β} = {\hat{τ}}_{l i n}^{⊤} \hat{\tilde{β}}

, where

\tilde{β}

is a coefficient in a linear model corresponding to (A9) (where T is assumed to be larger than

τ (X)

implicitly since we assumed that

τ

is known and non-random in S). Note that

\hat{\tilde{β}} ⊥ ⊥ {\hat{τ}}_{l i n}

since

\hat{\tilde{β}}

is computed from the second half of the dataset and its computation does not contain

{\hat{τ}}_{l i n}

. However, we know that

\hat{\tilde{β}}

satisfies that

\frac{1}{\sqrt{n}} (\hat{\tilde{β}} - \tilde{β})

and

\frac{1}{\sqrt{n}} ({\hat{\tilde{β}}}^{★} - \hat{\tilde{β}})

given

P_{n}

both converge to the same Gaussian distribution (Theorem 2 in [78] or [77]).

Together: Since

{\hat{τ}}_{l i n}

is ‘bootstrappable’ and

\hat{\tilde{β}}

is ‘bootstrappable’ and they are independent, the delta method give us that

{\hat{ω}}_{x^{★}}

is ‘bootstrappable’. More formally, we use Theorem 23.5 in [58] (Delta method for bootstrap). Define

ϕ : R^{2 d} \to R : ϕ (a, b) = (a^{⊤} b) (a^{⊤} x^{★})

. Note that

{\hat{ω}}_{x^{★}} = ϕ ({\hat{τ}}_{l i n}, \hat{\tilde{β}})

. Since

{\hat{τ}}_{l i n}

and

\hat{\tilde{β}}

satisfy the conditions of the theorem, we obtain that

\frac{1}{\sqrt{n}} ({\hat{ω}}_{x^{★}} - ω_{x^{★}})

and

\frac{1}{\sqrt{n}} ({\hat{ω}}_{x^{★}}^{★} - {\hat{ω}}_{x^{★}})

given

P_{n}

both converge to the same distribution. That is what we wanted to show.

□

Appendix E. Proofs of Lemmas 1 and 2

Proof of Lemma 1.

A simple computation gives us

\begin{matrix} E [Y (t)] & = E (E [Y (t) ∣ X]) \sim E (E [Y ∣ X, T = t]) \\ = \int_{X} \int_{Y} p_{Y ∣ X, T} (y, x, t) p_{X} (x) y d y d x \\ = \int_{X} \int_{Y} \frac{p_{T} (t)}{p_{T ∣ X} (t ∣ x)} p_{Y, X ∣ T} (y, x ∣ t) y d y d x \\ = E {π_{0} (T, X) Y ∣ T = t} . \end{matrix}

□

Proof of Lemma 2.

From Assumption 2, we have that

E [Y (t) ∣ X] \sim E [Y ∣ T = t, X] as t \to τ_{R} .

(A10)

On both sides of (A10), we condition on

θ (X)

and integrate over the remaining

X

(denoted as

θ^{C} (X)

, formally it is an orthogonal complement). Note that the distribution of

T ∣ θ (X)

approaches the distribution of

T ∣ X

, given

T > τ (X)

for sufficiently large

τ (X)

, since it approaches

G P D (θ (X))

. Hence,

P_{θ^{C} (X) ∣ T = t, θ (X)}

approaches the distribution

P_{θ^{C} (X) ∣ θ (X)}

as

t \to τ_{R}

.

We obtain the following:

\begin{matrix} E [Y (t) ∣ θ (X)] & = \int E [Y (t) ∣ θ (X), θ^{C} (X) = w] d P_{θ^{C} (X) ∣ θ (X)} (w) \\ \sim \int E [Y (t) ∣ θ (X), θ^{C} (X) = w] d P_{θ^{C} (X) ∣ T = t, θ (X)} (w) \\ = \int E [Y ∣ T = t, θ (X), θ^{C} (X) = w] d P_{θ^{C} (X) ∣ T = t, θ (X)} (w) \\ = E [Y ∣ T = t, θ (X)] . \end{matrix}

The second statement in the Lemma trivially follows from the first by integrating over

θ (X)

. □

References

Rosenbaum, P.R.; Rubin, D.B. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
Holland, P.W. Statistics and Causal Inference. J. Am. Stat. Assoc. 1986, 81, 945–960. [Google Scholar] [CrossRef]
Robins, J.M.; Hernández-Díaz, S.; Brumback, B. Marginal Structural Models and Causal Inference in Epidemiology. Epidemiology 2000, 11, 550–560. [Google Scholar] [CrossRef] [PubMed]
Imai, K.; King, G.; Stuart, E.A. Misunderstandings Between Experimentalists and Observationalists about Causal Inference. J. R. Stat. Soc. Ser. A Stat. Soc. 2008, 171, 481–502. [Google Scholar] [CrossRef]
Imai, K.; van Dyk, D.A. Causal Inference With General Treatment Regimes. J. Am. Stat. Assoc. 2004, 99, 854–866. [Google Scholar] [CrossRef]
Heckman, J.J.; Humphries, J.E.; Veramendi, G. Returns to Education: The Causal Effects of Education on Earnings, Health, and Smoking. J. Political Econ. 2018, 126, 197–246. [Google Scholar] [CrossRef] [PubMed]
Hannart, A.; Naveau, P. Probabilities of Causation of Climate Changes. J. Clim. 2018, 31, 5507–5524. [Google Scholar] [CrossRef]
Low, H.; Meghir, C. The Use of Structural Models in Econometrics. J. Econ. Perspect. 2017, 31, 33–58. [Google Scholar] [CrossRef]
Rubin, D.B. Causal Inference Using Potential Outcomes: Design, Modeling, Decisions. J. Am. Stat. Assoc. 2005, 100, 322–331. [Google Scholar] [CrossRef]
Imbens, G.W.; Rubin, D.B. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction; Cambridge University Press: Cambridge, UK, 2015. [Google Scholar] [CrossRef]
Kennedy, E.H.; Ma, Z.; McHugh, M.D.; Small, D.S. Non-parametric Methods for Doubly Robust Estimation of Continuous Treatment Effects. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 1229–1245. [Google Scholar] [CrossRef]
Westling, T.; Gilbert, P.; Carone, M. Causal Isotonic Regression. JRSSb 2020, 82, 719–747. [Google Scholar] [CrossRef]
Galagate, D. Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions with Applications. Ph.D. Thesis, University of Maryland, College Park, MD, USA, 2016. [Google Scholar] [CrossRef]
Rubin, D.; van der Laan, M.J. Extending Marginal Structural Models through Local, Penalized, and Additive Learning; Working Paper 212; Division of Biostatistics, UC Berkeley: Berkeley, CA, USA, 2006. [Google Scholar]
Neugebauer, R.; van der Laan, M.J. Nonparametric causal effects based on marginal structural models. J. Stat. Plan. Inference 2007, 137, 419–434. [Google Scholar] [CrossRef]
Zhang, Y.F.; Zhang, H.; Lipton, C.Z.; Li, L.E.; Xing, E. Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation. NeurIPS ML Safety Workshop 2023. [Google Scholar] [CrossRef]
Bica, I.; Jordon, J.v.d.; Schaar, M. Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks. In Proceedings of the Advances in Neural Information Processing Systems, virtual, 6–12 December 2020; Volume 33, pp. 16434–16445. [Google Scholar] [CrossRef]
Zhang, Y. Extremal Quantile Treatment Effects. Ann. Stat. 2018, 46, 3707–3740. [Google Scholar] [CrossRef]
Deuber, D.; Li, J.; Engelke, S.; Maathuis, M. Estimation and Inference of Extremal Quantile Treatment Effects for Heavy-Tailed Distributions. JASA 2023, 1–11. [Google Scholar] [CrossRef]
Huang, W.; Li, S.; Peng, L. Extreme Continuous Treatment Effects: Measures, Estimation and Inference. arXiv 2022, arXiv:2209.00246. [Google Scholar]
Bodik, J.; Paluš, M.; Pawlas, Z. Causality in extremes of time series. Extremes 2024, 27, 67–121. [Google Scholar] [CrossRef]
Gnecco, N.; Meinshausen, N.; Peters, J.; Engelke, S. Causal discovery in heavy-tailed models. Ann. Stat. 2020, 49, 1755–1778. [Google Scholar] [CrossRef]
Pasche, O.C.; Chavez-Demoulin, V.; Davison, A. Causal Modelling of Heavy-Tailed Variables and Confounders with Application to River Flow. Extremes 2023, 26, 573–594. [Google Scholar] [CrossRef]
Krali, M.; Davison, A.C.; Klüppelberg, C. Heavy-tailed max-linear structural equation models in networks with hidden nodes. arXiv 2023, arXiv:2306.15356. [Google Scholar]
Bodik, J.; Chavez-Demoulin, V. Structural restrictions in local causal discovery: Identifying direct causes of a target variable. arXiv 2023, arXiv:2307.16048. [Google Scholar]
Engelke, S.; Hitz, A. Graphical models for extremes. J. R. Stat. Soc. Ser. B 2020, 82, 871–932. [Google Scholar] [CrossRef]
Naveau, P.; Hannart, A.; Ribes, A. Statistical Methods for Extreme Event Attribution in Climate Science. Annu. Rev. Stat. Its Appl. 2020, 7, 89–110. [Google Scholar] [CrossRef]
Courgeau, V.; Veraart, A.E.D. Extreme event propagation using counterfactual theory and vine copulas. arXiv 2021, arXiv:2106.13564. [Google Scholar]
Kiriliouk, A.; Naveau, P. Climate extreme event attribution using multivariate peaks-over-thresholds modeling and counterfactual theory. Ann. Appl. Stat. 2020, 14, 1342–1358. [Google Scholar] [CrossRef]
Dong, K.; Ma, T. First steps toward understanding the extrapolation of nonlinear models to unseen domains. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Christiansen, R.; Pfister, N.; Jakobsen, M.E.; Gnecco, N.; Peters, J. A Causal Framework for Distribution Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6614–6630. [Google Scholar] [CrossRef]
Saengkyongam, S.; Rosenfeld, E.; Ravikumar, P.K.; Pfister, N.; Peters, J. Identifying Representations for Intervention Extrapolation. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, Y.; Bühlmann, P. Domain Adaptation under Structural Causal Models. J. Mach. Learn. Res. 2021, 22, 1–80. [Google Scholar]
Shen, X.; Meinshausen, N. Engression: Extrapolation for Nonlinear Regression? arXiv 2024, arXiv:2307.00835. [Google Scholar]
Pfister, N.; Bühlmann, P. Extrapolation-Aware Nonparametric Statistical Inference. arXiv 2024, arXiv:2402.09758. [Google Scholar]
Hirano, K.; Imbens, G.W. The Propensity Score with Continuous Treatments. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives; John Wiley and Sons, Ltd.: Hoboken, NJ, USA, 2004; Chapter 7; pp. 73–84. [Google Scholar] [CrossRef]
Gill, R.D.; Robins, J.M. Causal inference for complex longitudinal data: The continuous case. Ann. Stat. 2001, 29, 1785–1811. [Google Scholar] [CrossRef]
King, G.; Zeng, L. The Dangers of Extreme Counterfactuals. Political Anal. 2006, 14, 131–159. [Google Scholar] [CrossRef]
Crump, R.K.; Hotz, V.J.; Imbens, G.W.; Mitnik, O.A. Dealing with Limited Overlap in Estimation of Average Treatment Effects. Biometrika 2009, 96, 187–199. [Google Scholar] [CrossRef]
Ai, C.; Linton, O.; Zhang, Z. Estimation and Inference for the Counterfactual Distribution and Quantile Functions in Continuous Treatment Models. J. Econom. 2021, 228, 39–61. [Google Scholar] [CrossRef]
Bahadori, M.T.; Tchetgen, E.; Heckerman, D. End-to-End Balancing for Causal Continuous Treatment-Effect Estimation. In Proceedings of the 39th International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 1313–1326. [Google Scholar]
Li, Y.; Kuang, K.; Li, B.; Cui, P.; Tao, J.; Yang, H.; Wu, F. Continuous Treatment Effect Estimation via Generative Adversarial De-confounding. In Proceedings of the 2020 KDD Workshop on Causal Discovery, PMLR, San Diego, CA, USA, 24 August 2020; Volume 127, pp. 4–22. [Google Scholar]
Kreif, N.; Grieve, R.; Díaz, I.; Harrison, D. Evaluation of the Effect of a Continuous Treatment: A Machine Learning Approach with an Application to Treatment for Traumatic Brain Injury. Health Econ. 2015, 24, 1213–1228. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; van Dyk, D.A.; Imai, K. Propensity Score-based Methods for Causal Inference in Observational Studies with Non-binary Treatments. Stat. Methods Med. Res. 2020, 29, 709–727. [Google Scholar] [CrossRef] [PubMed]
Resnick, S.I. Extreme Values, Regular Variation and Point Processes; Springer: New York, NY, USA, 2008. [Google Scholar]
Pickands, J. Statistical Inference Using Extreme Order Statistics. Ann. Stat. 1975, 3, 119–131. [Google Scholar] [CrossRef]
Coles, S. An Introduction to Statistical Modeling of Extreme Values; Springer Series in Statistics; Springer: London, UK, 2001. [Google Scholar]
Fisher, R.; Tippett, L. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Math. Proc. Camb. Philos. Soc. 1928, 24, 180–190. [Google Scholar] [CrossRef]
Pearl, J. Causality: Models, Reasoning and Inference; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Soklakov, A. Occam’s razor as a formal basis for a physical theory. Found. Phys. Lett. 2002, 15, 107–135. [Google Scholar] [CrossRef]
Wood, S. Generalized Additive Models: An Introduction with R, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Smith, R. Extreme Value Theory. Handb. Appl. Math. 1990, 1, 437–447. [Google Scholar]
Davison, A.; Huser, R. Statistics of Extremes. Annu. Rev. Stat. Its Appl. 2015, 2, 203–235. [Google Scholar] [CrossRef]
Schneider, L.F.; Krajina, A.; Krivobokova, T. Threshold selection in univariate extreme value analysis. Extremes 2021, 24, 881–913. [Google Scholar] [CrossRef]
Caeiro, F.; Gomes, M. Threshold selection in extreme value analysis. In Extreme Value Modeling and Risk Analysis: Methods and Applications; Chapman and Hall/CRC: Boca Raton, FL, USA, 2015; pp. 69–82. ISBN 9780429161193. [Google Scholar]
Davison, A.; Smith, R.L. Models for exceedances over high thresholds. J. R. Stat. Soc. Ser. B (Methodol.) 1990, 52, 393–425. [Google Scholar] [CrossRef]
van der Vaart, A.W. Bootstrap; Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 1998; pp. 326–340. [Google Scholar]
Kennedy, E.H. Nonparametric causal effects based on incremental propensity score interventions. JASA 2019, 114, 645–656. [Google Scholar] [CrossRef]
Bia, M.; Mattei, A.; Nicolò, G. A Stata package for the application of semiparametric estimators of dose response functions. Stata J. 2014, 14, 580–604. [Google Scholar] [CrossRef]
Galagate, D.; Schafer, J. Estimating Causal Dose Response Functions, R package version 0.4.2; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
van der Wal, W.; Geskus, R. IPW: An R package for inverse probability weighting. J. Stat. Softw. 2011, 43, 1–23. [Google Scholar]
Davis, R.A. The rate of convergence in distribution of the maxima. Stat. Neerl. 1982, 36, 31–35. [Google Scholar] [CrossRef]
Engelke, S.; Ivanovs, J. Sparse structures for multivariate extremes. Annu. Rev. Stat. Its Appl. 2021, 8, 241–270. [Google Scholar] [CrossRef]
Pearl, J. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, Seattle, WA, USA, 2–5 August 2001; UAI’01. pp. 411–420. [Google Scholar]
Yeh, I. Modeling of strength of high performance concrete using artificial neural networks. Cem. Concr. Res. 1998, 28, 1797–1808. [Google Scholar] [CrossRef]
Neville, A. Properties of Concrete; Pearson Education Limited: London, UK, 2011. [Google Scholar]
Koenker, R. Quantile Regression; Econometric Society Monographs; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar] [CrossRef]
Youngman, B.D. Evgam: An R Package for Generalized Additive Extreme Value Models. J. Stat. Softw. 2022, 103, 1–26. [Google Scholar] [CrossRef]
Kolesárová, A.; Mesiar, R.; Saminger-Platz, S. Generalized Farlie-Gumbel-Morgenstern Copulas. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations; Springer International Publishing: Cham, Switzerland, 2018; Communications in Computer and Information Science; Volume 853, pp. 244–252. [Google Scholar] [CrossRef]
Perlin, K. An Image Synthesizer. SIGGRAPH Comput. Graph. 1985, 19, 287–296. [Google Scholar] [CrossRef]
Davis, R.A.; Mikosch, T. The extremogram: A correlogram for extreme events. Bernoulli 2009, 15, 977–1009. [Google Scholar] [CrossRef]
Greene, W.H. Econometric Analysis; Pearson Education: London, UK, 2008. [Google Scholar]
Chernozhukov, V. Extremal quantile regression. Ann. Stat. 2005, 33, 806–839. [Google Scholar] [CrossRef]
Chernozhukov, V.; Fernández-Val, I.; Kaji, T. Extremal Quantile Regression: An Overview; Chapman and Hall: Boca Raton, FL, USA, 2016. [Google Scholar]
Hahn, J. Bootstrapping Quantile Regression Estimators. Econom. Theory 1995, 11, 105–121. [Google Scholar] [CrossRef]
Freedman, D.A. Bootstrapping Regression Models. Ann. Stat. 1981, 9, 1218–1228. [Google Scholar] [CrossRef]
Eck, D.J. Bootstrapping for multivariate linear regression models. Stat. Probab. Lett. 2018, 134, 141–149. [Google Scholar] [CrossRef]

Figure 1. Extrapolation of

E [Y ∣ T = t]

under various models (without confounding). The upper three figures illustrate a first-order extrapolation approach, employing a linear fit at the boundary of the support of T. In contrast, the lower three figures depict estimations generated by distinct models: the first utilizes a pre-additive noise model parameterized by neural networks [34], the second employs smoothing splines [51], and the third utilizes a random forest approach [52].

Figure 1. Extrapolation of

E [Y ∣ T = t]

under various models (without confounding). The upper three figures illustrate a first-order extrapolation approach, employing a linear fit at the boundary of the support of T. In contrast, the lower three figures depict estimations generated by distinct models: the first utilizes a pre-additive noise model parameterized by neural networks [34], the second employs smoothing splines [51], and the third utilizes a random forest approach [52].

Figure 2. Left: Dataset generated based on the simulations outlined in Section 5.1 with

n = 500

. Points falling within the set S are identified by a blue square. Right: Estimation of

μ (t)

using various methods: orange represents the true

μ (t)

, blue depicts our estimate employing the method from Section 4 with

95 %

confidence intervals, grey illustrates the doubly robust estimation method introduced by [11,59], red showcases the additive spline estimator described in [60], dark green demonstrates the approach proposed by [36] utilizing a GAM outcome model (further details in [61]), and purple describes the inverse probability of treatment weighting estimator [62].

Figure 2. Left: Dataset generated based on the simulations outlined in Section 5.1 with

n = 500

. Points falling within the set S are identified by a blue square. Right: Estimation of

μ (t)

using various methods: orange represents the true

μ (t)

, blue depicts our estimate employing the method from Section 4 with

95 %

confidence intervals, grey illustrates the doubly robust estimation method introduced by [11,59], red showcases the additive spline estimator described in [60], dark green demonstrates the approach proposed by [36] utilizing a GAM outcome model (further details in [61]), and purple describes the inverse probability of treatment weighting estimator [62].

Figure 3. Map of meteo-stations (red) and five river stations (black). Note that the river flow is from south to north (with springs in the mountains).

Figure 4. The estimation of

μ (t)

using the doubly robust estimator introduced in [11,59] cut at the second largest observation, its

95 %

confidence intervals, together with the estimation of

μ (111), ω

and their

95 %

confidence intervals.

Figure 4. The estimation of

μ (t)

using the doubly robust estimator introduced in [11,59] cut at the second largest observation, its

95 %

confidence intervals, together with the estimation of

μ (111), ω

and their

95 %

confidence intervals.

Table 1. Estimates of

ω = - 1

with varying dimensions of the confounders

d = d i m (X)

and with different distributions of the noise of T. The sample size is

n = 5000

. The full simulations setup can be found in Appendix B.1.

Table 1. Estimates of

ω = - 1

with varying dimensions of the confounders

d = d i m (X)

and with different distributions of the noise of T. The sample size is

n = 5000

. The full simulations setup can be found in Appendix B.1.

True $ω = - 1$	Gaussian $ε_{T}$	Exponential $ε_{T}$	Pareto $ε_{T}$
$d = 5$	$\hat{ω} = - 1.0 \pm 0.03$	$\hat{ω} = - 1.0 \pm 0.01$	$\hat{ω} = - 1.0 \pm 0.001$
$d = 25$	$\hat{ω} = - 0.96 \pm 0.08$	$\hat{ω} = - 0.99 \pm 0.01$	$\hat{ω} = - 1.0 \pm 0.001$
$d = 50$	$\hat{ω} = - 0.75 \pm 0.28$	$\hat{ω} = - 0.97 \pm 0.09$	$\hat{ω} = - 0.99 \pm 0.01$
$d = 200$	$\hat{ω} = - 0.44 \pm 0.37$	$\hat{ω} = - 0.53 \pm 0.42$	$\hat{ω} = - 0.91 \pm 0.68$

Table 2. Comparing the extrapolation performance of various models using the average Absolute Relative Error (ARE) across 100 simulations with varying dimension of the confounders

d = d i m (X)

. The interpretation of the values is as follows: if the true value of

μ (\tilde{t})

is 1, an ARE of 0.17 indicates an approximate typical error of

| \hat{μ} (\tilde{t}) - μ (\tilde{t}) | \approx 0.17

. Bold values correspond to the algorithm with the best performance.

Table 2. Comparing the extrapolation performance of various models using the average Absolute Relative Error (ARE) across 100 simulations with varying dimension of the confounders

d = d i m (X)

. The interpretation of the values is as follows: if the true value of

μ (\tilde{t})

is 1, an ARE of 0.17 indicates an approximate typical error of

| \hat{μ} (\tilde{t}) - μ (\tilde{t}) | \approx 0.17

. Bold values correspond to the algorithm with the best performance.

	Our Method	Bia et al. [60]	Kennedy et al. [11]	HI with GAM [36]	IPTW [62]
$d = 2$	0.18	0.68	0.64	0.42	3.89
$d = 10$	0.48	0.81	0.65	0.67	5.69
$d = 30$	0.79	0.92	could not handle	0.92	4.70

Table 3. Estimates

\hat{ω}

between each pairs of the stations.

Table 3. Estimates

\hat{ω}

between each pairs of the stations.

Truth: $ω = 1$	Stations $2 \to 1$	Stations $3 \to 2$	Stations $4 \to 3$	Stations $5 \to 3$
$\hat{ω}$	$1.03 \pm 0.05$	$1.17 \pm 0.24$	$1.21 \pm 0.19$	$0.78 \pm 0.41$

Table 4. Estimates

\hat{β}

and

\hat{ω}

represent the estimation of the effect of T on Y in the body and in the tail, respectively.

\hat{β}

is computed using standard regression, while

\hat{ω}

is the tail counterpart computed using steps introduced in Section 4. * Note that Station 5 is in different altitude and relatively far from meteo-station M2 with a lake in between; hence, there is a bias due to data collection problems.

Table 4. Estimates

\hat{β}

and

\hat{ω}

represent the estimation of the effect of T on Y in the body and in the tail, respectively.

\hat{β}

is computed using standard regression, while

\hat{ω}

is the tail counterpart computed using steps introduced in Section 4. * Note that Station 5 is in different altitude and relatively far from meteo-station M2 with a lake in between; hence, there is a bias due to data collection problems.

Truth Unknown	Station 1	Station 2	Station 3	Station 4	Station 5
$\hat{β}$	$2.4 \pm 0.1$	$2.28 \pm 0.1$	$1.44 \pm 0.02$	$0.89 \pm 0.02$	$0.38 \pm 0.01$
$\hat{ω}$	$3.04 \pm 0.95$	$2.61 \pm 0.67$	$1.62 \pm 0.35$	$0.99 \pm 0.32$	$0.36 \pm 0.13$ *

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bodik, J. Extreme Treatment Effect: Extrapolating Dose-Response Function into Extreme Treatment Domain. Mathematics 2024, 12, 1556. https://doi.org/10.3390/math12101556

AMA Style

Bodik J. Extreme Treatment Effect: Extrapolating Dose-Response Function into Extreme Treatment Domain. Mathematics. 2024; 12(10):1556. https://doi.org/10.3390/math12101556

Chicago/Turabian Style

Bodik, Juraj. 2024. "Extreme Treatment Effect: Extrapolating Dose-Response Function into Extreme Treatment Domain" Mathematics 12, no. 10: 1556. https://doi.org/10.3390/math12101556

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Extreme Treatment Effect: Extrapolating Dose-Response Function into Extreme Treatment Domain

Abstract

1. Introduction

2. Problem Statement, Notation and Preliminaries

2.1. Classical Assumptions

2.2. Extreme Value Theory

3. Our Tail Framework

3.1. Assumptions

3.2. Adjusting Only for θ ( X )

3.3. Model for the Conditional Expectation of Y Given a T

4. Inference and Estimation

5. Illustration and Experiments

5.1. Simple Example

5.2. Simulations

6. Application: River Discharge Dataset

6.1. Known Ground Truth

6.2. Effect of Precipitation on River Discharge

7. Conclusions and Future Work

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Application 2—Concrete Compressive Strength

Appendix A.1. Main Analysis

Appendix A.2. Detailed Computations of the Estimates

Appendix A.3. Discussion about the Results Regarding Different Threshold q

Appendix A.4. Discussion about the Assumptions

Appendix B. Simulations

Appendix B.1. Simulations with a High Dimensional X

Appendix B.2. Comparison with Classical Methods

Appendix B.3. Dependence, Sample Size and the Causal Effect

Appendix B.4. Simulations with a Hidden Confounder

Appendix B.5. Simulations with Varying Extremal Region

Appendix C. River Data Application

Appendix C.1. Simple Illustration with Known Ground Truth

Appendix C.2. Effect of Precipitation on River Discharge

Appendix C.2.1. Choice of Variables

Appendix C.2.2. Computation of β ^

Appendix D. Consistency, Bootstrap and Its Asymptotics

Appendix D.1. Bootstrap

Appendix D.2. Simplifying Assumptions

Appendix D.3. Consistency

Appendix D.4. Bootstraps Correctness

Appendix E. Proofs of Lemmas 1 and 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. Adjusting Only for $θ (X)$

Appendix C.2.2. Computation of $\hat{β}$