Debiased Maximum Likelihood Estimators of Hazard Ratios Under Kernel-Based Machine Learning Adjustment

Hayakawa, Takashi; Asai, Satoshi

doi:10.3390/math13193092

Open AccessFeature PaperArticle

Debiased Maximum Likelihood Estimators of Hazard Ratios Under Kernel-Based Machine Learning Adjustment

by

Takashi Hayakawa

^1,2,*

and

Satoshi Asai

^1,2

¹

Division of Pharmacology, Department of Biomedical Sciences, Nihon University School of Medicine, 30-1 Oyaguchi-Kami Machi, Itabashi-ku, Tokyo 173-8610, Japan

²

Division of Genomic Epidemiology and Clinical Trials, Clinical Trials Research Center, Nihon University School of Medicine, 30-1 Oyaguchi-Kami Machi, Itabashi-ku, Tokyo 173-8610, Japan

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(19), 3092; https://doi.org/10.3390/math13193092

Submission received: 24 July 2025 / Revised: 1 September 2025 / Accepted: 19 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Advances in Statistical Methods and Machine Learning for Medical and Genetic Epidemiology)

Download

Browse Figures

Versions Notes

Abstract

Previous studies have shown that hazard ratios between treatment groups estimated with the Cox model are uninterpretable because the unspecified baseline hazard of the model fails to identify temporal change in the risk-set composition due to treatment assignment and unobserved factors among multiple contradictory scenarios. To alleviate this problem, especially in studies based on observational data with uncontrolled dynamic treatment and real-time measurement of many covariates, we propose abandoning the baseline hazard and using kernel-based machine learning to explicitly model the change in the risk set with or without latent variables. For this framework, we clarify the context in which hazard ratios can be causally interpreted, then develop a method based on Neyman orthogonality to compute debiased maximum likelihood estimators of hazard ratios, proving necessary convergence results. Numerical simulations confirm that the proposed method identifies the true hazard ratios with minimal bias. These results lay the foundation for the development of a useful alternative method for causal inference with uncontrolled, observational data in modern epidemiology.

Keywords:

survival analysis; hazard ratio; machine learning; causal inference; kernel method

MSC:

62N02

1. Introduction

The use of hazard ratios to measure the causal effect of treatment has recently come under debate. Although it has been standard practice in epidemiological studies to examine hazard ratios using the Cox proportional hazard model and its modified versions [1,2], several studies [3,4,5] have noted that hazard ratios are uninterpretable with regard to causation (see Martinussen (2022) [6] for a review). In particular, Martinussen et al. (2020) [5] provided concrete examples of data-generating processes in which the Cox model is correctly specified, but estimated hazard ratios are difficult to interpret. The main difficulty is that time courses of different study populations with different treatment effects are described by the same Cox model with the same set of parameter values. Although a few authors have rebutted the uninterpretability of hazard ratios in the Cox model [7,8,9], the issue concerning the unidentifiability with multiple contradictory scenarios as posed by Martinussen et al. (2020) [5] remains unresolved.

To address this issue, researchers have sought alternative measures of causal treatment effect, such as differences between counterfactual survival functions or restricted mean survival time [10,11,12]. Methods for applying machine learning to estimate these measures have also been developed [13,14,15]. However, these measures are only applicable to simple, well-controlled settings, such as randomized clinical trials or observational studies of several covariate-adjusted groups with different baseline treatments, despite such studies making few assumptions about data-generating processes in other respects. Modern epidemiology increasingly requires methods for analyzing large quantities of observational data acquired in a more uncontrolled, dynamic manner, in which many covariates are measured in real time. Examples of such data are electronic medical or health records and data acquired by electronic devices for health promotion [16]. The marginal structural Cox model potentially used for this purpose [17] suffers from the uninterpretability problem described above; thus, the development of an alternative method is required.

The present study proposes a strategy to use a hazard model based on time-dependent treatment variables and covariates but without unspecified baseline hazard. Previous studies [3,4,5] attributed the uninterpretability of hazard ratios to “selection,” i.e., the fact that less frail subjects (described by unobserved factors) are more likely to remain in the risk set at later stages of the study. Although these authors did not explicitly describe the role of the baseline hazard in the description of this selection process, it is the unspecified baseline hazard that allows the Cox model to be correctly specified for selection processes of many different patterns (see Remark 6 below). Thus, we abandon the baseline hazard and instead exploit the descriptive power of machine learning to capture how the risk set changes over time. This approach can now be implemented due to recent advances in the incorporation of machine learning into rigorous statistical analysis with effect estimation [18,19]. Prior to this development, machine learning could be used for only outcome prediction in epidemiology because estimation with machine learning was biased. Thus, we develop an algorithm that debiases maximum likelihood (ML) estimators of hazard ratios in the model, using the framework of doubly robust, debiased machine learning (DML) based on Neyman orthogonality (see Ref. [20] for an introductory review). Notably, our algorithm applies to models with latent variables, thereby enabling explicit modeling of the selection in the risk set caused by unobserved factors.

This article is organized as follows. In Section 2.1, we introduce our problem setting for an observational study with uncontrolled, dynamic treatment and real-time measurement of covariates, together with an exponential machine learning-based hazard model used for analysis. In Section 2.2, we show that the ML estimator of hazard ratios can be interpreted as a measure of causal treatment effect in this setting (Proposition 1), clarifying required (mostly testable) assumptions. In this argument, we show that the unindentifiability with multiple interpretations is alleviated (Remark 6). In Section 2.3, we construct Neyman- (near-)orthogonal scores, which are for debiasing of the ML estimators of hazard ratios in our model (Propositions 2 and 3). For these scores, additional convergence results required for DML are also provided in Appendix G. In Section 2.4 and Section 2.5, we describe how to extend the model by introducing a latent variable and multiple-kernel models, respectively. In Section 2.6, we develop an algorithm for computing the debiased estimators of hazard ratios after model selection, suitably adapting the procedure proposed in Ref. [19] to our setting. Then, in Section 3, we apply the developed method to two sets of clinically plausible simulation data. In the first simulation study described in Section 3.1, we show that our estimators appropriately identify the true hazard ratios with minimal bias under the effect of complicated nonlinear confounding among treatment, comorbidities and outcome. In the second simulation study described in Section 3.2, we present a case in which an unobserved factor causes selection in the risk set, resulting in less frail subjects in the later phase of the study. Our debiased estimator based on a model with a latent variable appropriately identifies the true hazard ratio with minimal bias. On the basis of these results, in Section 4, we discuss multiple advantages of the proposed ML approach over the conventional Cox’s maximum partial likelihood (MPL) approach, as well as its limitations. We list mathematical notations in Table A1.

2. Theories and Methods

2.1. Exponential Parametric Hazard Model Combined with Machine Learning

We consider observational studies in which the occurrence or absence of an event of interest in subjects randomly sampled from a large population and indexed by

i (\in I)

is longitudinally observed over time (indexed by

t \in T \subset R

) during a noninformatively right-censored period, i.e.,

0 \leq t \leq C_{i}

. With a set of time-dependent covariates, collectively denoted by

X_{i, t} (= {X_{i j, t}}_{j \in J}) \in X \subset R^{d}

(d \in N)

, we strive to identify the effect of time-dependent treatment described by a collection of binary variables

A_{i, t} (= {A_{i k, t}}_{k \in K} \in A \subset {0, 1}^{| K |})

. For clarity of presentation, we assume that, at most, a single treatment variable can take a value of unity, and the remaining variables must be zero. Consider an exponential hazard model whose conditional hazard is given by

\begin{matrix} h (t | A_{i, t}, X_{i, t}) \overset{def}{=} exp (θ^{'} A_{i, t} + f (X_{i, t})), \end{matrix}

(1)

where

θ = {θ_{k}}_{k \in K} \in Θ \subset R^{| K |}

is a set of parameters corresponding to the natural logarithm of the hazard ratio of the untreated samples to samples treated to different extents and the function f describes the risk variation due to the given set of covariates, which we adjust using a machine learning model (i.e., a set of functions (

M

) in which f lies). In addition, hereafter,

{(\cdot)}^{'}

denotes the transposition of vectors and matrices. Note that the above model lacks the baseline hazard as a function of t (the time elapsed after enrollment). The covariate

X_{i, t}

can include temporal information, such as the date and durations of suffered disorders, but not t itself.

Suppose that a subset of subjects (indexed by i) experiences an event of interest at time

T_{i}

in the observation period and the other subjects experience no such event. The full log-likelihood for this observation is then given by (see, e.g., Ref. [21])

\begin{matrix} ln Q_{h} (T | A, X) = \sum_{i \in I} \{I_{[0, C_{i}]} (T_{i}) (θ^{'} A_{i, T_{i}} + f (X_{i, T_{i}})) - \int_{0}^{T_{i} \land C_{i}} exp (θ^{'} A_{i, t} + f (X_{i, t})) d t\}, \end{matrix}

(2)

where

I_{[0, C_{i}]} (T_{i})

is the indicator function that returns unity for

T_{i} \in [0, C_{i}]

and zero otherwise. To see the meaning of Equation (2), we consider how the hazard model describes the occurrence of an event in a small interval

(t, t + δ t]

. Assuming, for the moment, that (i)

X_{i, s}

is continuous with respect to s and (ii)

A_{i, s} = c o n s t

over

s \in [t, t + δ t]

, the likelihoods of occurrence and non-occurrence of an event in this period, conditioned on event-free survival at time t, are

h (t | A_{i, t}, X_{i, t}) δ t

and

1 - h (t | A_{i, t}, X_{i, t}) δ t

, respectively, up to the first order in

δ t

. Simple computations show that the log-likelihood for this binary observation is

\begin{matrix} ln Q_{h} (I_{(t, t + δ t]} (T_{i}) | T_{i} > t, A_{i, t}, X_{i, t}) & = & I_{(t, t + δ t]} (T_{i}) ln h (t | A_{i, t}, X_{i, t}) δ t \\ + (1 - I_{(t, t + δ t]} (T_{i})) ln (1 - h (t | A_{i, t}, X_{i, t}) δ t) + O (δ t^{2}), \\ = & I_{(t, t + δ t]} (T_{i}) (θ^{'} A_{i, t} + f (X_{i, t})) - exp (θ^{'} A_{i, t} + f (X_{i, t})) δ t \\ + O (δ t^{2}) + c o n s t . \end{matrix}

(3)

The time integration of Equation (3) results in the summand of Equation (2) up to a constant independent of

θ

and f.

We assume that the two assumptions described above hold almost surely in the

δ t \to + 0

limit because

{A_{i t}}_{t}

and

{X_{i t}}_{t}

are continuous with respect to time, except for finite numbers of time points in real applications. Here, however, we make it clear in advance that we do not go into detail about the necessary conditions on the data-generating stochastic process, i.e.,

{A_{i t}, X_{i t}}_{t}

. We do not specify the filtered probability space for the process but, rather, assume the existence of marginal (conditional) probability measures on the measurable spaces

(A \times X \times T, M_{A} \times M_{X} \times M_{T})

and the integrability of quantities of interest with respect to these measures. For model

M

on the base space

(X, M_{X})

for covariates, we consider only reproducing-kernel Hilbert spaces (RKHSs)

H_{k}

associated with a positive-semidefinite, bounded kernel

k (\cdot, \cdot)

[22]. We refer readers to Ref. [23], which provides a basic tool for our theoretical analysis of kernel-based machine learning.

2.2. Causal Inference and ML Estimation

This section clarifies the causal interpretation of the estimation results in the model described in the previous section and discusses the necessary assumptions that support this interpretation. First, we make the conventional assumptions of consistency, no unmeasured confounder (ignorability) and positivity.

Assumption 1.

(consistency). For any counterfactual treatment schedule a, we have,

X_{(- \infty, t]}^{a} = X_{(- \infty, t]} | A_{(- \infty, t]} = a_{(- \infty, t]}

and

N_{(- \infty, t]}^{a} = N_{(- \infty, t]} | A_{(- \infty, t]} = a_{(- \infty, t]}

for counting processes

N_{t} = I_{(- \infty, t]} (T)

and

N_{t}^{a} = I_{(- \infty, t]} (T^{a})

.

Assumption 2.

(no unmeasured confounder). For some

ϵ > 0

, we have, for any t and any counterfactual treatment schedule a,

\begin{matrix} T^{a} ⫫ A_{[t, t + ϵ]} & | & T^{a} > t, X_{(- \infty, t + ϵ]}, A_{(- \infty, t)} . \end{matrix}

(4)

Assumption 3.

(positivity) [optional]. For any

t > 0

and any

a \in A

, we have

P (A_{t} = a | T > t, X_{t}) > 0

.

Here, the set subscript, such as

X_{(- \infty, t]}

, denotes the collection of all variables with subscripts included in the set. Throughout the paper, P denotes the marginal (conditional) probability measures of the argument variables derived from the data-generating process. We used variables for

- \infty < t < 0

for the technical reason that

A_{0}

and

X_{0}

must be determined on the basis of past information. This may be replaced by some finite interval before the enrollment at

t = 0

for which information about the treatment and covariate is available.

Next, in addition to the above, we make the following assumption about the regularity of hazard.

Assumption 4.

(regularity of hazard).

\begin{matrix} lim_{δ t \to + 0} \frac{1}{δ t} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{(- \infty, t + δ t]}, X_{(- \infty, t + δ t]}) \\ = & lim_{δ t \to + 0} \frac{1}{δ t} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{(- \infty, t]}, X_{(- \infty, t]}) \\ < & \infty . \end{matrix}

(5)

holds almost surely.

Remark 1.

Assumption 1 is a natural assumption without which we have two different processes for the same treatment schedule.

Remark 2.

Assumption 2 states that treatment at each time point t can be considered to be randomized, as it is conditioned on the value of the covariate. In the analysis of electronic medical records, for example, if the medical practitioners decide their treatment on the basis of evidence recorded as covariates and those covariates are measured in real time, this assumption is valid. It should be noted that if the decision is based on past records, a past variable should be combined with the present variable to form a vectorized covariate. Causal inference with the same assumption with discretized time steps has been performed with marginal structural Cox models (see Refs. [17,24] for examples).

Remark 3.

Assumption 3 is a technical assumption required for debiasing with inverse treatment probability weights (see Proposition 3 and Remark 12 for our model and Ref. [25] for marginal structural Cox models). If medical practitioners never choose a particular treatment option for a value of the covariate, this assumption is violated. Thus, if one analyzes the effect of a treatment considered taboo for comorbidities included in the set of covariates (for example, anticholinergics for angle-closure glaucoma), they need to take extra care of this assumption. This assumption can be omitted if one uses Proposition 2 for estimation in our study.

Remark 4.

Assumption 2 essentially implies that the treatment at time t does not affect the time evolution of covariate X for a subsequent period

(t, t + ϵ]

, as seen through the following argument. First, we have

\begin{matrix} P (I_{(t, t + ϵ]} (T^{a}) | T^{a} > t, X_{(- \infty, t + ϵ]}, A_{(- \infty, t)} = a_{(- \infty, t)}) \\ = & P (I_{(t, t + ϵ]} (T^{a}) | T^{a} > t, X_{(- \infty, t + ϵ]}, A_{(- \infty, t + ϵ]} = a_{(- \infty, t + ϵ]}) \\ = & P (I_{(t, t + ϵ]} (T) | T > t, X_{(- \infty, t + ϵ]}, A_{(- \infty, t + ϵ]} = a_{(- \infty, t + ϵ]}), \end{matrix}

(6)

where the first and second equalities are due to the absence of an unmeasured confounder and consistency, respectively. If the time evolution of covariates is immediately affected by the treatment schedule and if the outcome is immediately affected by the covariates, then under the conditioning on

X_{[t, t + ϵ]}

,

X_{[t, t + ϵ]}^{a}

depends on the value of A and, hence,

T^{a}

depends on A, which contradicts the assumption. The removal of such instantaneous interactions among treatment, covariates and outcome also makes Assumption 4 reasonable.

In addition to Assumptions 1–4, we assume that the model in Equation (1) is correctly specified. More precisely, we make the following four assumptions:

Assumption 5.

(hazard conditionally independent of past treatment and covariate). For any

t > 0

,

a \in A

,

x \in X

and

E_{A}, E_{X} \subset (- \infty, t)

, we almost surely have

\begin{matrix} lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a, X_{t} = x)}{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a, X_{t} = x, A_{E_{A}}, X_{E_{X}})} = 1 . \end{matrix}

(7)

Assumption 6.

(homogeneous treatment effect). For any

a, a^{'} \in A

and any

x \in X

, the hazard contrast, i.e.,

\begin{matrix} lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a, X_{t} = x)}{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a^{'}, X_{t} = x)}, \end{matrix}

(8)

takes a constant value, regardless of the value of x.

Assumption 7.

(time-homogeneous hazard): The hazard is independent of its timing during the observation period. Mathematically, for any

t, t^{'} \geq 0

and any fixed

a \in A

and

x \in X

,

\begin{matrix} lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a, X_{t} = x)}{P (I_{(t^{'}, t^{'} + δ t]} (T) = 1 | T > t^{'}, A_{t^{'}} = a, X_{t^{'}} = x)} = 1 . \end{matrix}

(9)

Assumption 8.

(correctly specified machine learning model): There exist unique

θ^{*} \in R

and

f^{*} \in M

that satisfy

\begin{matrix} E [ln Q_{h_{θ^{*}, f^{*}}} (T | A, X)] = sup_{θ \in R, f \in B_{X}} E [ln Q_{h_{θ, f}} (T | A, X)], \end{matrix}

(10)

where

B_{X}

is the set of Borel-measurable functions.

Remark 5.

Assumptions 5–7 can be validated by extending the model to incorporate the effect of past treatment and covariates, the inhomogeneous treatment effect and time inhomogeneity, respectively. The original and extended models can then be compared in terms of, for example, Bayesian model evidence (BME) (see, e.g., Chapter 3 of Ref. [26]). If Assumption 5 is violated, the definition of the treatment variables and covariates may be extended so that the current value retains past information. This is the same strategy as that employed for the design of variables in a marginal structural Cox model [17]. If Assumption 6 is violated, the model may be extended by incorporating a covariate-dependent heterogeneous treatment effect. Violation of Assumption 7 indicates that variations in covariate values do not account for the temporal change in the risk set. In this case, additional covariates or extending the model with latent variables (as described below) may be considered. Finally, Assumption 8 is an a priori assumption that stems from the fact that in statistical learning, inference in the function space requires a regularity assumption.

Proposition 1.

Let the maximizer of the expected likelihood be

\begin{matrix} θ^{*}, f^{*} = \arg max_{θ \in R, f \in M} E [ln Q_{h} (T | A, X)] . \end{matrix}

(11)

Under Assumptions 1–8, suppose two possible treatment schedules

a, a^{'}

such that

a_{(- \infty, t)} = a_{(- \infty, t)}^{'}

holds and that

a_{k, s} = 1

and

a_{s}^{'} = 0

(or

a_{k^{'}, s}^{'} = 1

) hold for

s \in [t, t + ϵ]

for some

ϵ > 0

. Then, we have, for any sample path,

x_{(- \infty, t + δ t]}

, for the covariate

\begin{matrix} θ_{k}^{*} (- θ_{k^{'}}^{*}) = \\ lim_{δ t \to + 0} ln \frac{P (I_{(t, t + δ t]} (T^{a}) = 1 | T^{a}, T^{a^{'}} > t, A_{(- \infty, t)} = a_{(- \infty, t)}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})}{P (I_{(t, t + δ t]} (T^{a^{'}}) = 1 | T^{a}, T^{a^{'}} > t, A_{(- \infty, t)} = a_{(- \infty, t)}^{'}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})} . \end{matrix}

(12)

Proof.

See Appendix B. □

Remark 6.

Suppose that the entire covariate

X_{t}

is unobserved. In this case, the modified version of the above proof indicates that a hazard model expanded with respect to time:

\begin{matrix} h (t | A_{t}) \overset{def}{=} exp (θ {(t)}^{'} A_{t} + f (t)) \end{matrix}

(13)

admits the optimal solution

\begin{matrix} θ^{*} (t) & = & ln lim_{δ t \to + 0} \frac{\int_{X} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{k, t} = 1, X_{t}) d P (X_{t} | A_{k, t} = 1, T > t)}{\int_{X} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = 0, X_{t}) d P (X_{t} | A_{t} = 0, T > t)}, \\ f^{*} (t) & = & ln lim_{δ t \to + 0} \frac{1}{δ t} \int_{X} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = 0, X_{t}) d P (X_{t} | A_{t} = 0, T > t), \end{matrix}

where one can interpret

exp (f^{*} (t))

as a baseline hazard. The example of the uninterpretability of hazard ratios presented by Martinussen et al. [5] was for this setting. They assumed time-independent

A_{t}

and

X_{t}

. Their point was that multiple combinations of

d P (X_{t} | A_{t}, T > t)

and

{lim}_{δ t \to + 0} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t}, X_{t}) / δ t

yield the same solution for

θ (t)

.

This ambiguity has been removed in our setting. We assume that

{lim}_{δ t \to + 0} P (I_{(t, t + δ t]}

(T) = 1 | T > t, A_{t}, X_{t}) / δ t

is independent of specific time points of the observation period, which is biologically natural. Then, the time dependence of solutions for

θ

and f indicates a lack of observation of relevant factors and encourages us to seek a better model. This proof mechanism does not work in the Cox model because the Cox model does not estimate the baseline hazard. Although our framework cannot exclude a scenario in which temporal changes in the distribution of unobserved factors miraculously keep

θ^{*} (t)

and

f^{*} (t)

constant over time, this is not a generic case.

Remark 7.

The quantity in Equation (12) is a covariate-adjusted hazard contrast for two counterfactual treatments that branch at time t, so it can be interpreted as a measure of causal effect. In the setting discussed in Ref. [17], this corresponds to the treatment effect measured in the next month (for which covariates in each month can be regarded as baseline covariates [27]).

Remark 8.

See Remarks A2 and A3 in Appendix H for more consideration of causal interpretation of hazard ratios.

2.3. Debiasing ML Estimators

Next, we assume that we have consistent estimators

\hat{θ}

and

\hat{f}

that are the maximizers of the empirical log-likelihood in Equation (2) in the presence of a suitable regularizer. Although a large body of the machine learning literature presents such consistent estimators, they are biased in the sense that we cannot expect

\sqrt{n} E [\hat{θ} - θ^{*}], \sqrt{n} E [\hat{f} - f^{*}] \overset{p}{\to} 0

for the increasing number of subjects

n (\to \infty)

. This situation prohibits making a decision on the significance of estimated results. However, we can apply to this problem debiased machine learning based on Neyman orthogonality as developed by Chernozhukov et al. (2018) [19]. Chernozhukov et al. (2018) [19] described how to systematically debias ML estimators and other M estimators. Applying their idea to our problem setting yields debiased estimators of

θ^{*}

.

Definition 1.

(Neyman near-orthogonal scores). Suppose that an estimator

\hat{θ} (\in Θ \subset R^{K})

of quantities of interest

θ^{*}

is given as a zero of the empirical average of score functions

ϕ : D \times Θ \times N \to R^{K}

for i.i.d. data

D_{i} \in D

:

\begin{matrix} \hat{E} [ϕ (D; \hat{θ}, \hat{η})] \overset{def}{=} \frac{1}{n} \sum_{i} ϕ (D_{i}; \hat{θ}, \hat{η}) = 0, \end{matrix}

(14)

where

\hat{E}

denotes the empirical average and the third argument of ϕ is a (possibly infinitely dimensional) nuisance parameter and an estimator

\hat{η}

of an optimal value

η^{*} \in N

is used. Also suppose that there exist convex neighborhoods of

η^{*}

,

N_{n} \subset N

, to which

\hat{η}

belongs, with high probabilities converging to one for

n \to \infty

. Then, ϕ is said to be “Neyman near-orthogonal” if it satisfies

\begin{matrix} E [ϕ (D; θ^{*}, η^{*})] & = & 0, \end{matrix}

(15)

\begin{matrix} \partial_{η} E [ϕ (D; θ^{*}, η^{*})] [η - η^{*}] & \leq & ϵ_{n} \sim o (n^{- 1 / 2}) (^{\forall} η \in N_{n}), \end{matrix}

(16)

where, in the second line,

\partial_{η} (\cdot) [η - η^{*}]

denotes the Gateaux derivative operator (see Table A1 and Ref. [28] for definition). If the above condition holds for

ϵ_{n} = 0

, ϕ is said to be “Neyman orthogonal”.

Given a few conditions on the regularity of score functions and the convergence of nuisance parameters, Chernozhukov et al. [19] proved

\sqrt{n} E [\hat{θ} - θ^{*}] \overset{p}{\to} 0

. In the following, we construct such orthogonal scores from the log-likelihood in Equation (2). For this purpose, we modify the following lemma proven by Chernozhukov et al. [19].

Lemma 1.

(Lemma 2.5 of [19]: Neyman orthogonal scores derived from log-likelihood). Consider an ML problem:

\begin{matrix} θ^{*}, f^{*} = arg min_{θ \in Θ, f \in M} E [ℓ (D; θ, f)], \end{matrix}

(17)

where ℓ is a sample-wise negative log-likelihood function and

M

is a convex set of high-dimensional vectors. Define

f_{θ}

with

\begin{matrix} f_{θ} = arg min_{f \in M} E [ℓ (D; θ, f)], \end{matrix}

(18)

and a convex set

N

of mappings of Θ into

M

with an optimal solution in this set

η^{*} (θ) = f_{θ}

. Further suppose that for each

η \in N

, the function

θ \mapsto ℓ (D; θ, η (θ))

is almost surely continuously differentiable. Then, under mild regularity conditions, the score ϕ defined as

\begin{matrix} ϕ (D; θ, η) = \frac{d ℓ (D; θ, η (θ))}{d θ} \end{matrix}

(19)

is Neyman orthogonal at

(θ^{*}, η^{*})

. The differentiation on the right-hand side above is the full derivative.

For a finite-dimensional f, the above lemma with the application of the implicit function theorem to

\partial_{θ} E [ℓ (D; θ, f)] = 0

(or its empirical version) yields the following orthogonal score (Section 2.2.1 of Ref. [19]):

\begin{matrix} ϕ (D; θ, (f, μ)) = \partial_{θ} ℓ (D; θ, f) - μ \partial_{f} ℓ (D; θ, f), \end{matrix}

(20)

with

μ^{*} = E [\partial_{θ} \partial_{f} ℓ |_{θ^{*}, f^{*}}] E {[\partial_{f} \partial_{f} ℓ |_{θ^{*}, f^{*}}]}^{- 1}

and its estimator

\hat{μ} = \hat{E} [\partial_{θ} \partial_{f} ℓ |_{\hat{θ}, \hat{f}}] \hat{E} {[\partial_{f} \partial_{f} ℓ |_{\hat{θ}, \hat{f}}]}^{- 1}

as long as the Hessians with respect to f are invertible.

In the following, for our high-dimensional machine learning applications, we develop an analogous approach with extra care.

Proposition 2.

(Neyman near-orthogonal scores for ML in the present model). Suppose that model

M

is given as a reproducing-kernel Hilbert space

H_{k}

associated with a bounded positive-semidefinite kernel k on

X \times X

. Then, for the sample-wise negative log-likelihood ℓ of the present study, the gradients

\partial_{f} ℓ

,

\partial_{f} \partial_{θ_{k}} ℓ

and the Hessian

\partial_{f} \partial_{f} ℓ

can be identified with elements of

H_{k}

and a linear operator from

H_{k}

into

H_{k}

in the following sense:

\begin{matrix} ℓ (D; θ, f + h) & = & ℓ (D; θ, f) + {(\partial_{f} ℓ |_{θ, f}, h)}_{H_{k}} + {O (∥ h ∥}_{H_{k}}^{2}), \\ \partial_{θ_{k}} ℓ (D; θ, f + h) & = & \partial_{θ_{k}} ℓ (D; θ, f) + {(\partial_{f} \partial_{θ_{k}} ℓ |_{θ, f}, h)}_{H_{k}} + {O (∥ h ∥}_{H_{k}}^{2}), \\ \partial_{f} {ℓ |}_{θ, f + h} & = & \partial_{f} {ℓ |}_{θ, f} + \partial_{f} \partial_{f} {ℓ |}_{θ, f} h + {O (∥ h ∥}_{H_{k}}^{2}) . \end{matrix}

(21)

If Assumption 9 (stated below) holds, for

ζ_{n} = c n^{- α}

with constants

c > 0

and

α (> \frac{1}{2} - β)

, the score function,

\begin{matrix} ϕ (D; θ, (f, H)) = \partial_{θ} ℓ (D; θ, f) - H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} \partial_{f} ℓ (D; θ, f), \end{matrix}

(22)

is Neyman near-orthogonal for the shrinking set of nuisance parameters

N_{n} = B_{n} \times S_{n}

, as defined below, with a true value

H^{*} = E [\partial_{(θ, f)} \partial_{(θ, f)} ℓ |_{θ^{*}, f^{*}}]

and its empirical estimator

\hat{H} = \hat{E} [\partial_{(θ, f)} \partial_{(θ, f)} ℓ |_{\hat{θ}, \hat{f}}]

. Here,

H_{θ f}

and

H_{f f}

are blocks of the positive-semidefinite, compact linear operator

H : R^{| K |} \times H_{k} \to R^{| K |} \times H_{k}

defined as

\begin{matrix} (\begin{matrix} H_{θ f} \\ H_{f f} \end{matrix}) : ψ \in H_{k} \mapsto H (\begin{matrix} 0 \\ ψ \end{matrix}) . \end{matrix}

(23)

The other blocks are similarly defined.

Proof.

See Appendix C. □

Assumption 9.

For positive constants

c_{1}

and

β (\leq 1 / 2)

,

Θ_{n} \overset{def}{=} {θ | ∥ θ - θ^{*} ∥_{2} \leq c_{1} n^{- β}}

and

B_{n} \overset{def}{=} {f | ∥ f - f^{*} ∥_{H_{k}} \leq c_{1} n^{- β}}

, estimator

(\hat{θ}, \hat{f})

falls into

Θ_{n} \times B_{n}

with probabilities converging to one for

n \to \infty

. Estimator

\hat{H} = \hat{E} [\partial_{(θ, f)} \partial_{(θ, f)} ℓ |_{\hat{θ}, \hat{f}}]

falls in

S_{n} \overset{def}{=} {H |

positive-semidefinite compact op. from

R^{| K |} \times H_{k}

into

R^{| K |} \times H_{k}

s.t.

∥ H - H^{*} ∥ \leq c_{2} n^{- β}}

for a positive constant

c_{2}

, with probabilities converging to one for

n \to \infty

. It is further assumed that, for each

k \in K

,

H_{f θ_{k}}^{*} = H_{f f}^{*} ρ_{k}

holds for some element

ρ_{k}

of

H_{k}

.

Remark 9.

It can be argued that Assumption 9 is likely to hold (see Appendix E). However, we leave this as a conjecture because its proof needs an explicit argument about the details of the data-generating stochastic process.

Remark 10.

The assumption of

H_{f θ_{k}}^{*} = H_{f f}^{*} ρ_{k}

can be relaxed to

H_{f θ_{k}}^{*} = H_{f f}^{* γ} ρ_{k}

(

0 < γ < 1

), which is an assumption sometimes made in the analysis of the kernel method [23]. In this case, the convergence rate should be suitably modified.

Remark 11.

Apart from the Neyman near-orthogonality, Chernozhukov et al. [19] provided additional conditions on score regularity and the quality of nuisance parameters (Assumptions 3.3 and 3.4 in [19]). Their proof of the asymptotic unbiasedness of debiased estimators relies on these assumptions. In Appendix G, we prove that these conditions are satisfied for suitably chosen values for α and β.

The above score functions may not be intuitively understood. Inspecting the concrete representations of

\partial_{θ} ℓ

,

\partial_{f} ℓ

,

H_{θ f}^{*}

and

H_{f f}^{*}

, one can derive the following more intuitively understandable formula as well.

Proposition 3.

(Intuitive Neyman orthogonal scores for ML in the present model). Assume that the positive Borel measures

μ_{a}

(

a \in A

) defined as

\begin{matrix} μ_{a} (E) \overset{def}{=} \int_{0}^{\infty} \int_{0}^{C} \int_{E} P (A_{t} = a | X_{t}, T > t) d P (X_{t} | T > t) P (T > t) d t d P (C) \end{matrix}

(24)

(E \in M_{X} :

Borel measurable set) are absolutely continuous with respect to each other and have Radon–Nykodym derivatives,

\begin{matrix} \frac{d μ_{k}}{d μ_{0}} = e^{g_{k}^{*}}, \end{matrix}

(25)

for which

μ_{k}

denotes

μ_{a}

such that

a_{k} = 1

and

g_{k}^{*}

is a measurable function on

X

. Then, the score functions for estimating θ,

\begin{matrix} ϕ_{k} (D; θ, (f, g_{k})) & = & I_{[0, C]} (T) A_{k, T} e^{- θ_{k}} (1 + e^{- g_{k} (X_{t})}) - \int_{0}^{T \land C} A_{k, t} e^{f (X_{t})} (1 + e^{- g_{k} (X_{t})}) d t \\ + \int_{0}^{T \land C} (1 - \sum_{ℓ \in K} A_{ℓ, t}) e^{f (X_{t})} (1 + e^{g_{k} (X_{t})}) d t - I_{[0, C]} (T) (1 - \sum_{ℓ \in K} A_{ℓ, T}) (1 + e^{g_{k} (X_{T})}), \end{matrix}

(26)

(k \in K)

, are Neyman orthogonal. Here, D denotes

(T, C, X, A)

.

Proof.

See Appendix F. □

Remark 12.

As one might notice, the crux of the construction of the above orthogonal scores lies in balancing the second and third terms of Equation (26) by

g_{k}

. Roughly speaking, for each value of covariates,

g_{k}

weights the integrand with the inverse treatment probability for

A_{k} = 1

and

A = 0

. Note that this treatment probability is marginalized with respect to time.

Remark 13.

Note that the estimation of the nuisance parameter

{g_{k}}_{k \in K}

can be naturally formulated as the following logistic regression for each

k \in K

:

\begin{matrix} min_{g_{k}} \hat{E} [\int_{0}^{T \land C} \{A_{k, t} ln (1 + e^{- g_{k} (X_{t})}) + (1 - \sum_{ℓ \in K} A_{ℓ, t}) ln (1 + e^{g_{k} (X_{t})})\} d t] + ζ_{n, k} {∥ g_{k} ∥}_{H_{k}}^{2}, \end{matrix}

(27)

Here, we have introduced a regularization hyperparameter

ζ_{n, k} (> 0)

.

2.4. Extension to Models with Latent Variables

Suppose that Assumption 7 is violated due to the presence of unobserved factors affecting the outcome and covariates cannot adjust the resultant temporal change in the risk set. In this case, as we see in the simulation study below, the inclusion of the time elapsed after enrollment in the set of covariates may improve the description of event occurrence. For such cases, if additional data about the unobserved factors cannot be acquired, the only alternative is to explicitly model the unobserved factors as latent variables.

To simplify the argument and notation, let us assume that the unobserved factor (denoted by

W \in {0, 1}

) is a single time-independent binary variable (e.g., a variable indicating the presence or absence of a genetic risk factor). Also suppose that we use a model with a latent variable

Z \in {0, 1}

defined by the following prior distribution of Z and Z-dependent conditional hazard:

\begin{matrix} Q_{β} (Z | X_{0}, A_{0}) \overset{def}{=} \frac{1}{1 + exp (- (2 Z - 1) (\sum_{j \in J} β_{j} X_{j 0} + β_{0}))}, \end{matrix}

(28)

and

\begin{matrix} \tilde{h} (t | A_{t}, X_{t}, Z) \overset{def}{=} exp (θ^{'} A_{t} + f (X_{t}) + κ Z) . \end{matrix}

(29)

The latter yields the following Z-dependent log-likelihood:

\begin{matrix} Q_{\tilde{h}} (T | X, A, Z) = I_{[0, C]} (T) (θ^{'} A_{T} + κ Z + f (X_{T})) - \int_{0}^{T \land C} exp (θ^{'} A_{t} + κ Z + f (X_{t})) d t . \end{matrix}

(30)

In the above, we have introduced parameters

β \in R^{\dim (X_{0}) + 1}

relating baseline values of covariates (at

t = 0

) to the prior distribution of Z and a parameter

κ \in R

representing the effect of the latent variable.

The same argument for Proposition 1 applies to this model. Suppose that the data-generating process is described by

\begin{matrix} P (W | X_{0}, A_{0}) = \frac{1}{1 + exp (- (2 W - 1) (\sum_{j \in J} β_{j}^{*} X_{j 0} + β_{0}^{*}))}, \end{matrix}

(31)

and

\begin{matrix} lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a, X_{t} = x, W = 1)}{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a, X_{t} = x, W = 0)} = c o n s t, \end{matrix}

(32)

regardless of the values of x, a and t. Then, replacing X with the combination of

(X, W)

or

(X, Z)

in Assumptions 1–8 and Proposition 1 and assuming that the model is identifiable, one obtains the same interpretation of hazard ratios.

Then, one can apply Proposition 2 to derive Neyman-orthogonal scores from the following marginal log-likelihood for this model:

\begin{matrix} \sum_{i \in I} ln Q_{\tilde{h}} (T_{i} | A_{i}, X_{i}) = \sum_{i \in I} max_{r_{i}} \sum_{Z_{i} \in {0, 1}} r_{i} (Z_{i}) ln \frac{Q_{\tilde{h}} (T_{i} | X_{i}, A_{i}, Z_{i}) Q_{β} (Z_{i} | X_{i}, A_{i})}{r_{i} (Z_{i})}, \end{matrix}

(33)

with the aid of variational distributions

{r_{i}}_{i \in I}

constrained by

r_{i} (0) + r_{i} (1) = 1

. It should be noted that, in the application of Proposition 2, f is replaced by

vec (f, κ, β)

. For the above latent-variable model, a more intuitively understandable formula corresponding to Proposition 3 is not available.

2.5. Design of Models with Multiple RKHSs

Although we have discussed a single RKHS for model

M

, from a practical point of view, it is much more convenient to define f in the following form [29,30]:

\begin{matrix} f (X) = \sum_{k_{ℓ} \in K} f_{ℓ} (X) + b, f_{ℓ} \in H_{k_{ℓ}}, b \in R . \end{matrix}

(34)

In the above, we have used a collection of RKHSs, each of which is associated with a kernel function (

k_{ℓ} \in K

). With this formulation, one can separately model the effects of different factors, such as demographics, comorbidities and drug use. We provide examples in simulation studies below. The use of multiple kernels also allows us to construct f by trial and error. One can observe whether incorporation or removal of a component function associated with a covariate improves the data description by calculating BME. On can also validate Assumptions 5–7 by checking whether incorporation of

f_{ℓ}

representing, for example, time inhomogeneity improves BME. Here, note that criteria for model selection must be BME (or Bayesian information criteria applicable to only parametric models), not cross-validation error, because the former is consistent in terms of model selection in the large sample limit, while the latter is not.

The theory developed in the previous sections applies to the above multiple-kernel model as well because the sum of functions in different RKHSs is known to belong to an RKHS associated with a single composite kernel [30,31]. The squared norm of f defined by this composite kernel is dominated by the square sum of the norms of the component functions [31]. Thus, the entire theory is not affected by the use of a multiple-kernel model. The theoretical properties of ML estimation with the model in Equation (34) under a 1-norm, 2-norm or mixed-norm regularization have been extensively studied [30,32,33,34]. We employ the 2-norm-regularized version, for which calculation of BME is tractable.

2.6. Estimation Algorithm

In this section, we describe the concrete procedure for computing the debiased estimators of the hazard ratios for treatment. The entire algorithm is summarized in the pseudocode in Algorithm 1 and is largely divided into three steps. How the algorithm is numerically implemented is detailed in Appendix A.

Algorithm 1 DML of hazard ratios with cross fitting

Input: log-likelihood Equation (2) or (33), partitioned data $D = {D^{(1)} \dots D^{(M)}}$
Output: optimal model $M^{*}$ , debiased estimate $\bar{θ}$ and its squared standard error $\bar{Σ}$
$M^{*} \leftarrow ⌀$ .
while $M^{*} = ⌀$ do
Choose a model: $M_{0} \leftarrow H_{k_{0}}$ .
Prepare alternative models: $M_{ℓ} \leftarrow H_{k_{ℓ}}$ $(ℓ = 1 \dots L)$ .
for $λ$ , $σ \in$ SearchGrid do
Perform ML estimation with D and Compute ${BME}_{λ, σ, ℓ}$ for $ℓ = 0 \dots L$ .
end for
Find $λ_{ℓ}^{*}, σ_{ℓ}^{*} \leftarrow arg max {BME}_{λ, σ, ℓ}$ for $ℓ = 0 \dots L$ .
if ${BME}_{λ_{0}^{*}, σ_{0}^{*}, 0} \geq {BME}_{λ_{ℓ}^{*}, σ_{ℓ}^{*}, ℓ}$ $(1 \leq ℓ \leq L)$ then
$M^{*}, λ^{*}, σ^{*} \leftarrow H_{k_{0}}, λ_{0}^{*}, σ_{0}^{*}$ .
end if
end while
for $m = 1 \dots M$ do
$D^{(∖ m)} \leftarrow D ∖ (D^{(m)} \cup D^{(m + 1 (\mod M))})$ .
Perform ML estimation with $D^{(∖ m)}$ to obtain ${\hat{θ}}^{(∖ m)}, {\hat{f}}^{(∖ m)}$ with $λ^{*}$ and $σ^{*}$ .
end for
for $ζ \in$ SearchGrid do
for $m = 1 \dots M$ do
$D^{(∖ m)} \leftarrow D ∖ (D^{(m)} \cup D^{(m + 1 (\mod M))})$ .
Compute ${\hat{H}}^{(∖ m)}$ (or ${{\hat{g}}_{k}^{(∖ m)}}$ ) based on ${\hat{θ}}^{(∖ m)}, {\hat{f}}^{(∖ m)}$ and $D^{(∖ m)}$ .
Compute ${CVErr}_{H, ζ}^{(m)}$ (or ${{CVErr}_{g_{k}, ζ_{k}}^{(m)}}_{k}$ ) based on ${\hat{H}}^{(∖ m)}$ (or ${{\hat{g}}_{k}^{(∖ m)}}_{k}$ )
and $D^{(m + 1 \mod M)}$ .
end for
end for
$ζ^{*} \leftarrow arg {max}_{ζ} \sum_{m = 1}^{M} {CVErr}_{H, ζ}^{(m)}$ (or $ζ_{k}^{*} \leftarrow arg {max}_{ζ} \sum_{m = 1}^{M} {CVErr}_{g_{k}}^{(m)}$ ).
for $m = 1 \dots M$ do
Compute $ϕ (D_{i}; θ, ({\hat{f}}^{(∖ m)}, {\hat{H}}^{(∖ m)}))$ (or $ϕ (D_{i}; θ, ({\hat{f}}^{(∖ m)}, {\hat{g}}^{(∖ m)}))$ ) for $i \in D^{(m)}$ .
end for
Obtain $\bar{θ}$ by finding the zero of $\sum_{m = 1}^{M} \sum_{i \in D^{(m)}} ϕ (D_{i}; θ, ({\hat{f}}^{(∖ m)}, {\hat{H}}^{(∖ m)}))$
(or $\sum_{m = 1}^{M} \sum_{i \in D^{(m)}} ϕ (D_{i}; θ, ({\hat{f}}^{(∖ m)}, {\hat{g}}^{(∖ m)}))$ ).
Compute $\bar{Σ}$ based on $\bar{θ}$ .

Step 1: Model selection and determination of hyperparameter values

We first choose the best possible combination of RKHSs by trial and error by comparing BME in the manner we described in the previous section. We also examine the validity of Assumptions 5–7 in this step. To compare BME for different models, we first perform a grid search of the optimal hyperparameter values for regularization and the bandwidth of kernels for each model, examining values with intervals of approximately

ln (1.5)

on the logarithmic scale (e.g.,

2.0, 3.0, 5.0, 7.0, 10.0 \dots

) for each hyperparameter. Then, the BME values of different models calculated with the optimal hyperparameter values are compared. When we think that we have obtained the best possible model, we proceed to the next step. There is, however, a possibility that the validity of Assumptions 5–7 is denied for the best possible model at hand. In this case, one needs to review the study design, obtain new covariates or incorporate latent variables in the model.

Step 2: Estimation of nuisance parameters

For the model chosen in Step 1, we calculate nuisance parameters. Following Chernozhukov et al. [19], we perform “cross fitting” as follows. To debias

θ

, the estimation errors in f and H (or

g_{k}

) must be independent of the data

D_{i}

in the argument of score functions. To achieve this, the entire dataset D is first partitioned into M groups of approximately equal size (

D^{(1)} \dots D^{(M)}

). We estimate nuisance parameters M times, using

D_{train}^{(m)} \overset{def}{=} D ∖ (D^{(m)} \cup D^{(m + 1 (\mod M))})

as a training set, validating the estimation result with

D_{val}^{(m)} \overset{def}{=} D^{(m + 1 (\mod M))}

, and then obtaining the values used for score functions of

D_{hout}^{(m)} \overset{def}{=} D^{(m)}

for

1 \leq m \leq M

.

Concretely, we first perform ML estimation to obtain

\hat{θ^{(∖ m)}}

and

\hat{f^{(∖ m)}}

as the minimizer of the negative log-likelihood of

D_{train}^{(m)}

using the optimal hyperparameter values obtained in Step 1. Then, we proceed to the computation of additional nuisance parameters

\hat{H}

(or

{{\hat{g}}_{k}}_{k}

) based on

D_{train}^{(m)}

. In this step, we also need to identify the value of the regularization hyperparameter

ζ_{n}

for these nuisance parameters. For determination of

ζ_{n}

for

\hat{H}

, we examine the sum of the following cross-validation error across the M folds:

\begin{matrix} {CVErr}_{H}^{(m)} \overset{def}{=} \sum_{k \in K} {∥{\hat{H}}_{θ_{k} f}^{(m, val)} - {\hat{H}}_{θ_{k} f}^{(∖ m)} {({\hat{H}}_{f f}^{(∖ m)} + ζ_{n})}^{- 1} {\hat{H}}_{f f}^{(m, val)}∥}_{H_{k}}^{2} \end{matrix}

(35)

where

{\hat{H}}^{(∖ m)}

and

{\hat{H}}^{(m, val)}

are the Hessians of the negative log-likelihoods of

D_{train}^{(m)}

and

D_{val}^{(m)}

, respectively.

For determination of

{ζ_{n, k}}_{k}

for

{g_{k}}_{k}

, the logarithmic loss in Equation (27) may underestimate the error due to large

e^{g_{k}}

and

e^{- g_{k}}

values in the score functions. We therefore suggest computing the sum of the following cross-validation error across the M folds:

\begin{matrix} {CVErr}_{g_{k}}^{(m)} \overset{def}{=} {[\sum_{i \in D_{val}^{(m)}} \int_{0}^{T_{i} \land C_{i}} \{A_{i k, t} (e^{- \hat{g_{k}^{(∖ m)}} (X_{i, t})} - 1) + (1 - \sum_{ℓ \in K} A_{i ℓ, t}) (e^{\hat{g_{k}^{(∖ m)}} (X_{i, t})} - 1)\} d t]}^{2} . \end{matrix}

(36)

The above error design exploits the fact that, for

\hat{P} (A_{k} = 1 | X) \to P (A_{k} = 1 | X)

,

\begin{matrix} P (A_{k} = 1 | X) \frac{1}{\hat{P} (A_{k} = 1 | X)} + (1 - P (A_{k} = 1 | X)) \frac{1}{1 - \hat{P} (A_{k} = 1 | X)} \to 2 . \end{matrix}

(37)

Here, the trivial solution

\hat{P} (A_{k} = 1 | X) = \frac{1}{2}

also minimizes

{CVErr}_{g_{k}}

and should be avoided by checking BME of the logistic regression (Equation (27)) as well.

Step 3: Debiased estimation of $θ$ and its standard error.

The debiased estimator

\bar{θ}

of

θ^{*}

is obtained as the zero of the following:

\begin{matrix} \sum_{1 \leq m \leq M} \sum_{i \in D_{hout}^{(m)}} ϕ (D_{i}; θ, {\hat{η}}^{(∖ m)}), \end{matrix}

(38)

with

{\hat{η}}^{(∖ m)} = ({\hat{f}}^{(∖ m)}, {\hat{H}}^{(∖ m)})

(or

({\hat{f}}^{(∖ m)}, {\hat{g}}_{k}^{(∖ m)}

)). According to Chernozhukov et al. [19], the asymptotic standard error of the estimator is given by

{\bar{Σ}}^{- 1 / 2} (\bar{θ} - θ^{*}) \overset{d}{\to} Normal (0, 1)

, with

\begin{matrix} \bar{Σ} & \overset{def}{=} & {\bar{J}}^{- 1} (\frac{1}{n^{2}} \sum_{1 \leq m \leq M} \sum_{i \in D^{(m)}} ϕ (D_{i}; \bar{θ}, ({\hat{f}}^{(∖ m)}, {\hat{η}}^{(∖ m)})) ϕ {(D_{i}; \bar{θ}, ({\hat{f}}^{(∖ m)}, {\hat{η}}^{(∖ m)}))}^{'}) {\bar{J}}^{- 1'}, \end{matrix}

(39)

\begin{matrix} \bar{J} & \overset{def}{=} & \frac{1}{n} \sum_{1 \leq m \leq M} \sum_{i \in D^{(m)}} \frac{\partial}{\partial θ} ϕ (D_{i}; θ, ({\hat{f}}^{(∖ m)}, {\hat{η}}^{(∖ m)})) |_{θ = \bar{θ}} . \end{matrix}

(40)

Thus, in the simulation study described below, the square roots of the diagonal elements of

\bar{Σ}

serve to scale the errors from the true value to obtain the t-statistics.

Remark 14.

The orthogonal scores as functions of θ are given in Proposition 3 for the case with

g_{k}

. For the case with H, a procedure for obtaining functions of θ is given in Appendix A.

3. Results of Numerical Simulations

3.1. Simulation Result 1: Adjustment for Observed Confounders

This section presents a simulation study designed to demonstrate that, in a clinically plausible setting, the proposed method estimates hazard ratios for treatment with minimal bias. Suppose we randomly sample subjects from an electronic medical record that contains the medical history of a large local population. Suppose, further, that 2000 subjects aged 50–75 years are enrolled at uniformly random times between 2000 and 2005. The chosen subjects are followed up for

C \overset{i . i . d .}{\sim} Uniform [5, 10]

years, with timely recording of the onset of comorbidities, the administration of drugs and the occurrence of a disorder defined as an outcome event.

As a typical setting in which confounders bias the estimation of the treatment effect, suppose that subjects are newly diagnosed with condition 1 at a constant rate (2.5%/month), which may be a prodrome for the outcome event. A drug that is suspected to be a risk factor for an outcome event may be used to treat this condition. In the simulation, the probability per month of initiating this drug is

0.004 + 0.2 X_{1 t} exp (- X_{1 t} / 0.6) / {(0.6)}^{2}

, with a covariate of

X_{1 t}

[year] indicating the duration of suffering from condition 1 before time t, while the probability per month of discontinuing the drug is 0.01. Use of this drug increases the risk of developing condition 2. The probability per month of developing condition 2 at time t is

0.05 + 0.05 \int_{- \infty}^{t} B_{s} Θ (t - s - 1 / 12) e^{- 3 (t - s - 1 / 12)} d s

, where

B_{s}

takes a value of 1 if the subject is a current user of the drug at time s and a value of 0 otherwise. The above integral including the Heaviside step function

Θ

describes a short-term delayed increase in the risk of developing condition 2 subsequent to drug use.

Finally, suppose that the occurrence of the outcome event depends on both treatment and conditions 1 and 2 and its true risk per month is

exp (- 7.0 + θ_{1}^{*} A_{1 t} + θ_{2}^{*} A_{2 t} + 0.04 {age}_{t} + 0.2 sin (π {date}_{t} / 8) - 0.2 cos (π {date}_{t} / 6) + 2.0 X_{1} exp (- X_{1 t} / 1.5) + P_{2} (1 - exp (- X_{2 t} / 2.5))) / 12.0

, with

θ_{1}^{*} = 1.0

and

θ_{2}^{*} = 2.0

, where we define two treatment variables

A_{1 t}

and

A_{2 t}

, which take a value of 1 if and only if the subject uses the drug for a period

\leq 18

and

> 18

months before time t, respectively.

P_{2}

in the equation is a parameter for which we use different values. Covariates

{age}_{t}

,

{date}_{t}

and

X_{2 t}

(duration of suffering condition 2) are measured in years.

For this dataset, we performed debiased ML estimation using multiple-kernel models for f. Concretely, we used the following model:

\begin{matrix} f (X_{t}) = f_{age} ({age}_{t}) + f_{date} ({date}_{t}) + f_{1} (X_{1 t}) + f_{2} (X_{2 t}) + b, \end{matrix}

(41)

with

f_{age} \in H_{k_{age}}

,

f_{date} \in H_{k_{date}}

,

f_{1} \in H_{k_{1}}

and

f_{2} \in H_{k_{2}}

. To correctly specify the model, we chose

k_{age}

to be a linear kernel and

k_{date}

,

k_{1}

and

k_{2}

to be one-dimensional Gaussian kernels [e.g.,

k_{1} (X_{i_{1}, t_{1}}, X_{i_{2}, t_{2}}) = exp (- {(X_{i_{1} 1, t_{1}} - X_{i_{2} 1, t_{2}})}^{2} / σ^{2})

] with a bandwidth hyperparameter

σ

. Although we can identify the set of necessary kernels a priori, one can find the appropriate set of kernels by trial and error, using BME as a criterion. In this step, we can also validate the model assumptions (Assumptions 5–7) by examining whether the introduction of an additional kernel improves BME. We briefly illustrate the idea through an example.

First, as we have described above, covariate

X_{2 t}

is necessary for predicting the onset of outcome events. This can be seen by comparing BME values for the correctly specified model and a misspecified model built by removing

f_{2} (X_{2 t})

and

H_{k_{2}}

from it (Figure 1A). The question naturally arising here is how one can see that there is room for improvement when they have the misspecified model at hand. We propose examination of the assumption of time homogeneity (Assumption 7) as an indicator. In our example, comparison of BME values for the second, misspecified model with or without inclusion of the time elapsed after enrollment as a covariate shows the violation of the assumption, depending on the value of

P_{2}

(Figure 1B), although the sensitivity is not high. The risk variation due to the time elapsed after enrollment indicates a temporal change in the risk set that covariates at hand do not account for; therefore, the model is missing some necessary factors (see the argument in Remark 6). For a correctly specified model, such violation of the assumption is not detected (Figure 1C). The violation of the time-homogeneity assumption is observed for models misspecified in different manners (which we omit to avoid redundancy).

Next, suppose that we successfully identified the correctly specified model and proceeded to the calculation of debiased estimators of hazard ratios. We carried out this calculation in two different ways based on either one of Propositions 2 and 3. In both applications, we followed the procedure described in Section 2.6. Additionally, in the application of Proposition 3, we use the following multiple-kernel model for the estimation of

g_{ℓ}

(

ℓ \in K

):

\begin{matrix} g_{ℓ} (X_{t}) = g_{ℓ, age} ({age}_{t}) + g_{ℓ, date} ({date}_{t}) + g_{ℓ 1} (X_{1 t}) + g_{ℓ 2} (X_{2 t}) + b, \end{matrix}

(42)

with

g_{ℓ, age} \in H_{k_{age}}

,

g_{ℓ date} \in H_{k_{date}}

,

g_{ℓ 1} \in H_{k_{1}}

and

g_{ℓ 2} \in H_{k_{2}}

, where all RKHSs are associated with a one-dimensional Gaussian kernel. On top of the kernels described above, we also examined whether the inclusion of two-dimensional Gaussian kernels such as

k_{12} (X_{i_{1}, t_{1}}, X_{i_{2}, t_{2}}) = exp (- ({(X_{i_{1} 1, t_{1}} - X_{i_{2} 1, t_{2}})}^{2} + {(X_{i_{1} 2, t_{1}} - X_{i_{2} 2, t_{2}})}^{2}) / σ_{2}^{2})

improves BME for the logistic regression in Equation (27). In the present case, the two-dimensional kernels did not improve BME.

In the tuning of hyperparameter values for nuisance parameters (H or

g_{ℓ}

), we examined how the hyperparameters affect

{CVErr}_{H}

(Figure 1D) and

{CVErr}_{g_{ℓ}}

(Figure 1F). In the application of Proposition 3, BME for the logistic regression and

{CVErr}_{g_{ℓ}}

identified different optimal values (Figure 1E,F).

After taking all of the steps in Algorithm 1, we calculated debiased estimate

\bar{θ}

using the score functions obtained with either

(H_{θ f}, H_{f f})

or

g_{1}

. To measure the bias in the estimator, we obtained estimates with 500 datasets generated from the identical process described above with different seeds for the random number generator. The t statistics (

\overset{def}{=} ({\bar{θ}}_{k} - θ_{k}^{*}) / \bar{{SE}_{θ_{k}}}

) in Figure 1G,H show that the debiased estimator identified the true value with minimal bias, in contrast with the naive ML estimator. In Figure 1H, we also observe that

g_{1}

tuned with

{CVErr}_{g_{1}}

debiases the estimator more effectively than

g_{1}

tuned with BME.

3.2. Simulation Result 2: Estimation of Treatment Effect in Population with Heterogeneous Risk

Suppose that most of the simulation settings are the same as in the previous case, but the enrolled subjects are divided into two distinct risk groups denoted by

W \in {0, 1}

, as reflected in a baseline covariate value. More precisely, the number of subjects, their age, enrollment date, censoring, onset probabilities of conditions 1 and 2 and treatment assignment are the same as in the previous case. Suppose, however, that blood-test results

X_{j 0}^{(test)} \overset{i . i . d .}{\sim} Normal (0, σ_{j}^{(test) 2})

at baseline (

t = 0

) are available (

1 \leq j \leq 3

) and the risk group to which the subject belongs is probabilistically related to

{X_{j 0}^{(test)}}_{j}

as

\begin{matrix} P (W | A, X) = \frac{1}{1 + exp (- (2 W - 1) (\sum_{j} β_{j}^{*'} X_{j 0}^{(test)} + β_{0}^{*}))}, \end{matrix}

(43)

with

{σ_{j}^{(test)}}_{1 \leq j \leq 3} = {2.0, 1.0, 4.0}

and

{β_{j}^{*}}_{0 \leq j \leq 3} = {- 0.5, 1.0, 0.0, 0.0}

. With variable W, the risk of an outcome event per month is given by

exp (- 7.0 + θ_{1}^{*} A_{1 t} + θ_{2}^{*} A_{2 t} + 0.04 {age}_{t} + 0.2 sin (π {date}_{t} / 8) - 0.2 cos (π {date}_{t} / 6) + 2.0 X_{1} exp (- X_{1 t} / 1.5) + P_{2} (1 - exp (- X_{2 t} / 2.5)) + κ^{*} W) / 12.0

with

θ_{1}^{*} = 1.0

,

θ_{2}^{*} = 2.0

,

P_{2} = 0.5

and

κ^{*} = 1.0, 2.0

or

3.0

.

We first performed an analysis with only observed covariates

(X_{t}, X_{t}^{(test)})

, not using latent variables. Here, we assumed

X_{t}^{(test)} = X_{0}^{(test)}

. In the validation of assumptions, the inclusion of the time elapsed after enrollment in the set of covariates improved BME (Figure 2A), suggesting that the enrolled subjects are heterogeneous, presumably due to an unobserved factor. Then, we analyzed the dataset with a latent variable, maximizing the marginal log-likelihood with the EM algorithm (see e.g., Chapter 9 of Ref. [26]), which led to further improvement of BME (Figure 2B). In this calculation, we observed that the optimization becomes numerically unstable for small

κ

values (≤1.0) presumably because of the singularity of the model. We examined the estimated values of

θ

with 500 datasets generated by the identical process with different seeds for its random number generator, and the results confirmed that the suggested debiased estimators identify the true value with minimal bias, in contrast with the naive ML estimator (Figure 2C).

4. Discussion

Preceding studies [3,4,5] showed that the Cox model allows for multiple contradictory interpretations due to the selection process in the risk set. To alleviate this difficulty, we proposed a new framework that combines an exponential hazard model with machine learning models. With the observation that the baseline hazard of the Cox model is a major source of the uninterpretability, we constructed the model without a baseline hazard. For this model, with a few testable assumptions, as well as conventional assumptions for causal inference in observational studies, we clarified the context in which the estimated hazard ratios are causally interpreted. In this argument, we have demonstrated that the source of uninterpretability is mostly removed (Remark 6). Then, we developed a framework to systematically seek the best possible model with the aid of machine learning, then compute debiased estimates of hazard ratios with the obtained model. Numerical simulations demonstrated that wrong description of the selection process in the risk set can be detected as violation of a time-homogeneity assumption in this framework. In the simulation, it was also demonstrated that our theoretically justified debiased ML estimators of hazard ratios identify the true values with minimal bias.

Epidemiologists and statisticians might find it psychologically difficult to abandon the Cox model with baseline hazard because of its established role as a default model. The popularity of the Cox model is due to its semiparametric nature. Leaving the difficult-to-model aspects as a black box is an attractive idea. Furthermore, the Cox’s MPL estimator [35] for log-hazard ratios of treatment groups is asymptotically efficient [36,37], which is a natural consequence of obviating the estimation of the baseline hazard. Because of this asymptotic efficiency, all other estimators, including the ML estimator, have essentially been reduced to theoretical subjects, while a relatively small number of studies have suggested the advantage of using the others (e.g., [21,38]). The present results suggest the need to reconsider this situation by re-evaluating the tradeoff between efficiency and uninterpretability due to the unspecified baseline hazard. The disadvantage of using MPL is not only uninterpretability. First of all, the baseline hazard is impossible to interpret from the physical and biological points of view. No physical process depends on the time elapsed after the enrollment. Thus, its use is solely justified by convenience. However, from a technical point of view, comparing the estimation of

g_{k}

in the present study and the inverse probability-of-treatment weight for marginal structural Cox models used for a similar purpose [17], one can see that the former is a marginal estimand in terms of time and, hence, usually easier to estimate. The standard method for the latter, that is, estimating the transition probability of treatment and censoring for each value of covariates and each time t, possibly works better in a simple model setting, such as analysis involving only a few or several discrete variables. However, carrying this out under the nonlinear effect of a high-dimensional covariate is prohibitive. In fact, the recently proposed application of doubly robust estimation to marginal structural Cox models is still limited to simple model settings in which machine learning cannot be incorporated [39]. This is why we could not directly compare the results achieved with our method with those achieved with the marginal structural Cox model. Another negative factor for the MPL approach is that the combination of the Bayes theorem and partial likelihood is only approximately justifiable, so extending Cox regression to Bayesian settings is not straightforward, although efforts in this direction have been made for Cox models as well [40]. Considering these negative factors for the MPL approach, we conclude that the debiased ML approach should be reconsidered as an option in causal inference for observational studies with uncontrolled dynamic treatment and real-time measurement of a large number of covariates.

It is, however, fair to note that the applicability of our approach at the current stage is largely limited because Assumptions 1–8 must be satisfied for the application. Among these assumptions, Assumptions 1–4 are technical assumptions that are explicitly or implicitly made in the conventional approach with the marginal structural Cox model [17]. In particular, in causal inference with observational data, Assumption 2, stating conditionally randomized assignment of treatment for each value of the covariate, is thought to be crucial. If this does not hold, an experiment (intervention) is needed. See Remarks 1–4 for the reasoning of the other assumptions. Assumptions 5–8 essentially concern specification of a model. Assumption 6 is potentially removable. Estimation of the heterogeneous treatment effect (i.e., covariate-dependent treatment effect) is a subject of active research [41], and an approach that adjusts the heterogeneity in the treatment effect in an unbiased manner may be incorporated into our approach. Then, the remaining problem is whether one can design a suitable set of variables for X and A (Assumption 5) and a correctly specified machine learning model (Assumption 8) without sacrificing the assumption of time homogeneity (Assumption 7). This will be achieved if one can construct a model that simulates the data-generating process precisely because the biophysical processes generating data do not depend on specific time points of observation. At the current stage, the relatively simple model we used in this study is likely to suffer from misspecification in real-world epidemiological problems. Therefore, our proposals presented in this study should be considered suggestions for future efforts to overcome this misspecification. Taking the current success of deep learning in real-world applications into account, we would like to suggest an optimistic view that models and data of a large scale, in combination with Bayesian inference, may remove any such misspecification problem. Before such a goal is achieved, no one knows a priori which of our approach and the Cox MPL approach yields a value closer to the true hazard ratio for a given dataset.

In technical aspects, we, indeed, restricted our study to a simple setting for the sake of clear exposition. In this study, only non-informative right censoring is considered, only a single binary treatment variable can take the value of one and only a single time-independent latent variable is allowed. However, the advantage of ML estimation has been demonstrated in the context of informative censoring [42]. Therefore, extending our framework in this direction seems promising. The design of treatment variables in our model is for the sake of an exposition, and our theories apply to general cases with a finite number of discrete treatment variables. Application of our framework to models with continuous treatment variables is also straightforward. Proposition 2 applies to models with linear functions of continuous treatment variables, while Proposition 3 cannot be used for this case. Since our theory and the framework of DML rely on the low dimensionality of treatment variables, extension to models with a nonlinear treatment effect of continuous variables is beyond the fundamental limit of the current technique. However, use of multiple binary variables can approximately describe nonlinear dependence on treatment variables. This is the reason why we chose the binary setting. The extension of our framework to models with multiple (possibly continuous, time-dependent) latent variables would face the identifiability problem. Although progress has been made in both theoretical analysis of identifiability [43,44,45,46] and numerical examination of identifiability in the context of biological modeling [47], there is still a technical challenge. Related to this topic, using a singular model can lead to effective breakdown of asymptotic normality of estimators [48]. In this case, the framework of debiased machine learning would, itself, need modifications, and the Laplace approximation of BME also fails and needs to be replaced by Monte-Carlo integration [49] (or information criteria for singular parametric models [50,51]). Given the abovementioned difficulties, the use of rich classes of latent-variable models for which inference methods have been extensively developed in machine learning (especially those for time-series data [52,53]) remains an attractive, worthy challenge. With respect to this challenge, our technique for debiasing will be useful.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math13193092/s1.

Author Contributions

T.H. conceptualized the study. T.H. performed all of the mathematical works in the study. T.H. and S.A. designed models and simulation data. T.H. developed all program codes. T.H. and S.A. interpreted the analyzed results. T.H. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Ministry of Education, Culture, Sports, Sciences and Technology (MEXT) of the Japanese government and the Japan Agency for Medical Research and Development (AMED) under grant numbers JP18km0605001 and JP223fa627011.

Data Availability Statement

All of the source codes that support the findings of this study are available as Supplementary Materials.

Acknowledgments

We would like to express our gratitude to two medical IT companies, 4DIN Ltd. (Tokyo, Japan) and Phenogen Medical Corporation (Tokyo, Japan), for financial support. However, neither company had any role in the research design, analysis, data collection, interpretation of data, or review of the manuscript, and no honoraria or payments were made for authorship.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Numerical Implementation of Inference with Multiple-Kernel Models

Regularized ML estimation of

θ (\in R)

and

f (\in H

: RKHS for the composite kernel determined by Equation (34)) with discretized timesteps (

T_{i} = {0, Δ t, 2 Δ t, \dots (\leq C_{i} \land T_{i})}

for subject i) under a two-norm regularization is formulated as

\begin{matrix} min_{θ, f} \sum_{i \in I} \sum_{t \in T_{i}} U (D_{i, t}; θ, f) + \sum_{k \in K} \frac{λ_{k}}{2} α_{k}^{'} G_{k} α_{k}, \end{matrix}

(A1)

where the Gram matrix

{(G_{k})}_{i t, i^{'} t^{'}} = k (X_{i, t}, X_{i^{'}, t^{'}})

relates f and

α_{k}

via

\begin{matrix} f (X_{i, t}) = \sum_{k \in K} \sum_{i^{'} t^{'}} {(G_{k})}_{i t, i^{'} t^{'}} α_{k, i^{'} t^{'}} + b, \end{matrix}

(A2)

together with a bias parameter

b \in R

, and

\begin{matrix} U (D_{i, t}; θ, f) = - I_{[t, t + Δ t)} (T_{i}) (θ^{'} A_{i t} + f (X_{i t})) + exp (θ^{'} A_{i t} + f (X_{i t})) \end{matrix}

(A3)

denotes the timestep-wise negative log-likelihood function. Note that the integration step width

Δ t

can be omitted in the representation of U because the use of different values of

Δ t

(namely, change of units) affects the estimate of only b. Concretely, we use linear or 1–3-dimensional Gaussian kernels, such as

k (X_{i_{1}, t_{1}}, X_{i_{2}, t_{2}}) = X_{i_{1} j, t_{1}} X_{i_{2} j, t_{2}}

and

k (X_{i_{1}, t_{1}}, X_{i_{1}, t_{1}}) = exp (- \sum_{1 \leq ℓ \leq d} {(X_{i_{1} j_{ℓ}, t_{1}} - X_{i_{2} j_{ℓ}, t_{2}})}^{2} / 2 σ_{d}^{2})

. Before calculating the Gram matrices, all covariate values are normalized to have a zero mean and a unit variance. For all d-dimensional Gaussian kernels, we use the same values

λ_{k} = λ_{d}

and

σ_{d}

. We primarily use

λ_{k} = 0

for linear kernels, except for cases in which we compare models with different numbers of linear kernels. See the Results section for concrete designs of kernels. Since

{α_{k}}_{k \in K}

has large dimensions, we use incomplete Cholesky decomposition [54] to approximate an

N \times N

Gram matrix with

N \times N^{'}

matrix

L_{k}

as

G_{k} \approx L_{k} L_{k}^{'}

for

N^{'} ≪ N

. We use

0.001

for the value of the tolerance parameter in this approximation. We then have

f = \sum_{k} L_{k} u_{k}

with

u_{k} = L_{k}^{'} α_{k}

and

\frac{λ_{k}}{2} α_{k}^{'} G_{k} α_{k} = \frac{λ_{k}}{2} {∥ u_{k} ∥}_{2}^{2}

. We solve the above optimization problem by applying the limited-memory BFGS algorithm [55] with backtracking and line search stopped by the Armijo condition. The optimization is stopped if the

ℓ_{2}

-norm of the gradient gets smaller than

ϵ_{stop} = 1.0 \times 10^{- 2}

.

Table A1. Mathmatical notations.

Symbols	Description
$N$ , $R, R^{d}$	The sets of natural and real numbers and d-dimensional Euclidean space ( $d \in N$ )
$X_{1} \times X_{2}$	Product set (or product space)
$(X_{1} \times X_{2}, M_{X_{1}} \times M_{X_{2}})$	Measure space with product space and product $σ$ algebra
$v_{1} \otimes v_{2}$	Tensor product
$v^{'}$ , $A^{'}$	Transposition of vector v and matrix A
$a \land b$	The smaller of a and $b \in R$
$a \in A$	Element a of a set $A$
$X \subset Y$	Set inclusion (with possible equality)
$X \cup Y$ , $X \cap Y$	Union and intersection of sets
$\| A \|$	The number of elements in a set $A$
$A \overset{p}{\to}$ B, $A \overset{d}{\to} B$	Convergence in probability and in distribution
$b_{n} \sim o (a_{n})$ , $O (a_{n})$	Asymptotic notations for $n \to \infty$ (see Ref. [56])
$b_{n} \sim o_{p} (a_{n})$ , $O_{p} (a_{n})$	Probabilistic asymptotic notations for $n \to \infty$ (see Ref. [56])
(arg) ${min}_{x \in X} f (x)$	(Element yielding) minimum of $f (x)$ over $X$
(arg) ${max}_{x \in X} f (x)$	(Element yielding) maximum of $f (x)$ over $X$
${sup}_{x \in X} f (x)$	Supremum of $f (x)$ over $X$
$E [\cdot]$ ( $E_{X \sim p} [\cdot]$ )	Expectation of the argument random variable (for the specified distribution)
$\overset{def}{=}$	Equation defining the object on the left-hand side
${∥ v ∥}_{2}$ , ${∥ v ∥}_{H}$	The norm of the metric vector space and the normed vector space $H$
$∥ A ∥$	The operator norm of a linear operator A
$A ⫫ B$ $(\| C_{1}, \dots)$	Independence between A and B (conditioned on $C_{1}, \dots$ )
$i . i . d .$ , ( $\overset{i . i . d .}{\sim}$ )	Independently and identically distributed (objects drawn from the right-hand side)
$\int_{E} f (X) d μ (X)$	Integration of $f (X)$ over set E with respect to measure $μ$ on the space for X
$vec (v_{1}, v_{2} \dots v_{k})$	Natural mapping of vectors into their product space
$\partial_{v} (\cdot)$	Abbreviation of partial derivative $\frac{\partial}{\partial v} (\cdot)$
$\partial_{v} (\cdot) [v_{1}] \|_{v_{2}}$	Gateaux derivative ${lim}_{ϵ \to + 0} ϵ^{- 1} ((\cdot) (v_{2} + ϵ v_{1}) - (\cdot) (v_{2}))$
$Uniform [a, b]$	Uniform probability distribution over the interval $[a, b]$
$Normal (μ, Σ)$	Gaussian probability distribution with mean $μ$ and (co)variance $Σ$

For ML estimation, one can approximate BME by using the Laplace approximation. Assigning the Hessian of the negative log-likelihood with the regularization term (A1) to

\tilde{H}

and regarding the regularization term as the negative log-likelihood of a Gaussian process prior [57], we calculate BME (see, e.g., Chapter 3 of Ref. [26]) using Gaussian integrals as follows:

\begin{matrix} ln BME (M | D) \approx - \sum_{i \in I} \sum_{t \in T_{i}} U (D_{i, t}; \hat{θ}, \sum_{k \in K} L_{k} \hat{u_{k}}) + \sum_{k \in K} \frac{ln λ_{k}}{2} \dim (u_{k}) - \frac{1}{2} ln | \tilde{H} |, \end{matrix}

(A4)

with the solution of Equation (A1),

\hat{θ}

and

\hat{f}

. In the above approximation, the space of functions perpendicular to the range of

G_{k}

does not contribute to U (the representer theorem [22]). Therefore, the contributions of the prior and posterior cancel each other out. Similarly, the space perpendicular to the range of

L_{k}

—but not perpendicular to that of

G_{k}

—makes negligible contributions to U relative to variations in the prior process. In our implementation, linear functions of treatment variables, covariates and bias parameters are treated as one-dimensional functions associated with linear kernels and are, therefore, represented as

L_{k} u_{k}

. We often use

λ_{k} = 0

for linear kernels—namely, a non-informative prior. In this case,

ln λ_{k}

in the BME is removed. This is justified when we compare two models with the same set of linear kernels. When we compare models with different sets of linear kernels, we need to tune

λ_{k} > 0

for the linear kernels.

Next, we describe how the additional nuisance parameters are calculated. Hessians

{\hat{H}}_{θ f}

and

{\hat{H}}_{f f}

are simply obtained as the block components of the numerical Hessian with respect to

{u_{k}}

and b (as well as

κ

and

β

for the latent variable model) because these coordinates are orthogonal. For the validation and held-out datasets in the cross-fitting procedure, we calculate the trained solution

L_{k} {\hat{α}}_{k} = {\hat{u}}_{k}

with the aid of the pseudoinverse of

L_{k}

, and the decomposition

{\bar{L}}_{k} {\bar{L}}_{k}^{'}

for the Gram matrix for both training and validation (held-out) datasets provides new orthogonal coordinates

{\bar{u}}_{k} = {\bar{L}}_{k} {\hat{α}}_{k}

. The Hessian of each of the negative log-likelihoods of the training and validation datasets is calculated in terms of this

{\bar{u}}_{k}

, and

{CVErr}_{H}

is calculated according to the formula in Equation (35).

For the determination of

{\hat{g}}_{k}

, we perform optimization by replacing U with

\begin{matrix} U_{g_{k}} (D_{i, t}; g_{k}) = A_{i k, t} ln (1 + e^{- g_{k} (X_{i, t})}) + (1 - \sum_{ℓ} A_{i ℓ, t}) ln (1 + e^{g_{k} (X_{i, t})}) \end{matrix}

(A5)

in Equation (A1). The cross-validation error

{CVErr}_{g_{k}}

is also calculated via

{\bar{u}}_{k}

according to Equation (36).

In the simulation with latent variable Z, we apply the EM algorithm for optimization (see, e.g., Chapter 9 of Ref. [26]). Since the latent-variable model exhibits indefiniteness with respect to permutation of the range of Z, as do many other models with discrete latent variables, we start optimization from values close to

θ^{*}, f^{*}, β^{*}

and

κ^{*}

. An alternately repeating sequence of updating

r (Z_{i})

with

\begin{matrix} r_{i} (Z) = \frac{exp (- \sum_{t \in T_{i}} U (D_{i, t}, Z; θ, f, κ, β))}{\sum_{\tilde{Z} \in {0, 1}} exp (- \sum_{t \in T_{i}} U (D_{i, t}, \tilde{Z}; θ, f, κ, β))} \end{matrix}

(A6)

and optimizing

θ, f, κ

and

β

\begin{matrix} min_{θ, f, κ, β} \sum_{i \in I} \sum_{Z \in {0, 1}} \sum_{t \in T_{i}} r_{i} (Z) U (D_{i, t}, Z; θ, f, κ, β) + \sum_{k \in K} \frac{λ_{k}}{2} α_{k}^{'} G_{k} α_{k} \end{matrix}

(A7)

maximizes the marginal log-likelihood in Equation (33). In the above equation, the stepwise negative log-likelihood is given by

\begin{matrix} U (D_{i, t}, Z; θ, f, κ, β) & = & - I_{[t, t + Δ t)} (T_{i}) (θ^{'} A_{i t} + f (X_{i t}) + κ Z) + exp (θ^{'} A_{i t} + f (X_{i t}) + κ Z) \\ + δ_{t 0} ln (1 + exp (- (2 Z - 1) (\sum_{j} β_{j} X_{j 0} + β_{0}))), \end{matrix}

(A8)

where the Kronecker delta

δ_{t 0}

is used. The procedure of estimating nuisance parameters is essentially the same as for the model without a latent variable.

For the construction of score functions with

{\hat{H}}_{θ f}

and

{\hat{H}}_{f f}

, we compute the components of

\partial_{f} {ℓ |}_{\hat{θ}, \hat{f}}

proportional to

{e^{θ_{k}}}_{k}

and the one independent of

θ

separately, then apply

{\hat{H}}_{θ f} {({\hat{H}}_{f f} + ζ_{n})}^{- 1}

to all of these components. In the case with a latent variable, we compute the components of

\partial_{f} {ℓ |}_{\hat{θ}, \hat{f}}

proportional to

r_{i} (Z) {e^{θ_{k}}}

and

r_{i} (Z)

separately, then apply

{\hat{H}}_{θ f} {({\hat{H}}_{f f} + ζ_{n})}^{- 1}

. For the case without a latent variable, the obtained linear equations of

{e^{θ_{k}}}

can be algebraically solved. For the case with a latent variable, further numerically finding the root of score equations of the following form is necessary:

\begin{matrix} \frac{1}{n} \sum_{i \in I} \sum_{Z \in {0, 1}} r_{i} [Z] (θ_{1}, θ_{2}) (b_{i k 1} (Z) e^{θ_{1}} + b_{i k 2} (Z) e^{θ_{2}} + b_{i k 3} (Z)) = 0, (k \in K) . \end{matrix}

(A9)

Solving this set of highly nonlinear and nonconvex score equations is not difficult, but we take the approach of linearizing the above equations around the initial estimate (

\hat{θ}

) and solving the resultant linear equation. This is justifiable because the orthogonalization procedure is not designed to adjust higher-order terms around

\hat{θ}

in the first place.

Appendix B. Proof of Proposition 1

Proof.

We have

\begin{matrix} {lim}_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T^{a}) = 1 | T^{a}, T^{a^{'}} > t, A_{(- \infty, t)} = a_{(- \infty, t)}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})}{P (I_{(t, t + δ t]} (T^{a^{'}}) = 1 | T^{a}, T^{a^{'}} > t, A_{(- \infty, t)} = a_{(- \infty, t)}^{'}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})} \\ = & lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T^{a}) = 1 | T^{a}, T^{a^{'}} > t, A_{(- \infty, t + δ t]} = a_{(- \infty, t + δ t]}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})}{P (I_{(t, t + δ t]} (T^{a^{'}}) = 1 | T^{a}, T^{a^{'}} > t, A_{(- \infty, t + δ t]} = a_{(- \infty, t + δ t]}^{'}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})} \\ = & lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{(- \infty, t + δ t]} = a_{(- \infty, t + δ t]}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})}{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{(- \infty, t + δ t]} = a_{(- \infty, t + δ t]}^{'}, X_{(- \infty, t + δ t]} = x_{(- \infty, t + δ t]})} \\ = & lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a_{t}, X_{t} = x_{t})}{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a_{t}^{'}, X_{t} = x_{t})} . \end{matrix}

(A10)

The first equality is due to the absence of an unmeasured confounder (Assumption 2), thanks to which the specification of

A_{[t, t + δ t]}

with any values in the conditioning does not change the conditional probabilities in the numerator and denominator. With this additional conditioning, consistency (Assumption 1) ensures equality between actual and counterfactual variables, which asserts the second equality. The last equality is obtained by first applying the regularity of hazard (Assumption 4), then the independence of past values of treatment variables and covariates (Assumption 5).

Identifying the last line with the ML estimator is straightforward, as described below. Using Assumptions 4 and 5 and Equation (3), the expected log-likelihood is represented with marginal probability measures on

X \times A \times T

as

\begin{matrix} E [Q_{h} (T | X, A)] \\ = & \int_{0}^{\infty} \int_{0}^{C} \int_{X \times A} lim_{δ t \to + 0} \frac{1}{δ t} \{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t}, X_{t}) (θ^{'} A_{t} + f (X_{t})) \\ - exp (θ^{'} A_{t} + f (X_{t})) δ t\} d P (X_{t}, A_{t} | T > t) P (T > t) d t d P (C) + c o n s t \\ = & \sum_{k \in K} \int_{0}^{\infty} \int_{0}^{C} \int_{X} lim_{δ t \to + 0} \frac{1}{δ t} \{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{k, t} = 1, X_{t}) (θ_{k} + f (X_{t})) \\ - exp (θ_{k} + f (X_{t})) δ t\} d P (X_{t} | T > t, A_{k, t} = 1) P (T > t, A_{k, t} = 1) d t d P (C) \\ + \int_{0}^{\infty} \int_{0}^{C} \int_{X} lim_{δ t \to + 0} \frac{1}{δ t} \{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = 0, X_{t}) f (X_{t}) \\ - exp (f (X_{t})) δ t\} d P (X_{t} | T > t, A_{t} = 0) P (T > t, A_{t} = 0) d t d P (C) + c o n s t, \end{matrix}

(A11)

where the second equality is obtained by decomposing the integration domain as

X \times {\cup_{k \in K} {A | A_{k} = 1} \cup {A | A = 0}}

. Here, terms independent of

θ

and f are regarded as constants, and the integration with respect to the marginal probability measures for the data-generating process is denoted by

d P

. Differentiating this representation yields

\begin{matrix} \frac{\partial}{\partial θ_{k}} E [Q_{h} (T | X, A)] & = & \int_{0}^{\infty} \int_{0}^{C} \int_{X \times A} {lim_{δ t \to + 0} \frac{1}{δ t} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{k, t} = 1, X_{t}) \\ - exp (θ_{k} + f (X_{t}))} d P (X_{t} | T > t, A_{k, t} = 1) P (T > t, A_{k, t} = 1) d t d P (C), \end{matrix}

(A12)

\begin{matrix} lim_{ϵ \to + 0} \frac{1}{ϵ} E [Q_{h_{θ, f + ϵ ψ}} (T | X, A) - Q_{h_{θ, f}} (T | X, A)] \\ = & \int_{0}^{\infty} \int_{0}^{C} \int_{X \times A} {lim_{δ t \to + 0} \frac{1}{δ t} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t}, X_{t}) ψ (X_{t}) \\ - ψ (X_{t}) exp (θ^{'} A_{t} + f (X_{t}))\} d P (X_{t}, A_{t} | T > t) P (T > t) d t d P (C) . \end{matrix}

(A13)

With Assumptions 6 and 7, it is seen that these derivatives vanish if and only if

\begin{matrix} θ_{k} & = & ln lim_{δ t \to + 0} \frac{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{k, t} = 1, X_{t})}{P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = 0, X_{t})}, \\ f (X_{t}) & = & ln lim_{δ t \to + 0} \frac{1}{δ t} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = 0, X_{t}) \end{matrix}

(A14)

hold almost everywhere in

X \times [0, \infty)

with respect to

d P (X_{t}, A_{t} | T > t) P (T > t) P (C > t) d t

, and the concavity of the expected log-likelihood ensures that this solution is the maximizer. □

Appendix C. Proof of Proposition 2

Proof.

We refer readers to the next section for a proof of the identification of the gradients and Hessian with elements of

H_{k}

and an operator from

H_{k}

into

H_{k}

. Then, taking the Gateaux derivative of the expected score with respect to

f \in B_{n}

, we have

\begin{matrix} \partial_{f} E [ϕ (D; θ^{*}, (f^{*}, H^{*}))] [f - f^{*}] & = & (H_{θ f}^{*} - H_{θ f}^{*} ({(H_{f f}^{*} + ζ_{n})}^{- 1} H_{f f}^{*}) (f - f^{*}) \\ = & ζ_{n} H_{θ f}^{*} {(H_{f f}^{*} + ζ_{n})}^{- 1} (f - f^{*}) . \end{matrix}

(A15)

Using

H_{θ_{k} f}^{*} = H_{f f}^{*} ρ_{k}

, we see that the norm of this vector is given by

\begin{matrix} ζ_{n} {\{\sum_{k \in K} {({(H_{f f}^{*} + ζ_{n})}^{- 1} H_{f f}^{*} ρ_{k}, f - f^{*})}_{H_{k}}^{2}\}}^{1 / 2} \sim O (n^{- α - β}), \end{matrix}

(A16)

and hence, the proposition is asserted for

α + β > \frac{1}{2}

. Here, we use the fact that

∥ {(H_{f f}^{*} + ζ_{n})}^{- 1} H_{f f}^{*} ∥ \leq 1

. Note that the expected score and the Gateaux derivative of the expected score with respect to H at

(θ, f) = (θ^{*}, f^{*})

are zero because

E [\partial_{(θ, f)} ℓ |_{θ^{*}, f^{*}}] = 0

holds. □

Appendix D. Gradient Functional and Hessian Operator

In this section, we provide the gradient functional and the Hessian operator of the sample-wise negative log-likelihood:

\begin{matrix} ℓ (D; θ, f) = - I_{[0, C]} (T) (θ^{'} A_{T} + f (X_{T})) + \int_{0}^{T \land C} exp (θ^{'} A_{t} + f (X_{t})) d t . \end{matrix}

(A17)

Calculating the Gateaux derivative

\frac{1}{ϵ} (ℓ (D; θ, f + ϵ ψ) - ℓ (D; θ, f))

for an arbitrary element

ψ

of the RKHS, we expect that the gradient functional is given as

\begin{matrix} ψ \in H_{k} \mapsto - I_{[0, C]} (T) ψ (X_{T}) + \int_{0}^{T \land C} ψ (X_{t}) exp (θ^{'} A_{t} + f (X_{t})) d t \in R . \end{matrix}

(A18)

This mapping is actually given as the following inner product:

\begin{matrix} ψ \in H_{k} \mapsto {(- I_{[0, C]} (T) k (X_{T}, \cdot) + \int_{0}^{T \land C} exp (θ^{'} A_{t} + f (X_{t})) k (X_{t}, \cdot) d t, ψ)}_{H_{k}} \in R, \end{matrix}

(A19)

which satisfies the definition in the Frechet sense stated in Proposition 2. Note that the integration of

k (X, \cdot)

over a bounded positive measure in the left argument of the inner product belongs to

H_{k}

[23]. Similarly, the differentiation of the above functional is given by

\begin{matrix} ψ \in H_{k} \mapsto \int_{0}^{T \land C} exp (θ^{'} A_{t} + f (X_{t})) k (X_{t}, \cdot) {(k (X_{t}, \cdot), ψ)}_{H_{k}} d t, \in H_{k} . \end{matrix}

(A20)

Therefore, the Hessian operator is represented as an integration of

k (X, \cdot) \otimes k (X, \cdot)

over a bounded positive measure in a manner similar to the autocovariance operator [23]. Next, let us consider the case with a latent variable. In this case, we represent the sample-wise negative marginal log-likelihood as

\begin{matrix} ℓ (D; θ, f) = \sum_{Z \in Z} r (Z) (ℓ (D | Z; θ, f) + ln r (Z)), \end{matrix}

(A21)

with

\begin{matrix} ℓ (D | Z; θ, f) & = & - I_{[0, C]} (T) (θ^{'} A_{T} + f_{1} (X_{T}, Z)) \\ + \int_{0}^{T \land C} exp (θ^{'} A_{t} + f_{1} (X_{t}, Z)) d t - ln Q (Z | f_{2} (X_{0})), \end{matrix}

(A22)

and

\begin{matrix} r (Z) = \frac{exp (- ℓ (D | Z; θ, f))}{\sum_{\tilde{Z} \in {0, 1}} exp (- ℓ (D | \tilde{Z}; θ, f))}, \end{matrix}

(A23)

where we redefine f as

vec (f_{1}, f_{2})

. In the simulation study, we use

f_{1} (X_{t}, Z) = f (X_{t}) + κ Z + b

and

f_{2} (X_{0}) = \sum_{j} β_{j} X_{j 0} + β_{0}

. Direct computation shows that the gradient of this marginal negative log-likelihood is given by

\begin{matrix} \partial_{f} ℓ (D; θ, f) = E_{Z \sim r} [\partial_{f} ℓ (D | Z; θ, f)] \end{matrix}

(A24)

and the Hessian is given by

\begin{matrix} \partial_{f} \partial_{f} ℓ (D; θ, f) & = & E_{Z \sim r} [\partial_{f} \partial_{f} ℓ (D | Z; θ, f)] + E_{Z \sim r} [\partial_{f} ℓ (D | Z; θ, f)] \otimes E_{Z \sim r} [\partial_{f} ℓ (D | Z; θ, f)] \\ - E_{Z \sim r} [\partial_{f} ℓ (D | Z; θ, f) \otimes \partial_{f} ℓ (D | Z; θ, f)] . \end{matrix}

(A25)

Appendix E. The Validity of Assumption 9

As we have seen in the previous section,

\hat{H_{θ f}}

and

\hat{H_{f f}}

are represented as the sum of terms of the forms

\hat{E} [\int exp ({\hat{θ}}^{'} A_{t} + \hat{f} (X_{t})) A_{t} \otimes k (X_{t}, \cdot) d t]

and

\hat{E} [\int exp ({\hat{θ}}^{'} A_{t} + \hat{f} (X_{t})) k (X_{t}, \cdot) \otimes k (X_{t}, \cdot) d t]

, respectively. For a latent-variable model, the empirical estimators also involve terms such as

\hat{E} [(I_{0, C} (T) k (X_{T}, \cdot) - \int exp ({\hat{θ}}^{'} A_{t} + \hat{f} (X_{t})) k (X_{t}, \cdot) d t) \otimes (I_{0, C} (T) k (X_{T}, \cdot) - \int exp ({\hat{θ}}^{'} A_{t} + \hat{f} (X_{t})) k (X_{t}, \cdot) d t)]

. The

O_{p} (n^{- 1 / 2})

norm convergence of empirically averaged operators of the same form but without the integration of a stochastic process to their large-sample limit is known (see Section 9.1 of Ref. [22]; Ref. [58] for a stronger convergence in the Hilbert–Schmidt norm). We believe that a similar convergence result can be obtained, but we leave this as a conjecture in order to avoid the argument about the stochastic process

{(A_{i t}, X_{i t})}_{t}

, which is not the focus of the current study. From the representation of the Hessian operator, it is also justified that

\begin{matrix} ∥E [\partial_{(θ, f)} \partial_{(θ, f)} ℓ |_{\hat{θ}, \hat{f}}] - H^{*}∥ \sim O_{p} (n^{- β}) \end{matrix}

(A26)

because

f^{*}

is a bounded function and we have

\begin{matrix} sup_{X \in X} | f (X) - f^{*} (X) | = sup_{X \in X} {(f - f^{*}, k (X, \cdot))}_{H_{k}} \leq sup_{X \in X} {∥ f - f^{*} ∥}_{H_{k}} k {(X, X)}^{1 / 2} . \end{matrix}

(A27)

From the representation of the Hessian, it is also reasonably justified that

H_{f θ_{k}}^{*} = H_{f f}^{*} ρ_{k}

holds. Let us define the following marginal probability measures for

{}^{\forall}E \in M_{X}

:

\begin{matrix} ν_{0} (E) = \int_{0}^{\infty} \int_{0}^{C} \int_{E} exp (θ^{'} A_{t} + f (X_{t})) d P (X_{t}, A_{t} | T > t) P (T > t) d t d P (C), \end{matrix}

(A28)

\begin{matrix} ν_{k} (E) = \int_{0}^{\infty} \int_{0}^{C} \int_{E} A_{k} exp (θ^{'} A_{t} + f (X_{t})) d P (X_{t}, A_{t} | T > t) P (T > t) d t d P (C) . \end{matrix}

(A29)

Suppose that

ν_{k}

is absolutely continuous with respect to

ν_{0}

and the Radom–Nykodym derivative

\frac{d ν_{k}}{d ν_{0}} = ρ_{k}

belongs to

H_{k}

. Then, we can see

H_{f θ_{k}}^{*} = H_{f f}^{*} ρ_{k}

. This scenario is plausible enough. It should be noted that the sample-wise Hessian

\partial_{θ_{k}} \partial_{f} ℓ

, itself, is very unlikely to lie in the range of

H_{f f}^{*}

. Comparing the above result with Proposition 3, one can see that, roughly speaking, the two methods based on Proposition 2 and 3 perform density-ratio estimation in two different ways—namely, least-square regression and logistic regression (cf., Ref. [59] for multiple methods of density-ratio estimation based on the kernel method).

For

H_{θ f}^{*}

and

H_{f f}^{*}

of the latent-variable model, we cannot represent

ρ_{k}

as a density ratio, but it is still plausible enough that

H_{θ_{k} f}^{*}

and the functions in the range of

H_{f f}^{*}

are the integral of

k (X, \cdot)

over some smooth density on

X

in commonly studied settings and

H_{θ_{k}}^{*} = H_{f f}^{*} ρ_{k}

holds.

Appendix F. Proof of Proposition 3

Proof.

Consider the following hazard model that includes the heterogeneous treatment effect (i.e., covariate-dependent treatment effect):

\begin{matrix} {\tilde{h}}_{θ, f, f_{het}} (t | A_{t}, X_{t}) = exp (θ^{'} A_{t} + f (X_{t}) + f_{het} (X_{t}) A_{k, t}) . \end{matrix}

(A30)

Because we assumed the homogeneity of the treatment effect (Assumption 6), we have, for arbitrary

ψ

,

\begin{matrix} lim_{ϵ \to + 0} & \frac{1}{ϵ} E [ln Q_{{\tilde{h}}_{θ^{*}, f^{*}, ϵ ψ}} (T | A, X) - ln Q_{{\tilde{h}}_{θ^{*}, f^{*}, 0}} (T | A, X)] \\ = E [I_{[0, C]} (T) A_{k, T} ψ (X_{T}) - \int_{0}^{T \land C} A_{k, t} ψ (X_{t}) exp (θ_{k}^{*} A_{k, t} + f^{*} (X_{t})) d t] \\ = 0 . \end{matrix}

(A31)

Note that Assumption 8 assures that the argument of the Gateaux derivative

ψ

does not need to belong to

M

. Let the first line on the right-hand side of Equation (26) be

ϕ_{k, 1} (D; θ, (f, g_{k}))

and the second line be

ϕ_{k, 2} (D; θ, (f, g_{k}))

. Note that

E [ϕ_{k, 1}]

with

θ_{k} = θ_{k}^{*}

and

f = f^{*}

coincides with the above equation with

ψ (X_{t}) = e^{- θ_{k}^{*}} (1 + e^{- g_{k} (X_{t})})

. Similarly, for arbitrary

ψ

, we have

\begin{matrix} {lim}_{ϵ \to + 0} & \frac{1}{ϵ} E [ln Q_{h_{θ^{*}, f^{*} + ϵ ψ}} (T | A, X) - ln Q_{h_{θ^{*}, f^{*}}} (T | A, X)] \\ = E [I_{[0, C]} (T) ψ (X_{T}) - \int_{0}^{T \land C} ψ (X_{t}) exp (θ^{*'} A_{t} + f^{*} (X_{t})) d t] \\ = 0 . \end{matrix}

(A32)

Subtracting the sum of Equation (A31) for all k from Equation (A32) for

ψ (X_{t}) = 1 + e^{g_{k} (X_{t})}

, we obtain

E [ϕ_{k, 2} (D; θ^{*}, (f^{*}, g_{k}))] = 0

. Hence,

E [ϕ_{k} (D; θ^{*}, (f^{*}, g_{k}))] = 0

has been proven.

Next, we examine the Gateaux derivatives of the expected scores with respect to the nuisance parameters

(f, g_{k})

. First, the derivative of

E [ϕ_{k}]

with respect to

g_{k}

at

θ_{k} = θ_{k}^{*}

and

f = f^{*}

is

\begin{matrix} {lim}_{ϵ \to + 0} \frac{1}{ϵ} E [ϕ_{k} (D; θ^{*}, (f^{*}, g_{k} + ϵ ψ)) - ϕ_{k} (D; θ^{*}, (f^{*}, g_{k}))] \\ = E [- I_{[0, C]} (T) A_{k, T} ψ (X_{T}) e^{- θ_{k}^{*} - g_{k} (X_{T})} + \int_{0}^{T \land C} A_{k, t} ψ (X_{t}) e^{f^{*} (X_{t}) - g_{k} (X_{t})} d t] \\ + E [\int_{0}^{T \land C} (1 - \sum_{ℓ \in K} A_{ℓ, t}) ψ (X_{t}) e^{f^{*} (X_{t}) + g_{k} (X_{t})} d t - (1 - \sum_{ℓ \in K} A_{ℓ, T}) ψ (X_{T}) e^{g_{k} (X_{T})} d t] \\ = 0 . \end{matrix}

(A33)

The combination of Equations (A31) and (A32), again, proves the last equality. The derivative with respect to f at

θ = θ^{*}

,

f = f^{*}

and

g_{k} = g_{k}^{*}

is given by

\begin{matrix} {lim}_{ϵ \to + 0} {\frac{1}{ϵ} E [ϕ_{k} (D; θ^{*}, (f^{*} + ϵ ψ, g_{k}^{*}) - ϕ_{k} (D; θ^{*}, (f^{*}, g_{k}^{*}))]|}_{f = f^{*}} \\ = & E [\int_{0}^{T \land C} (1 - \sum_{ℓ \in K} A_{ℓ, t}) ψ (X_{t}) e^{f^{*} (X_{t})} (1 + e^{g_{k}^{*} (X_{t})}) - A_{k, t} ψ (X_{t}) e^{f^{*} (X_{t})} (1 + e^{- g_{k}^{*} (X_{t})}) d t] \\ = 0 . \end{matrix}

(A34)

Equation (25) shows that the expected values of the two integrands in the second line cancel each other out, which proves the last equality. □

Appendix G. Consideration of the Score Regularity and the Quality of Estimation of Nuisance Parameters Required for DML

Chernozhukov et al. [19] provided additional conditions for score regularity and convergence of nuisance parameters so that debiasing works (Assumptions 3.3 and 3.4 in [19]). Assumption 3.3 imposes a straightforward regularity condition for non-degeneracy concerning

\partial_{θ} E [ϕ (D; θ, η^{*})] |_{θ^{*}}

, which we can reasonably assume for ML estimation with identifiable models. Assumption 3.4 additionally imposes the following nontrivial conditions, together with conventionally used conditions on the covering number and non-degeneracy of the model.

Proposition A1.

(Assumption 3.4(c) of [19]). Assume that Θ is bounded and the suitably defined norm of all derivatives of the expected log-likelihood up to the third order with respect to θ and f are bounded in the product set of Θ and

N_{\bar{n}}

(for some

\bar{n} \in N

). Concretely, define a tensor,

\begin{matrix} \partial_{f} \partial_{f} \partial_{f} ℓ : (ψ_{1}, ψ_{2}) \in H_{k} \times H_{k} \mapsto ψ_{3} \in H_{k}, \end{matrix}

(A35)

in the same manner as for the gradients and Hessian and assume

∥ \partial_{f} \partial_{f} \partial_{f} {ℓ |}_{θ \in Θ, f \in N_{\bar{n}}} (ψ_{1}, ψ_{2}) ∥_{H_{k}} \leq C ∥ ψ_{1} ∥_{H_{k}} {∥ ψ_{2} ∥}_{H_{k}}

for some positive constant C. Also make the same sort of assumptions for the other derivatives as well.

Then, with three sequences of positive numbers converging to zero—

{δ_{n}}_{n}

,

{Δ_{n}}_{n}

and

{τ_{n}}_{n}

—the following three conditions are satisfied for the orthogonal scores defined in Proposition 2:

\begin{matrix} sup_{η \in N_{n}, θ \in Θ} ∥ E [ϕ (D; θ, η) - ϕ (D; θ, η^{*})] ∥ \leq δ_{n} τ_{n}, \\ r_{n} & = & sup_{η \in N_{n}, ∥ θ - θ^{*} ∥ \leq τ_{n}} E [∥ ϕ (D; θ, η) - ϕ (D; θ^{*}, η^{*}) {∥^{2}]}^{1 / 2} a n d r_{n} {ln}^{1 / 2} (1 / r_{n}) \leq δ_{n}, \\ sup_{r \in (0, 1), η \in N_{n}, ∥ θ - θ^{*} ∥ \leq τ_{n}} ∥ \partial_{r}^{2} E [ϕ (D; θ^{*} + r (θ - θ^{*}), η^{*} + r (η - η^{*}))] ∥ \leq δ_{n} n^{- 1 / 2} . \end{matrix}

(A36)

Proof.

The quantity in the first line is bounded as

\begin{matrix} ∥ E [ϕ (D; θ, η) - ϕ (D; θ, η^{*})] ∥ \\ = & ∥ E [\partial_{θ} ℓ |_{θ, f} - H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} \partial_{f} ℓ |_{θ, f} - \partial_{θ} ℓ |_{θ, f^{*}} + H_{θ f}^{*} {(H_{f f}^{*} + ζ_{n})}^{- 1} \partial_{f} ℓ |_{θ, f^{*}}] ∥ \\ \leq & ∥ E [\partial_{θ} {ℓ |}_{θ, f} - \partial_{θ} {ℓ |}_{θ, f^{*}}] ∥ + ∥ E [H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} (\partial_{f} {ℓ |}_{θ, f} - \partial_{f} {ℓ |}_{θ, f^{*}})] ∥ \\ + & ∥ E [(H_{θ f} - H_{θ f}^{*}) {(H_{f f}^{*} + ζ_{n})}^{- 1} \partial_{f} ℓ |_{θ, f^{*}}] ∥ \\ + & ∥ E [H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} (H_{f f} - H_{f f}^{*}) {(H_{f f}^{*} + ζ_{n})}^{- 1} \partial_{f} ℓ |_{θ, f^{*}}] ∥ . \end{matrix}

(A37)

For the evaluation of each term, we first use

\begin{matrix} ∥ H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} ∥ & = & lim_{m \to \infty} ∥ H_{θ f, m} {(H_{f f, m} + ζ_{n})}^{- 1} ∥ \\ = & lim_{m \to \infty} ∥ H_{θ θ, m}^{1 / 2} W_{θ f, m} H_{f f, m}^{1 / 2} {(H_{f f, m} + ζ_{n})}^{- 1} ∥ \\ \sim & O (n^{\frac{α}{2}}) . \end{matrix}

(A38)

Here, we represent the compact positive-semidefinite operator H as the limit of a sequence of positive-definite, finite-rank operators (therefore, represented as matrices), and we use the decomposition of

H_{θ f, m} = H_{θ θ, m}^{1 / 2} W_{θ f, m} H_{f f, m}^{1 / 2}

with

∥ W_{θ f, m} ∥ \leq 1

. Then, we can evaluate the terms of the upper bound in Equation (A37) as

O (n^{- β})

,

O (n^{\frac{α}{2} - β})

,

O (n^{α - β})

and

O (n^{\frac{3 α}{2} - β})

. From this evaluation, one can see that

β > \frac{3}{2} α

is sufficient for the existence of

δ_{n}

and

τ_{n}

that satisfy the first line of Equation (A36).

The norm within the expectation of the second line of Equation (A36) is similarly bounded as

\begin{matrix} E ∥ ϕ (D; θ, η) - ϕ (D; θ^{*}, η^{*}) ∥^{2} \\ = & E ∥ \partial_{θ} {ℓ |}_{θ, f} - H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} \partial_{f} {ℓ |}_{θ, f} - \partial_{θ} {ℓ |}_{θ^{*}, f^{*}} + H_{θ f}^{*} {(H_{f f}^{*} + ζ_{n})}^{- 1} \partial_{f} ℓ {|_{θ^{*}, f^{*}} ∥}^{2} \\ \leq & E ∥ \partial_{θ} {ℓ |}_{θ, f} - \partial_{θ} {ℓ |}_{θ^{*}, f^{*}} ∥^{2} + E ∥ H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} (\partial_{f} {ℓ |}_{θ, f} - \partial_{f} {ℓ |}_{θ^{*}, f^{*}} {) ∥}^{2} \\ + & E ∥ (H_{θ f} - H_{θ_{f}}^{*}) {(H_{f f}^{*} + ζ_{n})}^{- 1} \partial_{f} ℓ {|_{θ^{*}, f^{*}} ∥}^{2} \\ + & E ∥ H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} (H_{f f} - H_{f f}^{*}) {(H_{f f}^{*} + ζ_{n})}^{- 1} \partial_{f} ℓ {|_{θ^{*}, f^{*}} ∥}^{2} \end{matrix}

(A39)

and similarly evaluated as

O (τ_{n}^{2})

,

O (n^{α} τ_{n}^{2})

,

O (n^{2 (- β + α)})

and

O (n^{3 α - 2 β})

. Together with the condition obtained above, we see that values for

α

,

β

and

τ_{n}

satisfying

n^{\frac{3}{2} α - β} < τ_{n} < n^{- \frac{1}{2} α}

is sufficient for the existence of

δ_{n}

satisfying the second line of Equation (A36), which requires

β > 2 α

.

Finally, the upper bound for the expected value of the twice Gateaux derivative is written down with

δ θ = (θ - θ^{*})

and

δ f = (f - f^{*})

, etc., together with slight notational abuse for brevity:

\begin{matrix} ∥ \partial_{r}^{2} E [ϕ (D; θ^{*} + r (θ - θ^{*}), η^{*} + r (η - η^{*}))] ∥ \\ \leq & ∥ E [\partial_{θ θ θ} ℓ δ θ δ θ] ∥_{2} + 2 ∥ E [\partial_{θ θ f} ℓ δ θ δ f] ∥_{2} + {∥ E [\partial_{θ f f} ℓ δ f δ f] ∥}_{2} \\ + & 2 ∥ E [δ H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} (\partial_{f θ} ℓ δ θ + \partial_{f f} ℓ δ f)] ∥_{2} \\ + & 2 ∥ E [H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} δ H_{f f} {(H_{f f} + ζ_{n})}^{- 1} δ H_{f f} {(H_{f f} + ζ_{n})}^{- 1} \partial_{f} ℓ] ∥_{2} \\ + & 2 ∥ E [H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} δ H_{f f} {(H_{f f} + ζ_{n})}^{- 1} (\partial_{f θ} ℓ δ θ + \partial_{f f} ℓ δ f)] ∥_{2} \\ + & ∥ E [H_{θ f} {(H_{f f} + ζ_{n})}^{- 1} (\partial_{f θ θ} ℓ δ θ δ θ + 2 \partial_{f θ f} ℓ δ θ δ f + \partial_{f f f} ℓ δ f δ f)] ∥_{2} . \end{matrix}

(A40)

Evaluating each term in the same manner as for the previous conditions, we conclude values for

α

,

β

and

τ_{n}

satisfying

τ_{n} n^{\frac{3}{2} α - β}, n^{\frac{3}{2} α - 2 β}, τ_{n}^{2} n^{\frac{α}{2}} \sim o (n^{- \frac{1}{2}})

is sufficient for the existence of

δ_{n}

satisfying the third line of Equation (A36). Inserting the lower bound of

τ_{n}

with respect to

α

and

β

due to the second condition, one can see that

α

should satisfy

\frac{1}{2} - β < α < \frac{4}{7} β - \frac{1}{7}, \frac{1}{2} β

, which yields

β > \frac{63}{154}

. The existence of a set of values for

α

and

β

satisfying the above conditions proves the assertion. Here, note that terms with convergence slower than any positive power of n in the assertion do not take effect as long as we deal with strict inequalities of exponents of n. □

Remark A1.

For the orthogonal score presented in Proposition 3, the relationship between the above conditions and the convergence of f and

g_{k}

can be easily seen. Thus, we omit writing it down to avoid redundancy.

Appendix H. Additional Consideration of Causal Interpretation of Hazard Ratios

Remark A2.

Suppose that Assumption 2 holds for arbitrary

ϵ > 0

. Then, for arbitrary

t > t^{'}

, we obtain

\begin{matrix} lim_{δ t \to + 0} \frac{1}{δ t} P (I_{(t, t + δ t]} (T^{a}) = 1 | T^{a} > t, X_{(- \infty, t + δ t]}, A_{(- \infty, t^{'})} = a_{(- \infty, t^{'})}) \\ = lim_{δ t \to + 0} \frac{1}{δ t} P (I_{(t, t + δ t]} (T) = 1 | T > t, A_{t} = a_{t}, X_{t}), \end{matrix}

(A41)

with the aid of Assumptions 1–5 in the same manner as for Equations (6) and (A10). If we can assume

A_{(- \infty, t^{'})} = a_{(- \infty, t^{'})} = 0

with a probability of one at some

t^{'}

, the above equation implies that the counterfactual survival function is given by

\begin{matrix} S^{a} (t) = E [exp (- \int_{0}^{t} exp (θ^{*'} a_{s} + f^{*} (X_{s})) d s)] . \end{matrix}

(A42)

Although measures of causal effects can be calculated from this relation, doing so requires an unbiased estimation of

f^{*}

, which is possible only if a simple model can be used to estimate f without modeling error.

Remark A3.

As discussed by Martinussen et al. (2022) [6], of more interest is the quantity, i.e.,

\begin{matrix} lim_{δ t \to + 0} ln \frac{P (I_{(t, t + δ t]} (T^{a}) = 1 | T^{a}, T^{a^{'}} > t, X_{(- \infty, t + δ t]}^{a} = X_{(- \infty, t + δ t]}^{a^{'}})}{P (I_{(t, t + δ t]} (T^{a^{'}}) = 1 | T^{a}, T^{a^{'}} > t, X_{(- \infty, t + δ t]}^{a} = X_{(- \infty, t + δ t]}^{a^{'}})}, \end{matrix}

(A43)

for two arbitrary counterfactual treatment schedules that branch before t. A stronger untestable assumption is required for this to be related to the hazard ratios. For example,

\begin{matrix} T ⫫ T^{a}, X^{a} & | & T > t, {A_{s}}_{s \geq t}, {X_{s}}_{s \geq t} \\ T^{a} ⫫ A, T, X & | & T^{a} > t, {X_{s}^{a}}_{s \geq t}, \\ T^{a} ⫫ T^{a^{'}}, X^{a^{'}} & | & T^{a} > t, {X_{s}^{a}}_{s \geq t} \end{matrix}

(A44)

for any t and any counterfactual treatment schedules a and

a^{'}

. This condition is satisfied if no unmeasured factors affect the outcome, as described by the causal graph in Figure A1A. If unmeasured factors affect the outcome, as described by the causal graph in Figure A1B, the above conditions are not satisfied. Although the assumption of time homogeneity (Assumption 7) is strong and does not allow temporal change in the heterogeneity of risk to be left unmodeled, it does not completely exclude the existence of unmeasured factors affecting outcome. For example, suppose that the survival of subjects for an arbitrary counterfactual treatment schedule a is described by

\begin{matrix} P (T_{i}^{a} > t) = exp (- \int_{0}^{t} exp (θ^{' *} a_{s} + f^{*} (X_{i, s})) (1 + W_{i, s}) d s), X_{i, s} = X_{i, s}^{a}, W_{i, s} = W_{i, s}^{a}, \end{matrix}

(A45)

where

W_{i, t} = \sum_{j} δ (t - t_{i j})

represents instantaneous random fluctuations in the conditional hazard described by Dirac delta functions centered at time points determined by a homogeneous Poisson point process

{t_{i j}}_{j}

. The subject-wise random effect of

W_{i}

shared between

T, T^{a}

and

T^{a^{'}}

then violates Equation (A44), whereas

W_{i}

at different time points is independent and does not influence time homogeneity.

Figure A1. Two causal graphs describing the causal relationships among the counting process for outcome event

N_{t} = I_{(- \infty, t]} (T)

, treatment variable

A_{t}

, and covariate

X_{t}

with (A) or without (B) unmeasured factors

W_{t}

affecting the outcome, which bifurcate into an actual process and a counterfactual process at time t. Remark A3 discusses how the difference between the two causal relationships affect the interpretation of the estimation results in the proposed framework.

Figure A1. Two causal graphs describing the causal relationships among the counting process for outcome event

N_{t} = I_{(- \infty, t]} (T)

, treatment variable

A_{t}

, and covariate

X_{t}

with (A) or without (B) unmeasured factors

W_{t}

affecting the outcome, which bifurcate into an actual process and a counterfactual process at time t. Remark A3 discusses how the difference between the two causal relationships affect the interpretation of the estimation results in the proposed framework.

References

Lin, R.S.; Lin, J.; Roychoudhury, S.; Anderson, K.M.; Hu, T.; Huang, B.; Leon, L.F.; Liao, J.J.; Liu, R.; Luo, X.; et al. Alternative Analysis Methods for Time to Event Endpoints Under Nonproportional Hazards: A Comparative Analysis. Stat. Biopharm. Res. 2020, 12, 187–198. [Google Scholar] [CrossRef]
Bartlett, J.W.; Morris, T.P.; Stensrud, M.J.; Daniel, R.M.; Vansteelandt, S.K.; Burman, C.F. The Hazards of Period Specific and Weighted Hazard Ratios. Stat. Biopharm. Res. 2020, 12, 518–519. [Google Scholar] [CrossRef]
Hernán, M.A. The Hazards of Hazard Ratios. Epidemiology 2010, 21, 13–15. [Google Scholar] [CrossRef]
Aalen, O.O.; Cook, R.J.; Røysland, K. Does Cox analysis of a randomized survival study yield a causal treatment effect? Lifetime Data Anal. 2015, 21, 579–593. [Google Scholar] [CrossRef] [PubMed]
Martinussen, T.; Vansteelandt, S.; Andersen, P.K. Subtleties in the interpretation of hazard contrasts. Lifetime Data Anal. 2020, 26, 833–855. [Google Scholar] [CrossRef] [PubMed]
Martinussen, T. Causality and the Cox Regression Model. Annu. Rev. Stat. Its Appl. 2022, 9, 249–259. [Google Scholar] [CrossRef]
Prentice, R.L.; Aragaki, A.K. Intention-to-treat comparisons in randomized trials. Stat. Sci. 2022, 37, 380–393. [Google Scholar] [CrossRef]
Ying, A.; Xu, R. On Defense of the Hazard Ratio. arXiv 2023, arXiv:2307.11971. Available online: http://arxiv.org/abs/2307.11971 (accessed on 22 July 2025). [CrossRef]
Fay, M.P.; Li, F. Causal interpretation of the hazard ratio in randomized clinical trials. Clin. Trials 2024, 21, 623–635. [Google Scholar] [CrossRef] [PubMed]
Rufibach, K. Treatment effect quantification for time-to-event endpoints—Estimands, analysis strategies, and beyond. Pharm. Stat. 2019, 18, 145–165. [Google Scholar] [CrossRef]
Kloecker, D.E.; Davies, M.J.; Khunti, K.; Zaccardi, F. Uses and Limitations of the Restricted Mean Survival Time: Illustrative Examples From Cardiovascular Outcomes and Mortality Trials in Type 2 Diabetes. Ann. Intern. Med. 2020, 172, 541–552. [Google Scholar] [CrossRef] [PubMed]
Snapinn, S.; Jiang, Q.; Ke, C. Treatment effect measures under nonproportional hazards. Pharm. Stat. 2023, 22, 181–193. [Google Scholar] [CrossRef]
Cui, Y.; Kosorok, M.R.; Sverdrup, E.; Wager, S.; Zhu, R. Estimating heterogeneous treatment effects with right-censored data via causal survival forests. J. R. Stat. Soc. Ser. Stat. Methodol. 2023, 85, 179–211. [Google Scholar] [CrossRef]
Xu, S.; Cobzaru, R.; Finkelstein, S.N.; Welsch, R.E.; Ng, K.; Shahn, Z. Estimating Heterogeneous Treatment Effects on Survival Outcomes Using Counterfactual Censoring Unbiased Transformations. arXiv 2024, arXiv:2401.11263. Available online: http://arxiv.org/abs/2401.11263 (accessed on 22 July 2025).
Frauen, D.; Schröder, M.; Hess, K.; Feuerriegel, S. Orthogonal Survival Learners for Estimating Heterogeneous Treatment Effects from Time-to-Event Data. arXiv 2025, arXiv:2505.13072. Available online: http://arxiv.org/abs/2505.13072 (accessed on 22 July 2025).
Leviton, A.; Loddenkemper, T. Design, implementation, and inferential issues associated with clinical trials that rely on data in electronic medical records: A narrative review. BMC Med. Res. Methodol. 2023, 23, 271. [Google Scholar] [CrossRef]
Hernán, M.A.; Brumback, B.; Robins, J.M. Marginal Structural Models to Estimate the Joint Causal Effect of Nonrandomized Treatments. J. Am. Stat. Assoc. 2001, 96, 440–448. [Google Scholar] [CrossRef]
Van der Laan, M.J.; Rose, S. Targeted Learning in Data Science; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/debiased machine learning for treatment and structural parameters. Econom. J. 2018, 21, C1–C68. [Google Scholar] [CrossRef]
Ahrens, A.; Chernozhukov, V.; Hansen, C.; Kozbur, D.; Schaffer, M.; Wiemann, T. An Introduction to Double/Debiased Machine Learning. arXiv 2025, arXiv:2504.08324. Available online: http://arxiv.org/abs/2504.08324 (accessed on 22 July 2025). [CrossRef]
Ren, J.J.; Zhou, M. Full likelihood inferences in the Cox model: An empirical likelihood approach. Ann. Inst. Stat. Math. 2011, 63, 1005–1018. [Google Scholar] [CrossRef]
Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Fukumizu, K.; Song, L.; Gretton, A. Kernel Bayes’ Rule: Bayesian Inference with Positive Definite Kernels. J. Mach. Learn. Res. 2013, 14, 3753–3783. [Google Scholar]
Yang, S.; Eaton, C.B.; Lu, J.; Lapane, K.L. Application of marginal structural models in pharmacoepidemiologic studies: A systematic review. Pharmacoepidemiol. Drug Saf. 2014, 23, 560–571. [Google Scholar] [CrossRef] [PubMed]
Robins, J.M.; Hernán, M.Á.; Brumback, B. Marginal Structural Models and Causal Inference in Epidemiology. Epidemiology 2000, 11, 550–560. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
van der Laan, M.J.; Petersen, M.L.; Joffe, M.M. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. Int. J. Biostat. 2005, 1. [Google Scholar] [CrossRef]
Hille, E.; Phillips, R.S. Functional Analysis and Semi-Groups; 3rd Printing of Rev. Ed. of 1957; Colloquium Publications: Kyiv, Ukraine, 1974; Volume 31. [Google Scholar]
Lanckriet, G.R.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 2004, 5, 27–72. [Google Scholar]
Suzuki, T.; Sugiyama, M. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. Ann. Stat. 2013, 41, 1381–1405. [Google Scholar] [CrossRef]
Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
Bach, F.R. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 2008, 9, 1179–1225. [Google Scholar]
Meier, L.; Van de Geer, S.; Bühlmann, P. High-dimensional additive modeling. Ann. Stat. 2009, 37, 3779–3821. [Google Scholar] [CrossRef]
Koltchinskii, V.; Yuan, M. Sparsity in multiple kernel learning. Ann. Stat. 2010, 38, 3660–3695. [Google Scholar] [CrossRef]
Cox, D.R. Regression Models and Life-Tables. J. R. Stat. Soc. Ser. B (Methodol.) 1972, 34, 187–202. [Google Scholar] [CrossRef]
Efron, B. The Efficiency of Cox’s Likelihood Function for Censored Data. J. Am. Stat. Assoc. 1977, 72, 557–565. [Google Scholar] [CrossRef]
Oakes, D. The Asymptotic Information in Censored Survival Data. Biometrika 1977, 64, 441–448. [Google Scholar] [CrossRef]
Thackham, M.; Ma, J. On maximum likelihood estimation of the semi-parametric Cox model with time-varying covariates. J. Appl. Stat. 2020, 47, 1511–1528. [Google Scholar] [CrossRef]
Luo, J.; Rava, D.; Bradic, J.; Xu, R. Doubly robust estimation under a possibly misspecified marginal structural Cox model. Biometrika 2024, 112, asae065. [Google Scholar] [CrossRef]
Zhang, Z.; Stringer, A.; Brown, P.; Stafford, J. Bayesian inference for Cox proportional hazard models with partial likelihoods, nonlinear covariate effects and correlated observations. Stat. Methods Med. Res. 2023, 32, 165–180. [Google Scholar] [CrossRef] [PubMed]
Inoue, K.; Adomi, M.; Efthimiou, O.; Komura, T.; Omae, K.; Onishi, A.; Tsutsumi, Y.; Fujii, T.; Kondo, N.; Furukawa, T.A. Machine learning approaches to evaluate heterogeneous treatment effects in randomized controlled trials: A scoping review. J. Clin. Epidemiol. 2024, 176, 111538. [Google Scholar] [CrossRef]
Ma, J.; Heritier, S.; Lô, S.N. On the maximum penalized likelihood approach for proportional hazard models with right censored survival data. Comput. Stat. Data Anal. 2014, 74, 142–156. [Google Scholar] [CrossRef]
Allman, E.S.; Matias, C.; Rhodes, J.A. Identifiability of parameters in latent structure models with many observed variables 2009.
Allman, E.S.; Rhodes, J.A.; Stanghellini, E.; Valtorta, M. Parameter identifiability of discrete Bayesian networks with hidden variables. J. Causal Inference 2015, 3, 189–205. [Google Scholar] [CrossRef]
Gassiat, E.; Cleynen, A.; Robin, S. Inference in finite state space non parametric Hidden Markov Models and applications. Stat. Comput. 2016, 26, 61–71. [Google Scholar] [CrossRef]
Gassiat, E.; Rousseau, J. Nonparametric finite translation hidden Markov models and extensions. Bernoulli 2016, 22, 193–212. [Google Scholar] [CrossRef]
Wieland, F.G.; Hauber, A.L.; Rosenblatt, M.; Tönsing, C.; Timmer, J. On structural and practical identifiability. Curr. Opin. Syst. Biol. 2021, 25, 60–69. [Google Scholar] [CrossRef]
Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: Cambridge, UK, 2009; Volume 25. [Google Scholar]
Calderhead, B.; Girolami, M. Estimating Bayes factors via thermodynamic integration and population MCMC. Comput. Stat. Data Anal. 2009, 53, 4028–4045. [Google Scholar] [CrossRef]
Watanabe, S. A widely applicable Bayesian information criterion. J. Mach. Learn. Res. 2013, 14, 867–897. [Google Scholar]
Drton, M.; Plummer, M. A Bayesian Information Criterion for Singular Models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 323–380. [Google Scholar] [CrossRef]
Moral, P. Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Chopin, N.; Papaspiliopoulos, O. An Introduction to Sequential Monte Carlo; Springer: Berlin/Heidelberg, Germany, 2020; Volume 4. [Google Scholar]
Bach, F.; Jordan, M. Kernel independent component analysis. J. Mach. Learn. Res. 2003. [Google Scholar]
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
Janson, S. Probability asymptotics: Notes on notation. arXiv 2011, arXiv:1108.3924. Available online: http://arxiv.org/abs/1108.3924 (accessed on 22 July 2025). [CrossRef]
Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; Volume 2. [Google Scholar]
Fukumizu, K.; Bach, F.R.; Gretton, A. Statistical Consistency of Kernel Canonical Correlation Analysis. J. Mach. Learn. Res. 2007, 8, 361–383. [Google Scholar]
Kanamori, T.; Suzuki, T.; Sugiyama, M. Theoretical analysis of density ratio estimation. Ieice Trans. Fundam. Electron. Commun. Comput. Sci. 2010, 93, 787–798. [Google Scholar] [CrossRef]

Figure 1. Numerical results for a clinically plausible data generation process. (A–C): The mean and standard error of the natural logarithms of the Bayes factors for (A) the correctly specified model vs. the misspecified model lacking

f_{2} (X_{2 t})

, (B) the misspecified models with vs. without the inclusion of

f_{t} (t)

and (C) the correctly specified models with vs. without the inclusion of

f_{t} (t)

, calculated with ten bootstrap datasets and different regularization parameters for Gaussian kernels

λ_{g}

. The other hyperparameters are fixed to the optimal values for the smaller model. The log-Bayes factors are calculated by subtracting the log-BME of the larger model from that of the smaller model. (D–F) The mean and standard error of (D)

{CVErr}_{H}

, (E) log-BME for

g_{1}

and (F)

{CVErr}_{g_{1}}

calculated with ten bootstrap datasets and different regularization parameters. (G,H) Histograms of t statistics measuring the bias in the naive ML estimator of

θ_{1}^{*}

and the debiased ML estimators of

θ_{1}^{*}

based on Propositions 2 and 3. For the estimator based on Proposition 3, results with nuisance parameter

g_{1}

estimated with

ζ_{n, 1} = 70

and

ζ_{n, 1} = 1

are shown.

Figure 1. Numerical results for a clinically plausible data generation process. (A–C): The mean and standard error of the natural logarithms of the Bayes factors for (A) the correctly specified model vs. the misspecified model lacking

f_{2} (X_{2 t})

, (B) the misspecified models with vs. without the inclusion of

f_{t} (t)

and (C) the correctly specified models with vs. without the inclusion of

f_{t} (t)

, calculated with ten bootstrap datasets and different regularization parameters for Gaussian kernels

λ_{g}

. The other hyperparameters are fixed to the optimal values for the smaller model. The log-Bayes factors are calculated by subtracting the log-BME of the larger model from that of the smaller model. (D–F) The mean and standard error of (D)

{CVErr}_{H}

, (E) log-BME for

g_{1}

and (F)

{CVErr}_{g_{1}}

calculated with ten bootstrap datasets and different regularization parameters. (G,H) Histograms of t statistics measuring the bias in the naive ML estimator of

θ_{1}^{*}

and the debiased ML estimators of

θ_{1}^{*}

based on Propositions 2 and 3. For the estimator based on Proposition 3, results with nuisance parameter

g_{1}

estimated with

ζ_{n, 1} = 70

and

ζ_{n, 1} = 1

are shown.

Figure 2. Numerical results for a clinically plausible data generation process with a time-independent unobserved factor affecting the outcome. (A,B) The mean and standard errors of log-Bayes factors for (A) the misspecified models with only the observed covariates with vs. without the inclusion of

f (X_{1 t}^{(test)}, t)

and (B) the correctly specified model with a latent variable vs. the misspecified model with only the observed covariates but without the inclusion of

f (X_{1 t}^{(test)}, t)

, calculated with ten bootstrap datasets and different regularization parameters for Gaussian kernels. The other hyperparameters are fixed to the optimal values for the smaller (null) model. (C) Histograms of t statistics measuring the bias in the naive ML estimator of

θ_{1}

and the debiased ML estimator of

θ_{1}

based on the correctly specified model with a latent variable.

Figure 2. Numerical results for a clinically plausible data generation process with a time-independent unobserved factor affecting the outcome. (A,B) The mean and standard errors of log-Bayes factors for (A) the misspecified models with only the observed covariates with vs. without the inclusion of

f (X_{1 t}^{(test)}, t)

and (B) the correctly specified model with a latent variable vs. the misspecified model with only the observed covariates but without the inclusion of

f (X_{1 t}^{(test)}, t)

, calculated with ten bootstrap datasets and different regularization parameters for Gaussian kernels. The other hyperparameters are fixed to the optimal values for the smaller (null) model. (C) Histograms of t statistics measuring the bias in the naive ML estimator of

θ_{1}

and the debiased ML estimator of

θ_{1}

based on the correctly specified model with a latent variable.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hayakawa, T.; Asai, S. Debiased Maximum Likelihood Estimators of Hazard Ratios Under Kernel-Based Machine Learning Adjustment. Mathematics 2025, 13, 3092. https://doi.org/10.3390/math13193092

AMA Style

Hayakawa T, Asai S. Debiased Maximum Likelihood Estimators of Hazard Ratios Under Kernel-Based Machine Learning Adjustment. Mathematics. 2025; 13(19):3092. https://doi.org/10.3390/math13193092

Chicago/Turabian Style

Hayakawa, Takashi, and Satoshi Asai. 2025. "Debiased Maximum Likelihood Estimators of Hazard Ratios Under Kernel-Based Machine Learning Adjustment" Mathematics 13, no. 19: 3092. https://doi.org/10.3390/math13193092

APA Style

Hayakawa, T., & Asai, S. (2025). Debiased Maximum Likelihood Estimators of Hazard Ratios Under Kernel-Based Machine Learning Adjustment. Mathematics, 13(19), 3092. https://doi.org/10.3390/math13193092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Debiased Maximum Likelihood Estimators of Hazard Ratios Under Kernel-Based Machine Learning Adjustment

Abstract

1. Introduction

2. Theories and Methods

2.1. Exponential Parametric Hazard Model Combined with Machine Learning

2.2. Causal Inference and ML Estimation

2.3. Debiasing ML Estimators

2.4. Extension to Models with Latent Variables

2.5. Design of Models with Multiple RKHSs

2.6. Estimation Algorithm

3. Results of Numerical Simulations

3.1. Simulation Result 1: Adjustment for Observed Confounders

3.2. Simulation Result 2: Estimation of Treatment Effect in Population with Heterogeneous Risk

4. Discussion

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Numerical Implementation of Inference with Multiple-Kernel Models

Appendix B. Proof of Proposition 1

Appendix C. Proof of Proposition 2

Appendix D. Gradient Functional and Hessian Operator

Appendix E. The Validity of Assumption 9

Appendix F. Proof of Proposition 3

Appendix G. Consideration of the Score Regularity and the Quality of Estimation of Nuisance Parameters Required for DML

Appendix H. Additional Consideration of Causal Interpretation of Hazard Ratios

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI