1. Introduction
Causal inference with survival outcomes has been a hot topic in recent years. When the research interest focused on the causal effect of intervention on the time until an individual’s first event of interest, extra care is needed in employing usual time-to-event models. The target population for a new drug often includes patients with diverse prognostic baseline factors. As a result, it is critical to consider confounders during the study design and data analysis. Furthermore, it is desirable to demonstrate the intervention efficacy by showing a significant marginal intervention effect. A well-designed randomized controlled trial has both the measured and unmeasured covariates balanced between intervention groups on average, thus even if the primary analysis does not consider any baseline covariates as confounders, the confounding bias can be eliminated efficiently [
1]. However, collapsibility is a favorable model property that cannot be guaranteed by study design. Confounding and collapsibility are separate issues that have been emphasized in the literature [
2,
3,
4]. In
Section 3, we discuss the importance of confounding adjustment and collapsibility in causal survival analysis. It is known that the Cox Proportional Hazards (PH) model, the most prevalent survival model in medical research, is not a collapsible model and is not favored in causal survival analysis [
5,
6].
The first-hitting-time (FHT)-based threshold regression (abbreviated as TR) [
7,
8] is a well-accepted alternative method to analyze covariate effects on time-to-event endpoints while allowing for clinically meaningful interpretations [
9,
10,
11]. We show in
Section 4 that estimators from a TR Model using a Wiener process with linear predictors are collapsible and, hence, the TR model can be easily adapted for causal survival analysis. In particular, we would like to investigate the causal meaning of TR, as well as how to make a marginal causal conclusion from TR with the assistance of G-computation.
This paper is organized as follows:
Section 2 presents the necessary notations and required causal assumptions for this manuscript.
Section 3 reviews the definitions of confounding and collapsibility.
Section 4 introduces the TR methodology as a useful survival model. Furthermore, this section discusses the estimator collapsibility property of TR that justifies its application in causal survival analysis.
Section 5 provides simulation results that support this finding.
Section 6 introduces TR G-computation, which can be used to estimate the marginal intervention causal effect. Next,
Section 7 presents the highlight of this manuscript—a neural network extension of the TR model, TRNN.
Section 8 provides further simulations which illustrate the accuracy of TRNN when the underlying model is correctly formulated and the potential bias when the model is misspecified. A real application is presented in
Section 9. The manuscript is closed with discussions and limitations.
2. Notations and Causal Assumptions
The intervention is denoted by a dichotomous variable A that takes values 0 or 1. We defer the discussion of continuous treatment to future research. The outcome of interest is the subject’s time-to-first-event. Due to limited follow-up or competing risks failure, the true time-to-event random variable T may not be observed. Instead, the observed outcome variable , where C is the independent censoring random variable. We further use to denote whether an event is observed or not. The outcome Y is recorded once either during the study (event observed) or at the end of the study (right censored). The symbol Z denotes time-independent potential confounders. The observed data for individuals are, thus, drawn from the joint distribution of . In this paper, we only consider studies where the intervention was introduced and covariates were measured at baseline, without additional information being gathered before the occurrence of the first event.
There are three assumptions required for the discussions in this paper, consistency, exchangeability and positivity [
12,
13].
Consistency links the counterfactuals with the observables:
Exchangeability states that the conditional probability of receiving any value of treatment depends only on measured covariates
Z:
The symbol
denotes statistical independence. For example,
means
X is conditionally independent of
Y given
Z [
14].
Positivity ensures that the probability of receiving any value of treatment conditional on
Z is greater than zero, i.e., positive:
4. Parameter Estimators of TR Models with Linear Predictors Are Collapsible
The TR model conceptualizes a subject’s latent health as a random process that drifts down towards a threshold (which is usually set at zero) [
7,
8]. For a Wiener process where the health begins at
and the boundary is the zero level, the distribution for the first-hitting-time (time-to-event) outcome
T is given by an Inverse Gaussian (IG) distribution with p.d.f.
and survival function
where
and
denotes the c.d.f. of the standard normal distribution. In most situations, one expects the deterioration rate
to be negative so that the process will hit a disease threshold. If the deterioration rate is positive, such that
, the Wiener process may not hit the boundary threshold and the first-hitting-time
T is not certain to be finite. This case can be used to model applications with a cure-rate, where the corresponding p.d.f. is improper with
.
There are two parameters in the underlying latent health process, namely, the initial health status parameter
, which describes the process starting point, and the deterioration rate parameter
, which describes the rate at which the process drifts towards the threshold. To begin with, we considered TR with linear predictors in which the quantities
and
can be linked with observed regression covariate vectors
and
(with a leading unit to include an intercept term) and associated vectors of regression coefficients
:
A logarithmic function is used to link the parameter
with covariates in Equation (
7) because the initial health status
is set to be larger than 0. It is legitimate to use the same covariate vectors for both
and
, or to consider completely different sets of risk factors for these two linear predictors. In other words, covariates in
x and
z can completely differ, coincide, or partially agree. The most plausible functional form should come from expert knowledge of scientific research. The deterioration rate
is not observable but can be connected with covariates, such as treatment regime
A, by including the intervention
A as a covariate in Equation (
6). The effect size is then estimated by the coefficient of covariate
A. Once the functional forms are determined by the investigator, the maximum likelihood estimation method is used to estimate coefficient vectors
and
. Note, though
and
incorporate randomness after including predictors, they can still be regarded as parameters when the covariates
,
are given.
In most chronic disease (such as CVD or kidney disease) studies, the intervention aims at slowing the health deterioration rate. As a result, we focus on investigating the causal effect of the intervention on the deterioration rate
.
Figure 1 includes a binary intervention
A and a potential confounder
Z and illustrates the causal relationship among covariates, parameters, and the random variable of interest. The corresponding linear prediction function of Equation (
6) includes predictors
A and
Z:
The coefficient thus encodes the intervention effect on the deterioration rate .
We now claim that the MLE estimated treatment effect
is collapsible; that is, the estimator does not systematically change after removing independent covariates (such as
Z) from in Equation (
8). See
Appendix A for a proof. This property is especially pleasant because the removal of independent covariates does not change the parameters of remaining covariates when the initial health status
is controlled correctly. On the other hand, the estimators from the Cox PH model do not have this benefit in most cases. It has been well established that the coefficient estimation before the intervention would be biased for the causal effect even when an uncorrelated covariate is omitted from the Cox PH model [
21,
22,
23].
Since the Maximum Likelihood Estimates (MLEs) of
are correlated, the initial health status
has to be correctly adjusted to draw a valid causal conclusion regarding the deterioration rate in TR.
Figure 1 includes a dashed arrow between
and
. This dashed arrow indicates
may bias the estimation of the causal effect of
A on
. The only causal pathway we are interested in is the one from
A to
and, subsequently, flowing to
T. To control the effect of the initial health status, in practice one may consider, (1) carefully recruiting subjects so that the participants do not dramatically vary in terms of the initial health status, or (2) correctly adjusting the initial health status during analysis. The former approach raises significant challenges because the initial health status is a latent variable that cannot be observed. The assumption may be true approximately if investigators carefully recruit subjects with similar demographics. Consequently, the generalizability of the study conclusion becomes infeasible. On the other hand, the schema of a complex disease process (which usually can be described by a Directed Acyclic Graph (DAG)) is not always available [
24]. Thus, including the correct covariates is not an easy task. Considering the clinical meaning of each parameter in TR also helps us to select the correct set of covariates to be adjusted. For example, we recognized that a post-baseline effect should not be adjusted in the initial health status
. Nonetheless, including unrelated predictors for
will not influence the consistency and efficiency of the causal estimator due to the estimator collapsibility. As a result, it is encouraged to consider comprehensive predictor sets, for both
and
.
The hazard ratio (HR) has been criticized for decades for lacking causal meanings [
5,
6]. In contrast, the interpretation of the TR model is completely free from the hazard concept. For the causal problem investigated in this work, we would interpret the intervention’s causal effect as it causes the deterioration rate to increase or decrease by the estimated amount. With the same initial health level, a large deterioration rate means an earlier first-hitting-time, hence a shorter survival time. Furthermore, the causal inference community has been considering using other non-hazard-related models due to the interpretability issue of HR.
Not surprisingly, less model restrictive measures, such as differences between suitably adjusted and standardized survival curves or Restricted Mean Survival Time (RMST) [
4,
25], have drawn considerable attention. Some recent literature advocates the usage of RMST over a hazard ratio as the target causal measure [
26]. In addition to focusing on interpreting the causal meaning of
, RMST can also be estimated by TR. Given a fixed time horizon
, RMST is defined with respect to the survival function as the area under the survival curves with a cut-off at the prespecified restrictive follow-up time (
). We have also derived TR RMST formulas in a previous publication [
27]. In general, the RMST for TR is calculated by
Theoretically, there are three formulas for TR RMST, depending on the sign of deterioration rate. When
, a convenient and exact computational formula for evaluating
is given by
Two other formulas when
and
can be easily obtained by variable transformation [
27].
5. Simulation I—Estimator Collapsibility of TR with Linear Predictors
Now, we will use simulations to demonstrate the estimator collapsibility of the TR model with linear predictors. To investigate whether an intervention postpones an individual’s time to the first CVD event by slowing the deterioration rate, studies can be conducted to examine the relationship among intervention, potential confounder, and time-to-event outcome. In this simulation, we considered hypothesized studies with sample size
. We simulated the dichotomous intervention
A and two potential confounders
. The intervention
A was generated from a Bernoulli distribution, with a probability of receiving the intervention as
. Confounder
is a binary variable generated from a Bernoulli distribution, with
. Some baseline characteristics, such as gender, follow this distribution. Confounder
is a continuous variable, generated from a Normal distribution, with mean = 1 and standard deviation = 1. Following this setting,
and
are mutually independent. The time-to-event outcome
T is simulated by an Inverse Gaussian (IG) distribution for TR as shown in (
4) with
and
given later. The study is scheduled to end in 3 years. Thus, the endpoint is potentially right-censored due to the limited follow-up length. Later, we explore the case in which
is a confounder for the intervention and outcome. By using
to replace the aforementioned Bernoulli distribution which generates the intervention, we create an association between the intervention and confounder. The coefficient
thus measures the magnitude of positive/negative association between
A and
.
Design 1: To start with, we consider Design 1 where the initial health status is the same for all individuals in the population. Intervention
A and covariate
both have causal effects on the deterioration rate
. The linear predictors of
and
are simulated as:
Design 2: For a slightly more complicated design (Design 2),
is related to the initial health status but not to
by construction. Examples of such predictors include the genotype of an individual. If
stands for gender as suggested earlier, the model is reasonable if the female has
and females have a certain genetic makeup that protects them from CVD events. In conclusion, the simulation for deterioration rate remains unchanged whereas
is added to the generation of
:
As a result, the intervention causal effect is
for both simulated studies. The effect can be interpreted as the intervention slowing the deterioration rate by
. Now, we consider the fitting of
and
separately. For the deterioration rate, we consider a reduced model that includes
A only versus a correct model that includes both
A and
. Then, for the initial health status, two models for
were considered. One does not adjust for
while the other does. In conclusion, for each design, with a specific association parameter
, four TR models were explored. The simulation results of the intervention effect are summarized in
Table 1; see the first six rows for Design 1 results and rows 7–12 for Design 2 results. The first two columns present results of the
model with
A only, and the next two columns present results from the
model which includes both
A and
. The different
models are summarized by rows, where “1” indicates the
model with the intercept only and “1,
” indicates the
model with intercept and
. The standard error was computed based on 1000 repetitions. All models were fitted in R [
28] with the
threg package [
29].
The simulation results of Design 1 first demonstrated that, when both the intervention A and the potential confounder are included in the deterioration model, the coefficient estimations are unbiased. Whether or not is included in the component, the estimation is not affected because the true initial health status is a constant. Now if the deterioration model only includes the intervention A in Design 1, the relationship between the intervention and the potential confounder needs to be further examined. Note that is not a confounder for the intervention and outcome if it is uncorrelated with A (i.e., ). By examining the first two rows of the table, the intervention causal effect is estimated consistently by the model, whether omitting or not. This is due to the independence between A and . Hence, the estimated coefficient for A provides the intervention causal meaning. The Cox PH model would fail to provide an unbiased estimation for the intervention when the uncorrelated covariate is omitted. The covariate has a beneficial effect on outcome by setting the coefficient to , the same as the intervention. Thus, if covariate is positively associated with the intervention (), exclusion of this confounder enlarges the intervention effect as the confounder beneficial effect has to be absorbed by the solely included intervention. If the association between the confounder and intervention is in a reverse direction (), the intervention effect is attenuated to account for the exclusion. For Design 2, the previous discoveries remain true, but now we have to correctly account for in the component. All reduced models failed to provide unbiased intervention estimation (serious underestimation), regardless of the format of . If both intervention and potential confounder are included in the deterioration rate model, we still obtain an unbiased estimation of the intervention effect. The unbiasedness is preserved if is not a confounder ().
6. TR G-Computation Procedure with Time-Independent Covariates
G-computation [
17] serves as a useful tool for the marginal causal effect estimation. Conducting G-computation for TR with linear predictors is straightforward. The first step of a G-computation is to fit a regression model of the outcome on the exposure and relevant covariates. The model fitted in this step is frequently called the “Q-model” [
16]. In TR G-computation, the Q-model is nothing but a standard threshold regression of the time-to-event outcome with potential predictors, such as
. A correct Q-model is required for G-computation to estimate the intervention causal effect unbiasedly.
To improve the chance of finding the correct model, the investigator may be able to apply classical model selection tools (for example, AIC or BIC) in picking the most plausible TR model. Classical model selection methods are valid for selecting among linear prediction models, such as those in the Q-model. AIC and BIC are useful for likelihood-based method. Model selection techniques for TR were discussed previously [
30].
Once estimated, the Q-model is used to predict all counterfactual outcomes for each subject under all possible intervention regimes. If the intervention is dichotomous, this is accomplished by plugging a 1 and then subsequently a 0 as the intervention into the TR fit to obtain predicted outcomes. Depending on the research question, the counterfactual outcomes or other estimands of interest can be calculated. Specifically, corresponds to the potential deterioration rate that should be observed under continuing intervention and is the counterpart that would be observed without intervention. Predicting all potential outcomes for each subject provides researchers with a full dataset that is free of confounding under causal assumptions. In other words, comparison or contracts among different potential outcomes now can be estimated at the individual or population level.
Generally speaking, G-computation can be used to generate the full data under all possible intervention levels if an intervention with discrete levels is under investigation. If the research question is straightforward as to identifying the intervention’s causal effect on the deterioration rate, we can merely estimate the difference between
and
. Then, averaging the differences across the observed distribution of the confounders leads us to the marginal causal effect (
) of the intervention on the deterioration rate. For instance, the marginal intervention causal effect on
for a population with size
N can be written as
Alternatively, a Marginal Structural Model (MSM) [
31] can be used to fit the counterfactual outcomes. The estimates from the MSM are also equipped with causal meanings under certain regularization conditions. Under the binary intervention case, these two approaches yield the same estimation. Nevertheless, if the researcher plans to use a more clinically related measure, such as RMST, estimates of
and
can be plugged into Equation (
9) to obtain the counterfactual individual RMST. The marginal intervention causal effect on RMST with a specific
can be calculated similarly to what has been implemented for the deterioration rate:
7. Machine Learning-Assisted TR—TRNN
TR models with linear predictors discussed in previous sections enjoy benefits such as easy interpretation and estimator collapsibility. Nevertheless, the linear predictors of
and
(i.e., Equations (
6) and (
7)) can be improved. To mitigate the risk of model misspecification, one may propose adding all relevant covariates along with all possible interactions and higher-order terms into the model. Yet, a model with linear predictors is likely to fail when the input dimension is close to or even higher than the number of observations [
32]. A recent research work [
11] introduced a novel approach to fit the TR model with linear predictors in a high-dimensional framework. Using their suggested boosting algorithm, estimated coefficients can be obtained. In causal survival analysis, one may not only have to deal with cases with high-dimensional inputs but also have to consider possible non-linearity among predictors. To address these challenges, TR with non-linear predictors is considered using machine learning methods.
Deep learning techniques, such as neural networks, are useful tools for survival data analysis [
33]. In this article, we propose a neural network extension for the TR model—TRNN. TRNN includes two separate neural networks to predict
and
in a TR model. It is reasonable to assume that the deterioration rate
and the initial health
may be influenced by different biological factors that are specific to them, in varied manners. Thus, different inputs and structures of the neural networks for
and
can reflect potential distinct trajectories. Nevertheless, the neural networks for
and
are connected by a log-likelihood-loss for right censored data and are optimized together. Trainable parameters in these two networks are updated simultaneously while the model is trained with respect to the objective function. This is the first time that neural networks are employed to fit a TR.
To start, the linear Predictors (
6) and (
7) can be replaced by neural networks without a hidden layer:
and
In above equations,
and
are activation functions. The quantities
and
are weight matrices, and
and
are bias vectors, all of which must be estimated. To echo the parameter range for TR, one can choose an activation function with output on the real line (such as a linear activation function) for
and an activation function with a strictly positive output range (such as an exponential activation function) for
. When the activation is the identity and the logarithm function, respectively, Equations (
15) and (
16) simply reduce to Equations (
6) and (
7), which have linear formats. More generally, hidden layers can be added between the input and output layers with user-specified numbers of neurons. With increased complexity, neural networks can approximate any functions [
34]. The number of neurons in the output layer decides the output dimension. Regardless of the number of hidden layers and the number of neurons that each layer contains, the output layer of the
and
networks should always have one neuron. As a result, each network outputs a one-dimensional value for
or
in TRNN. Equations (
15) and (
16) can be generalized to
and
, in which
and
are some neural networks. An example of TRNN which takes two three-dimensional inputs is displayed in
Figure 2. This TRNN has one hidden layer with 5 units for both
and
.
The loss function for TRNN is based on the log-likelihood function for TR. With the right censored data of
N individuals, the total log-likelihood is the sum of the log density or log survival functions:
The neural network solves the optimization problem by minimizing the loss. Consequently, the negative log-likelihood is the final loss function for TRNN. The MLE of Inverse Gaussian with censored observations is unique [
35] and, thus, creates no challenge for the optimization problem. The well-accepted optimization algorithms, such as Adam [
36], can be used to train the model. To facilitate model convergence, we can first fit an Inverse Gaussian with intercept only (for both
and
) using observed data, then use the estimated intercept terms at the initial bias for each network. Or, we can simply set the initial bias for
at
and for
at 1. The usual neural network training techniques are all applicable to train a TRNN. For instance, the validation loss can be monitored during training and the algorithm will be stopped early if the validation loss decrements between iterations do not exceed a prespecified value (delta). The weights that return the best validation loss will be reported as the final weights. Employing early stopping can effectively avoid over-fitting. See
Section 9 for a real example of fitting a TRNN.
The benefits of using TRNN are multi-fold, either when applying TR for a conventional survival analysis or a causal survival analysis. TRNN is capable of handling high-dimensional data such as gene expression. In
Section 6, we have claimed that a correct Q-model is the prerequisite for the unbiased marginal effect estimation using G-computation. A TRNN can replace a regular TR model for the Q-model for TR G-computation. Once the TRNN is optimized, it can be used to predict
and
under all possible regimes. Once the Q-model provides the complete data set with
and
under different regimes for each subject, the marginal causal effect can be evaluated as before, depending on the causal estimands of interest. All ML models and experiments were implemented in the
R package
keras [
37]. Keras [
38] is a high-level neural networks Application Programming Interface (API) of either TensorFlow or Theano. The
kerasR interfaces with Python 3.8.8 and Tensorflow 2.9.1 [
39]. Sample code running a TRNN using R is included in
Appendix B.
8. Simulation II—TR G-Computation and RMST
This section uses simulated data to demonstrate the application of G-computation with TR, either with the deterioration rate or RMST as the target causal estimand wherein the sample size
. The outcome of interest is the time until the subject’s first CVD event. The observed outcome is potentially right-censored at a prespecified study end time, at 10 years. We further simulated two potential confounders. Similar to Simulation I, the binary variable gender (
) was generated from a Bernoulli distribution, with a probability of female being
, and
is a continuous baseline variable that was generated from a Normal (1, 1). An example of such a predictor is population-standardized blood pressure. The covariates
and
are independent of each other. The binary intervention variable
A was assigned by:
This assignment mechanism indicates that subjects with higher blood pressure and/or male are more likely to receive the intervention. This is just a sample intervention assignment mechanism. Nevertheless, the estimation procedure described in this section follows even if other parametrizations that create associations between covariates and intervention are considered. In observational studies, assuming intervention is given at random may not be reasonable. Instead, it is safer to consider the intervention as randomly assigned, but the randomization probability depends on some baseline covariates. The time-to-event outcome
T is simulated by an Inverse Gaussian (IG) distribution as shown in (
4). Linear predictors of
and
are formulated as:
Figure 3 illustrates covariate effects on the outcome of interest for the simulated study.
Under this simple simulation setting, two covariates ( and ) must be considered when analyzing the effects of intervention A on the outcome Y. Gender () is a confounder for the intervention and outcome because it affects both A and the initial health status (subsequently Y). Without any adjustments, one may incorrectly conclude that the intervention A has a causal effect on Y even when there is no such causal pathway from A to Y. The association appears only due to the unadjusted confounder, Gender. In contrast, blood pressure () affects the intervention, yet has no independent effect on the deterioration rate (subsequently Y). Thus, is just an effect modifier for the intervention effect on , evident by the interaction term between and A and lacking the causal pathway from to . This data generation schema implies that the intervention has a positive effect on slowing the deterioration rate and the effect is further modified by while the intervention benefit increases as blood pressure increases. In terms of the initial health status, subjects who received the intervention and/or were female have larger initial health values.
We explored TR G-computation of four TR models (with different linear predictors for and/or ), as well as a TRNN as the Q-model. Model 1 assumes that the intervention influences both the initial health status and the deterioration rate. Thus, one only adjusts the intervention in both and .
Model 1 (linear predictors with intervention only):
Model 1 is incorrect for as it ignores the interaction effect and is also incorrect for by omitting the gender main effect. Model 2 is another incorrect model with a similar assumption for the covariate .
Model 2 (linear predictors with
only):
This model includes the main effect of when estimating but this causal pathway is lacking, and only influences by interacting with intervention A. Further, wrongly replaces for . Thus, the model fails to model both the deterioration rate and initial health status flawlessly. Next, Model 3 is the correct model.
Model 3 (true simulated model as specified in Equation (
19)):
Given the fact that picking the correct TR model may not be an easy task for the user, it is more realistic to consider a full model which includes all potential confounders, as well as the interaction term.
Model 4 (full model with intervention,
, interaction):
Table 2 summarizes the intervention effect estimated by TR,
Table 3 summarizes the marginal intervention causal effect estimation provided by the G-computation. The estimation standard deviation was calculated by non-parametric bootstrapping with 1000 times bootstrap samples.
RMST may be more interpretable for clinicians and patients. With the same simulated data, using
,
Table 4 summarized the RMST estimation results for each group and the group-wise difference. The estimations were obtained by using TR Model 1–4, TRNN (same parameter settings as used in previous G-computation), Kaplan–Meier curves, and the Cox model. The Cox model used in the simulation is similar to the TR full model which includes
and an interaction between intervention
A and
as covariates. The baseline hazard was estimated by the Breslow method [
40].
The simulated true marginal effect on deterioration rate
is
per Equation (
19). Specifically, the true marginal intervention causal effect on the deterioration rate is the summation of the main effect and the interaction effect then averaged over the population distribution of the effect modifier. We observed that the intervention causal effects estimated by TR are the same as by TR with G-computation, for TR Models 1 and 2. The reason is that the TR with G-computation is estimating the same marginal effect as the TR models if no interaction term was captured. Nonetheless, these TR models are biased due to model misspecifications. Both TR Model 1 and 2 have the intervention effect slightly overestimated since part of the intervention effect modified by BP is absorbed by the only included main intervention effect. Though TR is estimator collapsible, these TR models failed to provide consistent estimation as the excluded interaction term is correlated with the included main effect. Further, intervention
A correlates with both
and
under this simulation setting. This emphasizes again the importance of confounder adjustment. Next, when the TR model was correctly specified (Model 3), TR with G-computation returned a consistent estimation of the marginal intervention causal effect. This marginal causal effect has not only accounted for the main intervention effect but also the effect modified by
. After adjusting for gender effect in the initial health level and the effect modifier for intervention in the deterioration rate, we were able to conclude that receiving the intervention will slow the deterioration rate by
marginally, for this population. The TR full model (Model 4) with G-computation also provided an unbiased estimation of the marginal intervention effect. Though the standard error for the TR model 4 estimations increased due to unnecessary terms that have been added to the model, the marginal causal intervention effect estimation has a similar standard error to the correct model.
Model 5 (TRNN): Furthermore, as suggested in
Section 7, we used a TRNN to fit
and
. The algorithm took
as input, for both
and
networks. We chose a network with one hidden layer (10 units, activation is
tanh) for the
network and a network without a hidden layer for the
network. This TRNN can be conceptualized as TR Model 5 in which there were no specific forms of
and
assumed. Each simulated dataset was first split into a training set (90%) and a testing set (10%). Then 25% of the training data were randomly selected as validation data and the model was trained with the rest of the training set. The optimization process would be potentially stopped early if the validation loss no longer decreases. Once the
and
networks were trained and we obtained the predictions of
and
, the G-computation for the marginal intervention effect followed as demonstrated in
Section 6. However, the estimation will only be evaluated using the test data. The marginal intervention causal effect on the deterioration rate we obtained by using TRNN as the Q-model is
, and on the logarithm of initial health status is
(
Table 3). Though only three main effects were fed into the model, using neural networks allows us to capture the non-linear relationship among covariates. The estimations were fairly close given the fact that no specific model is assumed. One should expect the correct parametric models to provide more efficient estimates as the estimates are derived based on maximal likelihood. The standard deviations of marginal effect estimation were even smaller than those obtained from TR Models 3 and 4.
We would like to further quantify the marginal causal intervention effect in terms of RMST.
Table 4 provides the RMST estimation results of TR Models 1-4, TRNN, Kaplan–Meier (KM), and the Cox Model, with G-computation. First, the results obtained by KM were from group-specific non-parametric Kaplan–Meier curves. They serve as the true values and were the average RMST in each intervention group with diverse gender and BP values.
The RMST difference was simply calculated as the difference between the areas under group-specific survival curves. Next, for TR/Cox models, the group-specific estimations came directly from the estimated survival probabilities while the between-group difference requires an extra step of G-computation. The TR Models 3 and 4 provide nearly identical results for both group-specific estimations, as well as the between-group difference. Moreover, these two TR models provided estimations that are close to the non-parametric estimation. Thus, these two models with G-computation both provided unbiased marginal intervention causal effect on RMST as the intervention effect modified by has been correctly accounted for. Yet, TR Models 1 and 2 with G-computation fail this purpose because their underlying Q-models were incorrect. The intervention effect on RMST was underestimated. It can be seen from the TR Models 3 and 4 estimation result that the intervention was beneficial as the RMST increased by years when the follow-up horizon was set to 10 years. In other words, the population-level RMST will be almost doubled if everyone switches to the intervention.
Finally, TRNN outperforms all other TR models as it provides the closest results to the non-parametric method, though the standard deviations of estimation increase significantly as well. The Cox model estimated the group-specific RMST inaccurately; however, the estimated RMST between-group difference is almost the same as TR estimated. Note that the standard error of Cox model estimation is much higher than that of TR models 3 and 4.
9. Application—CARDIA
The Coronary Artery Risk Development in Young Adults (CARDIA) study [
41,
42] examined the development and determinants of clinical and subclinical cardiovascular disease and their risk factors. This study is a population-based observational study of 5115 participants aged around 18–30 years recruited during 1985–1986, from four centers in the United States (Birmingham, AL; Chicago, IL; Minneapolis, MN; and Oakland, CA). The sample was designed to achieve approximately balanced subgroups of race, gender, education (high school or less, and more than high school), and age (18–24 and 25–30 years old). After the enrollment, follow-up in-person examinations were conducted in 1987–1988 (Year 2), 1990–1991 (Year 5), 1992–1993 (Year 7), 1995–1996 (Year 10), 2000–2001 (Year 15), 2005–2006 (Year 20), 2010–2011 (Year 25), and 2015–2016 (Year 30), with retention of 90.5%, 85.6%, 80.6%, 78.3%, 73.6%, 72.0%, 72.2%, and 71.0% of living participants, respectively. Because this study started to follow subjects during their young adulthood, it has provided evidence that supports intervening during young adulthood to prevent CVD mortality in later decades [
43].
To examine the smoking causal effect on subjects’ time to first CVD, we consider the baseline measurement and the individual’s first CVD event time measured until the end of the study. We implemented gentle data cleaning techniques, such as removing observations with missing covariates, and combining categories for extremely unbalanced categorical data. The cleaned data (
obs) are used to demonstrate the TRNN application and the following G-computation. We are interested in the marginal causal effect of smoking on the heart health deterioration rate
. Since the competing risk censoring (e.g., death, lost-to-follow-up) is likely to be informative, we first use Inverse Probability of Censoring Weighting (IPCW) to adjust for the informative censoring bias. A subject who is competing risk censored will receive weights of 0, while the remaining subjects will be inversely weighted by the probabilities of not competing risk censored. The likelihood-based loss function used in TRNN is updated accordingly to incorporate these weights.
Figure 4 shows the Kaplan–Meier survival curves of CVD events of smokers vs. non-smokers. The outcome is the subject’s time to his/her first CVD event. We truncate the follow-up time at month 400.
As can be seen from the survival curves, there is an obvious trend of delayed exposure effect. The trend is expected as participants’ likelihood of experiencing their first CVD event grows as they age.
To identify the optimal structure of the TRNN, we started with determining the hyperparameters to be used in the network. Five-fold cross-validation [
44] was used for the tuning process. From the simulation, we observed that one hidden layer is sufficient to capture the non-linear nature of the covariates. In this application, we would like to consider
sigmoid and
tanh activation functions for the hidden layer. Additional hyperparameters we tuned included hidden unit number for
network
, hidden unit number for
network
, learning rate
, and delta for early stopping
. There are many other hyperparameters, such as batch size, to be tuned. The truth is that it is impossible to tune all hyperparameters exhaustively as the set of hyperparameters must be large. Therefore, we only include a working example here with a limited set of hyperparameters. The choice of network structure and hyperparameter set are mostly experience-based. For instance, in this dataset the event rate is low. We do not encourage using a small batch size—the algorithm may not perform well if a batch of data does not include a single event. Thus, we set the batch size to 256 and refrain from tuning this parameter. The number of hidden units also depends on the input dimension. Referring to the proposed network structure in
Figure 2, one can select different covariates for the
and
networks. In this application, for the
network, we considered baseline covariates, including the smoking status (smoker vs non-smoker), gender (female vs male), race (white vs non-white), age at enrollment (younger than 18, 19–25, 25–29, and older than 30), marital status (married or living as married, divorced or separated or widowed, never married), baseline BMI, and baseline SBP. For the
network, we considered gender (female vs male), race (white vs non-white), age at enrollment (younger than 19, 19–25, 25–29, and older than 29), marital status (married or living as married, divorced or separated or widowed, never married), baseline BMI, baseline SBP, exercise intensity score, employment status (full-time/part-time/unemployed), education level (less than high school or GED/some college or college/college above), self-reported diabetic, hypertension, high cholesterol, whether father/mother had a heart attack, and whether father/mother had a stroke. Nevertheless, TRNN is capable of handling high-dimensional input. Thus, many other potentially important covariates can be included. Using domain knowledge, we narrowed down the range of covariates, as well as the hyperparameter set, for demonstration purposes. The five-fold cross-validation selects neuron numbers 15 and 2, activation function
sigmoid and
tanh for
and
network, respectively. We set the learning rate to
and the delta for early stopping to
. With selected hyperparameters, the data were further split into the training set (70%), the validation set (20%), and the testing set (10%). Once the model is trained, the G-computation is performed with 10% testing data that the model never used previously. Note without setting a random seed, the resultant neural networks parameter estimations are different due to the stochastic optimization. With 100 refits of the neural networks, smoking changes the deterioration rate by
in this population.
In terms of RMST ( months) of the first CVD event free time, the difference between smokers and non-smokers on average is months. The confidence interval was computed using the bootstrapped standard deviation. The marginal smoking causal effect we detected using G-computation and TRNN is significant.
10. Discussion
The threshold regression (TR) model with inverse Gaussian first hitting time distribution provides profound clinical interpretations because it models the initial health status , as well as the deterioration rate . These two parameters jointly describe the first hitting time of a latent health process. Predictors linked to these two parameters influence the health process and can be interpreted with different biological meanings. We further argue that the estimators from the TR model with linear predictors for the deterioration rate are collapsible if is correctly controlled. Collapsibility of the estimator is a desirable property for a causal model because one often feels uncomfortable interpreting the conditional effect estimation. Unlike the Cox PH model, the TR model ensures an unbiased estimation of the intervention’s causal effect on the outcome, if an uncorrelated covariate is omitted from the model.
When analyzing the intervention effect, any proposed model will generally not be exactly correct, and results can be difficult to interpret if the model is misspecified and treatment effects substantially differ across subgroups [
1]. The lack of subgroup information may make things more complicated. As a result, the marginal intervention effect is of research interest as it provides population summaries with a simple interpretation. A typical randomized controlled trial attempts to estimate the marginal effect of the investigated intervention, which is the intervention effect on the population level regardless of the diverse characteristics. Estimator collapsibility is a particularly desirable model property because it provides a model with a much higher likelihood to estimate the marginal effect unbiasedly. Furthermore, estimator collapsibility allows for other more advanced causal inference techniques, such as the instrumental variable (IV) analysis [
45]. It has been shown that TR offers an appealing opportunity to apply IV analysis in a survival context [
9].
The causal aspect of TR is underrated because of the availability of TR causal inference [
9].
Section 4 and Simulation 5 have demonstrated the collapsibility of estimators from TR with linear predictors when one would like to make causal conclusions for the deterioration rate, condition on the initial health status is correctly controlled. The
part in TR allows one to control for the population baseline heterogeneity. Thus, we shall not treat the necessity of controlling
in TR deterioration rate analysis as a disadvantage. Previous research has shown that the intervention effect may interact with the baseline genotypes [
46]. This is a common cause for the crossing survival curves observed in medical research. Subjects in the intervention group seem to experience the event sooner than the control group in the early stage of follow-up due to subgroup initial health status heterogeneity. However, the intervention benefit emerges later because the intervention has significant effects on the deterioration rate. The Cox model may lose power significantly when the PH assumption is violated, not to mention the lack of collapsibility and interpretation issues. Alternatively, TR is capable of addressing distinct hazard types [
10]. As we have shown in
Section 5, the genotype or other baseline characteristics can be adjusted as part of the initial health status. Once the baseline heterogeneity has been effectively controlled, the marginal intervention causal effect can be viewed as slowing the health process deterioration, providing a solid causal interpretation of the intervention on the deterioration rate
.
Though the non-parametric marginal causal estimation is feasible under limited cases, such as when all covariates are discrete and there are data collected for each sub-category, parametric models are more realistic choices in practice. One may have to seek help from the G-computation method when confounders for both the intervention and outcome or an effect modifier (interaction) are present. G-computation is a statistically reliable method of covariate adjustment for an unconditional treatment effect that produces a resulting estimator [
1]. We have demonstrated that, with correct model specifications, TR is capable of estimating the intervention effect marginally besides the usual conditional estimation with G-computation. G-computation with TR can also provide a valid marginal causal estimation if other estimands, such as RMST, are chosen as the target causal estimand. Simulation in
Section 8 advocates using a comprehensive model as the Q-model to mitigate the risk of model misspecification. Plus, a doubly robust estimation that combines the outcome model (TR in our case) and IPW is feasible. The combined strategies are generally superior to either strategy alone [
47].
TRNN is the neural network extension of the TR model in which the linear predictor structure is replaced by nonlinear neural networks. TRNN can not only handle high-dimensional input easily, but also capture more sophisticated relationships between covariates and deterioration rate and/or initial health status. From our observations, TRNN serves as a competitive Q-model for the G-computation. As we have shown with simulations in
Table 3 and
Table 4, using TRNN can free practitioners from selecting the “true” model. Yet, as in other ML methods, the network structures and hyperparameters have to be tuned before implementation. With an increasing pool of ML toolkits, we only expect this method to become more accessible in the future. Last, the convergence of TRNN is fairly quick. Under the simulated setting, 100 training epochs are sufficient, thus no significant increased computation time was observed compared to a non-NN-based TR model.
As a flexible parametric survival model, surrogate estimands, such as RMST, can be readily obtained for TR. Using this estimand also solves the dilemma that an investigated intervention may provide reverse effects for the initial health status and the deterioration rate. Given a fixed end of follow-up horizon (), slowing the degradation rate and/or improving the initial health status directly improves the survival rate, subsequently the RMST. We can also consider other causal estimands, such as causal survival probability difference.
There exist other aspects of TR, which make it more valuable in practice. For instance, when
, TR with a cure rate can handle the case where a substantial proportion of subjects are believed “cured” or immune from the endpoint events [
48].