3.1. Traditional Q-Learning
First, we introduce traditional Q-learning, a form of approximate dynamic programming originally proposed by [
3]. Q-learning estimates the optimal DTR by postulating regression models for Q-functions and subsequently taking solutions that would yield the largest rewards. The Q-functions for the two stages are defined as:
and
The optimal decision rule at stage j can be expressed as . Generally, we do not know the true Q-functions and so we consider linear working models for Q-functions of the form , where and possibly contain different components of the history .
The two-stage Q-learning algorithm works as follows:
Stage 2 regression is obtained by
, where denotes the empirical mean;
The stage 1 pseudo-outcome is given by
Stage 1 regression: .
The optimal decision rules can further be written as = when we have the particular case that . We will assume that we have binary treatment options for both stages for convenience, although this can be relaxed with further assumptions.
3.2. Censoring Adapted Q-Learning
Our stage 2 optimization objective is complicated by the fact that some may be unobserved due to censoring. Let C denote the time of censoring, which happens after , the time of stage 2 treatment. We assume that (conditional independence).
3.2.1. Stage 2
Let be the counterfactual outcome of survival starting from the stage 2 treatment conditional on the previous treatment . Correspondingly, is the counterfactual outcome under decision rule , i.e., . Similarly, we can obtain through and .
Using the linear utility function defined in
Section 2 and conditional on the previous treatment
,
.
Correspondingly,
is the counterfactual utility, conditional on the previous treatments
under decision rule
.
The optimal regime, , satisfies , where is the class of all potential decision rules for stage 2.
We make the following assumptions to connect the mean of counterfactual outcomes with the observed data:
Consistency:
and ;
No unmeasured confounding:
Treatment is randomly assigned with probability possibly dependent on , , i.e., and ;
Positivity:
There exist constants such that, with probability 1, the propensity score ;
Latent variable independence:
.
The first three assumptions are standard assumptions in causal inference. The last assumption facilitates separate modeling of outcomes and preferences and can be weakened at the expense of more complicated models [
14].
We denote the marginal expectation with respect to as , abbreviated as throughout. Furthermore, we denote , , and assume that and are conditionally independent given . As in traditional Q-learning, we assume linear working models for each of our outcomes of interest (i.e., can be generated through underlying models of predictive and tailoring variables , where and are some possibly different components of and ).
Using the causal assumptions above, we bridge the observed data to the counterfactual outcome means:
where the separate modeling of preference and outcomes is allowed by the fourth assumption.
With censoring, it is unlikely that all
are observed. We propose the following estimator that re-weights observed complete data using inverse probability of censoring weights (IPCW):
where
denotes the model estimate for
using observed data and covariates,
is the event indicator that the patient’s data are not censored, and
is a working estimator of the probability that the individual has not been censored by their event time.
Denote our Q-function here to be . The derived treatment rule is , where represents the tailoring variables in the stage 2 model.
3.2.2. Stage 1
For stages prior to the last stage, we use backward induction, and then the optimal
at stage 1 can be derived similarly. Assuming that stage-specific rewards have been maximized after stage 1, we define the following stage 1 reward:
where
is as defined previously and
denotes a counterfactual outcome given future optimized treatments and taking treatment
at stage 1 (similarly for
and
). Using this reward, we define the optimal regime at stage 1 as the one that satisfies
for all
, where
is the class of all potential regimes at stage 1.
We make the following assumptions. Note that in our set-up, is predetermined and fixed for every patient, but this assumption does not apply to .
No unmeasured confounding:
. It holds in our setting since for . In addition, ;
Positivity: is bounded away from zero and one;
Latent variable independence: where .
Then we have
where
,
,
, and similarly for the equivalents of
q. Similarly to stage 2, we further assume that
and
. Notice again that the right-hand side (RHS) can be completely estimated from observed data. Under these assumptions, the optimization problem at stage 1, among all potential regimes
, can be written as
RHS of Equation (
2).
We maximize the stage 1 outcome through a pseudo-outcome defined as:
Our proposed estimator for is , where can be modeled with . The first stage estimated optimal rule is given by , where represent the tailoring variables in the stage 1 model.
3.3. Inference
In addition to obtaining the optimal stage-specific decision rules, our second goal is to draw statistical inference on the effect of each stage’s covariates on the decision. Particular emphasis was placed on tailoring variables. To that end, we propose using a censoring-adjusted version of the
m-out-of-
n method presented by [
31]. In stage 1 optimization, the pseudo-outcome
is a nonsmooth function of
. In particular, if
, then first stage covariates will converge to a normal distribution. However, if
, the estimator
oscillates between the two asymptotic distributions across samples, which reflects the typical challenging problem of nonregularity in DTR literature [
29]. Hence, direct estimation results in an asymptotically biased estimator and poor performance of usual Wald-type confidence intervals. Even bootstrap-based approaches suffer from underlying nonsmoothness. The
m-out-of-
n bootstrap was developed to address the bootstrap inconsistency due to nonsmoothness [
32]. Although conceptually very similar to the original bootstrap, the resample size
m (which needs to depend on
n, tends to infinity with
n, and
is selected to be a smaller order than
n. Chakraborty et al. 2013 [
31] showed through simulation studies that the m-out-of-n approach obtained desirable coverage probabilities in the two-stage DTR problem for first stage tailoring variables. Because censoring reduces the size of observed stage 2 data in our scenario, we further modified the m-out-of-n algorithm to accommodate censoring. Our algorithm works as follows.
We adopted the functional form of
m as presented in [
31]:
Let
n be the total number of subjects in the dataset (including those who were censored). For stage 2, we create a bootstrap sample of size
n and fit a regression model using the complete data weighted by IPC-weights within the bootstrapped sample to obtain stage 2 coefficient estimates. Stage 2
confidence intervals are obtained after getting
and
, the
and
percentiles of
, where
is the desired significance level,
is the bootstrap estimate of stage 2 coefficients with bootstrap-specific re-estimated censoring weights, and
is the plug-in estimator obtained using weighted regression from the empirical dataset. The confidence interval is given by
. For stage 1, we first generate bootstrap samples of size
m, which is calculated using Equation (
3) after calculating a sample specific
. We further use each bootstrap sample to re-estimate IPC weights, and fit a weighted lm model to obtain a bootstrap-specific stage 2 estimate. The stage 2 coefficients from each bootstrap sample are then used to calculate pseudo-outcomes, which are then used to fit a stage 1 model to obtain
. Then we obtain
and
, the
× 100 and
100 percentiles of
, where
is the plug-in estimator obtained using the complete empirical dataset, while
is the estimate obtained from each bootstrap sample of size
m. The confidence set is given by
.
We further selected the value of
to be
, which provided stable coverage in simulations with complete data. The calculation of
relies on a selection of
. We opted to use the plug-in estimator for
just as in [
31], where
is the plug-in sandwich estimator of
, and
.
In practice, however, we need to tune the two hyper parameters and to obtain an appropriate m using double bootstrap, and tune the bootstrap sample number m straightforwardly and automatically. Such a double bootstrap algorithm for choosing m is data-driven. Suppose we are interested in , i.e., the stage 1 variable effect, and its estimate from the original data is . Consider a grid of candidate values for m:
(1) Draw n-out-of-n first-stage bootstrap samples from the data and calculate the bootstrap estimate ,( = 1, 2, …, ). Fix m at the smallest value in the grid.
(2) Conditional on each first-stage bootstrap sample, draw m-out-of-n second-stage (nested) bootstrap samples and calculate the double bootstrap versions of the estimate , = 1, 2, …, , = 1, 2, …, .
(3) For = 1, 2, …, , compute the and percentiles of , = 1, …, ; say and respectively. Construct the double-centered percentile bootstrap from the -th first-stage bootstral data as (), = 1, …, .
(4) Estimate the coverage rate of the double bootstrap CI from all of the first-stage bootstrap datasets as:
(5) If the current coverage rate is located in [ ], then it is not significantly biased from , and we will pick the current value of m as the final value. Otherwise, increase m to the next value in the grid.
(6) Repeat steps (1)–(5), until (5) is satisfied or the grid is exhausted.
3.4. Survival Augmented Patient Preference Weights (SAPP-Weights)
Going forward, we assume that information in the lastest stage survey will override information from previous stages as well as other covariate information (i.e.,
will override
and
). To model survey information as a function of latent preference, we assume a latent traits model [
33]. We further assume that the latent preferences are related to survey responses through a modified Rasch model [
34,
35].
For our scenario, we assume that there are questions on a survey, each soliciting binary answer choices from the patient. For each binary response, we assume that the underlying generating mechanism is of the form , where j indicates the stage and k the question number.
Algorithm 1 outlines the algorithm for estimating patient preference
. Then the survival augmented patient preference weights (SAPP-weights) are the transformed version of
, which we denote as
. Essentially, we use the Expectation-Maximization algorithm [
36] to iterate between estimates of
, the questionnaire coefficients, and
, individual patient preferences at stage
j. We use Gauss–Hermite quadrature to numerically approximate the integral, and estimate
using the Metropolis Hastings algorithm.
Algorithm 1: EM algorithm for estimating patient preference . |
![Stats 04 00046 i001]() |