Next Article in Journal
Partially Linear Generalized Single Index Models for Functional Data (PLGSIMF)
Next Article in Special Issue
A Bayesian Approach for Imputation of Censored Survival Data
Previous Article in Journal
Curve Registration of Functional Data for Approximate Bayesian Computation
Previous Article in Special Issue
Fiducial Inference on the Right Censored Birnbaum–Saunders Data via Gibbs Sampler
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Survival Augmented Patient Preference Incorporated Reinforcement Learning to Evaluate Tailoring Variables for Personalized Healthcare

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
*
Author to whom correspondence should be addressed.
Stats 2021, 4(4), 776-792; https://doi.org/10.3390/stats4040046
Submission received: 3 August 2021 / Revised: 16 September 2021 / Accepted: 22 September 2021 / Published: 27 September 2021
(This article belongs to the Special Issue Survival Analysis: Models and Applications)

Abstract

:
In this paper, we consider personalized treatment decision strategies in the management of chronic diseases, such as chronic kidney disease, which typically consists of sequential and adaptive treatment decision making. We investigate a two-stage treatment setting with a survival outcome that could be right censored. This can be formulated through a dynamic treatment regime (DTR) framework, where the goal is to tailor treatment to each individual based on their own medical history in order to maximize a desirable health outcome. We develop a new method, Survival Augmented Patient Preference incorporated reinforcement Q-Learning (SAPP-Q-Learning) to decide between quality of life and survival restricted at maximal follow-up. Our method incorporates the latent patient preference into a weighted utility function that balances between quality of life and survival time, in a Q-learning model framework. We further propose a corresponding m-out-of-n Bootstrap procedure to accurately make statistical inferences and construct confidence intervals on the effects of tailoring variables, whose values can guide personalized treatment strategies.

1. Introduction

For chronic illnesses, patients often have to navigate a series of treatment decisions. It has been increasingly recognized that due to patient heterogeneity based on genetics, environmental factors, various other factors and the interplay between the factors, a good treatment plan needs to be both personalized and adaptive to a patient’s changing clinical course. Dynamic treatment regimes (DTRs) are algorithmic solutions to this clinical problem, where a DTR consists of a sequence of treatment decision rules that adapt over time in response to an individual’s clinical response and health outcome trajectory.
A large number of methods have been proposed for the evaluation of the optimal DTR. Some of the earlier and foundation works include the marginal structural model with inverse probability weighting (IPW) [1], G-estimation of structural nested mean models [2], Q-learning ([3,4,5]), and A-learning [6]. More recently, along with the development of data science, machine learning flavored methods were also developed for DTR estimation, including tree-based and list-based methods ([7,8,9,10]), classification type methods ([11,12]), and stochastic tree search methods [13].
Despite the large number of methods that can be used to evaluate the optimal DTR, the majority of methods rely on a single pre-specified reward endpoint of interest. Often times, in practice, multiple competing priorities need to be balanced when considering a clinical decision and in fact these outcomes can be affected in opposing directions of desirability. A classic example in this case is toxicity vs. efficacy. A newly introduced drug that is highly efficacious might come with a larger burden of undesirable side effects. In recent years, a few proposed methods have tackled the delicate balance between multiple outcomes of a proposed treatment. Butler et al. [14] balanced treatment efficacy and toxicity using patient derived preference using a Q-learning approach, while [15] assigned different rewards based on survival status, wellness (a measure of toxicity) and tumor size (a measure of drug efficacy) at each stage.
The particular scenario that we would like to address in this work is a delicate balance of quality vs. quantity, a dilemma commonly encountered for patients facing severe illness. For many chronic diseases, patients often receive a first treatment and are followed up after a short time to determine if the treatment needs to be adjusted. The patient is then followed until an adverse event or a certain maximal follow-up time. Adverse events could include hospitalization, organ failure, impending surgery, or other undesired medical consequence, but for simplicity, we look at mortality as our event of interest [16]. Our primary goal then is to estimate the optimal treatment regime that would maximize a patient preference weighted combination of quality of life and survival time. Secondly, we would like to provide an inference framework for more confident decision making. Although similar in flavor to the abovementioned works, our scenario brings with it a unique set of distinct technical challenges.
The main challenge in this scenario is the presence of censored data. Because of the long tail of survival distributions, and because of other logistical reasons (patients move, dropout due to deteriorating health, etc.), it is common to not observe the outcomes of a significant fraction of the population. However, partially observed information from censored subjects can still contribute important information and give power to the analysis if analyzed correctly. An enormous body of work has been developed to estimate the optimal rule or regime in the presence of censoring. A non-exhaustive and overlapping list includes methods for a single stage ([17,18,19]), methods for multiple stages ([20,21,22,23]), inverse probability of censoring weighted (IPCW) adjustment methods ([19,20,22,24]), Q-learning-based methods ([20,22]), tree-based methods ([17,18]), survival probability-based methods ([23]), accelerated failure time (AFT)-based models ([25,26]), and doubly robust methods ([21,27,28]), just to name some of the numerous ways by which these methods differ in scope and direction.
A secondary but equally important goal of our method is to provide inference on stage-specific parameters, particularly for tailoring variables. Inference in DTR methods is challenging due to the known issue of nonregularity caused by non-smooth functions that get carried forward through backward induction [29]. When the degree of nonregularity is large (in other words, a larger fraction of covariate space does not have a treatment effect), the asymptotic distribution of the true coefficient oscillates between two asymptotic distributions, resulting in asymptotic bias and poor Wald-type confidence intervals. In the same paper, Chakraborty et al. [29] proposed a hard-threshold estimator and soft-threshold estimator to adjust for this poor coverage. Laber et al. 2014 [30] proposed an adaptive confidence interval for first stage parameters by utilizing regular, uniformly convergent lower and upper bounds for the asymptotic distribution of interest, and later bootstrapping for the confidence set. Chakraborty et al. 2013 [31] proposed an adaptive bootstrap-based method that adjusts for the bias and coverage by adjusting the bootstrap sample size.
In this work, we explore, synthesize, and adapt existing methods in the literature to create a new method for estimating the optimal treatment regime and constructing stage-specific confidence intervals that fit our scenario. We utilize IPCW to enable a complete data analysis and use the Q-learning framework for the modeling approach. We further adapt the m-out-of-n bootstrap to accommodate censoring in order to obtain covariate-specific confidence intervals for inference. We illustrate the numerical performance of our method through simulation studies.

2. Notation and Framework

As shown in Figure 1, we look at a two-stage setting where a patient, upon diagnosis, receives a stage 1 treatment. Shortly thereafter, at a scheduled follow-up time, the patient will be assessed for stage 2 treatment. Following stage 2 treatment, the patient is at risk of an adverse event, i.e., death, denoted by time D i . Although in some cases adverse events might happen early, our assumption that the death does not happen in Stage 1 is fairly natural in the setting of many chronic illnesses such as diabetes, hypertension, chronic kidney disease, or many autoimmune diseases, where significant clinical decline is not acutely expected. Finally, in this setting, patient information might be lost to follow-up, either due to administrative censoring (surpassed maximal follow-up time) or due to patient factors.
Let T i 1 , T i 2 denote the times of treatment for stages 1 , 2 , respectively, for a patient i. Let τ be the maximal administrative follow-up time. Let S i 1 and S i 2 denote the amount of time survived in each stage, i.e., S i 1 = T i 2 T i 1 , and S i 2 = D i T i 2 . We assume that per protocol, everyone’s S i 1 should be the same (i.e., routine assessment following stage 1 treatment at a prespecified time interval). Let K j be the number of treatment options in the jth stage, j = 1 , 2 . Let A i j denote the treatment indicator for the ith patient in the jth stage, with a i j denoting the observed treatment. Let X i j denote patient characteristics and other historical information such as outcomes and prior treatments A 1 , , A j 1 prior to treatment assignment at stage j. In addition, we assume that each patient will have an evolving preference h i j , which can be derived from answers W i j to a questionnaire at stage j.
At each stage, we assume that each patient i will have two observed outcomes, q i j (with the range [0, 1]) for the average quality of life during stage j, and S i j , the amount of time spent in stage j. q i j allows us to calculate Q u i j = q i j S i j , the quality adjusted life years during stage j. However, since Q u i j is dependent on S i j , it is also subject to censoring at stage 2.
We assume a utility function of the form Q u + { 1 Φ ( h ) } { S Q u } , where Φ ( · ) denotes the cumulative distribution function of a normal random variable. Intuitively, this utility function is a sliding scale between S and Q u , and a patient’s preference would dictate where he/she would fall. Let the overall outcome of interest that we would like to optimize be R i 1 + R i 2 , where R i j = Q u i j + { 1 Φ ( h i j ) } { S i j Q u i j } . This is a cumulative preference of adjusted quality of life years experienced on a given regime. If one’s preference is such that Φ ( h i ) = 0 for both stages, then R i 1 + R i 2 is the total survival time. Similarly, if a patient has the preference Φ ( h i ) = 1 for both stages, then R i 1 + R i 2 would be the total quality adjusted life years. In general, we denote all history, or history up to stage K for a given variable with an overhead bar (i.e., W ¯ i and W ¯ i K , respectively).
Let g j ( X ¯ i j , W ¯ i j ) be a function that maps from covariate and survey history to the domain of treatment assignment A i j . At stage j, the expected potential reward of the following decision rule g j ( X ¯ i j , W ¯ i j ) for patient i is defined as E [ a j = 1 K j [ Q u i j * ( a j ) + { 1 Φ ( h i j ) } { S i j * ( a j ) Q u i j * ( a j ) } ] I { g j ( X ¯ i j , W ¯ i j ) = a j } ] , where S i j * ( a j ) = S i j * ( A i 1 , , A i , j 1 , a j ) and Q u i j * ( a j ) = Q u i j * ( A i 1 , , A i , j 1 , a j ) denotes the counterfactual survival outcome and quality adjusted survival, respectively, where the patient is assumed to have taken treatment a j at stage j, conditional on previous treatment decisions A i 1 , , A i , j 1 in the sense that previous treatments need to be fixed to those actually given in previous stages. In our case, our primary goal is to find a sequence of individualized decision rules, g ( X ¯ i , W ¯ i ) = ( g 1 ( X ¯ i 1 , W ¯ i 1 ) , g 2 ( X ¯ i 2 , W ¯ i 2 ) ) , that optimize the potential outcome of R i 1 + R i 2 .
A second but equally important objective is to conduct inference on coefficients, with particular emphasis on tailoring variables (variables that interact with treatment selection). Inference on tailoring variables is important because it obviates the need to collect data for covariates that have no evidence of significant deviation from zero. Furthermore, inference allows us to know when there is insufficient evidence to support one treatment over another so that treatment decisions could be made using other factors important to the patient. In this work, along with stage-specific decision rules, we present a censoring adapted method of obtaining confidence intervals for the covariates in both stages of the model. For the sake of brevity going forward, let us abbreviate g j ( X ¯ i j , W ¯ i j ) with g j and drop the patient index i when there is no confusion.

3. Methods: Censoring Adapted Q-Learning

3.1. Traditional Q-Learning

First, we introduce traditional Q-learning, a form of approximate dynamic programming originally proposed by [3]. Q-learning estimates the optimal DTR by postulating regression models for Q-functions and subsequently taking solutions that would yield the largest rewards. The Q-functions for the two stages are defined as:
Q 2 ( X ¯ 2 , A 2 ) = E [ R 2 | X ¯ 2 , A 2 ]
and
Q 1 ( X ¯ 1 , A 1 ) = E [ R 1 + max a 2 Q 2 ( X ¯ 2 , a 2 ) | X ¯ 1 , A 1 ]
The optimal decision rule at stage j can be expressed as d j ( X ¯ j ) = arg max a j Q j ( X ¯ j , a j ) . Generally, we do not know the true Q-functions and so we consider linear working models for Q-functions of the form Q j ( X ¯ j , A j ; β j , ψ j ) = β j T Z ¯ j , 0 + ( ψ j T Z ¯ j , 1 ) A j , where Z ¯ j , 0 and Z ¯ j , 1 possibly contain different components of the history X ¯ j .
The two-stage Q-learning algorithm works as follows:
  • Stage 2 regression is obtained by
    ( β ^ 2 , ψ ^ 2 ) = arg min β 2 , ψ 2 E n ( R 2 Q 2 ( X ¯ 2 , A 2 ; β 2 , ψ 2 ) ) 2 , where E n denotes the empirical mean;
  • The stage 1 pseudo-outcome is given by
    P O ˜ 1 = R 1 + max a 2 Q 2 ( X ¯ 2 , a 2 ; β ^ 2 , ψ ^ 2 )
  • Stage 1 regression: ( β ^ 1 , ψ ^ 1 ) = arg min β 1 , ψ 1 E n ( P O ˜ 1 Q 1 ( X ¯ 1 , A 1 ; β 1 , ψ 1 ) ) 2 .
The optimal decision rules can further be written as d j ( X ¯ j ) = arg max a j Q j ( X ¯ j , a j ; β ^ j , ψ ^ j ) = sign ( ψ ^ j T Z ¯ j , 1 ) when we have the particular case that A j { 1 , 1 } . We will assume that we have binary treatment options for both stages for convenience, although this can be relaxed with further assumptions.

3.2. Censoring Adapted Q-Learning

Our stage 2 optimization objective is complicated by the fact that some S 2 may be unobserved due to censoring. Let C denote the time of censoring, which happens after T 2 , the time of stage 2 treatment. We assume that S 2 C | A 2 , X ¯ 2 , W ¯ 2 (conditional independence).

3.2.1. Stage 2

Let S 2 * ( a 2 ) be the counterfactual outcome of survival starting from the stage 2 treatment conditional on the previous treatment A 1 . Correspondingly, S 2 * ( g 2 ) is the counterfactual outcome under decision rule g 2 , i.e., S 2 * ( g 2 ) = a 2 = 1 K 2 S 2 * ( a 2 ) I { g 2 ( X ¯ 2 , W ¯ 2 ) = a 2 } . Similarly, we can obtain Q u 2 * ( a 2 ) through q 2 * ( a 2 ) and S 2 * ( a 2 ) .
Using the linear utility function defined in Section 2 and conditional on the previous treatment A 1 , R 2 * ( a 2 ) = Q u 2 * ( a 2 ) + { 1 Φ ( H 2 ) } { S 2 * ( a 2 ) Q u 2 * ( a 2 ) } .
Correspondingly,
R 2 * ( g 2 ) = Q u 2 * ( g 2 ) + { 1 Φ ( H 2 ) } { S 2 * ( g 2 ) Q u 2 * ( g 2 ) }
is the counterfactual utility, conditional on the previous treatments A 1 under decision rule g 2 .
The optimal regime, g 2 o p t , satisfies E { R 2 * ( g 2 o p t ) } E { R 2 * ( g 2 ) } g 2 G 2 , where G 2 is the class of all potential decision rules for stage 2.
We make the following assumptions to connect the mean of counterfactual outcomes with the observed data:
  • Consistency:
    S 2 = a 2 = 1 K 2 S 2 * ( a 2 ) I { A 2 = a 2 } and q 2 = a 2 = 1 K 2 q 2 * ( a 2 ) I { A 2 = a 2 } ;
  • No unmeasured confounding:
    Treatment A 2 is randomly assigned with probability possibly dependent on X ¯ 2 , W ¯ 2 , i.e., { S 2 * ( 1 ) , , S 2 * ( K 2 ) } A 2 | X ¯ 2 , W ¯ 2 and { q 2 * ( 1 ) , , q 2 * ( K 2 ) } A 2 | X ¯ 2 , W ¯ 2 ;
  • Positivity:
    There exist constants 0 < c 0 < c 1 such that, with probability 1, the propensity score π a 2 ( X ¯ 2 , W ¯ 2 ) = P r ( A 2 = a 2 | X ¯ 2 , W ¯ 2 ) ( c 0 , c 1 ) ;
  • Latent variable independence:
    H 2 ( A 2 , Q u 2 * ( a 2 ) , S 2 * ( a 2 ) ) | X ¯ 2 , W ¯ 2 .
The first three assumptions are standard assumptions in causal inference. The last assumption facilitates separate modeling of outcomes and preferences and can be weakened at the expense of more complicated models [14].
We denote the marginal expectation with respect to X ¯ t , W ¯ t as E X ¯ t , W ¯ t , abbreviated as E t throughout. Furthermore, we denote μ 2 , a 2 S ( X ¯ 2 , W ¯ 2 ) E S 2 | A 2 = a 2 , X ¯ 2 , W ¯ 2 , μ 2 , a 2 q ( X ¯ 2 , W ¯ 2 ) E q 2 | A 2 = a 2 , X ¯ 2 , W ¯ 2 , and assume that S 2 * ( a 2 ) and q 2 * ( a 2 ) are conditionally independent given X ¯ 2 , W ¯ 2 . As in traditional Q-learning, we assume linear working models for each of our outcomes of interest (i.e., q 2 , S 2 can be generated through underlying models of predictive and tailoring variables β 2 T Z 20 + ( ψ 2 T Z 21 ) A 2 , where Z 20 and Z 21 are some possibly different components of X ¯ 2 and W ¯ 2 ).
Using the causal assumptions above, we bridge the observed data to the counterfactual outcome means:
E R 2 * ( g 2 ) = E 2 ( a 2 { 1 , 1 } [ μ 2 , a 2 q ( X ¯ 2 , W ¯ 2 ) μ 2 , a 2 S ( X ¯ 2 , W ¯ 2 ) + E 1 Φ ( H 2 ) | X ¯ 2 , W ¯ 2 { μ 2 , a 2 S ( X ¯ 2 , W ¯ 2 ) μ 2 , a 2 q ( X ¯ 2 , W ¯ 2 ) μ 2 , a 2 S ( X ¯ 2 , W ¯ 2 ) } ] I { g 2 = a 2 } ) ,
where the separate modeling of preference and outcomes is allowed by the fourth assumption.
With censoring, it is unlikely that all S 2 are observed. We propose the following estimator that re-weights observed complete data using inverse probability of censoring weights (IPCW):
argmin β 2 , ψ 2 P n R 2 E ^ ( R 2 ( X ¯ 2 , W ¯ 2 , A 2 ; β 2 , ψ 2 ) ) 2 Δ P r ^ { Δ = 1 | X ¯ 2 , W ¯ 2 , A 2 } ,
where E ^ ( R 2 ( X ¯ 2 , W ¯ 2 , A 2 ; β 2 , ψ 2 ) ) denotes the model estimate for R 2 using observed data and covariates, Δ = I ( S 2 < C ) is the event indicator that the patient’s data are not censored, and P r ^ { Δ = 1 | X ¯ 2 , W ¯ 2 , A 2 } is a working estimator of the probability that the individual has not been censored by their event time.
Denote our Q-function here to be Q 2 ( X ¯ 2 , W ¯ 2 , A 2 ) = E ^ [ R 2 | X ¯ 2 , W ¯ 2 , A 2 ] . The derived treatment rule is g ^ 2 o p t ( X ¯ 2 , W ¯ 2 ) = argmax a 2 Q 2 ( W ¯ 2 , X ¯ 2 , A 2 ; β ^ 2 , ψ ^ 2 ) = sign ( ψ ^ 2 T Z 21 ) , where Z 21 represents the tailoring variables in the stage 2 model.

3.2.2. Stage 1

For stages prior to the last stage, we use backward induction, and then the optimal g 1 o p t ( X ¯ 1 , W ¯ 1 ) at stage 1 can be derived similarly. Assuming that stage-specific rewards have been maximized after stage 1, we define the following stage 1 reward:
R 1 * ( a 1 ) = q 1 * ( a 1 ) S 1 * ( a 1 ) + { 1 Φ ( H 1 ) } { S 1 * ( a 1 ) q 1 * ( a 1 ) S 1 * ( a 1 ) } +
q 2 * ( a 1 ) S 2 * ( a 1 ) + { 1 Φ ( H 2 ) } { S 2 * ( a 1 ) q 2 * ( a 1 ) S 2 * ( a 1 ) } ,
where S 1 * ( a 1 ) is as defined previously and S 2 * ( a 1 ) = S 2 * ( a 1 , g 2 o p t ) denotes a counterfactual outcome given future optimized treatments and taking treatment a 1 at stage 1 (similarly for q 1 * ( a 1 ) and q 2 * ( a 1 ) ). Using this reward, we define the optimal regime at stage 1 as the one that satisfies
E { R 1 * ( g 1 o p t ) } E { R 1 * ( g 1 ) }
for all g 1 G 1 , where G 1 is the class of all potential regimes at stage 1.
We make the following assumptions. Note that in our set-up, S 1 is predetermined and fixed for every patient, but this assumption does not apply to q 1 .
  • Consistency:
    a 1 = 1 K 1 S k * ( a 1 ) I ( A 1 = a 1 ) = S 1 when k = 1 S k ( A 1 , g 2 o p t ) when k = 2
    a 1 = 1 K 1 q k * ( a 1 ) I ( A 1 = a 1 ) = q 1 when k = 1 q k ( A 1 , g 2 o p t ) when k = 2 ;
  • No unmeasured confounding:
    { S 1 * ( 1 ) , , S 1 * ( K 1 ) } A 1 | X ¯ 1 , W ¯ 1 . It holds in our setting since S 1 * ( 1 ) , , S 1 * ( K 1 ) = S 1 for a 1 . In addition, { q 1 * ( 1 ) , , q 1 * ( K 1 ) } A 1 | X ¯ 1 , W ¯ 1 ;
  • Positivity: π a 1 ( X ¯ 1 , W ¯ 1 ) = P r ( A 1 = a 1 | X ¯ 1 , W ¯ 1 ) is bounded away from zero and one;
  • Latent variable independence: H 1 ( A 1 , S k * ( a 1 ) , Q u k * ( a 1 ) | X ¯ k , W ¯ k where k = 1 , 2 .
Then we have
E [ R 1 * ( g 1 ) ] = E 1 ( a 1 { 1 , 1 } [ μ 1 , a 1 q ( X ¯ 1 , W ¯ 1 ) μ 1 , a 1 S ( X ¯ 1 , W ¯ 1 ) + E 1 Φ ( H 1 ) | X ¯ 1 , W ¯ 1 × { μ 1 , a 1 S ( X ¯ 1 , W ¯ 1 ) μ 1 , a 1 q ( X ¯ 1 , W ¯ 1 ) μ 1 , a 1 S ( X ¯ 1 , W ¯ 1 ) } ] I { g 1 = a 1 } ) + E 2 ( a 1 { 1 , 1 } [ μ 2 , a 1 q ( X ¯ 1 , W ¯ 1 ) μ 2 , a 1 S ( X ¯ 1 , W ¯ 1 ) + E { 1 Φ ( H 2 ) | X ¯ 2 , W ¯ 2 } × { μ 2 , a 1 S ( X ¯ 1 , W ¯ 1 ) μ 2 , a 1 q ( X ¯ 1 , W ¯ 1 ) μ 2 , a 1 S ( X ¯ 1 , W ¯ 1 ) } ] I { g 1 = a 1 } ) ,
where μ 1 , a 1 S ( X ¯ 1 , W ¯ 1 ) = E ( S 1 | A 1 = a 1 , X ¯ 1 , W ¯ 1 ) , μ 2 , a 1 S ( X ¯ 1 , W ¯ 1 ) = E [ S 2 ( A 1 , g 2 o p t ) | A 1 = a 1 , X ¯ 1 , W ¯ 1 ] , and similarly for the equivalents of q. Similarly to stage 2, we further assume that q 1 * ( a 1 ) S 1 * ( a 1 ) | X ¯ 1 , W ¯ 1 and q 2 * ( a 1 ) S 2 * ( a 1 ) | X ¯ 1 , W ¯ 1 . Notice again that the right-hand side (RHS) can be completely estimated from observed data. Under these assumptions, the optimization problem at stage 1, among all potential regimes G 1 , can be written as g 1 o p t = a r g m a x g 1 G 1 RHS of Equation (2).
We maximize the stage 1 outcome through a pseudo-outcome defined as:
P O ˜ 1 = q 1 S 1 + { 1 Φ ( H 1 ) } { S 1 q 1 S 1 } + max a 2 Q 2 ( W ¯ 2 , X ¯ 2 , a 2 ; β ^ 2 , ψ ^ 2 )
Our proposed estimator for ( β ^ 1 , ψ ^ 1 ) is argmin β 1 , ψ 1 P n P O ˜ 1 Q 1 ( X ¯ 1 , W ¯ 1 , A 1 ; β 1 , ψ 1 ) 2 , where Q 1 ( X ¯ 1 , W ¯ 1 , A 1 ; β 1 , ψ 1 ) can be modeled with β 1 T Z 10 + ( ψ 1 T Z 11 ) A 1 . The first stage estimated optimal rule is given by g ^ 1 o p t = argmax a 1 Q 1 ( X ¯ 1 , W ¯ 1 , a 1 ; β ^ 1 , ψ ^ 1 ) = sign ( ψ ^ 1 T Z 11 ) , where Z 11 represent the tailoring variables in the stage 1 model.

3.3. Inference

In addition to obtaining the optimal stage-specific decision rules, our second goal is to draw statistical inference on the effect of each stage’s covariates on the decision. Particular emphasis was placed on tailoring variables. To that end, we propose using a censoring-adjusted version of the m-out-of-n method presented by [31]. In stage 1 optimization, the pseudo-outcome P O ˜ 1 = q 1 S 1 + { 1 Φ ( H 1 ) } { S 1 q 1 S 1 } + β ^ 2 T Z 20 + | ψ ^ 2 T Z 21 | is a nonsmooth function of ψ ^ 2 . In particular, if P [ Z 21 : ψ 2 T Z 21 = 0 ] = 0 , then first stage covariates will converge to a normal distribution. However, if P [ Z 21 : ψ 2 T Z 21 = 0 ] > 0 , the estimator ψ ^ 1 oscillates between the two asymptotic distributions across samples, which reflects the typical challenging problem of nonregularity in DTR literature [29]. Hence, direct estimation results in an asymptotically biased estimator and poor performance of usual Wald-type confidence intervals. Even bootstrap-based approaches suffer from underlying nonsmoothness. The m-out-of-n bootstrap was developed to address the bootstrap inconsistency due to nonsmoothness [32]. Although conceptually very similar to the original bootstrap, the resample size m (which needs to depend on n, tends to infinity with n, and m = o ( n ) is selected to be a smaller order than n. Chakraborty et al. 2013 [31] showed through simulation studies that the m-out-of-n approach obtained desirable coverage probabilities in the two-stage DTR problem for first stage tailoring variables. Because censoring reduces the size of observed stage 2 data in our scenario, we further modified the m-out-of-n algorithm to accommodate censoring. Our algorithm works as follows.
We adopted the functional form of m as presented in [31]:
m = n 1 + ξ ( 1 p ^ ) 1 + ξ .
Let n be the total number of subjects in the dataset (including those who were censored). For stage 2, we create a bootstrap sample of size n and fit a regression model using the complete data weighted by IPC-weights within the bootstrapped sample to obtain stage 2 coefficient estimates. Stage 2 95 % confidence intervals are obtained after getting l ^ 2 and u ^ 2 , the α / 2 × 100 and ( 1 α / 2 ) × 100 percentiles of n ( θ ^ 2 , n ( b ) θ ^ 2 , n ) , where α is the desired significance level, θ ^ 2 , n ( b ) is the bootstrap estimate of stage 2 coefficients with bootstrap-specific re-estimated censoring weights, and θ ^ 2 , n is the plug-in estimator obtained using weighted regression from the empirical dataset. The confidence interval is given by ( θ ^ 2 , n u ^ 2 / n , θ ^ 2 , n l ^ 2 / n ) . For stage 1, we first generate bootstrap samples of size m, which is calculated using Equation (3) after calculating a sample specific p ^ . We further use each bootstrap sample to re-estimate IPC weights, and fit a weighted lm model to obtain a bootstrap-specific stage 2 estimate. The stage 2 coefficients from each bootstrap sample are then used to calculate pseudo-outcomes, which are then used to fit a stage 1 model to obtain θ ^ 1 , m ^ ( b ) . Then we obtain l ^ 1 and u ^ 1 , the α / 2 × 100 and ( 1 α / 2 ) × 100 percentiles of m ^ ( θ ^ 1 , m ^ ( b ) θ ^ 1 , n ) , where θ ^ 1 , n is the plug-in estimator obtained using the complete empirical dataset, while θ ^ 1 , m ^ ( b ) is the estimate obtained from each bootstrap sample of size m. The confidence set is given by ( θ ^ 1 , n u ^ 1 / n , θ ^ 1 , n l ^ 1 / n ) .
We further selected the value of ξ to be 0.01 , which provided stable coverage in simulations with complete data. The calculation of p ^ = P I { n [ Z 21 T ψ ^ 2 , n ] 2 τ n ( Z 21 ) } relies on a selection of τ n ( Z 21 ) . We opted to use the plug-in estimator for τ n ( Z 21 ) = ( Z 21 T Σ ^ 21 Z 21 ) · χ 1 , 1 ν 2 just as in [31], where Σ ^ 21 is the plug-in sandwich estimator of n C o v ( ψ ^ 2 , n , ψ ^ 2 , n ) , and ν = 0.01 .
In practice, however, we need to tune the two hyper parameters ξ and ν to obtain an appropriate m using double bootstrap, and tune the bootstrap sample number m straightforwardly and automatically. Such a double bootstrap algorithm for choosing m is data-driven. Suppose we are interested in c T θ , i.e., the stage 1 variable effect, and its estimate from the original data is c T θ ^ n . Consider a grid of candidate values for m:
(1) Draw B 1 n-out-of-n first-stage bootstrap samples from the data and calculate the bootstrap estimate c T θ ^ n b 1 ,( b 1 = 1, 2, …, B 1 ). Fix m at the smallest value in the grid.
(2) Conditional on each first-stage bootstrap sample, draw B 2 m-out-of-n second-stage (nested) bootstrap samples and calculate the double bootstrap versions of the estimate c T θ ^ n , m b 1 b 2 , b 1 = 1, 2, …, B 1 , b 2 = 1, 2, …, B 2 .
(3) For b 1 = 1, 2, …, B 1 , compute the ( α / 2 ) × 100 and ( 1 α / 2 ) × 100 percentiles of c T m ( θ ^ n , m b 1 b 2 θ ^ n b 1 ) , b 2 = 1, …, B 2 ; say l ^ D B ( b 1 ) and u ^ D B ( b 1 ) respectively. Construct the double-centered percentile bootstrap from the b 1 -th first-stage bootstral data as ( c T θ ^ n b 1 u ^ D B ( b 1 ) / n , c T θ ^ n b 1 l ^ D B ( b 1 ) / n ), b 1 = 1, …, B 1 .
(4) Estimate the coverage rate of the double bootstrap CI from all of the first-stage bootstrap datasets as:
1 B 1 b 1 = 1 B 1 I { c T θ ^ n b 1 u ^ D B ( b 1 ) / n c T θ ^ n c T θ ^ n b 1 l ^ D B ( b 1 ) / n } .
(5) If the current coverage rate is located in [ ( 1 α ) 1.96 × α ( 1 α ) / B 1 , ( 1 α ) + 1.96 × α ( 1 α ) / B 1 ], then it is not significantly biased from 1 α , and we will pick the current value of m as the final value. Otherwise, increase m to the next value in the grid.
(6) Repeat steps (1)–(5), until (5) is satisfied or the grid is exhausted.

3.4. Survival Augmented Patient Preference Weights (SAPP-Weights)

Going forward, we assume that information in the lastest stage survey will override information from previous stages as well as other covariate information (i.e., W j will override W j 1 and X ¯ j ). To model survey information as a function of latent preference, we assume a latent traits model [33]. We further assume that the latent preferences are related to survey responses through a modified Rasch model [34,35].
For our scenario, we assume that there are n u m Q questions on a survey, each soliciting binary answer choices from the patient. For each binary response, we assume that the underlying generating mechanism is of the form l o g i t { P ( W j k = 1 | H j = h j ) } = α 0 , k + α 1 , k h j , where j indicates the stage and k the question number.
Algorithm 1 outlines the algorithm for estimating patient preference h ^ j . Then the survival augmented patient preference weights (SAPP-weights) are the transformed version of h ^ j , which we denote as Φ ( h ^ i j ) . Essentially, we use the Expectation-Maximization algorithm [36] to iterate between estimates of α , the questionnaire coefficients, and h j , individual patient preferences at stage j. We use Gauss–Hermite quadrature to numerically approximate the integral, and estimate P ( h j | W j ) P ( W j | h j ) P ( h j ) using the Metropolis Hastings algorithm.
Algorithm 1: EM algorithm for estimating patient preference h ^ j .
Stats 04 00046 i001

4. Numerical Results

We conducted simulation studies to investigate the performance of our proposed method. We looked at two scenarios, differing by degree of nonregularity (the estimated probability that stage 2 treatment does not provide a significant difference), p. The first scenario is an example of low nonregularity, where about 25% of people obtain similar results with both treatments. Scenario 2 is an example of higher nonregularity, where approximately 75% of patients could obtain similar results with either of the stage 2 treatments. The true value of p was estimated through simulated complete data (assuming no censoring) and true preferences.
For both scenarios, we generated baseline covariates X 1 , X 2 N ( 0 , 1 ) , censoring time C, quality of life q 1 , q 2 , and survival times S 1 , S 2 . Preferences of both stages were generated from N ( 0 , S D = 0.5 ) distribution, and 10 binary preference derived questionnaire responses W 1 , W 2 were generated according to our latent model with coefficients, as shown in Table 1.
The two scenarios differ in terms of stage 2 parameters but share common stage 1 settings. Stage 1 treatment assignment A 1 was randomly assigned with probability 0.5 . The stage 1 quality of life outcome q 1 [ 0 , 1 ] was generated from N ( α 0 + α 1 X 1 + α 2 A 1 + α 3 X 1 A 1 , σ 2 ) , where α = [ 0.55 , 0.03 , 0.06 , 0.09 ] , and σ = 0.03 . As mentioned previously, S 1 for everyone indicates a routine follow-up time of 30 days. The outcome of stage 1 is a weighted combination of q 1 and S 1 through the equation R 1 = q 1 S 1 + ( 1 Φ ( h 1 ) ) ( S 1 q 1 S 1 ) , where Φ ( h ) represents the cumulative distributive function of a standard normal. R 1 can be interpreted as an quality of life weighted survival during the initial follow-up time. The treatment assignment A 2 depends on R 1 : B e r n o u l l i ( { exp ( 3 + R 1 / 3 ) } / { 1 + exp ( 3 + R 1 / 3 ) } ) , and thus those patients with higher R 1 are more likely to remain on the same treatment as A 1 .
Using these simulated datasets, we estimated patient preferences, the optimal dynamic treatment regime, and the confidence intervals of estimated DTR coefficients from these responses and outcomes. We evaluated each scenario through measures of bias, coverage probability, optimal mean response, and the percent of subjects correctly classified to their true optimal treatment % opt .

4.1. Scenario 1: Low Nonregularity (p = 0.25)

We generated stage 2 outcome q 2 N ( β 0 + β 1 X 2 + β 2 X 2 A 2 , S D = 0.03 ) , where β = [ 0.5 , 0.07 , 0.06 ] . Similarly, we generated S 2 N ( γ 0 + γ 1 R 1 + γ 2 ( R 1 c ) A 2 , S D = 5 ) , where γ = [ 50 , 10 , 2.5 ] and c = 20 . q 2 and S 2 disagreed on the optimal stage 2 treatment approximately half the time, indicating that half of all patients had to make a choice between quality and quantity of life. In a randomly generated dataset, treatment 1 gave 73.4 % of patients a better stage 1 reward, while treatment 1 gave 31 % of patients a better stage 2 reward. The range of S 2 varied from 169 to 361 days.
Unlike stage 1, stage 2 survival times could be subject to censoring time C. We generated censoring C exp ( log λ 0 C + X 2 β C ) , where β C = 0.01 and λ 0 C = 0.00058 for 15 % censoring and λ 0 C = 0.0013 for 30 % censoring. τ was set to be a year after initiation of stage 2 treatment.
Because S 2 and q 2 are functions of X 2 , A 2 , and R 1 , the reward combination can be rearranged as:
R 2 = q 2 S 2 + { 1 Φ ( h 2 ) } ( S 2 q 2 S 2 ) = γ 0 + ( β 0 γ 0 γ 0 ) Φ ( h 2 ) + { γ 1 + ( β 0 γ 1 γ 1 ) Φ ( h 2 ) } R 1 + { ( β 1 γ 0 β 2 γ 2 c ) Φ ( h 2 ) } X 2 + { ( β 1 γ 1 + β 2 γ 2 ) Φ ( h 2 ) } X 2 R 1 + { ( γ 2 c β 0 γ 2 c ) Φ ( h 2 ) γ 2 c } A 2 + { γ 2 + ( β 0 γ 2 γ 2 ) Φ ( h 2 ) } R 1 A 2 + { ( β 1 γ 2 + β 2 γ 1 ) Φ ( h 2 ) } R 1 X 2 A 2 + { ( β 2 γ 0 β 1 γ 2 c ) Φ ( h 2 ) } X 2 A 2 .
On the other hand, stage 1 coefficients are obtained by regressing a pseudo-outcome, R 1 + β ^ 2 T Z 20 + | ψ ^ 2 T Z 21 | on X 1 , A 1 , and X 1 A 1 , so stage 1 regression has four covariate terms (including intercept). We obtained both true stage 1 and stage 2 coefficients by performing Monte Carlo sampling regressions on a sample size of 10 million.
Table 2 and Table 3 summarize the simulation results of the two stages for scenario 1, including bias estimates, the empirical standard deviations, the mean bootstrap standard deviations, mean widths of confidence interval, and the coverage probabilities of our proposed method. The coverage probabilities ranged from 86% to 96%, with the majority between 92% to 96%. Furthermore, we observed a general agreement between the empirical SD and the average bootstrap SD. In terms of trends, we saw slightly larger ESD, mean bootstrap SD, and mean width when censoring was increased from 15 % to 30 % , while a reduction in all three was observed when we increased n from 500 to 1000. Using covariate A 2 as an example, the ESD for n = 500 and 15% censoring was 11.72, which increased to 13.03 when censoring reached 30% but decreased to 8.38 when n was increased to 1000. Similar patterns were observed with the average bootstrap SD and mean width of the confidence interval.
We further investigated the distributions of the observed total reward, as well as the predicted optimal reward of one randomly selected simulation, which we illustrate in Figure 2. Aggregating means across the four sub-scenarios (based on sample size and censoring), the average observed reward was 233.77, while the average predicted optimal reward was 254.19, indicating an expected increase of 20.42 reward when everyone follows the regime to receive the treatments by our algorithm. It is also evident in the figure that the variability of the observed rewards is significantly larger than the variability of the predicted optimal reward. The aggregated SD of all observed rewards in this scenario is 39.2, while the aggregated SD of all predicted rewards is 5.66.

4.2. Scenario 2: High Nonregularity (p = 0.75)

In this scenario, we evaluated a case where high nonregularity exists. We modified the scenario 1 simulation by the following: (1) We generated the stage 2 outcome q 2 N ( β 0 + β 1 X 2 + β 2 X 2 A 2 , S D = 0.03 ) , where β = [ 0.5 , 0.07 , 0.021 ] , and S 2 N ( γ 0 + γ 1 R 1 + γ 2 ( R 1 c ) A 2 , S D = 5 ) , where γ = [ 135 , 8 , 0.5 ] and c = 20 . (2) We adjusted the coefficients such that the magnitude of γ 2 and β 2 (which influence the effect of A 2 ) are smaller compared to scenario 1. (3) q 2 and S 2 also disagreed on the optimal stage 2 treatment approximately half the time.
In a randomly generated dataset, treatment 1 gave 74.8 % of patients a better stage 1 reward, while treatment 1 gave 41.6 % of patients a better stage 2 reward. The range of S 2 varied from 235 to 368 days.
Stage 2 survival times are similarly subject to censoring time C. Censoring was generated with model C exp ( log λ 0 C + X 2 β C ) , where β C = 0.01 and λ 0 C = 0.00048 for 15 % censoring and λ 0 C = 0.0011 for 30 % censoring. As before, τ was set to be a year after initiation of stage 2 treatment.
Table 4 and Table 5 present our simulation results for scenario 2. As before, the coverage probabilities range from 86% to 97%, with most covering around 95 % , and the mean width, SD, and mean bootstrap SD are decreasing with decreasing censoring and increasing sample size. Again, using A 2 as an example, the ESD across simulations of its coefficient was 12.38 when n = 500 and censoring at 15%, which increased to 13.85 when censoring went to 30%, and decreased to 8.84 when n went to 1000.
As in Section 4.1, we investigated the distribution of the observed total reward and the predicted optimal reward of one random simulation in Figure 3. For aggregated means across the four sub-scenarios (based on sample size and censoring), the average observed reward was 263.62, while the average predicted reward was 275.35, indicating an increase of 11.73 reward when everyone follows the regime to receive treatment by our algorithm. The aggregated SD of observed rewards was 37.77, while the aggregated SD of all predicted rewards was 4.39.

4.3. Optimality

Besides looking at performance of inference, we also looked at the number of times our algorithm chose the correct treatment for each patient at each stage, across the scenarios we have visited ( n = 500 , 1000 across two levels of censoring at 15 % , 30 % ).
Table 6 shows the simulation results. In this table we also included the average stage 1 bootstrap resample sizes for each sub-scenario. As expected from Equation (3), increasing p indicates increasing nonregularity, and decreases m. We can see that for scenario 1 ( p = 0.25 ), the algorithm chose the optimal treatment for stage 1 over 93 % of the time, and over 83 % of the time for stage 2. Our algorithm was able to assign the correct treatment to a patient over 78 % across all sample sizes and censoring levels. This is significantly higher than the random guess approach, which would have landed us around 25 % .
By contrast, scenario 2’s performance was slightly weaker, coming in at over 93 % for stage 1, around 66–69% for stage 2, and with an overall correct regime assignment percentage of 61–65%; however, this is still much better than 25 % with a random guess. In this scenario, we could also see a general increase in SD as compared to scenario 1, indicating higher uncertainty in our decision making process.

5. Discussion

We proposed a method that estimates the optimal decision rules for a two-stage treatment scenario subject to censoring in this paper. We proposed to balance between quality of life and quantity using a sliding scale function adjusted by patient preference. Through simulation studies, we have shown that our proposed method is capable of choosing the optimal treatment dynamically a majority of the time, as well as provide convincing confidence intervals for each of the coefficients in question.
The simulation results of scenarios 1 and 2 are notable in the following ways. Most importantly, we can see that the coverage probabilities mostly hover around 95 % , showing that our confidence interval has the combination of adequate width and minimal bias required for a good coverage probability. The general congruence between the empirical SE and the mean bootstrap SE further supports that our method is sampling at the appropriate width. Generally, there is a slight increase in ESD and mean bootstrap SD when increasing censoring from 15% to 30%, indicating decreased certainty, but the difference is slight (e.g., for parameter A 2 in scenario 1, ESD increased from 11.72 to 13.03). The decrease in ESD is more significant between n = 500 and n = 1000 , where for scenario 1 and 15% censoring, the ESD of the estimated coefficient for parameter A 2 decreased from 11.72 to 8.38. In the boxplots in Figure 2 and Figure 3, we can see much larger variability in the observed rewards as compared to the predicted rewards. This is as expected for two reasons. First, observed rewards contain an error component that is not present in expected (predicted) rewards. Second, observed rewards include individuals who have by chance obtained their optimal reward, as well as those who did not. Variability invariably reduces when more individuals are predicted to their optimal reward. As expected, the distribution of the predicted optimal rewards can be mapped to the upper part of the observed rewards. The same general patterns were observed for scenario 2.
One main challenge in this work, as is the case in [31], is the selection of m, which is a crucial factor in determining the coverage probability and confidence interval. In our simulations, we selected parameters ν and ξ using background knowledge and reference simulation results and used Equation (3) to select m. This approach was recommended by [31] and is straightforward and easy to implement, but risks inappropriate values of m if either ν or ξ are selected inadequately. Hence, we recommend the double bootstrap procedure in practice, where we take our empirical dataset estimator as the truth, and build confidence intervals using nested bootstrap samples of size m from empirical bootstrap samples of size n. Similarly, another potential idea to further improve coverage across all covariates could be to select distinct values of m for each covariate. This might be of interest for future research. Alternatively, one may consider coupling our method with other methods that address the nonregularity issue in dynamic treatment regimes, notably [37]. Two main challenges exist for this proposed integration, namely the adaptability to censored data and derivation of asymptotic results in censored settings, and the specific process of choosing a tuning parameter λ for our proposed data structure. The integration would be an interesting and nontrivial problem for future research.
The authors are aware of two works in the literature that balance between two outcomes and would like to highlight certain differences at this time. Zhao et al. 2009 [15] looked at the dosage effect of a cancer drug on tumor size and toxicity. In their work, each reward function is separated into three parts: survival status, wellness, and tumor size effects. In simulation studies, the reward was assigned to be −60 if the patient died, 15 if the patient’s tumor shrunk to zero, and 5 or −5 if tumor size/wellness improved and deteriorated. The method of assigning rewards is particular to the example they proposed, and there is no clear indication of how to perform inference using this method. The comparison between our method and [14] is a bit more direct. Their work used patient preference estimation to weigh between toxicity and efficacy, both continuous outcomes. There are definite similarities between this work and [14], especially in terms of preference estimation and the common Q-learning framework. However, most importantly, Butler et al. 2018 [14] would not accommodate censored data, nor did they provide any framework for inference, both of which are important contributions of this work.
This work can be improved in a couple of directions for increased generalizability and impact. First, the vast amount of methods for survival data is one indication of the complexity of generalizing the various time to event scenarios. One main direction for extension would be to accommodate multiple stages, with potential censoring and event times that could happen at any stage, similar to the set-up of [20]. Another direction that might benefit real applications could be allowing the subset of treatments to change depending on patient outcomes, as in [21]. As pointed out by a reviewer, there may be other ways of estimating and incorporating patient preference. Before settling on our proposed approach, we had considered hierarchical models, which condition directly on latent variables. However, this approach would lose causal interpretation in multiple stage settings. Even though hierarchical models were not suitable for multi-stage settings, exploring other models for calculating and estimating preference variables, e.g., generalizing from logit to other exponential families within the latent variable framework as in [33], would still be valuable. Finally, generalizing binary treatment options to a continuous version (i.e., dosage), or increasing the number of outcomes we balance are also meaningful directions for future extensions.

Author Contributions

Conceptualization, L.W.; Formal analysis, Y.Z., and C.W.; Investigation, Y.Z. and L.W.; Methodology, Y.Z. and L.W.; Supervision, L.W.; Validation, C.W.; Visualization, Y.Z.; Writing—original draft, Y.Z. and L.W.; Writing—review & editing, Y.Z., C.W. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Robins, J.M. Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology, the Environment, and Clinical Trials; Springer: Berlin/Heidelberg, Germany, 2000; pp. 95–133. [Google Scholar]
  2. Robins, J.M. Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics; Springer: New York, NY, USA, 2004; pp. 189–326. [Google Scholar]
  3. Murphy, S.A. Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol. 2003, 65, 331–355. [Google Scholar] [CrossRef] [Green Version]
  4. Murphy, S.A. An experimental design for the development of adaptive treatment strategies. Stat. Med. 2005, 24, 1455–1481. [Google Scholar] [CrossRef] [Green Version]
  5. Moodie, E.E.; Chakraborty, B.; Kramer, M.S. Q-learning for estimating optimal dynamic treatment rules from observational data. Can. J. Stat. 2012, 40, 629–645. [Google Scholar] [CrossRef]
  6. Schulte, P.J.; Tsiatis, A.A.; Laber, E.B.; Davidian, M. Q-and A-learning methods for estimating optimal dynamic treatment regimes. Stat. Sci. 2014, 29, 640. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Laber, E.; Zhao, Y. Tree-based methods for individualized treatment regimes. Biometrika 2015, 102, 501–514. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Tao, Y.; Wang, L. Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics 2017, 73, 145–155. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Tao, Y.; Wang, L.; Almirall, D. Tree-based reinforcement learning for estimating optimal dynamic treatment regimes. Ann. Appl. Stat. 2018, 12, 1914–1938. [Google Scholar] [CrossRef]
  10. Zhang, Y.; Laber, E.B.; Davidian, M.; Tsiatis, A.A. Interpretable dynamic treatment regimes. J. Am. Stat. Assoc. 2018, 113, 1541–1549. [Google Scholar] [CrossRef]
  11. Zhao, Y.; Zeng, D.; Rush, A.J.; Kosorok, M.R. Estimating individualized treatment rules using outcome weighted learning. J. Am. Stat. Assoc. 2012, 107, 1106–1118. [Google Scholar] [CrossRef] [PubMed]
  12. Zhang, B.; Zhang, M. C-learning: A new classification framework to estimate optimal dynamic treatment regimes. Biometrics 2018, 74, 891–899. [Google Scholar] [CrossRef] [Green Version]
  13. Sun, Y.; Wang, L. Stochastic Tree Search for Estimating Optimal Dynamic Treatment Regimes. J. Am. Stat. Assoc. 2021, 116, 421–432. [Google Scholar] [CrossRef]
  14. Butler, E.L.; Laber, E.B.; Davis, S.M.; Kosorok, M.R. Incorporating patient preferences into estimation of optimal individualized treatment rules. Biometrics 2018, 74, 18–26. [Google Scholar] [CrossRef]
  15. Zhao, Y.; Kosorok, M.R.; Zeng, D. Reinforcement learning design for cancer clinical trials. Stat. Med. 2009, 28, 3294–3315. [Google Scholar] [CrossRef] [Green Version]
  16. Torrance, G.W.; Feeny, D. Utilities and quality-adjusted life years. Int. J. Technol. Assess. Health Care 1989, 5, 559–575. [Google Scholar] [CrossRef] [PubMed]
  17. Cui, Y.; Zhu, R.; Kosorok, M. Tree based weighted learning for estimating individualized treatment rules with censored data. Electron. J. Stat. 2017, 11, 3927–3953. [Google Scholar] [CrossRef] [PubMed]
  18. Zhu, R.; Kosorok, M.R. Recursively imputed survival trees. J. Am. Stat. Assoc. 2012, 107, 331–340. [Google Scholar] [CrossRef] [PubMed]
  19. Zhao, Y.Q.; Zeng, D.; Laber, E.B.; Song, R.; Yuan, M.; Kosorok, M.R. Doubly robust learning for estimating individualized treatment with censored data. Biometrika 2015, 102, 151–168. [Google Scholar] [CrossRef] [PubMed]
  20. Goldberg, Y.; Kosorok, M.R. Q-learning with censored data. Ann. Stat. 2012, 40, 529. [Google Scholar] [CrossRef] [Green Version]
  21. Hager, R.; Tsiatis, A.A.; Davidian, M. Optimal two-stage dynamic treatment regimes from a classification perspective with censored survival data. Biometrics 2018, 74, 1180–1192. [Google Scholar] [CrossRef] [PubMed]
  22. Zhao, Y.Q.; Zhu, R.; Chen, G.; Zheng, Y. Constructing Stabilized Dynamic Treatment Regimes. arXiv 2018, arXiv:1808.01332. [Google Scholar]
  23. Jiang, R.; Lu, W.; Song, R.; Davidian, M. On estimation of optimal treatment regimes for maximizing t-year survival probability. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 1165. [Google Scholar] [CrossRef] [PubMed]
  24. Shen, J.; Wang, L.; Daignault, S.; Spratt, D.E.; Morgan, T.M.; Taylor, J.M. Estimating the Optimal Personalized Treatment Strategy Based on Selected Variables to Prolong Survival via Random Survival Forest with Weighted Bootstrap. J. Biopharm. Stat. 2018, 28, 362–381. [Google Scholar] [CrossRef] [PubMed]
  25. Huang, X.; Ning, J. Analysis of multi-stage treatments for recurrent diseases. Stat. Med. 2012, 31, 2805–2821. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Huang, X.; Ning, J.; Wahed, A.S. Optimization of individualized dynamic treatment regimes for recurrent diseases. Stat. Med. 2014, 33, 2363–2378. [Google Scholar] [CrossRef]
  27. Zhang, M.; Schaubel, D.E. Contrasting treatment-specific survival using double-robust estimators. Stat. Med. 2012, 31, 4255–4268. [Google Scholar] [CrossRef] [Green Version]
  28. Jiang, R.; Lu, W.; Song, R.; Hudgens, M.G.; Naprvavnik, S. Doubly robust estimation of optimal treatment regimes for survival data—With application to an HIV/AIDS study. Ann. Appl. Stat. 2017, 11, 1763. [Google Scholar] [CrossRef]
  29. Chakraborty, B.; Murphy, S.; Strecher, V. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res. 2010, 19, 317–343. [Google Scholar] [CrossRef] [Green Version]
  30. Laber, E.B.; Lizotte, D.J.; Qian, M.; Pelham, W.E.; Murphy, S.A. Dynamic treatment regimes: Technical challenges and applications. Electron. J. Stat. 2014, 8, 1225. [Google Scholar]
  31. Chakraborty, B.; Laber, E.B.; Zhao, Y. Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics 2013, 69, 714–723. [Google Scholar] [CrossRef] [Green Version]
  32. Shao, J. Bootstrap sample size in nonregular cases. Proc. Am. Math. Soc. 1994, 122, 1251–1262. [Google Scholar] [CrossRef]
  33. Moustaki, I.; Knott, M. Generalized latent trait models. Psychometrika 2000, 65, 391–411. [Google Scholar] [CrossRef] [Green Version]
  34. Rasch, G. On general laws and the meaning of measurement in psychology. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 1 January 1961; Volume 4, pp. 321–333. [Google Scholar]
  35. Rasch, G. Studies in Mathematical Psychology: I. Probabilistic Models for Some Intelligence and Attainment Tests. 1960. Available online: https://psycnet.apa.org/record/1962-07791-000 (accessed on 14 September 2021).
  36. Moon, T.K. The expectation-maximization algorithm. IEEE Signal Process. Mag. 1996, 13, 47–60. [Google Scholar] [CrossRef]
  37. Song, R.; Wang, W.; Zeng, D.; Kosorok, M.R. Penalized q-learning for dynamic treatment regimens. Stat. Sin. 2015, 25, 901. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Schematic of clinical progression timeline.
Figure 1. Schematic of clinical progression timeline.
Stats 04 00046 g001
Figure 2. Comparison of observed reward and predicted optimal reward for scenario 1 (p = 0.25).
Figure 2. Comparison of observed reward and predicted optimal reward for scenario 1 (p = 0.25).
Stats 04 00046 g002
Figure 3. Comparison of observed reward and predicted optimal reward for scenario 2 (p = 0.75).
Figure 3. Comparison of observed reward and predicted optimal reward for scenario 2 (p = 0.75).
Stats 04 00046 g003
Table 1. Coefficients for the latent model used to solicit preferences. α 10 and α 11 are the coefficients for stage 1 questionnaires, while α 20 and α 21 are for stage 2 coefficients.
Table 1. Coefficients for the latent model used to solicit preferences. α 10 and α 11 are the coefficients for stage 1 questionnaires, while α 20 and α 21 are for stage 2 coefficients.
α 10 α 11 α 20 α 21
Q11.001.000.900.72
Q20.00−1.00−0.191.82
Q3−1.642.351.47−1.63
Q40.54−2.35−0.502.11
Q5−0.88−1.101.301.69
Q60.751.25−0.40−1.69
Q71.272.960.042.37
Q8−1.502.00−1.271.60
Q90.09−1.500.62−1.17
Q10−0.551.35−0.232.00
Table 2. Stage 2 simulation results for scenario 1 (p = 0.25). True parameters are [7.49, 3.23, 37.58, 0.28, −1.88, 3.21, 0.21].
Table 2. Stage 2 simulation results for scenario 1 (p = 0.25). True parameters are [7.49, 3.23, 37.58, 0.28, −1.88, 3.21, 0.21].
n% CensorParameterBiasESDAvgBootSDMean WidthCP
500 R 1 −0.320.480.501.940.92
X 2 0.1611.9412.1847.750.96
A 2 −4.9311.7211.5745.170.92
15 R 1 : X 2 −0.010.520.522.040.96
R 1 : A 2 0.220.500.501.940.92
X 2 : A 2 0.3812.5112.2047.680.93
R 1 : X 2 : A 2 −0.010.540.522.040.94
R 1 −0.310.540.552.150.93
X 2 0.3813.3913.5753.190.95
A 2 −4.8313.0312.8050.050.92
30 R 1 : X 2 −0.020.580.582.280.94
R 1 : A 2 0.210.560.552.150.93
X 2 : A 2 0.0613.5613.5753.120.95
R 1 : X 2 : A 2 0.000.580.582.280.95
1000 R 1 −0.290.360.341.340.86
X 2 −0.148.238.3432.640.95
A 2 −4.738.388.0131.260.88
15 R 1 : X 2 0.010.350.361.400.95
R 1 : A 2 0.210.360.341.340.88
X 2 : A 2 0.108.558.3432.590.94
R 1 : X 2 : A 2 0.000.370.361.390.94
R 1 −0.280.400.381.490.87
X 2 0.049.389.2536.170.96
A 2 −4.948.958.8534.580.90
30 R 1 : X 2 0.000.400.401.550.96
R 1 : A 2 0.220.390.381.480.89
X 2 : A 2 0.449.679.2436.110.94
R 1 : X 2 : A 2 −0.020.410.401.550.94
Table 3. Stage 1 simulation results for scenario 1 (p = 0.25). True parameters are [3.77, 8.17, −2.37].
Table 3. Stage 1 simulation results for scenario 1 (p = 0.25). True parameters are [3.77, 8.17, −2.37].
n% CensorParameterBiasESDAvgBootSDMean WidthCP
500 X 1 −0.100.930.983.840.95
15 A 1 −0.301.051.074.180.93
X 1 : A 1 0.411.211.234.780.94
X 1 −0.100.931.003.930.96
30 A 1 −0.291.071.114.340.94
X 1 : A 1 0.401.281.295.050.94
1000 X 1 −0.040.670.682.660.96
15 A 1 −0.200.730.742.900.94
X 1 : A 1 0.310.850.853.320.92
X 1 −0.040.680.692.710.95
30 A 1 −0.210.750.772.990.94
X 1 : A 1 0.310.880.903.490.94
Table 4. Stage 2 simulation results for scenario 2 (p = 0.75). True parameters are [6.00, 4.74, 7.53, 0.28, −0.38, 1.66, 0.07].
Table 4. Stage 2 simulation results for scenario 2 (p = 0.75). True parameters are [6.00, 4.74, 7.53, 0.28, −0.38, 1.66, 0.07].
n% CensorParameterBiasESDAvgBoot SDMean WidthCP
500 R 1 −0.330.510.522.050.92
X 2 0.3112.4212.8350.290.97
A 2 −3.4712.3812.2747.860.93
15 R 1 : X 2 −0.010.540.552.140.96
R 1 : A 2 0.160.530.522.050.94
X 2 : A 2 0.1712.9912.8450.180.95
R 1 : X 2 : A 2 −0.010.550.552.140.95
R 1 −0.310.560.582.270.93
X 2 0.4113.9214.3256.060.95
A 2 −3.3513.8513.6053.130.92
30 R 1 : X 2 −0.020.600.612.390.95
R 1 : A 2 0.150.590.582.280.91
X 2 : A 2 −0.0814.1314.3155.970.94
R 1 : X 2 : A 2 0.000.600.612.390.95
1000 R 1 −0.300.390.361.420.86
X 2 −0.108.698.7834.340.95
A 2 −3.528.848.5133.240.91
15 R 1 : X 2 0.000.370.371.470.96
R 1 : A 2 0.160.380.361.420.90
X 2 : A 2 0.088.998.7834.270.94
R 1 : X 2 : A 2 0.000.380.371.460.94
R 1 −0.290.420.401.570.88
X 2 0.249.799.7538.130.95
A 2 −3.769.469.4036.690.93
30 R 1 : X 2 −0.010.420.421.630.94
R 1 : A 2 0.170.410.401.570.93
X 2 : A 2 0.3710.259.7338.040.93
R 1 : X 2 : A 2 −0.020.440.421.620.94
Table 5. Stage 1 simulation results for scenario 2 (p = 0.75). True parameters are [3.19, 6.39, −9.61].
Table 5. Stage 1 simulation results for scenario 2 (p = 0.75). True parameters are [3.19, 6.39, −9.61].
n% CensorParameterBiasESDAvgBootSDMean WidthCP
500 X 1 −0.330.830.913.540.94
15 A 1 −0.420.951.013.940.94
X 1 : A 1 0.611.141.194.630.92
X 1 −0.350.830.933.620.94
30 A 1 −0.420.991.054.110.94
X 1 : A 1 0.601.201.264.910.93
1000 X 1 −0.220.610.632.460.95
15 A 1 −0.330.680.702.750.93
X 1 : A 1 0.480.800.833.250.92
X 1 −0.230.620.642.500.94
30 A 1 −0.340.690.732.860.93
X 1 : A 1 0.490.840.883.430.92
Table 6. Simulation results for percent optimal treatment chosen.
Table 6. Simulation results for percent optimal treatment chosen.
Scenarion% CensorStage 1% Opt (sd)Stage 2% Opt (sd)Overall % Opt (sd)
p = 0.25500150.936 (0.012)0.839 (0.020)0.785 (0.022)
500300.936 (0.012)0.837 (0.022)0.783 (0.023)
1000150.939 (0.008)0.844 (0.012)0.793 (0.013)
1000300.939 (0.008)0.843 (0.013)0.792 (0.014)
p = 0.75500150.935 (0.014)0.667 (0.051)0.623 (0.050)
500300.934 (0.014)0.662 (0.053)0.618 (0.051)
1000150.938 (0.009)0.687 (0.029)0.644 (0.028)
1000300.938 (0.009)0.682 (0.033)0.640 (0.032)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhong, Y.; Wang, C.; Wang, L. Survival Augmented Patient Preference Incorporated Reinforcement Learning to Evaluate Tailoring Variables for Personalized Healthcare. Stats 2021, 4, 776-792. https://doi.org/10.3390/stats4040046

AMA Style

Zhong Y, Wang C, Wang L. Survival Augmented Patient Preference Incorporated Reinforcement Learning to Evaluate Tailoring Variables for Personalized Healthcare. Stats. 2021; 4(4):776-792. https://doi.org/10.3390/stats4040046

Chicago/Turabian Style

Zhong, Yingchao, Chang Wang, and Lu Wang. 2021. "Survival Augmented Patient Preference Incorporated Reinforcement Learning to Evaluate Tailoring Variables for Personalized Healthcare" Stats 4, no. 4: 776-792. https://doi.org/10.3390/stats4040046

Article Metrics

Back to TopTop