Survival Augmented Patient Preference Incorporated Reinforcement Learning to Evaluate Tailoring Variables for Personalized Healthcare

: In this paper, we consider personalized treatment decision strategies in the management of chronic diseases, such as chronic kidney disease, which typically consists of sequential and adaptive treatment decision making. We investigate a two-stage treatment setting with a survival outcome that could be right censored. This can be formulated through a dynamic treatment regime (DTR) framework, where the goal is to tailor treatment to each individual based on their own medical history in order to maximize a desirable health outcome. We develop a new method, Survival Augmented Patient Preference incorporated reinforcement Q-Learning (SAPP-Q-Learning) to decide between quality of life and survival restricted at maximal follow-up. Our method incorporates the latent patient preference into a weighted utility function that balances between quality of life and survival time, in a Q-learning model framework. We further propose a corresponding m-out-of-n Bootstrap procedure to accurately make statistical inferences and construct conﬁdence intervals on the effects of tailoring variables, whose values can guide personalized treatment strategies.


Introduction
For chronic illnesses, patients often have to navigate a series of treatment decisions. It has been increasingly recognized that due to patient heterogeneity based on genetics, environmental factors, various other factors and the interplay between the factors, a good treatment plan needs to be both personalized and adaptive to a patient's changing clinical course. Dynamic treatment regimes (DTRs) are algorithmic solutions to this clinical problem, where a DTR consists of a sequence of treatment decision rules that adapt over time in response to an individual's clinical response and health outcome trajectory.
A large number of methods have been proposed for the evaluation of the optimal DTR. Some of the earlier and foundation works include the marginal structural model with inverse probability weighting (IPW) [1], G-estimation of structural nested mean models [2], Q-learning ( [3][4][5]), and A-learning [6]. More recently, along with the development of data science, machine learning flavored methods were also developed for DTR estimation, including tree-based and list-based methods ( [7][8][9][10]), classification type methods ( [11,12]), and stochastic tree search methods [13].
Despite the large number of methods that can be used to evaluate the optimal DTR, the majority of methods rely on a single pre-specified reward endpoint of interest. Often times, in practice, multiple competing priorities need to be balanced when considering a clinical decision and in fact these outcomes can be affected in opposing directions of desirability. A classic example in this case is toxicity vs. efficacy. A newly introduced drug that is highly efficacious might come with a larger burden of undesirable side effects. In recent years, a few proposed methods have tackled the delicate balance between multiple outcomes of a proposed treatment. Butler et al. [14] balanced treatment efficacy and toxicity using patient derived preference using a Q-learning approach, while [15] assigned different rewards based on survival status, wellness (a measure of toxicity) and tumor size (a measure of drug efficacy) at each stage.
The particular scenario that we would like to address in this work is a delicate balance of quality vs. quantity, a dilemma commonly encountered for patients facing severe illness. For many chronic diseases, patients often receive a first treatment and are followed up after a short time to determine if the treatment needs to be adjusted. The patient is then followed until an adverse event or a certain maximal follow-up time. Adverse events could include hospitalization, organ failure, impending surgery, or other undesired medical consequence, but for simplicity, we look at mortality as our event of interest [16]. Our primary goal then is to estimate the optimal treatment regime that would maximize a patient preference weighted combination of quality of life and survival time. Secondly, we would like to provide an inference framework for more confident decision making. Although similar in flavor to the abovementioned works, our scenario brings with it a unique set of distinct technical challenges.
The main challenge in this scenario is the presence of censored data. Because of the long tail of survival distributions, and because of other logistical reasons (patients move, dropout due to deteriorating health, etc.), it is common to not observe the outcomes of a significant fraction of the population. However, partially observed information from censored subjects can still contribute important information and give power to the analysis if analyzed correctly. An enormous body of work has been developed to estimate the optimal rule or regime in the presence of censoring. A non-exhaustive and overlapping list includes methods for a single stage ( [17][18][19]), methods for multiple stages ( [20][21][22][23]), inverse probability of censoring weighted (IPCW) adjustment methods ( [19,20,22,24]), Qlearning-based methods ( [20,22]), tree-based methods ( [17,18]), survival probability-based methods ( [23]), accelerated failure time (AFT)-based models ( [25,26]), and doubly robust methods ( [21,27,28]), just to name some of the numerous ways by which these methods differ in scope and direction.
A secondary but equally important goal of our method is to provide inference on stage-specific parameters, particularly for tailoring variables. Inference in DTR methods is challenging due to the known issue of nonregularity caused by non-smooth functions that get carried forward through backward induction [29]. When the degree of nonregularity is large (in other words, a larger fraction of covariate space does not have a treatment effect), the asymptotic distribution of the true coefficient oscillates between two asymptotic distributions, resulting in asymptotic bias and poor Wald-type confidence intervals. In the same paper, Chakraborty et al. [29] proposed a hard-threshold estimator and soft-threshold estimator to adjust for this poor coverage. Laber et al. 2014 [30] proposed an adaptive confidence interval for first stage parameters by utilizing regular, uniformly convergent lower and upper bounds for the asymptotic distribution of interest, and later bootstrapping for the confidence set. Chakraborty et al. 2013 [31] proposed an adaptive bootstrap-based method that adjusts for the bias and coverage by adjusting the bootstrap sample size.
In this work, we explore, synthesize, and adapt existing methods in the literature to create a new method for estimating the optimal treatment regime and constructing stagespecific confidence intervals that fit our scenario. We utilize IPCW to enable a complete data analysis and use the Q-learning framework for the modeling approach. We further adapt the m-out-of-n bootstrap to accommodate censoring in order to obtain covariate-specific confidence intervals for inference. We illustrate the numerical performance of our method through simulation studies.

Notation and Framework
As shown in Figure 1, we look at a two-stage setting where a patient, upon diagnosis, receives a stage 1 treatment. Shortly thereafter, at a scheduled follow-up time, the patient will be assessed for stage 2 treatment. Following stage 2 treatment, the patient is at risk of an adverse event, i.e., death, denoted by time D i . Although in some cases adverse events might happen early, our assumption that the death does not happen in Stage 1 is fairly natural in the setting of many chronic illnesses such as diabetes, hypertension, chronic kidney disease, or many autoimmune diseases, where significant clinical decline is not acutely expected. Finally, in this setting, patient information might be lost to follow-up, either due to administrative censoring (surpassed maximal follow-up time) or due to patient factors.
Let T i1 , T i2 denote the times of treatment for stages 1, 2, respectively, for a patient i. Let τ be the maximal administrative follow-up time. Let S i1 and S i2 denote the amount of time survived in each stage, i.e., S i1 = T i2 − T i1 , and S i2 = D i − T i2 . We assume that per protocol, everyone's S i1 should be the same (i.e., routine assessment following stage 1 treatment at a prespecified time interval). Let K j be the number of treatment options in the jth stage, j = 1, 2. Let A ij denote the treatment indicator for the ith patient in the jth stage, with a ij denoting the observed treatment. Let X ij denote patient characteristics and other historical information such as outcomes and prior treatments A 1 , . . . , A j−1 prior to treatment assignment at stage j. In addition, we assume that each patient will have an evolving preference h ij , which can be derived from answers W ij to a questionnaire at stage j.
At each stage, we assume that each patient i will have two observed outcomes, q ij (with the range [0, 1]) for the average quality of life during stage j, and S ij , the amount of time spent in stage j. q ij allows us to calculate Qu ij = q ij * S ij , the quality adjusted life years during stage j. However, since Qu ij is dependent on S ij , it is also subject to censoring at stage 2. We assume a utility function of the form Qu + {1 − Φ(h)}{S − Qu}, where Φ(·) denotes the cumulative distribution function of a normal random variable. Intuitively, this utility function is a sliding scale between S and Qu, and a patient's preference would dictate where he/she would fall. Let the overall outcome of interest that we would like to optimize be R i1 This is a cumulative preference of adjusted quality of life years experienced on a given regime. If one's preference is such that Φ(h i ) = 0 for both stages, then R i1 + R i2 is the total survival time. Similarly, if a patient has the preference Φ(h i ) = 1 for both stages, then R i1 + R i2 would be the total quality adjusted life years. In general, we denote all history, or history up to stage K for a given variable with an overhead bar (i.e., W i and W iK , respectively).
Let g j (X ij , W ij ) be a function that maps from covariate and survey history to the domain of treatment assignment A ij . At stage j, the expected potential reward of the following . . , A i,j−1 , a j ) denotes the counterfactual survival outcome and quality adjusted survival, respectively, where the patient is assumed to have taken treatment a j at stage j, conditional on previous treatment decisions A i1 , . . . , A i,j−1 in the sense that previous treatments need to be fixed to those actually given in previous stages. In our case, our primary goal is to find a sequence of individualized decision rules, g(X i , W i ) = (g 1 (X i1 , W i1 ), g 2 (X i2 , W i2 )), that optimize the potential outcome of R i1 + R i2 .
A second but equally important objective is to conduct inference on coefficients, with particular emphasis on tailoring variables (variables that interact with treatment selection). Inference on tailoring variables is important because it obviates the need to collect data for covariates that have no evidence of significant deviation from zero. Furthermore, inference allows us to know when there is insufficient evidence to support one treatment over another so that treatment decisions could be made using other factors important to the patient. In this work, along with stage-specific decision rules, we present a censoring adapted method of obtaining confidence intervals for the covariates in both stages of the model. For the sake of brevity going forward, let us abbreviate g j (X ij , W ij ) with g j and drop the patient index i when there is no confusion.

Traditional Q-Learning
First, we introduce traditional Q-learning, a form of approximate dynamic programming originally proposed by [3]. Q-learning estimates the optimal DTR by postulating regression models for Q-functions and subsequently taking solutions that would yield the largest rewards. The Q-functions for the two stages are defined as: The optimal decision rule at stage j can be expressed as d j (X j ) = arg max a j Q j (X j , a j ). Generally, we do not know the true Q-functions and so we consider linear working models for Q-functions of the form Q j (X j , where Z j,0 and Z j,1 possibly contain different components of the history X j .
Stage 1 regression: (β 1 ,ψ 1 ) = arg min β 1 The optimal decision rules can further be written as d j (X j ) = arg max a j Q j (X j , a j ;β j ,ψ j ) = sign(ψ T j Z j,1 ) when we have the particular case that A j ∈ {−1, 1}. We will assume that we have binary treatment options for both stages for convenience, although this can be relaxed with further assumptions.

Censoring Adapted Q-Learning
Our stage 2 optimization objective is complicated by the fact that some S 2 may be unobserved due to censoring. Let C denote the time of censoring, which happens after T 2 , the time of stage 2 treatment. We assume that S 2 ⊥ C|A 2 , X 2 , W 2 (conditional independence).

Stage 2
Let S * 2 (a 2 ) be the counterfactual outcome of survival starting from the stage 2 treatment conditional on the previous treatment A 1 . Correspondingly, S * 2 (g 2 ) is the counterfactual outcome under decision rule g 2 , i.e., S * 2 (g 2 ) = ∑ Similarly, we can obtain Qu * 2 (a 2 ) through q * 2 (a 2 ) and S * 2 (a 2 ). Using the linear utility function defined in Section 2 and conditional on the previous treatment Correspondingly, is the counterfactual utility, conditional on the previous treatments A 1 under decision rule g 2 . The optimal regime, g where G 2 is the class of all potential decision rules for stage 2.
We make the following assumptions to connect the mean of counterfactual outcomes with the observed data:

1.
Consistency: No unmeasured confounding: Treatment A 2 is randomly assigned with probability possibly dependent on Positivity: There exist constants 0 < c 0 < c 1 such that, with probability 1, the propensity score π a 2 (X 2 , Latent variable independence: The first three assumptions are standard assumptions in causal inference. The last assumption facilitates separate modeling of outcomes and preferences and can be weakened at the expense of more complicated models [14]. We denote the marginal expectation with respect to X t , W t as E X t ,W t , abbreviated as E t throughout. Furthermore, we denote µ S 2,a 2 (X 2 , W 2 ) ≡ E S 2 |A 2 = a 2 , X 2 , W 2 , µ q 2,a 2 (X 2 , W 2 ) ≡ E q 2 |A 2 = a 2 , X 2 , W 2 , and assume that S * 2 (a 2 ) and q * 2 (a 2 ) are conditionally independent given X 2 , W 2 . As in traditional Q-learning, we assume linear working models for each of our outcomes of interest (i.e., q 2 , S 2 can be generated through underlying models of predictive and tailoring variables β T 2 Z 20 + (ψ T 2 Z 21 )A 2 , where Z 20 and Z 21 are some possibly different components of X 2 and W 2 ).
Using the causal assumptions above, we bridge the observed data to the counterfactual outcome means: where the separate modeling of preference and outcomes is allowed by the fourth assumption. With censoring, it is unlikely that all S 2 are observed. We propose the following estimator that re-weights observed complete data using inverse probability of censoring weights (IPCW): where E(R 2 (X 2 , W 2 , A 2 ; β 2 , ψ 2 )) denotes the model estimate for R 2 using observed data and covariates, ∆ = I(S 2 < C) is the event indicator that the patient's data are not censored, and Pr{∆ = 1|X 2 , W 2 , A 2 } is a working estimator of the probability that the individual has not been censored by their event time. Denote our Q-function here to be Q 2 (X 2 , W 2 , 21 represents the tailoring variables in the stage 2 model.

Stage 1
For stages prior to the last stage, we use backward induction, and then the optimal g opt 1 (X 1 , W 1 ) at stage 1 can be derived similarly. Assuming that stage-specific rewards have been maximized after stage 1, we define the following stage 1 reward: is as defined previously and S * 2 (a 1 ) = S * 2 (a 1 , g opt 2 ) denotes a counterfactual outcome given future optimized treatments and taking treatment a 1 at stage 1 (similarly for q * 1 (a 1 ) and q * 2 (a 1 )). Using this reward, we define the optimal regime at stage 1 as the one that satisfies where G 1 is the class of all potential regimes at stage 1.
We make the following assumptions. Note that in our set-up, S 1 is predetermined and fixed for every patient, but this assumption does not apply to q 1 .

1.
Consistency: No unmeasured confounding: Positivity: π a 1 (X 1 , W 1 ) = Pr(A 1 = a 1 |X 1 , W 1 ) is bounded away from zero and one; 4. Latent variable independence: , and similarly for the equivalents of q. Similarly to stage 2, we further assume that q * 1 (a 1 ) ⊥ S * 1 (a 1 )|X 1 , W 1 and q * 2 (a 1 ) ⊥ S * 2 (a 1 )|X 1 , W 1 . Notice again that the right-hand side (RHS) can be completely estimated from observed data. Under these assumptions, the optimization problem at stage 1, among all potential regimes G 1 , can be written as g opt 1 = arg max g 1 ∈G 1 RHS of Equation (2).
We maximize the stage 1 outcome through a pseudo-outcome defined as: Our proposed estimator for (β 1 ,ψ 1 ) is argmin β 1 The first stage estimated optimal rule is given byĝ 11 represent the tailoring variables in the stage 1 model.

Inference
In addition to obtaining the optimal stage-specific decision rules, our second goal is to draw statistical inference on the effect of each stage's covariates on the decision. Particular emphasis was placed on tailoring variables. To that end, we propose using a censoring-adjusted version of the m-out-of-n method presented by [31]. In stage 1 optimization, the pseudo-outcome In particular, if P[Z 21 : ψ T 2 Z 21 = 0] = 0, then first stage covariates will converge to a normal distribution. However, if P[Z 21 : ψ T 2 Z 21 = 0] > 0, the estimatorψ 1 oscillates between the two asymptotic distributions across samples, which reflects the typical challenging problem of nonregularity in DTR literature [29]. Hence, direct estimation results in an asymptotically biased estimator and poor performance of usual Wald-type confidence intervals. Even bootstrap-based approaches suffer from underlying nonsmoothness. The m-out-of-n bootstrap was developed to address the bootstrap inconsistency due to nonsmoothness [32]. Although conceptually very similar to the original bootstrap, the resample size m (which needs to depend on n, tends to infinity with n, and m = o(n) is selected to be a smaller order than n. Chakraborty et al. 2013 [31] showed through simulation studies that the m-out-of-n approach obtained desirable coverage probabilities in the two-stage DTR problem for first stage tailoring variables. Because censoring reduces the size of observed stage 2 data in our scenario, we further modified the m-out-of-n algorithm to accommodate censoring. Our algorithm works as follows.
We adopted the functional form of m as presented in [31]: Let n be the total number of subjects in the dataset (including those who were censored). For stage 2, we create a bootstrap sample of size n and fit a regression model using the complete data weighted by IPC-weights within the bootstrapped sample to obtain stage 2 coefficient estimates. Stage 2 95% confidence intervals are obtained after gettinĝ l 2 andû 2 , the α/2 × 100 and (1 − α/2) × 100 percentiles of √ n(θ 2,n is the bootstrap estimate of stage 2 coefficients with bootstrap-specific re-estimated censoring weights, andθ 2,n is the plug-in estimator obtained using weighted regression from the empirical dataset. The confidence interval is given by (θ 2,n −û 2 / √ n,θ 2,n −l 2 / √ n). For stage 1, we first generate bootstrap samples of size m, which is calculated using Equation (3) after calculating a sample specificp. We further use each bootstrap sample to re-estimate IPC weights, and fit a weighted lm model to obtain a bootstrap-specific stage 2 estimate. The stage 2 coefficients from each bootstrap sample are then used to calculate pseudo-outcomes, which are then used to fit a stage 1 model to obtainθ 1,m −θ 1,n ), whereθ 1,n is the plug-in estimator obtained using the complete empirical dataset, whileθ 1,m is the estimate obtained from each bootstrap sample of size m. The confidence set is given by (θ 1,n −û 1 / √ n,θ 1,n −l 1 / √ n). We further selected the value of ξ to be 0.01, which provided stable coverage in simulations with complete data. The calculation ofp = PI{n[Z T 21ψ 2,n ] 2 ≤ τ n (Z 21 )} relies on a selection of τ n (Z 21 ). We opted to use the plug-in estimator for τ n (Z 21 ) = (Z T 21Σ 21 Z 21 ) · χ 2 1,1−ν just as in [31], whereΣ 21 is the plug-in sandwich estimator of nCov(ψ 2,n ,ψ 2,n ), and ν = 0.01.
In practice, however, we need to tune the two hyper parameters ξ and ν to obtain an appropriate m using double bootstrap, and tune the bootstrap sample number m straightforwardly and automatically. Such a double bootstrap algorithm for choosing m is data-driven. Suppose we are interested in c T θ, i.e., the stage 1 variable effect, and its estimate from the original data is c Tθ n . Consider a grid of candidate values for m: (1) Draw B 1 n-out-of-n first-stage bootstrap samples from the data and calculate the bootstrap estimate c Tθ b 1 n ,( b 1 = 1, 2, . . . , B 1 ). Fix m at the smallest value in the grid.
, then it is not significantly biased from 1 − α, and we will pick the current value of m as the final value. Otherwise, increase m to the next value in the grid.

Survival Augmented Patient Preference Weights (SAPP-Weights)
Going forward, we assume that information in the lastest stage survey will override information from previous stages as well as other covariate information (i.e., W j will override W j−1 and X j ). To model survey information as a function of latent preference, we assume a latent traits model [33]. We further assume that the latent preferences are related to survey responses through a modified Rasch model [34,35].
For our scenario, we assume that there are numQ questions on a survey, each soliciting binary answer choices from the patient. For each binary response, we assume that the underlying generating mechanism is of the form logit{P(W jk = 1|H j = h j )} = α 0,k + α 1,k h j , where j indicates the stage and k the question number.
Algorithm 1 outlines the algorithm for estimating patient preferenceĥ j . Then the survival augmented patient preference weights (SAPP-weights) are the transformed version ofĥ j , which we denote as Φ(ĥ ij ). Essentially, we use the Expectation-Maximization algorithm [36] to iterate between estimates of α, the questionnaire coefficients, and h j , individual patient preferences at stage j. We use Gauss-Hermite quadrature to numerically approximate the integral, and estimate P(h j |W j ) ∝ P(W j |h j )P(h j ) using the Metropolis Hastings algorithm. Result: Obtain p(h j |w j ) for patient i Set an initial value of h j for all subjects to estimate an initial guess of α 0 , α 1 ; while not reached convergence do Using MH, get an updated estimate of P(h j |w j ); Approximate likelihood integral using Gauss-Hermite quadrature with k abscissae h t and weight p(h t ); Solve likelihood equations using Newton-Raphson to obtain updated estimates of α 0 , α 1 ; end

Numerical Results
We conducted simulation studies to investigate the performance of our proposed method. We looked at two scenarios, differing by degree of nonregularity (the estimated probability that stage 2 treatment does not provide a significant difference), p. The first scenario is an example of low nonregularity, where about 25% of people obtain similar results with both treatments. Scenario 2 is an example of higher nonregularity, where approximately 75% of patients could obtain similar results with either of the stage 2 treatments. The true value of p was estimated through simulated complete data (assuming no censoring) and true preferences.
For both scenarios, we generated baseline covariates X 1 , X 2 ∼ N(0, 1), censoring time C, quality of life q 1 , q 2 , and survival times S 1 , S 2 . Preferences of both stages were generated from N(0, SD = 0.5) distribution, and 10 binary preference derived questionnaire responses W 1 , W 2 were generated according to our latent model with coefficients, as shown in Table 1. The two scenarios differ in terms of stage 2 parameters but share common stage 1 settings. Stage 1 treatment assignment A 1 was randomly assigned with probability 0.5. The stage 1 quality of life outcome q 1 ∈ [0, 1] was generated from N(α 0 + α 1 X 1 + α 2 A 1 + α 3 X 1 A 1 , σ 2 ), where α = [0.55, 0.03, 0.06, −0.09], and σ = 0.03. As mentioned previously, S 1 for everyone indicates a routine follow-up time of 30 days. The outcome of stage 1 is a weighted combination of q 1 and S 1 through the equation represents the cumulative distributive function of a standard normal. R 1 can be interpreted as an quality of life weighted survival during the initial follow-up time. The treatment assignment A 2 depends on R 1 : , and thus those patients with higher R 1 are more likely to remain on the same treatment as A 1 .
Using these simulated datasets, we estimated patient preferences, the optimal dynamic treatment regime, and the confidence intervals of estimated DTR coefficients from these responses and outcomes. We evaluated each scenario through measures of bias, coverage probability, optimal mean response, and the percent of subjects correctly classified to their true optimal treatment %opt.
Unlike stage 1, stage 2 survival times could be subject to censoring time C. We generated censoring C ∼ exp(log λ C 0 + X 2 β C ), where β C = 0.01 and λ C 0 = 0.00058 for 15% censoring and λ C 0 = 0.0013 for 30% censoring. τ was set to be a year after initiation of stage 2 treatment.
Because S 2 and q 2 are functions of X 2 , A 2 , and R 1 , the reward combination can be rearranged as: On the other hand, stage 1 coefficients are obtained by regressing a pseudo-outcome, , and X 1 A 1 , so stage 1 regression has four covariate terms (including intercept). We obtained both true stage 1 and stage 2 coefficients by performing Monte Carlo sampling regressions on a sample size of 10 million. Tables 2 and 3 summarize the simulation results of the two stages for scenario 1, including bias estimates, the empirical standard deviations, the mean bootstrap standard deviations, mean widths of confidence interval, and the coverage probabilities of our proposed method. The coverage probabilities ranged from 86% to 96%, with the majority between 92% to 96%. Furthermore, we observed a general agreement between the empirical SD and the average bootstrap SD. In terms of trends, we saw slightly larger ESD, mean bootstrap SD, and mean width when censoring was increased from 15% to 30%, while a reduction in all three was observed when we increased n from 500 to 1000. Using covariate A 2 as an example, the ESD for n = 500 and 15% censoring was 11.72, which increased to 13.03 when censoring reached 30% but decreased to 8.38 when n was increased to 1000. Similar patterns were observed with the average bootstrap SD and mean width of the confidence interval.  We further investigated the distributions of the observed total reward, as well as the predicted optimal reward of one randomly selected simulation, which we illustrate in Figure 2. Aggregating means across the four sub-scenarios (based on sample size and censoring), the average observed reward was 233.77, while the average predicted optimal reward was 254.19, indicating an expected increase of 20.42 reward when everyone follows the regime to receive the treatments by our algorithm. It is also evident in the figure that the variability of the observed rewards is significantly larger than the variability of the predicted optimal reward. The aggregated SD of all observed rewards in this scenario is 39.2, while the aggregated SD of all predicted rewards is 5.66.
In a randomly generated dataset, treatment 1 gave 74.8% of patients a better stage 1 reward, while treatment 1 gave 41.6% of patients a better stage 2 reward. The range of S 2 varied from 235 to 368 days.
Stage 2 survival times are similarly subject to censoring time C. Censoring was generated with model C ∼ exp(log λ C 0 + X 2 β C ), where β C = 0.01 and λ C 0 = 0.00048 for 15% censoring and λ C 0 = 0.0011 for 30% censoring. As before, τ was set to be a year after initiation of stage 2 treatment.
Tables 4 and 5 present our simulation results for scenario 2. As before, the coverage probabilities range from 86% to 97%, with most covering around 95%, and the mean width, SD, and mean bootstrap SD are decreasing with decreasing censoring and increasing sample size. Again, using A 2 as an example, the ESD across simulations of its coefficient was 12.38 when n = 500 and censoring at 15%, which increased to 13.85 when censoring went to 30%, and decreased to 8.84 when n went to 1000. As in Section 4.1, we investigated the distribution of the observed total reward and the predicted optimal reward of one random simulation in Figure 3. For aggregated means across the four sub-scenarios (based on sample size and censoring), the average observed reward was 263.62, while the average predicted reward was 275.35, indicating an increase of 11.73 reward when everyone follows the regime to receive treatment by our algorithm. The aggregated SD of observed rewards was 37.77, while the aggregated SD of all predicted rewards was 4.39.

Optimality
Besides looking at performance of inference, we also looked at the number of times our algorithm chose the correct treatment for each patient at each stage, across the scenarios we have visited (n = 500, 1000 across two levels of censoring at 15%, 30%). Table 6 shows the simulation results. In this table we also included the average stage 1 bootstrap resample sizes for each sub-scenario. As expected from Equation (3), increasing p indicates increasing nonregularity, and decreases m. We can see that for scenario 1 (p = 0.25), the algorithm chose the optimal treatment for stage 1 over 93% of the time, and over 83% of the time for stage 2. Our algorithm was able to assign the correct treatment to a patient over 78% across all sample sizes and censoring levels. This is significantly higher than the random guess approach, which would have landed us around 25%. By contrast, scenario 2's performance was slightly weaker, coming in at over 93% for stage 1, around 66-69% for stage 2, and with an overall correct regime assignment percentage of 61-65%; however, this is still much better than 25% with a random guess. In this scenario, we could also see a general increase in SD as compared to scenario 1, indicating higher uncertainty in our decision making process.

Discussion
We proposed a method that estimates the optimal decision rules for a two-stage treatment scenario subject to censoring in this paper. We proposed to balance between quality of life and quantity using a sliding scale function adjusted by patient preference. Through simulation studies, we have shown that our proposed method is capable of choosing the optimal treatment dynamically a majority of the time, as well as provide convincing confidence intervals for each of the coefficients in question.
The simulation results of scenarios 1 and 2 are notable in the following ways. Most importantly, we can see that the coverage probabilities mostly hover around 95%, showing that our confidence interval has the combination of adequate width and minimal bias required for a good coverage probability. The general congruence between the empirical SE and the mean bootstrap SE further supports that our method is sampling at the appropriate width. Generally, there is a slight increase in ESD and mean bootstrap SD when increasing censoring from 15% to 30%, indicating decreased certainty, but the difference is slight (e.g., for parameter A 2 in scenario 1, ESD increased from 11.72 to 13.03). The decrease in ESD is more significant between n = 500 and n = 1000, where for scenario 1 and 15% censoring, the ESD of the estimated coefficient for parameter A 2 decreased from 11.72 to 8.38. In the boxplots in Figures 2 and 3, we can see much larger variability in the observed rewards as compared to the predicted rewards. This is as expected for two reasons. First, observed rewards contain an error component that is not present in expected (predicted) rewards. Second, observed rewards include individuals who have by chance obtained their optimal reward, as well as those who did not. Variability invariably reduces when more individuals are predicted to their optimal reward. As expected, the distribution of the predicted optimal rewards can be mapped to the upper part of the observed rewards. The same general patterns were observed for scenario 2.
One main challenge in this work, as is the case in [31], is the selection of m, which is a crucial factor in determining the coverage probability and confidence interval. In our simulations, we selected parameters ν and ξ using background knowledge and reference simulation results and used Equation (3) to select m. This approach was recommended by [31] and is straightforward and easy to implement, but risks inappropriate values of m if either ν or ξ are selected inadequately. Hence, we recommend the double bootstrap procedure in practice, where we take our empirical dataset estimator as the truth, and build confidence intervals using nested bootstrap samples of size m from empirical bootstrap samples of size n. Similarly, another potential idea to further improve coverage across all covariates could be to select distinct values of m for each covariate. This might be of interest for future research. Alternatively, one may consider coupling our method with other meth-ods that address the nonregularity issue in dynamic treatment regimes, notably [37]. Two main challenges exist for this proposed integration, namely the adaptability to censored data and derivation of asymptotic results in censored settings, and the specific process of choosing a tuning parameter λ for our proposed data structure. The integration would be an interesting and nontrivial problem for future research.
The authors are aware of two works in the literature that balance between two outcomes and would like to highlight certain differences at this time. Zhao et al. 2009 [15] looked at the dosage effect of a cancer drug on tumor size and toxicity. In their work, each reward function is separated into three parts: survival status, wellness, and tumor size effects. In simulation studies, the reward was assigned to be −60 if the patient died, 15 if the patient's tumor shrunk to zero, and 5 or −5 if tumor size/wellness improved and deteriorated. The method of assigning rewards is particular to the example they proposed, and there is no clear indication of how to perform inference using this method. The comparison between our method and [14] is a bit more direct. Their work used patient preference estimation to weigh between toxicity and efficacy, both continuous outcomes. There are definite similarities between this work and [14], especially in terms of preference estimation and the common Q-learning framework. However, most importantly, Butler et al. 2018 [14] would not accommodate censored data, nor did they provide any framework for inference, both of which are important contributions of this work.
This work can be improved in a couple of directions for increased generalizability and impact. First, the vast amount of methods for survival data is one indication of the complexity of generalizing the various time to event scenarios. One main direction for extension would be to accommodate multiple stages, with potential censoring and event times that could happen at any stage, similar to the set-up of [20]. Another direction that might benefit real applications could be allowing the subset of treatments to change depending on patient outcomes, as in [21]. As pointed out by a reviewer, there may be other ways of estimating and incorporating patient preference. Before settling on our proposed approach, we had considered hierarchical models, which condition directly on latent variables. However, this approach would lose causal interpretation in multiple stage settings. Even though hierarchical models were not suitable for multi-stage settings, exploring other models for calculating and estimating preference variables, e.g., generalizing from logit to other exponential families within the latent variable framework as in [33], would still be valuable. Finally, generalizing binary treatment options to a continuous version (i.e., dosage), or increasing the number of outcomes we balance are also meaningful directions for future extensions.