Personalized Risk-Based Screening Design for Comparative Two-Arm Group Sequential Clinical Trials

Personalized medicine has been emerging to take into account individual variability in genes and environment. In the era of personalized medicine, it is critical to incorporate the patients’ characteristics and improve the clinical benefit for patients. The patients’ characteristics are incorporated in adaptive randomization to identify patients who are expected to get more benefit from the treatment and optimize the treatment allocation. However, it is challenging to control potential selection bias from using observed efficacy data and the effect of prognostic covariates in adaptive randomization. This paper proposes a personalized risk-based screening design using Bayesian covariate-adjusted response-adaptive randomization that compares the experimental screening method to a standard screening method based on indicators of having a disease. Personalized risk-based allocation probability is built for adaptive randomization, and Bayesian adaptive decision rules are calibrated to preserve error rates. A simulation study shows that the proposed design controls error rates and yields a much smaller number of failures and a larger number of patients allocated to a better intervention compared to existing randomized controlled trial designs. Therefore, the proposed design performs well for randomized controlled clinical trials under personalized medicine.


Introduction
Personalized medicine is a new paradigm motivated by the possibility that patients' response to a particular treatment is heterogeneous, which may be due to biological covariates. Only a subset of patients is sensitive to, and benefit from, the treatment. Thus, a traditional one-size-fits-all remedy may not be the best option for some patients, even though the standard of care for a disease generally has a well-established track record. In the era of personalized medicine, molecularly targeted agents have been developed for disease treatment and prevention, e.g., trastuzumab [1,2], crizotinib [3,4], and erlotinib [5,6]. Novel statistical methods and clinical trial designs have been proposed for the novel targeted therapy. Park [7] reviewed statistical methods evaluating the effect of the targeted therapy with a certain genetic mutation on multiple disease types. Biomarker-based clinical trial designs have been proposed to address the one-size-fits-all issue [8][9][10][11]. Adaptive enrichment designs propose the enrichment rule to identify the patients who are expected to get more benefit from the experimental treatment and restrict the enrollment adaptively to the treatment sensitive patients [12,13]. In this paper, we are interested in how personalized medicine works on randomization of treatments for clinical trials.
Randomization is critical in clinical trials to remove any systematic bias for detecting the treatment effect and thus powerful to ensure validity in the comparative clinical trials. Most of randomized controlled trials use a fixed randomization to allocate participants to the treatments being compared, i.e., the allocation ratio 1:1 or 2:1 is commonly used in comparative two-arm clinical trials. The fixed randomization makes simple to execute the clinical trials. However, in the era of the personalized medicine, it would make investigators hesitant to assign equal number of patients to each treatment if the trial enrolls patients regardless of enrollment restriction to a targeted subgroup based on empirical evidence of the efficacy of the treatments. As an effective approach to address the ethical problem, adaptive randomization assigns more future patients to the better performing treatment based on the accumulating information on patients' response to the treatments. Using the skewed allocation probability, response-adaptive randomization (RAR) designs for binary response trials have been proposed [14][15][16][17]. The optimal allocation probability to treatments was proposed in that sample size is minimized [18] or total number of failures is minimized [19,20]. To incorporate patients' covariate information in RAR designs, the response probability conditioning on the covariates is estimated for RAR [17,[21][22][23][24].
In this paper, we propose a personalized risk-based screening design for comparative two-arm group sequential clinical trials. The proposed design follows the group sequential manner with the first look used for a burn-in stage. It collects some preliminary data to facilitate the regression fitting and adaptive decision of the intervention assignment for the next stages. We propose personalized randomization using a Bayesian covariate-adjusted response-adaptive randomization based on adaptive regression of response on informative covariates to randomize a patient with the given vector of covariates to the intervention from which the patient is expected to get more benefit based on the accumulating information. Using risk factors to build the personalized risk-based allocation probability, the design provides individually tailored randomization of screening modality. Moreover, we propose a group sequential test in personalized allocation and Bayesian monitoring rule to compare screening effects and maintain the error rates.
The rest of this paper is organized as follows. In Section 2, we describe a motivating trial for cancer screening and propose a design structure, probability model, and methods for the personalized screening trial design. In Section 3, we evaluate the operating characteristics of the proposed design using simulation studies. We provide discussion in Section 4.

Motivating Trial
Tomosynthesis Mammographic Imaging Screening Trial (TMIST) is a Phase III trial study, which starts on July 2017 and will be completed by August 2030 (The study identifier is NCT03233191). TMIST randomizes women between the ages of 45 and 74 to either tomosynthesis mammography (3D mammography) or standard digital mammography (2D mammography) with equal probability and evaluates the mammographic accuracy for breast cancer screening. The primary endpoint of the study is the incidence of advanced breast cancer, and the trial was designed to compare the proportion of women diagnosed with an advanced breast cancer between two screening modalities. In an era of personalized medicine, it is essential to develop methods and trial designs for personalized risk-based screening using breast density, tumor subtyping, and genomics [25].

Design Structure
Motivated by TMIST considering two screening disparities, digital breast tomosynthesis mammography and standard digital mammography, we consider a comparative group sequential clinical trial with patients individually randomized to experimental treatment A or control B based on accumulating data.
Our design enrolls a maximum of N patients sequentially in cohorts of sizes n 1 , . . . , n K with N = ∑ K k=1 n k . The design uses a Bayesian group sequential monitoring, described in Section 2.5 below, for superiority or futility at interims to compare A to B in the adaptively randomized patients. The schema of the design is shown in Figure 1. The trial begins by enrolling patients according to the eligibility criteria for the first cohort of n 1 patients. It randomizes the patients to A or B with equal probability. When the n 1 patients have been enrolled and their outcomes are available, the superiority or futility of the experimental treatment A against the control B is monitored at the first interim. If the monitoring shows that A is superior or futile, the trial is terminated. However, if the trial is not stopped early, then we fit the regression model of response on a vector of patients' characteristics and treatment to estimate the personalized allocation probability, given in (2) in Section 2.4 below. The allocation probability is updated to randomize the treatment adaptively and individually for the next enrollment of the second cohort. This procedure is repeated until the end of the trial. If the maximum sample size N is reached and the last patient's outcome has been evaluated, a final analysis is performed.

Final analysis
Randomize patients with fixed ratio Figure 1. Schema of the proposed design.

Probability Model
Let G be an indicator of treatment group taking 1 for receiving experimental treatment A and 0 for receiving control B. Let Y be a binary indicator of events, e.g., deaths. For each patient, we assume that a vector of informative covariates x is available at enrollment.
We describe a probability distribution for Y assuming a probit regression model where Φ(·) denotes the cumulative distribution function of standard normal variable, x = (1, x ) and θ ≡ (β , γ ) denotes the regression coefficient parameter vector. Specifically, β is the vector of covariate main effects and γ is the vector of interaction effects between treatments and covariates including the main experimental versus control effect. Back to the motivating trial, Y is the indicator of having breast cancer. The probability in (1) indicates the chance of having an advanced breast cancer for the given screening method G and a vector of patients' characteristics x. To interpret the breast cancer risk for screening, electronic health record, breast density, age, tumor subtyping, first-degree breast cancer family history, and genomics are candidates of the predictive covariates in the risk prediction model [26][27][28][29][30][31].
Assigning β and γ normal priors, the parameters are estimated by Bayesian inference. We used LearnBayes R package to fit Bayesian probit regression model.

Personalized Allocation for Adaptive Randomization
For each k = 1, . . . , K − 1, let D k be an accumulating data at the kth interim, i.e., a set of Y, G, x over the k cohorts. Let p A (x) = Pr(Y = 1|G = 1, x), and p B (x) = Pr(Y = 1|G = 0, x). Then, p A (x) − p B (x) = Φ(x β +x γ) − Φ(x β), which is a function of unknown parameter θ = (β , γ ) . To assign more patients to the better performing personalized treatment, we are interested in quantifying a likelihood of a patient with x benefiting more from the treatment A than B, i.e., p A (x) − p B (x) < 0. Let p k−1 (x) = Pr(p A (x) < p B (x)|D k−1 ) denote the posterior probability that a patient with covariates x is less likely to have an event under treating A than treating B based on accumulating data D k−1 . Assuming normal prior on θ for Bayesian probit regression model in Section 2.3, samples of θ are generated from the posterior distribution where lik(data|θ) denotes the likelihood function and prior(θ) denotes the prior distribution of parameter θ, and the posterior probability p k−1 (x) is calculated. We provide how to compute the posterior probability p k−1 (x) in Appendix A. The posterior probability is to reflect the personalized medicine, and patients' characteristics x are incorporated into the posterior probability Pr(p A < p B |D k−1 ) used in Bayesian adaptive randomization [32] Then, we define the probability of randomizing a patient with covariates x in the kth cohort to the treatment A as This is an option considering the personalized allocation probability, which is a type of covariate-adjusted response adaptive randomization (CARA). To emphasize in the randomization ratio that patients can respond differently to the treatments, we prefer what we call personalized randomization over CARA. We use this allocation probability (2) for the proposed design to perform personalized randomization.
Alternative option is to consider another type of CARA given by where p k−1,A (x) = Pr(Y = 1|G = 1, x, D k−1 ) and p k−1,B (x) = Pr(Y = 1|G = 0, x, D k−1 ). The personalized allocation probability (3) uses the estimated response rates of treatment A and B denoted by p k−1,A (x) and p k−1,B (x), which are obtained by posterior mean of parameter. In our motivating screening trial, the response is an event such as death. To build the personalized allocation probability which is skewed to patients who get more benefit, the allocation probability is proportional to . This is the modified version using Bayesian inference from optimal allocation probability suggested by Rosenberger et al. [19]. The personalized allocation probabilities (2) and (3) are updated throughout the clinical trials based on the accumulating data. They change the treatment allocation probability and adaptively randomize more patients to the treatment arm that is superior according to the patients' characteristics. Back to the motivating trial, using the risk predictive model in Section 2.3, we are able to perform data-driven personalized randomization. It builds the personalized risk-based allocation probability and randomizes more patients to the superior screening modality individually. The personalized randomization makes more reasonable in ethics and help clinicians and clinical trialists get more out of randomized clinical trials.

Group Sequential Test in Personalized Randomization
To effectively use the personalized randomizationin group sequential designs allowing early stopping, it is critical to preserve the overall type I error rate. As the response adaptive randomization (RAR) including CARA is considered based on the observed data, potential selection bias can occur. Moreover, the bias would be more serious if CARA is used when there exists an effect of informative covariates. Park [33] shows that group sequential designs using CARA are influenced by prognostic covariates and the overall type I error rate is not controlled. To address the issue of type I error rate inflation from using the personalized allocation and accommodate the possible change in eligibility of patients during the trial, it is required to propose an elaborate test statistic which preserves the error rates.
At the kth analysis, the trial enrolls patients of k cohorts sequentially. Based on the accumulated data D k from k successive cohorts, which might consist of k heterogeneous cohorts, the kth interim monitoring determines go or no-go of the trial. Let ∆ k be an expected subgroup-averaged treatment effect based on the kth cohort. Assuming that x determines the subgroups, we suppose that there are I k subgroups in the kth cohort denoted by S i , i = 1, . . . , I k . In the case where x is continuous, the dichotomization can be considered to define the subgroups, e.g., young and old groups for the age variable. A comparative treatment effect of the kth cohort is obtained by where I(·) denotes the indicator function. It is a function of parameter θ and indicates the expected difference of the response probability with respect to x over the kth cohort. Then, a group sequential test statistic is proposed as the weighted sum of the comparative treatment effect based on k cohorts, i.e., As the comparative treatment effect ∆ k is calculated by marginalizing the difference of response probability with respect to x, the test statistic T k does not indicate the treatment effect of the individual patient. It indicates the overall treatment effect based on the accumulating data at the kth analysis.
When there are a few covariates, all possible combinations of subgroups are considered to obtain the comparative treatment effect ∆ k . However, with more covariates, to avoid any computational burden or complexity, we suggest identifying the covariates whose main effect is significant so that they determine the subgroups in the kth cohort for the calculation of ∆ k .
Let δ 1 denote the minimal improvement for the experimental treatment to be deemed superior to the control and δ 2 denote the minimal improvement so that the experimental treatment is considered worthy of further investigation. The values of δ 1 and δ 2 are prespecified by clinicians or the study hypothesis. Let i , i = 1, 2, 3 be the pre-specified probability cutoffs for superiority and futility monitoring rule. They are design parameters obtained by preliminary simulation-based calibration, where 1 and 3 control type I error rate α and 2 controls type II error rate β. To save several rounds of calibrations, the initial cutoff values of 1 and 3 were selected as one minus target type I error rate, and the initial cutoff of 2 was selected as one minus target type II error rate. To make sense with experts' experience and knowledge, the survey results can be used to determine the level of evidence and calibrate for the monitoring rules [34]. If the type I error rate is lower/higher than the desirable level, we decrease/increase the value of 1 and 3 , and if the calculated type II error rate is lower/higher than the desirable level, we decrease/increase the value of 2 . We repeat this calibration process until the desirable type I and II error rates are obtained. Then, the calibration procedure determines the cutoffs carefully to adjust the multiplicity of testing repeatedly over time and thus maintain the overall type I and II error rates at the nominal levels. It is widely used in Bayesian sequential designs [13,[35][36][37]. Shi and Yin [38] provides the unified framework for the calibration procedure to search the cutoffs effectively.
Then, the Bayesian sequential monitoring rule is described as follows.
• When k = K (i.e., at final analysis), we argue that A is superior to B if Pr(T K < δ 1 |D K ) > 3 , and otherwise, A is not superior to B.
The posterior probabilities Pr(T k < δ 1 |D k ) and Pr(T k > δ 2 |D k ) are comupted by Bayesian inference (see Appendix A). The values of δ 1 and δ 2 are not necessarily to be the same in the decision rules. The proposed rule allows unequal values of δ 1 and δ 2 to increase the flexibility of the study.

Simulation Study
We assumed maximum sample size 210, which yielded 80% power to detect a response rate of 0.3 versus a null response rate of 0.5 based on a two-sample t-test with one-sided significance level α = 0.05 under the traditional randomized clinical trial using the fixed equal randomization. Each patient was randomized to either experimental treatment A or control B. Two interim analyses were performed when the first 70 and 140 enrolled patients completed the evaluation of the response. At interims, we monitored the superiority or futility of the treatment A against B. A final analysis was performed after the last patient completed follow-up to argue the experimental treatment A is superior to B.
In the following, we first identified the challenging issues in personalized allocation based on the conventional group sequential test. Next, we investigated the performance of the proposed design and verified if the issues are addressed.

Type I Error Rate Inflation
We considered four group sequential clinical trial designs: traditional randomization with 1:1 (Trad), response-adaptive randomization without incorporating covariates (RAR), and covariate-adjusted response-adaptive randomization using (2) and (3) (CARA1 and CARA2, respectively). For all designs, we used the fixed equal randomization for the first cohort of 70 patients but changed the randomization scheme at the first interim according to the design. Trad kept the fixed equal randomization throughout the trial, but other designs updated the allocation probability at each interim to randomize the patients for the next cohorts. RAR used the allocation probability which Rosenberger et al. [19] proposes. CARA1 and CARA2 used the personalized allocation probability described in (2) and (3), respectively. To make comparable, four designs performed the conventional group sequential test based on a chi-square test. We set the overall type I error rate to 0.05 for the group sequential test. The O'Brien-Fleming alpha spending function was used to specify the stopping boundaries for the sequential test in Trad, RAR, CARA1, and CARA2. To estimate the personalized allocation probability in CARA1 and CARA2, we fitted the Bayesian probit regression model assuming normal priors with the mean vector of the maximum likelihood estimate and diagonal covariance matrix with diagonal elements 4. The choice of prior was to avoid using the vague prior and help the error rates less inflated [33]. When we implemented the Bayesian inference, we ran 10000 iterations and discarded the first 5000 iterations as burn-in.
We considered two binary covariates x = (x 1 , x 2 ) which were generated from a Bernoulli distribution with response probability 0.5. There were four possible subgroups of patients determined by the two covariates, i.e., patients with x = (1, 1), (1, 0), (0, 1), or (0, 0). Then, the response Y was generated from a Bernoulli distribution with the probability We considered twenty scenarios, and the true parameters generating response in (6) were described in Table 1. Table 1. Simulation scenarios: True model parameters when x 1 and x 2 are independently generated from a Bernoulli distribution with response probability 0.5. Note that "sc" denotes scenarios.  Table 1 provides the summary of the response rates p A = Pr(Y = 1|G = 1, x) and p B = Pr(Y = 1|G = 0, x) for the overall group and four subgroups. Scenarios 1-9 describe null scenarios where both an experimental treatment A and the control B have no difference in the response. Scenarios 10-20 describe alternative scenarios where the main experimental versus control effect exists. Scenario 1 shows the same response for A and B as 0.5 regardless of patients' characteristics or treatment assignment, i.e., it has no main effect of covariates or the main experimental versus control effect. Scenarios 2 and 3 have the main effect of the first covariate (i.e., β 1 = 0), while Scenarios 4 and 5 have the main effect of the second covariates (i.e., β 2 = 0). Scenarios 6-9 have nonzero coefficients β 1 and β 2 implying that two covariates x 1 and x 2 have the main effect on the response. Thus, in Scenarios 2-9, the response rate depends on the covariates but does not depend on the treatment assignment. They indicate the cases where there is an effect of prognostic covariates. Scenario 10 does not have any effects of covariates but has the main experimental versus control effect, and it is the case the experimental treatment A has better efficacy in response (i.e., smaller response) than the control B. Compared to Scenarios 10-12 we consider the additional effect of prognostic covariate x 1 (i.e., β 1 = 0) to the main experimental versus control effect, while Scenarios 13 and 14 consider the additional effect of predictive covariate x 1 (i.e., γ 1 = 0) to the main experimental versus control effect. Furthermore, in Scenario 15, the first covariate x 1 has both prognostic and predictive effects. Scenarios 16 and 17 have an effect of prognostic covariate x 2 and the effect of treatment assignment. In Scenario 18, the second covariate x 2 has both prognostic and predictive effects. In Scenarios 19 and 20, both x 1 and x 2 have prognostic and predictive effects. Depending on the effects of prognostic or predictive covariates, in Scenarios 10-20, particular subgroups with the covariate profile are more likely to get benefit from one of the treatments than the other treatment. To better understand the subgroups of the covariate profile, we call A (or B)-sensitive patients if the patients with the covariate profile x are expected to respond better to A (or B) but not respond to B (or A). The better treatment for A-sensitive patients is A, and the better treatment for B-sensitive patients is B. For example, in Scenario 14, patients with x 1 = 0 are A-sensitive; in Scenario 18, patients with x 2 = 0 are A-sensitive; in scenario 19, patients except x = (1, 0) are A-sensitive; and in scenario 20, patients with x = (1, 1) are B-sensitive and patients with x = (0, 0) are A-sensitive. Table 2 shows the estimated rejection probability to detect the difference of the response rate between treatments A and B based on 1000 simulated trials. The rejection probability under the null scenarios (i.e., Scenarios 1-9) indicates the overall type I error rate, and the rejection probability under the alternative scenarios (i.e., Scenarios 10-20) indicates the power. Trad and RAR preserved the type I error rate at the target level of 0.05 for all null scenarios. In addition, CARA2 worked well to control the overall type I error rate except for Scenarios 5 and 7. Specifically, under CARA2 using the personalized allocation probability (3), the estimated type I error rates were inflated at 10-17% in Scenarios 5 and 7. However, CARA1 failed in most null scenarios when there exists an effect of the prognostic covariate(s). Specifically, CARA1 using the personalized allocation probability (2) led to serious error inflation at 25-40% by the prognostic covariates in Scenarios 5 and 7. To investigate the type I error rate inflation in Scenarios 5 and 7, we looked at the distribution of the subgroups in each treatment arm A or B for all designs. The mean and standard deviation of the allocation probability of the treatment for each subgroup are reported in Table 3. We observed that designs using personalized randomization, e.g., CARA1, CARA2, and BaCARA, led to the large variability of the distributions compared to Trad and RAR which controlled the overall type I error rate. Under CARA1 and CARA2, the conventional group sequential test did not work properly in the presence of the effect of prognostic covariate(s), and we observed large inflations of overall type I error rate. However, under BaCARA, the overall type I error rates were less likely to be inflated, which resulted from the proposed group sequential test statistics considering the differences in treatment effect within subgroups. Depending on the difference in the response rate for each covariate profile and the prevalence of the subgroups, the outcomes were influenced by the covariates. Table 2. Simulation results: estimated rejection probability of the designs when x 1 and x 2 are independently generated from a Bernoulli distribution with response probability 0.5. Note that "sc" denotes scenarios. The bold indicates the inflation of error rates.  Under the alternative scenarios, Trad yielded a power which ranged from 0.10 to 0.98 depending on the overall difference between p A and p B . As the power 80% was justified by the difference of 0.2 from the response probability of p B = 0.5, the power for each scenario varied according to the smaller or larger treatment effect difference and the null response probability p B . RAR generally yielded similar or a little smaller power than Trad. CARA2 showed similar or larger power compared to Trad and RAR in most scenarios (except for Scenario 17). In most scenarios where the treatment effect difference or subgroup effect difference was less than 0.2 (i.e., Scenarios 10-16), CARA1 yielded similar or smaller power compared to Trad and RAR. However, when the treatment effect difference or subgroup effect difference became larger (i.e., in , CARA1 led to much larger power than Trad and RAR. We also provided boxplots of the estimated difference between p A and p B at the final analysis for all designs in Figure 2. Therefore, CARA1 was more sensitive to the prognostic covariates than CARA2 and was more likely to inflate the error rates. Table 4 shows other operating characteristics of the designs such as the average difference of the number of patients assigned to A and B and the average number of failures (i.e., events) across 1000 simulated trials. Compared to Trad, RAR and CARA change the allocation ratio and randomize more patients to the superior treatment. Under Trad, the averaged difference of the number of patients assigned to A and B ranged from −1.176 to 0.924 with an average of 0.065 across the alternative scenarios. Under RAR, the averaged difference of the number of patients assigned to A and B ranged from 0.482 to 3.082 with an average of 2.023 across the alternative scenarios. Under CARA1, the averaged difference of the number of patients assigned to A and B ranged from 16.808 to 48.808 with an average of 36.466 across the alternative scenarios. Under CARA2, the averaged difference of the number of patients assigned to A and B ranged from 0.316 to 17.598 with an average of 6.654 across the alternative scenarios. CARA1 and CARA2 showed a larger number of patients assigned to the superior treatment A than Trad and RAR. The gain was much larger when CARA1 is considered, which was resulted from the effective use of the personalized allocation probability based on the accumulating data. It also resulted in a smaller number of failures under CARA1 than other designs. Under Trad, the number  The simulation study tells us that effective use of the personalized allocation probability can lead to the inflation of the overall type I error rate but is more ethical by assigning more patients to the superior treatment and yields a smaller number of failures. Therefore, it is critical to maintain the overall type I error rate in personalized allocation and improve clinical benefit while inheriting the advantages of CARA designs.

Evaluation of the Proposed Design: Preservation of Type I Error Rate
We observed the inflation of the overall type I error rate using CARA1 in Table 2, which came from the prognostic covariates' effect and sequential personalized allocation. Using a conventional group sequential test to detect the overall treatment difference in the response probability did not work well when the randomization depended on patients' characteristics. Patients receiving a certain treatment might not be homogeneous, and they responded differently to the treatment. To accommodate this heterogeneity and control the type I error rate, the group sequential test in personalized allocation was proposed in Section 2.5.
For convenience, we called the proposed design BaCARA, which used the personalized allocation probability (2) to randomize the patients and monitor the treatment effect based on the proposed group sequential test statistic (5) through the Bayesian sequential monitoring rule. We evaluated the operating characteristics of BaCARA through simulations. We followed the same simulation settings as in Tables 2-4. To compare the results  with Trad, RAR, CARA1, and CARA2, we included the results of BaCARA in the last column of Tables 2-4. Assuming the minimal improvements δ 1 = δ 2 = 0, we calibrated We observed in Table 2 that BaCARA preserved the overall type I error rate at the target level of 0.05, implying that the inflation issue of CARA1 was addressed. Compared with Trad and RAR designs, under BaCARA, the overall type II error rates seemed to be controlled well (i.e., in . Similar to CARA1 and CARA2, BaCARA was more powerful than Trad and RAR in Scenarios 18-20 where patients' response to the treatment was more heterogeneous, i.e., the treatment effect difference or subgroup effect difference was relatively larger. In addition, in Scenario 17 where CARA2 yielded a large inflation of type II error rate, BaCARA showed a large power compared to other designs. Thus, BaCARA improved the performance of CARA1 and CARA2 using the personalized allocation probability in that it preserved the overall type I and II error rates. BaCARA was appropriate to use for group sequential clinical trials incorporating patients' characteristics into the adaptive randomization.
In Table 4, BaCARA showed better performance in the difference of the number of patients assigned A and B than Trad, RAR, and CARA2, but it had smaller differences of the number of patients assigned A and B than CARA1 in most scenarios. Under BaCARA, the difference of the number of patients assigned A and B ranged from 17.718 to 33.606 with an average of 26.634. However, BaCARA yielded a smaller number of failures across scenarios than Trad, RAR, CARA1, and CARA2. Under BaCARA, the number of failures ranged from 25.82 to 116.36 with an average of 57.84. Such an improvement came from the effective group sequential test as well as the personalized allocation. The proposed design led to the improvement of clinical benefit and provided a better suggestion to effectively use personalized randomization for personalized medicine.

Discussion
We proposed a personalized risk-based screening design using Bayesian covariateadjusted response-adaptive randomization for comparative two-arm clinical trials. Following the group sequential procedure, we adaptively built the personalized allocation probability using the risk factors to randomize more patients to the most desirable individualized intervention and minimize the number of events. We also proposed a new group sequential test to address the challenging issues in the personalized allocation. The proposed Bayesian monitoring rule determined go or no-go of the trial at interims based on accumulating data, and the proposed design preserved the type I error rate through the calibrated cutoffs for the Bayesian monitoring rule.
We compared the performance of the proposed design to the randomized controlled trial designs such as traditional, RAR, and CARA designs. Even though RAR design assigned more patients to the better performing intervention and thus was ethical compared to the traditional randomized controlled trial design, the expected number of failures was not different, and the improvement of clinical benefit was not clear. In addition, in RAR, all eligible patients were enrolled and randomized without any restriction considering patients' characteristics, which was not appropriate in personalized medicine. By incorporating patients' characteristics into randomization, CARA design led to a larger allocation of patients to the better performing intervention than RAR design. However, CARA designs could be sensitive to the prognostic covariate effect and inflate the overall type I error rate. Furthermore, in our simulations, it was not clear to achieve a significant improvement in clinical benefit (i.e., the smaller number of events) compared to the traditional and RAR designs. Taking all of the above into account, the proposed design was the most appropriate to use for two-arm personalized screening clinical trials.
The proposed design is flexible and extended to the followings. First, assuming that informative covariates are not specified at the beginning of the trial, covariate selection methods can be carried out in the burn-in stage. The selected covariates with the significant effect are used in the remaining stages to randomize and test the screening effect. Second, our Bayesian sequential monitoring rule is flexible and can be modified according to the study objectives. For example, additional monitoring rules based on surrogate or safety endpoint can be included to make a data-driven decision throughout the trials. This also allows us to learn health systems along with the trials. Third, personalized randomization can be generalized for multi-arm trials, and each arm is compared to the control using the proposed test. For example, to calculate the allocation probability (2) of randomizing a patient with x to the treatment A, the posterior probability of p A (x) < p B (x) for comparing with the control B is replaced with the posterior probability that the treatment A offers the minimum response rates among all treatment arms [39]. Villar et al. [40], Ryan et al. [41], and Viele et al. [42] provide some directions under consideration for the multi-arm trials.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Calculation of Posterior Probability
We have used Bayesian inference for personalized randomization (Section 2.4) and monitoring rule (Section 2.5). In the Bayesian probit regression model, we compute the posterior distribution instead of computing the maximum likelihood estimator of the parameter. As mentioned in Section 2.3, we assume that the regression coefficient vector θ = (β , γ ) follows the normal prior distributions. Given the observed data D, the posterior distribution of parameter θ is obtained to be proportional to prior distribution of parameter times the likelihood function. Then, the posterior sample is drawn from the posterior distribution using data augmentation and Gibbs sampling [43,44].
Similarly, for some value of δ, Pr(T k < δ|D k ) is approximated by the average of I(T k < δ) over the posterior sample based on data D k , where T k is the weighted sum of ∆ j , j = 1, . . . , k. In both cases, the posterior samples are obtained easily by using the R package LearnBayes.