Design and Analysis of Cancer Clinical Trials for Personalized Medicine

Biomarkers play a key role in the development of personalized medicine. Cancer clinical trials with biomarker should be appropriately designed and analyzed reflecting the various factors, such as the phase of trials, the type of biomarker, the study objectives, and whether the used biomarker is already validated or not. In this paper, we demonstrate design and analysis of two phase II cancer clinical trials, one with a predictive biomarker and the other with a prognostic biomarker. A statistical testing method and its sample size calculation method are presented for each of the trials. We assume that the primary endpoint of these trials is a time to event variable, but this concept can be used for any type of endpoint with associated testing methods. The test statistics and their sample size formulas are derived using the large sample approximation based on the martingale central limit theorem. Using simulations, we find that the test statistics control the type I error rate accurately and the sample sizes calculated using the formulas maintain the statistical power specified at the design stage.


Introduction
In many cancer clinical trials, different types of biomarkers are measured from the tumor, blood or urine using molecular, biochemical, physiological, anatomical, or imaging method at the baseline or during treatment. The observed biomarkers are used for various purposes during the diagnosis and treatment of the diseases. For example, cancer biomarkers are used to diagnose diseases (diagnostic biomarker), to predict the response to a specific treatment (predictive biomarker), to measure the aggressiveness of a disease for patients with no or a non-targeted treatment(prognostic biomarker), to monitor the recurrence of a disease, and so on.
These biomarkers can be used to select a treatment of cancer patients. However, biomarkers should be validated before being used to select a treatment in clinical trials. If a biomarker has not been validated yet, it can be used as a stratification factor of a randomized clinical trial. In such a trial, the biomarker is used for its validation, rather than for treatment selection.
The design and analysis method of a clinical trial with a biomarker-guided treatment can be very different depending on the type of the used biomarker, the biomarker's development stage, the study objective, and so on. Various design issues of randomized clinical trials with biomarkers have been widely discussed [1]. A series of statistical testing has been proposed for a randomized phase II trial with a potentially predictive biomarker which has not been strictly validated yet [2]. The efficacy of enrichment trials and stratified randomization trials with a time to event variable as the primary endpoint has been compared assuming that the treatment effect reverses between biomarker positive and negative groups and considering subset analysis within each biomarker status group [3].
Phase II trials are to screen out inefficacious treatments before proceeding to a largescale studies, such as a phase III trial. As such, phase II trials should be completed in a short time period, so that we must choose a small sample size and a short-term surrogate endpoint, such as tumor response or progression-free survival, as the primary endpoint, rather than a confirmatory endpoint, such as overall survival.
In this paper, we demonstrate two phase II cancer clinical trials, one with a predictive biomarker and the other with a prognostic biomarker, and present analysis and sample size calculation methods for these trials. We use a survival variable as the primary endpoint in this paper, but the same concept can be used for any kind of variables including a binary variable, such as tumor response. For the purpose of sample size calculation, we assume exponential survival distributions which are most popularly used in real trial designs, although the statistical testing does not depend on any specific survival distribution. This is a review article of a biostatistical paper [4] with some modifications.

Materials and Methods
We consider a time to event (or survival) endpoint, progression-free survival (PFS). We use a generalized log-rank test for a trial with an imaging prognostic biomarker and a Cox proportional hazards model for a trial with a predictive biomarker, and derive their sample size formulas. To account for relatively small sample sizes of phase II trials, exact statistical methods are used for binary outcomes, but in general no exact methods are available for survival analysis. Therefore, using simulations on two real trial examples, we evaluate the small sample performance of the discussed statistical tests and their sample size formulas that are derived based on large sample approximation.

A Phase II Trial with a Predictive Biomarker
Predictive biomarkers help provide information on the likelihood of response to a specific chemotherapy. For example, tumors expressing high thymidylate synthase (TS) levels were shown to be resistant to pemetrexed in a preclinical study [5], but it was not validated by a clinical study yet. Suppose that we want to investigate whether TS expression is a predictive marker for the clinical outcome of pemetrexed/cisplatin (PC) in patients with nonsquamous non-small-cell lung cancer (NSCLC) through a phase II trial. The control non-targeted treatment is gemcitabine/cisplatin (GC). Compared to GC, PC is expected to be similarly efficacious for TS-positive group, but to be more efficacious in TS-negative group.
To investigate this hypothesis, we want to randomize patients between the two treatment arms stratifying by TS-positivity vs. TS-negativity. This trial was designed and published with overall response as the primary endpoint [6], but in this paper, we demonstrate how to design and analyze a trial using PFS as the primary endpoint based on the estimates from the trial.

Sample Size Calculation
For sample size calculation of this study, we need to specify following design parameters.

•
Type I error rate and power, (α, 1 − β) • Allocation proportion for GC arm, p 0 , and for PC arm, p 1 (p 0 + p 1 = 1) • TS-negativity q 0 and TS-positivity q 1 based on the prevalence in the study population Assuming exponential distributions for PFS, the hazard rates, λ z 1 z 2 , of the four patient groups, λ 00 , λ 01 , λ 10 , and λ 11 • Accrual period a (or accrual rate r) and additional follow-up period b Assuming exponential distributions for PFS with hazard rates λ z 1 z 2 , model (1) is By solving these equations with respect to (λ 0 , β 1 , β 2 , β 3 ), we have Hence, we can calculate the values of β 3 under H 1 in terms of the hazard ratios that are specified as design parameters above.
To derive a sample size formula, we need to calculate the limit ofσ 2 3 or I −1 (0) as n → ∞ in terms of the design parameters. Let p kl = P(z 1 = k, z 2 = l) for k, l = 0 or 1 denote the relative frequency of each cell of the 2 × 2 table defined by treatment and TS status. Under the stratified randomization scheme, we have p kl = p k q l for k, l = 0 or 1.
Appendix A shows that I(0) converges to DA, where D = nd denotes the expected number of events (or number of patients with tumor progression), d = p 00 d 00 + p 10 d 10 + p 01 d 01 + p 11 d 11 denotes the probability that a patient has a progression during the study period, d kl = 1 − exp(−λ kl b){1 − exp(−λ kl a)} denotes the probability that a patient in group (z 1 , z 2 ) = (k, l) has a progression for k, l = 0, 1 as derived based on an exponential PFS distribution and U(b, a + b) censoring distribution, and Please note that d z 1 z 2 is derived from an exponential PFS distribution with hazard ratio λ z 1 z 2 and a censoring distribution of U(b, a + b). Hence, the limit ofσ 2 3 From (2),β 3 = log λ 11 − log λ 10 − log λ 01 + log λ 00 is the β 3 value specified under H 1 . Since (β 3 −β 3 )/σ 3 has the standard normal distribution under H 1 , the power for a local alternative hypothesis H 1 : β 3 =β 3 is given as whereΦ(·) is the survivor function of the standard normal distribution. Noting that σ 2 3 = DA (3,3) = ndA (3,3) , we obtain the required number of events or the required sample size by solving Equation (3). Formula (4) requires specification of accrual period a together with (α, 1 − β),β 3 , b, p 0 and q 0 . In designing a clinical trial, however, we can estimate the accrual pattern, rather than an accrual period. Suppose that patients are expected to be enrolled to the study at a rate of r during an accrual period based on the number of patients treated by the study member sites recently. Assuming uniform patient accrual during period a, we have n ≈ a × r. Noting that d = d(a) is a function of a, (4) is expressed as By solving (5) with respect to a using a numerical method, such as the bisection method, we obtain the required accrual period, say a * , and the required sample size n = a * r.

Example 1
We demonstrate our sample size calculation method with the NSCLC trial that is introduced above. We will randomize patients between the two treatment arms with 1-to-1 fashion, i.e., p 0 = p 1 = 1/2 stratified by TS status. The expected TS-positivity is 50% (i.e., q 0 = q 1 = 0.5) because the median TS level was selected as the cutoff value for TS-positivity from a previous study [9]. Hence, we have p z 1 z 2 = p z 1 q z 2 = 1/4 for z 1 , z 2 = 0 or 1. The 6-month PFS is expected to be about 35% for GC arm regardless of TS level and for PC arm with TS-positivity, and 55% for PC arm with TS-negativity. For an exponential distribution, t-year survival probability S(t) is associated with its hazard rate λ by S(t) = exp(−λt). Therefore, the annual hazard rates under the alternative hypothesis are given as λ 00 = λ 01 = λ 11 = 2.100 and λ 10 = 1.196 under the exponential PFS assumption. For these hazard rates, we have the baseline hazard rate λ 0 = 2.100, β 2 = 0, and β 3 = −β 1 = 0.563 from (2). Suppose that about 10 patients per month are expected to be entered to the study, i.e., an annual accrual rate of r = 120. We plan to follow the patients for additional b = 1 year after the last patient enters. Then, the 1-sided α = 0.1 test for H 0 : β 3 = 0 against H 1 : β 3 = 0.563 in model (1) requires n = 345 patients for a power of 1 − β = 0.9. The expected number of events (i.e., number of patients with a disease progression) at the analysis will be D = 333. As an effort to lower the sample size for this phase II trial, we use a large α level compared to the standard two-sided α = 0.05. We observe an empirical power of 0.897 from 10,000 simulation samples of size n = 345 that are generated at the design setting. This trial recruited 321 patients using overall response as the primary endpoint [6].
A stratified randomized trial of a treatment with a predictive biomarker requires a large sample size for testing on the interaction term. Sample size of a trial for testing the interaction term with 50% of biomarker positivity may be compared to that of a trial for an arm-to-arm comparison with 1-to-3 randomization in the setting of the NSCLC trial expecting a higher efficacy of PC only for TS-negative group. Let us consider a randomized trial to compare two treatment arms with (α, 1 − β, r, b) = (0.1, 0.9, 120, 1) as above and 6-month PFS of 35% for the control treatment and 55% for the experimental treatment. In this case, we need only n = 122 (D = 102) patients by 1-to-3 randomization by a sample size formula for the standard log-rank test [10]. If TS had been already validated to be a predictive biomarker of pemetrexed before this trial, then we could have chosen an enrichment trial for TS-negative patients that would require a much smaller sample size. The efficiency has been compared between an enrichment design and a stratified randomization design has been for predictive biomarker in terms of a continuous outcome [11].

A Phase II Trial with a Prognostic Biomarker
Prognostic biomarkers provide information on the overall cancer outcome in patients to facilitate cancer diagnosis regardless of selected treatments. In this section, we consider a phase II trial with an imaging prognostic biomarker. Chemotherapy B has been a standard regimen for patients with non-bulky stage I and II Hodgkin lymphoma. In a previous study on 6 cycles of B, each patient had a FDG-PET (fluorodeoxyglucose positron-emission tomography) imaging after 2 cycles of B. It was found that the patients with a negative PET image (group 1) and those with a positive PET image (group 2) had a 3-year PFS of S 1 (3) = 0.86 and S 2 (3) = 0.52, respectively, and the hazard ratio, ∆ = λ 2 /λ 1 , was estimated as ∆ 0 = 4.3.
In a new single-arm phase II trial, the patients with a negative PET image after 2 cycles of B will be treated by additional 4 cycles of the chemotherapy B as in the previous study, whereas those with a positive PET image after 2 cycles of B will be treated by 4 cycles of a more aggressive chemotherapy C plus radiation therapy (C+RT).
In this trial, we want to show that by treating PET positive patients with the more aggressive therapy C+RT, their PFS will become closer to that of PET negative patients who are treated by the standard chemotherapy B. To this end, we test H 0 : ∆ = ∆ 0 against H 1 : ∆ < ∆ 0 . Although the PFS of group 2 will be different between H 0 and H 1 , that of group 1 is expected to be identical since PET negative patients receive the same treatment as that of the previous study.

Statistical Testing
Let n k denote the sample size in group k, n = n 1 + n 2 the total sample size, and T ki the time to progression for subject i in group k (1 ≤ i ≤ n k ; k = 1, 2). We observe (X ki , δ ki ), where X ki is the minimum of T ki and the censoring time and δ ki is an event (or progression) indicator taking 1 if the subject had a tumor progression and 0 otherwise. For group k, T k1 , . . . , T k,n k are distributed with hazard function λ k (t). Under the proportional hazards assumption, ∆ = λ 2 (t)/λ 1 (t) denotes the hazard ratio between the two patient groups.

Sample Size Calculation
We want to estimate the sample size n under a local alternative hypothesis H 1 : ∆ = ∆ 1 (<∆ 0 ) with a desired power. For sample size calculation of this trial, we need to specify following design parameters. Using the specified hazard rates, we have ∆ 0 = λ 20 /λ 1 and ∆ 1 = λ 21 /λ 1 . A sample size formula with ∆ 0 > 1 and ∆ 1 = 1 for designing non-inferiority trials was proposed [14]. This sample size formula was further extended for general ∆ 0 and ∆ 1 with ∆ 1 < ∆ 0 [16].
Appendix B derives a sample size formula by adapting Jung and Chow's formula [16] for this trial, where is the survivor function of the U(b, a + b) censoring distribution with and p k = n k /n. The integrals for ω, σ 2 0 , and σ 2 1 are calculated using a numerical method.
The number of events D expected at the analysis time under H 1 is calculated by Sample size formula (6) assumes that the accrual period a is specified. Suppose that accrual rate r is specified instead of accrual period a. Given (λ 1 , ∆ 0 , ∆ 1 , α, 1 − β, p 1 , b), ω = ω(a) and σ h = σ h (a) for h = 0, 1 are functions of a. Under uniform accrual assumption, we have n = a × r. Hence, (6) is expressed as By solving (7) with respect to a, using a numerical method such as the bisection method, we obtain the required accrual period a * and the required sample size n = a * × r.

Example 2
We consider the PET-guided Hodgkin lymphoma trial introduced in the beginning of this section. Under H 0 , we assume a 3-year PFS of 86% and 52% for PET negative and positive groups, respectively, which correspond to annual hazard rates of (λ 1 , λ 20 ) = (0.050, 0.218) under an exponential PFS model, resulting in a hazard ratio of ∆ 0 = 4.3. By treating the PET positive patients with an aggressive treatment C+RT, we expect to increase their 3-year PFS up to 74% (from 52%), resulting in an annual hazard rate of λ 21 = 0.100 and hazard rate of ∆ 1 = 2.0. The previous study observed about p 2 = 20% of PET-positivity. Assuming an annual accrual rate of r = 60 patients and b = 3 years of additional follow-up after completion of accrual, we need n = 191 patients for 1 − β = 90% power for detecting H 1 : ∆ 1 = 2 by the generalized log-rank test W(∆ 0 )/σ n (∆ 0 ) with onesided α = 10% under H 0 : ∆ 0 = 4.3. Under this specific alternative hypothesis, we expect about 46 events (progressions or deaths) at the data analysis. This trial was conducted with this study objective as a second objective [17]. Simulation studies are conducted to evaluate the performance of the calculated sample size under the above design settings under H 0 and H 1 . Using 10,000 simulation samples of size n = 191 under each hypothesis, the empirical type I error rate and power are observed as 0.0984 (to be compared to α = 0.1) and 0.8749 (to be compared to 1 − β = 0.9), respectively.

Discussion
We have presented design and analysis methods of two phase II trials for biomarkerguided treatments.
The power of the statistical tests of the trials discussed above depends on the prevalence of the biomarker positivity. Therefore, we need to check the observed prevalence during the patient accrual, and to recalculate the sample size if the observed prevalence is very different from the one specified at the design stage. For both of our example trials, the initial sample size will be under-powered if the observed prevalence is farther from 1/2 than the specified one at the design stage. In this case, we may plan a sample size recalculation reflecting the observed prevalence in the middle of the trial, and modify the sample size of the trial if necessary.
For the sample size calculations, we have assumed exponential survival distributions and an accrual pattern with a constant accrual rate, but we can easily extend the formulas for any survival distributions and any accrual pattern [18]. We have considered a survival endpoint as the primary endpoint in this paper, but the concept can be used to design and analysis for biomarker-driven phase II trials with other type of endpoint, such as a binary outcome for tumor response.
As an effort to lower the sample size of a phase II trial from that of a phase III trial, we use a high type I error rate [19,20], such as 1-sided α = 5% or 10% (compared to 2-sided α = 5%), a surrogate short-term outcome, such as tumor response or progression-free survival (compared to a confirmatory endpoint such as overall survival), a larger treatment effect, and a single-arm design (compared to a randomized design). Despite these efforts, we observe that a randomized trial stratified by a predictive biomarker requires a relatively large sample size for a phase II trial. This fact is pointed out by literature [3,11,21].

Conclusions
Through simulations on the two real study examples, we find that the proposed statistical tests control the type I error rate accurately and the calculated sample sizes maintain the appropriate power. The sample size calculations require some numerical methods for integration and solving equations. The author developed Fortran programs to implement the sample size formulas, which are available upon request.
its variance under H 1 iŝ Therefore, n −1σ2 1 converges to On the other hand, under H 1 , n −1 σ n (∆ 0 ) converges to