^{1}

^{*}

^{1}

^{2}

^{3}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Biomarkers are becoming increasingly important for streamlining drug discovery and development. In addition, biomarkers are widely expected to be used as a tool for disease diagnosis, personalized medication, and surrogate endpoints in clinical research. In this paper, we highlight several important aspects related to study design and statistical analysis for clinical research incorporating biomarkers. We describe the typical and current study designs for exploring, detecting, and utilizing biomarkers. Furthermore, we introduce statistical issues such as confounding and multiplicity for statistical tests in biomarker research.

In recent years, biomarkers have played an increasingly important role in drug discovery, understanding the mechanism of action of a drug, investigating efficacy and toxicity signals at an early stage of pharmaceutical development, and in identifying patients likely to respond to treatment. In addition, several potentially powerful tools to decipher such intricacies are emerging in various fields of science, and the translation of such knowledge to personalized medicine has been promoted and has occasioned strong expectations from almost every sector of health care. Therefore, biomarkers have been utilized to personalize medication or healthcare and in the safety assessment of drugs in clinical practice. However, few valid biomarkers at present can predict which group of patients will respond positively, which patients are non-responders, and who might experience adverse reactions to the same medication and dose. Therefore, a vast number of clinical biomarker studies are conducted and reported.

In practice, however, the results in highly cited biomarker studies often significantly overestimate their findings, as seen from meta-analyses of these studies. Many of these studies were relatively small and among the first to report on the association of interest. Discoveries made in small studies are prone to overestimating or underestimating the actual association. Ioannidis and Panagiotou [

In this paper, we first introduce the definition, classification, and some examples of biomarkers in clinical research. Second, we review the typical and current study designs of clinical research using biomarkers in practical studies. Furthermore, we describe statistical issues such as confounding and multiplicity for statistical tests in biomarker research. The final section is a brief summary.

An expert working group at the National Institutes of Health (NIH) has defined a biological marker or biomarker as ‘a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention’ [

Biomarkers can be broadly classified into prognostic biomarkers, predictive biomarkers, pharmacodynamic biomarkers, and surrogate endpoints [

A prognostic biomarker classically identifies patients with differing risks of a specific outcome, such as progression or death [

The prognostic biomarker can distinguish populations into groups whose outcome will be poor or good following the test and standard treatments, but it cannot guide the choice of a particular treatment. The preliminary knowledge necessary to propose a validation study of a prognostic biomarker is the subject of considerable previous work [

A biomarker predicts the differential outcome of a particular therapy or treatment (e.g., only biomarker-positive patients will respond to the specific treatment or to a greater degree than those who are biomarker negative) [

In this case, for example, biomarker-positive patients perform moderately better than do biomarker-negative patients when standard treatment is administered, whereas test treatment may be more effective in the biomarker-positive group (

As a further remark, many biomarkers would have both prognostic and predictive features. For example, in breast cancer, patients with diagnosed estrogen receptor (ER)-negative have a higher risk of relapse than do ER-positive patients with a similar disease stage. In this case, ER is ‘prognostic.’ On the other hand, the antiestrogen tamoxifen is more effective in preventing breast cancer recurrences in ER-positive patients than in ER-negative patients. In this case, ER is ‘predictive’ of benefit from tamoxifen [

We also notice a usual error in non-randomized studies. A test treatment was administered to biomarker-positive and biomarker-negative patients in a non-randomized study, and the outcome for the biomarker-positive patients was superior to that for the biomarker-negative patients. If the biomarker was prognostic, it might be concluded that biomarker-positive patients would have better outcome regardless of the treatment. On the other hand, if the biomarker was predictive, it might be concluded that biomarker-positive patients would be more likely to benefit from the test treatment. Unfortunately, non-randomized studies cannot provide definitive information to these correct answers.

According to Jenkins

For example, inflammatory markers such as C-reactive protein (CRP) or erythrocyte sedimentation rate (ESR) may be used to select a dose in rheumatoid arthritis treatment or can form part of a clinical composite such as disease activity score (DAS) 28 and be used for the same purpose [

A surrogate endpoint is intended to be a substitute for a clinical endpoint. It is expected to predict clinical benefit (lack of benefit or harm) based on epidemiologic, therapeutic, pathophysiologic, or other scientific evidence (

According to the Biomarker Working Group [

The terms ‘biomarker’ and ‘surrogate endpoint’ are often used interchangeably. However, there is a subtle difference. Surrogate endpoints may not merely be biomarkers and could include imaging measurements (such as CT, MRI, and PET); therefore, it is likely that only a few biomarkers would be considered for use as surrogate endpoints. For the concept of a surrogate endpoint to be useful, one must specify the clinical endpoint, class of intervention, and population in which the substitution of the biomarker for a clinical endpoint is considered reasonable [

A pharmacodynamic biomarker which correlates well with a widely accepted clinical outcome at both individual and group levels could potentially act as a surrogate endpoint and a substitute for a recognized clinical endpoint, such as the manner in which low density lipoprotein (LDL) cholesterol acts as a surrogate for major cardiovascular events in the licensing of statins [

In this section, we have reviewed the typical and current study designs that use biomarkers. We have simply assumed that clinical studies involving 2-treatment comparisons (standard treatment versus test treatment) feature one biomarker with two status levels (positive or negative).

In general, a well-controlled, randomized parallel group design can be useful to identify biomarker(s) in clinical studies that include patients with both high and low values or levels of the biomarker(s). Retrospective analyses of data from these RCTs may be used to identify candidate biomarkers. Buyse

For a prognostic biomarker, an association must be demonstrated between the value of the biomarker at baseline, or changes in the biomarker over time, and clinical outcome regardless of treatment. For a putative prognostic biomarker to be validated, its association with the clinical endpoint of interest should be demonstrated repeatedly in independent studies, preferably across a range of clinical situations. Retrospective studies may be sufficient for the initial identification and statistical validation of prognostic biomarkers, although the clinical utility or validity of the biomarkers may need to be confirmed in prospective studies.

For a predictive biomarker, the baseline value or changes in the values of the biomarker over time must be shown to predict the efficacy or toxicity of a test treatment, as assessed by a defined clinical outcome. For a putative predictive biomarker to be validated, its ability to predict the effects of treatment (or lack thereof) should be demonstrated repeatedly in multiple studies. Identification of a predictive biomarker requires statistical data from RCTs that include patients with both high and low levels of the biomarker. Retrospective analyses may be sufficient to identify candidate predictive biomarkers; however, prospective research and analyses may be required to validate the biomarkers. The statistical issues for identification of prognostic and predictive biomarkers are discussed in Section 4.

This design is applied to confirm a treatment effect by using biomarker status as a stratification factor (

An actual application of this design is a Cancer and Leukemia Group B trial (CALGB-30506) to investigate benefit of adjuvant chemotherapy in stage I non-small cell lung cancer (NSCLC) patients [

Biomarker-strategy design addresses the clinical utility of a biomarker and falls into two classifications: one is biomarker-strategy design with a standard control and the other is that with a randomized control [

However, there is scientific concern regarding this design, as shown in

Biomarker-strategy design was applied to the GILT docetaxel trial [

In actual settings, the potential treatment benefit is limited to a certain biomarker-defined patient subgroup based on sufficiently convincing evidence. Whether or not such evidence exists, there could be a widely held perception that equipoise for the best treatment choice is present only in patients with certain biomarker values. In either case, it is not feasible to use a biomarker-stratified design; therefore, the clinical utility of the biomarker can be partially assessed by an enrichment trial design [

This design involves a prescreening step whereby patients are selected for the study based on a prespecified biomarker status [

As a similar design strategy, Mandrekar and Sargent [

In fact, this design was applied to a clinical trial to evaluate a prognostic and possibly predictive biomarker for breast cancer, Oncotype Dx^{®}, in breast cancer patients treated with tamoxifen [

An adaptive signature design is a 2-stage design for randomized clinical trials of targeted agents in settings where an assay or signature that identifies sensitive patients is not available at the outset of the study [_{1}. If the effect of the test treatment is superior to that of the standard treatment for statistically significant, analysis is done. If there is no significant difference between the two treatments, testers proceed to the second stage. In the second stage, the predictive biomarker for a subset of patients, _{1} (the number of patients of the subset), in the study is identified by a statistical classification method. Next, patients positive for the identified biomarker are defined from the remaining subset of patients, _{2} (= total sample size _{1}), and the outcome of the test treatment is compared with that of the standard treatment for the specified biomarker-positive patients with the statistical test at significance level _{2}.

This design includes the multiplicity problem for statistical testing since the statistical test would be conducted twice. Hence, to control the overall significance level to that under a nominal value _{1} +_{2}. This multiplicity problem is discussed in detail in Section 4. In addition, Freidlin and Simon [_{1} = 0.04, _{2} = 0.01, and that the ratio of _{1} to _{2} be 1. Furthermore, the procedure to identify the predictive biomarker should be prespecified if this design is to be applied.

Freidlin and Simon [

Jiang _{1}. If the test is statistically significant, the procedure is done, and the outcome difference between the two treatments for all patients is confirmed. If the test is not significant, second-stage analysis is conducted to identify an optimal cut-off point for the predictive biomarker using the remaining significance level (_{2} = _{1}). This procedure (hereafter referred to as ‘analysis plan A’) controls the probability of making any false-positive findings at the prespecified level, _{1} = 0.04 and _{2} = 0.01 to correspond to an overall significance level,

Otherwise, ‘analysis plan B’ combines the two statistical tests for overall and subgroup patients by incorporating the correlation structure of the two test statistics, and is a generalization of analysis plan A. For example, if analysis plan B demonstrates a difference between the test and standard treatments by the statistical test, the next step is to identify the biomarker threshold above which the test treatment is more effective than the standard treatment.

Additionally, in both analysis plans A and B, a point estimate and a confidence interval for the cut-off point are estimated by using a bootstrap re-sampling approach. However, the cut-off value should not be estimated if analysis plan B does not demonstrate a statistical difference between the two treatments, as the estimation is inexplicable [

Wang

According to a simulation study investigating the performance characteristic of this design in Wang

Zhou

As an application setting, this design has been used for evaluation of the phase II Biomarker-Integrated Approaches of Targeted Therapy of Lung Cancer Elimination (BATTLE) trial, which consists of an umbrella screening trial and four parallel phase II targeted therapy trials (adaptively randomized into one of the four treatments using erlotinib, sorafenib, vandetanib, and the combination of erlotinib and bexarotene) in advanced NSCLC patients with prior chemotherapy.

Bayesian statistical methods are being used increasingly in clinical research because the Bayesian approach is ideally suited to adapting to information that accrues during a trial, potentially allowing for smaller more informative trials and for patients to receive better treatment. Bayesian design can provide an advantage over the non-Bayesian if certain conditions exist and have been the topic of a recent FDA guidance publication [

In this section, we introduce the important issues of confounding and multiplicity that arise quite often in biomarker studies.

In clinical studies to evaluate the treatment effect, many sources of variation have an impact on the evaluation of the treatment. If these variations are not identified and properly controlled, then they may be combined with the treatment effect that the studies are intended to demonstrate. In this case, the treatment is said to be confounded with the effects due to these variations [

To provide a better understanding, consider the following example. A test treatment was compared to a standard treatment for a clinical outcome in an RCT. All patients allocated to the two treatment groups in the study were biomarker negative. As a result, the outcome with the test treatment was not superior to the standard treatment, and the test treatment effect for those patients was not confirmed. However, as mentioned earlier, all patients were biomarker negative. Thus, it is not clear whether the insufficient treatment effect was due to the use of the test treatment or the effect of being biomarker negative. In this example, the biomarker is termed a potential confounding factor of the treatment effect.

Further, the laboratory batch effects due to assay runs, reagent lots, and shifts in instrument calibration often pose significant risks for confounding. For instance, we consider a study in which blood samples from test treatment were treated with Reagent I and blood samples from standard treatment were treated with Reagent II. If the reagent effect was associated with the outcome, the reagent was a confounding variable; a confounding variable is a variable that is associated with the treatment and the outcome. Therefore, the study cannot separate effects of the confounding variable and the treatment effect.

Close contemplation at the planning stages of a biomarker study is very important to avoid this issue. Randomization and selection of the study population (such as inclusion and exclusion criteria) can be useful tools to prevent confounding. In the statistical analysis stage, removal of the confounding effect is also attempted in order to perform subgroup analysis and model-based analysis, which are typical approaches in biomarker studies.

A subgroup analysis is the simplest approach and an important part of the analyses in a comparative clinical study. Separate comparison of the test treatment to the standard treatment would be conducted in the subgroups (e.g., biomarker positive and biomarker negative) for a specific clinical outcome. When multiple subgroup analyses are performed, the results are commonly over-interpreted and can lead to further research that is misguided, or worse, lead to suboptimal patient care due to substantial inflation of the probability of a false positive result [

In cases where,

The traditional statistical approach in which cases are classified by treatment and by a biomarker as a covariate that may affect treatment efficacy is to first test whether there is a significant interaction between the treatment (test versus standard treatment) and the covariate (biomarker negative or biomarker positive). If the interaction test is not significant, then the treatment effect can be evaluated overall, and not within the levels of the biomarker. If the interaction test is significant, the biomarker may be regarded as a predictive biomarker. The treatment effect differs with each biomarker status and is evaluated separately within the levels of the biomarker (

In practice, the interaction test is often performed based on the use of statistical models, which are an integral component of any data analysis describing the relationship between clinical outcome and one or more explanatory variables (such as treatment group and biomarker status) in the form of mathematical equations. For example, an RCT was conducted to identify a predictive biomarker for progression-free survival. The Cox proportional hazard model [_{0} is a baseline hazard as a nuisance parameter, and _{1} and _{2} are the effects of treatment and biomarker status, respectively. An interaction effect by treatment and biomarker status is _{3}. Treat = 0, 1 indicates whether the patients were allocated standard or test treatments, respectively; Bio = 0, 1 denotes negative or positive biomarker status, respectively. Treat × Bio indicates the interaction term. To identify whether the biomarker is predictive, the interaction effect _{3} is statistically tested. In addition, the hazard ratio (HR) is a useful statistic for quantitative interpretation of the treatment effect and interaction effect. In particular, the HR of the test treatment to the standard treatment in biomarker-negative patients is estimated as exp(_{1}), and the HR for biomarker-positive patients is estimated as exp(_{1} + _{3}), as derived from

Statistical tests of interaction effect should be used instead of inspection of subgroup

Simultaneous considerations of a set of statistical inferences are common in clinical studies. Clinical studies frequently incorporate one or more of the following design features: multiple outcome measures, repeated tests of significance as the study progresses (interim analyses) to ensure early detection of effective treatments, subgroup analysis to address particular concerns on the efficacy and safety of the drug in specific patient subgroups (e.g., biomarker positive/negative), and various combinations of these features.

Multiplicity is an important issue for multiple testing in the planning, data analysis, and interpretation of clinical studies. In this section, we first introduce the framework and principle of statistical tests. Next, we present the multiplicity issue of multiple testing.

A ‘statistical hypothesis test’ is a formal scientific method to examine the plausibility of a specific statement regarding the comparison of an outcome between one group and a fixed value or between two or more groups. We adopt the 2-treatment group comparison in this section.

The statement regarding the comparison is typically formulated as a ‘null hypothesis’ stating that there is no difference in outcome between the test and standard treatments. An ‘alternative hypothesis’ is set for the study objective to be proved, such as that the mean of the outcome between the two treatments will differ. The test procedure computes a

When a statistical test is performed, one of four outcomes will occur, depending on whether the null hypothesis is true or false and whether the statistical test rejects or does not reject the null hypothesis: (1) the procedure rejects a true null hypothesis (a false-positive type I error), (2) the procedure does not reject a true null hypothesis (a true negative), (3) the procedure rejects a false null hypothesis (a true positive), or (4) the procedure does not reject a false null hypothesis (a false-negative type II error). The true state and the decision to accept or reject a null hypothesis are summarized in

Multiplicity is an important issue for inflating Type I error when multiple simultaneous hypotheses are tested at set ^{10}. This phenomenon is termed a Type I error inflation and multiplicity problem [

In practical settings, multiplicity would arise in the following situations: testing for multiple endpoints; exploration of multiple prognostic biomarkers; comparison between more than two treatment groups; adaptive design (see Section 3) and interim analyses; basic subgroup analyses; and so on.

The basic ideas of multiple testing have been outlined and the problem of how to control the Type I error has been discussed [

One of the simplest approaches to account for multiplicity is to adjust the significance level to account for the number of tests [

For pharmacogenomic studies and genome-wide association studies (GWAS) that focus on finding sets of predictive genes, an alternative approach to multiple testing considers the false discovery rate (FDR), which is the probability that a given gene identified as differentially expressed is a false positive. The FDR is typically computed after a list of differentially expressed genes has been generated [

We cannot introduce other methods for multiplicity adjustment in detail because a large number of methods have been proposed. The interested reader can refer to the papers published by Bauer [

Statistical tests are one of the most popular tools, not only in clinical trials but also scientific research, however, many researchers have a confused interpretation about

In general, statistical significance is meaningful in confirmatory trials, but not necessary in exploratory trials. In contrast to confirmatory trials, the objectives of exploratory trials may not always lead to simple statistical tests of pre-defined hypotheses [

The importance of biomarkers in medical diagnosis, prevention, and therapy of diseases is increasing. In fact, studies have identified an impressive number of clinical biomarkers. This article provides an overview on the study designs for biomarker research. In addition, we introduce confounding and multiplicity of statistical tests, which are important statistical issues in biomarker studies. From the viewpoint of evidence-based medicine, appropriate study design and statistical analysis are absolutely necessary for conducting valid biomarker studies.

We are grateful to Nan M. Laird at Department of Biostatistics, Harvard School of Public Health, and Isao Yoshimura at Tokyo University of Science for their valuable advice and suggestions. This study was supported by a Grant-in-Aid for Scientific Research from the Japan Society for the Promotion of Science.

^{®}) National Cancer Institute

Biomarker types. (

Biomarker by treatment interaction design.

Biomarker-strategy design. (

(

General procedure for adaptive signature and biomarker-adaptive threshold designs. ‘S’ and ‘T’ denote the standard and test treatments, respectively.

General procedure for adaptive accrual design. ‘S’ and ‘T’ denote the standard and test treatments, respectively.

Framework for Bayesian adaptive design. Patients are assigned to biomarker groups 1–4 in sequential order according to the characteristics of the three biomarker categories. ‘+’ and ‘−’ correspond to the respective positive and negative biomarker statuses. Patients are adaptively randomized to one of the three treatments according to their biomarker groups. The dashed arrows indicate the putative effective treatment for each of the biomarker groups.

Examples of biomarker use.

Human epidermal growth factor receptor 2 (HER2), epidermal growth factor receptor (EGFR), V-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog (KRAS) mutations [ |
Directing treatment | Predictive biomarker |

BCR-ABL ( |
Directing treatment of imatinib | Predictive biomarker |

Cytochrome P450 enzymes (CYP2D6, CYP2C9, CYP2C19 polymorphisms) [ |
Known to affect drug metabolism | Predictive biomarker |

Estrogen receptor (ER) and progesterone receptor (PR) [ |
Selection for hormonal therapy | Predictive biomarker |

Promyelocytic leukemia-retinoic acid receptor α (PML/RARα translocation [ |
Prescribing arsenic trioxide for acute promyelocytic leukaemia | Predictive biomarker |

Uridine diphosphate glucuronosyltransferase (UGT1A1), Thiopurine Methyltransferase (TMPT), major histocompatibility complex, class I, B (HLA-B*5701), Dihydropyrimidine dehydrogenase (DPYD) polymorphisms) [ |
Predisposition to certain toxicities | Predictive biomarker |

Amyloid β peptide (AB) 1-42 [ |
Diagnosis of prodromal Alzheimer's disease | Prognostic biomarker |

Gene signature chips (e.g., Oncotype, MammaPrint) [ |
Prognosis prediction in oncology | Prognostic biomarker (also predictive in certain cases) |

B-type natriuretic peptide (BNP) [ |
Screening and diagnosis in heart failure | Prognostic biomarker |

C-reactive protein (CRP), Interleukin-6 (IL-6), Tumor necrosis factor (TNF-α in blood samples [ |
Proof of principle in inflammatory diseases | Pharmacodynamic biomarker |

FDG-PET (SUVmax) functional imaging [ |
Proof of concept (e.g., in tumour metabolism) | Pharmacodynamic biomarker |

Low density lipoprotein (LDL) cholesterol [ |
Confirmatory trials in coronary heart disease | Surrogate endpoint |

Hemoglobin a1c (HbA1c) [ |
Represents glycaemic control in diabetics | Surrogate endpoint |

Prostate-specific antigen (PSA) [ |
Screening and monitoring in prostate cancer | Surrogate endpoint |

Carcinoembryonic antigen (CEA) and cancer antigen (e.g., CA-19-9) [ |
Monitoring in cancers | Surrogate endpoint |

True state and hypothesis test.

Yes | No | |
---|---|---|

Yes | (1) False positive (Type I error) | (2) True negative |

No | (3) True positive (Power) | (4) False negative (Type II error) |