Incorporating Covariates into Measures of Surrogate Paradox Risk

Clinical trials often collect intermediate or surrogate endpoints other than their true endpoint of interest. It is important that the treatment effect on the surrogate endpoint accurately predicts the treatment effect on the true endpoint. There are settings in which the proposed surrogate endpoint is positively correlated with the true endpoint, but the treatment has opposite effects on the surrogate and true endpoints, a phenomenon labeled “surrogate paradox”. Covariate information may be useful in predicting an individual’s risk of surrogate paradox. In this work, we propose methods for incorporating covariates into measures of assessing the risk of surrogate paradox using the meta-analytic causal association framework. The measures calculate the probability that a treatment will have opposite effects on the surrogate and true endpoints and determine the size of a positive treatment effect on the surrogate endpoint that would reduce the risk of a negative treatment effect on the true endpoint as a function of covariates, allowing the effects of covariates on the surrogate and true endpoint to vary across trials.


Introduction
Clinical trials often collect intermediate, or surrogate, endpoints other than their true endpoint of interest.Surrogate endpoints are chosen because they occur more frequently, are easier to measure, or occur more proximally to the treatment time.The use of surrogate endpoints can result in a reduction in the required sample size for a trial, leading to shorter trial duration, as well as reduced costs of conducting clinical trials.A good surrogate endpoint is one that accurately reflects the effect of a given treatment on the true endpoint of interest while incurring lower cost or taking less time to measure.Some examples of surrogate endpoints include tumor progression as a surrogate endpoint for cancer-specific mortality, or CD4 counts in blood as a surrogate endpoint for AIDS mortality.
There exist several approaches for evaluating the strength of proposed surrogate endpoints.The first formalized approach for surrogate endpoint validation was presented by Prentice in 1989, who suggested that, among other criteria, a good surrogate should be highly correlated with the true endpoint [1].He provided a method to test the surrogate by including it in a regression model of the true endpoint with the treatment and checking if it would eliminate the coefficient of the treatment association with the true endpoint of interest [1].Later work pointed out that this approach does not allow for causal claims about surrogate efficacy since it ignores the potential of confounders between the surrogate endpoint and true endpoint.Confounding is possible despite randomization, since the surrogate endpoint is measured after treatment [2].
Since then, there have been several approaches proposed to evaluate surrogates in a causal inference framework when data are available on a single trial in which both outcomes are measured.These methods can be categorized into two major types: "causal effects" and "causal association" [2][3][4].The causal effects paradigm uses the potential outcomes framework, which considers all the outcomes that would be potentially observed if the treatment and placebo were both applied to each subject (a combination of the observed outcomes and counterfactual outcomes if a subject were assigned to the opposite treatment that they actually received) [5].Once the potential outcomes are defined, we consider both treatment and surrogate endpoints to be separately manipulable and create potential outcomes based on all possible combinations of potential outcomes [2].This allows the estimation of the total effect of treatment as the sum of direct effects of the treatment on the true endpoint and indirect effects of the treatment that go through the surrogate endpoint.An ideal surrogate would capture the majority of the indirect effect of the treatment on the true outcome of interest, leaving little direct effect of the treatment.In the causal association framework, only the treatment and not the surrogate is considered manipulable.To account for the fact that the surrogate endpoint is measured after treatment, the causal association framework conditions on the joint counterfactual values of the surrogate endpoint under both the treatment and the control.Both the causal effects and causal association approaches use models that are not entirely identifiable, since we never completely observe the counterfactual distribution.There is an alternate causal association approach, presented by Buyse et al. in 2000, in the meta-analytic setting, where data are available on multiple trials of the same treatment and surrogate combination [6].This approach leverages data from multiple randomized trials to assess the effectiveness of a surrogate endpoint, allowing all parameters to be identified from the observed data [6].This is the setting we consider in this paper.
The goal of measuring the validity of a surrogate is to make sure that a surrogate endpoint accurately captures the effect of the treatment on the true endpoint of interest.There have been several examples of surrogate endpoints that are positively associated with both the treatment and the true endpoint of interest but have not accurately predicted the treatment effect on the true endpoint.One notable example is in the development of a drug to fight ventricular arrhythmias, which were considered to be a surrogate for cardiac-related deaths.The drug was found to lower ventricular arrhythmias, and ventricular arrhythmias were positively associated with cardiac deaths, leading to the approval of the drug in clinical trials.Subsequent follow-up trials found that the drug was associated with a significantly increased risk of cardiac death [7].The phenomenon is labeled the "surrogate paradox" [8].The surrogate paradox occurs when the treatment has beneficial effects on the surrogate outcome, and the surrogate outcome is positively associated with the true outcome, yet the overall effect of the treatment on the true outcome is negative, leading to incorrect conclusions that can be potentially dangerous to public health.It has been shown that testing the efficacy of a surrogate endpoint under either the causal association or causal effects framework is not enough to fully preclude the risk of observing the surrogate paradox [8].There are several situations in which the surrogate paradox may be observed [9].The first is when a direct effect between the treatment and the true outcome runs in the opposite direction of the indirect effect of the treatment through the surrogate.The second is when there is uncaptured confounding between the surrogate and true endpoints.The third is when the effect of the treatment on the surrogate and true endpoints are different on the individual level, meaning that the positive effect of the treatment is experienced on the surrogate endpoint for some patients and on the true endpoint for a different set of patients.In his paper, Vanderweele discusses means of assessing the risk of surrogate paradox and concludes that the meta-analytic approach [6] is the most effective, since it studies the efficacy of a surrogate measure over multiple trials.Elliott et al. proposed measures to assess the risk of surrogate paradox in the meta-analytic causal association framework [10].
Treatments may have different effects on different patient subpopulations, and there is the possibility that some subpopulations in a study may be at a different risks of experiencing the surrogate paradox.To consider this possibility, in this paper, we propose extensions to the measures of surrogate paradox risk proposed by Elliott et al. [10] that incorporate covariate information.Without considering covariate information when measuring the risk of surrogate paradox, there is the possibility that a new trial in a new population with different covariate distribution than past studies could expose those patients to a higher risk of surrogate paradox than what was expected.Incorporating covariate information may allow us to identify groups that are at particular risk of experiencing the surrogate paradox and help design future trials that make use of that surrogate.In the following sections, we describe the Buyse et al. meta-analytic causal association setting [11], the proposed surrogacy paradox risk measures from Elliott et al. [10], and then propose methods for incorporating covariate information.

Background
For surrogate marker S ij and outcome measure T ij , where i = 1, …, N indexes the trials, and j = 1, …n i indexes the subjects in the ith trial, Buyse et al. [6] considered the following distributions: where Z ij ∈ 0, 1 is an indicator of treatment assignment, and and random effects From this distribution, we can calculate the causal effect of a treatment Z on the surrogate marker in the ith trial as Similarly, the causal effect of a treatment Z on the outcome measure in the ith trial is is the proportion of variance explained by the trial-level random effects associated with the surrogate and is defined by  [10].These measures are dependent on both the level of correlation between Δ S and Δ T and the size of the treatment effect on both outcomes.For example, in Scenario 1, although there is a strong correlation between the treatment effect on the surrogate and true outcomes, there is still some risk of surrogate paradox because of the relatively small treatment effect on the true outcome.In Scenario 2, there is some risk that the treatment effect on the surrogate outcome is negative, while the true treatment effect is positive; however, the increased true treatment effect size means that there is a lower risk of experiencing the more dangerous surrogate paradox (i.e., the treatment effect on the surrogate is positive while the true treatment effect is negative).In Scenario 3, despite the very strong correlation between the treatment effects on the two outcomes, there is some risk of surrogate paradox because of the low treatment effect sizes.Finally, in Scenario 4, there is low correlation between the two outcomes, but the risk of surrogate paradox is precluded because of the large treatment effect size on both outcomes.
In the remainder of this section, we describe Elliott et al.'s measures of surrogate paradox risk using this joint distribution [10].

Direction of Treatment Effects in a New Trial
The first surrogate paradox measure considers the probability that the N + 1th trial will yield treatment effects on the marker and the outcome in the same direction.This probability is given by A second measure of surrogacy paradox considers the particularly dangerous situation where the surrogate marker suggests a beneficial treatment effect but the treatment effect on the outcome measure is harmful.This probability is given by This measure estimates the probability that the N + 1th trial lies outside of the fourth quadrant of the Cartesian plane (see Figure 1).It is the probability that a future trial will not result in a setting where the surrogate marker suggests the treatment will be helpful when, in fact, it is harmful.The first two measures can be considered when drawing inferences about a future trial that has not yet collected data based on N historic trials that have already completed data collection.In practice, a trial may have already begun data collection and be interested in the risk of observing the surrogate paradox in their ongoing trial conditioning on the data from historic trials.In particular, they may have collected data on the surrogate outcome and no or very limited data on the true outcome of interest.We consider the situation where we have collected partial data for the Nth trial and want to estimate the measures of surrogate paradox risk in the ongoing trial conditioned on the previously collected data from the first N − 1 trials.
constitute the surrogate marker and outcome for each subject, be the fixed effect matrix associated with the parameters μ = (α S , α T , β S , β T ) T , and let be the random effect matrix associated with Let ϒ i , M i , and W i , represent the stacked elements of ϒ ij , M ij , and W ij .Then, ϒ N , X N , and W N represent the stacked individual level data for each subject (j = 1, …, n N ) in the Nth trial (where n N is the total sample of the Nth trial so far) and γ N = (a S N , a T N , b S N , b T N ) T represents the trial-level random effects.
The conditional distribution of γ N | ϒ N can be found by considering the joint distribution of ϒ N ∼ N 2n N (M N μ, V N ) and γ N ∼ N 4 (0, D) and cov(ϒ N , γ N ) = W N D for V N = W N DW N T + R, with R representing a 2n N × 2n N matrix with block diagonals of σ representing the individual level residual variance.Then, we have where From here, the measure of surrogate paradox risk is given by This measure allows measurement of surrogate paradox risk after some data have been collected in the trial.This could be useful after the surrogate outcome has been collected on some of the patients, but there are not yet many (or any) measurements of the true endpoint that might occur later in the study.When T Nj is missing, ϒ Nj can be replaced with S Nj in the above calculations, while leaving the placeholder X Nj rows for the missing T Nj .

Preclude a Harmful Treatment Effect on the Outcome
In the fourth surrogate paradox measure, Elliott et al. consider the minimum observed beneficial treatment effect for a marker that can reduce the probability that the true treatment effect for the outcome is harmful.Let represent the difference between the observed surrogate marker means under treatment and control.Note that for some value of s, O S i will coincide with the true Δ S i .Then, the joint distribution of the true treatment effect on the outcome and the observed treatment effect on the surrogate marker is given by where From here, they find that the distribution of the true treatment effect on the outcome Δ T i conditional on a given observed treatment effect O Si is and The authors propose two different ways to move forward from here.If data are collected to determine s, we can calculate the probability that the true effect in the outcome for the trial will be non-negative by replacing the parameters in (3) by their estimates from the data.Alternatively, we can determine the value of s that will ensure that the probability that Δ T i is negative is less than or equal to a preset level α:

Incorporating Covariates
Treatments may have heterogeneous effects on surrogate and true endpoints in different patient populations, exposing some subpopulations to increased risk of surrogate paradox.Therefore, it is important that measuring risk of surrogate paradox allows consideration of patient level factors.To address this concern, a natural extension to Elliott et al. [10] is to incorporate covariate information by conditioning on a set of covariates and making the measures above (Sections 2.1-2.4)functions of covariates X.We can consider a situation where the surrogate and outcome measures depend on a set of covariates in addition to the treatment and extend ( 1) and ( 2) to incorporate covariates, where k = 1, …, p indexes the number of covariates.
This may be difficult to fit once p gets large and increases the number of random effects required.We consider two simplified scenarios that can be extended to a larger number of covariates if enough data are available: • Scenario 1: The effects of covariates on surrogate and outcome are constant across trials (i.e., no random effects related to the covariates X).
• Scenario 2: The effects of covariates on surrogate and outcome are not constant across trials.In order to not overly complicate notation, we focus on the setting with only one scalar or binary covariate X (i.e., p = 1, and all random effects related to the covariate X are included), but the approach can easily be extended to higher dimensions of covariates.
Although it is theoretically possible to consider a larger number of covariates, it is often not possible or computationally feasibly if it is expected that the effect of the covariates differs by study, since that would rapidly increase the size of the random effect variance matrix.
In the following two sections, we recreate the surrogate paradox measures from Elliott et al. under each of the above scenarios.

Scenario 1
Under scenario 1, we assume the effects of covariates on the surrogate and outcome measures are constant across trials: Then, we can choose a level x k for each X k in X and calculate the causal effect of a treatment Z on the surrogate marker among subjects with X k = x k in the ith trial as Similarly, the causal effect of a treatment Z on the outcome measure among subjects with Thus, Δ S i and Δ T i have the joint distribution: This distribution consists of a mean shift from the non-covariate-adjusted distribution.The variance remains the same as the original, no-subgroup distribution.To visualize this, refer to Scenario 1 in Figure 2. The risk of surrogate paradox may be different in the two groups and can be identified by calculating the differing probabilities of falling into each quadrant for the different covariate levels.The change in risk occurs from a mean shift of the overall joint distribution (the variance of the joint distribution for the two covariate levels remains the same).
3.1.1.Scenario 1: Ψ SP 13 (x)-Using the new joint distribution, the probability that the N + 1th trial will yield treatment effects on the marker and outcome in the same direction is given by where Φ k (x; Θ, Ψ) is the cumulative distribution function of a k-variate normal distribution with mean Θ and variance Ψ.

Scenario 1:
Ψ SP 123 (x)-Under the new joint distribution, the probability that the treatment effects for the outcome will be harmful given that the treatment effect on the marker is beneficial is given by This measure estimates the probability that a future trial will not result in a setting where the surrogate marker suggests the treatment will be helpful when it is, in fact, harmful.

Scenario 1:
Ψ SP 13 N (x)-For this section, we consider the simplest case of one covariate for illustrative purposes.This can easily be extended to multiple covariates by extending the X N and W N matrices and the μ and γ i vectors.
constitute the surrogate marker and outcome for each subject, be the fixed effect matrix associated with the parameters μ = (α S , α T , β S , β T , γ S , γ T , δ S , δ T ) T , and be the random effect matrix associated with R representing a 2n N × 2n N matrix with block diagonals of σ as before. where From here, the measure of surrogate paradox risk is given by  consider the minimum observed beneficial treatment effect for a marker that can reduce the probability that the true treatment effect for the outcome is harmful [10].When considering covariate subgroups, we can compute O S i for each covariate level and call it O S i (x): O S i (x) represents the difference between the observed surrogate marker means under treatment and control within a fixed level of X.Then, the joint distribution of the true treatment effect on the outcome and the observed treatment effect on the surrogate marker is given by where d aa = d aa + σ ss (1 ∕ n 1ix + 1 ∕ n 0ix ), n 1ix = ∑ j: X ij = x Z ij , and n 0ix = ∑ j: X ij = x (1 − Z ij ).So, the distribution of the true treatment effect on the outcome Δ T i (x) conditional on a given observed treatment effect O Si (x) within the group having X = x is and The value of s that will ensure that the probability that Δ T i (x) is negative is less than or equal to a preset level α:

Scenario 2
Under scenario 2, we assume the effects of the covariates on the surrogate and outcome are not constant across trials.For simplicity, we consider only one scalar or binary covariate X: Now, we can choose a level x for the covariaite K and calculate the causal effect of a treatment Z on the surrogate marker and outcome measure among subjects with X = x in the ith trial as Similarly, the causal effect of a treatment Z on the surrogate marker and outcome measure among subjects with X = x in the ith trial is Now, we can calculate the joint distribution of Δ S i (x) and Δ T i (x):

Scenario 2:
Ψ SP 123 (x)-Under the new joint distribution, the probability that the treatment effects for the outcome will be harmful given that the treatment effect on the marker is beneficial is given by constitute the surrogate marker and outcome for each subject, be the fixed effect matrix associated with the parameters μ = (α S , α T , β S , β T , γ S , γ T , δ S , δ T ) T , and let be the random effect matrix associated with with R representing a 2n N + 2n N matrix with block diagonals of σ as before. where From here, the measure of surrogate paradox risk is given by

Bayesian Estimation
In this section, we describe how to obtain estimates and inferences for the proposed measures using a Bayesian frameworks for scenario 2, which is a generalization of scenario 1 that allows for covariate effects and interactions to differ by study.It is also possible to estimate the measures using a maximum likelihood (ML) or reduced maximum likelihood (REML) approach, although it is often not computationally feasible in practice without large sample sizes, so we focused on a Bayesian estimation approach in this paper.Details of the ML/REML estimation approach are provided in the Appendix A.
The estimation can also be conducted using a fully Bayesian approach, with priors placed on μ, D, and σ.We obtain draws of the parameters from a Markov chain Monte Carlo and transform them to obtain p(ψ SP 13 | ϒ) and p(ψ SP 123 | ϒ), the posterior distributions of ψ SP 13 and ψ SP 123 .We place a multivariate normal prior on the fixed effects, μ = (α S , α T , β S , β T , γ S , γ T , δ S , δ T ) T , such that μ ∼ N 8 (0, Σ 0 ).We place Wishart priors on the variance parameters D and σ such that σ −1 W (v σ , G) and D −1 W (v D , F ).Then, we can obtain the conditional posterior distributions for each of the parameters of interest as Using the conditional posterior distributions and a Gibbs sampling routine, we can obtain draws from the posterior distributions of each of the parameters of interest.

Testing
In order to determine which scenario is the best fit for a particular analysis, we would need some intuition as to whether the effect of a covariate X on the outcome differs based on the study and whether that effect also differs based on treatment.If there is no intuition as to whether the covariate effect differs by center, it may be of interest to test which scenario is the most appropriate for the observed meta-analytic data.This amounts to jointly testing the null hypotheses that all of the variances and covariances associated with the covariate random effects are equal to zero.propose a test statistic based on the variance least square estimator of variance components, as well as a permutation test to approximate its finite sample distribution [12].Under the Bayesian framework, Ariyo et al. recommend using the marginal deviance information criterion (DIC) or the marginal widely applicable information criterion (WAIC) to evaluate the need for random effects [13] by comparing the criterion value between the model including the random effects and a model excluding all the covariate-related random effects.

Simulations
We perform simulations under several surrogacy scenarios to examine the properties of the proposed estimators as a function of a binary covariate X.We generate data under scenario 1 (the effect of X on the surrogate and outcome is constant across trials) and scenario 2 (the effect of X on the surrogate and outcome is not constant across trials).
For scenario 1, we generate data assuming α S = α T = 1, β S = 2,, β T = 1, γ S = γ T = 0, and δ S = − 1, δ T = 1.For the variance components, we assume We used a Gibbs sampling routine, as described in Section 4, with a multivariate normal prior for the fixed effects, such that (α S , α T , β S , β T , γ S , γ T , δ S , δ T ) ∼ N 8 (0, 10 6 I 8 ), and Wishart priors for the inverse of the covariance matrices of the form W (q + 1, (1 ∕ (q + 2)) ∕ I q ), where q is the length of the associated vector of covariance effects.We sample from the derived conditional posterior distributions to obtain draws of the proposed estimators.Tables 1  and 2 contain the point estimates, standard errors, bias, and coverage rates for ψ SP 13 (X), ψ SP 123 (x), and s, with 30 and 100 trials, respectively.The true value of s assumes that there is equal distribution of subjects between each of the treatment and covariate categories.To estimate ψ SP 13 N , we considered the final study to have only half of the data of the other trials.
Although it is also possible to conduct this analysis with a ML/REML estimation approach, as described in the Appendix A, we ran into computation issues when estimating the large number of random effects using reasonable sample sizes and have therefore presented only the simulation results for the Bayesian approach.
We observed some minimal bias in estimating ψ SP 13 , ψ SP 123 , and ψ SP 13 N with either 30 or 100 trials, each of size 20, 50, or 500 subjects.However, with the estimate of s, we found that the lower number of trials and lower number of subjects resulted in unstable estimates with very large bias and variance.The observed coverage rates of the credible intervals were below the nominal level for some estimates of ψ SP 13 and Ψ SP 123 in both scenarios, demonstrating the need for large numbers of trials and subjects per trial when there is a desire to identify the risk of surrogate paradox in subpopulations.
As a sensitivity analysis, we also considered two simulation settings with data that were not normally distributed to assess the robustness of our proposed method to model misspecification.We generated data using a T Distribution with 15 degrees of freedom, as well as a skew normal distribution with α equal to 0.1 times the location and scale parameters and centered at 0. The data generated under the T distribution allow us to assess whether the method is robust to a situation in which the normality assumption is violated in the tails of the distribution [14].The data generated under the skew normal distribution consider a situation in which the data are distributed asymmetrically, as carried out in prior similar sensitivity analyses [15].For each sensitivity analysis, we generated 30 trials, each with 50 subjects, and considered the bias, standard error, and coverage of ψ SP 13 and ψ SP 123 .
The true value of each of the parameters of interest was estimated empirically by taking one million draws of Δ S and Δ T and computing ψ SP 13 and ψ SP 123 from the proportion of draws that fell into each of the relevant quadrants.The results of the sensitivity analysis are shown in Table 3.Under these deviations from normality, we had small increases in bias and standard error but still maintained high coverage rates.As the number of required parameters increased in scenario 2, the coverage rates also decreased, as we would expect.

Collaborative Initial Glaucoma Treatment Study
We apply the proposed method to data from the Collaborative Initial Glaucoma Treatment Study (CIGTS) [16].The CIGTS trial was a multicenter randomized clinical trial that contrasted initial surgical therapy versus initial medical therapy to treat glaucoma, with reduction in intraocular pressure (IOP) as one of its outcome measures.A total of 607 patients were enrolled in the study, and 307 were randomized to the drug arm.IOP was recorded in mmHg at baseline, 3 months, 6 months, and every 6 months thereafter.We consider the measurement of IOP at 18 months after beginning treatment as a surrogate for the true endpoint of interest: IOP at 96 months.We consider the 14 centers at which the study was conducted to be the trial-level replicates.Missing data were imputed using single imputation with a linear mixed model with a random effect for trial, a quadratic trend for time, an effect for treatment, and an interaction between time and treatment.The estimates of the between-trial covariance matrix, D, are not positive definite, so only the results (estimates and 95% credible intervals(CIs)) from the Bayesian estimation procedure are presented.As in the simulation study, we used a Gibbs sampling routine, as described in Section 4, with a multivariate normal prior for the fixed effects, such that (α S , α T , β S , β T , γ S , γ T , δ S , δ T ) ∼ N 8 (0, 10 6 I 8 ), and Wishart priors for the inverse of the covariance matrices of the form W (q + 1, (1 ∕ (q + 2))I q ), where q is the length of the associated vector of covariance effects.The R trial 2 measure of surrogacy is 0.49, indicating a moderate quality surrogate by the Buyse criteria [6].
In order to illustrate our proposed methods, we consider two covariates: sex (female, male) and age (<60, ≥60), and compute Ψ SP 13 and Ψ SP 123 for each variable category under both proposed scenarios.The results are shown in Table 4.
In scenario 1, we exclude all of the random effects for the included covariates.As we can see, overall, there is a small probability of experiencing the surrogate paradox when using early IOP as a surrogate for later IOP in this trial, since the 95% credible intervals of the measures are close to 1.This does not change significantly when comparing the overall Ψ SP 13 and Ψ SP 123 with the covariate adjustments, implying that there is no evidence of a significant difference between the risk of surrogate paradox by age or gender.In scenario 2, we estimate all of the random effects for the included covariates, allowing the effect of the covariate and the interaction between the covariate and treatment to differ by study center.In this scenario, we observe some differences between the risk of surrogate paradox by subgroup.Notably, it seems as though males and people aged 60 or over are at a higher risk of experiencing the surrogate paradox in a new trial compared with females and people under the age of 60, respectively.However, the difference in their risk of dangerous surrogate paradox is minimal.In both scenarios, the measure of s is too unstable to provide useful inference.
Using WAIC as a model selection tool, we find that there is a WAIC difference of 380 between the models for scenarios 1 and 2 for the model including sex as a covariate, and a WAIC difference of 815 for the model including age as a covariate, and conclude that the models including the additional random effects (scenario 2) are a better fit in this data example.The data for this trial are not publicly available.

Trial of Preventing Hypertension
Our second illustrative example comes from the Trial of Preventing Hypertension (TROPHY) [17].This multicenter randomized trial compared the effects of two years of treatment with Candesartan versus the standard of care on the incidence of hypertension in patients with prehypertension.Blood pressure and hypertension status were collected at baseline, 1 month and 3 months post randomization, and then every 3 months for a total of two years of follow-up.To illustrate our proposed methods, we consider the average of systolic and diastolic pressure at 1 month as a surrogate for the average of systolic and diastolic pressure at 12 months.Although the primary endpoint of interest in the original trial was a binary indicator of developing hypertension, we used the endpoint of average systolic and diastolic pressure at 12 months, since our method has currently only been developed for normally distributed outcomes.After developing hypertension patients were switched to a new treatment regimen, resulting in some missing data in both the surrogate measured at 1 month and the true endpoint measured at 12 months.These missing data were imputed using a model that was stratified by treatment and gender and included the following baseline covariates: age, race, weight, body mass index, systolic blood pressure, diastolic blood pressure, total cholesterol, high-density lipoprotein cholesterol (HDL), lowdensity lipoprotein (LDL), HDL:LDL ratio, triglycerides, fasting glucose, total insulin, and creatinine.For missing outcome values at 12 months, the imputation model also included the blood pressure measurements up to the 12th month.We consider the 69 centers at which the study was conducted to be the trial-level replicates.There were a total of 772 patients included in the original analysis.After removing centers with patients in only one treatment arm, there were a remaining 62 centers and 764 patients, 389 of which received the treatment.The size of the remaining centers ranged from 2 patients to 46 patients.When applying the REML estimation method, the covariance matrix was nonpositive-definite (likely due to the small sample size at some centers), so we only present the results (estimates and 95% credible intervals (CIs)) from the Bayesian estimation procedure.
In order to illustrate our proposed methods, we consider two covariates: sex (female, male) and age (<50, ≥50), and compute Ψ SP 13 and Ψ SP 123 for each variable category under both proposed scenarios.The results are shown in Table 5.
The results indicate that, overall, there is very little risk of the surrogate paradox when considering the effect Candesartan on the average of systolic and diastolic blood pressure at 1 month as a surrogate for the average of systolic and diastolic blood pressure at 12 months.Although there are minor differences between the risk of surrogate paradox (measured through both Ψ SP 13 and Ψ SP 123 ) by gender and age, the credible intervals overlap between the groups, indicating no significant difference between their risk of surrogate paradox.As in the previous example, the measure of s is too unstable to provide useful inference, consistent with our simulation study that indicated a large number of trials would be required to obtain useful inference for this quantity.
Using WAIC as a model selection tool, we find that there is a WAIC difference of 120 between the models for scenarios 1 and 2 for the model with sex as a covariate, and a WAIC differnce of 83 for the model with age as a covariate, and conclude that the models including the additional random effects (scenario 2) are better fitting in this data example.However, qualitatively, the results between the two scenarios are quite similar, and a simpler model may be preferred.The data for this trial are not publicly available.

Discussion
Surrogate outcomes are commonly used in clinical trials, and their prevalence has led to the development of innovative trial designs that aim to efficiently use the additional information provided by surrogate outcomes [18][19][20].Despite the valuable additional information that surrogate outcomes provide, their use also comes with risk.Evaluating the quality of a chosen surrogate to prevent the surrogate paradox should be an important step in both the design and analysis of clinical trials.
There are several existing approaches for evaluating surrogate outcome efficacy, but some apparently "good" surrogates under these methods may still experience the "surrogate paradox", in which the treatment has a positive effect on the surrogate endpoint but a negative effect on the true endpoint.The meta-analytic causal association approach to surrogate validation is particularly useful in assessing the risk of surrogate paradox.In this paper, we develop methods to measure the risk of the surrogate paradox in subpopulations when there are data available on multiple trials of similar treatments on the same surrogate and outcome.Using measures of surrogate paradox risk can prevent the occurrence of the surrogate paradox in new trials and protect the health of study participants.
Incorporating covariate information can provide valuable insights into the mechanism of the surrogate paradox and identify groups that are particularly vulnerable to the paradox.This additional information can tell us about the transferability of surrogates from one trial to the next, depending on their study population.It can also help assess the risk of using a proposed surrogate in a new trial depending on the demographic distribution of the new study population.Researchers can incorporate their understanding of whether certain subpopulations are at a higher risk of experiencing the surrogate paradox into the design of new clinical trials of similar treatments that plan to use the same surrogate and true endpoints.
Both our simulations and examples focused on exploring whether the surrogate paradox risk varied with a single scalar covariate.While in principle this could easily be extended to a multiple-covariate setting, in practice, this would typically require a fairly large set number of trials to obtain stable estimates, especially for the "scenario 2" setting, where both the fixed and random effects are associated with multiple covariates.Our simulation study showed that the estimation of some measures can be unstable when there is a small number of trials and subjects.We also considered simulations under mild deviations from normality and were able to retain relatively high coverage rates.The proposed method derives the probabilities of interest assuming normally distributed variables that may not be likely in practice.Future work will consider further violations of the normality assumption, as well as how to account for them when estimating the risk of surrogate paradox.
This work has the potential to be extended to non-normal surrogate and true endpoints.By using a copula model instead of the bivariate normal assumption in this paper, we may be able to consider a larger range of distributions for the surrogate and true endpoints, including binary or time-to-event distributions.We may also be able to consider the situation when the proposed surrogate and true endpoints have differing distributional forms (e.g., an indicator of hypertension as a surrogate for time to cardiac death).Another potential extension is to apply meta-analytic methods to estimate the risk of surrogate paradox when individual-level data on the prior studies are not available.One example would be if we only have the parameter estimates from a series of published papers on the same treatment and endpoint combination and want to use them to estimate the risk of surrogate paradox risk in a newly designed study.
Finally, we note that while we focused on conditional surrogacy paradox estimatesinteractions with covariates-this method can also be used to deal with non-normality in the multiple trials setting, with the conditional surrogacy paradox measures averaged to obtain marginal results, using the sample distribution of the covariates to approximate the population density.Thus, variance estimates could be obtained by bootstrapping for the REML approaches or via posterior distributions of draws of Ψ SP 13 obtained by averaging the draws of Ψ SP 13 (x i ).
The code for implementing these methods is available at github.com/fatemashafie.

Funding:
This research was funded in part by the US National Institutes of Health grant CA83654 and by the National Cancer Institute Award Number T32CA083654.ψ SP 13 = 0.997, ψ SP 123 = 0.999.ψ SP 13 is defined as the probability than an outcome and marker will have the same direction of treatment effects in a new trial and is introduced in Section 2.1.ψ SP 123 is defined as the probability of avoiding the dangerous surrogate paradox, or the situation in which the surrogate marker suggests a beneficial treatment effect but the outcome suggests a harmful treatment effect, and it is introduced in Section 2.2.

2. 3 .
Ψ SP 13 N : Estimating the Probability That an Outcome and Marker Will Have the Same Direction of Treatment Effects in a New Trial When Partial Data Have Been Collected N d 34 N d 44 N where β S N = β S + b S N for β S corresponding to the third element of the maximum likelihood (ML) or reduced maximum likelihood (REML) estimate of μ and b S N corresponding to the third element of the ML/REML estimate of γ N , β T N = β T + b T N for β T corresponding to the fourth element of the ML/REML estimate of μ and b T N corresponding to the fourth element of the ML/REML estimate of γ N , d kl N corresponding to the k, l element of the ML/REML estimator of D N .Similarly, we can derive Ψ SP 123 N .
and W i , represent the stacked elements of M ij , ϒ ij , and W ij .Consider the vector of random effects γ N = (a S N , a T N , b S N , b T N ) T , then the conditional distribution of γ N | ϒ N can be found by considering the joint distribution of N d 34 N d 44 N where β S N = β S + δ S + b S N for β S and δ S corresponding to the third and seventh elements of the estimate of μ and b S N corresponding to the third element of the estimate of γ N , β T N = β T + δ T + b T N for β T and δ T corresponding to the fourth and eighth element of the estimate of μ and b T N corresponding to the fourth element of the estimate of γ N , d kl N corresponding to the k, l element of the estimator of D N .Similarly, we can derive Ψ SP 123 N (x).
st d sa d sb d scs d sct d sds d sdt d tt d ta d tb d tcs d tct d tds d tdt d aa d ab d acs d act d ads d adt d bb d bcs d bct d bds d bdt d cs d csdt d csds d csdt d ct d ctds d ctdt d ds d dsdt d dt and W i represent the stacked elements of M ij , ϒ ij , and W ij .Consider the vector of random effects γ N = (a S N , a T N , b S N , b T N , c S N , c T N , d S N , d T N ) T , then the conditional distribution of γ N | ϒ N can be found by considering the joint distribution of γ N ∼ MV N(X N μ, V N ) and γ N ∼ MV N(0, D * ) and cov(ϒ N , γ N ) = W N D * for V N = W N D * W N T + R,

Ψ SP 13 N
(x) = 1 − Φ 1 (0; β S N + δ S N x, d 33N ) − Φ 1 (0; β T N + δ T N x, d 44 N ) where β S N = β S + δ S + b S N for β S and δ S corresponding to the third and seventh elements of the estimate of μ and b S N corresponding to the third element of the estimate of γ N , β T N = β T + δ T + b T N for β T and δ T corresponding to the fourth and eighth element of the estimate of μ and b T N corresponding to the fourth element of the estimate of γ N , d kl N corresponding to the k, l element of the estimate of D N .Similarly, we can derive Ψ SP 123 N (x).
Since variances are positive, testing whether they are equal to zero means we are testing a null hypothesis on the boundary of the parameter space, and the usual chi-square distribution of the likelihood ratio statistics under this null hypothesis is incorrect.Drikvandi et al.
d ss = d tt = d aa = d bb = 1, d ab = 0.5, and d st = d sa = d sb = d ta = d tb = d ab = 0.3.For scenario 2, we generate data using the same parameters as scenario 1 and assume the new variance components d cs = d ct = d ds = d dt = 1 and that all the new off-diagonal components d scs − d dsdt are set to 0.3.Under each scenario, we simulate 200 studies with 30 or 100 clusters, each of size 20, 50, or 500, representing 30 or 100 repeated trials of the same treatment, surrogate, and true endpoint combination, each with either 20, 50, or 500 participants.Half of the participants in each trial are randomly assigned to either placebo or control.

Figure 2 .
Figure 2.Changes to the joint distribution of Δ S and Δ T dependent on X: (1) Scenario 1: The effects of a binary covariate X on surrogate and outcome is constant across trials, resulting in a mean shift of the overall distribution for different levels of X. (2) Scenario 2: The effects of a binary covariate X on surrogate and outcome differs across trials, resulting in both a mean shift and variance change for different levels of X.

Figure 3 .
Figure 3.Changes to the joint distribution of Δ S and Δ T dependent on a continuous covariate X: (1) Scenario 1: The effects of a continuous covariate X on surrogate and outcome is constant across trials, resulting in a mean shift of the overall distribution based on the value of X. (2) Scenario 2: The effects of a continuous covariate X on surrogate and outcome differs across trials, resulting in both a mean shift and variance change for different values of X.
[10]ott et al. use the joint distribution of Δ S i and Δ T i .todevelopseveralmeasures of surrogate paradox risk[10].To do this, consider the contour plots of the joint distribution Figure1.Throughout the paper, we assume, without loss of generality, that the qualitative effects of the treatment on the surrogate marker and true outcome are in the same direction, with positive effects beneficial and negative effects harmful.Each scenario shows the joint distribution of a different set of trials.Based on the location of the joint distribution on the Cartesian plane, we can infer the risk of surrogate paradox occurring.If the distribution falls mostly in the first or third quadrants, there is little risk of surrogate paradox, since Δ S and Δ T give the same qualitative conclusion.However, if the distribution falls in the second or fourth quadrants, the treatment effect on the surrogate and true outcomes are in opposite directions.By calculating the probabilities of the joint distribution falling in each quadrant, Elliot et al. present measures of the risk of surrogate paradox

2.2. Ψ SP 123 : Estimating the Probability of Avoiding Dangerous Surrogate Paradox
is the cumulative distribution function of a k-variate normal distribution with mean Θ and variance Ψ.The subscript 13 in Ψ SP 13 refers to the first and third quadrants of the Cartesian plane, the region in which the marker gives a qualitatively correct prediction of the treatment effect.
Cov(d S i , d T i ) = d ab + xd adt + xd bds + x 2 d dsdt Thus, Δ S i and Δ T i have the joint distribution:This distribution consists of both a mean shift and change in variance compared with the original, no-subgroup distribution.To visualize this, refer to Scenario 2 in Figures2 and 3.
The change in risk occurs from both a mean shift and change in variance of the overall joint distribution by covariate level.We can use this distribution to construct the four surrogate paradox measures proposed by Elliott et al.3.2.1.Scenario 2:Ψ SP 13 (x)-Using the new joint distribution, the probability that the N + 1th trial will yield treatment effects on the marker and outcome in the same direction is given by where Φ k (x; Θ, Ψ) is the cumulative distribution function of a k-variate normal distribution with mean Θ and variance Ψ.
Value-In the fourth surrogate paradox measure, Elliott et al. consider the minimum observed beneficial treatment effect for a marker that can reduce the probability that the true treatment effect for the outcome is harmful.When considering covariate subgroups, we can compute O S i for each covariate level and call it O S i (x): 3.2.4.Scenario 2: s

Table 1 .
Simulation Results for 30 trials.

Table 2 .
Simulation Results for 100 trials.

Table 3 .
Sensitivity to model misspecification: each sensitivity analysis considered 30 simulated trials, each with 50 subjects.

Table 4 .
Results of application to Collaborative Initial Glaucoma Treatment Study dataset.