Generalized Linear Models with Covariate Measurement Error and Zero-Inflated Surrogates

: Epidemiological studies often encounter a challenge due to exposure measurement error when estimating an exposure–disease association. A surrogate variable may be available for the true unobserved exposure variable. However, zero-inflated data are encountered frequently in the surrogate variables. For example, many nutrient or physical activity measures may have a zero value (or a low detectable value) among a group of individuals. In this paper, we investigate regression analysis when the observed surrogates may have zero values among some individuals of the whole study cohort. A naive regression calibration without taking into account a probability mass of the surrogate variable at 0 (or a low detectable value) will be biased. We developed a regression calibration estimator which typically can have smaller biases than the naive regression calibration estimator. We propose an expected estimating equation estimator which is consistent under the zero-inflated surrogate regression model. Extensive simulations show that the proposed estimator performs well in terms of bias correction. These methods are applied to a physical activity intervention study.


Introduction
In biomedical research, regression analysis is an important tool to understand associations between disease outcomes and risk factors.In practice, however, a risk factor may not be measured precisely.This problem is often called covariate measurement error [1][2][3].We consider an example when a biomarker is a risk factor for a disease outcome.In practice, the biomarker may have seasonal, daily, or even hourly variation, and a single measurement is prone to a covariate measurement error from instrumentation or human error.Hence, an average of an infinite number of the biomarker measurements during a specified period of time is, therefore, a more meaningful covariate variable than the average of a few observed measurements.However, in practice it is not feasible to make such measurements, and thus studies often rely on single measures at a specific time point with associated measurement error.
Physical activity and nutrient intake are important risk factors for disease incidence and mortality.However, physical activity and nutrient intake data may be measured with errors since they are generally self-report data.This issue is important since measurement error in diet or physical activity may have an attenuation effect on the regression coefficients of exposures in the range of approximately 20% to 50% [4][5][6].That is, an odds ratio of 1.5 from diet or physical activity may be reduced to the range of 1.22 to 1.38 due to measurement errors in these measures.In addition, an important challenge in this research is that some physical activity or dietary data may have a zero value, such as 0 metabolic equivalent (MET) hours per week from moderate or vigorous physical activity or 0 alcohol intake.One MET is defined as the amount of oxygen consumed while at rest per kilogram of body weight [7].A 3 MET activity expends three times the energy used by the body at rest.Hence, if a person does a 3 MET activity for 4 h in a week, he or she has done 12 MET hours of physical activity in a week.A naive method without taking into account measurement error may lead to biased effect estimation in regression analysis, and the bias is attenuation in most (but not all) cases [8].A standard bias correction for measurement error without taking into account a subset of individuals with zero exposure value may be biased in the effect estimation.
One motivating example of our methodology research is covariate measurement error associated with the measurement of physical activity in the APPEAL study (A Program Promoting Exercise and Active Lifestyles; APPEAL: Clinicaltrials.govNCT00668161) [9].APPEAL was a year long randomized controlled trial of moderate-to-vigorous intensity exercise vs. control (no exercise) among 202 healthy, sedentary adults recruited between 2001 and 2004 primarily through physician practices, and randomized to an exercise program (n = 100) or a control group (n = 102).The trial was designed to test the effects of exercise on biomarkers of colon cancer and other physiologic and psychosocial outcomes.Numerous case-control and cohort studies have found an inverse association between physical activity and risk of colon cancer [10].Physical activities are commonly quantified by determining the energy expenditure in kilocalories or by using the MET of the activity.A question of interest is whether there is an association between physical activity via MET-hours/week and c-reactive protein, a biomarker of inflammation, with elevated levels of CRP associated with risk of developing colon cancer.The true average of MET-hours/week is an unobserved variable that is the average of an infinite number of MET-hours/week scores.However, in practice it is not possible to obtain this measure and, thus, the true average of MET-hours/week scores cannot be observed.
In the motivating example given above, two methodology challenges are involved.The first challenge is regression analysis with covariate measurement error, which is due to physical activity (MET-hours/week).The observed error-prone variable is typically called a surrogate variable for the true but unobserved exposure.The second challenge is the zero-inflated surrogate model because some individuals may have zero MET-hours/week.The zero-inflated surrogate issue in some similar research examples is also called truncation of the observed surrogates.In our problem, the second challenge (zero-inflated surrogate modeling) is added to the first challenge (covariate measurement error).Methods for covariate measurement error have been well developed.For example, regression calibration (RC) for covariate measurement error is to replace an error-prone covariate by its conditional expectation given the observed covariates [11].In linear regression, the RC estimator is a consistent estimator for regression coefficients (Buonaccorsi, 2010, Chapter 5) [12].However, for logistic and Cox regression, it is known that it is not consistent (Carroll, et al., 2006, Chapter 4) [2].There is further research on refinement of RC for logistic and Cox regression [13,14].Another general approximation approach for covariate measurement error is the simulation extrapolation (SIMEX) approach [15,16].An advantage of SIMEX is that it has the advantage of being easy to implement.There are methods to address the situation when the surrogate variables may be truncated (which is in general the same as zero-inflated surrogate modeling).Tooze et al. investigated a likelihood approach for repeated measures data with clumping at zero [17].When the observed exposure variables are truncated by a lower limit, the estimation of the disease-exposure association due to measurement error and truncation may not always be attenuation [18].
As discussed above, there is relatively limited research that addresses the issue of measurement error when some individuals may have a zero value (or lower limit) in the observed surrogates.The main objective of the paper is to develop and apply methods to adjust for measurement error in generalized linear models when the observed surrogates may be truncated at a low value (such as 0) among some individuals.The paper is organized as follows: In Section 2, we describe the statistical models for the problem of interest, and discuss the bias issue when we apply a naive RC estimator without taking into account the zero-inflated surrogates.In Section 3, we study a regression calibration estimator for this problem.In Section 4, we propose a maximum likelihood estimator via expected estimating equations for this problem.In Section 5, the results from simulation studies are presented.In Section 6, we apply the methods to the APPEAL study data.We discuss the advantages and limitations of the proposed EEE estimator in Section 7. Concluding remarks are given in Section 8.

Statistical Models and Naive RC Estimator
We assume that the total sample size of the study cohort is n.The regression model of interest is the generalized linear model.Let Y i be the response variable, X i be the unobserved true covariate (dietary intake or physical activity) that cannot be precisely measured, and Z i be the vector of covariates which is available for all individuals, i, i = 1, . . ., n.For simplicity of presentation, the true unobserved exposure X is assumed to be a scalar throughout this paper.The main interest is to estimate the vector of regression coefficients β ≡ (β 0 , β 1 , β ′  2 ) ′ in the followingregression model: where g(•) is a specified function.Model (1) contains many important regression models.For example, g(u) = u in linear regression, while g(u) = (1 + e −u ) −1 in logistic regression.The goal of the research is to develop valid estimation methods for the regression coefficients β.For the true unobserved covariate X i , we assume that there are k i non-negative surrogate variables In a covariate measurement error problem when the surrogates are not truncated, replicates W ij , j = 1, . . ., k i , are used to estimate the measurement error variance where k i is the number of replicates.We use notation Wi for (W i1 , . . ., W ik i ), W * i for (W * i1 , . . ., W * ik i ), and ηi for (η i1 , . . ., η ik i ).
To understand the RC estimator, we consider a special linear regression case that , where e i is a mean-zero random residual term.Assume From this argument, it is seen that under the special linear regression case above, replacing an unobserved true X i with E(X i | W * i ) will lead to a consistent estimator.This method is often called the RC estimator [2].In this case, E(Y i | W * i ) is the calibration function.We may also use E( [14].Let µ x and σ x denote the mean and standard deviation of any random variable X, respectively.Calculation of the conditional expectation of the unobserved exposure given the surrogates can be obtained based on a bivariate normal assumption such that From this calculation, a naive estimator using W * i as a replacement for X i will have an attenuation effect.When Z is in the model, a standard RC estimator is to replace X i with E(X i |W * i , Z i ).This can be done by a multivariate-normal assumption with a conditional mean formula similar to the formula given above.However, a more practical approach is via a semiparametric RC approach by assuming a working regression model of where j ̸ = j ′ = 1, . . ., k, and (α 0 , α 1 , α ′ 2 ) ′ is the vector of regression coefficients.This semiparametric RC estimator does not assume a multivariate normality assumption of the observed surrogates and covariates [19,20].
However, in our problem, the observed Using W ij data will likely overestimate µ x , but underestimate σ x , and For linear regression with truncated surrogates, standard RC may be biased because E(X i |W i ) will be different from E(X i |W * i ).One naive approach is to use the observed W ij as W * ij , without taking into consideration the truncated surrogates, to calculate the RC estimator.We call this estimator a naive RC (NRC) estimator.As discussed above, the NRC estimator is biased even when the main regression model is linear.The asymptotic variance of the NRC estimator can be obtained by a sandwich variance estimator where the vector of the estimating equations is obtained by stacking the estimating equations for β and the nuisance parameters involved in the calculation of the calibration function E(X i | W * i , Z i ) (but noting that the NRC estimator assumes Wi is the same as W * i ).However, if there are many covariates in the modeling of the calibration function, then it will be computationally easier to use bootstrap variance estimation to obtain the standard errors.

Regression Calibration for Zero-Inflated Surrogates
The NRC estimator described in the previous section does not take into account zero values due to truncation.Now, we consider calibration based on truncate surrogates due to zero values.To understand the method, we first consider a linear regression model , where e i has mean 0, and is independent of in the regression analysis may be a valid approach.Let X i ≡ E(X| W, Z).The estimating equation for the RC estimator can be expressedas Hence, when Y i given (X i , Z i ) is linear, we have the following result: Proposition 1. Assume the surrogate variables W * ij , j = 1, . . ., k i may be truncated by a lower limit, and the truncation indicator ηi is independent of Y i given , where e i has mean 0, and is independent of X i and Z i .Then the RC estimator solving (2) is a consistent estimator of β.
The proof of Proposition 1 is given in Appendix A. We note that because of the surrogate assumption, the measurement errors U ij and e i are independent, which is needed to ensure that estimating Equation ( 2) is unbiased.Hence, for linear regression with zeroinflated surrogates, the RC estimator is consistent.However, when the mean function of Y i given X i , Z i is not linear, the RC estimator may be biased since the expectation of the estimating score will no longer be zero.For logistic regression, pr Although the RC estimator is not consistent, the RC estimator can be considered as an improved estimator of the NRC estimator described in the last section.The calibration function can be calculated based on the likelihood function.We use notation L(X) to denote a likelihood function for any random variable X, and L(Y|X) to denote a conditional likelihood function of Y given X, for any two random variables X and Y. Generally, the conditional calibration function can be calculated by the following: From the argument given above, the RC estimator can be obtained by replacing an unobserved X i by E{X i | Wi , Z i } based on (3).The asymptotic variance of the RC estimator can be obtained by a stacked sandwich estimator that is similar to the one for the NRC estimator described in the last section, or by bootstrap variance estimation.

Expected Estimating Equation Estimator
We now develop another approach to this problem via the maximum likelihood (ML) estimation.We first take a different viewpoint linking the ML estimation and the conditional expectation of the full data estimating equation, namely, the estimating equation when there is no measurement error.The full data likelihood, L(Y i |X i , Z i ), is the likelihood function of Y i given (X i , Z i ).The full data estimating equation for β can be expressed as Because the true X i is not observed, the full data estimating equation can not be directly applied to the data.With the observed data, the estimating score will be from the likelihood of Y i given Z i and W i , denoted by L( From the equations given above, the likelihood-based score of the observed data can be obtained by the conditional expectation of the likelihood-based score of the full data given the observed data.That is, the estimating score for an individual can be expressed as E{ϕ(Y i , X i , Z i , β)|Y i , Wi , Z i }, which is the observed data estimating score.The ML estimator can be obtained from the idea of expected estimating equations [21].Therefore, the ML estimator can be obtained by solving In general, ϕ(Y i , X i , Z i , β) does not need to be the full data likelihood-based estimating score.It can be any estimating equation that satisfies E{ϕ(Y i , X i , Z i , β)} = 0.For example, it can be a weighted estimating equation of the ML estimator.The estimator solving (4) is the expected estimating equation (EEE) estimator for β.Let Equation (4) be denoted by S(β, X, Z) = 0. Let the EEE estimator be denoted by β eee .The asymptotic distribution of β eee can be presented as the following result: , and the surrogate variables W * ij , j = 1, . . ., k i may be truncated by a lower limit, and the truncation indicator ηi is conditionally independent of Y i given The EEE estimator solving (4) is consistent for β.Furthermore, n 1/2 ( β eee − β) is asymptotically normal with mean 0 and asymptotic variance given in Appendix A.
The proof of Proposition 2 is given in Appendix A. The EEE in (4) can be calculated by the following: The asymptotic variance of the EEE estimator solving (4) for β can be obtained by a sandwich variance estimator.The vector of the estimating equations is obtained by stacking two sets of estimating equations.The first set is the estimating equations for β and the second set is the nuisance parameters involved in the conditional distribution of Y i given (Z i , Wi ).However, bootstrap variance estimation is another approach to obtain the standard errors of the EEE estimator.

Simulation Study
We conducted a simulation study to examine the finite sample performance of the NRC, RC, and EEE estimators with the naive estimator that used W i for X i .In Table 1, we illustrate the situation when the regression model is linear and the observed surrogates may have a zero value among some individuals.That is, the observed surrogates were truncated at c = 0 in the simulations.In this table, each individual's true covariate is X i .We first generated X i , i = 1, . . ., n, from a normal distribution, where the sample size was n = 500, and n = 1000, respectively.We generated two replicates W * i1 and W * i2 for the unobserved X i .With µ x = 1.5, σ x = 1, and σ u = 0.707.The percent of non-zero W ij was η = 89%; 11% of W ij was truncated at 0. We also considered the situation when σ u = 1, 1.5, and √ 3, respectively, in which the percent of non-zero covariates were η = 86%, 80%, and 77%, respectively.The outcomes were generated based on linear regression with coefficients β 0 = 0.5 and β 1 = 1, and the residuals were from a standard normal distribution.In Tables 1-4, "bias" was obtained from the average of the biases of the regression coefficients estimates of the 500 simulation replicates, "SD" was the sample standard deviation of the estimates, and "ASE" was the average of the estimated standard errors of the estimates.The 95% confidence interval coverage probabilities (CP) were also obtained.The standard errors of the estimates were obtained from sandwich variance estimation.From the result of Table 1, the NRC estimator was not much better than the naive estimator.The reason for limited improvement from the NRC over the naive estimator was because of truncated W values.The RC and EEE estimators were consistent with limited biases under this setting, and hence, they were better than the naive and NRC estimators.Under this setting, the RC and EEE were very comparable.X is normal and the error is from a modified chi-square distribution  We considered non-normal X in Table 2 to investigate if the estimators were sensitive to the normality assumption in the calculation.We also examined the sensitivity of the estimators to misspecification of the measurement error distribution.On the upper portion of Table 2, the unobserved X was generated from a mixture of two normal distributions; one with mean 2.5 and variance 1, and the other with mean 1 and variance 0.25, and the mixture percentages were (1/3, 2/3).The result from the upper portion of the table was similar to that of Table 1, except that there were small biases from the RC and EEE estimators.We found that the RC and EEE showed small biases when the unobserved exposure had a skewed distribution, but the bias was not too large in general.Nevertheless, the RC and EEE estimators were still better than the NRC and naive estimators under this situation.On the lower portion of Table 2, we considered the situation when X was normal but measurement error was from a location/scale-transformed chi-squared distribution and a mixture of two normal distributions, respectively.The specification of the mixture of two normal distributions was the same as the mixture of normal distributions given above.The location/scale-transformed chi-squared distribution has mean 0 and variance σ 2 u after a chi-squared random variable was location/scale-transformed.From the sensitivity analysis, the RC and EEE estimators were not sensitive to mild violation due to a mixture of normal distributions since the biases were considered small.However, the biases may be sensitive to violation of the normality assumption while the true distribution was very skewed, as for chi-squared distributions.The biases were moderate, rather than small, when the errors were from chi-squared distributions.In Table 3, the data were generated similarly to those in Table 1 but the main model was logistic regression such that pr(Y i = 1|X i ) = H(β 0 + β 1 X i ), where the regression coefficients were β = (0, ln(2)) and β = (0, ln(3)), respectively.The findings were similar to those from Table 1 for the situation when β = (0, ln(2)).The biases of the RC and EEE estimators were very small.Although RC is not consistent, it may have limited biases if the relative risk parameter is small to moderate, such as β 1 = ln(1.5)or β 1 = ln(2) when the exposure's standard deviation is about 1.However, when β 1 = ln(3), the biases of the RC estimator were larger than those of the EEE estimator.The reason is that the RC estimator's bias will increase if the relative risk parameter is large.The findings are typically similar to those for measurement error in longitudinal data and survival analysis with covariate measurement error [20,21].
In Table 4, we investigated the situation when both X and Z were included in a linear regression model.We first generated X i , i = 1, . . ., n and two replicates W i1 and W i2 in the same way as those in Table 1.Covariate Z i , i = 1, . . ., n, were generated via Z i = ρX i /σ x + 1 − ρ 2 V i /σ z , where V i were from N(0, σ 2 z ) and independent from X i , σ 2 z = 1 and ρ = 0.2.The outcomes were generated via , where β 0 = 0.5, β 1 = 1 and β 2 = −1, The residuals e i , i = 1, . . ., n, were generated from a standard normal random variable which was independent of X i and Z i .The findings were mostly similar to those from Table 1.That is, the naive and NRC estimators had large biases while the RC and EEE estimators were consistent with limited biases.

Analysis of APPEAL Data
The design of the APPEAL study was briefly reviewed in the Introduction.In this section, we are interested in investigating the association between physical activity measured via MET hours per week and CRP.The outcome variable of interest is the CRP value at baseline.In the APPEAL study, MET hours per week and other data including biomarkers were collected at both baseline and 12 months (end of study).In the control group who did not receive the exercise intervention, physical activity levels did not change significantly between baseline and 12 months.Hence, it seems reasonable to assume that the two MET-hours/week scores at baseline and 12 months in the control group (n = 102) can be treated as replicates.The MET-hours/week data for the exercise intervention group at 12 months were not included in the analysis as the MET-hours/week value changed significantly for study participants randomized to the exercise intervention between baseline and 12 months.As such, these values cannot be treated as replicates.The MET-hours/week scores at baseline and 12 months are surrogate variables (replicates, control arm only) for an unobserved true MET-hours/week score of an individual (unobserved underlying average of a period of time).The true unobserved average MET-hours/week variable is a variable to measure the actual physical activity which cannot be observed.In addition to MET-hours/week, age at baseline was another covariate in the regression analysis.
We first investigated an association between MET-hours/week and CRP at baseline.A scatterplot and a fitted kernel smoother of MET-hours/week and CRP at baseline are shown in the upper portion of Figure 1.The lower portion of Figure 1 is the scatterplot and a fitted kernel smoother of log(MET+1) and log(CRP) at baseline.We excluded 26 individuals with missing data and outliers (defined as values larger than median + 3× interquartile range) for CRP.Hence, a total of 176 individuals are included in the data analysis.The percentage of non-zero log(MET+1) at baseline is 67%, and 68% at 12 month.In our regression analysis, we used the log-transformed data since the transformed data were less skewed.In this section, the data analysis involved applying our methods to the regression association for the effects of physical activity (MET-hours/week) and age on CRP.The data application here is primarily for the purpose of a demonstration of our new methods.The regression coefficients were estimated based on the naive, RC, CRC, and EEE estimators.The results are given in Table 5.All the four estimators showed that MET was negatively associated with the inflammatory marker CRP; but not significant.
From the naive estimator, when the log(MET+1) score increased by 1 h/week, the CRP, on average, decreased by about 0.07 mg/L.From the NRC, RC, and EEE estimates, when the log(MET+1) score increased by 1 h/week, the CRP, on average, decreased by about 0.1 mg/L.It was observed that the standard errors from the NRC, RC, and EEE estimates were larger than those from the naive estimates.This was a general phenomenon of a bias-efficiency trade-off that has been reported in the measurement error literature, and is consistent with the findings from our simulations.Furthermore, all the four estimates demonstrated a significant effect of age on CRP.On average, an increase of 10 years in age was associated with an increase of approximately 0.15 mg/L in log(CRP).

Discussion
In the paper, we propose an EEE estimator for generalized linear models with covariate measurement error when the surrogate variables may have zero values among a subset of individuals.Our work is applicable to the situation for more applications when an exposure may be truncated.Our numerical studies show that RC is better than the naive estimator and NRC estimator in general, but it may be biased under some situations.Overall, the EEE estimator has smaller biases.There is a trade-off between bias and efficiency.The EEE has a larger SE due to this.One limitation of the proposed EEE estimator is that it may be biased if the likelihood function of the exposure variable is misspecified.Our simulation results demonstrate that the biases are moderate if the exposure distribution is not too skewed.Future research is needed to develop a non-parametric approach that does not require the exposure variable distribution [22].
In addition to physical activity or dietary data, biomarker measurements are important for the early detection and monitoring of disease progression.Our methods developed in this paper can be applied to biomarker data.When a biomarker is truncated due to a detection limit, decisions are required concerning how to handle values at or below the threshold in order to avoid biasing the parameter estimates.However, biomarkers are often measured with errors for many reasons, such as imperfect laboratory conditions, analytic variability of the assay, or temporal variability within individuals.The statistical modeling of zero-inflated surrogates in this paper can be applied to the situation when biomarker data are truncated due to a detection limit.Further research is needed if longitudinal biomarker, physical activity, or dietary data, are available over time [23][24][25].

Conclusions
We have developed an EEE approach for regression analysis with covariate measurement error when the surrogates may be truncated.One limitation of our proposed EEE estimator is that it is not consistent if the covariate distribution or the measurement error distribution is misspecified.In our simulations, the covariates and measurement errors are from normal distributions.Our simulation results demonstrate that if the misspecification is not too extreme, then the bias is typically small.Hence, if the covariates are skewed, then an appropriate (such as a logarithmic) transformation of the data may reduce the skewness of the data.Then the proposed EEE estimator may work well with likely minimal biases.

Table 1 .
Simulation study for linear regression with truncated surrogates.

Table 2 .
Simulation study for linear regression with truncated surrogates; misspecified distribution for covariate X or measurement error.
X is from a mixture of two normal distributions and the error is normalµ x = 1.5, σ x = 1, σ u = 0.707, η = 91%

Table 3 .
Simulation study for logistic regression with truncated surrogates.

Table 4 .
Simulation study for linear regression model with truncated surrogates; covariates are X and Z. Naive is an estimator that uses the average of two replicates as the covariate, RC is the usual RC estimator that uses E(X| W, Z) as the covariate, CRC is a conditional RC estimator that uses E(X| W, Z, η) as the covariate, EEE is the expected estimating equation estimator described. NOTE:

Table 5 .
Analysis results of data from the APPEAL study.: See the footnote of Table1for notation.The percentages of non-zero log(1+MET) were 66.7% and 67.8% at baseline and 12 months among the participants in the control group, respectively.The total sample size in the analysis was 176. Note