A Discriminant Function Approach to Adjust for Processing and Measurement Error When a Biomarker is Assayed in Pooled Samples

Pooling biological specimens prior to performing expensive laboratory assays has been shown to be a cost effective approach for estimating parameters of interest. In addition to requiring specialized statistical techniques, however, the pooling of samples can introduce assay errors due to processing, possibly in addition to measurement error that may be present when the assay is applied to individual samples. Failure to account for these sources of error can result in biased parameter estimates and ultimately faulty inference. Prior research addressing biomarker mean and variance estimation advocates hybrid designs consisting of individual as well as pooled samples to account for measurement and processing (or pooling) error. We consider adapting this approach to the problem of estimating a covariate-adjusted odds ratio (OR) relating a binary outcome to a continuous exposure or biomarker level assessed in pools. In particular, we explore the applicability of a discriminant function-based analysis that assumes normal residual, processing, and measurement errors. A potential advantage of this method is that maximum likelihood estimation of the desired adjusted log OR is straightforward and computationally convenient. Moreover, in the absence of measurement and processing error, the method yields an efficient unbiased estimator for the parameter of interest assuming normal residual errors. We illustrate the approach using real data from an ancillary study of the Collaborative Perinatal Project, and we use simulations to demonstrate the ability of the proposed estimators to alleviate bias due to measurement and processing error.


Introduction
Epidemiological studies in general and environmental health-oriented research in particular often require conducting expensive laboratory assays of biospecimens to measure continuous biomarker concentrations that may be associated with adverse outcomes. In such studies, the physical pooling of samples prior to performing assays can be an effective design strategy aimed at reducing lab assay costs, ensuring or preserving sufficient specimen volumes, or minimizing the potential for measurements below a limit of detection. Specialized statistical techniques are required in order to extract the pertinent information from pooled specimens in a valid and efficient manner [1][2][3][4][5][6][7].
To effectively promote the concept of pooling in epidemiology, it is necessary to consider its application in the context of regression analysis (e.g., [8][9][10][11][12][13][14][15]). In this direction, Weinberg and Umbach [8] considered logistic regression with the goal of associating continuous levels of an exposure variable (measured in pools) with a binary outcome, adjusting for covariates. In discussing practical aspects of the approach, they anticipated the potential need to consider assay measurement errors in studies that utilize pooling. Along these lines, Schisterman et al. [7] proposed a framework for modeling measurement error as a feature of individual and pooled assay results, as well as pool processing error. The latter was assumed to apply only to assays made on pooled samples, resulting from the physical manipulations of individual specimens required to allocate them to pools. The prior considerations in [7] were limited to settings in which the objective is to estimate the unadjusted mean and variance of a biomarker assessed in pools.
In this paper, we focus on estimating the adjusted log odds ratio (log OR) relating a continuous exposure variable (X, measured in pools) to a binary outcome (Y), adjusting for covariates (C). As in prior related work, we assume each member of a given pool contributes the same sample aliquot volume and the lab assay is expected to return the arithmetic mean (equivalently, the sum) of biomarker concentrations across members of each pool [8,13]. The complication is that we seek to formally account for the fact that measurement error, processing error, or both may be incurred when applying the lab assay to measure X. Our approach relies upon discriminant function analysis (e.g., [16,17]), together with a prior paradigm for modeling sources of error [7]. We note that there is precedent for adopting the discriminant function approach to covariate measurement error problems [18,19]; however, this is to our knowledge the first attempt to apply it to analyses involving bioassay data obtained on pools. Our specific strategy utilizes a variant on classical discriminant function analysis, in which one assumes normal errors in a multiple linear regression model as opposed to multivariate normality of the exposure variable and any covariates [20]. Section 2 details the proposed methods, which are then applied in Section 3 to a motivating dataset stemming from a substudy of the Collaborative Perinatal Project [21,22] where the goal is to estimate the covariate-adjusted association between cytokine levels and the risk of spontaneous abortion. We study empirical properties of the proposed estimators using simulated data in Section 4, and discuss implications and future work in Section 5.

Models for Individual-Level Data without Measurement or Processing Error
Initially, assume the standard scenario in which individual-level data are available for a binary outcome (Y), a continuous exposure of interest (X), and for a set of covariates (C). The parameter of primary interest is the adjusted exposure log odds ratio (OR), commonly captured by the coefficient β in the following standard logistic regression model (Equation (1)): (i = 1,…, k; j = 1,…, gi; t = 1,…, T). Here pij = Pr (Yij = 1) where Yij is the binary outcome for the jth member of the ith of k eventual pools, and gi is the number of specimens included in the ith pool (we discuss pooling further in the next section).
Prior to the widespread use of logistic regression, a discriminant function approach provided a convenient method to estimate the desired adjusted log OR (e.g., [16,17]). While this method has historically been less widely applied due to a requirement for additional distributional assumptions on (X, C | Y), Lyles, Guo and Hill [20] recently revisited this approach and demonstrated that the adjusted exposure log OR of interest can be efficiently estimated. In addition, their adaptation of the discriminant function approach required only a univariate distributional assumption for the errors in the following standard multiple linear regression (MLR) model (Equation (2)): In particular, assuming that ) σ N(0, ε 2 iid ij , the same adjusted log OR targeted by β in Equation (1) is defined as the ratio β*/σ 2 under Equation (2). When normality of the errors in Equation (2) holds, a uniformly minimum variance (UMVU) estimator for the log OR is available [20]. Further, simulation results [20] demonstrated that such a discriminant function-based estimator can be more accurate and precise in small samples than the standard maximum likelihood estimator (MLE) based on Equation (1).

Models for Pooled Data without Measurement or Processing Error
The first methodology designed to estimate the adjusted OR of interest when the exposure (X) is measured in pooled samples was introduced by Weinberg and Umbach [8]. They showed that if pools are homogeneous with respect to Y (i.e., all members of pool i have yij = 0 or yij = 1), and if pooling is random within y strata, then model (1) implies the following poolwise logistic model (Equation (3)): where Yi = 1 or 0 for a "case" or "control" pool (all members positive or negative, respectively), pi = Pr (Yi = 1 | Xi = xi, Ci1 = ci1,…, CiT = ciT ), gi is the size of (i.e., number of specimens in) the i-th is gi times the average exposure across pool members assumed returned by the assay, is the sum of the values of the t-th covariate across the members of pool i, and ln(rgi) is an offset with rgi being the ratio of the number of case pools of size gi to the number of control pools of size gi. Model (3) is fit with gi as a covariate and with no intercept, applying the offset. This can be done using standard software, e.g., the LOGISTIC procedure in SAS [23], to obtain an MLE for β and its corresponding standard error. While the Weinberg-Umbach approach provides a convenient method to estimate the log OR for a pooled exposure, it is not directly generalizable to incorporate information on pooling or measurement error. Thus, in the empirical studies that follow, our primary focus is on poolwise extensions of the model in Equation (2). Specifically, the MLR model for the summed exposure variable (Xi) stemming from model (2) is: where Xi is a random variable corresponding to the observed sum (xi) of exposure levels across pool members, xi and cit are the same exposure and covariate sums that appear in Equation (3), Note that model (4), like model (3), is fit without an intercept and with the pool size (gi) as a covariate. If pool sizes are not equal, model (4) must be fit using weighted least squares (WLS) with weights wi = 1/gi. This yields the WLS estimate * β , along with the residual variance estimate Estimated variances also take the same form as those given in [20], i.e., (Equation (6)): As the subscript implies, umvû ) OR ln( is a minimum variance unbiased estimator of the adjusted log OR that accounts for pooling to assess exposure (X). This provides an alternative to the MLE of β based on the Weinberg-Umbach poolwise logistic model in Equation (3), if one is comfortable with the assumption of normal errors in model (4). In Section 4, we use simulation studies to investigate some of the statistical properties of these estimators.

Models for Pooled Data with Measurement and/or Processing Error
Schisterman et al. [7] proposed a framework for modeling measurement and processing errors when estimating the mean and variance of a continuous biomarker concentration based on pooled samples. We adapt their framework to address such errors when estimating the adjusted log OR of interest by augmenting the poolwise model in (4) for discriminant function analysis as follows (Equation (7)): In model (7), the tilde ("~") indicates that one observes an error-prone version of the desired sum of the exposures across members of pool i, while the new terms m i ε and p i ε represent measurement and processing errors, respectively. Following [7], we assume these errors are mutually independent ; we further assume these to be independent of the residual errors (εi's). Note that model (7) implies the assumptions that each laboratory assay result is subject to measurement error with a constant variance regardless of whether it is performed on a pooled or individual specimen, and the indicator function makes clear that each pooled assay (i.e., where gi > 1) is assumed subject to processing error with a constant variance regardless of the pool size.
Although iteratively reweighted least squares (IRWLS) could be an option, we take an ML approach to analysis based on model (7). Specifying the likelihood is straightforward, because the k poolwise outcomes i X are mutually independent such that where (Equation (8)): The discriminant function-based MLE for the adjusted log OR then follows as the ratio of the MLEs of the respective parameters in (8), i.e., (Equation (9)) 2 ml with its delta method-based estimated variance given by (Equation (10)): For computations, we use built-in SAS IML functions (NLPQN and NLPFDD) for numeric quasi-newton optimization and to approximate the observed information matrix [24].

Design Considerations and Bias Adjustment
From Equation (8), it is clear that the proposed discriminant function estimator in Equation (9) can only be used when the study design permits identifying the three separate variance components , or at least under the condition that the residual variance (σ 2 ) can be estimated uniquely.
For this purpose, we recommend a "hybrid" design [7], in which individual exposure assay measurements (gi = 1) are combined with pools (gi > 1) of at least two different sizes. The pools of two or more sizes should permit estimation of σ 2 , while the individual assays provide observations devoid of processing error and should permit identification of the other two components. This requirement can be relaxed if one expects only measurement or processing error (not both). In that case one variance is eliminated when specifying 2 i σ in Equation (8), and a design featuring pools of any two or more sizes (including 'pools' of size 1) would theoretically be adequate. We return to such considerations in Section 3 when introducing the real data example, and we include "measurement error only" and "processing error only" models in simulation studies described in Section 4.
Given a design that allows estimation of a unique σ 2 , the stability of the discriminant function-based estimator in Equation (9) may remain an issue. We note that in the absence of measurement and processing error, the UMVU estimator in Section 2.2 improves stability by eliminating small-sample bias entirely. While we have not developed a UMVU estimator in the presence of measurement and/or processing error, a second-order Taylor series expansion leads to the following bias-adjusted alternative to the MLE in Equation (9): We would not recommend the use of Equation (11), for example, if the directionality of the log OR estimate changes relative to ml ) OR ln( . Otherwise, Equation (11) acts as a shrinkage estimator that tends to have lower variability. While one could defend using the unadjusted standard error based on Equation (10) in conjunction with the bias-adjusted estimate (e.g., [25]), one could also contemplate multiplying that standard error by the ratio ml adĵ ) OR /ln( ) OR ln( . Multiplying by this ratio has no effect asymptotically, but may better reflect the variability of the adjusted estimator in small samples assuming that the ratio is approximately constant. Clearly, one issue is that rare point estimates of σ 2 close to 0 can correspond to exceedingly large log OR estimates. This instability makes the theoretical bias associated with both Equation (9) and Equation (11) infinite whenever there is a positive probability that 2 σ equals 0, and can also produce occasional "blow ups" in estimated standard errors based on Equation (10). For this reason, our empirical studies in Section 4 include a discussion of practical strategies to reduce such problems. This includes consideration of Akaike's information criterion (AIC; [26]) to select a model accounting solely for measurement or processing error if the model accounting for both is subject to instability in the estimated log OR and/or its accompanying standard error.

Collaborative Perinatal Project Data
The Collaborative Perinatal Project (CPP) was originally conducted during the years 1959-1974 to study exposures and outcomes related to pregnancy [21]. In a subsequent nested case-control study using stored serum, investigators assayed cytokine concentrations in controls and in cases who experienced spontaneous abortion (SA) [22]. As part of this substudy, cytokines were measured in individual samples as well as in pools of size 2. We analyze data from 666 women, for whom the variables SA status (Y; 0 or 1), race (C1; black vs. white), and smoking status (C2; yes vs. no) were measured individually. The cytokine concentration (X) of interest is that of monocyte chemotactic protein 1 (MCP1; X). We use MCP1 assay results from 251 pools of size 2 (involving 502 women), along with individual MCP1 assays from the other 164 women who were not included in pools. Women paired in pools were matched on SA (Y) status.
Basic descriptive statistics for the 666 participants include the following: 305 (46%) had SA = 1 (including 85 individuals and 110 pairs whose serum was pooled), 189 (28%) were black, and 313 (47%) smoked. We consider four models based on Equation (7), distinguished by the extent to which they account for sources of error: (a) both measurement error (ME) and pool processing error (PE); (b) ME only; (c) PE only; (d) neither ME nor PE. The parameter of interest is the log OR associating individual-level SA status (Y) with a unit increase in MCP1 concentration (X), adjusting for race (C1) and smoking status (C2). Table 1 summarizes analyses carried out for the CPP ancillary study. The first thing to note is that when both measurement and processing error are accounted for in model (8), the parameter β* is identifiable, but the residual variance (σ 2 ) and hence the log OR of interest is not. This stems from the fact that only a single pool size (ki = 2) was utilized in the study (see Section 2.4). The other three models (ME only, PE only, and neither ME nor PE) all agree with regard to a positive but non-significant estimated log OR characterizing the adjusted association between SA status and MCP1 levels. For the ME only model, the estimated measurement error variance ) (σ 2 m attained a lower bound (0.001) that was set for each variance component in the numerical ML optimization process. As such, results for the ME only and the "neither ME nor PE" models are extremely similar to each other. Those results are also qualitatively similar to an analysis based on the Weinberg-Umbach [8] model in Equation (3), results of which we provide in the last row for comparison.

Results
While the analysis suffers from lack of identifiability with respect to the most general model, note that AIC strongly favors the PE only model over the other two viable choices. The MLE for the log (OR) is also noticeably higher (0.39 vs. 0.31) under this model, although the accompanying standard error is also larger and the inferential result remains non-significant. As a final note, Table 1 illustrates that the proposed bias-corrected log OR estimates and adjusted standard errors (Section 2.4) are very similar to the uncorrected MLEs for this example, given the relatively small effect and large sample size.

Simulations
We conducted simulations to study empirical properties of the estimators discussed in Section 2, with data generated to closely mimic the CPP ancillary study. Specifically, a Bernoulli variable C2 was first generated to match the observed prevalence of smoking status. A second Bernoulli variable C1 was then drawn with prevalence matching that observed within the corresponding smoking group. Then, SA status (Y) was generated conditional on C1 and C2, according to a logistic regression with parameters matching estimates obtained from the observed CPP data. Finally, MCP1 concentration (X) was generated according to model (2), again with parameters closely mimicking estimates obtained from the corresponding model fit to individual-level CPP data. We note that this means of generating the data can also be shown to imply the validity of model (1). The true values of β* and σ 2 used in the simulations were 0.035 and 0.08, respectively, for a true adjusted log OR of 0.4375.
Unlike the actual CPP substudy which involved only one pool size together with individual samples (ki = 1,2), we generated data with three pool sizes (ki = 1,2,3) so that all variance components in Equation (8) would theoretically be identifiable. Specifically, 50% of the total number (N) of individual "subjects" were allocated to pools of size 4, 50% of the remainder to pools of size 2, and the rest were treated individually. True poolwise exposures (Xi) were calculated as the sum of individual exposures for those in the pool. For simulations involving ME and/or PE, normal errors were randomly generated with mean 0 and variance 0.08 σ 2 m = and/or 0.08 σ 2 p = , respectively.

Results of Simulations with Neither ME nor PE
We conducted initial simulations to study empirical properties of the estimators discussed in Section 2.2, for use in the absence of measurement and pool processing error. Two overall sample sizes (N = 2000 and N = 200) were considered. begins to show empirically. In this case, the logistic regression-based estimator suffers in terms of bias and precision relative to the discriminant function-based log OR estimators that take advantage of the assumption of normal residual errors. Similar findings in the standard setting without pooling were discussed by Lyles, Guo and Hill [20].

Results of Simulations with ME and/or PE
We considered three overall sample sizes (N = 2000, 1000, 500) in simulations accounting for measurement and/or processing error. While the inclusion of three pool sizes (ki = 1,2,3) permits identifying all variance components in the general model that requires estimating both 2 m σ and 2 p σ in addition to the residual variance σ 2 , some numerical instabilities were still observed. Specifically, the MLE for σ 2 occasionally met the lower boundary of 0.001, and/or the estimated standard error accompanying ml ) OR ln( in Equation (9) was implausibly large. In such boundary cases or if the estimated standard error under the general model was more than 10 times the standard error under a model ignoring ME and PE, we used the AIC criterion to select the best fitting alternative model (ME only or PE only) to estimate the log OR. Such model selection adjustments were fairly common with N = 500 when generating data under the most general model, but were almost never necessary under the ME only or PE only models and were much less frequent for larger sample sizes. If a different model than the one that generated the data was selected under the specified criteria, standard errors as well as the point estimate of the log OR were based on the selected model. Table 3 provides results for data simulated under the general model with = = 2 m 2 σ σ 0.08 σ 2 p = .
We focus mainly on the summary for the MLE of the log OR, in which we include the mean and median across 2500 simulations to get a sense of both mean and median bias. As indicated in the footnotes, instability in the estimate of σ 2 or in the standard error accompanying ml ) OR ln( led to an AIC-based decision to base estimation on the ME only or PE only model in 19.1% of simulations with N = 500. This percentage reduced to 8.5% and 3.4% with N = 1000 and 2000, respectively. Ultimately, the estimator is characterized by acceptable mean and median bias and accompanied by adequate (if a bit anticonservative) CI coverages. As expected, the Weinberg-Umbach model (Section 2.2) produces a markedly attenuated log OR estimate with sub-nominal coverage, since it does not account for measurement or processing errors. We do not summarize the bias-corrected estimator adĵ ) OR ln( in Table 3, since numerical issues affecting ml ) OR ln( also tended to impact stability of the estimated correction factor under the most general model. Table 4 provides results for data simulated under the ME only model with = = 2 m 2 σ σ 0.08. Mean log OR estimates were much closer to their medians than under the general model (Table 3), so we report mean estimated standard errors in place of median estimates. These match empirical SDs closely, and translate to reasonable CI coverages. Here we see the value of the bias-corrected estimator adĵ ) OR ln( in Equation (11), with respect to accuracy as well as precision (particularly for smaller overall sample sizes). In this case we continue to see marked bias in the logistic regression-based estimator that ignores measurement error.    This model encountered no numerical difficulties at any of the three sample sizes, and in this case we see very little bias in either of the two discriminant function-based estimators. Attenuation remains present in the logistic regression-based estimator, but is less prominent given that processing error only impacts pooled (and not individual) simulated assay values.

Discussion
Despite the potential cost benefits and facilitation of lab assay procedures afforded by pooling, possible measurement and processing errors associated with individual and/or pooled measurements may limit its use in practice. In proposing set-based logistic regression with a continuous exposure variable measured in pools, Weinberg and Umbach [8] anticipated and provided insights into the potential effects of measurement errors. However, to our knowledge, this report offers the first demonstration of a comprehensive and implementable approach to adjust for ME, PE, or both when the goal is to estimate an adjusted exposure odds ratio. We used a discriminant function-based strategy (e.g., [16,20]) to convert the problem to a multiple linear regression framework. For modeling measurement and processing errors in the context of pooled exposures, we then followed the paradigm proposed by Schisterman et al. [7]. Combining these two methodologies allows us to target the individual-level adjusted log OR associating exposure with a binary outcome in the presence of pooling, ME and PE.
Our method can be further extended to account for more sources of error (e.g., attributable to technician, variations in temperature, etc.), and to allow the processing error variance 2 p σ to vary with pool size. However, this will require a more complex hybrid design, where different pool sizes are available as a function of the variance components. We also note in passing that inclusion of replicate assay results taken on individual specimens could offer an alternate means of identifying the measurement error variance, 2 m σ .
Mitchell et al. [14] provided ML methodology applicable to model (4) when the exposure variable X is right-skewed, as is common in environmental epidemiologic studies. We believe that ongoing research in the direction of extensions to permit adjustment for ME and/or PE in that scenario could be highly valuable. Further work could also focus more specifically on study design considerations (see Section 2.4), e.g., determining an optimal allocation of pool sizes. Such an effort may benefit from past work demonstrating efficiency gains possible in the context of model (4) when pool allocations are made with respect to other covariates (C) in addition to the outcome (Y) [15]. Nevertheless, we expect that pooling with respect to Y alone will typically be highly efficient when targeting the parameter β* in model (4), and thus also for the purpose of estimating the adjusted log OR of interest. It is worth noting that the discriminant function approach (with and without accounting for measurement or processing errors) readily accommodates pool allocations that are completely random (i.e., not made within separate strata of Y or any other variable), as well as allocations that are dependent on Y in addition to other covariates. We caution that the former strategy is less statistically efficient than y-dependent pooling, which is a requirement for use of the Weinberg-Umbach poolwise logistic regression model [8]. Even ignoring ME and/or PE, the latter model requires offset adjustments if pooling is dependent on both Y and covariates [13,27].
Finally, we are currently exploring methods to adjust for ME and PE rooted in the logistic regression, as opposed to the discriminant function framework. Although the discriminant function method is much simpler computationally, an advantage of the logistic regression approach would be the ability to estimate adjusted ORs associated with other covariates in addition to the exposure OR. We anticipate building naturally off of considerations given here, in addition to prior related references [7,8,13].
The Appendix contains SAS/IML code that was used in fitting model (7) to the CPP example data. Only minor adjustments are required to alter this code to account for ME or PE only. Readers interested in implementing the approach described are encouraged to direct any remaining computational questions to the first author.