Statistical Methods for Assessing the Explained Variation of a Health Outcome by a Mixture of Exposures

Exposures to environmental pollutants are often composed of mixtures of chemicals that can be highly correlated because of similar sources and/or chemical structures. The effect of an individual chemical on a health outcome can be weak and difficult to detect because of the relatively low level of exposures to many environmental pollutants. To tackle the challenging problem of assessing the health risk of exposure to a mixture of environmental pollutants, we propose a statistical approach to assessing the proportion of the variation of an outcome explained by a mixture of pollutants. The proposed approach avoids the difficult task of identifying specific pollutants that are responsible for the effects and may also be used to assess interactions among exposures. Extensive simulation results demonstrate that the proposed approach has very good performance. Application of the proposed approach is illustrated by investigating the main and interaction effects of the chemical pollutants on systolic and diastolic blood pressure in participants from the National Health and Nutrition Examination Survey.


Introduction
Environmental pollutants are a major source of risk to public health. Evaluating the risks from environmental pollutants is challenging because the pollutants are always mixtures of chemicals and can be highly correlated due to similar exposure pathways and/or chemical structures [1]. In addition, the effect of an individual chemical on a health outcome can be weak and difficult to detect because of the relatively low level of exposures to many environmental pollutants. Recent technological advances allow for measuring a large number of environmental chemicals in biologic and environmental samples. Conventional statistical methods encounter substantial difficulties in analyzing such data where high-dimensional covariates can be highly correlated and the effects of the covariates on the outcome can be weak [2].
One approach to dealing with highly correlated covariates in regression analysis is to apply principal component regression, which uses linear combinations of covariates that explain a large portion of covariate variation as input to the regression model. The principal components method weights each covariate and can be hard to interpret because such combinations are not unique. Factor analysis is traditionally used to improve the interpretation. When data are of high dimension, Zou et al. [3] proposed a sparse principal component analysis that restricts each principal component to sparse non-zero weights by shrinkage approaches such as Lasso [4] or elastic net [5]. One problem with the principal components regression approach is that components accounting for a large portion of the covariate variation do not necessarily explain a substantial proportion of the variation of the outcome. The partial least squares approach [6][7][8] addresses this problem by considering both the covariates and the outcome in forming the components. Chun and Leles [9] proposed a sparse version of the partial least squares that generates linear combinations of sparse covariates for better prediction and interpretation.
When the covariate effects are weak and numerous, it is more attractive to first estimate the variation of the outcome explained by the covariates [10]. Both principal component regression and partial least squares involve choosing the number of components to be included in the regression model. Neither is guaranteed to obtain a stable sparse representation of the components of the regression model even if the linear regression model holds. This prevents us from obtaining an unbiased estimation of the total variation explained by the measured covariates in general. When the sparsity does hold in the linear regression model, Cai and Guo [11], Verzelen and Gassiat [12], and Cai and Guo [13] proposed optimal estimators of the variation explained by the measured covariates. In practice, we may not know if the sparsity assumption holds. In this case, a more attractive estimator of the variation explained by the covariates is to use the normal random-effects model [14]. The method of Yang et al. [14], termed genetic complex trait analysis (GCTA), was proposed for estimating the narrow-sense heritability in the genome-wide association study of single-nucleotide polymorphism (SNPs) effects on a complex trait. The estimator is consistent when the SNPs are independent and the number of SNPs is in the comparable order of the sample size [14]. Dicker [15] studied the asymptotic distribution of a similar estimator under the normality assumption for the independent covariates. Janson et al. [16] proposed an alternative approach termed EigenPrism to construct confidence intervals for the total variation explained by the covariates also under the normality assumption for the independent covariates and the residual errors.
The methods for inference in Yang et al. [14], Dicker [15], and Janson et al. [16] all rely on the normality assumption on data generation. Chen [17] proposed an estimating equation approach that relies neither on the normality assumption of the covariates nor the residual errors. This estimator was shown to be consistent and asymptotically normally distributed under some reasonable conditions. One key assumption is the independence of the covariates in the model. Yet, in the study of environmental pollutants, it is often the case that the pollutants are correlated. To address this problem, we propose to use a special weighting matrix along with strategies for estimating the correlation matrix with possible supplemental data. The proposed approach enables us to estimate the explained variation by the environmental pollutants and to examine interaction effects of those pollutants.
We apply this approach to the analysis of the association of chemical pollutants, in particular persistent organic pollutants (POPs), such as polycholorinated biphenyls (PCBs), on systolic and diastolic blood pressure in a cross-sectional dataset from the National Health and Nutrition Examination Survey (NHANES), 1999-2004. The selection of this health outcome was motivated by previous investigations that demonstrated associations of POPs with prevalent or incident hypertension [18][19][20] and increases in systolic and diastolic blood pressure [21,22] in populations with general exposures, such as NHANES, and those living near highly contaminated areas.

Estimation of the Total Effects of Exposures
Denote the health outcome by Y and the exposures to the mixtures of pollutants by X, composed of p individual chemicals: X 1 , · · · , X p , i.e., X = X 1 , · · · , X p t . Let the mean prediction of the health outcome Y given the exposures X be E Y X 1 , · · · , X p . The proportion of the variation of the health outcome explained by the collection of the environmental exposures is defined as If there is no association between the health outcome and the environmental exposures, the mean prediction is a constant, and thus the proportion of the explained variation is zero. When the dimension of the environmental exposures is low, it may be feasible to first estimate the regression model and then assess the proportion of the variation explained. When the dimension of the exposures is high, the approach of estimating the regression model first becomes difficult to impossible to carry out. We propose an alternative approach that directly estimates the proportion of the explained variation without the need to estimate the regression model first. Assume the linear model linking the health outcome and the pollutants as where E( ) = 0, var( ) = σ 2 . Let β = β 1 , · · · , β p t and ∑ = var(X). The proportion of explained variation under the linear model (2) becomes If the covariance matrix ∑ is known, a decorrelation transformation can be applied to transform X 1 , · · · , X p into uncorrelated variables Z 1 , · · · , Z p through Z 1 , · · · , Z p = X 1 , · · · , X p ∑ −1/2 . After the decorrelation transformation, the linear model becomes where α = α 1 , · · · , α p t and var Z 1 , · · · , Z p = I p , the identity matrix. Furthermore, The proportion of outcome variation explained by covariates Z 1 , · · · , Z p under model (2) can be rewritten as Under model (3), let the observed outcomes be Y i , i = 1, · · · , n, and the observed exposures be Z i1 , · · · , Z ip , i = 1, · · · , n. The proportion of the explained variation r 2 can be directly estimated byr and M = (M ik ) n×n with and Z +j = 1 n ∑ n i=1 Z ij , and W = (I + λM) −1 (M − I)(I + λM) −1 for fixed λ ≥ 0. One major advantage of using (4) to estimate the proportion of the explained variation is that it allows for high-dimensional exposures in model (3). The dimension of the exposures p can be as large as or even greater than the sample size n, as long as the increase in dimension is approximately a linear function of the sample size. Chen [17] showed that the estimator for r 2 behaves well in the high-dimensional setting.

Decorrelation of the Exposures with Possibly Supplementary Exposure Data
The estimation of the proportion of the explained variation by (4) is based on model (3), which requires the exposures be de-correlated before being used. In practice, the covariance matrix may not be known a priori. One idea is to use the estimated covariance matrix for the decorrelation. However, there are potentially two problems in implementing this idea when the exposures are of high dimension. First, the covariance matrix may not be directly estimable when the dimension is high. For example, the empirical estimate of the covariance matrix does not yield a good estimate of the covariance matrix if p is close to or larger than n. Second, even if the covariance matrix can be estimated in some ways, the resulting estimator of the proportion of the explained variation may not behave well. We propose the following approaches to address these issues.
When n > p, the covariance matrix can be directly estimated by the empirical estimator as∑ = ∑ jk , where∑ When n ≤ p, to continue the use of the empirical estimator for the decorrelation, it is required to have additional supplementary covariate data. Suppose a separate sample of covariates only is available to use. Denote the supplementary data by X i1 , · · · , X ip , i = n + 1, · · · , n + N, where n + N > p. The empirical estimator of the variance matrix∑ = ∑ jk , wherê Note also that if supplementary covariate data are available, the supplementary data can also be used for the covariance matrix estimation when n > p. With the covariance matrix estimated, decorrelation can be carried out by the following approach, The estimation approach in (4) can then be carried out using the decorrelated covariate data Z in M and W, denoted, respectively, byM andŴ. The estimator of the proportion with estimated weight matrix iŝ It has been shown that the empirical estimate of the covariance when n > p and the empirical estimate when n ≤ p with supplementary exposure data can yield well-behaved estimator of r 2 by the method in (4) and (5), respectively. More details can be found in Chen (2021). Note, however, current theory does not support the use of other types of covariance estimators in place of the empirical estimator, including the frequently used sparse matrix approaches in high-dimensional settings.

Confounder Adjustment and the Variation Explained by a Subset of Exposures
In estimating the proportion of the variation of a health outcome explained by a set of exposures, it is often the case that some confounders need to be adjusted. Another relevant question is estimating the proportion of the variation of a health outcome explained by a subset of the exposures. In this case, we need to account for the fact that the subset of exposures may be correlated with the rest of the exposures and that the rest of the exposures may also explain a proportion of the variation of the health outcome. These two problems are similar from a statistical point of view. We treat them in a single framework, where we label the confounders and the exposures together by covariates.
Let the set of covariates be divided into two sets, denoted by X A = X j , j ∈ A t and where A and B are two disjoint index sets. For example, X A can be the collection of confounders and X B be the collection of exposures. Suppose that we are interested in estimating the proportion of the additional variation of the health outcome explained by the set of covariates X B after the subset of covariates X A are already included in the model. The proportional of additional variation explained is Under the linear model in (6), where the covariance matrix for (X A , X B ) is The proportions of the variation of Y explained by (X A , X B ) and X A alone are, respectively They are related through r 2 B|A = r 2 AB − r 2 A . Estimation of r 2 B|A can therefore be obtained bŷ wherer 2 AB andr 2 A are, respectively, the proportions of variation estimated by data on (Y, X A , X B ) and on (Y, X A ). The estimation approach can be either the direct estimating equation approach assuming covariate independence, or the supplementary covariate approach allowing correlated covariates.

Estimation of the Total Interaction Effects
One important practical question in environmental health research is whether interactions among pollutants exist. To answer this question, linear model (1) can be expanded to include interactions as follows, For estimating the total variation explained by both the main and interaction effects, the estimatorr 2 orr 2 e may be used depending on whether the covariance matrix needs to be estimated. For estimating the proportion of variation explained by the interaction effects only, the problem can be cast into the covariate adjustment formulation discussed in the previous section. Let X A = X 1 , · · · , X p and X B = X 1 X 2 , · · · , X p−1 X p . The proportion of variation explained by the interaction terms given the inclusion of the main effect terms is r 2 B|A . The parameter may be estimated using the method described in the last section.

Inference on the Explained Variation
One important question to ask in practice is whether the explained variation by a mixture of exposures is non-zero. To answer this question, we propose to use the permutation test. Specifically, we first permute the outcomes among the subjects and recalculate the estimated value for r 2 . By comparing the r 2 estimator based on the permuted data with that from the original data, and repeating the process for many times, we can estimate the p-value for testing the hypothesis of no explained variation. Specifically, we test the hypothesis, H 0 : The following permutation algorithm is used.
Repeat step 2 for N times. The estimated p-value is the frequency ofr 2 τ >r 2 o , i.e., To take into consideration the accuracy of the Monte Carlo simulation approach in computing the p-values, we obtain an upper bound of the true p-value bŷ The second term is obtained from the upper bound of the 95% confidence interval for p-value equals 0.05. The p-value bound is more appropriate to use when the estimated p-value is small in comparison to the accuracy that can be achieved by the Monte Carlo simulation approach.
Another question often asked in practice is whether the explained variation by a mixture of exposures is non-zero after adjusting for confounders. This also includes the additional explained variation of a group of exposures after other groups of exposures are already explained. One such example is the additional variation explained by interactions. The hypothesis to be tested can in general be stated as The following permutation algorithm can be used after the decorrelation transformation: 1.
Compute r 2 B|A estimator using one of the proposed methods based on data (y 1 , X A1 , X B1 ), · · · , (y n , X An , X Bn ). Denote the estimate byr 2 B|Ao .
Repeat step 2 for N times. The estimated p-value is the frequency ofr 2 B|Aτ >r 2 B|Ao , i.e., Similarly, we can compute an upper bound of the true p-value bŷ to account for the error in using the Monte Carlo approach to computing the p-value. Aside from the permutation tests for no effects, a confidence interval for the explained variation can be constructed based on the asymptotic analysis of the proposed estimators. Such asymptotic results based on the random matrix theory [23] are very complex to obtain. Interested readers may check the results in Chen [17] for more details.

R Package: TEV
We developed an R package called the total explained variation (TEV) for carrying out the computation of the proposed methods. The package includes two main functions for computing the proportion of the explained variations and produces confidence intervals for the explained variation. The first function is named R2ee for computingr 2 in (4), the second is named R2eesd for computingr e 2 in (5) when supplemental covariate data are available. Althoughr e 2 appears simply as replacing M and W in (2), respectively, byM andŴ, this replacement affects the inference on r 2 . As a result, a different function is used. A third function is the least squares approach for the case n > p, named as R2eels. These functions are modified to perform hypothesis tests based on permutation. The modifications for the unconditional test have PMT attached to their names-for example, R2ee to R2eePMT. The modifications for the performance conditional permutation test have PMTca attached to their names-for example, R2eePMTca. In addition, two existing approaches are also included in the R package for the convenience of comparison. The first is the EigenPrism approach with the function name EigenPrismFull, which is an adaptation of the R function EigenPrism by Jansen et al. [16]. The second is the GCTA approach [14] with the option of bootstrap variance computation [24]. This function is named R2GCTA. The code for TEV is available in the Github site: https://github.com/hychen-uic/TEV (accessed on 30 December 2021).

Simulation Study Design
We simulated exposure data in three different scenarios: independent, mildly correlated, and highly correlated. The distributions of the exposures are either normally or non-normally distributed. The outcome data are simulated following a linear model with random error following normal or non-normal distributions. The simulated data are generated under a combination of parameters, including p, n, independent or correlated exposures with different distributions. The detailed data generation algorithm is given in Appendix A. The first set of simulated data are used to validate the correct size of the permutation tests under the null hypothesis and the adequate power under the alternative hypothesis. The permutation sample size of the Monte Carlo simulations for computing the p-values is set to 10,000 except stated otherwise. In the second set of simulations, we compute the confidence intervals and their coverages and lengths for the explained variations when the true values of r 2 are not zero. Comparison with other existing approaches as well as the variations of the proposed approach is carried out. All the simulation results are based on 1000 replicates except stated otherwise.

Data Analysis Approach
We analyze a National Health and Nutrition Examination Survey (NHANES) dataset to demonstrate the use of the proposed methods for data analysis. NHANES is a weighted sample representative of the U.S. population. Persistent organic pollutants (POPs) were measured in serum from a subgroup of the population in the period 1999-2004. The dataset was downloaded from the NHANES website. The original data have 31,126 records. A total of 75 POPs, including 11 brominated flame retardants, 34 polychlorinated biphenyl (PCB), 13 organochlorine pesticides, and 17 dioxins and furans, are treated as exposure variables in the analysis. When interactions are also considered, an additional 2775 covariate items are included in the analysis, for a total of 2862 covariate items. Analytes with measurements below the limit of detection (LOD) are imputed at the LOD/ √ 2. Since these chemicals are lipophilic, the chemical measurements are all adjusted for serum lipid levels (Phillips et al. 1989). As the POPs were measured only on a subset of the subjects by design, we excluded subjects who were not measured by design. For subjects who were selected for measurement by design, those with missing POPs and confounders were included and missing values were imputed using the R package MICE [25] before the analysis. The final sample used for the illustration has 3261 subjects.
The outcomes we analyze are systolic and diastolic blood pressures. Average systolic and diastolic blood pressures (SBP and DBP, respectively) reported to participants were used for analysis. Blood pressure measures have been adjusted for hypertensive medication use to account for the size of potential treatment effect. For participants currently taking anti-hypertensive medications, we added 10 and 5 mmHg to observed systolic and diastolic blood pressures, respectively [26,27]. Possible confounders, including age, body mass index (kg/m 2 ), sex, race/ethnicity (non-Hispanic white, non-Hispanic black, Mexican American, other Hispanic, other race), family poverty income level (<1.3, 1.4-3.4, >3.5), education (less than high school, high school, more than high school), alcohol drinks per year, smoking status (never, former, current), drugs taken last month including hormones modifying drugs (yes, no), adrenal cortical steroids (yes, no), antidiabetic drugs (yes, no), and immunosuppressant drugs (yes, no), are adjusted in the analysis.
The distributional summary of the exposures and the outcomes will be examined first. The proportions of explained variation by all the covariates, including confounders and exposure variables, with/without interactions between the exposures are then estimated and inferred. The proportions of the explained variation by the exposures with/without interactions after adjusting for the confounders are estimated and inferred next. The inference included confidence intervals for the proportion of explained variation and p-values for the permutation tests. Table 1 lists the type I errors of the permutation test for the hypothesis H 0 : r 2 = 0 with α = 0.05 under varying data generation mechanisms. It can be seen from the table that the type I errors are mostly close to the nominal level. There is a slight inflation of type I error for the test based on the estimating equations assuming independent covariates. For other tests, the type I errors are well under control. Figures 1 and 2 display the power of the permutation tests for the hypothesis H 0 : r 2 = 0 at the significance level α = 0.05 with a range of r 2 values, respectively, for independent covariates and highly correlated covariates. It can be clearly seen that the test that assumes independent covariates has the highest power. The permutation test that uses the supplementary data is a little more powerful than the test that does not use the supplementary data when n > p. The power difference between the test assuming covariate independence and tests without the assumption is large when the ratio of the sample size to the number of covariates is small. For the case with independent covariates (Figure 1), the power of the different tests is similar. In contrast, the power can be substantially different with highly correlated covariates, especially when n < p.  Table 2 lists the estimated proportion of the explained variation by different methods when the proportion is non-zero. From the table, it can be seen that the EigenPrism approach yields reasonably good estimates in terms of bias and variance, and the confidence intervals have good coverage except in the non-normal case with highly correlated covariates. In the latter case, the EigenPrism estimates are subject to large bias and the coverage rates of the confidence intervals are very low. The GCTA approach yields good estimates and confidence intervals when the normality assumption on the random error holds. The GCTA confidence intervals have a low coverage rate when the normality does not hold. Because of the high computation cost in using the bootstrap method for estimating the variance, the GCTA estimator was not computed in the comparison for p > n. Some limited simulation results not shown here suggest the performance is similar to the case of n > p. The confidence intervals of the proposed estimators maintain good coverage for all the simulated cases. The estimating equation approach (R2ee) assuming covariate independence has the shortest length of the confidence intervals. The supplementary data approach (R2eesd) has comparable or wider confidence intervals. This is because the latter approach pays a cost for accounting for the dependence similar to the least squares approach (R2eels) when n > p. It avoids potential bias that may occur in using the R2ee approach for correlated covariates.        Table 3 lists the type I errors of the conditional permutation test for the hypothesis H 0 : r 2 B|A = 0 with α = 0.05. This type of test can be used for confounder adjustment, subset exposures, and interactions. Because the simulation is much more time-consuming, the simulation results are obtained with 1000 permutations and 200 replicates. In this set of simulations, X A is set to half of the X variables and r 2 A is set to either 0 or non-zero. Note in particular that X A is also of high dimension. The estimated type I errors for the proposed test are all under control and close to the nominal level except in some cases of using the R2ee approach.  Table 4 lists the summary statistics for the demographic variables and confounders to be adjusted in the analysis. The ranges of the continuous variables appear to be reasonable. The other race categories in the data are small. The percentages of medication use among the subjects are low. Figure 3 shows the box plots of 75 chemical exposures and the distribution of the pairwise correlation coefficients. From the figure, we see that the distributions of the POPs have longer tails than the normal distributions would display. However, the deviations are in general not extreme. The correlation can be high; however, most of the correlation coefficients are approximately 0. Overall, the deviation appears to be not far from independent normal.  Table 5 lists the results of the estimation and inference on the proportions of explained variations. The analysis was performed for the SBP and the DBP, respectively, with/without confounder adjustment and with/without interactions. For the unadjusted analysis without interactions, the proportion of the SBP variation (approximately 35%) explained by the chemical exposures is much higher than that of the DBP (6~7%). Both are significantly different from 0 as the 95% confidence intervals suggest. After the adjustment for the confounders, the proportions of the variation of both the SBP and DBP explained by the POPs is close to 3%, with the DBP slightly less, but significantly different from 0 as the permutation results suggest. Different methods yield very similar results in these cases.   Table 5 lists the results of the estimation and inference on the proportions of explained variations. The analysis was performed for the SBP and the DBP, respectively, with/without confounder adjustment and with/without interactions. For the unadjusted analysis without interactions, the proportion of the SBP variation (approximately 35%) explained by the chemical exposures is much higher than that of the DBP (6~7%). Both are significantly different from 0 as the 95% confidence intervals suggest. After the adjustment for the confounders, the proportions of the variation of both the SBP and DBP explained by the POPs is close to 3%, with the DBP slightly less, but significantly different from 0 as the permutation results suggest. Different methods yield very similar results in these cases.  Table 4). ** Permutation tests for no interaction effects adjusted for both confounders and main exposures.

Data Analysis
For the model with interactions, the unadjusted estimates from the estimating equation approach assuming independent covariates (R2ee) and the GCTA method are markedly different from the EigenPrism and the least squares approaches. When compared with the estimates without interaction, the R2ee and GCTA approaches do not yield any evidence against no interaction assumptions among the chemical exposures for either the SBP or the DBP. However, both the R2eesd and R2eels approaches yield strong evidence (p-value < 0.031) against no interaction assumptions for both SBP and DBP. The least squares type of estimates are considered more reliable because they account for the possible correlations among covariates while the former methods do not. The conditional permutation tests suggest the interactions explain an additional proportion of variation for both the SBP (estimate = 13.2%) and the DBP (estimate = 21.6%).

Discussion
The proposed methods allow us to infer the proportion of the variation of an outcome explained by a set of exposures collectively with possible adjustment for confounders. One prominent feature of the approaches is that they do not need to first estimate the effects of individual exposures for estimating the overall effects. This feature is useful because it can be hard to estimate the individual effects in high-dimensional exposures. The proposed methods can be applied to situations where either or both the exposures and the confounders are of high dimension where traditional regression approaches do not yield a good estimate of the effects.
There are a number of subtle points which need some attention in applying the proposed methods. Strictly speaking, the proposed methods are applicable for situations with independent covariates before or after a decorrelation transformation. There is a subtle difference between the independence of covariates and the uncorrelated covariates. Theory guarantees for the former but not necessarily the latter. When n ≤ p and no supplementary covariate data are available, we may impose structures on the covariance matrix so that the covariance matrix can still be reasonably estimated. One such an assumption is the sparsity assumption on the off-diagonal elements for the precision matrix ∑ −1 [28]. However, current theory does not directly support application to cases with an estimated structured covariance matrix. Confounder adjustment in low-dimensional cases can be carried out by regressing on both the outcome and the exposures. In the high-dimensional cases, this adjustment can be impractical. The proposed methods are not subject to the problem.
Data analysis is presented as an illustrative example for applying the proposed methods, rather than an in-depth analysis of the specific dataset. A number of limitations exist with respect to interpreting the results as a substantive analysis of the dataset. First, the missing values are imputed. We performed the analysis only on a single imputed dataset and did not account for the variation in the imputed values. In addition, the imputation model may also be subject to questions and bias may occur for using inappropriate models. Second, the confounder adjustment does not exhaust all possible available information. Additional variables may need to be included, such as use of anti-lipid medications that may impact blood pressure and lipophilic chemical exposures. Furthermore, there may be residual confounding due to additional chemical exposures that may affect blood pressure, such as heavy metals. However, we excluded the heavy metals from the analysis because of a sampling design issue in which some metals and POPs were measured in non-overlapping subsamples. Third, we did not account for the survey sampling design and weights in the analysis.
The proposed approach is currently limited to the continuous outcome under the linear model formulation. For categorical outcomes, the generalized linear model is often used instead, and adaptation of TEV to the more general outcomes is not currently developed. The detailed theory for supporting the use of the proposed methods in this article is not presented in this paper. However, such theory has been developed by the leading author and his collaborators. Interested readers may find it in the reference. One remaining theoretical gap is the lack of a variance formula for the conditional effects estimator for constructing confidence intervals for the proportion of the explained variation after adjustment for covariates. As a result, inference on the total effects relies on the conditional permutation tests only.
We have developed the R code for performing the analysis presented in this paper. We are currently packaging the code into an R package. The package will be posted online for free access. The computations are mostly fast for the proposed methods except for the conditional permutation tests which can be slow when the dimension of the covariates is very high and an accurate p-value is to be computed by the simulation.

Conclusions
The main achievement of the proposed methods is to provide a tool for inferring the overall effects of a large number of exposures on an outcome, with possible adjustment for a large number of confounders, without the need to estimate individual effects of the exposures. Unlike many ad hoc approaches for estimating the overall effects, the proposed approaches have a solid theoretical foundation to guarantee their performance under the condition that covariates can be linearly transformed into independent variables. R code is also available for other researchers to use the proposed methods.

Acknowledgments:
We acknowledge and appreciate the input from the following co-investigators and colleagues: Saria Awadalla and Sanjib Basu.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
To generate the simulated data, a correlation matrix is first generated as follows. A p × p matrix with independent normal N(a, 1) entries and a p × p matrix with independent uniform U[−0.5, 0.5] entries are generated. Denote them, respectively, by A and B. Let D = (AB) t AB.
A correlation matrix C is extracted from the covariance matrix D using the singular value decomposition approach. The level of correlation can be adjusted by changing the values of a. In the simulation, independence corresponds to D = I and mild high correlations corresponding, respectively, to a = 0 and a = 2. Covariates at different levels of correlation are then generated in the following way. Standard normal random numbers are generated first. Power transformation is then applied to generate deviation from the correlated normal, where power transformation maps a random variable u into u 1 through where γ > 0. In the case of χ 2 1 covariates, the transformation drops sign(u). The covariates are rescaled to a standard deviation of 1 before being transformed by C 1/2 . The covariate effects are set to a fixed constant for half of the variables and 0 for the other half. The constant is chosen along with the error variance so that the proportion of explained variation is close to 0.2, 0.5, 0.8, respectively. Random errors are generated as mean 0 normal random numbers with a possible power transformation to yield deviation from the normal distributed random error. Supplementary covariates with sample sizes equal to the number of covariates are simulated for each case.