Estimating the Relative Contribution of Environmental and Genetic Risk Factors to Different Aging Traits by Combining Correlated Variables into Weighted Risk Scores

Genetic and exposomal factors contribute to the development of human aging. For example, genetic polymorphisms and exposure to environmental factors (air pollution, tobacco smoke, etc.) influence lung and skin aging traits. For prevention purposes it is highly desirable to know the extent to which each category of the exposome and genetic factors contribute to their development. Estimating such extents, however, is methodologically challenging, mainly because the predictors are often highly correlated. Tackling this challenge, this article proposes to use weighted risk scores to assess combined effects of categories of such predictors, and a measure of relative importance to quantify their relative contribution. The risk score weights are determined via regularized regression and the relative contributions are estimated by the proportion of explained variance in linear regression. This approach is applied to data from a cohort of elderly Caucasian women investigated in 2007–2010 by estimating the relative contribution of genetic and exposomal factors to skin and lung aging. Overall, the models explain 17% (95% CI: [9%, 28%]) of the outcome’s variance for skin aging and 23% ([11%, 34%]) for lung function parameters. For both aging traits, genetic factors make up the largest contribution. The proposed approach enables us to quantify and rank contributions of categories of exposomal and genetic factors to human aging traits and facilitates risk assessment related to common human diseases in general. Obtained rankings can aid political decision making, for example, by prioritizing protective measures such as limit values for certain exposures.


Introduction
Human phenotypes in general and health outcomes such as aging traits in particular result from genetic and non-genetic influences. For the latter the term exposome has been coined, which according to Christopher Wild is the totality of all non-genetic factors a human individual is exposed to from conception to death [1]. The exposome not only encompasses environmental factors but also lifestyle and behaviors, social environment and social status as well as the biological response [2].
The exposome concept is a step towards an all-embracing assessment of environment and health. It is potentially very interesting with regards to risk assessment, because knowledge about the relative contribution of distinct exposomal and genetic factors to a specific health outcome or aging trait would allow for more a precise and efficient prevention.
However, to date it is still challenging to measure all exposures of the exposome continuously over the entire lifetime for each individual [3]. Apart from the difficulty in measuring the exposome, the statistical analysis is also not straightforward [3]. Even a Here, data from the German SALIA cohort (Study on air pollution, lung function, inflammation and aging) is analyzed to prove the concept of applying risk scores to estimate relative contributions of environmental and genetic risk factors. The concept is applied to two aging traits, namely skin aging and aging-associated decreased lung function, to evaluate the relative contributions. The focus is on these two aging traits because for both it is well established that genetic and a number of specific exposomal factors contribute to their development [19,20]. Accordingly, important exposomal factors for lung aging are tobacco smoke and air pollution [20][21][22]; for skin aging these include exposure to ultraviolet (UV) radiation, air pollution and tobacco smoke, as well as nutritional factors [19].

Materials and Methods
The proposed methodology was applied to data from the SALIA cohort study of elderly German women by analyzing three aging traits: a z-score of facial pigment spots and z-scores of the lung function parameters Forced Expiratory Volume in 1 s (FEV 1 ) and Forced Vital Capacity (FVC). Details on the study population [23][24][25] as well as outcome [26][27][28][29], genetic [30][31][32][33][34][35][36], exposure [37][38][39][40] and confounder variables can be found in the Supplementary Methods and Supplementary Tables S1-S3 (Supplementary File S1). An overview of the variables is given in Figure 1 and descriptive statistics can be found in Table 1. The de-correlating effect of building the risk scores is demonstrated with correlation plots in Supplementary Figures S1-S5 (Supplementary File S1).

Choice of Predictors
The risk scores combinepredefined predictors. In addition to the risk scores, further predictors are included as single predictors in the linear regression models. Figure 1 gives an overview for both the skin aging and the lung function analyses.
All predictors except the SNP variables have been standardized to mean zero and standard deviation one to achieve a fair penalization of all regressors in the following.

Bootstrap Data Sets and Division into Training and Test Sample
The following analysis steps are repeated B times using the bootstrap principle. That means that the analysis is not only conducted on the original data set, but also on B-1 data sets created from the original data set by randomly sampling participants with replacement. Each bootstrap data set is then divided randomly into training and test samples in the relation 60% to 40%, as recommended in [12]. The number of bootstrap data sets is set to B = 500 for the skin aging outcome and B = 200 for the lung function outcomes, since in the latter case 278 SNP variables are included in the genetic RS resulting in considerably longer computation times.

Learning Risk Score Weights on Training Sample
The weights for the risk scores were learned on 60% of the participants in the training sample using ridge regression (implemented by elastic net regression with parameter α = 0), where the regularization parameter lambda was chosen via tenfold cross-validation with the R function cv.glmnet from R package glmnet [41]. Ridge regression was chosen, since variable selection was not the ultimate goal and shrinking all coefficients towards zero should retain the relations between the variables and lead to meaningful RS weights. The model formula, exemplary for the lung function analyses, is given by  The single predictors are included in this regression model to account for their effects on the outcome, but only the predictors belonging to the risk scores are regularized, since they are highly correlated and their coefficients will be used as weights in the risk scores. Since the folds for the cross-validation in cv.glmnet are selected at random, the results are random as well. To reduce this randomness, the estimation of the coefficients γ k,j is repeated twenty times and the resulting estimates are averaged. For each RS the respective subset of coefficients are normalized so that the resulting weights lie in [−1, 1] and sum to one for each RS. Explicitly, the weights of the risk scores are (exemplary for the lung function analyses) calculated as whereγ k,j are the estimates averaged over the twenty repetitions and m k is the number of predictors included in RS k (compare Figure 1).

Risk Scores, Linear Model and Relative Importance in the Test Sample
The RS weights derived from the training sample are used to calculate the values of each RS k in the test sample as the weighted average of the respective predictors: The risk score values are then scaled by their interquartile ranges: z k = z k IQR(z k ) , k = 2, · · · , 5. Afterwards, the risk scores and the fixed single predictors are used as independent variables in a multiple linear regression model for each outcome (here: a lung function index): Air pollution RS Finally, the relative importance of all independent variables in this linear model is calculated with the R function calc.relimp [42], where the relative importance of socioeconomic status (SES) is assessed using the two binary dummy variables as one group. The relative importance metric Lindeman-Merenda-Gold is used, which decomposes the coefficient of determination of the model, R 2 , by averaging sequential R 2 s over orderings of regressors [18, chapter 4.7]). The relative contributions given by this metric sum to the overall R 2 [42].
All analysis steps are carried out for each bootstrap sample. The results presented in the following section are thus based on medians and percentiles across the bootstrap samples. The regression coefficients of the risk scores reflect the change in the outcome for an increase of one interquartile range of the respective RS and are used to determine significance of the association. Since the relative contributions of the risk scores are given as percentages, the corresponding bootstrap confidence intervals lie by definition above zero and cannot determine significance. All calculations and figures were done using R version 4.0.3 [43], except for Figure 1, which was produced using Microsoft Office Professional Plus 2019.

Descriptive Analyses of the Outcomes and the Predictors
Descriptive statistics for all outcome and predictor variables are given in Table 1 and have been calculated for all 547 participants with available genetic and skin aging data and for 510 participants with available genetic and lung function data.
The two analysis samples differ only slightly in their characteristics. The women were on average 73 years old at the time of the second follow-up, were on average mildly overweight (mean BMI: 27 kg/m 2 ) and according to the mean MeDi score of 28.5 their nutritional habits moderately followed the Mediterranean diet. Only a few women were current or former smokers (2% and 16%, respectively), but about 33% were exposed to ETS at home and 42% at work.

Skin Aging Outcome
For the outcome of facial pigment spots all predictors combined explain 16.90% of the variance (median total R 2 , Table 2). As can be seen from the regression coefficients in Figure    Regression coefficients (bootstrap medians and 95% percentile confidence intervals) of the predictors for pigment spots. The coefficients reflect the change in the z-score for one unit increase in the single predictors and one interquartile range increase in the risk scores. MeDi: Mediterranean diet; HRT: hormone replacement therapy; RS: risk score; SESmed: medium socio-economic status (reference: low socio-economic status); SEShigh: high socio-economic status (reference: low socioeconomic status); SPF: sun protection factor; * p < 0.05.

Lung Function Outcomes
All single predictors and risk scores combined explain in median 22.32% of the variance in FEV1 and 23.36% of the variance in FVC (see Tables 3 and 4). The genetic RS is associated with both lung function parameters according to its regression coefficients (see   Regression coefficients (bootstrap medians and 95% percentile confidence intervals) of the predictors for pigment spots. The coefficients reflect the change in the z-score for one unit increase in the single predictors and one interquartile range increase in the risk scores. MeDi: Mediterranean diet; HRT: hormone replacement therapy; RS: risk score; SESmed: medium socio-economic status (reference: low socio-economic status); SEShigh: high socio-economic status (reference: low socioeconomic status); SPF: sun protection factor; * p < 0.05.

Lung Function Outcomes
All single predictors and risk scores combined explain in median 22.32% of the variance in FEV 1 and 23.36% of the variance in FVC (see Tables 3 and 4). The genetic RS is associated with both lung function parameters according to its regression coefficients (see It might seem as if the effects of obesity and smoking are beneficial due to the positive regression coefficients, but the risk scores' weights are mostly negative so that a larger RS refers to less obesity and less tobacco smoke exposure.     Regression coefficients (bootstrap median and 95% confidence interval) of the predictors for FEV1. The coefficients reflect the change in the z-score for one unit increase in the single predictors and one interquartile range increase in the risk scores. FEV1: forced expiratory volume in 1 s; RS: risk score; SESmed: medium socio-economic status (reference: low socio-economic status); SEShigh: high socio-economic status (reference: low socio-economic status); * p < 0.05.

Discussion
The proposed approach enables us to quantify the contributions of genetic factors and various categories of the exposome to a certain outcome in terms of percentages of explained variance, where each category has been assessed by several (correlated) variables and combined into one weighted RS.
In these examples, the highest contribution to all aging traits was achieved by the genetic risk scores comprising the considered SNPs. Apart from the genetic and the obesity RS and in parts the smoking RS, the environmental risk scores were not associated with the outcomes. That the lower limit of the confidence interval of the air pollution risk score's coefficient is very close to zero in the skin aging example is in line with previous analyses in the SALIA study, which found associations of air pollution with skin aging [44] when using single-pollutant models with no need to split off a training sample. One might have expected to see associations of the aging traits with age (at least for the skin aging outcome, where the z-score does not account for age). This is probably not only due to the small sample size, but also due to a small age range of the study participants.
The magnitudes and rankings of the estimated relative importance are quite similar compared between the two lung function parameters, while there are distinct differences when comparing the same predictors and risk scores between skin aging and lung function. The percentage of variance explained by the genetic RS in the skin aging outcome is only less than half of that in the lung function outcomes. In addition, the smoking risk score's relative contribution, for example, ranks very low for skin aging, while it has a top three ranking for lung function. Thus, awareness campaigns and other measures to reduce the number of smokers seem more important for reducing lung aging than for delaying skin aging in the population. Against skin aging it would be more effective to inform about the negative effects of sunbed use.
Overall, the models were able to explain between 17% and 23% of the different outcomes' variances, which is noticeable considering the complexity of the aging processes, the limited sample size and the fact that further components of the exposome such as stress or lack of sleep which were not collected in the SALIA study could not be incorporated. Though the presented example is far from a complete exposome analysis, this investigation shows that (i) in principle the proposed approach can quantify the extent to which each of the various categories of the exposome contribute, and that (ii) these relative contributions vary for different health traits and thus can be ranked.
The approach has some limitations. First, it relies on the calculation of relative contributions via shares of explained variance in a linear regression model and is as a consequence limited to linear regression. Generalized linear models such as logistic regression for binary traits are not applicable, since the concept of relative contribution is not (easily) generalizable. This direction is interesting for future research since binary health outcomes are very common. Second, splitting the available data set into training and test samples does not only reduce statistical power in the linear regression analysis, but also requires repeated execution of the fitting process (here: several bootstrap samples) to reduce the randomness of the partition and the results. This, however, complicates reporting of the results which have to be averaged. In particular, residual diagnostics are not easily applicable since they would have to be examined for each bootstrap sample. Yet, there is usually no alternative to internal weights for the risk scores, since external weights from published studies with several covariates belonging to the same risk factor category are typically not available precisely because they are often highly correlated. Third, combining several covariates in one (weighted) RS certainly leads to loss of information. An alternative would be to directly report the results of a regularized regression without the formation of risk scores. However, the aim of this study was to show to which extent each risk factor category contributes to the outcome. To the best of our knowledge, the concept of relative contribution is not (yet) applicable to regularized regression models. Such an extension is interesting for future research. In addition, this study is limited to a certain configuration of model parameters. For example, there might be choices other than α = 0 (ridge regression) in the elastic net or other regularized regression or machine learning methods, which yield more appropriate RS weights for our purposes, but a comparison is beyond the scope of this work. Nevertheless, the presented application and results provide a proof of concept for the proposed methodology.

Conclusions
The combination of risk scores with a measure of relative contribution is suitable to assess the extent to which various categories of the exposome and genetic factors contribute to a certain health outcome, and the contributions can not only be compared between different outcomes, but also between, for example, different ethnic or age groups. In addition, the exposome's categories can be ranked according to their relative contribution. The proposed approach might thus have the potential to improve risk assessment relevant for human aging traits and beyond, i.e., common human diseases. In this regard it could be of interest not only to health scientists, but also to governmental institutions, because it might help to prioritize regulatory decisions limiting exposure to selected environmental factors and put them on a more solid scientific basis.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijerph192416746/s1, Supplementary File S1: Table S1: Scoring of the food frequencies for the Mediterranean diet score; Table S2: Single nucleotide polymorphisms used in the genetic risk score for the skin aging trait; Table S3: Single nucleotide polymorphisms used in the genetic risk score for the lung function traits; Figure S1: Correlation plot of the original variables used in the skin aging analysis (excluding the SNP variables); Figure S2: Correlation plot of the original variables used in the lung function analysis (excluding the SNP variables); Figure S3: Correlation plot of the risk scores and single predictors used in the skin aging analysis; Figure S4: Correlation plot of the risk scores and single predictors used in the analysis of FEV1; Figure S5: Correlation plot of the risk scores and single predictors used in the analysis of FVC. Refs. [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][45][46][47]   Data Availability Statement: Due to privacy laws in Germany the data of the cohort study are not available. The R code can be obtained by contacting the authors of the paper.