A Metabolomic Severity Score for Airflow Obstruction and Emphysema

Chronic obstructive pulmonary disease (COPD) is a disease with marked metabolic disturbance. Previous studies have shown the association between single metabolites and lung function for COPD, but whether a combination of metabolites could predict phenotype is unknown. We developed metabolomic severity scores using plasma metabolomics from the Metabolon platform from two US cohorts of ever-smokers: the Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS) (n = 648; training/testing cohort; 72% non-Hispanic, white; average age 63 years) and the COPDGene Study (n = 1120; validation cohort; 92% non-Hispanic, white; average age 67 years). Separate adaptive LASSO (adaLASSO) models were used to model forced expiratory volume at one second (FEV1) and MESA-adjusted lung density using 762 metabolites common between studies. Metabolite coefficients selected by the adaLASSO procedure were used to create a metabolomic severity score (metSS) for each outcome. A total of 132 metabolites were selected to create a metSS for FEV1. The metSS-only models explained 64.8% and 31.7% of the variability in FEV1 in the training and validation cohorts, respectively. For MESA-adjusted lung density, 129 metabolites were selected, and metSS-only models explained 59.0% of the variability in the training cohort and 17.4% in the validation cohort. Regression models including both clinical covariates and the metSS explained more variability than either the clinical covariate or metSS-only models (53.4% vs. 46.4% and 31.6%) in the validation dataset. The metabolomic pathways for arginine biosynthesis; aminoacyl-tRNA biosynthesis; and glycine, serine, and threonine pathway were enriched by adaLASSO metabolites for FEV1. This is the first demonstration of a respiratory metabolomic severity score, which shows how a metSS can add explanation of variance to clinical predictors of FEV1 and MESA-adjusted lung density. The advantage of a comprehensive metSS is that it explains more disease than individual metabolites and can account for substantial collinearity among classes of metabolites. Future studies should be performed to determine whether metSSs are similar in younger, and more racially and ethnically diverse populations as well as whether a metabolomic severity score can predict disease development in individuals who do not yet have COPD.


Introduction
Chronic obstructive pulmonary disease (COPD) is characterized by persistent respiratory symptoms and airflow limitation that is due to airway and/or alveolar abnormalities usually caused by significant exposure to noxious particles or gases [1]. COPD is one of the leading causes of death and hospitalizations worldwide and in the United States [2,3]. However, many adults with abnormal pulmonary function are not aware of having any obstructive lung disease [2]. Although the lung is the main affected organ, there is strong evidence for systemic effects of COPD as evidenced by muscle wasting, cardiovascular disease, osteoporosis, and depression, which suggests a generalized metabolic disturbance in affected individuals [2]. Metabolomic profiling of blood may help to assess these metabolic disturbances.
Several studies have examined the association between metabolites and lung function measures. In investigating forced expiratory volume in 1 s (FEV 1 ), FEV 1 as a percentage of predicted and the FEV 1 /Forced Vital Capacity (FVC) ratio, Cruickshank-Quinn et. al found 32 and 269 significantly associated metabolites, respectively, in plasma samples from 131 participants from a cohort with COPD [4]. They found significant associations between glycerophospholipid metabolism and FEV 1 percent predicted, and between sphingolipids and FEV 1 /FVC [4]. In a large general population study (n = 4742), Yu et al. reported 30 novel metabolites associated with FEV 1 , out of a total of 95 associated metabolites with FEV 1 using plasma samples [5]. Additionally, they found 100 metabolites associated with FVC. Yu et al. also showed associations between FEV 1 and four metabolic pathways, including aminoacyl-tRNA biosynthesis; phenylalanine metabolism; nitrogen metabolism; and alanine, aspartate, and glutamate metabolism [5]. Kelly et al. found 156 metabolites associated with FEV 1 in a general population study of 10,460 participants with a validation cohort of 437 participants using blood and plasma samples [6].
The association between COPD, metabolic pathways, and clusters of metabolites indicates a need for multivariable metabolite models, instead of single metabolite models, to elucidate important metabolic profiles among patients with COPD. One approach to multivariable metabolite modeling entails creating a score. Metabolomic scores are useful in predicting a variety of chronic diseases and disease risk markers, including incident coronary heart disease [7,8], weight gain [9], and type 2 diabetes [10]. The methods used to develop these scores range from least absolute shrinkage and selection operator (LASSO) to random forest modeling. These scores can also sometimes include clinical markers. However, a metabolomic score has not been developed for COPD. Pinto-Plata et al. evaluated a panel of untargeted metabolomics using random forest and support vector machines to classify controls, surviving patients, and nonsurviving patients with COPD, but they did not develop a score metric [11]. Furthermore, none of these studies used an independent population to validate the performance of a COPD score.
A challenge in developing scores is determining which variables to use, particularly when the number of variables is larger than the number of subjects and when there is collinearity among variables. As with LASSO, the adaptive LASSO (adaLASSO) model shrinks beta parameters to exactly 0 to drop unimportant metabolites, and additionally, adaLASSO adds individual weights for each variable to control bias in estimators, which allows for more consistent variable selection. The coefficients from the adaLASSO can then be used as weights to develop a final score and are interpretable in terms of the linear relationship between the metabolite and the outcome. In this analysis, we use adaLASSO to develop separate metabolomic severity scores (metSSs) for two important clinical manifestations of COPD-airflow limitation and emphysema-using two independent cohorts. Since airflow limitation is relatively easy and inexpensive to assess with spirometry, the severity scores may demonstrate the mechanistic associations between the clinical manifestations and metabolic disturbances.

Demographic Characteristics
Demographics and clinical characteristics for both the training and validation cohorts are presented in Table 1. Cohorts show significant differences in all characteristics except for sex and BMI. The SPIROMICS (training) cohort had a greater proportion of self-identified Black/African American participants (18.8% vs. 8.5%, Chi-squared p-value < 0.001), and a higher percentage of current smokers (33.5%, vs. 23.9%, Chi-squared p-value < 0.001). However, the COPDGene (validation) cohort had more participants with Global Initiative for Chronic Obstructive Lung Disease (GOLD) stage 4 (5.6% vs. 2.3%, p-value < 0.001), a lower postbronchodilator FEV 1 (2.2 vs. 2.3 L, t-test p-value = 0.003), and lower MESAadjusted lung density (g/L) (81.7 vs. 86.1 g/L, t-test p-value < 0.001) indicating more severe COPD.

Adaptive LASSO Results
For FEV 1 , the adaLASSO procedure selected a total of 132 metabolites. The top 25 metabolites are shown in Table 2 (all metabolites and weights are shown in Table S1). Additionally, to depict the association of the adaLASSO-selected metabolites and FEV 1 , scatterplots of the four largest-coefficient metabolites are shown in Figure S1. In the validation data set, the metSS-only model explained 31.7% of the variability in FEV 1 . The combined metSS and covariate model explained 53.4% of the variability, which was more than the covariate-only model (46.4%) (p < 0.001). In the training data set, the metSS-only model explained almost 1.5 times the variability in FEV 1 compared with the covariate-only model (64.8% vs. 42.1%) ( Table 3). When combined, the clinical covariates and metSS explained 4.1% more variability compared with the metSS-only model (p < 0.001) in the training cohort. Mean squared error (MSE) showed similar patterns in the error for each model. Figure 1 shows the relationship between the metSS-predicted FEV 1 values and the observed FEV 1 values in the training and validation data sets. The figure shows a bias in the predicted scores with predictions for the highest and lowest FEV 1 values being underand over-predicted, respectively.    Linear regression models of covariates-only, metSS-only, and metSS with covariates for MESA-adjusted lung density showed a similar pattern to the corresponding models for FEV 1 (Table S2). The adaLASSO model selected 129 metabolites for MESA-adjusted lung density. In the validation data set, the metSS explained 17.4% of the variability; however, the combined metSS and covariate model outperformed the covariate-only model (adjusted R 2 : 42.2% vs. 38.2%). In the training data set, the metSS-only models explained 59% variability in MESA-adjusted lung density, while combined metSS and covariate models explained 63% of the variability, similar patterns were seen for MSE in both the validation and training cohorts.

Pathway Analysis
For FEV 1 , metabolites in three KEGG pathways were identified by the MetaboAnalyst 5.0 pathway analysis tool as over-represented (hypergeometric FDR < 0.10): arginine biosynthesis; aminoacyl-tRNA biosynthesis; and glycine, serine, and threonine metabolism, as shown in Figure 2. The specific metabolites in each of the significant pathways and their coefficients (weights) in the adaLASSO are shown in Table 4.

Sensitivity Analysis
As a sensitivity analysis, the training and validation data sets were exchanged and metSSs were developed in the COPDGene cohort and validated in the SPIROMICS cohort. Results are presented in Tables S3 and S4. In brief, the results were similar with the combined metSS and covariate models having the highest explained variability for both FEV 1 and MESA-adjusted lung density in both the training and validation data sets; however, almost double the number of metabolites were chosen by the adaLASSO procedure. Additionally, to correct for the bias depicted in Figure 1, the adaLASSO procedure was run with the highest and lowest quintiles of the training data weighted by a factor of 5. This incurs a higher penalty in the procedure for incorrectly predicting these subjects. A smaller number of metabolites were selected by adaLASSO using higher weights for extreme values. The linear models using the metSS from the extreme value weighted adaLASSO produced similar results to the original model, with the combined metSS and covariate model explaining the most variability in FEV 1 and MESA-adjusted lung density; however, the metSS-only models underperformed compared with the original analysis.

Discussion
This is the first publication of a metabolomic severity score for a respiratory disease. The major advantage of the metSS is that, similar to a genetic risk score, it combines the predictive power of many variables that individually, typically, explain only a small percentage (<5%) of the variability of a phenotype. For instance, in the absence of clinical covariates, our metSS was able to explain 32% of the variability in the independent validation cohort, similar to the SNP-based polygenic risk scores developed on hundreds of thousands of individuals [12]. Indeed, when clinical covariates (sex, age, height, race/ethnicity, BMI, smoking status, smoking pack-years, and clinical center) were added to the metSS, there was a 7% increase in explanation of variance over the covariate-only model. These findings support a key role of the blood metabolome in understanding COPD, as the metabolome reflects various physiological processes important in COPD pathogenesis, from immune regulation and energy homeostasis to protein synthesis/degradation and skeletal muscle dysfunction.
We considered two approaches to generating a metSS: with clinical covariates and without clinical covariates. The major advantage of a metSS without clinical covariates is that it is a standalone blood test and does not need separate interpretation based on age, sex, race, etc. An alternative would be to develop a score that included those covariates; however, that would limit the application of metSS to only subjects who have all covariates measured. Thus, using only metabolites in the adaLASSO procedure to derive a severity score allows for a broadly useful score for the population.
Studies have used a variety of methods to develop risk scores for different diseases. For this metSS, our goal was to create a parsimonious model for FEV 1 using only metabolite data. Since the number of metabolites exceeds the sample size (p > n), and metabolites can be highly correlated, adaLASSO was our preferred method for variable selection. As with LASSO, adaLASSO performs both variable selection and effect estimation simultaneously. However, adaLASSO has an advantage over LASSO in that it uses individual weights for each variable to reduce the bias in large coefficients found in LASSO [13]. For this analysis, the inverse of the absolute value of ridge regression coefficients were used as individual penalties for each metabolite, which allows for correlated metabolites to be included in the final model. Additionally, adaLASSO has been shown to select the true subset of variables and estimate the weight of the true variables as if only the true variables were included in the model in simulated datasets, which is called the oracle property [13,14]. One drawback to adaLASSO is that it lacks the ability to model nonlinear effects. However, metabolites selected by adaLASSO are interpretable in the same way as effects from multivariable regression models. Other methods such as random forests and support vector regression may be able to model nonlinear effects, but they can be difficult to interpret.
As adaLASSO has the ability to select only signal metabolomic features (i.e., the oracle property), it is worth investigating which metabolomic features are selected in the metSS. Approximately 70% of selected metabolites consisted of amino acid and lipid super pathway metabolites, split evenly between the two categories. The five metabolites with the strongest weights were vanillylmandelate (VMA), N1-methyladenosine, glutamine, 2hydroxypalmitate, and choline phosphate. While this is the first report of these metabolites and lung function or COPD, other metabolites selected by adaLASSO have been reported to be associated with COPD (Table S1). For instance, dimethylarginine and, specifically, asymmetric dimethylarginine (ADMA) results in a "functionally relevant shift" in l-arginine breakdown and has been associated with airflow obstruction [15]. Sphingomyelin has been associated with progression of percent emphysema and with worsening lung function [4,16]. Finally, 12,13 DiHOME has been associated with sex-specific effects with increased levels found in female smokers compared with female non-smokers [17].
The KEGG pathways that were over-represented in the metabolites selected in the adaLASSO procedure were arginine biosynthesis; aminoacyl-tRNA biosynthesis; and glycine, serine, and threonine metabolism. Glutamine and arginine, important amino acids in both the arginine biosynthesis and aminoacyl-t-RNA biosynthesis pathways, were inversely associated with FEV 1 . These findings are concordant with prior studies that found serum glutamine to be elevated in individuals with COPD compared with controls, and both serum glutamine and arginine to be elevated specifically in individuals with GOLD 4 COPD compared with controls [18,19].
AdaLASSO selection of dimethylarginine (DMA) further supports dysregulation of the arginine pathway in COPD. DMA is produced when methylated arginine residue proteins are degraded. DMA can be stimulated by hypoxia and is important in inflammation because it inhibits nitric oxide synthases (NOS) [20]. Our findings are supported by several smaller publications. In a study of 44 COPD patients and 30 healthy subjects, DMA was higher in patients with COPD [21]. These findings were supported by another study of 58 COPD patients and 30 healthy subjects [20]. In another study of 23 moderate-tosevere COPD patients and 19 healthy older controls, whole-body arginine was higher in COPD patients and related to de novo arginine production [22]. Additionally, the arginine pathway was also implicated in smoke-mediated emphysema in mice through its role in oxidant/antioxidant balance [23]. However, in a study of 25 COPD patients and 21 controls, a negative association was found between arginine and COPD status [24], while in another study comparing healthy smokers and smokers with COPD, Naz et al. found differences by sex with COPD women having lower ratios of arginine/(citrulline + ornithine) and higher ratios of asymmetric (ADMA) and symmetric dimethylarginine (SDMA) to arginine compared with healthy female smokers, but no difference in men [25]. KEGG analysis also identified aminoacyl-tRNA biosynthesis and glycine, serine, and threonine pathway overrepresentation in the metSS, which suggests disturbances in general amino acid metabolism. Aminoacyl-tRNAs are vital for protein synthesis and have been associated with oxidative stress [5,6,26]. Similarly, multiple studies have found enrichment of the glycine, serine, and threonine metabolism pathways, which have been associated with COPD exacerbation severity [4,27]. The reason for amino acid metabolism dysfunction in COPD is unclear but has been speculated to be a result of systemic inflammation and skeletal muscle energy metabolism dysfunction [28].
There are several limitations to our metSS approach. For instance, our metSS was developed in a population enriched with COPD, which could limit the generalizability and the deconvolution of those metabolomic pathways related to COPD progression versus those that regulate lung development and variability in the general population. This is supported by the overall low, but not negligible, overlap of metabolites associated with FEV 1 between our analysis and those involving general population cohorts that included a significant proportion of individuals who never smoked and/or who had normal FEV 1 [5,6]. Beyond disease progression versus lung-development-related metabolites, differences in methodological approaches (adaLASSO vs. multivariable regression models) between analyses might also explain these results and warrant further investigation. Additionally, the severity score was developed with a cross-sectional sample and the findings may not to apply to the longitudinal progression of COPD within an individual. The utility of the metSS should be tested with a longitudinal sample to determine the viability of the metSS in predicting progression of COPD. We chose the SPIROMICS data as our training set, even though it was smaller than COPDGene, as the population was more racial/ethnically diverse, had a higher proportion of current smokers and had less severe disease status, which should improve the utility of the score in undiagnosed populations and in early COPD patients where disease identification and prevention is most relevant. Age is also an important factor for the metabolome; the cohorts used to derive and validate the severity score were older adults, which again limits the generalizability of the score to younger populations. Finally, due to the low cost and accessibility of spirometry, the metSS, currently, should be used in research settings to better understand the pathobiology of COPD rather than a clinical tool.
In summary, we show that we can use adaLASSO to generate a metSS, similar to polygenic risk scores, which is highly associated with the variability of FEV 1 in two independent cohorts of individuals with a smoking history and with or at risk for COPD. The FEV 1 metSS is significantly enriched in amino acid pathways (particularly, arginine metabolism), suggesting the importance of these pathways in COPD pathogenesis. Further work in younger subjects without disease who have multiple long-term spirometry assessments should be undertaken to assess whether a metSS can predict disease development or progression.

Training Cohort
The Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS) (ClinicalTrials.gov Identifier: NCT01969344) provided metabolomic data for training in the development of our metSS. In brief, this study recruited 2771 participants between 40-80 years old with at least 20 pack-years of smoking and 202 participants who were never smokers; 73% of participants self-identified as non-Hispanic white. Of participants returning for their 5-7 year follow-up visit, the first 649 were selected for metabolomic profiling of baseline fasting blood samples. Details of the cohort are provided elsewhere [29,30]. Data from 648 participants were used in this analysis, as one subject was excluded. Median standard deviation scores (z-scores) were calculated across metabolites at the subject level.
Subjects with aggregate metabolite median z-scores > 3 SD from the mean of the cohort were removed.

Validation Cohort
The Genetic Epidemiology of COPD (COPDGene) (ClinicalTrials.gov Identifier: NCT00608764), another NIH-sponsored multicenter cohort, provided metabolomic data for validation of the metSS. Details of the COPDGene study are provided elsewhere [31]. In brief, this study enrolled 10,198 non-Hispanic white and African American participants between 40-80 years with at least 10 pack-years of smoking and no exacerbations for more than 30 days, and 465 individuals with no smoking history. At the in-person, 5-year visit, 1136 participants from the National Jewish Health and University of Iowa clinical centers participated in an ancillary study in which nonfasting blood samples were collected and processed for metabolomic profiling [29,31,32]. Data from 1125 participants were used in this analysis and six subjects were excluded based on median standard deviation score (z-scores), as described above, and another five had missing values of covariates.
Informed consent was obtained from all subjects involved in the study.

Clinical Data and Definitions
COPD severity and interindividual differences are best measured by FEV 1 and emphysema, respectively; due to this heterogeneity, both postbronchodilator forced expiratory volume in 1 s (FEV 1 ) and quantitative emphysema were used to generate separate metSSs. Emphysema was quantified using MESA-adjusted lung density from a computed tomography scan of the chest and adjusted according to reference equations from the Multi-Ethnic Study of Atherosclerosis [33]. Clinical covariates included age at time of visit, sex, height, self-identified race/ethnicity, BMI (kg/m 2 ), current smoking status, smoking pack-years, and clinical center. Additionally, for MESA-adjusted lung density, scanner model was included as a covariate. We tested for differences between the two cohorts using t-tests for continuous variables, and chi-squared tests and Fisher's Exact test for categorical variables.

Metabolomic Profiling and Processing
Plasma samples from both cohorts were profiled using Metabolon (Durham, NC, USA) Global Metabolomics Platform, as described previously, although profiling for each cohort occurred approximately 1 year apart [34][35][36]. Metabolite values were batch normalized, within each study, by dividing by the median metabolite value for each metabolite within a batch. After batch normalization, metabolite PCs showed a significant reduction in association with batch, so no further normalization was needed [32].
Metabolites were excluded if >80% of samples were missing values (COPDGene: 149; SPIROMICS: 197). For metabolites missing in 20-80% of samples, a present/absent (1/0) indicator variable was created and used for all analyses (COPDGene: 248, SPIROMICS: 192). For metabolites missing in <20% of samples, missing values were imputed with knearest neighbor imputation (kNN; k = 10) using the R package 'impute' (COPDGene: 995; SPRIOMICS: 785). In the SPIROMICS study, a total of 1174 metabolites were characterized by Metabolon, whereas 1392 metabolites were characterized in the COPDGene study. For the 7 tobacco metabolites identified by Metabolon and common between studies, 1 (nicotine) was excluded since >80% of samples were missing values, and the other six were converted to present/absent indicator variables.

Statistical and Bioinformatics Analysis
Continuous metabolite values were natural log-transformed for all analyses. Of the 946 metabolites identified in both studies, 73 had <20% missing samples in one study and 20-80% missing in the other, and 111 had at least one cohort with ≥80% missing samples; 673 had kNN imputed missing values in both cohorts and 89 were dichotomized to present/absent variables in both studies, resulting in 762 metabolites used to develop the metSSs.
Separate adaptive least absolute shrinkage and selection operator (adaLASSO) analyses were performed to select metabolites and their weights for the metSSs for FEV 1 and MESA-adjusted lung density. Inverted absolute value ridge regression estimates were used as initial weights for adaLASSO, and a 10-fold cross validation was performed in the training data set to obtain the penalty parameter (λ), which corresponded to the lowest mean-squared error across the folds. The adaLASSO procedure did not include clinical covariates. The metSSs were created with the coefficients and set of metabolites selected by the adaLASSO procedure.
Separate linear regression models were run to assess the variability explained (adjusted R 2 using the Wherry formula) by (a) the clinical covariates, (b) the metSS, and (c) the metSS in concert with clinical covariates associated with FEV 1 and MESA-adjusted lung density. Both training and validation data sets were modeled using linear regression. Additionally, mean squared error (MSE) was calculated to assess fit. Sensitivity analyses were performed to test the effect of switching the training and validation cohorts and by adding a higher penalty to errors in the highest and lowest quintiles of the training cohort.

Pathway Analysis
Pathway analysis was conducted using MetaboAnalyst 5.0 web server (accessed 2 November 2021) for KEGG pathways. Details on pathway analysis were previously published [37]. In brief, pathway analysis used a hypergeometric test to determine overrepresentation of pathways based on metabolites selected by adaLASSO and, additionally, used measures of metabolite centrality, including relative betweenness centrality and outdegree centrality, to calculate importance in the pathway and determine pathway impact. Pathway analysis was conducted on metabolites selected by the adaLASSO procedure.

Software
The statistical software R Version 4.0.2 was used for all analyses. The R packages glmnet and stats were used for adaLASSO and linear regression models, respectively. The MetaboAnalyst webserver (https://www.metaboanalyst.ca/, accessed on 2 November 2021) was used for pathway analysis. Analysis code can be accessed on GitHub at https://github.com/sunigodbole/netco-metRS, accessed on 2 November 2021.
Supplementary Materials: The following supporting information can be downloaded at https:// www.mdpi.com/article/10.3390/metabo12050368/s1, Figure S1: Scatterplots of the FEV1 postbronchodilator (L) and four strongest effect metabolites from the adaLASSO selection in the training and validation cohort.; Table S1: All metabolites selected by the adaLASSO procedure, the adaLASSO coefficient, super pathway, and sub pathway from Metabolon processing;   Takeda Pharmaceutical Company; and Theravance Biopharma and Mylan. The project described was supported by Award Number U01 HL089897 and Award Number U01 HL089856 from the National Heart, Lung, and Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. COPDGene is also supported by the COPD Foundation through contributions made to an Industry Advisory Board that has included AstraZeneca, Bayer Pharmaceuticals, Boehringer-Ingelheim, Genentech, GlaxoSmithKline, Novartis, Pfizer, and Sunovion. SG, KAP, AH, KK, and RPB were supported by R01HL152735 and R01HL137995. MHC was supported by R01HL153248, R01HL149861, R01HL135142, R01HL137927, R01HL147148, and R01HL089856.
Institutional Review Board Statement: COPDGene and SPIROMICS were conducted according to the guidelines of the Declaration of Helsinki and approved by the institutional review boards at each center (ClinicalTrials.gov: NCT00608764 (COPDGene) and NCT01969344 (SPIROMICS)).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.