Identifying Protein–metabolite Networks Associated with COPD Phenotypes

Chronic obstructive pulmonary disease (COPD) is a disease in which airflow obstruction in the lung makes it difficult for patients to breathe. Although COPD occurs predominantly in smokers, there are still deficits in our understanding of the additional risk factors in smokers. To gain a deeper understanding of the COPD molecular signatures, we used Sparse Multiple Canonical Correlation Network (SmCCNet), a recently developed tool that uses sparse multiple canonical correlation analysis, to integrate proteomic and metabolomic data from the blood of 1008 participants of the COPDGene study to identify novel protein–metabolite networks associated with lung function and emphysema. Our aim was to integrate -omic data through SmCCNet to build interpretable networks that could assist in the discovery of novel biomarkers that may have been overlooked in alternative biomarker discovery methods. We found a protein–metabolite network consisting of 13 proteins and 7 metabolites which had a −0.34 correlation (p-value = 2.5 × 10−28) to lung function. We also found a network of 13 proteins and 10 metabolites that had a −0.27 correlation (p-value = 2.6 × 10−17) to percent emphysema. Protein–metabolite networks can provide additional information on the progression of COPD that complements single biomarker or single -omic analyses.

. Range of correlations between adjusted proteomic data, adjusted metabolomic data, and FEV1% (A) and range of correlations between adjusted proteomic data, adjusted metabolomic data, and percent emphysema (B).

S.2. Parameter Selection for FEV1% Modules using Adjusted -Omic Datasets
We tried different scaling constants for b and c where b = c. After analyzing the results from each scaling constant change, a final value for the scaling constants was selected. This selection was made by a variety of diagnostics including the correlation of each resulting network module's first principal component to FEV1%, ratio of protein to metabolite nodes, and strength of module edges. We aimed for a network with a high correlation to FEV1% (|rho| ≥ 0.20), close to a 1:1 ratio between proteins and metabolites, and the strongest edges amongst all derived modules. We initially applied scaling constant values of 5, 10, 15, and 20 as a first pass to decrease computational time. The diagnostics listed above indicated that we should further analyze scaling constant values between 10 and 15 to determine the scaling constant for which the network module results ceased to have a substantial change.
The different scaling constants can have a strong effect on module results and the number of modules correlated with FEV1% (Table S1). However, after scaling constant of 12, the number of proteins, metabolites and modules do not change much. The scaling constant of 11 was selected because it results in a module that has a high correlation to FEV1% (rho = 0.33, p-value = 3.9 × 10 -26 ) and a balanced ratio between proteins and metabolites in the module. Results with scaling constants below 11 have lower correlation to FEV1% (e.g., scaling constant of 10 gives a maximum correlation between a module and FEV1% of rho = -0.18, p-value = 2.3 × 10 -8 ). Because not all proteins and metabolites will contribute to the canonical correlation of SmCCNet, sparsity is imposed by adding penalties for the number of metabolites and proteins influencing the canonical weights (see Methods). The optimal metabolite and protein penalty parameters are (0.05, 0.05) when the scaling constant of 11 was applied to the SmCCNet.
Finalizing the use of scaling constant 11 (penalty parameters = (0.05, 0.05)) and applying SmCCNet to adjusted proteomic and metabolomic data and FEV1% results in one protein-metabolite module with seven metabolites and fifteen proteins. It is important to examine not only the overall correlation of the module with FEV1% (based on the first principal component summary), but also the individual correlations of the metabolites and proteins in the module with FEV1% ( Figure S2). . Figure S2: Module node correlations with FEV1%. All proteins and metabolites not included in Module 1 are represented as Module 0. The proteins are blue and the metabolites are red. Module 0 has a 0.08 (pvalue = 0.011) correlation with FEV1%. Module 1 has a 0.33 (p-value = 3.9 × 10 -26 ) correlation with FEV1%.

S.3. Identified Network Associated with FEV1%
The initial module from SmCCNet provides a candidate network but may contain some weak edges between metabolites and proteins. By removing weaker edges, we focus on the strongest connections in the network and most informative proteins and metabolites, as they may be removed if they are weakly connected to other features. All edges are scaled based on the most heavily weighted edge on a 0 to 1 scale. The edge between troponin T and phosphocholine was so heavily weighted compared to the other edge weights, most of the edge weights were very small, or less than 0.1. Therefore, we evaluated different edge thresholds to keep the strongest connections within the network and tried values between 0 (no edges cut, keeping original module) to 0.01 in increments of 0.001. The resulting networks were summarized by the number of proteins and metabolites, in addition to the network's first principal component correlation with the phenotype ( Figure S3). Edge threshold value of 0.004 was chosen because it resulted in a network with the strongest edges that still yielded the highest correlation to FEV1% (rho = -0.34, p-value = 2.5 × 10 -28 ).

S.4. Correlations between Unadjusted -Omic Data and Phenotype of Interest
Similar to adjusted -omic data analysis, we explored the range of correlations between unadjusted -omic data sets and between -omic data sets and phenotype of interest (e.g., protein to percent emphysema) before applying SmCCNet to proteomic and metabolomic data ( Figure S4). The range of correlations between the proteomic data and the metabolomic data was -0.69 to 0.74, but the range of correlations between unadjusted proteomic or metabolomic data with FEV1% or percent emphysema was smaller: in the -0.31 to 0.24 range. Therefore, scaled SmCCNet was applied to the proteomic and metabolomic data. A.
B. Figure S4: Range of correlations between proteomic data, metabolomic data, and FEV1% (A) and range of correlations between proteomic data, metabolomic data, and percent emphysema (B).

S.5. FEV1% Modules (Unadjusted -Omic Data)
We resulted in a network with a higher correlation to FEV1%, but the number of proteins in module 1 was small compared to the number of metabolites (e.g., scaling constant of 18 resulted in 131 proteins and 352 metabolites, scaling constant of 19 resulted in 9 proteins and 115 metabolites).
Finalizing the use of scaling constant 18 and applying SmCCNet to unadjusted proteomic and metabolomic data resulted in two protein-metabolite modules. The optimal penalty parameters were (0.45, 0.25) which favors protein-metabolite networks with more metabolites and less proteins.

S.6. Identified Network Associated with FEV1% (Unadjusted -Omic Data)
The initial Module 1 provides a candidate network, but we want to focus on the proteins and metabolites connected by the strongest edges. Therefore, weaker edges and metabolites and proteins which are connected by weaker edges are removed. To keep the strongest edges, we evaluated edge threshold values between 0.4 and 0.7 at increments of 0.05. Resulting networks were summarized by network correlation to FEV1% and number of proteins and metabolites ( Figure S6). Module 1 was the only module that passed edge cuts. Edge threshold value of 0.55 was chosen as the final edge threshold because it resulted in an interpretable network with 27 nodes. It also had a high correlation (rho = -0.28, p-value = 2.7 × 10 -20 ) and balanced ratio of proteins to metabolites. Edge thresholds higher than 0.55 resulted in a large reduction of metabolites from the network. The identified trimmed network has 16 proteins and 11 metabolites with varying range of individual feature correlations to FEV1% ( Figure S7, Table S3). Network hubs include troponin T and epidermal growth factor receptor with the most heavily weighted edge connecting the two hubs. All other metabolites and proteins in the network are either only connected to troponin T or connected to troponin T and epidermal growth factor. Unlike the FEV1% network created on adjusted proteomic and metabolomic data, phosphocholine is not a hub, but it is still connected to troponin T.

S.7. Network Comparison between Adjusted and Unadjusted -Omic Data with FEV1%
When comparing the FEV1% protein-metabolite network constructed on the adjusted -omic data with the network from the unadjusted -omic data, there was a significant number of protein and metabolite nodes consistent between the two networks (Fisher's exact test p-value = 8.2 × 10 -20 ) (Table S4). Table S4. FEV1% Network Nodes using Adjusted and Unadjusted Data.

S.8. Parameter Selection for Percent Emphysema Modules using Adjusted -Omic Datasets
Similar to methods used to construct protein-metabolite networks correlated for FEV1%, we tried different scaling constants for b and c where b = c when applying scaled SmCCNet to adjusted proteomic and metabolomic data and percent emphysema. First, scaling constant values of 5, 10, 15, and 20 were applied to SmCCNet. Next, further analysis of scaling constant values between 10 and 20 was performed to determine the scaling constant for which module results ceased to have substantial changes in the diagnostics mentioned above. After analyzing the results from each scaling constant change, a final value was selected based on each module's first principal component correlation to percent emphysema, number of metabolites and proteins and strength of module edges for each module.
While all scaling constant changes resulted in a similar maximum module correlation to percent emphysema (rho = -0.29), after the scaling constant value of 15, the number of proteins and metabolites do not change much (Table S5). Scaling constant value of 15 was selected because it had the same correlation to percent emphysema as other modules from different scaling constants. However, it was the largest scaling constant that provided a large number of nodes for which to build a network before a sharp decrease in number of nodes. The largest module (Module 1) derived from SmCCNet with scaling constant of 15 is comprised of 258 nodes, while scaling constant of 16 gives a single module with 17 nodes.
Finalizing the use of scaling constant 15 and applying scaled SmCCNet to adjusted proteomic and metabolomic data resulted in three protein-metabolite modules correlated with percent emphysema. The optimal penalty parameters were (0.35, 0.15) which favors networks with more metabolites and fewer proteins. Not only did we examine the full module's correlation to percent emphysema, but it was important to inspect the individual correlations of proteins and metabolites in each module to percent emphysema ( Figure S8). All proteins and metabolites not included in Module 1, 2 or 3 are represented as Module 0. Module 0 has a lower correlation with percent emphysema (rho = -0.068, p-value = 0.038) compared to Module 1 (rho = -0.29, p-value = 5.6 × 10 -22 , 42 proteins, 216 metabolites), Module 2 (rho = -0.13, p-value = 0.00011, two proteins, one metabolite), or Module 3 (rho = -0.18, p-value = 3 × 10 -8 , 23 proteins, one metabolite) correlation to percent emphysema.

S.9. Identified Network Associated with Percent Emphysema
While the initial Module 1 provides a candidate network, it is very large (42 proteins, 216 metabolites) and may contain weak edges between features. To focus on the strongest connections within the network, edges that do not meet a specified edge threshold were removed from the network thereby removing metabolites and proteins that are weakly connected to other proteins and metabolites. Edge thresholds between 0.3 and 0.6 were evaluated. The resulting networks were summarized by the network's correlation to percent emphysema and the number of proteins and metabolites ( Figure S9).
Module 1 was the only module that passed edge cuts. Edge threshold value of 0.5 was chosen as the final edge threshold because it resulted in an interpretable network with 23 nodes. It also had the strongest edges and a balanced metabolite to protein ratio while still yielding a high correlation to percent emphysema (rho = -0.27, p-value = 2.6 × 10 -17 ).

S.10 Percent Emphysema Modules (Unadjusted -Omic Data)
Similar to methods used to construct previous protein-metabolite networks, we tried different scaling constants for b and c when applying scaled SmCCNet to unadjusted -omic data sets and percent emphysema (see Methods and the main manuscript about diagnostics for selecting constants). First, scaling constant values of 5, 10, 15, and 20 were applied to SmCCNet. Next, further analysis of scaling constant values between 2 and 10 was performed to determine the scaling constant for which module results ceased to have substantial changes in previously mentioned diagnostics.
After scaling constant value of 7, the number of proteins and metabolites and the module correlation to percent emphysema did not change much (Table S6). Scaling constant value of 7 was selected as the final scaling constant for SmCCNet applied to percent emphysema and unadjusted proteomic and metabolomic data. Although the resulting Module 1 has a small number of proteins compared to number of metabolites (six proteins, 193 metabolites), the correlation to percent emphysema is high (rho = -0.30, p-value = 8.6 × 10 -22 ). Scaling constant values between 2 and 6 did yield modules with high correlations to percent emphysema and a more equal ratio between number of proteins and number of metabolites (e.g., scaling constant = 5, module 2). However, after further investigation, the modules' edges were weak and did not pass a high enough edge threshold.

S.11. Identified Network associated with Percent Emphysema (Unadjusted -Omic Data)
While Module 1 is a candidate network, it contains weak edges and features only connected by weak edges. To focus on the strongest edges, we evaluated different edge thresholds values between 0.2 and 0.45. Resulting networks were summarized by network correlation to percent emphysema and number of proteins and metabolites ( Figure S11). Module 1 was the only network that passed edge cuts. Edge threshold value of 0.4 was chosen as the final edge threshold because it resulted in a network with an interpretable number of 24 nodes while maintaining a high correlation to percent emphysema (rho = -0.26, p-value = 3.9 × 10 -16 ). The identified network has three proteins and 21 metabolites with a range of individual feature correlations to percent emphysema ( Figure S12, Table S7). Although there are a low number of proteins in this network, the proteins are connected by highly weighted edges and have high connectivity. Growth hormone receptor, a protein, is the largest network hub with every other metabolite and protein connected to growth hormone receptor. The two most heavily weighted edges connect growth hormone receptor to apolipoprotein E and to proto-oncogene tyrosine-protein kinase receptor, both of which are proteins. -0.20 (5.1 × 10 -10 ) 1 11 12 5 -0.11 (0.00093) 2 11 13 6 -0.14 (2.3 × 10 -5 ) 1 1 2 (1, 10, 10) 1 -0.31 (3.8 × 10 -22 ) 6 203 209 2 0.15 (5.0 × 10 -6 ) 2 5 7 3 -0.14 (7.8 × 10 -6 ) 1 1 2 4 -0.21 (3.6 × 10 -11 ) 3 10 13 5 0.13 (6.3 × 10 -5 ) 1 1 2 (1, 13, 13)     Pearson correlations between percent emphysema and individual metabolites and proteins in identified network associated with percent emphysema.

S.12. Network Comparison between Adjusted and Unadjusted -Omic Data with Percent Emphysema
When comparing the percent emphysema protein-metabolite network constructed on the adjustedomic data with the network from the unadjusted -omic data, there was a significant number of protein and metabolite nodes consistent between the two networks (Fisher's exact test p-value = 7.8 × 10 -20 ) (Table S8).       [2] †  [1] Diacylglycerol

S.13. Secondary Network Analysis
Secondary analysis was calculated on the subjects to look at trends with each network constructed on adjusted -omic data by different clinical subsets ( Figure S13, Figure S14). There was a significant difference (p-value < 0.0001) between subjects who had zero or at least one exacerbation for the FEV1% network, but there was no significant difference for the percent emphysema network ( Figure S13). There was a significant difference (p-value < 0.0001) between subjects with and without cardiovascular comorbidity for the FEV1% network but not within the network associated with percent emphysema ( Figure S14). Figure S13.Association of network first principal component (PC1) with exacerbations. Subjects were separated by those who had zero and those who has at least one exacerbation and evaluated for trends within the A) FEV1% or B) percent emphysema networks. Figure S14. Association of network PC1 with cardiovascular comorbidity. Subjects were separated by those with or without cardiovascular comorbidities and evaluated for trends within the A) FEV1% or B) percent emphysema networks.

S.14. Methods Supplement
Protein-metabolite networks correlated to FEV1% and percent emphysema were constructed using SmCCNet ( Figure S15), a technique developed by Shi et al [15] that uses multiple canonical correlation network analysis to integrate multiple -omics data types with a phenotype of interest. The objective function is (w1, w2) = max [ aw1 X1 T X2 w2 + bw1 X1 Y + cw2 X2 Y] subject to ||ws|| 2 = 1, Ps(ws) ≤ cs, s = 1, 2 where X1 and X2 are -omic data sets (metabolites and proteins respectively), Y is a phenotype of interest (FEV1% or percent emphysema), w1 and w2 are canonical weights, a, b, and c are scaling constants, and c1, c2 are penalty parameters under the l1 -norm function. The sparse penalty parameters were chosen through a 5-fold cross validation ( Figure S15, Step 1) to find the penalty pair that minimized prediction error. All penalty pairs from the set (0.05, 0.15, 0.25, 0.35, 0.45, 0.55) were tested in a grid search to find the optimal (c1, c2). The c1 penalty is imposed on the metabolic data while the c2 penalty is imposed on the proteomic data. Small penalties correspond with less of that particular -omic type contributing to the network. Conversely, a high penalty parameter should result in more of that -omic type included in the network. Canonical weights indicate the contribution of metabolites and proteins to the canonical correlations and were generated through sparse multiple canonical correlation analysis ( Figure S15, Step 2).
An important step in SmCCNet is the feature subsampling step to create robust network construction. The proteomic and metabolomic data was subsampled 500 times and a relationship matrix was created for each subsample based on the canonical weights. The proportions of subsampled protein and metabolite features were 70% for both feature sets. To find the similarity matrix, all 500 relationship matrices were averaged and rescaled to have a maximum relatedness of one. A hierarchical tree was constructed based on the averaged and rescaled relationship matrix ( Figure S15, Step 3). A height threshold of one was applied to the tree, resulting in protein-metabolite modules correlated to the phenotype of interest. A more complete description of SmCCNet can be found in Shi et al 2019 [15]. Step 1: Find the best sparse penalty parameters through a 5-Fold cross validation.
Step 2: Subsample the proteomic and metabolomic data, find the canonical weights using the penalty pair found in step 1, create a relationship matrix, repeat 500 times. Compute similarity matrix by averaging and scaling all 500 relationship matrices.
Step 3: Find protein-metabolite modules by applying a hierarchical tree cut to similarity matrix.

S.15. Metabolon and Data Processing
Samples were extracted with methanol under vigorous shaking for two minutes (Glen Mills GenoGrinder 2000) followed by centrifugation to remove protein, dissociate small molecules bound to protein or trapped in the precipitated protein matrix, and to recover chemically diverse metabolites. The resulting extract is divided into five fractions: two for analysis by two separate reverse phase (RP)/UPLC-MS/MS methods with positive ion mode electrospray ionization (ESI), one for analysis by RP/UPLC-MS/MS with negative ion mode ESI, one for analysis by HILIC/UPLC-MS/MS with negative ion mode ESI, and one sample is reserved for backup. Metabolon has developed peak detection and integration software to generate a list of m/z ratios, retention indices and area under the curve (AUC) values for each detected metabolite, as described in detail [49][50][51]. User specified criteria for peak detection included thresholds for signal to noise ratio, area and width. Relative standard deviations (RSDs) of peak area were determined for internal and recovery standards to confirm extraction efficiency, instrument performance, column integrity, chromatography and mass calibration. The biological data sets, including QC samples, were chromatographically aligned based on a retention index that utilized internal standards assigned a fixed RI value. The RI of the experimental peak was determined by assuming a linear fit between flanking RI markers whose RI values are set. Peaks were matched against an in-house library of authentic standards and routinely detected unknown compounds specific to the respective method. Identifications were based on retention index values, experimental precursor mass match to the library authentic standard within 10 ppm, and quality of MS/MS match. All proposed identifications were then manually reviewed and curated by an analyst who approved or rejected each identification based on the criteria above. The platform reported 1,392 features, including 1,064 annotated features which were grouped by Metabolon into "Super Pathways" including: 436 lipids, 261 xenobiotics, 207 amino acids, 40 peptides, 38 cofactors and enzymes, 35 nucleotides, 25 carbohydrates, 11 energy pathway compounds, and 11 partially characterized molecules. All compounds are further annotated by "Sub Pathway" (e.g. "sphingomyelins", "carnitine metabolism", "lysine metabolism").