Non-Invasive Lung Cancer Diagnostics through Metabolites in Exhaled Breath: Influence of the Disease Variability and Comorbidities

Non-invasive, simple, and fast tests for lung cancer diagnostics are one of the urgent needs for clinical practice. The work describes the results of exhaled breath analysis of 112 lung cancer patients and 120 healthy individuals using gas chromatography-mass spectrometry (GC-MS). Volatile organic compound (VOC) peak areas and their ratios were considered for data analysis. VOC profiles of patients with various histological types, tumor localization, TNM stage, and treatment status were considered. The effect of non-pulmonary comorbidities (chronic heart failure, hypertension, anemia, acute cerebrovascular accident, obesity, diabetes) on exhaled breath composition of lung cancer patients was studied for the first time. Significant correlations between some VOC peak areas and their ratios and these factors were found. Diagnostic models were created using gradient boosted decision trees (GBDT) and artificial neural network (ANN). The performance of developed models was compared. ANN model was the most accurate: 82–88% sensitivity and 80–86% specificity on the test data.


Introduction
The development of new non-invasive and comfortable methods to diagnose various diseases is an urgent task in modern medicine. Exhaled breath [1], exhaled breath condensate [2], saliva [1,3], skin [1,4,5], and urine [1,6] are intensively studied to develop new diagnostic approaches. Exhaled breath is especially interesting for diagnostic purposes since it can be obtained without any discomfort for patients [7].
A few non-invasive tests have already been implemented in clinical practice: 13C-urea breath test in the diagnostics of Helicobacter pylori infection [8], nitric oxide breath test in asthma, and allergic airway inflammation management [9]. However, many diseases with high mortality rate are still diagnosed using complex and invasive procedures. Lung cancer remains the leading cause of death [10], since the disease develops rapidly and asymptomatically at the initial stage and can be diagnosed only by harmful and invasive procedures such as low dose computed tomography (LDCT) and biopsy. Biopsy is an invasive procedure; LDCT scanning includes radiation exposure. As such, the development of new, accurate, simple to use and non-invasive methods for lung cancer diagnostics is highly required.

Human Subjects
The study involved 2 groups of participants: lung cancer patients and healthy volunteers. A volunteer was defined as healthy based on a yearly physical exam report. Inclusion criteria were absence of pathologies and inflammation processes in lungs, which was verified by fluorography. Diagnosis of lung cancer patients was confirmed by biopsy. Patients with other lung comorbidities along with lung cancer were excluded. Most patients were treated with chemotherapy (88 patients), immunotherapy (7 patients), or target therapy (1 patient). The rest individuals provided the samples before a treatment course. Information on the volunteers is reflected in Table 1. Each participant provided an informed consent.

Exhaled Breath Collection
Mixed expiratory breath samples were collected in 5-L Tedlar (Supelco, Bellefonte, PA, USA) sampling bags pre-cleaned by flushing with nitrogen. The possibility of sample pollution by compounds from the sampling bag was studied earlier [36]. The intensities of phenol and N,N-dimethylacetamide increased after 2 h of sample storage in the sampling bag. Therefore, these compounds were omitted from a list of putative biomarkers. The samples of lung cancer patients and some healthy volunteers were collected in the hospital. The samples of other healthy volunteers were collected in a room without solvents. Ambient air was sampled on the day of exhaled breath sampling to consider the influence of exogenous compounds. The subjects were fasted overnight before breath sampling. Exhaled breath of active smokers was sampled not earlier than 2.5 h after smoking. It was found that anatomic dead space, breath hold, and flow rate might affect the results in case of healthy volunteers but not in lung cancer patients [37]. Therefore, it was essential to provide the same sampling conditions for both groups of participants. On the other hand, establishing the certain flow rate during sampling can be associated with discomfort and pain for patients. Therefore, these parameters were not controlled. However, the sampling procedure was the same for both cohorts of people. It was conducted as follows: after a 10-min rest in a sampling room, volunteers were asked to deeply breathe, hold their breath for 10 s and breathe out into the sampling bag in a calm manner, repeating the procedure until filling it. Breath samples were stored in sampling bags no longer than 6 h after sampling.

GC-MS Analysis of Exhaled Breath
The samples of exhaled breath were analyzed by GC-MS. A system consisting of a gas chromatograph (Chromatec crystal 5000.2, Yoshkar-Ola, Russia) coupled with a quadrupole mass spectrometer equipped with an electron ionization source (Chromatec MSD, Yoshkar-Ola, Russia), combined with a two-stage thermal desorber TD2 (Chromatec, Yoshkar-Ola, Russia). The Chromatec Analytic (Chromatec, Yoshkar-Ola, Russia) software and the mass spectral library NIST 2017, Version 2.3 (Gatesburg, PA, USA) were used for data acquisition and processing. Sorbent tubes with the external diameter and length of 6.2 and 115 mm filled with 0.4 g of Tenax TA (60-80 mesh, Chromatec, Yoshkar-Ola, Russia) sorbent with the surface area of 35 m 2 /g were used to preconcentrate VOCs. Exhaled breath VOCs were preconcentrated upon passing a 0.5-L sample through a Tenax TA sorbent tube at a rate of 200 mL/min using a PV-2 aspirator (Chromatec, Yoshkar-Ola, Russia). Supelco Supel-Q PLOT (30 m × 0.32 mm × 15 µm) column was applied to separate the compounds. The flow rate of carrier gas was 1.30 mL/min. Oven temperature program was as follows: initial 50 • C ramped at 10 • C/min to 150 • C, next ramped at 6 • C/min to 220 • C and finally ramped at 4 • C/min to 250 • C. GC-MS analysis conditions were optimized earlier ( Table 2) [36]. Identification of VOCs was conducted by applying analytical standards by introducing gaseous compounds in the sorbent tube with subsequent thermal desorption and GC-MS analysis. The rest VOCs were identified by comparing the obtained mass spectra with library ones. All VOCs showing mass spectra with match factor ≥ 85% were considered.

Statistical Analysis
The chromatograms of exhaled breath samples were recorded in the full scan mode for quantification purposes. Extracted ion chromatogram (EIC) mode was applied to calculate the peak areas. The room air influence was eliminated by subtraction of room air peak area values from the exhaled breath. Negative results were equated to zero. To provide the reliability of the results, statistical analysis was conducted only for VOCs with peak areas at least 20% greater than in ambient air and occurring in more than 50% of samples. The ratios of the compound peak areas to the main ones (more than 86% of the samples) as well as ratios of the main VOCs were considered for statistical analysis.
The statistical analysis was conducted using StatSoft STATISTICA (version 10). Preliminary sample size calculations were conducted using power analysis to determine the minimum sample size required. For the correlation analysis, the results showed that the required sample size to achieve 85% power for detecting a correlation at a level of 0.2 at a significance criterion of α = 0.05 was N = 221. The study includes 232 samples, which is considered adequate. The normality of distribution was estimated using Kolmogorov-Smirnov test. Due to the non-normal distribution, nonparametric Spearman's rank correlation test (p = 0.05) was applied to identify statistically significant correlations between the peak areas of VOCs, their ratios and disease status.
Spearman's rank correlation test (p = 0.05) was used to find statistically significant correlations between the parameters and tumor localization (central or peripheral). Histological tumor type groups (adenocarcinoma, squamous cell carcinoma, and small cell carcinoma) were ranged according to their malignant course, from least malignant to the highest, in the following order: squamous cell carcinoma, adenocarcinoma, and small cell carcinoma. The correlation analysis was applied to estimate correlation between the parameters and malignant course as well as TNM stage.
The influence of chemotherapy on VOC profile was evaluated by comparing the VOC profiles of patients before beginning of any treatment and under the treatment using the correlation analysis. Further, the correlation analysis was used to estimate the effect of comorbidities on exhaled breath VOC profile.
The dataset was randomly divided into 2 datasets: training (70%) and test (30%) to create a diagnostic model. Sensitivity and specificity for both training and test data were calculated for each model. Gradient boosted decision trees (GBDT) and artificial neural network (ANN) were applied for the creation of a diagnostic model. ANN was trained using Broyden-Fletcher-Goldfarb-Shanno algorithm. Multilayer perceptron artificial neural network with one hidden layer was used to create the diagnostic model. The hidden layer included 5 neurons and the output layer contained 2 neurons, which determined whether the input data belonged to the healthy or lung cancer group.

Results
Exhaled breath samples of 112 lung cancer patients and 120 healthy volunteers were analyzed by GC-MS. Typical GC-MS chromatograms of exhaled breath samples from a lung cancer patient and a healthy volunteer are shown in Figure 1. A total of 205 VOCs were identified in the study. Table 3 represents the most frequently occurring compounds. The VOCs occurring in more than 50% of samples were used for statistical analysis. rameters and malignant course as well as TNM stage.
The influence of chemotherapy on VOC profile was evaluated by comparing the VOC profiles of patients before beginning of any treatment and under the treatment using the correlation analysis. Further, the correlation analysis was used to estimate the effect of comorbidities on exhaled breath VOC profile.
The dataset was randomly divided into 2 datasets: training (70%) and test (30%) to create a diagnostic model. Sensitivity and specificity for both training and test data were calculated for each model. Gradient boosted decision trees (GBDT) and artificial neural network (ANN) were applied for the creation of a diagnostic model. ANN was trained using Broyden-Fletcher-Goldfarb-Shanno algorithm. Multilayer perceptron artificial neural network with one hidden layer was used to create the diagnostic model. The hidden layer included 5 neurons and the output layer contained 2 neurons, which determined whether the input data belonged to the healthy or lung cancer group.

Results
Exhaled breath samples of 112 lung cancer patients and 120 healthy volunteers were analyzed by GC-MS. Typical GC-MS chromatograms of exhaled breath samples from a lung cancer patient and a healthy volunteer are shown in Figure 1. A total of 205 VOCs were identified in the study. Table 3 represents the most frequently occurring compounds. The VOCs occurring in more than 50% of samples were used for statistical analysis.  Statistical analysis was conducted using VOC peak areas and their ratios. In case of ratios, to avoid division by zero, it is reasonable to use VOCs with frequency of occurring of 100% as a denominator, which was observed for acetone, isoprene, and dimethyl sulfide (Table 3). To consider a wider list of ratios, it was rational to apply the VOCs occurring the most frequently in the samples of both groups as a denominator, which was observed for the first 10 VOCs (Table 3). Among them, the lowest frequency was observed for acetonitrile. The frequency of occurring for rest compounds was lower and was different in the studied groups. These VOCs were applied only as a numerator.
Statistically significant correlations were found between TNM stage and some VOC peak areas and their ratios (Table 4). In case of tumor histological type ranged by malignant course, statistically significant correlations were found only for some VOC peak area ratios (Table 5). Further, the correlation analysis was applied to find difference between the parameters of lung cancer patients and healthy individuals. Significant correlations between the disease status and peak areas of several VOC were found, i.e., acetone (−0.163), 1methylthiopropene (0.140), 2-pentanone (0.244), hexane (−0.287), toluene (0.249), pentanal (−0.254), and dimethyl trisulfide (0.260). Additionally, a lot of ratios were significantly different between lung cancer patients and healthy volunteers. The ratios with the highest correlation coefficients were selected for the development of diagnostic models ( Table 6). The group of healthy volunteers was significantly younger than lung cancer patients. To avoid confounding influence of age, correlation between age and all ratios selected for the creation of diagnostic models was estimated in groups of lung cancer patients and healthy volunteers separately ( Table 6). None of ratios selected for the creation of diagnostic models had statistically significant correlation with age. Diagnostic models were created using GBDT and ANN. The input values of each model represented the same set of 12 ratios (Table 6). To provide reliability of diagnostic models, 3-fold cross validation method was applied. Performance of models created using 3 datasets is shown in Table 7. The highest sensitivity on the training dataset was observed in case of GBDT. However, ANN diagnostic model has the best accuracy on the test dataset. Table 7. Accuracy of diagnostic models.

Training Dataset Test Dataset
Sensitivity, % Specificity, % Sensitivity, % Specificity, % Dataset  1  2  3  1  2  3  1  2  3  1  2  3   GBDT  92  94  96  82  85  92  88  78  77  77  68  81  ANN  89  88  87  85  85  75  88  85  82  86  80  81 GBDT allows to estimate the importance of all variables which construct the model in relation to the most important ones. Bar plots illustrate the importance of the variables for each dataset ( Figure 2). As it can be seen, the ratio of hexane/2-pentanone contributes less to the prediction in all datasets, but ratios of dimethyl trisulfide/dimethyl disulfide and isoprene/acetone were the most significant for distinguishing the two groups.
Metabolites 2023, 13, x FOR PEER REVIEW 10 of 14 Table 7. Accuracy of diagnostic models.

Machine Learning Algorithm
Training Dataset Test Dataset Sensitivity, % Specificity, % Sensitivity, % Specificity, %  Dataset  1  2  3  1  2  3  1  2  3  1  2  3  GBDT  92  94  96  82  85  92  88  78  77  77  68  81  ANN  89  88  87  85  85  75  88  85  82  86  80  81 GBDT allows to estimate the importance of all variables which construct the model in relation to the most important ones. Bar plots illustrate the importance of the variables for each dataset ( Figure 2). As it can be seen, the ratio of hexane/2-pentanone contributes less to the prediction in all datasets, but ratios of dimethyl trisulfide/dimethyl disulfide and isoprene/acetone were the most significant for distinguishing the two groups.

Discussion
Different research groups have shown an ability of exhaled breath analysis to diagnose lung cancer [17][18][19]22,31]. However, the results obtained by these groups were incoherent. Numerous analysis conditions, groups of volunteers, putative biomarker sets were used for the creation of diagnostic models, various learning algorithms and different performances of the models were obtained.

Discussion
Different research groups have shown an ability of exhaled breath analysis to diagnose lung cancer [17][18][19]22,31]. However, the results obtained by these groups were incoherent. Numerous analysis conditions, groups of volunteers, putative biomarker sets were used for the creation of diagnostic models, various learning algorithms and different performances of the models were obtained.
Inconformity of the results partially can be explained by variability of lung cancer groups in different studies. One of the aims of this study was to evaluate possible VOC profile variations in lung cancer group dependently from different factors. Considering that a part of lung cancer patients involved in the study was under the treatment, it was essential to evaluate the influence of treatment on exhaled breath composition. For this, correlations between the treatment status (before or under chemotherapy course) and VOC profile were calculated. Several ratios of 2-heptanone significantly correlated with the treatment status. The majority of other research groups have studied the effect of treatment only in case of surgery [16,38]. The results of exhaled breath analysis for monitoring response to treatment in lung cancer were demonstrated in work [39]. The effect of different kinds of treatment was studied. Alterations in concentrations of dodecane, styrene, 4-methyldodecane, and α-phellandrene were observed after treatment. We did not consider these VOCs due to low frequency of occurring in samples. In our study, the majority of lung cancer patients were under the treatment. Therefore, it was essential to exclude the VOCs and ratios affected by the treatment status. It was found that a small number of ratios were affected by the treatment status. However, the issue should be considered in detail in further studies.
Another issue which can influence the results is comorbidities. The better part of scientists prefers to exclude patients with lung comorbidities when it comes to involving volunteers to the study. We also excluded patients with other lung comorbidities in the study. However, metabolic pathway of other pathologies can also lead to alterations in exhaled breath profile. In this study, for the first time, the effect of non-pulmonary comorbidities on exhaled breath profile of lung cancer patients was studied. Hypertension and diabetes affect the VOC profile the most. The parameters correlating with other pathologies should be excluded to avoid their effect on discrimination of lung cancer patients and heathy volunteers. The current study has several limitations considering comorbidities. First, we have not studied the effect of other lung comorbidities, which can influence other parameters. Second, not all possible comorbidities were considered. Third, the ratio of the patients with comorbidities was low compared with patients without them. Therefore, the list of parameters correlating with other comorbidities can be extended with increasing the cohort of participants. However, the analysis of comorbidities effect allows us to exclude parameters correlating with other diseases and eliminate their influence.
Exhaled breath composition can be varied dependently of tumor localization. A lung tumor localized in the central part is closer to the airways than peripheral, which can affect the alterations in VOC profile of patients differently. Further, 1-pentanol and some its ratios as well as dimethyl disulfide/acetonitrile and 2-butanone/isoprene significantly correlated with tumor localization. Differences in VOC profiles of lung cancer patients in relation to tumor localization have never been investigated by other researchers. This issue should be considered further to confirm the results.
In case of TNM stage, peak areas of 2-butanone, 1-methylthiopropene and some ratios (Table 4) significantly correlated with TNM stage. Concentration of 2-butanone was found to be higher in advanced stages [29], which was proved by our findings, since positive significant correlation with TNM stage was observed. However, in other studies, no differences were found in VOC profiles of patients with different stages of lung cancer [15,29,40]. In this study, VOC profiles were considered based on the detailed TNM diagnosis instead of lung cancer stage (I, II, III, IV), which allows to reveal common tendencies not unclear when considering the groups separately. However, conformity with other works was observed, which proves the effect of disease stage on 2-butanone levels.
The results of different research groups concerning histological type effect are inconsistent. In a study [15], no differences were found in exhaled breath composition of patients with different histological types. Statistically significant differences in 1-butanol and 3-hydroxy-2-butanone concentrations in samples of patients with adenocarcinoma and squamous cell carcinoma were found in a study [29]. We have studied VOC profiles of patients with different histological types dependently from tumor malignancy (squamous cell carcinoma, adenocarcinoma, and small cell carcinoma). Statistical analysis has shown that no VOC significantly correlated with the tumor histological type. However, predominantly dimethyl trisulfide and 3-heptanone ratios (Table 5) significantly correlated with the histological type. Considering the inconformity of the results obtained by different research groups, it is worthy to continue the study involving larger cohorts of participants.
Previously, we have optimized analysis conditions, proposed a new approach for the data analysis by using VOC ratios instead of VOC peak area values and demonstrated the efficiency of the proposed approach using different analytical methods (GC-FID and GC-MS) and different cohorts of participants [36,41]. In this work, we analyzed exhaled breath of a significantly larger cohort of volunteers by GC-MS. A wider range of VOC ratios was considered for the creation of diagnostic models as well. We used GBDT to estimate the contribution of ratios in predictive power of the model. Dimethyl trisulfide/dimethyl disulfide ratio contributes the most in classification of the groups. The same results were obtained earlier on a lower cohort of people [41], which confirms reliability of the ratio as a lung cancer biomarker. ANN outperform GBDT in terms of performance on the test dataset with 82-88% sensitivity and 80-86% specificity. The performance of the previous model [36] was higher (more than 90% for both sensitivity and specificity), which can be caused by several reasons: first, the larger cohort of people was involved in the present research and 30% of samples instead of 15% were assigned to the test dataset, which increases the reliability of the present models; second, in this research, the participants were fasted overnight before sampling, which can additionally decrease interfering effects; third, the number of smoking participants was equal; thus, the influence of smoking was eliminated. Most lung cancer patients are active smokers, which does not allow us to consider only the patients who do not smoke. However, cigarette smoking significantly influences exhaled breath composition [42]. Thus, to consider smoking factor we had to involve a comparable number of smokers in both lung cancer patient and healthy volunteer groups. The group of healthy volunteers was significantly younger, than lung cancer patients. However, none of parameters selected for the creation of diagnostic models had statistically significant correlation with age. Further, it should be noted that the young participants (from 21 years old) were also present in the group of lung cancer patients. Unfortunately, the increasing trend in the number of young people among cancer patients should be considered.
Notably, significant correlations with the disease status were observed for ratios of toluene/acetonitrile, hexane/acetonitrile, and pentanal/isoprene. The same results were obtained in the previous research (correlation coefficients of 0.248, −0.307, −0.296, respectively). It evidences the robustness of the parameters as potential lung cancer biomarkers. Further study is required to confirm the reliance of other biomarkers found in this study.

Conclusions
The study has revealed that not only pulmonary, but also non-pulmonary comorbidities can influence the exhaled breath VOC profile. Among them, chronic heart failure and hypertension affect mostly. Variations in exhaled breath VOC profiles among lung cancer patients with different histological types, TNM stages, tumor localization, and treatment status have been observed, which can influence the performance of diagnostic models. These factors should be considered before creating a lung cancer diagnostic model, which allows to build a useful test for diagnosing the dangerous disease timely.