Exploring Volatile Organic Compounds in Breath for High-Accuracy Prediction of Lung Cancer

Simple Summary Human-exhaled volatile organic compounds (VOCs) can be altered by lung cancer and become identifiable biomarkers. We used selected ion flow tube mass spectrometry (SIFT-MS) to quantitatively analyze 116 kinds of VOCs, which were exhaled by 148 lung cancer patients and 168 healthy individuals and collected from the environment to obtain a group of comprehensive data. A predictive model yielding 0.92 accuracy, 0.96 sensitivity, 0.88 specificity, and 0.98 area under the curve (AUC) was established using an advanced machine learning eXtreme Gradient Boosting (XGBoost) algorithm that considered the influences of exhaled and environmental VOCs. Abstract (1) Background: Lung cancer is silent in its early stages and fatal in its advanced stages. The current examinations for lung cancer are usually based on imaging. Conventional chest X-rays lack accuracy, and chest computed tomography (CT) is associated with radiation exposure and cost, limiting screening effectiveness. Breathomics, a noninvasive strategy, has recently been studied extensively. Volatile organic compounds (VOCs) derived from human breath can reflect metabolic changes caused by diseases and possibly serve as biomarkers of lung cancer. (2) Methods: The selected ion flow tube mass spectrometry (SIFT-MS) technique was used to quantitatively analyze 116 VOCs in breath samples from 148 patients with histologically confirmed lung cancers and 168 healthy volunteers. We used eXtreme Gradient Boosting (XGBoost), a machine learning method, to build a model for predicting lung cancer occurrence based on quantitative VOC measurements. (3) Results: The proposed prediction model achieved better performance than other previous approaches, with an accuracy, sensitivity, specificity, and area under the curve (AUC) of 0.89, 0.82, 0.94, and 0.95, respectively. When we further adjusted the confounding effect of environmental VOCs on the relationship between participants’ exhaled VOCs and lung cancer occurrence, our model was improved to reach 0.92 accuracy, 0.96 sensitivity, 0.88 specificity, and 0.98 AUC. (4) Conclusion: A quantitative VOCs databank integrated with the application of an XGBoost classifier provides a persuasive platform for lung cancer prediction.


Introduction
Scientists have been interested in volatile organic compounds (VOCs) released by human bodies for over five decades. In 1971, Nobel Prize winner Linus Pauling revealed that breath is a complex mixture comprised of approximately 250 VOCs [1]. In 1999, Phillips et al. [2] detected over 3400 different volatile compounds in exhaled human breath. Metabolic processes of the human body produce these compounds; they enter the lungs via blood and are exhaled. Therefore, variations in exhaled breath compounds' concentrations can be directly linked to a disease, such as cancer [3]. Of all cancer types, there are 1.6 million lung cancer deaths per year, which is more than the sum of the next three most common cancers, i.e., prostate, breast, and colon cancer [4]. Lung cancer is usually quiet in the early stages; patients frequently experience coughing, chest pain, weight loss, etc. These symptoms early tend to be ignored until advanced disease development to be taken seriously. Generally, the 5-year survival related to late diagnosis is around 10-15%. Using conventional diagnostic procedures such as computer tomography (CT), sputum cytology, and biopsy, 85% of lung cancer cases are detected at a phase at which therapy is ineffective for curing the disease [5]. Low-dose thoracic CT can detect tumors at early stages and reduce mortality from lung cancer [6]. With the incidences of lung cancer rising worldwide, early detection techniques are an essential and immediate need. Since the cost and concern regarding radiation exposure mainly impair the applications of currently available examinations, an effective, radiation-free, and less invasive approach for lung cancer screening needs to be established. Some of the current and developing methods used for lung cancer noninvasive detection methods are summarized in Table 1.
Studying VOCs is one of the most interesting strategies and has many advantages (Table 1). Researchers, commonly using gas chromatography-mass spectrometry (GC-MS), have demonstrated the presence of lung-cancer-specific profiles of VOCs [7]. Though GC-MS is an established technique for VOC analysis, compared with selected ion flow tube mass spectrometry (SIFT-MS), GC-MS requires precise calibrations of standard compounds if highly specific and reliable quantification is needed [8]. Though GC-MS analysis integrated with library search is a powerful strategy for compound identification, the quantitative assessment of VOC using GC-MS is far more complicated. On the contrary, SIFT-MS design benefits the quantitative analysis and real-time study of VOCs [9,10]. The research using SIFT-MS has been primarily centered on the esophagus and colorectal cancers [11,12], whereas the breath profile of lung cancer analysis by SIFT-MS has rarely been reported. The features of direct sampling and quantitative VOC estimation of SIFT-MS can provide a large quantity of VOC data useful for additional statistical modeling. For this kind of large data, multivariate and machine learning tools for chemometric applications seem to be the rational way to define, project, model, and interpret the results [7].

Characteristics of Patients with Lung Cancer and Healthy Volunteers
In this study, we enrolled 168 health volunteers (101 women) aged 20 to 74 years (controls) and 148 lung cancer patients (73 men) aged 37 to 90 years ( Table 2). Lung cancer patients were older (p < 0.001) than the controls. Most histological types of lung cancer were adenocarcinoma (72.9%). The most elevated target driver mutation is exon 19 deletion (22.3%) and exon 21 point mutation (20.3%). Most patients had a nonresectable disease at clinical stage IIIB and C (18.2%), IVA (43.9%), or IVB (27%).  The VOC data are used for model construction, machine learning, and prediction of lung cancer.

Characteristics of Patients with Lung Cancer and Healthy Volunteers
In this study, we enrolled 168 health volunteers (101 women) aged 20 to 74 years (controls) and 148 lung cancer patients (73 men) aged 37 to 90 years ( Table 2). Lung cancer patients were older (p < 0.001) than the controls. Most histological types of lung cancer were adenocarcinoma (72.9%). The most elevated target driver mutation is exon 19 deletion (22.3%) and exon 21 point mutation (20.3%). Most patients had a nonresectable disease at clinical stage IIIB and C (18.2%), IVA (43.9%), or IVB (27%).

VOCs for SIFT-MS Analysis
This study investigated 116 specific VOCs previously reported as human breath biomarkers, shown in Table A1. Fifty VOCs showed significant differences between lung cancer patients and healthy controls in all three statistical hypothesis tests adopted (* in Table A1), which can be used as biomarkers for lung cancer. When analyzing the collected background air samples, we identified 57 environmental VOCs whose concentrations were not significantly different between the National Yang Ming Chiao Tung University (NCTU) and the National Taiwan University Hospital Hsin-Chu Branch (NTUH) in all statistical hypothesis tests ( † in Table A1). The heat map for VOCs of different participants is shown in Figure 2. Fifty-seven percent (n = 84) of cancer patients were clustered tightly together (top red in the color bar on the left). Another 24% (n = 36) of cancer patients and 38% (n = 43) of healthy volunteers of NCTU were grouped closely (middle red and green in the color bar on the left). Healthy volunteers of NTUH were more spread out (blue in the color bar on the left). The hierarchical clustering identified several VOC groups (the dendrogram on the top), where two groups contained dominant features for distinguishing between cancer cases and healthy controls. Ethanol, formic acid, ethanedial, methanol, acetone, butane, and hexane (the far-left brown in the first color bar at the top) had higher values in cancer cases than in healthy controls. Another group of VOCs, including benzoic acid and beta-caryophyllene, (the far-right brown in the first color bar on the top), showed an extremely low concentration for most healthy controls. All these dominant VOCs, except hexane, were significantly different between lung cancer cases and healthy controls. Hexane and beta-caryophyllene were not significantly different between the NCTU and NTUH.

XGBoost Prediction Model
For prediction modeling, we first applied XGBoost to all VOC measurements for lung cancer disease state prediction, and its accuracy, sensitivity, specificity, and area under the curve (AUC) were 0.89, 0.82, 0.94, and 0.95, respectively. Using XGBoost with 50 significantly different VOCs between lung cancer cases and healthy controls, we obtained accuracy, sensitivity, specificity, and AUC of 0.90, 0.84, 0.94, and 0.94, respectively, demonstrating the efficacy of our list of potential lung cancer biomarkers. Notably, our sensitivity and specificity were different, indicating the model's differential prediction ability for lung cancer cases and healthy controls. This might be due to the confounding effect from environmental VOCs, where all cases' breath samples were taken in the hospital, whereas those from controls were taken either in the hospital or in the academic campus. In the color matrix at the center, each row represents a participant, and each column represents a single VOC with the diverging color scheme of red (high concentration) and blue (low concentration). VOCs and participants are clustered using the agglomerative hierarchical clustering method. In the color bar on the left, red indicates cancer patients, green indicates healthy volunteers of the National Chiao Tung University (NCTU), and blue indicates healthy volunteers of the National Taiwan University Hospital Hsin-Chu Branch (NTUH). The two-color bars at the top represent the significance of VOCs. The first bar indicates whether VOCs showed significant differences between lung cancer patients and healthy controls (brown for significant and black for nonsignificant). The second bar indicates whether VOCs were significantly different between NCTU and NTUH (green for significantly different and blue for not significantly different).

XGBoost Prediction Model
For prediction modeling, we first applied XGBoost to all VOC measurements for lung cancer disease state prediction, and its accuracy, sensitivity, specificity, and area under the curve (AUC) were 0.89, 0.82, 0.94, and 0.95, respectively. Using XGBoost with 50 significantly different VOCs between lung cancer cases and healthy controls, we obtained accuracy, sensitivity, specificity, and AUC of 0.90, 0.84, 0.94, and 0.94, respectively, demonstrating the efficacy of our list of potential lung cancer biomarkers. Notably, our sensitivity and specificity were different, indicating the model's differential prediction ability for lung cancer cases and healthy controls. This might be due to the confounding effect from environmental VOCs, where all cases' breath samples were taken in the hospital, whereas those from controls were taken either in the hospital or in the academic campus. In the color matrix at the center, each row represents a participant, and each column represents a single VOC with the diverging color scheme of red (high concentration) and blue (low concentration). VOCs and participants are clustered using the agglomerative hierarchical clustering method. In the color bar on the left, red indicates cancer patients, green indicates healthy volunteers of the National Yang Ming Chiao Tung University (NCTU), and blue indicates healthy volunteers of the National Taiwan University Hospital Hsin-Chu Branch (NTUH). The two-color bars at the top represent the significance of VOCs. The first bar indicates whether VOCs showed significant differences between lung cancer patients and healthy controls (brown for significant and black for nonsignificant). The second bar indicates whether VOCs were significantly different between NCTU and NTUH (green for significantly different and blue for not significantly different).

Adjust Algorithm for Environmental VOCs
When XGBoost was built on environmentally nondifferential VOCs to eliminate the potential confounding effect from environmental VOCs, the accuracy, sensitivity, specificity, and AUC were 0.88, 0.84, 0.90, and 0.92, respectively, representing a slight improvement in sensitivity but a deterioration in specificity. We further applied SMOTE [33] to the VOC values of the collected background air samples to create synthetic environmental VOCs for each participant. The XGBoost prediction model incorporating participants' exhaled VOCs and these simulated environmental VOCs can account for nonendogenous VOCs present in the environment. This model achieved better performance with 0.92 accuracy, 0.96 sensitivity, 0.88 specificity, and 0.98 AUC. These results are consistent with our speculation concerning the confounding effect of environmental VOCs. The approaches adopted here which consider environmental VOC effects can also improve prediction accuracy.

Discussion
Each whole breath can be divided into three parts according to the pressure of CO 2 in the exhalation. The first and second parts are dead space from the oropharynx and upper respiratory tract. The third part is the air from the alveoli deep inside the lungs that can exchange gases with the blood [34]. Previous breath analyses of lung cancer were performed by collecting the whole breath [35,36]. Some studies collected end-tidal breath by discarding the front of the breath [37,38] or filling the dead space air into other bags [39]. Studies have shown that the concentration of VOC is different in whole breath and end-tidal breath [40]. Our study used three-way connectors to manually fill Tedlar bags with air from dead areas of the mouth and upper respiratory tract and subsequently collect alveolar air from deep in the lungs into aluminum bags. Since VOCs in the alveoli are derived from the blood and gas exchange within the alveoli, this approach can better reflect VOCs' relationship with metabolic state changes caused by disease physiology.
The influence of environmental VOCs at the time of sampling can be considerable in the breathomics of lung cancer. More than 1000 exogenous VOCs are known to be detected in human respiration [41]. The relationship between environmental VOCs and the human body is complex and involves the processes of mixing, diffusion, and distribution in the blood and the metabolism in adipose tissue [42]. The concentration, exposure time, and solubility of the environmental VOCs in the human body and the individual physiology are the important factors that significantly affect the VOC contents of exhaled breath [43]. In past studies, there were no generally applicable rules for considering the influence of environmental VOCs. In addition to using the alveolar gradient concept [44], researchers solved this problem using inspiration filters [45] or having patients spend some time in the ventilation room before collection [38]. Although these methods are effective and widely accepted, we take one step further to eliminate the possible variances caused by environmental factors. Herein, we successfully introduced new algorithms to simulate environmental VOCs at the sampling time and incorporated them into the model to improve prediction accuracy. We showed that our approach could further abrogate the perturbation from environmental VOCs. For further applicability in various environments in the future, we suggest collecting and analyzing VOCs from participants and the environment simultaneously. The XGBoost model can further proceed with the process of learning and tuning, thus minimizing confounding effects. We significantly improved lung cancer prediction accuracy by selecting the phase of breath and calibrating environmental VOCs' impact.
This study describes an innovative machine-learning-based approach that uses SIFT-MS quantitative data to accurately distinguish respiratory samples of lung cancer patients from healthy controls. SIFT-MS quantitative analysis is achieved by applying precisely controlled ultra-soft chemical ionization combined with mass spectrometry detection [8]. The advantage of direct SIFT-MS is that it simplifies the needs for sample preparation, preconcentration, and chromatography. Our study shows that this machine-learningbased breath test is 0.96 sensitive, 0.88 specific, and highly accurate (0.98 area under the curve (AUC)) for identifying lung cancer, whereas the previous studies using multivariate classifiers based on VOCs' chemometrics analyzed by GC-MS demonstrated moderate to high accuracy (AUCs of 0.63-0.9, Table 3). The models reported herein show several advantages. The quantitative characteristics of SIFT-MS are attractive because many quantitative data can enhance model development and fine-tune the XGBoost model to improve prediction accuracy. The adopted XGBoost has been well-recognized and successfully applied in big data analytics. More importantly, our strategy of incorporating environmental VOC factors has unequivocally enhanced the power of XGBoost modeling for prediction.
Limitations of this study include the fact that it was a single-center and case-control study, and most of the patients were elderly and with advanced lung cancer. Our lung cancer patients were significantly older than healthy controls, leading to age mismatches and bias in case-control studies. With aging, a higher degree of oxidative stress occurs, and levels of VOCs in the breath increase, such as isoprene, alkanes, and methylated alkanes [46,47]. Our prediction model built on older and late-stage patients may fail in early detection of the disease. A multicenter study is currently planned to collect young, early, and operable lung cancer patients, aiming to provide an effective approach for early detection of lung cancer and a definitive answer to the test's accuracy.
The clinical application of breathomics in lung cancer remains challenging up to the present. There are considerable differences in respiration sampling procedures, study designs, and data analysis methods implemented by studies for breathomics of lung cancer, which lead to inconsistent results. The effect of nutritional habits on breath VOCs can be complicated. By modifying metabolism, inflammation, or redox status, or communicating with gut flora, food influences breath VOCs. However, how long it takes for the dietary VOCs to be removed from the breath is not known. The dietary style also has a sustained influence that fasting could not remove [31]. There is no consensus on how to eliminate these dietary effects. We thus did not strictly screen participants' nutritional status because we wanted to collect data of different dietary habits to establish a big data model that can be universally applied to the general population in the future. Another issue is that there is no validated list of VOC lung cancer biomarkers in the literature [31,48]. The lung cancer biomarkers found in these studies are mostly inconsonant [49]. The mechanism of most VOCs exhaled by the human body remains unclear. The following factors affect the concentration and composition of lung cancer VOCs in the human body: oxidative stress, cytochrome P450, liver enzymes, metabolic carbohydrates (glycolysis/gluconeogenesis pathways), and lipid metabolism [32]. These possible biochemical pathways vary from person to person, resulting in increased or decreased volatile organic compounds concentrations. Under these complex mechanisms, biochemical pathways of VOCs research cannot provide a definite answer. These VOCs that are considered possible and focused can be classified into the following families: hydrocarbons, primary and secondary alcohols, aldehydes and branched aldehydes, ketones, esters, nitriles, and aromatic compounds [32]. Our list of the targeted 116 VOCs was mainly derived from a literature search [32,49]. In this paper, we detail our approaches for VOC breathomics and provide suggested guidelines for further studies. Based on VOC analysis, this is by far the most comprehensive study (Table 3). Our research outcome provides a potentially useful model and platform for nonradiative and noninvasive lung cancer diagnosis. Abbreviations: AUC, area under the curve; GC-MS, gas chromatography-mass spectrometry; NR, not reported; SIFT-MS, selected ion flow tube mass spectrometry; * Accuracy: 89%, † Classify adenocarcinoma and squamous cell carcinoma patients.

Study Participants and Data Collection
Between May 2019 and June 2020, we obtained breath samples from 148 patients with histologically confirmed lung cancers and 168 healthy volunteer staff. All breath samples of lung cancer patients were collected in the National Taiwan University Hospital Hsin-Chu Branch (NTUH), and 112 and 56 healthy volunteers had their breath samples collected in the National Yang Ming Chiao Tung University (NCTU) and NTUH, respectively. Although we collected multiple breath samples from each healthy volunteer of the NCTU, only one randomly selected sample of each volunteer was included in the flowing analyses. The healthy volunteers had no history of significant pulmonary disease in any other organ and were free of disease. The patients' demographic profiles, history of cigarette smoking, staging, pathological finding, and cancer mutation testing were retrospectively collected from their medical records. The clinical cancer stage was based upon the American Joint Committee on Cancer (AJCC) TNM staging system 8th edition [54]. To evaluate the confounding effect from environmental VOCs, we also collected 18 and 29 background air samples from the NCTU and NTUH, respectively. These environmental VOCs were collected in the same place on the same day as participants' VOCs were obtained. All collected VOCs of the environment and participants were then analyzed. Each patient and volunteer provided written informed consent. This research was approved by the National Taiwan University Hospital Hsin-Chu Branch institutional review board (108-023-E).

Breath Sampling Methodology
All participants orally rinsed with water before sampling for breathing and stayed in the same place for more than 30 min before collecting the gas. To collect the alveolar breath and remove dead space air, each participant breathed normally through a disposable mouthpiece and into the device. We collected 0.2 L of the front portion of the exhaled breath flow through Exit 1 into a Tedlar bag (SKC Inc., Eighty Four, PA, USA), controlled by a 3-way valve. The remaining part of the exhaled breath, alveolar air, was collected in a 1.0 L aluminum bag, as shown in Figure 1a. A bag of room air was collected concurrently with some breath samples to account for nonendogenous VOCs present in the environment. Before collection, aluminum breath analysis bags were flushed with nitrogen gas at least ten times to remove background VOCs associated with the bags. The sealed samples were kept at room temperature (25 • C) and analyzed within 6 h. (Figure 1b). In some reports [55][56][57], VOC storage in bags potentially causes VOC content changes in breath samples due to storage and transportation conditions. In our case, we collected the samples from the location within a 15-min driving distance. To confirm our storage condition's feasibility, we performed a time-dependent analysis (twice a day for three days) on ten breath samples. By comparing the quantitative data from each sample, we conclude that most VOCs are considered stable.

Measurements of VOCs in Exhaled Air
The SIFT-MS theory is based on direct mass spectrometric analysis of VOCs in air or vapor samples by chemical ionization. Selected precursor ions (H 3 O + , NO + , and O 2 + ) are injected into the helium carrier gas and ionize the VOCs in the breath samples, generating characteristic productions detected by the downstream quadrupole mass spectrometer. Real-time quantification is achieved by measuring the count rate of both precursor ions and the characteristic product ions in the downstream detection system. The concentration of trace and volatile compounds is achieved at the parts-per-billion or parts-per-million by volume. It enables the simultaneous quantification within a gaseous mixture of several VOCs. A SIFT-MS instrument (VOICE200 ultra, Syft Technologies, Christchurch, New Zealand) was used to analyze the exhaled breath samples applying the selective ion mode (SIM). The 116 compounds shown in Table A1 were separated into seven categories: alkanes, ketones, aldehydes, alcohols, amines, thiols, and others for quantitative analysis. Among these VOCs, some of the product ions from different VOCs, e.g., MH + ions at No. 43,  47, 57, 59, 60, 61, 69, 71, 75, 85, 87, 89, 91, 97, 99, 101, and 103, overlap with others. The quantitative estimations of the selected VOCs were performed based on the pre-set protocol of SIFT-MS. This setting combines several product ions derived from three different reagent ions, H 3 O + , O 2 + , and NO + , with a tolerance feature setting at 20%. The tolerance feature is employed to deal with product ion interferences for every single compound. The final analyte concentration is calculated as an average of the lowest product ion concentration and anything within 20%. Any product ion that falls beyond the 20% tolerance range will not be accounted for in the calculation. Note that part of VOCs with the interference of product ions cannot be resolved by tolerance feature set can only be quantified as a relative scale. Both the accurate measurements and the VOCs' relative scales are collected and employed to construct statistical models.

Statistical Analysis
All analyses were performed using the R software (version 4.0.2; R Foundation for Statistical Computing). Participants' characteristics were analyzed and compared between lung cancer patients and healthy controls using the t-test or Wilcoxon rank-sum test for continuous variables and the Chi-square test or Fisher's exact test for categorical variables.
The heat map [58] was used to visualize VOCs' changes in different participants, helping establish the initial research hypothesis. We used the Wilcox rank-sum test and two-sample t-test with or without equal variance assumption to determine whether VOCs' differences between lung cancer patients and healthy controls were significant. To correct for multiple comparisons, the significance of the difference for each of the 116 VOC measurements was assessed at Bonferroni-corrected p-value = 0.0004 (0.05/116) [59]. We used the XGBoost [13], a machine learning method, to build a prediction model that used VOC measurements to predict lung cancer's disease state. We used 70% of the collected VOC data as the training set to build the prediction model and then used the other 30% to test its performance. The prediction model was established based on either all VOC measurements or those with a significant difference between the two study groups. To eliminate the confounding effect from environmental VOCs, we tried out two approaches: the prediction model built on VOCs whose concentrations were not significantly different among different environments and the model incorporating participants' exhaled VOCs and corresponding environmental VOCs simulated via SMOTE [33] from collected background air samples. The performance of various approaches was evaluated in terms of accuracy (the proportion of correctly classified participants), sensitivity, specificity, and AUC (the area under the ROC curve) on the test data.

Conclusions
To summarize, the machine learning model proposed in this study can accurately identify lung cancer using participants' exhaled breath. It is a non-invasive and radiation-free system, which can accelerate the diagnosis of lung cancer. We successfully demonstrated a new approach for disease diagnosis by integrating several techniques, including comprehensive and quantitative VOC analysis and deep learning algorithms for minimizing the interference of environmental factors, which resulted in an accurate prediction model. The development of standardized and automatic breath sampling protocols is an ongoing project that we expect will vastly simplify the process of sample collection and guarantee sample quality. We are confident that these efforts will ultimately unlock the potential and importance of human breathomics.