Comparison of Targeted and Untargeted Approaches in Breath Analysis for the Discrimination of Lung Cancer from Benign Pulmonary Diseases and Healthy Persons

The aim of the present study was to compare the efficiency of targeted and untargeted breath analysis in the discrimination of lung cancer (Ca+) patients from healthy people (HC) and patients with benign pulmonary diseases (Ca−). Exhaled breath samples from 49 Ca+ patients, 36 Ca− patients and 52 healthy controls (HC) were analyzed by an SPME–GC–MS method. Untargeted treatment of the acquired data was performed with the use of the web-based platform XCMS Online combined with manual reprocessing of raw chromatographic data. Machine learning methods were applied to estimate the efficiency of breath analysis in the classification of the participants. Results: Untargeted analysis revealed 29 informative VOCs, from which 17 were identified by mass spectra and retention time/retention index evaluation. The untargeted analysis yielded slightly better results in discriminating Ca+ patients from HC (accuracy: 91.0%, AUC: 0.96 and accuracy 89.1%, AUC: 0.97 for untargeted and targeted analysis, respectively) but significantly improved the efficiency of discrimination between Ca+ and Ca− patients, increasing the accuracy of the classification from 52.9 to 75.3% and the AUC from 0.55 to 0.82. Conclusions: The untargeted breath analysis through the inclusion and utilization of newly identified compounds that were not considered in targeted analysis allowed the discrimination of the Ca+ from Ca− patients, which was not achieved by the targeted approach.


Introduction
Human breath contains volatile organic compounds (VOCs) either originating from endogenous biochemical processes and thus distinguished as endogenous VOCs or environmental exposures (inhalation, ingestion, dermal absorption) and therefore pertaining to exogeneous VOCs. In case of disease, the biochemical pathways can be dysregulated or altered [1], and this will change the composition of exhaled breath in endogenous VOCs. Moreover, disease can also affect the absorption, distribution metabolism and excretion of the exogenous compounds. These alterations can be detected and used for disease detection and diagnosis. The analysis of exhaled breath is currently an area of intensive research aiming at the development of new non-invasive tests for preliminary screening and diagnosis of various pathological conditions. Particular attention is given to cancer, where early diagnosis is critical for successful disease treatment and which today is often diagnosed at late stages, and diagnosis procedures are invasive, time consuming or costly. Mass spectrometry (MS)-based breath analysis for disease diagnosis research is currently the mainstream choice that can be accomplished using two strategies, which are classified as targeted or non-targeted (also referred to as untargeted). The former is based on quantification of an a priori defined set of VOCs known or hypothesized as disease biomarkers and is thus a hypothesis-driven approach. In contrast, the non-targeted strategy is a (qualitative) hypothesis-generating approach that investigates the whole VOC profile in a breath sample without any a priori information about the chemical composition of the sample and aims to identify a maximum number of VOCs. By non-targeted breath analysis, novel biomarkers and disturbed metabolic pathways can be discovered or characteristic breath VOC profile of the disease can be defined and further used for disease detection and diagnosis. However, the non-targeted approach yields a huge amount of complex data and its application would be impossible without the development of bioinformatics software designed for the treatment and statistical analysis of raw chromatography-mass spectrometry data, and identification of detected unknown compounds. This has been done mostly in the last decade and currently there is a variety of commercial or open source software for the treatment and analysis of chromatography-mass spectrometry data and extraction of the relative biological information [2]. That has given great impetus for the development of non-targeted analysis in metabolomics in general [3] and opens new perspectives in breath research in particular [4]. One of the most widely used metabolomic software is XCMS Online, which is freely available [5].
However, the non-targeted approach has long-standing reproducibility issues [6,7] and is never truly unbiased since the acquired data are significantly affected by experimental design and instrumental parameters. In contrast to the targeted strategy, the lack of absolute quantification makes it difficult to assess variations in metabolite levels between groups, to normalize the acquired data and even to make interlaboratory comparisons of the results [7,8]. These weaknesses of the non-targeted approach are, at the same time, the strengths of the targeted approach and, recently, hybrid approaches bridging them have been developed [8,9]. In this study, we make a retrospective non-targeted analysis of full scan data previously acquired [10] in targeted analysis of the breath samples from lung cancer (Ca+) and benign pulmonary disease (Ca−) patients and healthy controls (HC). The targeted analysis was based on the quantitation of 19 pre-determined VOCs [10]. While Ca+ patients were satisfactorily discriminated from healthy controls, the analysis failed to discriminate Ca+ patients from Ca− patients (without LC but with pathological computed tomography findings). The aim of the present study is to compare the efficiency of the targeted and untargeted approaches in lung cancer discrimination with healthy people and patients with other pulmonary diseases and record the strengths and limitations of each approach on the same raw GC-MS data pool. Additionally, by merging (combining) targeted and untargeted approaches, we sought to improve the discrimination ability of the breath analysis.

Characteristics of Study Participants
From the 85 patients with pathological computed tomography (CT) findings who underwent bronchoscopy, lung cancer was diagnosed in 49 patients (43 males/6 females). The mean age of Ca+ patients was 71.1 years (SD: 8.2). The majority of LC patients (n = 40) were diagnosed with non-small cell lung carcinoma, while 8 were diagnosed with small cell lung carcinoma (for one patient, the type was not available). Thirty-six patients (30 males/6 females, mean age 66.8 (SD: 10.8)) were not diagnosed with LC by histological/cytological examination. The possible pathological origins for this group include sarcoidosis, hypersensitivity pneumonitis, interstitial lung diseases or pulmonary infections such as tuberculosis. The control group consisted of 52 persons (35 males/17 females) with a mean age of 66.8 (SD: 10.8).
In regard to smoking habit, most of the LC patients (81.6%) were former smokers with a mean time from cessation of 9.4 years, while 12.2% were active smokers and 6.1% reported that they had never smoked. Patients that were not diagnosed with LC had slightly different frequencies of smoking habit, with 55.6% being former smokers (mean time from cessation: 10.6 years), 27.7% being active smokers and 16.7% never smokers. In the HC group, the percentage of active smokers was significantly higher (38.4%), as was the percentage of individuals that had never smoked (28.9%

Data Pre-Processing, Selection and Identification of Candidate Features
The processing of raw files with the use of the XCMS Online platform identified 358 informative features (ions) meeting the criteria defined in the Materials and Methods (Section 4.3) after peak identification, alignment, retention time correction and preliminary online statistical analysis. Figure 1 presents the metabolomic cloud plots obtained from XCMS Online, concerning the pairwise analysis of Ca+ vs. HC and Ca+ vs. Ca− groups. Features identified as differentiated between subgroups by XCMS Online were automatically grouped into 110 corresponding chromatographic peaks. These peaks were manually evaluated and verified in the acquired chromatograms. This process resulted in the exclusion of 28 peaks from further analysis due to unacceptable chromatographic characteristics such as low signal to noise ratio and co-elution with other substances. The mass spectra corresponding to the 82 remaining peaks were compared with those stored in the NIST library after subtracting mass spectra corresponding to noise. These procedures lead to the exclusion of additional peaks with spectra indicating silanes and silicon compounds that were considered interferences from SPME fiber, the chromatography column or septum materials. In addition, peaks with mass spectra corresponding to known contaminants from Tedlar ® bag materials (phenol, N,N-dimethylacetamide) were also excluded [11]. In total, 53 compounds were not considered for further analysis. Thus, the remaining 29 peaks were considered for further investigation. For these, comparisons of mass spectra with those contained in the NIST library identified 12 compounds with a probability higher than 75%. Four monoaromatic compounds (benzene, styrene, ethylbenzene and toluene) were also verified with analytical standards. In addition, seven compounds were verified by retention time (RT) by comparing actual RTs with simulated RTs determined with the use of the Pro EZGC Chromatogram Modeler (Restek Corporation, Bellefonte, PA, USA). For 5 peaks, the NIST probability was 50-75%, indicating a considerable degree of uncertainty in compound identification, while 12 compounds (probability < 50%) were designated as unknowns. Moreover, experimentally determined retention indices (RIs) were compared with those stored in the NIST library. Small deviations were observed (<10%) for most compounds, while the RI values were in agreement with the order of elution of identified VOCs, with the exceptions of propionic acid and methylacetamide. Figure 2 presents the flow chart of the process applied for selecting and identifying informative compounds. In Table 1, the compounds are presented along with NIST probability and spectra match scores, actual and simulated retention times, experimentally determined RIs and RIs derived from the NIST workbook. The 17 identified compounds were further investigated by searching for their presence in the KEGG pathway database [12] and in the scientific literature to determine their putative origins and the involved metabolic pathways. For twelve compounds, no evidence of endogenous origin was found. These include monoaromatic hydrocarbons and furans, which are carcinogens contained in tobacco smoke, and pro-duced by industrial sources and commercial uses, sulfur-containing compounds (methyl propyl sulfide, 1-methylthio-(E)-1-propene) used as flavor agents and contained in garlic and onion and eucalyptol, which is used as an asthma/COPD drug. Eight substances could be of both endogenous and exogenous origin. Most of the identified metabolic pathways concerned the degradation/metabolism of xenobiotic substances such as ethylbenzene, benzene and dimethyacetamide. Propionic acid is involved in multiple pathways of lipid biosynthesis, propanoate metabolism and vitamin K metabolism. P-benzoquinone can be formed from benzene metabolism [13], but also participates in other pathways, and acetic acid is involved in the formation of glycogen, cholesterol synthesis, fatty acid degradation and acetylation of amines [14]. pathway database [12] and in the scientific literature to determine their putative origins and the involved metabolic pathways. For twelve compounds, no evidence of endogenous origin was found. These include monoaromatic hydrocarbons and furans, which are carcinogens contained in tobacco smoke, and produced by industrial sources and commercial uses, sulfur-containing compounds (methyl propyl sulfide, 1-methylthio-(E)-1-propene) used as flavor agents and contained in garlic and onion and eucalyptol, which is used as an asthma/COPD drug. Eight substances could be of both endogenous and exogenous origin. Most of the identified metabolic pathways concerned the degradation/metabolism of xenobiotic substances such as ethylbenzene, benzene and dimethyacetamide. Propionic acid is involved in multiple pathways of lipid biosynthesis, propanoate metabolism and vitamin K metabolism. P-benzoquinone can be formed from benzene metabolism [13], but also participates in other pathways, and acetic acid is involved in the formation of glycogen, cholesterol synthesis, fatty acid degradation and acetylation of amines [14].

Reprocessing of Raw Chromatographs and Statistical Analysis of Identified/Verified Associations
Following the identification of the compounds, all raw files were reprocessed with Thermo Xcalibur™ software to obtain more valid data. This procedure allowed manual retention time correction, more accurate integration of chromatographic peaks and exclusion of false (noise) peaks. The areas of the chromatographic peaks were determined for each compound in exhaled breath samples but also in ambient air samples. Chromatographic peak areas were normalized with the use of an external standard mixture (see Section 4.4). Regarding ambient air levels, for six out of 29 compounds, the relative levels of ambient air were considered insignificant, for 5 compounds low, for 14 compounds moderate and for 4 compounds high ( Table 2). Comparative statistical analysis confirmed the significant difference in breath levels between Ca+ patients and healthy controls for 18 out 29 compounds, while two were found to differ between Ca+ and Ca− patients. Lung cancer patients had significantly elevated levels of ethylbenzene, styrene, toluene, xylene, eucalyptol and four unknown compounds compared to healthy controls. Lower levels were observed for acetaldoxime, methyl propyl sulfide, 1-methylthio-(E)-1-propene, propionic acid, methylacetamide and three unknown compounds. Results concerning the comparative analysis of areas of chromatographic peaks between patient groups are summarized in Table 2.

Application of Machine Learning Methods to Estimate the Diagnostic Efficiency of the Breath Analysis
In our previous work, based on 19 selected VOCs, we identified subsets of features (VOCs) that were capable of efficiently discriminating healthy individuals from cancer patients, but not Ca+ from Ca− patients. In this section, we present the results of machine learning methods based on combinations of the 29 features, identified as differentiated between population subgroups by the untargeted approach. When all 29 features were included, correct classification of Ca+ and HC was 86% (AUC: 0.94) ( Table 3, Analysis no. 9). After the two steps of feature selection, using a subset of eight features, the correct classification improved to 91% (AUC: 0.96) ( Table 3, Analysis no. 10), which was higher than that of targeted analysis. Similarly, discrimination between Ca− patients and HC was also very efficient. The correct classification of datapoints ranged from 90% (AUC: 0.94), when using all 29 features (Table 3, Analysis no. 11), to 94% (AUC: 0.97) after the two steps of feature selection, using a subset of seven compounds (Table 3, Analysis no. 12). Not surprisingly, discrimination between pooled cancer-positive and non-cancer patients (Ca+ and Ca−) and HC was again very efficient. Overall, machine learning models based on compounds identified as differentiated by the untargeted approach achieved a very comparable if not marginally better accuracy than the targeted approach, when trying to discriminate healthy individuals from any of the three types of patients (cancer, non-cancer, pooled). Subsequently, we tested the potential for discrimination between Ca+ and Ca− patients, with the three machine learning algorithms, by using normalized peak areas of compounds from breath. The set of 29 VOCs was not capable of efficiently discriminating between cancer and non-cancer patients, irrespective of the machine learning algorithm applied. The best-performing algorithm (random forest) correctly predicted only 53% of datapoints (AUC: 0.54) ( Table 3, Analysis no. 15) when using all 29 VOCs. However, when two successive steps of feature selection were implemented, the random forest's accuracy significantly increased to 75% (AUC: 0.82), by using a set of only three metabolites (Table 3, Analysis no. 16). We repeated the analysis to discriminate Ca+ from Ca− patients, by incorporating normalized levels after subtracting ambient air levels, in the hope that removal of any noise from the air would increase the discriminatory power of the random forests. However, the performance did not increase as much as it did when we used only normalized concentrations of breath. More specifically, by using all 29 VOCs, random forests achieved an accuracy of 58% (AUC: 0.54) ( Table 3, Analysis no. 17), whereas, after two steps of feature selection, the performance was increased to an accuracy of 72% (AUC: 0.78) by using eight features (Table 3, Analysis no. 18).
We also examined whether the combination of the 19 VOCs measured by the targeted approach together with the 29 VOCs identified as differentiated by the untargeted approach would increase the discriminatory power of the machine learning models in Ca+ vs. Ca− patients. In this set, the concentrations of 19 VOCs in breath were used together with 29 VOCs selected as informative by the untargeted approach. By using all 48 variables, random forests achieved an accuracy of 45% (AUC: 0.44) ( Table 3, Analysis no. 19), whereas, after two steps of feature selection, the performance was increased to an accuracy of 73% (AUC: 0.72) ( Table 3, Analysis no. 20), using three features (thiophene from the targeted approach and acetaldoxime and N-methyl acetamide from the untargeted approach). Thus, the inclusion of the 19 targeted metabolites did not increase the discriminatory performance of random forests that were based only on targeted metabolites.
Finally, we tested if smoking was a confounding factor for the discrimination (with random forests) of cancer vs. non-cancer patients, using normalized breath measurements of VOCs selected as informative by the untargeted approach. In these analyses, we retained 43 cancer patients and 26 non-cancer patients that never smoked or had quit smoking. The best-performing algorithm (random forest) correctly predicted only 59% of datapoints (AUC: 0.57) when using all 29 untargeted VOCs (Table 3, Analysis no. 21). When we used the three untargeted VOCs that had yielded the best performance in the previous cancer vs. non-cancer patients analysis, random forests of the non-smokers achieved an accuracy of 72.5%, but with a significantly lower AUC of 0.68 (Table 3, Analysis no. 22). Thus, we also performed two rounds of feature selection specifically for the non-smokers and, this time, random forests achieved an accuracy of 77%, with an AUC of 0.85, by using five VOCs ( Table 3, Analysis no. 23).
In summary, based on all the above analyses, we conclude that the best-performing algorithm is again random forests, whereas the normalized breath data from the untargeted approach are sufficient to help the algorithm achieve a very high performance, in all comparisons. Furthermore, the two successive rounds of feature selection significantly improved the performance of the random forests, especially in the case of Ca + vs. Ca− patients. This was not possible in a previous study that had used a limited set of 19 selected VOCs. Furthermore, smoking was not a confounding factor for the untargeted analysis, an observation that is in agreement with the results of targeted analysis. It is very clear that the given untargeted approach, in combination with machine learning algorithms and feature selection, identified sets of compounds with sufficient discriminatory power (accuracy of 91-94%) to help us understand if a sample comes from a healthy person or from a person with a pulmonary disease. This was achievable with only seven to nine metabolites. Furthermore, it is also possible to discriminate, with satisfactory accuracy (75-77%), cancer from non-cancer patients, by using only three to five untargeted metabolites.

Discussion
In this study, we performed analyses based on non-targeted screening of the raw chromatographic data obtained from breath analysis, for three population groups (Ca+, Ca− and HC) and compared the discriminatory power of this approach to that achieved by targeted analysis. In the targeted analysis, 19 pre-selected compounds were measured, which were selected based on literature indicating that they might be potential biomarkers of lung cancer. Seven of these pre-selected compounds were found to differ significantly between Ca+ and HC, and between pooled patient (Ca+ and Ca−) and HC groups, and none differed significantly between Ca+ and Ca− groups [10].
The non-targeted analysis was performed with the use of the XCMS Online data processing platform combined with manual processing of the raw chromatograms to select the informative compounds and develop a dataset containing the areas of chromatographic peaks of differentiated compounds. Processing of the raw files with XCMS Online was conducted to determine the subset of chromatographic peaks and corresponding ions (m/z) to focus on, and narrow the investigated peaks to those only identified as significantly differentiated between population subgroups (Figure 2: Step 1). Next, we manually crosschecked ( Figure 2: Step 2) and reprocessed (Figure 2: Step 5) the identified peaks in the raw data, by integrating extracted ion chromatograms (EICs). This task was performed to confirm and, when necessary, correct the results obtained from automated online data processing, and increase the reliability of the developed dataset, before proceeding to statistical analyses and the application of machine learning methods. We considered this stage necessary since peak misalignment or identification of "false peaks" by preprocessing software has been reported as a potential limitation of this approach due to the variance and complexity of raw chromatograms [15][16][17]. Indeed, a number of peaks identified by XCMS as informative could not be satisfactorily processed in the raw chromatograms and had to be excluded from the analysis, due to noise interferences or co-elution issues. It was interesting that two compounds (1-propanol and 2-propanol) identified as differentiated between population groups by the targeted approach were filtered out by the selection criteria applied in the untargeted workflow. By searching for 2-propanol and 1-propanol in the XCMS results, we observed that the corresponding peaks were correctly identified and their levels were found to differ between Ca+ and HC, while fold changes in LC patients were in agreement with those observed when concentrations determined by calibration curves (targeted analysis) were compared. However, the level of statistical significance of non-normalized values (determined by a t-test) was 0.0185 for 2-propanol and 0.0198 for 1 propanol, which was marginally higher than the selection criterion (p < 0.01) set for Ca+ vs. Ca− pairwise (online automated) analysis. It should also be mentioned that the t-test is not the appropriate significance criterion for non-normally distributed data.
It is also noteworthy that 53 compounds identified as informative by the analysis with XCMS Online were at a later stage excluded as they corresponded to silicon-based compounds and presumably derived from the SPME fiber and chromatographic column bleed (Figure 2: Step 3). The vast majority of these compounds were selected based on the Ca+ vs. HC pairwise analysis and the associations can be attributed to different experimental conditions during the time periods of the collection and analysis of the population subgroups. It is therefore assumed that these compounds were selected due to systematic variations in experimental conditions. This effect is often corrected through normalization processes where signal intensity is adjusted by the total intensity, the highest value or by an external or internal standard [18]. In untargeted metabolomics, the use of pooled samples as external standards is often applied [19] but this practice would be extremely complicated in exhaled air samples. In the present study, external standard normalization was conducted by incorporating spiked standard mixtures with known concentrations that were used in targeted analysis (Figure 2: Step 6). Moreover, after manual processing of the detected peaks and external standard normalization, a few associations that were determined as significant from XCMS Online analysis were not confirmed by offline statistical analysis of reprocessed data.
Some of the identified compounds have been reported previously to differ in the breath of LC patients and other pulmonary diseases. In particular, monoaromatics are reported by numerous publications. A very recent review by Ratiu identified 21 aromatic hydrocarbons differentiated in lung cancer [20]. Furans, such as 3-methylfuran and 2,5dimethylfuran, have also been identified by previous studies but these compounds are considered biomarkers of both active and passive exposure to tobacco smoke [21]. Allyl methyl sulfide and methyl propyl sulfide (an isomer of 1-methylthio-(E)-1-propene), which were found in lower levels in LC patients, are known to suppress the proliferation of human lung tumor cells and possess anti-carcinogenic properties [22,23]. Moreover, similar structures, such as dimethyl sulfide and methionol, are involved in the metabolism of methionine [24]. Differences in the exhaled breath levels of acetic acid and propionic acid have also been reported by previous studies, albeit less frequently [25,26]. Exhaled p-benzoquinone has been proposed as a marker of malignant pleural mesothelioma [27]. For other identified substances (N-2-Aminoethyl acetamide, 1 methoxy propanol, methylacetamide, acetaldoxime, eucalyptol), we did not find any references in the scientific literature concerning the potential association of the exhaled breath concentrations with lung cancer. It should be noted that for some compounds (e.g., propionic acid, acetic acid), we report lower levels in the exhaled breath of LC patients, a finding which apparently contradicts existing evidence. The lack of reproducibility between independent research groups is a known obstacle in breath research. It should also be mentioned that for a few compounds, the identification is questionable. This statement is based on the observation that deviations in RTs and RIs (Figure 2: Step 4) for these compounds do not follow the trend established by known compounds. These include propionic acid, methylacetamide, acetaldoxime and 1-methxy-propanol. The utilization of retention indices in compound identification confirmation through the comparison with available retention data can be of great importance, especially when mass spectral matches are derived from multiple candidate compounds with similar spectra (e.g., isomer compounds) [28]. In our investigation, the use of RIs assisted in the confirmation of mass spectra matches and in distinguishing which isomer compound corresponds to the chromatographic peak (1-methylthio-(E)-1-propene, p-xylene). The small deviations between calculated and library-derived RIs were expected since RIs were experimentally determined with a DB-624 column (6% cyanopropyl/phenyl, 94% polydimethylsiloxane (PDMS)) and retrieved RIs were related to a 100% PDMS column. Naturally, the RI is dependent on the kind of stationary phase and different stationary phases give rise to different RIs of the same compound. However, the same trend in the abovementioned deviation was observed in the vast majority of the identified compounds. The combination of mass spectra and RI data has been proposed in both targeted and untargeted GC-MS data processing protocols [29].
By searching for the identified compounds in metabolic pathway databases and in the scientific literature, we found no direct evidence linking these VOCs to biochemical alterations that occur in cancer and therefore the biochemical interpretation of the results is not straightforward. While instrumental techniques, sampling methods and informatics approaches for studying diseases through the analysis of exhaled breath are constantly evolving [30][31][32], it is critical for future research to advance the knowledge concerning the understanding of underlying mechanisms that result in alteration of VOC breath composition. Current scientific knowledge provides some evidence and hypotheses concerning the biochemical background of endogenous VOCs [33], but the origin of the majority of these compounds is largely uncertain. Further research on endogenous products is of great importance not only for diagnostic purposes but also for targeting treatment [34].
It is evident that most of the compounds identified as differentiated in population groups in the present study are of exogenous origin or are produced endogenously during the metabolism of exogenous compounds. This observation enhances the findings of our previous publication, where it was hypothesized that alterations in pulmonary function and in the metabolism and excretion of exogenous compounds in disease can have an effect on the concentrations measured in exhaled breath. This hypothesis is also supported by several clinical tests and recent research that use exogenous VOCs (EVOCs) as probes to "measure the activity of metabolic enzymes in vivo, as well as the function of organs, through breath analysis" [35]. Future research should further elucidate the potential of the administration of harmless exogenous compounds as probes to study diseases.
In accordance with acquired data, the discrimination of LC patients from patients with abnormal CT findings was substantially increased by the untargeted approach and subsequent feature selection/machine learning in comparison to a previously conducted targeted approach. The correct classification was 75-77% for Ca+ vs. Ca− in the untargeted analysis compared to approximately 50% in the targeted analysis. Additionally, we report 91% accuracy for the discrimination of LC patients from healthy controls based on the investigation of 29 VOCs selected as informative by a non-targeted approach. The discriminatory power was slightly increased compared to the targeted analysis focusing on the quantification of a set 19 pre-determined VOCs. Although the targeted approach has the advantage of the absolute determination of VOC levels and is less prone to biases, untargeted screening allowed us to detect new distinctive features and incorporate a larger compound set into the classification analysis, thus resulting in better discrimination. Previous studies investigating VOC profiles by gas chromatography-mass spectrometry also reported high discriminatory power in distinguishing LC patients from healthy controls [36][37][38][39][40][41][42][43][44][45][46][47]. However, the major concerns are the limited reproducibility regarding the compounds identified by different research groups and the uncertainties regarding the origins of VOCs that differentiate lung cancer. The lower discriminant power between Ca+ and Ca− patients underlines the importance of evaluating the interference of other pulmonary diseases in the identification of LC biomarkers [46,47]. The combination of the datasets developed by the targeted and untargeted approaches did not significantly improve the discrimination, an observation that underlines that the information provided by targeted analysis is contained to a large extent in the data obtained by the untargeted approach. Untargeted VOC screening detected four (toluene, benzene, styrene, ethylbenzene) out of seven compounds that were found to differ significantly in targeted analysis, and exploited numerous features that could not be identified by the targeted approach. In agreement with targeted analysis, incorporating breath subtracts (ambient air was subtracted from breath measurements) slightly decreased the discriminatory power of the analysis. This can be explained by the fact that for some VOCs with high concentrations in ambient air, the information contained in breath measurements was not exploited. Including breath substrate (also referred to as alveolar gradient) in the analysis is a double-edged decision. On the one hand, not considering the ambient air chemical composition may introduce environmental interferences, while, in parallel, subtracting air levels from breath may result in the exclusion of valuable information.
Some further issues should be considered when interpreting the results of the present study. Although SPME has many advantages as a solvent-free and versatile pre-concentration method, it is not without limitations. During SPME, VOCs compete for the active sites of the fiber, and molecules with higher molecular weight may displace smaller ones. Thus, varying the composition of samples may influence the amounts of VOC extracted [48]. Moreover, different fiber coatings are suitable for different classes of analytes [49]. The fiber used in this study (CAR/PDMS) is suitable for VOCs with low molecular weight and a Kovats index of less than 980 [50]. According to a study conducted to evaluate the performance of different fiber coatings in the isolation of VOCs from feces, the particular fiber used isolated 60% of the total examined VOCs [51]. Concerning sampling, pre-concentration and instrumental procedures, we adopted a mixed expiratory breath sampling/SPME/GC-MS approach, but a variety of alternative methods are available. In brief, sampling can also focus on later or end-tidal expiratory breath, pre-concentration can be achieved with thermal desorption (TD) and needle trap devices (NTDs) [52] and instrumental analysis can also be performed with proton transfer reaction MS (PTR-MS) and selective ion flow tube MS (SIFT-MS) [18]. Cross-reactive sensors have also been developed and tested by numerous research groups [53].
Another limitation of this study is that the participants who formed the HC group did not undergo clinical examination or diagnostic tests to exclude the possibility of having undiagnosed cancer or serious pulmonary diseases, instead they were recruited based on personal interviews. Thus, the possibility that a few individuals were falsely classified as controls cannot be entirely excluded.
In summary, untargeted VOC profiling captured, to a large extent, the information provided by targeted analysis and performed more efficiently in discriminating lung cancer patients from patients with benign pulmonary diseases, through the utilization of new compounds that were not previously considered. However, uncertainties in compound identification and automated processing of raw data should be carefully addressed. Subsequence steps for the verification and manual correction of automatically identified peaks in the raw chromatographic files can increase the reliability of the acquired datasets.

Participant Recruitment and Breath Sampling
A detailed description concerning the procedures followed for participant recruitment and sampling of exhaled breath can be found in a previous publication [10]. In brief, the study population consisted of 85 patients from the General University Hospital of Larissa (Greece) who underwent bronchoscopy due to abnormal CT findings and a control group of 52 individuals of similar age were recruited from local health centers. Samples were collected from October 2018 to October 2019. After bronchoscopy, patients were categorized according to the presence of LC, according to results of the cytological/histological examination. The control group (referred to in the text as healthy controls (HC)) was selected on the basis of the absence of self-reported pulmonary diseases and cancer. The absence of these diseases was determined by self-report during the personal interviews conducted on the day of sampling.
Breath samples were collected in Tedlar ® bags (Sigma-Aldrich, St. Louis, MO, USA). Participants were asked to inhale deeply and hold their breath for 30 s, then exhale through a disposable mouthpiece into the 1 L Tedlar ® bag until filled. Two breath samples were collected with approximately two-minute intervals in between. Ambient air samples were also collected with the use of a portable Laboport ® UN 86 KTP (KNF Neuberger GmbH, Freiburg, Germany) pump.

Materials, Solid Phase Microextraction and GC-MS Analysis
A detailed description of the materials and methods used in the present study can be found in our previous publication [10]. In brief, extraction and pre-concentration of the analytes from breath samples was achieved by solid phase microextraction (SPME) using a 75 µm carboxen-polydimethylsiloxane (CAR/PDMS)-coated fused silica fiber assembly (Sigma-Aldrich, St. Louis, MO, USA), and desorption of analytes from the fiber was performed for 5 min at 270 • C. Instrumental analysis was performed with a Finnigan Trace GC Ultra/Polaris Ion Trap GC/MSn system equipped with a DB-624 GC capillary column (inner diameter: 0.25 mm, length: 30 m, film: 1.4 µm, 6% cyanopropylphenyl/94% dimethylpolysiloxan, Agilent, Santa Clara, CA, USA). GC-MS chromatograms were acquired in total ion current (TIC) mode of the mass analyzer, and then extracted at one or two specific m/z values for analyte quantification. Data acquisition and processing were carried out using Xcalibur™ 3.0 software (ThermoFisher Scientific, San Francisco, CA, USA). Furthermore, for the determination of RIs, SAK-100-1 and SMA-200-1 (Agilent, Santa Clara, CA, USA) analytical standards containing C5 to C12 alkanes were used. Gas samples were prepared, spiked with methanolic solution of C5-C12 alkanes and retention times of each alkane were determined.

Identification of Candidate Features and Raw Data Reprocessing
All features identified as differentiated between population groups with the XCMS analysis were searched for in the raw chromatograms and the corresponding peaks were identified. The mass spectrum of the identified peaks was studied in comparison with the National Institute of Standards and Technology (NIST) spectrometric library. Peaks of compounds corresponding to technical interferences (siloxanes, Tedlar ® bag compounds) were excluded from further analysis. Extracted ion chromatograms were obtained for the ions identified as significantly differentiated between population subgroups by XCMS analysis, and were reprocessed by calculation of the areas of the chromatographic peaks in SIM mode using Thermo Xcalibur™ software. The most discriminatory features were assigned based on mass spectral similarities to the NIST 2011 mass spectral library. Compounds were categorized as "probable" (probability > 75%), "possible" (probability 50-75%) and unknown (probability < 50%). To further confirm the identification of compounds, retention characteristics were examined. Retention times were simulated by using the Pro EZGC Chromatogram Modeler (Restek Corporation, Bellefonte, PA, USA), introducing an equivalent chromatographic column and an identical temperature program. Simulated RTs were compared to actual RTs for substances contained in the Restek database. Retention indices of these compounds were retrieved from the NIST webbook and related to a fully non-polar column (100% polydimethylsiloxane). Moreover, retention indices for each compound were experimentally determined. SAK-100-1 and SMA-200-1 (Agilent) analytical standards with C5 to C12 alkanes were used to calculate the retention indices from the unknown compounds. Experimental retention indices of these compounds were calculated according to the following formula: I = 100 [n + (t i − t n) /(t n+1 − t n) ] I: retention index n: number of carbons of heading n-alkane peak i t i : retention time of specific compound i (minutes) t n , t n+1 : retention times of heading and trailing n-alkanes Normalization of chromatographic peak areas was performed with an external standard, by dividing instrument response by the geometric mean peak areas of three monoaromatic compounds (benzene, toluene and ethyl benzene) of a standard mixture (≈20 ng/L air each) analyzed on the same day.

Machine Learning Methods
The machine learning analyses were performed with Waikato Environment for Knowledge Analysis (Weka). For each comparison, group 1 vs. group 2 or cases vs. controls were analyzed using naive Bayes, logistic regression and random forest methods, with 10-fold cross-validation. However, random forests consistently outperformed the other algorithms, therefore, all results are shown for this specific type of algorithm. Feature selection within the appropriate Weka module was also performed, in order to detect subsets of informative metabolites that could more efficiently separate the groups from each other. In particular, feature selection was performed in two steps with a wrapper that evaluates various subsets of the features (WrapperSubsetEval), using the Best_First method in order to maximize the performance of the random forest, based on the metric of the area under the curve (AUC). In the first step, the wrapper functions in a feature selection mode that performs 10-fold cross-validation. The output of this first feature selection step assesses how many times a feature has been selected in the 10-fold cross-validations. The features that are selected in at least 50% of the cross-validations form another subset that is fed into the second step. Thus, we repeat (in the second step) the feature selection, by starting with the abovementioned informative subset, and this time the wrapper runs in a feature selection mode that uses the full training set and selects only a certain final subset of features.  Data Availability Statement: The data are not publicly available because they contain sensitive information at an individual level.