A Novel Framework with High Diagnostic Sensitivity for Lung Cancer Detection by Electronic Nose

The electronic nose (e-nose) system is a newly developing detection technology for its advantages of non-invasiveness, simple operation, and low cost. However, lung cancer screening through e-nose requires effective pattern recognition frameworks. Existing frameworks rely heavily on hand-crafted features and have relatively low diagnostic sensitivity. To handle these problems, gated recurrent unit based autoencoder (GRU-AE) is adopted to automatically extract features from temporal and high-dimensional e-nose data. Moreover, we propose a novel margin and sensitivity based ordering ensemble pruning (MSEP) model for effective classification. The proposed heuristic model aims to reduce missed diagnosis rate of lung cancer patients while maintaining a high rate of overall identification. In the experiments, five state-of-the-art classification models and two popular dimensionality reduction methods were involved for comparison to demonstrate the validity of the proposed GRU-AE-MSEP framework, through 214 collected breath samples measured by e-nose. Experimental results indicated that the proposed intelligent framework achieved high sensitivity of 94.22%, accuracy of 93.55%, and specificity of 92.80%, thereby providing a new practical means for wide disease screening by e-nose in medical scenarios.


Introduction
As estimated, lung cancer has been responsible for close to 1 in 5 deaths in 2018, which remains the leading cause of cancer death [1]. According to the latest TNM 8 edition, the five-year average survival rate of stage IVA patients is 10%, and that of stage IVB patients is as low as 0% [2]. Despite the high mortality rate, early diagnosis can increase the chance of efficient treatment [3] and survival rate for lung cancer patients [4]. Radiological detection, such as computed tomography or positron-emission tomography, has enabled the lungs to be imaged for diagnosis of cancer [5]. However, these conventional detection methods are expensive and occasionally miss tumors (low sensitivity), and therefore cannot be used as widespread screening tools [6]. Moreover, radiation from medical imaging may cause adverse health effect on the human body [7]. Therefore, it is crucial to develop an effective diagnosis method for lung cancer, which is also feasible for wide screening with high sensitivity, especially for high risk patients [8].
Human volatilome analysis is a new and promising area in disease detection [9]. As a non-invasive tool for lung cancer detection [10,11], breath analysis becomes a fast-growing research field [12,13]. More than 3000 volatile organic compounds (VOCs) are found in human exhaled breath, which are directly or indirectly related to internal biochemical processes in the human body [14]. Breath print, interpreted as VOCs inside exhaled breath [15], can be analyzed by different instruments such as gas chromatography in combination with mass spectrometry (GC-MS), proton-transfer-reaction mass spectrometry, ion mobility spectrometry, and electronic nose (e-nose) [16]. E-noses are sensor arrays that consist of non-selective chemical sensors and each sensor is sensitive to a large number of VOCs with different sensitivity [17]. E-noses have been widely used in food analysis [18], environment control [19], and disease diagnosis [20]. As a promising non-invasive detection device, e-noses can identify different diseases such as lung cancer [21], prostate cancer [22], urinary tract infections [23], urinary pathogens [24], and gut bacterial populations [25]. Different from those expensive, time-consuming and complicated analysis methods by compounds identification, e-nose is popular as a simple, inexpensive, and portable sensing technology in lung cancer detection, but it relies heavily on computer analysis [26].
Although new computer-assisted diagnosis (CAD) methods emerge continuously and rapidly, effective algorithms of analyzing e-nose data for lung cancer remain far from perfection. Since e-nose cannot directly distinguish between specific VOCs [26], in addition to effective sample acquisition, another key procedure is the follow-up signal processing by using computer methods. In e-nose detection, feature extraction and classification are two basic and essential steps. Feature extraction methods are applied for analyzing high-dimensional signal data, which is the prerequisite for subsequent detection. Classification models aim to study the difference of the sensor features under different physiological conditions to achieve final diagnosis. There are many pattern recognition frameworks in diagnosing diseases by e-nose, as shown in Table 1. It can be concluded that data processing is a pivotal step to develop effective e-nose diagnosis system, which requires further improvement. Table 1. Pattern recognition frameworks for disease diagnosis by electronic nose (e-nose).

First Author Disease Samples Feature Extraction Classification Comments
Fens [27] COPD and Asthma 90 PCA CDA The raw data were reduced to four principal components by PCA.
van Velzen [28] COPD 68 PCA LR Breath profiles obtained by GC-MS as well as e-nose proved the non-invasive biomarker for the diagnosis.
Dragonieri [29] Asthma 40 PCA LDA It was the first study in the field of asthma to use pattern analysis to analyze exhaled VOC mixtures collected by e-nose.
As an unsupervised learning method, autoencoder demonstrated strength in extracting relevant information from high-dimensional signal data [35]. Meanwhile, gated recurrent unit (GRU) [36] has been shown to be one of the state-of-the-art architectures in extracting temporal features. Compared with long short-term memory (LSTM) [37], GRU has no cell state and directly employs hidden state for the transmission of signal information, thus possessing rapid training time. Thus far, deep learning algorithms have only been sparsely applied for feature extraction on e-nose data. Gated recurrent unit based autoencoder (GRU-AE) integrates GRU with the autoencoder, which leverages GRU cells to discover the dependency and temporality among multi-dimensional time series signal [38]. By introducing GRU-AE into the field of e-nose analysis for lung cancer detection, the effort to manually engineer complex features is minimized, which tremendously simplifies data processing procedures for e-noses.
As for classification models, ensemble learning has been a popular and desirable learning paradigm for the analysis of e-nose data [39]. The basic idea of ensemble learning is to build multiple component learners whose predictions are aggregated with the aim of outperforming the constituent members [40]. Typically, ensemble learning algorithms consist of two stages: the production of diverse base learners and their combination [41]. High precision and diversity are two key requirements for individual learners to guarantee the performance of the final ensemble [42]. However, combining all the individual learners requires massive storage and computing resources. Even worse, the larger size of the ensemble model cannot constantly guarantee the better performance [43]. For these reasons, ensemble pruning has arisen as an intermediate stage prior to combination, which is also termed as ensemble thinning, selective ensemble, or ensemble selection [41]. Ensemble pruning searches a good subset of base learners to form the sub-ensemble that can reduce the ensemble size and resource consumption while maintaining or even enhancing the performance of the complete ensemble. However, the complexity of finding the best sub-ensemble is an NP-complete problem [44], and therefore the optimal solution by global search is infeasible for large or even medium ensemble size [45]. Alternatively, it is more appropriate to use approximation techniques that guarantee the near-optimal sub-ensembles.
Many ensemble pruning strategies have been proposed to obtain the optimal or near-optimal sub-ensembles, which can be mainly categorized into ordering-based techniques [46,47], clustering-based techniques [48], and optimization-based techniques [43,49]. Ordering-based techniques attempt to rank individual classifiers based on the evaluation measures, and only the first few classifiers are selected in the pruned ensemble. Since the ranking mechanism tends to consume less time and storage resources, ordering-based ensemble pruning is the simplest and fastest one among all the ensemble pruning techniques, which is widely applied as CAD models with high accuracy [50]. Therefore, in this paper, a novel gated recurrent unit based autoencoder combined with margin and sensitivity based ordering ensemble pruning (GRU-AE-MSEP) framework is proposed. This framework consists of three major steps. (1) The GRU-AE is adopted to extract principal features from high-dimensional and complex signal data. (2) The compressed features are used to train classification and regression trees (CARTs). (3) MSEP is employed to order and select well-trained CARTs to form final sub-ensemble for lung cancer classification. Correspondingly, the main contributions of this study are listed as follows: 1. For the first time in the field of lung cancer screening, GRU-AE is introduced into the feature extraction of e-nose signal data. As far as we know, this fills the gap of applying deep learning methods to automatically extract principal features from temporal and high-dimensional data in the e-nose system.
2. Based on the gained insight through theoretical analysis of three other ensemble pruning measures, we design and propose a heuristic margin and sensitivity based measure (MSM) for explicitly evaluating the contribution of each component classifier, which considers both instance importance and classification sensitivity. Previous studies only focused on improving the recognition accuracy of the model. To our knowledge, this is the first time that sensitivity is introduced into ensemble pruning to meet the needs of medical fields. 3. A novel MSEP is established for lung cancer detection. The proposed ensemble pruning model contributes to increasing the survival rate by decreasing missed diagnosis of lung cancer patients while guaranteeing overall performance. 4. Compared with other state-of-the-art frameworks, we demonstrate the feasibility and effectiveness of the proposed framework on collected breath samples by e-nose and three open source datasets. Therefore, the proposed intelligent framework provides a new insight into machine learning algorithms and lung cancer detection.
The remainder of this paper is organized as follows. In Section 2, the acquisition process and pre-processing of the collected data are explained and summarized. Section 3 proposes the feature extraction method of GRU-AE and classification models based on ensemble pruning techniques. In Section 4, the performance of the proposed framework is tested and further validated by comparison with other algorithms. Discussion is shown in Section 5. Finally, Section 6 draws some conclusions of this study.

Data Collection
In this study, a total of 214 breath samples were collected from 98 patients with lung cancer and 116 heathy controls. Lung cancer patients were from the in-patient department of the Chongqing Cancer Hospital and Chongqing Red Cross Hospital. Healthy volunteers were doctors and nurses in the Chongqing Cancer Hospital and researchers from Chongqing University. All participants confirmed that they had no metabolic comorbidities and none of the patients had their tumors removed. After a detailed introduction of the purpose and plan of this experiment, all subjects gave their informed consent for inclusion before participating in the study. This study was conducted in accordance with the Declaration of Helsinki. Protocols including any relevant details of this study were carried out in accordance with the relevant guidelines and approved by Medical Ethics Committee of Chongqing Cancer Hospital as well as Medical Ethics Committee of Chongqing Red Cross Hospital. Table 2 provides the overall information of the volunteers participating in this study.
The breath collection process was standardized and based on a validated study published previously [51]. In brief, during the process of collection, all the volunteers blew the gas into the bag after deep breathing. To reduce the interference in the breath composition on account of different lifestyles, different variables were controlled such as the time interval, temperature, oral hygiene, etc. Sampling experiments were conducted in well-ventilated rooms to avoid interference by other odors. The detection process was carried out immediately after sample acquisition. The data used for classification were response signals from 13 sensors, including TGS2620, TGS2602, TGS2600, TGS826, TGS822, TGS8669, WSP2110, NAP-55A, MR516, ME3-C7H8, CO-B4, a temperature sensor, and a humidity sensor.

Breath Preprocessing
The miniature e-nose system in this study is composed of lower computer system, upper computer software, and data processing system. The lower system consists of gas chamber, sensor array, and signal processing circuit. The upper computer stores information of the users and detection data in MySQL (Oracle, CA, USA) relational management database. The upper computer and the lower computer of the e-nose system are combined to obtain the data of the samples and then store them to the database.
The overall scheme of e-nose detection system consists of the eight major steps shown in Figure 1a. Firstly, breath samples of volunteers were gathered, and then samples containing VOCs were sent into the gas chamber one by one. The sample gas diffused and eventually reached a uniform distribution. After reacting with the sensor arrays in the gas chamber, the machine outputted electrical signals. In the third step, electrical signals were amplified, filtered, and converted to digital signals. Then, digital signals were uploaded to the upper computer system via serial asynchronous communication. The upper computer system displayed the real-time response of the sensors, and saved data to the local database. The sixth step is the pre-processing of saved sensor data, including baseline processing, filtering and data standardization. Thereinto, baseline processing was used to achieve purposes such as drift compensation and contrast enhancement. As for the filtering of sensor signals, the wavelet filtering was applied owing to its fast computation capability and wide adaptability. The reaction time for each sensor was 90 s and each sensor collected 675 points in this time interval. Therefore, every sample datum had the dimensionality of 8775, i.e., 13 sensors multiplied by 675 time steps. Ultimately, the pre-processed signal data were analyzed through the pattern recognition framework.

Methodology
The algorithms used in the proposed framework are explained and interpreted below. The pipeline of the whole detection framework is shown in Figure 1b. Firstly, pre-processed data were inputted into the GRU-AE-MSEP framework. GRU-AE was then trained to extract principal features from each sample. After being trained on the training set, CARTs were ordered and selected by the MSEP on the pruning set step by step. Finally, the selected classifiers formed the pruned ensemble to make predictions and obtained classification results through simple voting on the testing set.

Feature Extraction
In this study, GRU-AE was applied to form elaborate feature representation to achieve effective classification subsequently. The schematic diagram of GRU-AE for feature extraction is illustrated in Figure 2. Generally, encoder module and decoder module are two fundamental components in the autoencoder. The encoder transforms the high-dimensional data x i , which consist of multichannel signals, into a compressed representation z i . The decoder module then implements the conversion from compressed features to original high-dimensional data, denoted as output x i . The autoencoder attempts to minimize the reconstruction error in Equation (1), which is defined as the difference between the x i and x i , where D is the dimensionality of the input. Finally, z i can be regarded as a valid representation of the input signal.
GRU-AE model integrates GRU cells with the autoencoder, which means the encoding and decoding processes are implemented by GRU [36]. In GRU-AE, GRU cells are leveraged to discover the dependency and correlations among multi-dimensional time series signal. As shown in Figure 2, GRU can process responses of multiple sensors simultaneously at each time step, and then generates sequence information in the encoder module. After training GRU-AE by back propagation algorithm, low-dimensional z i serves as temporal features extracted by the autoencoder and can appropriately represent the input signal x i . More detailed description and principle of GRU-AE can be found in [38].

Ensemble Pruning for Classification
In this section, the margin theory of ensemble method is interpreted and applied to investigate the relationship between samples and classifiers. Then, the advantages and shortcomings of three different ensemble pruning measures are analyzed and evaluated. By analysis and comparison, we propose a heuristic measure based on margin theory to assess the importance of each individual classifier, which can effectively rank and prune the base classifiers to construct a near-optimal sub-ensemble.
First, all the notations used in this section are introduced, which helps to comprehend the measures mentioned in this paper. Let D = {(x i , y i )|i = 1, 2, . . ., N} be the total dataset constituted by each sample x i with the corresponding label y i ∈ {0, 1}, which can be divided into D Tr with size of N Tr for training, D Pr with size of N Pr for pruning, and D Te with size of N Te for testing. The base classifier is denoted as h i , which is used to compose the original ensemble set H with M classifiers, and ensemble pruning set S with T classifiers. Suppose I is the discriminant equation where I(true) = 1 and I(false) = 0.

Margin Theory
The margin theory was originally proposed to analyze the upper bound of generalization error for ensemble methods with voting classification rules [52]. To further explain the correctness of the margin theory, kth margin bound was proposed to narrow the upper bound of generalization error with respect to margin distribution [53,54]. From the margin theory, it can be concluded that the larger is the margin over the training samples, the better is the generalization performance of the ensemble model on the testing set. Consider a binary classification problem, whose prediction is the result of majority voting. The margin of the sample x i is defined as Equation (2), which is a number in the range of [−1, 1].
Margin is a measure of the confidence for ensemble prediction [52]. From Equation (2), the larger positive (or negative) value of margin indicates the more confident correct (or incorrect) prediction.
Since better generalization performance can be achieved by larger margin on the whole training samples, the individual classifiers that make correct predictions are more important than those that make incorrect predictions. Intuitively, the larger is the negative margin of the sample, the more important are the base classifiers who can correctly classify it, since such classifiers have the potential to guide the ensemble to make the correct prediction. Based on those insights, margin-based measures can be applied to selecting appropriate individual classifiers.

Reviews and Analyses of Three Ensemble Pruning Measures
Before introducing proposed margin-based ensemble pruning algorithm, we first illustrate three different measures for ensemble pruning as guidance: simultaneous diversity and accuracy measure for ensemble pruning (SDAcc) [55], margin and diversity based ordering ensemble pruning (MDEP) [47], and unsupervised margin based ordering ensemble pruning (UMEP) [46]. For clarity and coherence, without altering the original meaning of the above three methods, the following formulas are based on the notations defined in this study.
To improve the error-correction ability and ensure the effectiveness of the pruned ensemble, both the accuracy and diversity of an individual classifier should be considered [55]. SDAcc shown in Equation (3) proposes a measure to combine different weights for four events, which can primarily care about accuracy and diversity of the sub-ensemble. Four events in the measure are defined in Equation (4), where h is an individual classifier to make predictions in the pruned ensemble S. In Equation (5), NT S i denotes the correct classification ratio on the pruning dataset D Pr , and NF S i is equal to 1 − NT S i . The measure in SDAcc gives marks for classifiers with correct prediction and deducts corresponding marks for incorrect classifiers. e 10 and e 11 indicate two cases where the base classifier make the correct decision, and the base classifier can be rewarded with different high marks, i.e., NF S i > 0.5 in e 10 and NF S i < 0.5 in e 11 . In event e 00 , since the results of the base classifier and the ensemble are the same, the base classifier lacks diversity. At the same time, the result of the base classifier is wrong, which makes it lack accuracy. Classifiers in e 00 have both low accuracy and diversity, and therefore should be deducted more marks than that in e 01 . Through SDAcc, the candidates with high accuracy and diversity can be selected for the final sub-ensemble. However, this measure was designed for the optimization process in greedy ensemble pruning, thus possessing higher complexity than the ordering-based ensemble pruning. Moreover, the incorrect classifiers in the case e 01 and e 00 have overlapped mark intervals, which means two samples with different importance could be considered equally important. For instance, a base classifier makes wrong prediction on the sample with 80 correct votes and 20 incorrect votes (belongs to e 01 ), while the other base classifier incorrectly classifies a sample with 80 incorrect votes and 20 correct votes (belongs to e 00 ). However, the marks for the incorrect classifiers in above two different events are all −0.2. Hence, it is hard to distinguish the importance of each classifier by its mark values in SDAcc. Moreover, diversity cannot guarantee the generalization capacity of the final pruned ensemble [56]. The following two ensemble pruning measures use margin theory to evaluate the importance of base classifiers in a relatively reliable manner.
MDEP is an ordering-based ensemble pruning model which relies on the margin and diversity based measure (MDM) [47]. Since large margin can guarantee high generalization capacity, base classifiers that have the ability to increase instance margin should be first considered. The article states that the importance of each sample increases as the absolute margin value decreases, therefore the logarithmic function is used to reveal such tendency. MDM shown in Equation (6) linearly combines the margin measure shown in Equation (7) and the diversity measure shown in Equation (8) with an adjustable parameter α. However, MDM deliberately favors the candidates that can make correct decisions on samples with low (positive or negative) margin. The samples that have large negative and large positive margin are considered equally in MDM, which both have little importance. However, in our opinion, since every sample is unique and considerable, hard samples should not be totally neglected, especially in the medical scenarios. Those difficult samples with large negative margin must be valued, which is the key to further improving the accuracy and sensitivity of the ensemble pruning model. Moreover, classifiers that can correctly classify the samples with most incorrect votes (margin < 0) should be more important. For instance, a sample x p has 55 incorrect votes and 45 correct votes (margin(x p ) < 0), while the other sample x q has 45 incorrect votes and 55 correct votes (margin(x q ) > 0). Classifiers that can correctly classify x p should be more important than classifiers that can correctly classify x q . However, in MDEP, the above two cases are of equal importance.
The UMEP [46] model highlights the main impact of low margin samples on the performance of pruning tasks. The logarithmic function was also applied to represent the inverse relation between the importance of the classifier and the margin of samples, as shown in Equation (9). The lower is the margin of sample x i , the higher is the information quantity in x i , and therefore the more significant is the classifier that makes correct decision on x i . The article emphasizes that the margin-based ordering classifiers are less likely to make coincident errors, and therefore sufficient diversity can be ensured compared to other ordering-based methods with the same complexity. Nevertheless, the logarithmic function can only deal with positive values and samples with negative margin are entirely neglected. As mentioned above, those samples misclassified by most classifiers (margin < 0) should be taken seriously rather than discarded.

Proposed Margin and Sensitivity Based Measure
Different from the three measures for ensemble pruning mentioned above, i.e., SDAcc, MDEP, and UMEP, this study has distinctive standpoint about margin distribution on different cases in medical scenarios. We propose herein a novel measure called the margin and sensitivity based measure (MSM) for base classifiers as: The motivation of designing MSM is to take into account the importance and difference between candidate classifiers, and, simultaneously, to consider the classification sensitivity while maintaining overall performance. To obtain a reasonable evaluation based on margin theory, the fourth term, i.e., e −margin(x i ) , is invented for the following three reasons: (1) Instead of logarithmic function, which is discontinuous on the whole interval of margin ([−1, 1]) and has infinite values, exponential function is utilized to depict the importance of different base classifiers. (2) The fourth term covers the interval [−1, 0], which means that very hard samples are considered as well. Moreover, samples misclassified by most base classifiers are given more attention. Therefore, the importance of the sample can be reflected precisely in each situation. (3) For samples that have more incorrect votes (the smaller margin), the classifiers that can correctly classify them deserve higher marks, while for samples with more correct votes (the larger margin), the correct classifiers should get lower marks. Therefore, by following the three above rules, the fourth term is monotonically decreasing, and sufficient diversity of the sub-ensemble can be achieved by distinguishable marking mechanism for different situations.
For serious diseases detection such as cancer, the rate of missed diagnosis should be reduced to increase the cure possibility, which motivates the creation of the third term in the MSM. e y i ·NF H i , referred to as the bonus term, aims to lower the rate of false negative identification, and the definition of NF H i is shown in Equation (11). Only classifiers that can correctly detect lung cancer samples (positive cases, y i = 1) are rewarded with bonus marks, which is e NF H i . In the case that more than half of the classifiers misclassify sample ) and x i happens to be the positive sample, the correct classifiers can get much higher marks. By introducing the bonus term, classifiers that can correctly identify lung cancer samples are more likely to be favored. By modifying the bonus term, MSM can be extended to multi-class problems, allowing the classifiers that successfully distinguish the most important categories to obtain additional marks. Therefore, the bonus term makes MSM more competent to increase the sensitivity of the pruning model.
The first term, i.e., I(h(x i ) = y i ), determines that only the correct classifiers can earn marks. The second term, i.e., I(margin(x i ) > θ), called threshold, is created specially to eliminate abnormal samples. In extreme cases, if all classifiers are wrong, except one or two classifiers, then the sample is very likely to be abnormal and should be ignored. Therefore, the interval of θ is [−1, 0], which is a parameter to reduce adverse impact by outliers and elusive samples.
In contrast to SDAcc, MDEP, and UMEP, MSM aims to improve the classification sensitivity under the circumstance of maintaining high accuracy and specificity, and therefore can be widely used in the diagnosis of cancer and other serious diseases. Compared with SDAcc, MSM defines a more rational evaluation for different situations through margin theory, which can consider the difference and generalization ability simultaneously. The marking mechanism of SDAcc linearly depends on the voting results, i.e., NF S i and NT S i . However, the marking mechanism of MSM has varying slopes according to the importance of samples. Classifiers that can correctly predict samples with smaller margin have greater tendency to obtain marks. Additionally, instead of abandoning hard samples in MDEP and UMEP, MSM attempts to classify samples as correctly as possible, especially the difficult ones, and therefore can further boost the performance of the ensemble pruning. Furthermore, the proposed heuristic ensemble pruning measure based on margin theory is demonstrated and verified by reasonable and exhaustive experiments in Section 4. Extract (x i , y i ) ∈ D Tr with replacement as E Tr with size of 30% × N Tr ; 5: Train h j with E Tr ; 6: end for 7: // Pruning procedures; 8: for each h j ∈ H do 9: MSM = 0; 10: for each x i ∈ D Pr do 11: margin( refer to Equation (2) 12: if h j (x i ) = y i && margin(x i ) > θ then refer to Equation (10) 13: refer to Equation (11) 14:

Margin and Sensitivity Based Ordering Ensemble Pruning
In this study, MSM is applied to create an ordering-based ensemble pruning, i.e., MSEP. Sampling with replacement is employed to produce diversity on the sub-training dataset. Overproduced CARTs are trained as base classifiers by using sub-training sets from the above sampling. Finally, the trained CARTs are ranked and selected by MSM from the original ensemble to form the sub-optimal ensemble. The algorithm of MSEP is provided in Algorithm 1, which can be implemented by the following five steps: 1. MSEP starts by generating M CARTs from extracted sub-training set E Tr with replacement. Each sub-training set is different with size of 30% of the total training set D Tr , and therefore each well-trained CART is unique and diverse. 2. Classify each sample in the pruning set D Pr by each well-trained CART and compute the margin value of each sample through Equation (2). 3. Only classifiers that properly predict the samples with margin larger than the threshold θ can get positive marks through Equation (10). Then, all M CARTs are sorted by corresponding marks into ordered sequence h R Select the first T ordered CARTs to compose a pruned ensemble S to achieve the best overall performance including accuracy, sensitivity, and specificity. 5. Evaluate the sub-ensemble S over the testing set D Te by required metrics.

Evaluation Metrics
The accuracy (Acc) of the classification is the proportion of correctly classified samples to the total number of samples. The classification accuracy defined in Equation (12) measures the universal classification results. TN is the number of true predictions for healthy samples; FN is the number of false predictions for healthy samples; TP is the number of true predictions for lung cancer samples; and FP is the number of false predictions for lung cancer samples.
Sensitivity measures the proportion of real lung cancer patients who are correctly classified and defined as Equation (13). Instead, specificity, defined in Equation (14), measures the proportion of real healthy people who are correctly predicted. High sensitivity indicates low rate of missed diagnosis, i.e., few lung cancer patients are classified as healthy individuals, which is particularly vital for lung cancer detection. High specificity indicates low rate of misdiagnosis, i.e., few healthy individuals are deemed as lung cancer patients.

Experimental Methodology
In this study, the main purpose was to verify the proposed framework and explore the effect of different pruning measures on the Acc, Sen, Spe, and area under the curve (AUC) in lung cancer classification, especially on the Sen. Since the size of the ensemble should be an odd number in binary classification to avoid tie situation where every class has equal votes, the size of original ensemble set, i.e., M, was set to be 101 and the size of pruning set, i.e., T, was set to be 11 (about 10% of the original ensemble size). In the experiment, we divided the dataset into 7:2:1 for training, pruning, and testing. All experimental results were obtained by 50-fold cross-validation. The program was carried out by Python 3.6.5 and Keras 2.2.4 on Windows 10 Operating System with Intel (R) Core (TM) ( Palo Alto, CA, USA) i7-7700HQ CPU @ 2.80 GHz and 8 GB RAM.
Comparative experiments were conducted on three feature extraction methods combined with seven different classification models. In the field of e-nose system for lung cancer screening, deep learning methods have been sparsely applied to extract features. Principal component analysis (PCA) [57] and kernel principal component analysis (KPCA) [58] are the most commonly used feature extraction methods in this field. Therefore, these two dimensionality reduction methods were adopted for comparison with GRU-AE introduced in this study. As for classification, seven different models, i.e., MSEP, MEP, MDEP, UMEP, SDAcc, complete ensemble, and adaboost [59], were tested and compared. The variant from proposed method, i.e., MEP, is MSEP without the bonus term. Complete ensemble and adaboost are two widely used and successful models in machine learning. Grid-search method combined with cross-validation was employed to optimize parameters of different methods over a given parameter grid, which is shown in Table A1. The result tables in the following sections demonstrate the results of different methods under the optimal parameters. Additionally, besides examining on the collected dataset, three open source datasets were applied to further validate the proposed framework.

Experiments on the Lung Cancer Dataset
Firstly, binary classification experiments were carried out on the collected samples, and the proposed method was compared with other frameworks to test its performance. This part was to verify that the portable e-nose combined with the proposed framework can properly and effectively differentiate between lung cancer and healthy controls. In the second part, we investigated additional categories: (i) clinical stages; (ii) lung cancer versus chronic obstructive pulmonary disease (COPD); and (iii) smoking history by GRU-AE-MSEP to make this study more exhaustive.

Lung Cancer versus Healthy Controls
In the experiment on the collected data for binary classification, a total of 214 samples composed by 98 lung cancer patients and 116 healthy controls were utilized. Seven classification models combined with three different dimensionality reduction methods were conducted and documented. The mean values and standard deviation (std) of all the metrics obtained by 50-fold cross-validation are shown in Table 3. Figure 3 presents the comparison between different frameworks. The extensive search process of θ in MSEP is shown in Table A2.
The results shows that, among all the methods, the proposed GRU-AE-MSEP framework achieved three highest metrics, i.e., Acc of 93.55%, Sen of 94.22%, and AUC of 0.92. On the original high-dimensional dataset, MSEP achieved the highest Acc, Sen, and AUC, while complete ensemble obtained the highest Spe. As for data reduced by PCA, MSEP obtained the highest Acc and Sen. The highest Spe was obtained by MEP, and adaboost achieved the highest AUC with the largest std. On the dataset after feature extraction based on KPCA and GRU-AE, MSEP achieved the highest Acc, Sen, and AUC with small std, and MEP demonstrated the highest Spe. To analyze the effectiveness of each feature extraction method, we compared those methods under the same classification models. Through experimental results, classification models based on GRU-AE consistently demonstrated better performance than those based on PCA and KPCA. Figure 3a-d exhibits stable ascending trend of the classification performance based on GRU-AE in every metric. In Figure 3b,c, PCA and KPCA are substantially unstable since they fail to improve the metric of all models simultaneously and the improvements are relatively small or even negative. As for the analysis of different classification models, each of them was compared under the same feature extraction methods, as shown in Figure 3e,f. In fact, when comparing two or more algorithms, a more reasonable way is to compare ranks or average ranks of different models on the same dataset [47]. Therefore, we defined a scoring rule, where the model with the highest metric gets seven points (there are seven classification models in total), the second ranked model gets six points, and so on. The scores in Figure 3e refer to the average scores on four metrics (Acc, Sen, Spe, and AUC) in four groups (Original, PCA, KPCA, and GRU-AE). For instance, MSEP combined with GRU-AE ranked the first in Acc (93.55%), Sen (94.22%), and AUC (0.92) while ranked the second in Spe (92.80%). Therefore, MSEP obtained seven, seven, six, and seven scores, correspondingly, and the average score of MSEP based on GRU-AE was 6.75. Through our scoring rule, Figure 3e directly and clearly shows the overall performance of each framework. Through the results, the proposed MSEP achieved the best overall performance among four groups. MEP and MSEP tied for the first place in the dataset based on PCA, and MEP ranked second in the other three groups. The overall performance of UMEP, SDAcc, complete ensemble, and adaboost fluctuated greatly in different groups. Figure 3f demonstrates scores of each model on sensitivity in four groups. The results illustrated that MSEP consistently achieved the highest sensitivity in different groups, while MEP had the second-best sensitivity. The sensitivity performance of other methods varied greatly among different groups. To verify whether the e-nose system had recognition effect on the staging of lung cancer, the proposed framework was tested with samples at different clinical stages. In this study, a total of 98 lung cancer samples (2 stage I, 6 stage II, 44 stage III, and 46 stage IV) were collected. Since there were only two samples for stage I, three sets of samples were employed (6 stage II, 44 stage III, and 46 stage IV) during the experiment. The results are shown in Table 4. Lung cancer and COPD were also studied to make this research more convincing and complete. Among them, COPD patients were in-patients of Chongqing Red Cross Hospital, none of whom had lung cancer or suspected lung cancer. To avoid the influence of smoking factors on the experiment, COPD patients were all non-smoking samples. In total, 96 samples were selected, including 35 healthy non-smokers (had no lung diseases), 33 lung cancer patients (had no other lung diseases), and 28 COPD patients. The results are shown in Table 5. In addition, a preliminary study was conducted on high-risk groups of lung cancer. In this experiment, healthy people who have been smoking for 30 years or more (1 pack or more per day) were selected as subjects. We excluded samples of long-term smokers with interfering factors such as lung diseases. Finally, 95 samples were selected, including 30 lung cancer patients (15 smokers and 15 non-smokers), 30 healthy long-term smokers, and 35 healthy non-smokers (had no smoking history). The results are shown in Table 6.

Experiment on Validation Datasets
In general, different datasets with sufficient size are required to test a new framework, and only convincing results can prove its stability and generalization ability. However, due to the difficulty in acquisition process, the size of VOCs dataset is relatively small, as shown in Table 1. Worse still, in terms of disease detection, there are few publicly available e-nose datasets, let alone e-nose data for lung cancer. Since the nature of e-nose response is high-dimensional and temporal data collected by chemical sensor array, we employed three related open source datasets with considerable amount, i.e., the Diabetes dataset [60], gas sensors for home activity monitoring dataset (GSHAM dataset) [61], and gas sensor array drift dataset at different concentrations (GSAD dataset) [62,63]. The approach to verify the proposed model on other datasets has been applied in similar studies (e.g., [64,65]).

Description of Validation Datasets
Human urinary VOCs are used to diagnose diabetes in the Diabetes dataset [60]. High-dimensional time series data of VOCs in human urine were collected by field asymmetric ion mobility spectrometry. The dataset contains the urinary VOCs from two groups of people, including 72 patients with type II diabetes (set as positive samples) and 43 healthy volunteers (set as negative samples). GSHAM dataset contains high-dimensional time series data collected by eight gas sensors [61]. The sensors detected different objects by reacting with volatile gases and generated signals, which was an essential part of the e-nose detection system. There were 33 samples of banana, 36 samples of wine, and 31 samples of blank control group. In the binary classification experiment, we employed two classes with larger size, i.e., banana (positive samples) and wine (negative samples). The GSAD dataset was collected by 16 chemical sensors that reacted with pure gaseous substances [62,63]. There are 183 samples of ethanol, 209 samples of ethylene, 115 samples of ammonia, 138 samples of acetaldehyde, 214 samples of acetone, and 130 samples of toluene. Likewise, two groups were selected for binary classification, i.e., acetone (positive samples) and ethylene (negative samples) plus acetaldehyde (positive samples) and toluene (negative samples).

Results on Validation Datasets
The results show that, among all the frameworks, the proposed GRU-AE-MSEP achieved highest Acc of 82.06%, Sen of 85.83%, and AUC of 0.73 on Diabetes dataset, and highest Acc of 89.71%, Sen of 92.87%, and AUC of 0.77 on GSHAM dataset. Since GSAD dataset did not contain time series data, GRU-AE was not applied. MSEP obtained highest Sen of 98.79%, and AUC of 0.98 on Acetone and Ethylene category. Meanwhile, MSEP achieved highest Acc of 98.96% and Sen of 98.74% on Acetaldehyde and Toluene category. Furthermore, the proposed framework was stable and achieved relatively small std while comparing with other methods. Detailed results are shown in Table A3 for Diabetes dataset, Table A4 for GSHAM dataset, and Table A5 for GSAD dataset.
Overall evaluation and sensitivity performance of each framework are shown in Figure 4. As shown in Figure 4b,d,f, the proposed MSEP achieved the best sensitivity performance in all the groups. As for average scores, MSEP ranked first in most situations except the original group in Figure 4a and Acetone and Ethylene category in Figure 4e. In these two situations, MEP obtained better average scores than MSEP, but its sensitivity scores were lower than MSEP. The overall performance and sensitivity of UMEP, MDEP, SDAcc, complete ensemble, and adaboost fluctuated greatly in different situations.

Discussion
This paper presents a novel and reliable GRU-AE-MSEP framework for non-invasive lung cancer detection by the e-nose system. The proposed framework especially contributes to enhancing sensitivity and reducing missed diagnosis rate. The proposed framework was compared with the widely adopted feature extraction methods and existing ordering ensemble pruning techniques. Meanwhile, elaborate ablation experiments based on MSEP and MEP were carried out, which aimed to explore the role of the bonus term in improving sensitivity. To confirm the effectiveness of the proposed framework, all methods were examined under a set of standard metrics, i.e., Acc, Sen, Spe, and AUC. Moreover, all listed methods were experimented on the same dataset collected from patients with different kinds of lung cancer and diverse healthy controls. To further verify the portability of the proposed framework to other signal data, three open source datasets were tested based on the above metrics.
In the experiments presented in Section 4.3.1, GRU-AE-MSEP performed best by comparing different feature extractors and classifiers on the collected lung cancer dataset, and the sensitivity achieved by the proposed framework was high and stable. Additionally, the proposed framework had effective classification performance on distinguishing between clinical stages, lung diseases and smoking status.
In the experiments presented in Section 4.4, GRU-AE-MSEP was further validated on three open source datasets to test its portability and it outperformed other methods as well.
Dimensionality reduction methods are essential in the analysis of sensor signals, and the extracted principal features perform as a prerequisite for subsequent classification. From the experimental results, metrics of classifiers varied unstably based on PCA and KPCA, while the application of GRU-AE generally improved the performance of classifiers. Since PCA only extracts linear features and cannot deal with nonlinear information, PCA-based frameworks were inferior to those based on original data in several situations. Compared with the original data, the features extracted by KPCA improved the performance of classifiers slightly but were far less effective than the features extracted by GRU-AE. Since conventional feature extraction methods are hand-crafted and require heavy computation as well as domain knowledge, it is hard to judge the impact of the feature extraction process on the final classification results. Moreover, the signal data from e-nose were rather complex, which consisted of linear, nonlinear, and redundant information. As a method based on deep learning training, GRU-AE can process high-dimensional nonlinear data by virtue of automatic feature extraction, especially to process temporal data, which was further verified in Section 4.4.
Ensemble learning is popular in enhancing performance, while ensemble pruning models are developed as efficient improvement techniques by reducing redundant costs in the complete ensemble. Among 56 situations in four datasets, complete ensemble only achieved two highest values in total, i.e., the highest specificity in original lung cancer dataset and in Acetaldehyde and Toluene dataset. It indicated that there existed classifiers with little or negative contribution to the complete ensemble. UMEP, MDEP, and SDAcc are three existing ensemble pruning methods and have been proved to be effective in their original papers. Compared with the complete ensemble and adaboost, experimental results indicate that pruning models were better in the evaluation of overall performance and sensitivity. Therefore, it is reasonable to aggregate classifiers with better performance, and pruning techniques are deemed to be effective for lung cancer non-invasive detection as the results verified.
However, sensitivity and specificity formed a trade-off dilemma when the accuracy was stable and high enough. The aim of calculating average scores was to ensure that the overall performance was not sacrificed as the sensitivity improved. Among seven classification models, the proposed MSEP exhibited more robust performance, which was consistent with the theoretical analysis in Section 3.2.3. By giving up hard samples, UMEP and MDEP were capable of increasing accuracy, but the development space was also limited by hard samples, thus leading to their mediocre performance. The vague and overlapped marking mechanism of SDAcc resulted in fluctuating and unstable ranks in both average scores and sensitivity scores. By adjusting the threshold term, i.e., θ, MSEP can determine what proportion of hard samples to be retained. Instead of abandoning all difficult samples in UMEP and MDEP, the proposed method achieved superb results by taking them into account. Since frameworks based on MSEP achieved the highest average scores and sensitivity scores in every group in lung cancer dataset, MSEP not only improved the sensitivity but also the three other metrics. Therefore, the proposed MSEP can achieve as high sensitivity as possible while ensuring excellent overall performance. In most situations, MEP ranked only second to MSEP in average scores but had unstable ranks in sensitivity in Diabetes and GSHAM datasets, which illustrated the capability of proposed margin-based method in improving classification performance and the effectiveness of bonus term in sensitivity enhancement.
To make the experiments more exhaustive, we investigated as many categories as possible in the experiments presented in Section 4.3.2. For the detection of different clinical stages, stage II had the highest accuracy and sensitivity, which could suggest the valuable prospect of the proposed system for early lung cancer diagnosis. By identifying the COPD and lung cancer, the results were competitive and may provide a further application area. Never versus long-term smokers were distinguished from lung cancer with high accuracy and sensitivity. It may indicate that smoking is a high influence factor for VOC alteration in human breath.
When evaluating on three open source datasets, the performance of the proposed GRU-AE-MSEP framework achieved enormous success as well. MSEP ranked first in every group in terms of sensitivity, and the proposed GRU-AE-MSEP framework obtained the highest accuracy and sensitivity in every dataset, which proved the portability and robustness of the framework. Classification is one of the most popular topics in bioinformatics and disease detection. It is reasonable that one classifier cannot always achieve both highest sensitivity and specificity under certain accuracy, but sensitivity is what we valued and paid attention to. Our practical and transplantable framework demonstrated the ability to promote classification sensitivity in various scenarios.
In the literature, many studies have focused on the detection of lung cancer based on e-nose system, as illustrated in Table 7. The feasibility and effectiveness of the machine learning classifiers were demonstrated on small datasets [66][67][68]. With the development of deep learning methods, the neural network was used by van de Goor [69] and Chang [11]. By contrast, this study provides a new perspective for the non-destructive screening of lung cancer, aiming to design an algorithm to improve the detection sensitivity. In addition to the innovation of feature extraction and classification methods, compared with other studies, the proposed GRU-AE-MSEP framework based on a larger sample size demonstrated superior overall performance and higher sensitivity.
Although the proposed GRU-AE-MSEP framework performed optimally, there is still room for improvement. Primarily, the quantity of the dataset was still limited. To achieve expert-level diagnostic detection, the framework requires more sufficient and diverse data. Secondly, the study of clinical stages, lung diseases, and smoking status is worth delving into in the future. In future research, these limitations in automatic detection of lung cancer could be overcome by using multi-class classification training on gargantuan dataset collected from different types of machine.

Conclusions
In this paper, a novel intelligent lung cancer diagnosis GRU-AE-MSEP framework is presented. In the process of feature extraction, GRU-AE was introduced to effectively extract principal features from temporal and high-dimensional e-nose signal data. Meanwhile, in the classification process, a heuristic ensemble pruning model was proposed, which enhanced the classification sensitivity while maintaining the overall identification performance. CAD system based on GRU-AE-MSEP is conducive to reducing missed diagnosis of lung cancer and improving survival rate by timely treatment. In the experiment on the collected data, comparative and ablation experiments were conducted under a set of standard metrics to confirm the effectiveness of the proposed framework in lung cancer detection. Additionally, the detection of different stages, diseases, and smokers was implemented to explore the medical application prospect of the proposed framework. Furthermore, three open source datasets were tested, which extended our applicable scenarios and further proved the robustness and adaptability of the framework. Compared with five state-of-the-art classification models and two popular dimensionality reduction methods, the proposed framework achieved superior overall performance with particularly high sensitivity. Therefore, this research can serve as an important step to explore the use of deep learning methods for feature extraction, as well as the use of ensemble pruning techniques for classification in lung cancer diagnosis and other medical detection fields.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, writing-original draft preparation, writing-review and editing, and visualization, B.L. and L.F.; validation, investigation, and writing-review and editing, B.N.; validation, formal analysis, and writing-review and editing, Z.P.; and conceptualization, resources, funding acquisition, and writing-review and editing, H.L.
Funding: This research was funded by the National Natural Science Foundation of China (grant number 81671850).

Acknowledgments:
The authors wish to thank Ke Chen, Ziru Jia, Xitian Pi, and Zichun He for their work of data curation and all the volunteers for their participation in this study.

Conflicts of Interest:
The authors declare no conflict of interest.