Globally, lung cancer is the leading cause of cancer-related death in men and the second-leading cause in women. In 2018, an estimated 1.8 million lung cancer deaths occurred, with 1.2 million in men and over 576,000 in women, accounting for 1 in 5 cancer-related deaths worldwide [1
]. Advances in precision medicine and genomic analyses have resulted in a paradigm shift whereby lung tumors are characterized and classified by biomarkers and genetic alterations (e.g., gene expression, mutations, amplifications, and rearrangements) that are critical to tumor growth and can be exploited with specific targeted agents or immune checkpoint inhibitors. However, there are many limitations of tissue-based biomarkers such as they can be subject to sampling bias due to the heterogeneous nature of tumors, the requirement of tumor specimens for biomarker testing, and the assays can take significant time and be expensive [2
]. As such, high-throughput and minimally invasive methods that can improve current precision medicine is a critical need.
Liquid biopsy is a good alternative for a non-invasive way to detect EGFR and KRAS mutations. The use of surrogate sources of DNA, such as blood, serum, and plasma samples, which often contain circulating free tumor (cft) DNA or circulating tumor cells (CTCs), is emerging as a new strategy for tumor genotyping [3
]. However, this technique is pretty recent and still has some disadvantages. Different studies have also shown that the amount of cftDNA is correlated with disease stage, which may make it difficult to detect in early stages of cancer. Moreover, non-tumor cfDNA might derive from different processes including necrosis of normal tissues surrounding the tumor cells or lysis of leukocytes after blood collection, which may make mutation difficult to detect. Even when recent versions of liquid biopsy techniques have been approved for clinical use, the sensitivity (or True Positive Rate) of this test is still a weak point [3
]. All these concerns provide space for the application of other non-invasive techniques that may be more effective in early stages of cancer and may provide higher sensitivity rates.
Quantitative image features, or radiomics, have the potential to complement and improve current precision medicine. Radiomic features are non-invasive, are extracted from standard-of-care images, and do not require timely and often expensive laboratory testing. Additionally, radiomic features are not subject to sampling bias since the entire tumor is analyzed and represents the phenotype of the entire tumor in 3D and not just the portion that was subjected to biomarker testing, and can be applied for all stages of cancer.
Radiogenomics is an emerging and important field because it utilizes radiomics to predict genetic mutations, gene expression, and protein expression [4
]. In lung cancer, there has been particular interest in predicting EGFR and KRAS mutations [5
]. Epidermal Growth Factor Receptor (EGFR) is a protein on the surface of cells that regulates signaling pathways to control cellular proliferation. According to Bethune et al. ([6
], p. 1), “Overexpression of EGFR has been reported and implicated in the pathogenesis of many human malignancies, including Non-Small Cell Lung Cancer (NSCLC). Some studies have shown that EGFR expression in NSCLC is associated with reduced survival, frequent lymph node metastasis and poor chemosensitivity”. Lung adenocarcinomas with mutated EGFR have a significant response to tyrosine kinase inhibitors [6
], which makes the detection of this mutation significant in determining patient treatment. On the other hand, Kirsten Rat Sarcoma viral oncogene (KRAS) is also a well-known tumor driver. Mutations of this gene have proven to be a useful biomarker to predict resistance to EGFR-based therapeutics [7
]. Furthermore, some studies have shown that KRAS can be targetable with promising results in phase III of NSCLC [8
Other authors have previously tried to predict EGFR and KRAS mutations in Non-Small Cell Lung Cancer (NSCLC) from image features. In the work presented by Gevaert et al. [5
], the authors attempted to predict these mutations from semantic image features provided by radiologists. A predictive model for the EGFR mutation was proposed that achieved an AUC of 0.89; however, conclusive results for the KRAS mutation were not obtained. Pinheiro et al. [10
] also found a correlation between imaging features and mutation status for EGFR mutation (AUC of 0.745) but could not find the same for the KRAS mutation. On the other hand, Wang et al. [11
] utilized semantic features to predict EGFR and KRAS mutation and found a significant correlation between EGFR and KRAS mutations and lesions with a low ground glass opacity (GGO). In particular, the authors found that L858R point mutations, exon 19 deletions, and KRAS mutations were more common in lesions with a lower GGO proportion (p
= 0.029, 0.027 and 0.018, respectively). Mei et al. [12
] utilized texture features to predict mutations in EGFR at exon 19 and exon 21. The authors reported an AUC of 0.66 for predicting EGFR exon 21 mutation using a model that included sex, non-smoking status, and the Size Zone Non-Uniformity Normalized radiomic feature. Shiri et. al. [13
] created machine learning models from PET and CT image features to predict both EGFR and KRAS mutations. These authors obtained an AUC of 0.75 for both mutations in CT images by applying a combination of K-Best and a variance threshold feature selector with logistic regression. Incorporating PET kept AUC values around 0.74. Other authors that utilized features from PET and CT are Koyasu et al. [14
]. These authors applied Random Forest and Gradient Tree Boosting to predict EGFR mutation, and obtained an AUC of 0.659 with the latter algorithm and seven types of imaging features. Liu et al. [15
] utilized radiomics features and clinical data to predict EGFR status and found an AUC of 0.647 with a model based on five radiomic features, which improved to 0.709 by combining radiomic features and clinical data. Deep learning has recently been applied in the diagnosis of different types of cancer [16
], and other authors such as Wang et al. [17
] have applied these techniques to mutation prediction. These authors utilized deep learning to the prediction of EGFR mutational status by training on 14,926 CT images and obtained an AUC of 0.81 on an independent validation cohort. Other recent studies have applied clinical nomograms to predict EGFR mutation status. In the work presented by Zhang et al. [18
], the authors combined CT features and clinical risk factors and used them to build a prediction nomogram. They obtained a 0.74 AUC on the validation cohort.
On the other hand, previous studies have demonstrated that applying ensembles to predictive models tends to improve the performance of predictions [19
]. An ensemble model is created by generating multiple models and combining them to produce an output classification. To combine the different models, a voting process is performed among them to determine the final result. There are different types of voting; for example, average voting, in which the average of the probabilities for each class of all the models is computed, and then a classification is performed based on the average probability. Another type of voting is maximum probability, in which for each case the base model with the higher pseudo-probability is selected, and the classification of the case is performed based on the pseudo-probabilities of this classifier alone.
In this paper, a novel voting scheme for ensembles of machine learning or deep learning models is proposed, and its effectiveness in predicting EGFR and KRAS mutations in CT images taken from the TCIA NSCLC Radiogenomics dataset [20
] is shown to be state of the art. Two experiments were performed; first, prediction with radiomic features and machine learning models, and second, prediction through Convolutional Neural Networks (CNN). In both cases, first base models are tested and then an ensemble of the best models with a new voting scheme is applied to observe if there is an improvement of the prediction performance. Our approach shows that performance can be improved by this scheme and that good results are possible even with a small dataset where only a few cases present mutations. With more data becoming available in the future, it is expected that this type of approach will add to tools for clinicians.
2. Materials and Methods
For this study, a cohort of 99 patients from the TCIA were obtained [20
], whose data included CT images with tumor segmentation on the CT image, genomic data (KRAS mutational status, and EGFR mutational status), and clinical data (age, sex, smoking status, pathological T stage, pathological N stage, pathological M stage, and histology type). Details of the cohort and corresponding data are published in a previous study [5
]. Patients with unknown mutational status were eliminated from the analysis, which resulted in 83 patients for the analysis. The list of the exact cases that were used in the study can be found in the Supplementary Material (Table S7 Features Transpose EGFR, Table S8 Features Transpose KRAS)
. This type of data, with curation, is difficult to obtain. This set, while small, allows for comparisons. The summary of the study cohort is presented in Table 1
. Table 2
summarizes the clinical features of the study cohort.
For the EGFR mutation case, there is not a significant difference observed between the mutant and wildtype statuses in terms of age. In terms of gender, for the mutant status there seems to be a more balanced distribution between the genders, while the wildtype status seems to be significantly more common among men. In terms of smoking history, the EGFR mutant status seems to be found more often among former smokers, and non-smokers in second place, while the wildtype status seems to be more common among former and current smokers. There is no significant difference between the groups in terms of T cancer stage, although wildtype status seems to be more common for patients with stage T1a. Cases with stages N1 and N2 seem to more frequently present wildtype status, as well as patients with M1b stage. In terms of histology type, none of the Squamous Cell Carcinoma patients present the EGFR mutation; this is only present in Adenocarcinoma cases.
For the KRAS mutation, there are no significant differences in terms of age and gender between the mutant and wildtype cases. In terms of smoking history, it can be observed that none of the non-smokers presented the KRAS mutation. For the pathological stage, it seems that most patients with stage N1 and N2 are wildtype cases. Moreover, as seen with the EGFR mutation, mutant status is only found in Adenocarcinoma. For more information about the distribution of the clinical variables in the Train and Test datasets, please refer to the Supplementary Material (Table S1. Clinical Variables Training Dataset, Table S2. Clinical Variables Test Dataset)
With this dataset two experiments were conducted; first, with traditional radiomic features and machine learning models, and second, with Convolutional Neural Networks (CNNs). Both experiments consisted of a base classifier performance assessment and then ensembles of several models were tested with three types of voting: average, maximum, and the method proposed here, Selective Class Average Voting (SCAV). SCAV is a voting technique that is particularly useful when dealing with an unbalanced dataset, where one class (majority class) is much more frequent than the other (minority class). In SCAV, first we count how many models predicted the minority class (in our case, the mutant status), and if this quantity is above a threshold value, the final outcome is the minority class. The pseudo-probability of this particular case is computed by averaging the scores of all the models where the final result was the minority class. If the value is below the chosen threshold, the final outcome is the majority class (in this case, the wildtype status), and the class pseudo probability is computed by finding the average of probabilities of all the models where the final result was the majority class. Once the probabilities are averaged according to the previous process, a threshold of 0.5 is applied to the final score to determine if the sample belongs to the minority (mutant) or to the majority (wildtype) class. To select the best thresholds for SCAV, that is the threshold for how many models must vote for the minority class, the performance of the ensemble on the Training set was assessed, and the thresholds that enabled a higher AUC on this data were selected and applied to the Test data. Figure 1
describes the algorithm used by SCAV.
The use of ensembles increases the probability of obtaining better results, since we have several diverse models as inputs, and their errors tend to be out voted by the full set of classifiers. This enables better generalization error. However, there are some disadvantages to this approach; first that it consumes more time. Several models have to be trained before an ensemble can be attempted, and it requires more computing power and resources, since we have several classifiers running at the same time. This last item creates a limitation in how many total models can be used in the ensemble.
2.1. Experiment 1: Radiomic Features and Machine Learning Classifiers
Quantitative image features (N = 266) presented in [22
] were extracted from the segmented 3D regions which included texture and non-texture features. These features were computed using the segmented volumes publicly available in the NSCLC Radiogenomics dataset. Non-texture features include tumor size, tumor shape, and tumor location categories, and texture features include pixel histogram, run length, co-occurrence, Laws, and Wavelet features. To extract these features, Definiens Developer XD© (Munich, Germany) was used [23
]. Definiens is based on the Cognition Network Technology that allows the development and execution of image analysis applications. Here, the Lung Tumor Analysis application was used. Most of the features were implemented within the Definiens platform, whereas some were computed with an implementation of the algorithms in C/C++ developed in a previous work by some of the authors of this paper [22
For stage 1, the following experimental workflow was applied to predict mutation status from image features. First, the data was divided into Train and Test sets as part of a 10-fold cross validation. Second, on the Training set, feature selection was applied to select the image features with the most predictive power; third, the SMOTE algorithm [24
] was applied to balance the number of examples in each class of the dataset; fourth, a classifier was trained with the previously selected features as inputs on the balanced Training data, and finally, the resulting model was applied to the Test set. Figure 2
summarizes the presented workflow.
Feature selection was used to determine the image features with more predictive power. Sets of 5, 10, 15, and 20 features were tested. For feature selection, two methods were separately applied. The selection approaches were the Mann–Whitney test [25
] and ReliefF [26
]. Since this is a case of an unbalanced dataset (one class is much more abundant than the other), an optional application of the SMOTE algorithm [24
] was performed to create synthetic samples of the minority class. The SMOTE algorithm was applied with the default settings. These settings make the dataset approximately balanced by class.
Finally, a classifier was trained with the selected features. Four machine learning classifiers were used: Random Forests [27
], Support Vector Machines [28
], Stochastic Gradient Boosting [29
], and Neural Networks [30
]. For every experiment, standard metrics were computed: including accuracy, sensitivity, and specificity (assuming the mutant status as the positive case), and Area Under the ROC Curve (AUC) [31
]. The workflow was applied in a ten-fold cross-validation scheme, where iteratively nine folds were used to select features and train the classifier, and the left-out fold was used for testing the model.
The whole process was coded and executed in R 3.5.1 using the package FSelector [32
] for the ReliefF feature selection, package DMwR [33
] for the SMOTE algorithm, and package caret [34
] to test the four different classifiers. These classifiers were executed with the default hyperparameters of the caret package. The implementation of the Mann–Whitney Feature Selector was coded in R.3.5.1 using the wilcox.test function to compute the p-value of every feature (every column of the dataset) and then features were sorted by this value in increasing order.
In the second stage of the experiment, an ensemble of base models was applied using a subset of the 32 learned models (obtained from 4 feature sets, 4 machine learning models, and 2 feature rankers). Sets of the top ranked 5, 10, and 20 existing models were tested, based on computational restrictions and the desire to have larger ensembles for typically better accuracy. To select which base models would be part of the ensemble, the average performance on the Training set was considered. The base models were sorted according to their average AUC on the Training set, and the top 5, 10, and 20 were selected.
2.2. Experiment 2: Convolutional Neural Networks
In the second experiment, CNNs were applied to the problem of predicting EGFR and KRAS mutations. CNNs are a type of deep neural network that have proven to be useful in detecting patterns on images [35
]. From the same TCIA dataset, the CT images from the 83 patients that had both tumor segmentation and mutation information were selected and processed so a volume with only the tumor would be obtained. Then images of the Region of Interest (ROI) with a uniform size of 128 × 128 pixels per slice were extracted. From the whole volume, up to three slices per patient were selected to be part of the final dataset. The slice that had the largest tumor area was selected by manual visual inspection by the lead author of the segmented images. Then, we left one out in both directions of the z
-axis and selected the two slices that where closest to the chosen slice up and down. The immediately consecutive slices were not used, assuming they were too similar to the central image. A slice without a clear piece of tumor in it was discarded. The dataset was then split into three: Training (65%), Validation (15%), and Test (20%) datasets. Since there was more than one image from each patient, we verified that images from the same patient were assigned to the same dataset.
In the first stage of the second experiment, several CNN models were applied to predict EGFR and KRAS mutations, varying conditions such as the CNN architecture, data augmentation, the optimizer, the learning rate, and the number of epochs of training. Since this is a very small dataset, small CNN architectures were tested. For the CNN experiments, we varied the CNN architecture (3 architectures), the optimizer (SGD and Adam), the Initial Learning rate (0.01, 0.005, and 0.0005), and the number of epochs (10, 20, and 30). Other numbers of epochs were also tested, based on the performance observed when the first three options were assessed. Furthermore, other architectures were tested (up to 10), but not with all the combinations. Figure 3
, Figure 4
and Figure 5
show the three best CNN architectures used in this experiment. The others can be found in Supplementary Material (Figure S1 CNN ARCHITECTURES)
More than 54 models were trained with different combinations of the before mentioned parameters. When enough good results were obtained using the base CNN models, a second stage of ensembles of CNN models was performed. Combinations of several models from the ones trained in the previous stage were tested. The models were ranked according to their performance on the Training set, and the best ones were selected for the ensembles. Different types of voting were applied: average, maximum, and SCAV.
The second experiment was coded in Python 3, and the library OpenCV was used for the image processing tasks. For the CNN generation, the library Keras with TensorFlow backend was utilized.
4.1. EGFR Mutation
Several observations can be made from these results. Our first observation is that for the machine learning models, the use of SMOTE greatly improves the performance of the models when dealing with unbalanced datasets. Without the SMOTE algorithm, the sensitivity was zero, but while using it several of the mutant cases were properly detected. For the EGFR mutation, good results could be obtained with the machine learning approach; however, despite the small training dataset, better results could be obtained with CNNs. In both cases, the base models could be improved by using ensembles. For the machine learning models, the best base performance was with ReliefF as feature selector, 15 features, and SVM as the classifier.
For the machine learning approach, better results in terms of accuracy, sensitivity, and AUC could be obtained with different combinations of ensembles and SCAV. The best features in the sense that they were more commonly selected in the best machine learning models are: 3D Wavelet features, 3D Laws features, GLSZM Grey level variance, 90th percentile, GLSZM Small zone low grey level emphasis, Flatness, Asymmetry, Orientation, and Surface to volume ratio.
In the tests applying CNNs, the best performance of the base models was with Architecture 4 and SGD as the optimizer. After applying ensembles, we got an improvement in accuracy. This result was obtained using our proposed voting scheme, SCAV. In general, we obtained better results working with CNNs than with the machine learning models. We also observed that the ensembles with SCAV outperformed the ones with average and maximum voting. This suggests that the proposed scheme can obtain better performance when applying ensembles, even if the performance of the base models is not optimal.
Our results, where features are extracted automatically, while slightly less than the AUC of 0.89 obtained by Gevaert et al. [5
] with the same dataset, did not require medical experts to produce semantic features. However, our AUC is superior to the ones obtained by other previous works with an automated approach. Other metrics such as accuracy were not reported and cannot be compared.
4.2. KRAS Mutation
For the base classifiers of the machine learning approach, good results could be obtained both with the ReliefF feature selector and the Mann–Whitney test. The best AUC was obtained with a model of ReliefF’s best five features and SVM as the classifier. The best features in the sense that they were more commonly selected in this model were 3D Laws features, 3D Wavelet features, average GLN Grey level non-uniformity, and GLSZM Grey level non-uniformity. With the ensemble approach, an important increase in the performance was obtained by applying SCAV as the voting scheme. The best results with ensembles and machine learning models were always obtained with SCAV as the voting scheme. This shows that ensembles can significantly improve the performance of classifiers.
For the CNN models, there were no models that achieved high scores in all the three metrics (accuracy, sensitivity, and AUC). The best AUC was obtained with Architecture 1 and SGD as the optimizer, and the best accuracy was with Architecture 1 and Adam as the optimizer. After applying ensembles of CNN models, the performance improved in AUC and the accuracy was maintained. In this case, the best result was obtained with average voting; however, the second best result was obtained with SCAV. It can be observed that in terms of AUC, better results could be obtained with CNNs over machine learning models. In both cases, an improvement was observed with the ensemble approach.
If we compare these results with the ones obtained by Gevaert et al. [5
] with the same dataset, the authors could not find a conclusive model for KRAS mutation with semantic features (AUC of 0.55). We did find a good predictive model for KRAS mutation on the same dataset with an AUC of 0.778 and using a deep learning approach that does not need human input to generate features.
A limitation of this study is the small dataset. Furthermore, even if separate training, validation, and test datasets are used, it would be more conclusive if the trained model could be tested on a dataset from a different source, which would prove the generalization of the model. Both these limitations will be addressed as future work, when more data that fits the requirements of the study is available. Finally, for the machine learning approach, because we chose features on each fold of a cross validation, the very best set is the one that occurred most often and could differ when more data is available.
In this study, we analyzed the effectiveness of using ensembles in the prediction of EGFR and KRAS mutations using a small dataset; in particular, we assessed the performance of a novel voting scheme SCAV. We tested this scheme with both ensembles of machine learning models and ensembles of CNNs and a significant improvement from the base classifiers was observed.
For the EGFR mutation, the performance of our model was similar to that obtained by Gevaert et al. with the same dataset, and our model did not require semantic features manually specified by a radiologist. Further, our best model obtained a higher AUC than the ones presented by the most recent works that used deep learning [17
] and nomograms [18
For the KRAS mutation, the results are much better than the ones obtained in [5
], where a conclusive model for KRAS mutation could not be found for this same dataset. Moreover, this is probably the best result for the KRAS mutation prediction that can be found in the literature, since most works only focus on the EGFR mutation.
In general, for both mutations, better results were be obtained by applying ensembles with SCAV as the voting method, rather than average and maximum voting; however, a more rigorous method to determine the best threshold is still necessary. This work indicates that applying ensembles and SCAV for voting may lead to a significant increase in the performance of the base models, both for machine learning and deep learning models, which offers a good strategy to handle small datasets when no more data is available. Furthermore, higher sensitivity was obtained when applying the SMOTE algorithm for the machine learning models, which is an effective strategy to handle unbalanced classification datasets.
This work showed novel ways to use ensembles of CNNs and non-neural classifiers on small data to achieve state-of-the-art results. Our proposed approach, which is to use ensembles with SCAV, shows in this study that the performance of classifiers can be improved, even when the base models do not perform that well, and this is an important contribution from this paper. Since larger datasets will enable better models, we firmly believe that if our approach is applied with more data it will yield outstanding performance and may generate models that can be used in clinical practice. This indicates a promising future for detecting these mutations in a non-invasive way.