A Radiogenomics Ensemble to Predict EGFR and KRAS Mutations in NSCLC

Lung cancer causes more deaths globally than any other type of cancer. To determine the best treatment, detecting EGFR and KRAS mutations is of interest. However, non-invasive ways to obtain this information are not available. Furthermore, many times there is a lack of big enough relevant public datasets, so the performance of single classifiers is not outstanding. In this paper, an ensemble approach is applied to increase the performance of EGFR and KRAS mutation prediction using a small dataset. A new voting scheme, Selective Class Average Voting (SCAV), is proposed and its performance is assessed both for machine learning models and CNNs. For the EGFR mutation, in the machine learning approach, there was an increase in the sensitivity from 0.66 to 0.75, and an increase in AUC from 0.68 to 0.70. With the deep learning approach, an AUC of 0.846 was obtained, and with SCAV, the accuracy of the model was increased from 0.80 to 0.857. For the KRAS mutation, both in the machine learning models (0.65 to 0.71 AUC) and the deep learning models (0.739 to 0.778 AUC), a significant increase in performance was found. The results obtained in this work show how to effectively learn from small image datasets to predict EGFR and KRAS mutations, and that using ensembles with SCAV increases the performance of machine learning classifiers and CNNs. The results provide confidence that as large datasets become available, tools to augment clinical capabilities can be fielded.


Introduction
Globally, lung cancer is the leading cause of cancer-related death in men and the second-leading cause in women. In 2018, an estimated 1.8 million lung cancer deaths occurred, with 1.2 million in men and over 576,000 in women, accounting for 1 in 5 cancerrelated deaths worldwide [1]. Advances in precision medicine and genomic analyses have resulted in a paradigm shift whereby lung tumors are characterized and classified by biomarkers and genetic alterations (e.g., gene expression, mutations, amplifications, and rearrangements) that are critical to tumor growth and can be exploited with specific targeted agents or immune checkpoint inhibitors. However, there are many limitations of tissuebased biomarkers such as they can be subject to sampling bias due to the heterogeneous nature of tumors, the requirement of tumor specimens for biomarker testing, and the assays can take significant time and be expensive [2]. As such, high-throughput and minimally invasive methods that can improve current precision medicine is a critical need.
Liquid biopsy is a good alternative for a non-invasive way to detect EGFR and KRAS mutations. The use of surrogate sources of DNA, such as blood, serum, and plasma samples, which often contain circulating free tumor (cft) DNA or circulating tumor cells to predict EGFR status and found an AUC of 0.647 with a model based on five radiomic features, which improved to 0.709 by combining radiomic features and clinical data. Deep learning has recently been applied in the diagnosis of different types of cancer [16], and other authors such as Wang et al. [17] have applied these techniques to mutation prediction. These authors utilized deep learning to the prediction of EGFR mutational status by training on 14,926 CT images and obtained an AUC of 0.81 on an independent validation cohort. Other recent studies have applied clinical nomograms to predict EGFR mutation status. In the work presented by Zhang et al. [18], the authors combined CT features and clinical risk factors and used them to build a prediction nomogram. They obtained a 0.74 AUC on the validation cohort.
On the other hand, previous studies have demonstrated that applying ensembles to predictive models tends to improve the performance of predictions [19]. An ensemble model is created by generating multiple models and combining them to produce an output classification. To combine the different models, a voting process is performed among them to determine the final result. There are different types of voting; for example, average voting, in which the average of the probabilities for each class of all the models is computed, and then a classification is performed based on the average probability. Another type of voting is maximum probability, in which for each case the base model with the higher pseudo-probability is selected, and the classification of the case is performed based on the pseudo-probabilities of this classifier alone.
In this paper, a novel voting scheme for ensembles of machine learning or deep learning models is proposed, and its effectiveness in predicting EGFR and KRAS mutations in CT images taken from the TCIA NSCLC Radiogenomics dataset [20] is shown to be state of the art. Two experiments were performed; first, prediction with radiomic features and machine learning models, and second, prediction through Convolutional Neural Networks (CNN). In both cases, first base models are tested and then an ensemble of the best models with a new voting scheme is applied to observe if there is an improvement of the prediction performance. Our approach shows that performance can be improved by this scheme and that good results are possible even with a small dataset where only a few cases present mutations. With more data becoming available in the future, it is expected that this type of approach will add to tools for clinicians.

Materials and Methods
For this study, a cohort of 99 patients from the TCIA were obtained [20,21], whose data included CT images with tumor segmentation on the CT image, genomic data (KRAS mutational status, and EGFR mutational status), and clinical data (age, sex, smoking status, pathological T stage, pathological N stage, pathological M stage, and histology type). Details of the cohort and corresponding data are published in a previous study [5]. Patients with unknown mutational status were eliminated from the analysis, which resulted in 83 patients for the analysis. The list of the exact cases that were used in the study can be found in the Supplementary Material (Table S7 Features Transpose EGFR, Table S8 Features Transpose KRAS). This type of data, with curation, is difficult to obtain. This set, while small, allows for comparisons. The summary of the study cohort is presented in Table 1. Table 2 summarizes the clinical features of the study cohort.
For the EGFR mutation case, there is not a significant difference observed between the mutant and wildtype statuses in terms of age. In terms of gender, for the mutant status there seems to be a more balanced distribution between the genders, while the wildtype status seems to be significantly more common among men. In terms of smoking history, the EGFR mutant status seems to be found more often among former smokers, and non-smokers in second place, while the wildtype status seems to be more common among former and current smokers. There is no significant difference between the groups in terms of T cancer stage, although wildtype status seems to be more common for patients with stage T1a. Cases with stages N1 and N2 seem to more frequently present wildtype status, as well as patients with M1b stage. In terms of histology type, none of the Squamous Cell Carcinoma patients present the EGFR mutation; this is only present in Adenocarcinoma cases.
For the KRAS mutation, there are no significant differences in terms of age and gender between the mutant and wildtype cases. In terms of smoking history, it can be observed that none of the non-smokers presented the KRAS mutation. For the pathological stage, it seems that most patients with stage N1 and N2 are wildtype cases. Moreover, as seen with the EGFR mutation, mutant status is only found in Adenocarcinoma. For more information about the distribution of the clinical variables in the Train and Test datasets, please refer to the Supplementary Material (Table S1. Clinical Variables Training Dataset, Table S2. Clinical Variables Test Dataset).
With this dataset two experiments were conducted; first, with traditional radiomic features and machine learning models, and second, with Convolutional Neural Networks (CNNs). Both experiments consisted of a base classifier performance assessment and then ensembles of several models were tested with three types of voting: average, maximum, and the method proposed here, Selective Class Average Voting (SCAV). SCAV is a voting technique that is particularly useful when dealing with an unbalanced dataset, where one class (majority class) is much more frequent than the other (minority class). In SCAV, first we count how many models predicted the minority class (in our case, the mutant status), and if this quantity is above a threshold value, the final outcome is the minority class. The pseudo-probability of this particular case is computed by averaging the scores of all the models where the final result was the minority class. If the value is below the chosen threshold, the final outcome is the majority class (in this case, the wildtype status), and the class pseudo probability is computed by finding the average of probabilities of all the models where the final result was the majority class. Once the probabilities are averaged according to the previous process, a threshold of 0.5 is applied to the final score to determine if the sample belongs to the minority (mutant) or to the majority (wildtype) class. To select the best thresholds for SCAV, that is the threshold for how many models must vote for the minority class, the performance of the ensemble on the Training set was assessed, and the thresholds that enabled a higher AUC on this data were selected and applied to the Test data. Figure 1 describes the algorithm used by SCAV. For the KRAS mutation, there are no significant differences in terms of age and gender between the mutant and wildtype cases. In terms of smoking history, it can be observed that none of the non-smokers presented the KRAS mutation. For the pathological stage, it seems that most patients with stage N1 and N2 are wildtype cases. Moreover, as seen with the EGFR mutation, mutant status is only found in Adenocarcinoma. For more information about the distribution of the clinical variables in the Train and Test datasets, please refer to the Supplementary Material (Table S1. Clinical Variables Training Dataset, Table S2. Clinical Variables Test Dataset).
With this dataset two experiments were conducted; first, with traditional radiomic features and machine learning models, and second, with Convolutional Neural Networks (CNNs). Both experiments consisted of a base classifier performance assessment and then ensembles of several models were tested with three types of voting: average, maximum, and the method proposed here, Selective Class Average Voting (SCAV). SCAV is a voting technique that is particularly useful when dealing with an unbalanced dataset, where one class (majority class) is much more frequent than the other (minority class). In SCAV, first we count how many models predicted the minority class (in our case, the mutant status), and if this quantity is above a threshold value, the final outcome is the minority class. The pseudo-probability of this particular case is computed by averaging the scores of all the models where the final result was the minority class. If the value is below the chosen threshold, the final outcome is the majority class (in this case, the wildtype status), and the class pseudo probability is computed by finding the average of probabilities of all the models where the final result was the majority class. Once the probabilities are averaged according to the previous process, a threshold of 0.5 is applied to the final score to determine if the sample belongs to the minority (mutant) or to the majority (wildtype) class. To select the best thresholds for SCAV, that is the threshold for how many models must vote for the minority class, the performance of the ensemble on the Training set was assessed, and the thresholds that enabled a higher AUC on this data were selected and applied to the Test data. Figure 1 describes the algorithm used by SCAV. The use of ensembles increases the probability of obtaining better results, since we have several diverse models as inputs, and their errors tend to be out voted by the full set of classifiers. This enables better generalization error. However, there are some disadvantages to this approach; first that it consumes more time. Several models have to be trained before an ensemble can be attempted, and it requires more computing power and resources, since we have several classifiers running at the same time. This last item creates a limitation in how many total models can be used in the ensemble. The use of ensembles increases the probability of obtaining better results, since we have several diverse models as inputs, and their errors tend to be out voted by the full set of classifiers. This enables better generalization error. However, there are some disadvantages to this approach; first that it consumes more time. Several models have to be trained before an ensemble can be attempted, and it requires more computing power and resources, since we have several classifiers running at the same time. This last item creates a limitation in how many total models can be used in the ensemble.

Experiment 1: Radiomic Features and Machine Learning Classifiers
Quantitative image features (N = 266) presented in [22] were extracted from the segmented 3D regions which included texture and non-texture features. These features were computed using the segmented volumes publicly available in the NSCLC Radiogenomics dataset. Non-texture features include tumor size, tumor shape, and tumor location categories, and texture features include pixel histogram, run length, co-occurrence, Laws, and Wavelet features. To extract these features, Definiens Developer XD© (Munich, Germany) was used [23]. Definiens is based on the Cognition Network Technology that allows the development and execution of image analysis applications. Here, the Lung Tumor Analysis application was used. Most of the features were implemented within the Definiens platform, whereas some were computed with an implementation of the algorithms in C/C++ developed in a previous work by some of the authors of this paper [22].
For stage 1, the following experimental workflow was applied to predict mutation status from image features. First, the data was divided into Train and Test sets as part of a 10-fold cross validation. Second, on the Training set, feature selection was applied to select the image features with the most predictive power; third, the SMOTE algorithm [24] was applied to balance the number of examples in each class of the dataset; fourth, a classifier was trained with the previously selected features as inputs on the balanced Training data, and finally, the resulting model was applied to the Test set. Figure 2 summarizes the presented workflow.

Experiment 1: Radiomic Features and Machine Learning Classifiers
Quantitative image features (N = 266) presented in [22] were extracted from the segmented 3D regions which included texture and non-texture features. These features were computed using the segmented volumes publicly available in the NSCLC Radiogenomics dataset. Non-texture features include tumor size, tumor shape, and tumor location categories, and texture features include pixel histogram, run length, co-occurrence, Laws, and Wavelet features. To extract these features, Definiens Developer XD© (Munich, Germany) was used [23]. Definiens is based on the Cognition Network Technology that allows the development and execution of image analysis applications. Here, the Lung Tumor Analysis application was used. Most of the features were implemented within the Definiens platform, whereas some were computed with an implementation of the algorithms in C/C++ developed in a previous work by some of the authors of this paper [22].
For stage 1, the following experimental workflow was applied to predict mutation status from image features. First, the data was divided into Train and Test sets as part of a 10-fold cross validation. Second, on the Training set, feature selection was applied to select the image features with the most predictive power; third, the SMOTE algorithm [24] was applied to balance the number of examples in each class of the dataset; fourth, a classifier was trained with the previously selected features as inputs on the balanced Training data, and finally, the resulting model was applied to the Test set. Figure 2 summarizes the presented workflow. Feature selection was used to determine the image features with more predictive power. Sets of 5, 10, 15, and 20 features were tested. For feature selection, two methods were separately applied. The selection approaches were the Mann-Whitney test [25] and ReliefF [26]. Since this is a case of an unbalanced dataset (one class is much more abundant than the other), an optional application of the SMOTE algorithm [24] was performed to create synthetic samples of the minority class. The SMOTE algorithm was applied with the default settings. These settings make the dataset approximately balanced by class.
Finally, a classifier was trained with the selected features. Four machine learning classifiers were used: Random Forests [27], Support Vector Machines [28], Stochastic Gradient Boosting [29], and Neural Networks [30]. For every experiment, standard metrics were computed: including accuracy, sensitivity, and specificity (assuming the mutant status as the positive case), and Area Under the ROC Curve (AUC) [31]. The workflow was applied in a ten-fold cross-validation scheme, where iteratively nine folds were used to select features and train the classifier, and the left-out fold was used for testing the model.
The whole process was coded and executed in R 3.5.1 using the package FSelector [32] for the ReliefF feature selection, package DMwR [33] for the SMOTE algorithm, and package caret [34] to test the four different classifiers. These classifiers were executed with the default hyperparameters of the caret package. The implementation of the Mann-Whitney Feature Selector was coded in R.3.5.1 using the wilcox.test function to compute the pvalue of every feature (every column of the dataset) and then features were sorted by this value in increasing order.
In the second stage of the experiment, an ensemble of base models was applied using a subset of the 32 learned models (obtained from 4 feature sets, 4 machine learning models, and 2 feature rankers). Sets of the top ranked 5, 10, and 20 existing models were tested, based on computational restrictions and the desire to have larger ensembles for typically better accuracy. To select which base models would be part of the ensemble, the average Feature selection was used to determine the image features with more predictive power. Sets of 5, 10, 15, and 20 features were tested. For feature selection, two methods were separately applied. The selection approaches were the Mann-Whitney test [25] and ReliefF [26]. Since this is a case of an unbalanced dataset (one class is much more abundant than the other), an optional application of the SMOTE algorithm [24] was performed to create synthetic samples of the minority class. The SMOTE algorithm was applied with the default settings. These settings make the dataset approximately balanced by class.
Finally, a classifier was trained with the selected features. Four machine learning classifiers were used: Random Forests [27], Support Vector Machines [28], Stochastic Gradient Boosting [29], and Neural Networks [30]. For every experiment, standard metrics were computed: including accuracy, sensitivity, and specificity (assuming the mutant status as the positive case), and Area Under the ROC Curve (AUC) [31]. The workflow was applied in a ten-fold cross-validation scheme, where iteratively nine folds were used to select features and train the classifier, and the left-out fold was used for testing the model.
The whole process was coded and executed in R 3.5.1 using the package FSelector [32] for the ReliefF feature selection, package DMwR [33] for the SMOTE algorithm, and package caret [34] to test the four different classifiers. These classifiers were executed with the default hyperparameters of the caret package. The implementation of the Mann-Whitney Feature Selector was coded in R.3.5.1 using the wilcox.test function to compute the p-value of every feature (every column of the dataset) and then features were sorted by this value in increasing order.
In the second stage of the experiment, an ensemble of base models was applied using a subset of the 32 learned models (obtained from 4 feature sets, 4 machine learning models, and 2 feature rankers). Sets of the top ranked 5, 10, and 20 existing models were tested, based on computational restrictions and the desire to have larger ensembles for typically better accuracy. To select which base models would be part of the ensemble, the average performance on the Training set was considered. The base models were sorted according to their average AUC on the Training set, and the top 5, 10, and 20 were selected.

Experiment 2: Convolutional Neural Networks
In the second experiment, CNNs were applied to the problem of predicting EGFR and KRAS mutations. CNNs are a type of deep neural network that have proven to be useful in detecting patterns on images [35]. From the same TCIA dataset, the CT images from the 83 patients that had both tumor segmentation and mutation information were selected and processed so a volume with only the tumor would be obtained. Then images of the Region of Interest (ROI) with a uniform size of 128 × 128 pixels per slice were extracted. From the whole volume, up to three slices per patient were selected to be part of the final dataset. The slice that had the largest tumor area was selected by manual visual inspection by the lead author of the segmented images. Then, we left one out in both directions of the z-axis and selected the two slices that where closest to the chosen slice up and down. The immediately consecutive slices were not used, assuming they were too similar to the central image. A slice without a clear piece of tumor in it was discarded. The dataset was then split into three: Training (65%), Validation (15%), and Test (20%) datasets. Since there was more than one image from each patient, we verified that images from the same patient were assigned to the same dataset.
In the first stage of the second experiment, several CNN models were applied to predict EGFR and KRAS mutations, varying conditions such as the CNN architecture, data augmentation, the optimizer, the learning rate, and the number of epochs of training. Since this is a very small dataset, small CNN architectures were tested. For the CNN experiments, we varied the CNN architecture (3 architectures), the optimizer (SGD and Adam), the Initial Learning rate (0.01, 0.005, and 0.0005), and the number of epochs (10, 20, and 30). Other numbers of epochs were also tested, based on the performance observed when the first three options were assessed. Furthermore, other architectures were tested (up to 10), but not with all the combinations. performance on the Training set was considered. The base models were sorted according to their average AUC on the Training set, and the top 5, 10, and 20 were selected.

Experiment 2: Convolutional Neural Networks
In the second experiment, CNNs were applied to the problem of predicting EGFR and KRAS mutations. CNNs are a type of deep neural network that have proven to be useful in detecting patterns on images [35]. From the same TCIA dataset, the CT images from the 83 patients that had both tumor segmentation and mutation information were selected and processed so a volume with only the tumor would be obtained. Then images of the Region of Interest (ROI) with a uniform size of 128 × 128 pixels per slice were extracted. From the whole volume, up to three slices per patient were selected to be part of the final dataset. The slice that had the largest tumor area was selected by manual visual inspection by the lead author of the segmented images. Then, we left one out in both directions of the z-axis and selected the two slices that where closest to the chosen slice up and down. The immediately consecutive slices were not used, assuming they were too similar to the central image. A slice without a clear piece of tumor in it was discarded. The dataset was then split into three: Training (65%), Validation (15%), and Test (20%) datasets. Since there was more than one image from each patient, we verified that images from the same patient were assigned to the same dataset.
In the first stage of the second experiment, several CNN models were applied to predict EGFR and KRAS mutations, varying conditions such as the CNN architecture, data augmentation, the optimizer, the learning rate, and the number of epochs of training. Since this is a very small dataset, small CNN architectures were tested. For the CNN experiments, we varied the CNN architecture (3 architectures), the optimizer (SGD and Adam), the Initial Learning rate (0.01, 0.005, and 0.0005), and the number of epochs (10, 20, and 30). Other numbers of epochs were also tested, based on the performance observed when the first three options were assessed. Furthermore, other architectures were tested (up to 10), but not with all the combinations.    performance on the Training set was considered. The base models were sorted according to their average AUC on the Training set, and the top 5, 10, and 20 were selected.

Experiment 2: Convolutional Neural Networks
In the second experiment, CNNs were applied to the problem of predicting EGFR and KRAS mutations. CNNs are a type of deep neural network that have proven to be useful in detecting patterns on images [35]. From the same TCIA dataset, the CT images from the 83 patients that had both tumor segmentation and mutation information were selected and processed so a volume with only the tumor would be obtained. Then images of the Region of Interest (ROI) with a uniform size of 128 × 128 pixels per slice were extracted. From the whole volume, up to three slices per patient were selected to be part of the final dataset. The slice that had the largest tumor area was selected by manual visual inspection by the lead author of the segmented images. Then, we left one out in both directions of the z-axis and selected the two slices that where closest to the chosen slice up and down. The immediately consecutive slices were not used, assuming they were too similar to the central image. A slice without a clear piece of tumor in it was discarded. The dataset was then split into three: Training (65%), Validation (15%), and Test (20%) datasets. Since there was more than one image from each patient, we verified that images from the same patient were assigned to the same dataset.
In the first stage of the second experiment, several CNN models were applied to predict EGFR and KRAS mutations, varying conditions such as the CNN architecture, data augmentation, the optimizer, the learning rate, and the number of epochs of training. Since this is a very small dataset, small CNN architectures were tested. For the CNN experiments, we varied the CNN architecture (3 architectures), the optimizer (SGD and Adam), the Initial Learning rate (0.01, 0.005, and 0.0005), and the number of epochs (10, 20, and 30). Other numbers of epochs were also tested, based on the performance observed when the first three options were assessed. Furthermore, other architectures were tested (up to 10), but not with all the combinations.    More than 54 models were trained with different combinations of the before mentioned parameters. When enough good results were obtained using the base CNN models, a second stage of ensembles of CNN models was performed. Combinations of several models from the ones trained in the previous stage were tested. The models were ranked More than 54 models were trained with different combinations of the before mentioned parameters. When enough good results were obtained using the base CNN models, a second stage of ensembles of CNN models was performed. Combinations of several models from the ones trained in the previous stage were tested. The models were ranked according to their performance on the Training set, and the best ones were selected for the ensembles. Different types of voting were applied: average, maximum, and SCAV.
The second experiment was coded in Python 3, and the library OpenCV was used for the image processing tasks. For the CNN generation, the library Keras with TensorFlow backend was utilized.

Machine Learning Models: EGFR Mutation
The ten best results of the performance of the base classifiers for the EGFR mutation on the Test dataset are presented in Table 3, sorted by their AUC. The results on the Training set are included in the Supplementary Material (Table S3. EGFR Mutation Prediction Results Base Classifiers). The classifiers are Gradient Based Method (gbm), Random Forest (RF), Support Vector Machine (SVM), and Neural Network (nnet). The highest AUC for EGFR mutation prediction was 0.68 with an SVM classifier. For this mutation, much better results were obtained with ReliefF as feature selector.
Then, ensembles of different numbers of models with three different types of voting were tested. Table 4 presents the best results with ensembles. The best AUC was 0.70 with SCAV. This model also had the higher sensitivity (0.75). Moreover, in another model, an accuracy of 80% was obtained with a 0.68 AUC. It can be observed for the machine learning experiment that a higher accuracy, sensitivity, specificity, and AUC can be obtained by applying ensembles and SCAV. Different ensemble combinations can be used to favor certain metrics.

Machine Learning Models: KRAS Mutation
The results of the performance of the ten best base classifiers for the KRAS mutation on the Test dataset are presented in Table 5. The results on the Training set are included in the Supplementary Material (Table S4. KRAS Mutation Prediction Results, Base Classifiers). The best AUC is 0.65. For this mutation, similar results could be obtained with both feature selection methods, though ReliefF was still best. Then, ensembles of different numbers of models with three different types of voting were tested. Table 6 presents the best results with ensembles. The ensemble approach resulted in an improved AUC of 0.71 with a 72% accuracy using SCAV. Again, the models with best accuracy and best AUC were obtained with the proposed voting scheme. This was the best AUC that could be obtained for the KRAS mutation with the machine learning models.

Convolutional Neural Networks: EGFR Mutation
The best results of EGFR mutation prediction applying CNNs on the Test set are presented in Table 7. Please refer to the Supplementary Material (Table S5. EGFR Mutation Best Results, CNNs) for the results on the Train set. It can be observed that all the best results were obtained with SGD as optimizer; this suggests that SGD can be a good choice when dealing with small datasets with small CNN architectures. The best result was obtained with Architecture 4, which is presented on Figure 4. This model had an AUC of 0.846 and an accuracy of 0.800. This was the best AUC that could be obtained for the EGFR mutation. After the base CNN models were obtained, an ensemble of the best CNN models was created. Table 8 presents the results of the best ensembles of CNN models. The best result in terms of AUC was 0.820 and an accuracy of 0.828, this result was obtained with a combination of the three best models and SCAV. An even better accuracy (0.857) was obtained with the combination of the five best models. This was the best accuracy for the EGFR mutation. Even if in this case there was not an increase in performance in terms of AUC, a better accuracy was obtained by applying SCAV.

Convolutional Neural Networks: KRAS Mutation
The best results of KRAS mutation prediction using CNNs on the Test set are presented in Table 9. The results on the Training set can be found in the Supplementary Material (Table S6. KRAS Mutation Best Results, CNNs). Analyzing the results for KRAS, we can see there is not a model that performs well according to all three metrics. The best result according to AUC is 0.739, however the sensitivity of this model is zero, so none of the mutant cases were detected. The model with the best accuracy has 72.2% and a sensitivity of 0.25, so it is a more balanced result, however the AUC is only 0.566. In order to improve the results, an ensemble of the best CNN models was created. Table 10 shows the best results with ensembles of CNNs. The best AUC that could be obtained in this stage was 0.778, which was obtained with an ensemble of the three best models and average voting. This was the best AUC that could be obtained for the KRAS mutation. This was the only stage where the best results in terms of AUC were not obtained with SCAV; however, an equal accuracy could be obtained applying SCAV with the best three models.

EGFR Mutation
Several observations can be made from these results. Our first observation is that for the machine learning models, the use of SMOTE greatly improves the performance of the models when dealing with unbalanced datasets. Without the SMOTE algorithm, the sensitivity was zero, but while using it several of the mutant cases were properly detected.
For the EGFR mutation, good results could be obtained with the machine learning approach; however, despite the small training dataset, better results could be obtained with CNNs. In both cases, the base models could be improved by using ensembles. For the machine learning models, the best base performance was with ReliefF as feature selector, 15 features, and SVM as the classifier.
For the machine learning approach, better results in terms of accuracy, sensitivity, and AUC could be obtained with different combinations of ensembles and SCAV. The best features in the sense that they were more commonly selected in the best machine learning models are: 3D Wavelet features, 3D Laws features, GLSZM Grey level variance, 90th percentile, GLSZM Small zone low grey level emphasis, Flatness, Asymmetry, Orientation, and Surface to volume ratio.
In the tests applying CNNs, the best performance of the base models was with Architecture 4 and SGD as the optimizer. After applying ensembles, we got an improvement in accuracy. This result was obtained using our proposed voting scheme, SCAV. In general, we obtained better results working with CNNs than with the machine learning models. We also observed that the ensembles with SCAV outperformed the ones with average and maximum voting. This suggests that the proposed scheme can obtain better performance when applying ensembles, even if the performance of the base models is not optimal.
Our results, where features are extracted automatically, while slightly less than the AUC of 0.89 obtained by Gevaert et al. [5] with the same dataset, did not require medical experts to produce semantic features. However, our AUC is superior to the ones obtained by other previous works with an automated approach. Other metrics such as accuracy were not reported and cannot be compared.

KRAS Mutation
For the base classifiers of the machine learning approach, good results could be obtained both with the ReliefF feature selector and the Mann-Whitney test. The best AUC was obtained with a model of ReliefF's best five features and SVM as the classifier. The best features in the sense that they were more commonly selected in this model were 3D Laws features, 3D Wavelet features, average GLN Grey level non-uniformity, and GLSZM Grey level non-uniformity. With the ensemble approach, an important increase in the performance was obtained by applying SCAV as the voting scheme. The best results with ensembles and machine learning models were always obtained with SCAV as the voting scheme. This shows that ensembles can significantly improve the performance of classifiers.
For the CNN models, there were no models that achieved high scores in all the three metrics (accuracy, sensitivity, and AUC). The best AUC was obtained with Architecture 1 and SGD as the optimizer, and the best accuracy was with Architecture 1 and Adam as the optimizer. After applying ensembles of CNN models, the performance improved in AUC and the accuracy was maintained. In this case, the best result was obtained with average voting; however, the second best result was obtained with SCAV. It can be observed that in terms of AUC, better results could be obtained with CNNs over machine learning models. In both cases, an improvement was observed with the ensemble approach.
If we compare these results with the ones obtained by Gevaert et al. [5] with the same dataset, the authors could not find a conclusive model for KRAS mutation with semantic features (AUC of 0.55). We did find a good predictive model for KRAS mutation on the same dataset with an AUC of 0.778 and using a deep learning approach that does not need human input to generate features.

Limitations
A limitation of this study is the small dataset. Furthermore, even if separate training, validation, and test datasets are used, it would be more conclusive if the trained model could be tested on a dataset from a different source, which would prove the generalization of the model. Both these limitations will be addressed as future work, when more data that fits the requirements of the study is available. Finally, for the machine learning approach, because we chose features on each fold of a cross validation, the very best set is the one that occurred most often and could differ when more data is available.

Conclusions
In this study, we analyzed the effectiveness of using ensembles in the prediction of EGFR and KRAS mutations using a small dataset; in particular, we assessed the performance of a novel voting scheme SCAV. We tested this scheme with both ensembles of machine learning models and ensembles of CNNs and a significant improvement from the base classifiers was observed.
For the EGFR mutation, the performance of our model was similar to that obtained by Gevaert et al. with the same dataset, and our model did not require semantic features manually specified by a radiologist. Further, our best model obtained a higher AUC than the ones presented by the most recent works that used deep learning [17] and nomograms [18].
For the KRAS mutation, the results are much better than the ones obtained in [5], where a conclusive model for KRAS mutation could not be found for this same dataset. Moreover, this is probably the best result for the KRAS mutation prediction that can be found in the literature, since most works only focus on the EGFR mutation.
In general, for both mutations, better results were be obtained by applying ensembles with SCAV as the voting method, rather than average and maximum voting; however, a more rigorous method to determine the best threshold is still necessary. This work indicates that applying ensembles and SCAV for voting may lead to a significant increase in the performance of the base models, both for machine learning and deep learning models, which offers a good strategy to handle small datasets when no more data is available. Furthermore, higher sensitivity was obtained when applying the SMOTE algorithm for the machine learning models, which is an effective strategy to handle unbalanced classification datasets.
This work showed novel ways to use ensembles of CNNs and non-neural classifiers on small data to achieve state-of-the-art results. Our proposed approach, which is to use ensembles with SCAV, shows in this study that the performance of classifiers can be improved, even when the base models do not perform that well, and this is an important contribution from this paper. Since larger datasets will enable better models, we firmly believe that if our approach is applied with more data it will yield outstanding performance and may generate models that can be used in clinical practice. This indicates a promising future for detecting these mutations in a non-invasive way.