Machine Learning and Feature Selection Methods for EGFR Mutation Status Prediction in Lung Cancer

: The evolution of personalized medicine has changed the therapeutic strategy from classical chemotherapy and radiotherapy to a genetic modiﬁcation targeted therapy, and although biopsy is the traditional method to genetically characterize lung cancer tumor, it is an invasive and painful procedure for the patient. Nodule image features extracted from computed tomography (CT) scans have been used to create machine learning models that predict gene mutation status in a noninvasive, fast, and easy-to-use manner. However, recent studies have shown that radiomic features extracted from an extended region of interest (ROI) beyond the tumor, might be more relevant to predict the mutation status in lung cancer, and consequently may be used to signiﬁcantly decrease the mortality rate of patients battling this condition. In this work, we investigated the relation between image phenotypes and the mutation status of Epidermal Growth Factor Receptor ( EGFR ), the most frequently mutated gene in lung cancer with several approved targeted-therapies, using radiomic features extracted from the lung containing the nodule. A variety of linear, nonlinear, and ensemble predictive classiﬁcation models, along with several feature selection methods, were used to classify the binary outcome of wild-type or mutant EGFR mutation status. The results show that a comprehensive approach using a ROI that included the lung with nodule can capture relevant information and successfully predict the EGFR mutation status with increased performance compared to local nodule analyses. Linear Support Vector Machine, Elastic Net, and Logistic Regression, combined with the Principal Component Analysis feature selection method implemented with 70% of variance in the feature set, were the best-performing classiﬁers, reaching Area Under the Curve (AUC) values ranging from 0.725 to 0.737. This approach that exploits a holistic analysis indicates that information from more extensive regions of the lung containing the nodule allows a more complete lung cancer characterization and should be considered in future radiogenomic studies.


Introduction
Lung cancer is the foremost determinant cancer death amongst men and women, killing a vaster number of people than colon, breast, and prostate cancers combined [1]. This is linked to the fact that it is often diagnosed in an advanced stage, with 5% or less chance of a 5-year survival [2]. Non-Small-Cell Lung Carcinoma (NSCLC) is the most prevalent histological type of lung cancer, covering about 85% of all lung cancer cases [3], and Epidermal Growth Factor Receptor (EGFR) is the most frequently mutated gene that springs lung cancer of type Adenocarcinoma [4]. The cell surface receptor EGFR is responsible for cell growth and survival and its mutations promote EGFR permanent activation, which contributes to uncontrolled cell division [5,6]. This genomic biomarker with clinically approved therapies is now considered a strong prognostic indicator in lung cancer, rising opportunities to explore treatment strategies that rely on the individual's genetic profile [7]. Currently, biopsy is the primary method for characterizing lung cancer and identifying EGFR mutation status, using an extracted tumor tissue sample for molecular analysis. Nevertheless, this invasive procedure can lead to some associated side effects, as it can be painful and risky for the patient. Recently, blood-based screening has been used to detect early lung cancer diagnostic biomarkers [8]. However, despite being less invasive than biopsy, this technique is still bothersome for the patient.
Computed tomography (CT) scans provide a reliable lung cancer characterization, offering a faster and less invasive approach compared to traditional tissue biopsy [9]. Thereby, foretelling gene mutation status by CT can help determine the most appropriate treatment for each subject, while decreasing medical complications [10]. Extracting quantitative features from CT images that are then used as inputs within a predictive model that can directly classify the EGFR mutation status for lung cancer patients composes the essence of radiogenomics, a field that correlates the radiomic features (image phenotype) and genetic information (genotype) [11].
There are a few studies that used traditional statistical analysis and machine learning (ML)-based approaches to demonstrate that the EGFR mutation status is correlated with CT scan imaging phenotypes. Regarding the works that employed logistic regression models, Digumarthy et al. [12] obtained an Area Under the Curve (AUC) of 0.73 applying only radiomic features, which increased to 0.79 when clinical data was added. The study by Mei et al. [13] resulted in an AUC of 0.58 and 0.66 implementing only radiomic features and combination of radiomic and clinical features, respectively. Similarly, Liu et al. [14] showed that adding radiomic features to a clinical model resulted in a significant improvement of predicting power, as the AUC increased from 0.67 to 0.71, and Liu et al. [15] showed that using clinical variables combined with CT features (AUC = 0.78) resulted in higher AUC values, compared to using clinical variables alone (AUC = 0.69). Concerning the ML domain, a decision tree was built to predict the presence of EGFR mutations using a combination of four image features (emphysema, airway abnormality, the percentage of ground glass component and the type of tumor margin), resulting in a test set performance of 0.89 AUC [5]. A work based on a deep learning technique with automatic nodule radiomic feature-learning ability achieved an AUC of 0.69 and showed a significant improvement when hand-crafted CT features were combined with clinical characteristics (AUC = 0.81) [10]. Velazquez et al. [16] found a radiomic signature related to radiographic heterogeneity that successfully discriminated between mutant and wild-type EGFR cases (AUC = 0.69). Combining this signature with a clinical model of EGFR status (AUC = 0.70) significantly improved prediction performance (AUC = 0.75). A XGBoost classifier using only radiomic and semantic features from the nodule obtained an AUC of 0.58 and 0.65, respectively [4].
Previous studies have suggested that combining radiomic and semantic data allows the development of integrated predictors that exhibit improved performance compared to learning models that use only one type of data. However, there is an urgent need to create learning models that use only radiomic features that are automatically extracted from images, as these features are more objective than semantic data and allow to reduce the dependency on data annotation by experts in assessing mutational status while reducing human error. Furthermore, the automatic detection of radiomic features from CT images is an important tool to help radiologists with the extensive and exhaustive work of CT annotation and overcome the lack of mass annotated datasets. Generally, the nodule is the main focus for lung cancer malignancy assessment and follow-up based on the well-established Fleischner Society and Lung-RADs Guidelines [17,18], and previous studies have analyzed only this important cluster of tumor cells for EGFR mutation status prediction [10,14,16]. Exploratory studies have recently shown that there is a correlation of EGFR mutation status with other lung diseases, such as emphysema and fibrosis, which seems to indicate that cancer development is related to multiple physiological changes not restricted to the nodule region [19,20], and that the inclusion of extratumor features allows a significant increase in EGFR mutation predictive performance [4,5,21].
Therefore, the current work applies a comprehensive approach for predicting EGFR mutation status, using only objective radiomic features directly extracted from a region of interest (ROI) containing the entire lung where the nodule is located. The hypothesis that it might be possible to develop predictive models with enhanced performance by combining nodule-related features with features from other lung structures is the main motivation for this work. This study includes combinations of six ML classification models-Logistic Regression, Elastic Net, Support Vector Machine (SVM) with linear and radial basis function (RBF) kernels, Random Forest, and Extreme Gradient Boosting (XGBoost)-and nine feature selection methods-Pairwise correlation filter, Principal component analysis (PCA) with feature set variance ranging from 65% to 95%, and QR decomposition, along with a baseline where no feature selection is used.

Materials and Methods
This section presents the dataset used in the current study, the feature extraction techniques and the classification methods employed. The pipeline implemented to predict the EGFR mutation status is represented in Figure 1.

Dataset
In order to fulfill the objectives set for this work, the NSCLC-Radiogenomics Dataset [23] was identified and used, as it comprises three fundamental requirements that were crucial for the development of the present study: thoracic CT scans from NSCLC patients, tumor binary segmentations, and EGFR mutation status labels. This dataset is a publicity available collection of CT images from a NSCLC cohort of 211 subjects collected retrospectively between 2008 and 2012 at Stanford University School of Medicine and Palo Alto Veterans Affairs Healthcare System. The CT scans are paired with patient clinical history and were obtained using different scanner models and scanning protocols, presenting variations in slice thickness from 0.625 to 3 mm (median: 1.5 mm) and X-ray tube current from 124 to 699 mA (mean: 220 mA) at 80-140 kVp (mean: 120 kVp). Additionally, binary nodule segmentation masks are also stored in this database, as well as semantic tumor annotations. This dataset is the only public dataset that comprises paired information on lung-cancer-related gene mutation status and CT data; however, the present study focuses only on EGFR due to its clinical relevance based on approved target therapies. Out of 211 subjects, only 116 lung cancer patients from this database were considered since only these owned tumor binary masks and a EGFR mutation test result (mutant: 20%, wild-type: 80% (see Figure 2)).

Pre-Processing
Firstly, the CT image pixel values were converted from radiodensity values to Hounsfield Units (HU) using the Rescale Slope and Rescale Intercept attributes stored in the metadata of the scans. Then, the entire dataset (including the tumor masks) was resampled to standardize image representations. The space between slices and pixel spacing were set to 1 mm and [1.0, 1.0] mm, respectively, and each slice dimension was calculated to match this new spacing, obtaining the resampled image by interpolation [24]. Additionally, all CT images were normalized between −1000 HU and 400 HU using the min-max normalization method. CT values below −1000 HU, corresponding to the radiodensity of air, were set to 0; values above 400 HU, representing hard tissues, were fixed to 1. A linear transformation was computed to map all the intermediate HU values to the [0,1] range. Lung binary masks were segmented using a 2D lung segmentation model based on the U-Net architecture [25]. These lung regions underwent the same resampling operation to match the dimensions of the corresponding CT, and all images were rescaled to a size of N × 256 × 256 pixels, with N representing the number of slices of the correspondent CT scan. To obtain the CT images with only the lung containing the nodule, the mask of the nodule was superimposed on the lung segmentation in order to identify the lung where the nodule is located in the selected axial CT slice. Then, the image was transformed to consider half of the opposite side as background.

Feature Extraction
The 1316 radiomic features available for this study quantified lung characteristics from CT images and were extracted using the open source package Pyradiomics [22]. Lung voxels were used in the extraction of seven classes of features: shape-based (14 features), first-order (18 features . Shape features examine the size and shape of the ROI, employing only the lung segmentation masks in the calculations. First-order features describe the distribution of voxel intensities within the ROI using basic metrics such as mean, median, range, and standard deviation. GLCM features describe the secondorder joint probability function of the ROI, while GLDM features quantify gray level dependencies in the image. GLRLM features quantify the length of consecutive pixels that have the same gray level intensity and GLSZM features quantify the number of connected voxels that share the same gray level value. Finally, NGTDM features characterize the difference between the gray value of a pixel and the average gray value of the neighboring pixels within a defined distance [26]. The radiomic features were computed both on the original image (107 features) and on images obtained after application of wavelet (744 features) and Laplacian of Gaussian (LoG) (465 features) filters. The wavelet transform applies a wavelet filter to each CT image, which is then decomposed in low and high frequencies into 8 different images. It applies either a high-pass filter (represented as H) or a low-pass filter (represented as L) in each one of the x, y, and z directions: LLH, LHL, LHH, HLL, HLH, HHL, HHH and LLL [27]. The LoG filter yields a derived image for each applied sigma value in order to emphasize areas of gray level change, where sigma defines how coarse the emphasized texture should be [28]. In these studies, five filters with different sigma values were applied (sigma = 1.0 mm, 2.0 mm, 3.0 mm, 4.0 mm, 5.0 mm).

Feature Selection Methods
The radiomic features were subjected to a feature selection process in order to prevent overfitting, improve learning accuracy, and reduce computation time [29]. To this end, we considered three feature selection techniques widely used in the radiogenomics field: pairwise correlation filter, PCA, and QR decomposition. Additionally, a baseline was also implemented where no feature selection method was used to compare the results obtained with and without feature selection.
The pairwise correlation filter removes variables whose pairwise correlation is greater than a specific cutoff. First, a correlation matrix is created with values representing the pairwise correlations for all feature combinations. Then, features that have an absolute pairwise correlation equal to or greater than the cutoff are excluded [30]. After investigating multiple cutoffs, the cutoff value was set to 0.95.
PCA is a method that projects high-dimensional data into a new lower dimensional representation while keeping all relevant linear structure intact. This method generates new, uncorrelated variables that explain a large proportion of the variance in the original feature space [31]. PCA was implemented at seven different cutoffs, where the number of components accounted for percentages of variance in the feature set ranging from 65% to 95%.
The QR decomposition along with an iterative procedure is used to remove features that are linear combinations of others. The feature matrix is decomposed into two matrices: the orthogonal matrix Q and the upper-triangular matrix R. The latter is used to determine which features are linearly dependent so that they can be sequentially removed [32].

Learning Models
Since we intended to explore different types of ML models, the predictive classifying models used in this study are from three different families: linear, nonlinear, and ensemble. The linear classifiers used include the Logistic Regression, Elastic Net, and Linear SVM models, while the nonlinear classifier includes the SVM algorithm with an RBF kernel. Of the ensemble models, the Random Forest and the XGBoost were implemented.
Logistic Regression is a classification model that employs the sigmoid function as a cost function in order to return a probability value that can be mapped to discrete classes [33]. It is one of the simplest ML algorithms and is easy to implement, interpret, and very efficient to train. Elastic Net regression is a penalized linear regression model that imposes a linear combination of regularization penalties to the loss function during training. These norm regularizations include both the L1 and L2 penalties, which are based on the sum of the absolute coefficient values and the sum of the squared coefficient values, respectively [34].
SVM is a discriminative classifier that transforms the original feature space into a higher-dimensional space based on a user-defined kernel function and then finds support vectors to maximize the margin separating the classes. New unlabeled samples are classified according to the side of the hyperplane they lie on [35]. Linear SVMs can only find a decision boundary to classify linearly separable features. On the other hand, when the dataset is separable by a nonlinear boundary, SVM uses nonlinear kernel functions, such as the RBF, to overcome the curse of dimensionality and properly transform the feature space. SVMs are powerful yet flexible supervised ML algorithms that are used in a variety of applications, such as the diagnosis and prognosis of cancer and other diseases [36][37][38][39][40].
Random Forest classifier is a bagging-type ensemble method of decision trees that trains several trees in parallel and aggregates the decisions of individual trees to reach the final prediction. This classifier tends to outperform most other classification methods in terms of accuracy, variance, and bias, without overfitting issues [41]. XGBoost is a decision-tree-based ensemble algorithm that uses a gradient boosting framework and has been widely used in lung cancer studies [4,[42][43][44]. This classifier has been shown to yield superior predictive results using less computing resources in the shortest amount of time compared to other models due to its parallel processing, tree-pruning, sparse data handling, and regularization to present overfitting [45].

Training
Data was randomly split into a training set (80%) and a test set (20%). The training and testing processes were repeated for 50 random splits of the original dataset to investigate data variance and for better performance robustness.
In the NSCLC-Radiogenomic dataset, the EGFR mutant is under-represented, resulting in classifiers with poor predictive performance for this minority class. To overcome the class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was applied for each fold. This oversampling approach creates new random synthetic minority class instances between the lines that connect each one of the k nearest neighbors of each minority class sample [46]. This data augmentation technique made it possible to obtain a class-balanced training set (consisting of the same number of mutant and wild-type samples), which was then used to train the classifiers.
The classifier's hyperparameters were tuned through a 5-fold cross-validation randomized search on the training data. The range of hyperparameters considered for the six classifiers are presented in Tables S1-S6 of the supplementary material.
Since Elastic Net is a regression model and not a classification model, predictions for this algorithm were performed considering a decision threshold equal to 0.5. This means that normalized predicted probabilities less than 0.5 were assigned to the wild-type class EGFR and probability values greater than or equal to 0.5 were mapped to the mutant class EGFR. On the other hand, for the classification models, the optimal threshold was calculated from the Receiver Operating Characteristic (ROC) curve using the thresholdmoving method. This method uses the original training set to train the model and then moves the decision threshold such that the minority class examples can be more easily correctly predicted.

Performance Metrics
For a better understanding of the predictive performance of each classifier and feature selection method, different evaluation metrics were averaged over the 50 random train-test splits. The AUC was computed, along with precision, sensitivity, and specificity.

Results
The number of radiomic features removed and retained after the implementation of the nine feature selection methods is shown in Table 1.
The hyperparameters that achieved the best performance on the test set for each classifier and feature selection method are shown in Tables S7-S12 of the supplementary material. Furthermore, in the supplementary material, the AUC, precision, sensitivity, and specificity metrics, for each classifier-feature selection method combination, can be found in Tables S13-S18.  Figure 3 gives the mean AUC (average value of all 50 results) for each classifier across the various feature selection methods in a heatmap form, whereas Table 2 summarizes the best performance results obtained in the present study for each one of the six classifiers. SVM with linear kernel, Elastic Net, and Logistic Regression classifiers had the best overall predictive performance and showed the best results when combined with the PCA feature selection method with 70% of variance, presenting AUC values greater than 0.7. It can thus be concluded that these classifiers performed well in predicting the EGFR mutation status. In fact, the highest AUC was obtained with the Linear SVM classifier and the PCA 70% (AUC = 0.737 ± 0.018). Furthermore, the implementation of the Logistic Regression classifier with PCA 70% resulted in the highest precision (Precision = 0.682 ± 0.099) and sensitivity (Sensitivity = 0.699 ± 0.039) values. On the other hand, the XGBoost classifier with PCA 70% achieved the highest specificity result (Specificity = 0.767 ± 0.017). Among all feature selection methods, PCA yielded the highest AUC results on average in five of the six classifiers studied, whereas QR decomposition held lower AUC values. The standard deviations of all results are fairly similar, which shows that no metric had a significantly higher or lower variance than the others. Table 2. Performance results of each classifier with the best AUC. For each combination of classifier and feature selection method, the evaluation metrics AUC, precision, sensitivity, and specificity are presented as mean ± standard deviation. AUC-Area Under the Curve; SVM-Support Vector Machine; XGBoost-Extreme Gradient Boosting; RBF-Radial Basis Function.

Classifier
Feature  Figure 3. Heatmap with the AUC of each classifier/feature selection combination. Dark colors stand for the best results, while light colors represent the worst outcomes.

Discussion
In this study, we assessed EGFR mutation status, using radiomic features extracted from the entire lung volume containing the nodule on CT images. Six ML algorithms were trained and used in conjunction with pairwise correlation, PCA, and QR decomposition feature selection methods. These filtering techniques were employed since using too many features in the classification algorithm can lead to overfitting, in which noise or irrelevant features may exert undue influence on classification decisions. In addition, a baseline without feature selection method was also implemented to compare results obtained with and without feature selection.
The results of this study suggest that the linear learning models-SVM with linear kernel, Elastic Net, and Logistic Regression-perform well with quantitative imaging features as their predictors, whereas the SVM classifier based on the RBF kernel performs poorly. Additionally, the commonly used classifiers, Random Forest and XGBoost, show acceptable performance results. Furthermore, we found that the model less used in radiomics and radiogenomics, but commonly used in genomics-Elastic Net-was one of the best performing classifiers, as shown in a previous study that investigated the ability of various ML classifiers to accurately predict lung cancer nodule status [47]. The present work also indicates that the Linear SVM classifier in conjunction with PCA feature selection with 70% variance should be considered in future EGFR mutation status classification studies. In our opinion, it is crucial to highlight the fact that linear models obtained the best results in the present study, as it may change the direction and extend the use of simpler and more interpretable models in seemingly complex problems.
We also show that, in general, feature selection methods that reduce the number of features prior to model training appear to improve predictive performance compared to previous radiomic analyses [47][48][49][50]. In the radiogenomics field, a large number of features tend not to provide additional information because they are highly correlated and are linear combinations of others. However, in ensemble learning models, such as Random Forest and XGBoost, which perform automatic feature selection, additional feature filtering does not seem to have a significant impact on model performance. Considering the feature selection methods, it was not possible to identify the one that gives better results for all classifiers since the models are defined with different mathematical principles and optimize different parameters; as a consequence, different classifiers provide better results when combined with different numbers of radiomic features. Nevertheless, based on the best results per feature selection method, we recommend considering the PCA method with 70% and 75% of variance over the pairwise correlation filter and QR decomposition, as these two methods seem to remove very important features for model predictions.
Considering approaches that employ only radiomic features, the outcomes of this work show that radiomic characteristics from the entire lung containing the nodule provided better results in EGFR mutation status assessment compared to traditional nodule-based approaches and to other methodologies that consider other lung structures [4,5,14,51]. However, direct performance comparisons between models trained and tested with data from different datasets would not bring a fair discussion point to this study. Nevertheless, the results of the current study indicate that the CT features with the highest correlation with EGFR mutation are from the lung that has the nodule and these are therefore the main contributors to the model decision. It is crucial to highlight these results and further investigate the importance of holistic lung cancer characterization studies, as there are many complex combinations of morphological, molecular, and genetic alterations that occur during lung cancer development that, when taken into account, would allow the development of more accurate classifiers for EGFR mutation status prediction [21].
The biggest limitation of this work is the reduced size of the used dataset, which is unrepresentative of the population characteristics, making it difficult to find a relevant pattern for such a complex problem. Furthermore, deep learning approaches, such as convolutional neural networks (e.g., CNN and 2D-CNN) and recurrent neural networks (e.g., LSTM and BiLSTM), are powerful methods for EGFR mutation status assessment.
However, data limitation does not allow the use of these end-to-end techniques based on deep learning. To increase the chances of better EGFR mutation status predictions, it is necessary to develop reproducible, clinically viable, and robust predictive models that can handle population heterogeneities. For this, a large and heterogeneous cohort of patients affected by lung cancer is crucial, as well as methods capable of coping with data heterogeneity. However, data access and uniform acquisition, along with privacy issues, fees, and data management requirements are the main limitations in clinical and imaging data access. Another limitation relies on the fact that only the most frequent oncogene in lung cancer was studied. As future work, it is important to analyze other feature selection techniques, such as Linear discriminant analysis, which, in addition to being a prediction model, can be used as a dimensionality reduction technique and autoencoder-based architecture. Additionally, it is also important to analyze whether the feature selection methods implemented in the studies can be used, for example, as information gain or mutual information in order to better understand how the selection techniques affect the features and consequently the performance of the learning models. It is also relevant in the future to consider other lung-cancer-related genes in order to obtain a more complete characterization, which would have an important impact on new targeted personalized therapies.

Conclusions
Predicting EGFR mutation status by CT imaging can improve the determination of the most appropriate treatment for each lung cancer patient and is a less invasive alternative compared to the traditional biopsy. This study proposed a comprehensive approach for the classification of EGFR mutation status using only radiomic features extracted from a ROI containing the entire lung where the nodule is located, changing the direction of traditional approaches, which until now have been mainly focused on the nodule. The results obtained with this novel and holistic approach showed that information from more extensive regions of the lung containing the nodule allows for a more complete lung cancer characterization. Our work suggested that Linear SVM, Elastic Net, and Logistic Regression are the most robust models for EGFR mutation status prediction and supports their use by others in future radiogenomics studies. Furthermore, we recommend the application of methods that reduce the number of features prior to model training, in particular, PCA methods, as they seem to improve predictive performance. We also encourage the study and comparison of various features and modeling approaches for predicting EGFR mutation status since improvements in prediction are often achieved when different combinations are utilized. Funding: This work is financed by the ERDF-European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation-COMPETE 2020 Programme and by National Funds through the Portuguese funding agency, FCT-Fundação para a Ciência e a Tecnologia within project POCI-01-0145-FEDER-030263.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data was obtained from the open-access NSCLC-Radiogenomics dataset publicly available at The Cancer Imaging Archive (TCIA) database [23,52]. Imaging and clinical data have been de-identified by TCIA and approved by the Institutional Review Board of the TCIA hosting institution. Ethical approval was reviewed and approved by the Washington University Institutional Review Board protocols. Informed consent was obtained from all individual participants included in the study.

Conflicts of Interest:
The authors declare no conflict of interest.