1. Introduction
Lung cancer continues to represent the deadliest form of cancer worldwide, accounting for nearly 2.2 million newly diagnosed cases and 1.8 million deaths in 2020. It remains the primary cause of cancer-related mortality across both sexes [
1,
2]. Non-small cell lung cancer (NSCLC) constitutes approximately 85% of all lung cancer diagnoses, with adenocarcinoma being the most frequently encountered histological subtype [
3,
4].
Within the molecular landscape of NSCLC, a considerable number of tumors harbor oncogenic driver mutations. Among these, alterations in the epidermal growth factor receptor (EGFR) gene play a pivotal role in treatment stratification [
5,
6,
7]. These mutations are identified in 10–15% of NSCLC cases in Western populations and in up to half of patients in East Asia [
8]. Two common activating mutations—exon 19 deletions and the L858R point mutation in exon 21—are predictive of tumor sensitivity to tyrosine kinase inhibitors (TKIs), including gefitinib, erlotinib, afatinib, and osimertinib [
9,
10,
11]. The clinical benefits of these agents over traditional chemotherapy have been confirmed through several randomized clinical trials, thereby justifying their integration into global treatment guidelines [
12,
13,
14,
15,
16].
Despite the availability of targeted treatments, detection of EGFR mutations typically requires molecular testing on tissue samples, often obtained via invasive biopsy. These procedures are not always feasible in patients with comorbidities or advanced disease and can be compromised by tumor heterogeneity, resulting in sampling errors [
17,
18,
19]. Moreover, molecular profiles may evolve throughout therapy, sometimes requiring additional tissue sampling, which poses logistical and clinical challenges [
20]. Although liquid biopsy using circulating tumor DNA (ctDNA) has emerged as a minimally invasive alternative, its diagnostic performance may be limited, particularly in early-stage disease or in cases with low tumor burden [
21,
22].
Radiomics has emerged as a promising tool for extracting high-dimensional, quantitative data from medical images, potentially revealing hidden imaging biomarkers associated with underlying genomic profiles such as EGFR mutation status [
23,
24,
25,
26,
27,
28]. When combined with clinical information and CT-derived morphological indicators, radiomics offers a valuable pathway toward non-invasive tumor genotyping strategies [
29].
Numerous prior investigations have explored the application of machine learning algorithms for predicting EGFR mutation status, often combining radiomic features with clinical data to enhance predictive accuracy [
30,
31,
32,
33]. While these approaches have shown encouraging results, many remain limited by challenges such as lack of transparency, insufficient external validation, or the use of unenhanced CT images, which may compromise generalizability [
34,
35,
36,
37]. With the increasing demand for transparent AI in medicine, combining powerful classifiers (e.g., Random Forests, SVMs, and neural networks) with SHAP-based interpretation has shown promise in delivering explainable and effective diagnostic predictions. However, previous studies leveraging explainable machine learning in oncology have faced several limitations. Many were restricted to either handcrafted radiomic features or purely deep-learning-based pipelines, often requiring large training cohorts and providing limited interpretability in small or heterogeneous datasets. In particular, works such as [
38,
39,
40] focused primarily on demonstrating proof-of-concept explainability without explicitly integrating multimodal clinical, morphological, and radiomic variables into a unified model. Our study addresses this gap by combining these complementary feature domains within an interpretable Random Forest framework, specifically tailored to a modest cohort size. This integration not only enhances predictive robustness but also ensures clinical interpretability through SHAP-based analysis, thereby improving the translational potential of the approach. SHAP assigns a measurable impact to each input feature, helping to demystify how complex models arrive at their decisions, thus improving clinician confidence and regulatory compliance [
23].
Studies that combine radiomic features with conventional clinical and imaging variables—such as patient age, smoking history, sex, enhancement characteristics, and pleural retraction—have demonstrated superior prediction performance compared to models based on a single modality [
37,
41]. In situations where biopsy is unfeasible or molecular diagnostics are not readily accessible, such integrated models could help to identify patients likely to benefit from early TKI intervention and support equitable deployment of personalized treatment strategies in oncology [
42,
43].
Our work, in this regard, presents an explainable and high-performing machine learning strategy, designed to harness the predictive strength of selected clinical data, CT scan features, and radiomic markers for determining EGFR mutation in NSCLC patients. In this study, we specifically focused on patients with lung adenocarcinoma, as this NSCLC subtype is most strongly associated with EGFR mutations and represents the standard population for the clinical implementation of targeted therapies. Feature selection was performed using a three-step process involving mutual information, Spearman rank correlation, and the FeatureWiz algorithm, allowing for effective dimensionality reduction while preserving key predictive variables. Random Forest classifiers were employed due to their ability to capture non-linear interactions and to prevent overfitting. Model interpretability was further enhanced using SHAP analysis to visualize and understand the contribution of each feature to the model’s output.
To our knowledge, this study is one of the earliest to implement a validated radiomics-driven machine learning approach for EGFR mutation prediction in NSCLC within a North African cohort. The findings support the practical integration of artificial intelligence into oncology workflows, even in environments with constrained resources, and align with international efforts to broaden equitable access to precision medicine on a global scale.
2. Materials and Methods
2.1. Study Population and Methodological Framework
Data were retrospectively collected from 521 patients with histologically confirmed NSCLC treated at the Hassan II University Hospital. Eligibility was contingent upon the availability of a pre-therapeutic chest CT conducted within 3 months prior to biopsy or surgical intervention, a confirmed EGFR mutation status, and complete clinical data. Only patients with histologically confirmed lung adenocarcinoma were included, as this histological subtype accounts for the majority of EGFR mutations in NSCLC and ensures the biological relevance of the predictive modeling. Patients were excluded if CT data were incomplete, if imaging was compromised by artifacts, or if histology was inconsistent with adenocarcinoma. Based on these criteria, 138 patients were included: Among the entire cohort, 98 patients were identified with wild-type EGFR, while 40 patients harbored activating EGFR mutations. Model evaluation was conducted exclusively via stratified five-fold cross-validation on the full dataset (
n = 138: EGFR-WT = 98, EGFR-Mutant = 40). All preprocessing steps and hyperparameter tuning were performed within the training portion of each fold only, with the validation portion kept unseen to prevent information leakage. The overall patient selection and allocation process is depicted in
Figure 1.
A detailed summary of the dataset characteristics, including source, histology, patient distribution, and acquisition period, is presented in
Table 1.
The methodological workflow followed in this study consisted of several integrated steps: acquisition of standardized high-resolution chest CT scans (120 kVp, 100–200 mAs, slice thickness 1–1.25 mm, lung and mediastinal kernels, venous contrast phase at ~70 s); tumor segmentation using semi-automatic tools with expert validation in 3D Slicer (version 5.6.2; an open-source platform for medical image segmentation and radiomics analysis) and ITK-SNAP (version 4.0.1); extraction of radiomic features with PyRadiomics (version 3.1.0), including shape, intensity, and texture metrics such as the Gray-Level Co-occurrence Matrix (GLCM), Gray-Level Run Length Matrix (GLRLM), Gray-Level Size Zone Matrix (GLSZM), Gray-Level Dependence Matrix (GLDM), and Neighborhood Gray-Tone Difference Matrix (NGTDM); feature selection via mutual information, Spearman correlation filtering (where features with r > 0.85 were considered redundant and one of them was removed to reduce collinearity), and the FeatureWiz algorithm. It is important to note that this threshold of Spearman correlation (r > 0.85) was applied only for redundancy filtering among radiomic variables, while the correlation coefficients reported later in the Results section reflect the individual association of selected features with EGFR mutation status, which are naturally more moderate, and training/evaluation of five Random Forest classifiers using different feature subsets. Performance was assessed with AUC, accuracy, precision, recall, and F1-score. Model interpretability was ensured using SHAP. An overview of the workflow is provided in
Figure 2.
2.2. EGFR Mutation Analysis
Formalin-fixed, paraffin-embedded (FFPE) tumor tissue specimens were obtained from all patients and deemed suitable for downstream molecular analysis. Histopathological assessment was initially conducted to estimate the percentage of tumor cells within each sample. Subsequently, DNA was extracted from regions enriched in tumor cells using the QIAamp DNA FFPE Tissue Kit (Qiagen, Hilden, Germany).
Depending on tumor cellularity, one of two detection strategies was adopted. Low-cellularity samples were examined via PCR, while those with ≥30% tumor content underwent NGS analysis using the BigDye Terminator v3.1 sequencing system (Applied Biosystems, Foster City, CA, USA). EGFR mutations were subsequently classified into three categories: wild-type tumors, which showed no detectable mutations; frequently observed alterations, which included in-frame deletions in exon 19 and L858R substitutions within exon 21; and uncommon or complex mutations, such as G719X in exon 18, S768I, T790M, insertions in exons 19 or 20, and compound variants.
2.3. CT Imaging Protocol and Preprocessing Workflow
A multidetector CT system from Siemens (SOMATOM Definition, Erlangen, Germany) was used to acquire all thoracic CT images, adhering strictly to institutional standard imaging procedures. Raw data were stored in DICOM format and processed using open-source Python libraries, notably pydicom for metadata handling and scipy.ndimage for initial volume manipulations [
44,
45].
To ensure spatial uniformity, all CT images were resampled to a resolution of 226 × 226 pixels using the Elastix registration module embedded in the 3D Slicer platform (version 5.6.2). This resampling process mitigated variability stemming from acquisition parameters and enabled consistency in radiomic analysis.
Segmentation of the primary tumors was achieved through a hybrid methodology combining manual delineation and semi-automatic approaches. The segmentation was carried out using ITK-SNAP and 3D Slicer, with manual adjustments applied in regions where semi-automatic tools failed to define clear tumor boundaries, especially in peripheral or low-contrast lesions. To enhance reliability and reduce interobserver bias, all segmentations were independently reviewed and validated by two experienced thoracic radiologists, and only those reaching full agreement were retained for analysis.
This rigorous image preprocessing ensured the standardization and reproducibility of the dataset prior to radiomic feature extraction.
2.4. Tumor ROI Delineation and Extraction
Tumor contouring was achieved using three distinct segmentation strategies: a manual delineation method, a semi-automated segmentation with ITK-SNAP, and a supplementary semi-automatic protocol integrated into 3D Slicer [
23,
39,
40]. Each tumor was segmented independently using these three methods. Only ROIs showing consistent boundaries across all approaches were retained as final segmentations. This multi-method validation strategy was crucial for minimizing variability and ensuring the robustness of subsequent radiomic analysis (
Figure 3).
For each patient, the region of interest was defined as the entire primary lung tumor volume, excluding adjacent atelectasis, vascular structures, or pleural tissue. This ensured that the radiomic features extracted were specific to tumor tissue and avoided contamination by surrounding anatomical structures.
2.5. Description of Clinical, Morphological, and Radiomic Characteristics
2.5.1. Demographic and Clinical Characteristics
Patient electronic medical records were retrospectively reviewed to extract clinical and demographic variables such as age, sex, smoking history, and EGFR mutation status. Prior to analysis, all variables were thoroughly reviewed to ensure data completeness, accuracy, and internal consistency across the dataset.
2.5.2. CT Morphological Features
Radiomic feature extraction was performed using PyRadiomics, an open-source and extensively validated library for high-throughput quantitative medical image analysis. The process was implemented through a custom script within the SlicerRadiomics™ interface (v2.10), ensuring consistent and reproducible workflows across the dataset (accessed on 25 May 2023).
The extracted features included shape-based descriptors quantifying tumor geometry and spatial properties, first-order statistics summarizing voxel intensity distributions, and texture features derived from GLCM, GLRLM, GLSZM, and GLDM. These metrics capture complex patterns of intra-tumoral heterogeneity, reflecting spatial voxel arrangements, edge sharpness, and structural irregularities. All features were extracted from validated ROIs using standardized settings and organized into structured matrices for subsequent machine learning analysis.
2.5.3. Extraction of Radiomic Features
To evaluate whether limiting the training dataset to only the most informative features could improve the predictive performance of the model, we designed five distinct datasets. These datasets were generated by systematically combining clinical, CT, and radiomic variables (either in full or after applying relevance-based feature selection).
Subsequently, a total of five Random Forest (RF) classifiers [
46,
47,
48] were developed. Each model was trained using a unique feature combination derived from one of the five constructed datasets.
Random Forests were chosen as the core classification method owing to their robustness in managing mixed-type and high-dimensional datasets, as well as their ability to mitigate overfitting in relatively small and imbalanced cohorts. In addition, they provide transparent feature importance measures, which align with the study’s objective of developing an interpretable and clinically applicable predictive model.
All five models were trained independently to predict EGFR mutation status. The specific feature configurations used for each classifier are summarized in
Table 2.
To further ensure robustness and reproducibility, all Random Forest models underwent hyperparameter tuning via a five-fold stratified cross-validated grid search. The search space included n_estimators (100 to 1000), max_depth (3 to 20), min_samples_split (2 to 10), and min_samples_leaf (1 to 5). The AUC metric was used as the optimization criterion. The final selected hyperparameters for each of the five models are summarized in
Supplementary Table S1.
This comparative modeling approach allowed us to assess the relative contribution of each feature type (clinical, morphological (CT), and radiomic), as well as the impact of dimensionality reduction, through selection of the most relevant predictors.
2.6. Cross-Validation Framework for Model Training and Evaluation
To maintain balanced class representation and ensure reproducibility in model evaluation, we adopted a stratified five-fold cross-validation strategy. The full cohort (n = 138: EGFR-WT = 98; EGFR-Mutant = 40) was partitioned into five folds, each preserving the relative distribution of mutation status. In each iteration, four folds (≈80% of the data) were used for model training and one fold (≈20%) was reserved for validation. This process was repeated until every fold had served once as the validation set.
All preprocessing steps (feature selection, correlation filtering, and FeatureWiz selection) and hyperparameter tuning were performed within the training portion of each fold only, ensuring that the corresponding validation fold remained unseen, thus preventing information leakage. Performance metrics are reported as mean ± standard deviation across folds, providing more stable and reproducible estimates under class imbalance. Robustness was further verified through repeated cross-validation with different random seeds and bootstrap confidence intervals. The aggregated confusion matrices are provided in
Supplementary Table S2.
2.6.1. Addressing Class Imbalance and Data Limitations
Because the training dataset exhibited a marked imbalance, with a lower proportion of EGFR-mutated cases, the Synthetic Minority Over-Sampling Technique (SMOTE) was applied during preprocessing. This algorithm generates synthetic minority samples through interpolation, thereby balancing the dataset and improving the generalizability of the models. SMOTE has been widely adopted in biomedical machine learning for mitigating class imbalance, and its use here ensured a more reliable learning process.
2.6.2. Data Augmentation Strategies
To further strengthen model robustness and reduce overfitting, classical image augmentation techniques were employed. These included random horizontal flipping, rotations of up to 25°, horizontal and vertical translations to simulate spatial shifts, random occlusion to introduce noise, and Gaussian perturbations to mimic intensity variability.
These augmentation strategies were deliberately chosen over more advanced approaches such as CutMix, MixUp, or CutOut. While such methods have demonstrated effectiveness in natural image classification, they may generate anatomically implausible CT slices and distort radiomic texture patterns, thereby reducing model interpretability. The selected classical transformations preserve anatomical realism and maintain the integrity of radiomic descriptors, ensuring both robustness and the clinical relevance of the predictive models.
2.6.3. Summary of the Training Approach
The final training strategy therefore combined three main elements: balanced data through SMOTE, expansion of the dataset using slice-based modeling, and generalization enhancement via data augmentation. This integrated approach was specifically designed to maximize classifier performance despite the relatively limited number of available patients. In addition, the computational complexity of all Random Forest models was assessed to demonstrate the feasibility of clinical translation.
To further ensure transparency, we also assessed the computational complexity of all Random Forest models. Training and evaluation were performed on a workstation equipped with an Intel Core i7-12700 CPU, 32 GB RAM, and an NVIDIA RTX 3060 GPU. Training time ranged from 20–45 s depending on the model, with inference requiring less than 0.1 s per patient. Memory usage remained below 3 GB across all models. These results confirm the efficiency and scalability of our approach compared to more resource-intensive deep learning methods, thereby reinforcing its suitability for clinical translation.
4. Discussion
4.1. Key Outcomes and Relevance for Clinical Practice
This study presents a comprehensive machine learning approach that integrates clinical data, CT-derived morphological indicators, and radiomic signatures to enable the non-invasive prediction of EGFR mutation status in individuals diagnosed with non-small cell lung cancer (NSCLC). Leveraging a Random Forest (RF) classification approach, five predictive models were evaluated based on distinct combinations of features. Feature selection played a central role in optimizing performance, using a three-step pipeline that included mutual information analysis, Spearman correlation filtering, and reduction of redundancy. For radiomics, additional selection was performed using the FeatureWiz algorithm to retain only the most discriminative variables among the 1034 initially extracted features.
Among the models evaluated, Model 5, which was trained solely on the most predictive feature subset, including two clinical variables (Age, Tobacco use), two CT morphological characteristics (Enhancement pattern, Presence of nodules in the same lobe), and seven selected radiomic features, achieved the strongest predictive performance, reflected by an AUC of 0.91 (95% CI: 0.81–1.00). It also delivered high values in other classification metrics, including precision, recall, and F1-score. To improve model transparency, SHapley Additive exPlanation (SHAP) was applied, offering insight into both global and patient-specific feature contributions, and revealing interactions among key variables.
This approach highlights the potential of non-invasive imaging biomarkers to serve as surrogates for molecular testing, particularly in clinical scenarios where tissue sampling is limited or risky [
27,
28,
44,
45]. Early and accurate identification of patients with EGFR mutations may support the timely initiation of tyrosine kinase inhibitors (TKIs), which have demonstrated improved outcomes in mutation-positive NSCLC patients [
50,
51,
52].
4.2. Current Methodological Constraints and Future Research Directions
Although the outcomes are encouraging, it is important to recognize certain methodological constraints and unresolved issues that may affect the model’s generalizability and clinical applicability. First, the study design is retrospective and single-center, which limits the external validity of the model. To establish generalizability and clinical utility, prospective validation in multi-institutional cohorts is required. Additionally, although SMOTE-based oversampling and data augmentation techniques were used to mitigate class imbalance and limited sample size, these strategies may not fully capture the variability found in real-world populations.
A further limitation is the modest size of the EGFR-Mutant subgroup (n = 40). This reflects the real-world difficulty of prospectively collecting large balanced datasets in oncology, particularly in North Africa. While this small subgroup size makes single-split evaluations unstable, the use of stratified cross-validation allowed us to obtain more robust and reproducible estimates. Future multi-center studies with larger populations will be essential to confirm reproducibility.
An additional limitation concerns the SHAP analysis of tobacco use, which paradoxically emerged as a positive contributor to predicting EGFR mutation. This finding contradicts established clinical evidence, as EGFR mutations are more prevalent among non-smokers. After verifying the dataset, no variable misencoding was identified, suggesting that this apparent contradiction reflects cohort-specific bias, given the relatively small number of mutant cases (n = 40). This emphasizes that SHAP contributions in small datasets may capture sample-specific distributions rather than general biological associations. Future validation on larger and more diverse cohorts will be required to determine whether this effect persists or diminishes.
The integration of radiomics into clinical workflows remains a challenge due to the lack of standardization in radiomic feature extraction and preprocessing protocols, which affects reproducibility across platforms and scanners [
52]. Furthermore, clinical integration would require seamless interoperability with existing hospital information systems and radiology workflows, which may involve significant technological and organizational adaptations.
Future work should aim to meet the following objectives:
Validate the model in diverse populations with different genetic and demographic backgrounds; assess the longitudinal predictive performance across time points and treatment phases; explore the integration of deep-learning-based radiomics or hybrid approaches combining handcrafted and deep features.
Furthermore, while SHAP was used in this study primarily for post hoc model interpretability, its integration into the feature selection pipeline could provide an additional validation layer for identifying the most predictive variables. Comparing SHAP-based feature selection with our current statistical and algorithmic methods may enhance robustness and offer deeper insights into the biological relevance of selected radiomic descriptors. We recognize this as an important avenue for future work.
In addition, future research should compare the performance of Random Forest classifiers with other machine learning algorithms, such as support vector machines, gradient boosting, and deep learning architectures, particularly when larger multicenter datasets become available. This would allow a more comprehensive benchmarking of classification strategies and enhance the generalizability of the proposed framework.
Another methodological consideration concerns the choice of learning framework. Although deep learning approaches, such as convolutional neural networks (CNNs), have demonstrated excellent performance in image-based predictive tasks, their application typically requires large-scale datasets to avoid overfitting. Given the relatively modest cohort size in the present study (n = 138), we opted for Random Forest classifiers, which are more suitable for small and heterogeneous datasets while offering interpretable outputs for clinical translation. Nevertheless, future work with larger, multi-institutional cohorts will investigate CNN-based architectures and hybrid models, which combine handcrafted radiomic features with deep feature representations, in order to benchmark their performance against our current machine learning pipeline.
4.3. Contribution, Relevance, and Context
To our knowledge, this is the first study conducted in Morocco that investigates the non-invasive prediction of EGFR mutation status in NSCLC patients through the application of machine learning techniques combined with multimodal-imaging-derived biomarkers. Our findings are particularly relevant in a context where biopsy access is often limited, and where genetic profiling may not be routinely available.
While our cohort’s EGFR mutation prevalence (28.98%) is lower than rates reported in East Asian populations, it remains aligned with those observed in Middle Eastern, North African, and European cohorts [
53]. This reinforces the relevance of developing region-specific predictive models that account for population-specific clinical and genetic profiles.
Importantly, the study provides a foundational framework for future research and clinical implementation of AI-assisted decision support tools in Moroccan oncology settings. By combining CT-derived radiomics with accessible clinical data, this model offers a promising direction toward personalized, non-invasive diagnostics in lung cancer care, particularly for patients for whom tissue sampling is contraindicated or unfeasible.
5. Conclusions
This work presents an automated and robust approach for the non-invasive prediction of EGFR mutation status in patients with non-small cell lung cancer (NSCLC). By combining clinical data, CT-based morphological characteristics, and quantitative radiomic features, we constructed a machine learning model capable of reliably identifying patients with a high probability of carrying EGFR mutations.
Such a predictive tool may serve as a valuable adjunct in clinical workflows, especially in scenarios where histopathological confirmation is not feasible or molecular testing is delayed. By facilitating the early identification of patients eligible for tyrosine kinase inhibitor (TKI) therapy, this approach holds promise for improving the timeliness and personalization of NSCLC treatment.
Among the five models assessed, the best performance was achieved by the one trained on a targeted selection of features, comprising two clinical variables, two CT morphological attributes, and seven radiomic descriptors. This optimized feature combination led to notable improvements across all performance indicators, including enhanced precision, recall, and overall classification accuracy.
This method demonstrates significant promise as a clinical decision support tool, especially in scenarios where tissue sampling is constrained or molecular diagnostics are inaccessible. Facilitating the rapid identification of candidates for TKI therapy may enhance the timeliness and individualization of care for NSCLC patients.
Recommendations and Future Research: To strengthen clinical translation, future studies should focus on multi-center validation with larger and more diverse patient cohorts, particularly including non-adenocarcinoma NSCLC subtypes. Prospective studies are also warranted to evaluate real-world performance and integration into routine oncology workflows. Additionally, the combination of radiomics with liquid biopsy markers and advanced deep learning approaches could further improve predictive accuracy and generalizability. Complementary comparisons with alternative machine learning algorithms such as XGBoost and artificial neural networks (ANNs) are also recommended to benchmark performance and to enhance model robustness.