A Machine Learning and Radiomics Approach in Lung Cancer for Predicting Histological Subtype

Lung cancer is one of the deadliest diseases worldwide. Computed Tomography (CT) images are a powerful tool for investigating the structure and texture of lung nodules. For a long time, trained radiologists have performed the grading and staging of cancer severity by relying on radiographic images. Recently, radiomics has been changing the traditional workflow for lung cancer staging by providing the technical and methodological means to analytically quantify lesions so that more accurate predictions could be performed while reducing the time required from each specialist to perform such tasks. In this work, we implemented a pipeline for identifying a radiomic signature composed of a reduced number of features to discriminate between adenocarcinomas and other cancer types. In addition, we also investigated the reproducibility of this radiomic study analysing the performances of the classification models on external validation data. In detail, we first considered two publicly available datasets, namely D1 and D2, composed of n = 262 and n = 89 samples, respectively. Ten significant features, according to univariate AUC evaluated on D1, were retained. Mann–Whitney U tests recognised three of these features to have a statistically different distribution, with a p-value < 0.05. Then, we collected n = 51 CT images from patients with lung nodules at the Azienda Ospedaliero—Universitaria “Policlinico Riuniti” in Foggia. Resident radiologists manually annotated the lung lesions in images to allow the subsequent analysis of the malignancy regions. We designed a pipeline for feature extraction from the Volumes of Interest in order to generate a third dataset, i.e., D3. Several experiments have been performed showing that the selected radiomic signature not only allowed the discrimination of lung adenocarcinoma from other cancer types independently from the input dataset used for training the models, but also allowed reaching good classification performances also on external validation data; in fact, the radiomic signature computed on D1 and evaluated on the local cohort allowed reaching an AUC of 0.70 (p < 0.001) for the task of predicting the histological subtype.


Introduction
The Global Cancer Observatory estimated that lung cancer was the leading cause of cancer death in the world's population [1]. Phenotyping lung cancer, or cancer in general, has been demonstrated to be crucial for clinical practice and medical research; in fact, in recent years, a profound effort in designing and employing computational solutions for phenotyping pathologies has been made [2][3][4].
Nowadays, such characterisation of pathologies is necessary to move toward the so-called Precision Medicine. It is a modern paradigm for diagnosing or treating cancers, or diseases in general, based on the identification and characterisation of pathology-specific characteristics, taking into account the individual variability, i.e., the genetic factors of each subject, their lifestyle and living environment. Made possible also by technical advances in computational sciences, Precision Medicine is definitely revolutionising the healthcare domain. This approach to cancer handling, in fact, aims to precisely target diseases, including lung carcinoma, with a per-subject approach [5].
Medical imaging methodologies, including Computed Tomography (CT), Positron Emission Tomography (PET), or Magnetic Resonance (MR), are crucial for performing clinical tasks related to cancer diagnosis, treatment, or follow up [6][7][8]. Medical imaging has already been demonstrated to be essential in the Precision Medicine framework thanks to its capability to improve the knowledge about the clinical phenomenon under consideration (regardless of its nature) [9]. In addition, a considerable amount of recent literature has already shown the capabilities of functional imaging methods to build an in-vivo representation of tumour processes from a biological perspective [10]. This characteristic made imaging methods good candidates for allowing the identification of biosignatures for phenotyping the tumour.
Radiomics is an emerging method based on algorithms for data characterization, which allows the extraction of a large number of features from medical images. By exploiting the information carried out by such features, radiomics approaches aim to uncover and quantitatively describe tumoral patterns and characteristics, otherwise not observable through traditional algorithms for image analysis [11][12][13][14][15]. This new perspective for extracting phenotypic information from imaging, which is already performed for biomedical signals, such as electromyography or electroencephalography [16,17], may thus provide valuable information for setting up personalised approaches for therapies.
Several authors carried out radiomics-based studies in the context of characterising solid cancers [18,19] or lesions [20] in the lung region. It also allowed the design and implementation of several applications in oncology, such as solid cancer, glioblastoma [21], hepatocellular carcinoma [22] and breast cancer [23] classification, among many others.
While the description of images by a large number of features may help in the comprehension of the underlying phenomena, it may also lead to several drawbacks. In fact, due to the high dimensionality of radiomic features, dimensionality reduction and clustering techniques may be needed to improve the classification and generalisation capabilities of automatic systems for supporting decisions [24].
Parmar et al. performed analyses to extract clusters of radiomic features and prognostic signatures specific for lung and head and neck (H&N) cancers [25]. In their work, the authors performed consensus clustering before classifying tumour phenotypes, revealing eleven stable clusters of radiomic features for lung tumour classification, also showing associations with clinical parameters.
Besides the high dimensionality of data, classification approaches based on radiomics suffer from other data-related problems, including the "big-p, little-n" problem, feature redundancy, and class unbalancing [24]. Zhang et al., trying to address such problems, compared many feature selection techniques and predictive models to improve the radiomicsbased prognosis of patients with non-small cell lung cancer (NSCLC) [26]. With respect to the problem of feature redundancy, their analysis showed that Random Forests (RF) were the optimal predictive model, whereas Principal Component Analysis (PCA) was the optimal feature selection method. The Synthetic Minority Over-sampling (SMOTE) technique was employed to mitigate the problem class unbalancing, significantly increasing the predictive accuracy.
In this work, we designed and implemented a quantitative approach based on a radiomics pipeline to classify lung nodules between adenocarcinoma (LUAD) cases and other histological classes from unenhanced CT images. The overall radiomics pipeline consists of the following stages, as reported in Figure 1. First, we collected CT images of patients with lung cancer in a cohort from the Azienda Ospedaliero-Universitaria "Policlinico Riuniti" in Foggia, Italy. Images were manually segmented by expert radiologists to identify and extract the Regions of Interest (ROIs) to process. Radiomic features were then extracted from the ROIs. To do this, we used PyRadiomics, an open-source python package for the extraction of radiomics features from medical images [27], which also has the advantage of increasing reproducibility among different studies, thanks to the adoption of the Imaging Biomarker Standardization Initiative (IBSI) [13], to define and compute the features. Statistical analyses for selecting features based on their discriminative power were accomplished, and predictive models were set up on the extracted feature data to make the decision on the histological subtype. Finally, external data were considered to validate the predictive model. Concerning the task of lung cancer phenotyping, several works could be found in the literature. For example, Ferreira-Junior et al. studied which quantitative features coming from contrast-enhanced CT scans would be more helpful in performing associations with histopathological data, such as the histological subtype classification [28]. However, contrast-enhanced imaging analyses include, albeit minimal, risk factors also related to the injection of the contrast medium. To overcome this, in this work, we analyse radiomic features on unenhanced CT scans for making decisions.
Linning et al. also included unenhanced CT scans in their work [29]. To demonstrate the validity of their approach, the authors performed a ten-fold cross-validation. However, validating the models on external and independent datasets is crucial in radiomics studies [30]. Instead, a similar work, which validated predictive models considering external data, was the one by Wu et al. [31]. The authors, in fact, developed a Naïve Bayes classifier and validated it with external datasets. However, in their work, Wu et al. considered images acquired in the Netherlands only. In our work, instead, external validation is performed using image data obtained in different countries.
To do this, we included in our analysis data from two cohorts employed in the work by Grossmann et al., who made it publicly available in terms of clinical data, radiomics features and genetic characteristics [32]. Grossmann et al. discovered a relationship between imaging features, immune response, survival and inflammation. They found that imaging features have predictive value for specific pathways; also, they concluded that a combination of clinical information, radiomics and genetic biomarkers improve the prognostic predictive performance, showing the complementary value of these data [32]. In this work, we considered only the radiomics features.
The rest of this paper is organised as follows: Section 2 describes materials, i.e., the characteristics of the considered CT images. Section 3 describes the methodology adopted in this work, including feature extraction, a feature reduction strategy, and the experiments' description for training the optimal classifiers. Section 4 discusses the experimental results. Finally, Section 5 draws the final remarks about the conducted study and delineates ideas for future works.

Materials
To implement and validate the radiomics approach detailed in Section 3, we used three different datasets of radiomic features. We first considered the two datasets analysed in a previous study by Grossman et al. and which have been made publicly available [32]. They consist of two independent cohorts of North American and European patients with lung cancer. These datasets have been considered in order to assess the validity of the radiomic approach for phenotyping lung cancer. Specifically, Dataset1 (D1) contains data from 262 patients treated within the Thoracic Oncology Program at the H. Lee Moffitt Cancer Center, Tampa, FL, USA. The histological analysis was available for 224 patients; 129 subjects (57.6%) had adenocarcinoma, whereas the others (42.4%) suffered from cancer forms other than adenocarcinoma (i.e., 61 patients had squamous carcinoma, whereas the remaining were not further categorized). Dataset2 (D2) includes data from 89 patients treated at the MAASTRO Clinic in the Netherlands. The analysis for revealing the histological cancer subtype was available for 87 patients; it showed that 42 subjects (48.3%) had adenocarcinoma, whereas 45 subjects (51.7%) experienced other cancer types, of which 33 (37.9%) were squamous carcinoma.
Lastly, we collected 51 CT scans from patients with lung nodules provided by the Azienda Ospedaliera-Universitaria "Policlinico Riuniti" in Foggia. This cohort included 29 patients (56.9%) affected by adenocarcinoma and 22 patients (43.1%) having other cancers, including squamous carcinoma, large cell lung carcinoma, chronic benign inflammation, hepatocarcinoma metastasis, intestinal adenocarcinoma metastasis and endocrine small cell carcinoma). This cohort included only unenhanced CT scans. This choice was made to verify whether the discrimination of lung cancer was also possible from this kind of image, also considering that, in this way, the patient is exposed to a lower amount of radiation. All patients were 18 to 85 years old and accidentally discovered solitary pulmonary nodules 10 to 50 mm in size diagnosed with Computed Tomography. The reports of the histological examinations were collected and the malignancy of the nodules was documented with the relative genetic panel. Subjects with incomplete clinical data and an absence of histological examination in nodules with highly suspected CT criteria of malignancy were excluded.
These CT images were segmented manually by a radiology resident from Azienda Ospedaliera-Universitaria "Policlinico Riuniti" in Foggia using the ITK-Snap Software [33], and radiomic features have been subsequently extracted following the pipeline detailed in Section 3. Figure 2 shows nine slices randomly selected from the CT scans included in D3, with the relative masks. Dataset1 and Dataset2, instead, have been shared by Grossman et al. in terms of radiomic features [32], thus no image processing steps were required. Table 1, instead, summarises the characteristics of the datasets in terms of sample size, imaging modality and type of available data.

Methods
The workflow designed and implemented in this work included three steps, namely features extraction, features selection and classification. Specifically, the features extraction step was performed only for creating Dataset3, since Dataset1 and Dataset2 already included features. Based on Dataset1, univariate statistical methods were implemented in order to select a reduced number of features that allowed the discrimination between lung adenocarcinoma and other cancer types. Then, classification models were evaluated in different training and validation conditions in order to classify LUAD and other cancer types, and investigate the reproducibility of this study on external validation data. The following paragraphs describe in detail the radiomic features constituting the three datasets, the features selection procedure and the classification approaches.

Radiomic Features
Radiomic features are quantitative descriptors extracted from images using several algorithms. They express different levels of complexity, from local characteristics to global ones. Gillies et al. distinguished radiomic features into two main categories, i.e., "semantics" and "agnostics" [34]. Semantics are those features commonly used by radiologists to describe ROIs visually. Agnostic features, instead, are quantitative descriptors obtained by mathematical operations on the data, which do not necessarily have a meaning in terms of imaging characteristics. These descriptors include first-, second-, or higher-order statistical indicators and shape features.
Radiomic features can be divided into different classes, i.e., intensity-, morphologicaland textural-based characteristics. The intensity-based features, also known as first-order features, describe the distribution of the intensity values in the ROIs based on the computation of the intensities histogram; such features do not take into consideration the spatial relationship between pixels, or voxels in the case of 3D Volumes of Interest (VOIs). Morphological features, instead, describe the geometric characteristics of the region. Textural-based features, also known as second-order statistics, describe the spatial composition of the intensity levels; they are defined starting from several data structures, such as the Gray Level Co-occurrence Matrix (GLCM) [35], the Gray Level Size Zone Matrix (GLSZM) [36,37], the Gray Level Run Length Matrix (GLRLM) [38], the Neighboring Gray Tone Difference Matrix (NGTDM) [39] and the Gray Level Dependence Matrix (GLDM) [40].
In this work, we used only agnostic features; in particular, we considered only 3D features extracted from VOIs. After having been segmented manually in CT images, VOIs of the tumoral area were reconstructed as a 3D volume before the features extraction step. To extract the radiomic features, we exploited the open-source Python framework PyRadiomics, which implements methods for extracting radiomic features in compliance with the definition present in IBSI [13]. The image processing and features extraction steps were performed in accordance with the analogous procedures described in Grossman et al. [32]. We extracted features from both the original VOIs and the filtered VOIs. In particular, we considered both the image after having applied a wavelet transform, where eight decompositions were obtained with Coiflets from 3D volumes, and the image after filtering with the Laplacian of Gaussian (LoG). The sigma parameter for the LoG filter varied from 0.5 to 5, with ticks spaced by 0.5 each.

Feature Selection
Before setting up the classification stage, data were preliminarily processed. In the first phase, we applied z-score normalisation. Then, we considered techniques for reducing the dataset dimensionality.
We exploited the algorithm presented in Bevilacqua et al. [20] to eliminate the features of D1 mostly correlated among them. This algorithm iterates over each couple of features and discards those with a correlation value higher than a threshold, set to 0.5. The procedure allowed us to retain 29 uncorrelated features. Figure 3 shows the correlation matrices before and after the aforementioned features reduction phase. In particular, it should be noted that clusters of correlated features, evident in Figure 3a, are not present anymore in Figure 3b.
In order to determine the discriminating relevance of each feature, we performed on D1 a cross-validation procedure with a univariate logistic regressor, assessing the mean AUC obtained. We then selected the features which satisfied the following equation: where m is the number of features. Figure 4 shows the results of the univariate logistic regression analysis. The choice of employing 1/2 standard deviations as features inclusion criterion allowed us to not discard too many of them in the univariate analysis, since they could result in being helpful in the subsequent multivariate model for the classification task. This analysis allowed us to retain 10 features in D1. The same features were also retained from D2 and D3.

Statistical Analysis
A statistical analysis was conducted in order to investigate how the features distribute in D1 between patients with adenocarcinoma and patients characterised by other histological subtypes. The Mann-Whitney U statistical test was used for unpaired comparisons between the features of subjects with adenocarcinoma and subjects with other histological types of cancer. A correction for multiple testing, using the Benjamini-Hochberg method was conducted for all the resulting p-values. Corrected p-values lower than 0.05 were considered significant. The statistical analysis revealed three significant features, namely the original_glcm_Autocorrelation (p = 0.001), the wavelet-LHH_firstorder_Median (p = 0.003) and the original_firstorder_Mean (p = 0.016). Figure 5 shows the box plots for the features selected in the features reduction step. In particular, the distributions of the statistically significant features are different among patients with adenocarcinoma with respect to subjects with other histological types of cancer. For visualisation purposes, normalised values have been clipped in the range [−4, 4], so that outliers do not alter the range of visibility of the boxplots. This difference between the distributions of the statistically significant features can also be seen in Table 2, where summary statistics for the distribution of the features, z-score normalised, is reported. For each feature, Table 2 reports the mean, the median, the standard deviation and the interquartile range for each of the two conditions, and the corrected p-value. The statistical analysis was carried out with Python 3.7, using the scikit-learn 0.22.2, numpy 1.21.5 and scipy 1.7.3 libraries. Visualisation was performed with matplotlib 3.5.1 and seaborn 0.10.0 libraries.

Classification
The main task of this work was to classify the histological subtype of lung nodules using radiomic features. In order to accomplish this task, we trained several classifiers considering a variable number of features. Specifically, features were added to the input pattern of each classifier according to their mean univariate AUC calculated in the features selection phase (as described in the previous section whose results are reported in Figure 4).
We also evaluated an ensemble of five models (En5) composed by the models mentioned above, except for GB. The ensemble model makes decisions following a voting strategy; the classification output by LogReg, SVM, and MLP was weighted 2; RF and AdaBo prediction was weighted as 1. Simpler models tend to be more robust and so they have received a larger weight. GB was not included since its adding to the ensemble did not lead to performance improvements. Moreover, it already suffered from low accuracies in many cases when more features were considered. The hyperparameters of all the models are reported in Table 3.
To assess the performance of the classifiers and the reliability of the radiomic features considered, we performed four experiments: Experiment 1: Internal cross-validation on each dataset (D1, D2, D3) separately; this experiment aimed to investigate if the radiomic signatures found by our approach could be discriminative for this classification task; Experiment 2: To train models on D1 and validate them on D2 and D3; this experiment aimed to investigate if the classifiers trained on a single dataset could perform on different datasets; Experiment 3: To train models on D1 and D2 (as a single dataset) and validate them on D3; this experiment aimed to investigate if increasing the training sample size could have improved the classification performance on the external validation; Experiment 4: Internal cross-validation merging D1, D2 and D3; this experiment aimed to investigate if considering all the datasets together could have led to a better performance than considering datasets separately, even at the cost of lower generalisation capabilities of the models.
In all the experiments, cross-validation was performed with 10-folds and stratified samples, i.e., the distribution of classes remains the same across folds. Figure 6 shows the workflows implemented for the four classification experiments in terms of input data and output results.

Results and Discussion
The results of each experiment are reported in Figure 7. Each matrix in the figure shows the mean AUC per model per number of features of the input pattern.
Experiment 1 aimed to assess if the radiomic signature obtained from Dataset1, thanks to the implemented approach for features reduction, could be discriminative for classifying LUAD and other histological types also on the other datasets. To do this, we trained the classification models on the selected features from D1, D2 and D3, respectively. The performances of these models are reported in Figure 7(I-a, I-b and I-c), respectively.
The classifiers reached the highest robustness in classifying samples from D1, regardless of the number of features and the classifier itself. Specifically, the mean (±standard deviation) AUC for all the considered models and features was 0.64 (±0.04), 0.54 (±0.06), and 0.56 (±0.09), considering data from D1, D2 and D3, respectively. Considering input data from D1, the best model was LogReg, which allowed us to obtain an AUC of 0.70 when trained with the input pattern size of six features. Internal cross-validation with input data from D2, instead, showed an increased variability among models' performances, regardless of the number of features, as can be noticed in the heatmap reported in Figure 7(I-b). In this case, the best model was the SVM trained on only one feature, which allowed us to achieve an AUC of 0.65. Considering the models trained on input data from D3, LogReg and MLP allowed us to achieve the best results in the univariate version; also in this case, with a mean AUC of 0.75. Experiment 2 aimed to assess if models trained on the histotype signature from D1 could generalise to make decisions also on D2 and D3. The results obtained show that GB, trained on two features, and AdaBo, trained on four features, allowed us to achieve an AUC of 0.62 considering data from D2 (Figure 7(II-a)). However, better performances were obtained, validating models on D3, with an AUC of 0.70 for SVM trained with one, five, and six features (Figure 7(II-b)).
In Experiment 3, instead, we investigated if increasing the training sample would allow us to obtain a higher performance on D3 as a validation set with respect to Experiment 2. As shown in Figure 7III, the MLP trained with an input pattern of seven features allowed us to obtain an AUC of 0.69 on data from D3 considering a joint train set composed of D1 and D2. Comparing this experiment to the previous one, a small improvement in terms of mean performances were obtained; in fact, this training configuration allowed us to achieve a mean (±standard deviation) AUC of 0.58 (±0.07), instead of 0.59 (±0.09), on the external validation set D3.
Concerning Experiment 4, it has been designed to evaluate how the models would have performed if D1, D2 and D3 would be merged. The internal cross-validation revealed that merging D1, D2 and D3 led to more robust performance of the classifiers, except for the comparison with the case of internal cross-validation performed on D1 (in Experiment 1). LogReg, RF, and Ensemble-5 models allowed us to achieve an AUC of 0.65 in their best case, i.e., trained considering two, six and seven features, respectively (as shown in Figure 7 IV). The overall mean (±standard deviation) AUC obtained in this experiment is 0.61 (±0.03).
Since the feature set has been chosen considering data in D1, we observed in some cases, i.e., Experiments 1 and 2, that a single feature allowed us to achieve the best performances in some external validation settings. It should be noted that, in the case of input patterns composed of a single feature, MLPs tend to behave very similarly to LogReg, considering that MLP is a set of neurons each with the capability of LogReg; thus, this configuration could not extract more information in the presence of only one feature.
The best ROC curves in external validation settings for cases II-a, II-b and III are reported in Figure 8.

Conclusions
Radiomics is a research field in which several algorithms are exploited to extract quantitative high-dimensional characteristics from images. Several authors have already used this methodology to address classification problems that involve radiological images. The use of algorithms for the processing and reduction of features may allow the design and implementation of more robust classification models that are able to improve the capabilities of automatic intelligent systems to support decisions in the biomedical field. However, the reproducibility of radiomic studies on external validation data is considered a crucial point.
In this study, we extracted a reduced set of radiomic features to classify the histological subtype in lung nodules to discriminate between adenocarcinoma cases from other histological classes. In this work, we used three datasets, namely D1 and D2, which were provided by Grossman et al., and D3, which was built starting from a local cohort of lung CTs opportunely processed to extract radiomic features.
Starting from D1, we designed and implemented a pipeline based on univariate analysis to select the most discriminative features to discriminate between lung adenocarcinoma and other histological types.
Then, four experiments were performed to investigate whether and how the number of features, the input dataset, and the model affect the classification performance and the reproducibility of radiomic studies.
Our analyses revealed that the best configuration for reproducibility on our local dataset D3 is the one using D1 for training the model. However, also combining D1 and D2 as a single training set led to good performances on data from D3.
In any case, internal cross-validations evaluated in Experiment 1 allowed us to reach higher performances, highlighting the importance of considering external validation datasets when performing radiomics studies, so that the real effectiveness of the discovered feature set can be measured. Eventually, the best result for the D3 dataset comes from internal cross-validation, but then this result would be less generalisable on unseen data. Such a conclusion is also endorsed by other authors, such as Wu et al., who also performed external validations in a similar case study, but with a reduced cohort for external validations, obtaining similar results (AUC of 0.72 with Naïve Bayes' classifier) [31].
Future works may include the comparison between different kinds of enhancement modalities from CT scans before performing the radiomics feature extraction and the correlation of radiomic signatures with genomics data. The collection of new data may also help in the direction of designing a robust radiomics pipeline for histological and genomic classification. Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki. Ethical review and approval were waived for this study since this was a retrospective observational study with anonymised data.
Informed Consent Statement: Patient consent was waived due to the fact that this was a retrospective observational study with anonymised data, already acquired for medical diagnostic purposes.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.