1. Introduction
Lung cancer is the leading cause of cancer-related death in the United States [
1,
2]. Computed tomography (CT) imaging remains one of the standard-of-care diagnostic tools for staging lung cancers. However, the conventional interpretation of radiological images can be, to some degree, affected by radiologists’ training and experience, and is therefore somewhat subjective and mostly qualitative by nature. While radiological images provide key information on the dimensions and extent of a tumor, they are unsuitable for assessing clinical–pathological information (e.g., histological features, levels of differentiation, or molecular characteristics) that is critical for the treatment selection process. Thus, the process of diagnosing cancer patients often requires invasive and sometime risky medical procedures, such as the collection of tissue biopsies. Finding new solutions for collecting critical microscopic and molecular features with non-invasive and operator-independent approaches remains a high priority in oncology. The development of reliable, non-invasive computer-aided diagnostic (CAD) tools may provide novel means to address these problems.
Image digitalization coupled with artificial intelligence is emerging as a powerful tool for generating large-scale quantitative data from high-resolution medical images and for identifying patterns that can predict biological processes and pathophysiological changes [
3]. Preliminary studies have suggested that objective and quantitative structures that go beyond conventional image examination can predict the histopathological and molecular characteristics of a tumor in a non-invasive way [
4]. Several new tools are now available for image analysis, and the machine learning processing pipeline enables automatic segmentation, feature extraction, and model building (
Figure 1).
Segmentation is a critical part of the radiomic process, but it is also known to be challenging. Manual segmentation is labor-intensive and can be subject to inter-reader variability [
5,
6]. To improve segmentation efficiency and accuracy, the development of automated or semi-automated segmentation methods has become an active area of research [
7]. Several deep learning models have been used to segment lung tumors from CT scans. However, validation of the reproducibility of these proposed methods using large datasets is still limited [
8,
9,
10,
11], and this has hindered their application in clinical settings. Deep learning tools such as U-Net and E-Net have been previously used to automatically segment non-small cell lung tumors and nodules in CT images, but these models were not specifically trained using lung cancer patient data [
8].
The incremental multiple resolution residual network (iMRRN) is one of the best performing deep learning methods to have been developed for volumetric lung tumor segmentation from CT images [
8,
9,
10,
12]. The iMRRN extends the full resolution residual neural network by combining features at multiple image resolutions and feature levels. It is composed of multiple blocks of sequentially connected residual connection units (RCU), which in turn are convolutional filters used at each network layer. Due to its enhanced capability in recovering input image resolution, the iMRRN has been shown to outperform other neural networks commonly used for segmentation, such as Segnet and Unet, in terms of segmentation accuracy, regardless of tumor size, and localization [
7,
8,
13,
14]. Additionally, the iMRRN has also been shown to produce accurate segmentations and three-dimensional tumor volumes when compared with manual tracing by trained radiologists [
8]. Its excellent segmentation performance and its public availability make the iMRRN an excellent candidate for other researchers to use.
In this study, we assessed the extent to which the iMRRN coupled with supervised classification can predict lung tumor subtypes based on CT images acquired from lung cancer patients. Segmentation performance was used to inform improvements to the tumor delineation process when deep learning models were used. The automated segmentation of lung tumors yielded quantifiable radiomic features to train classification algorithms. Seven machine learning classifiers were compared for their accuracy in differentiating three histological subtypes of lung cancer using CT image features. The most discriminating features and most accurate classification learners were ranked.
Our study was prompted by the fact that disconnections between AI research and clinical applications exist, and we have provided a framework to close such gaps. These gaps are not due to a lack of advanced and sophisticated AI models, but to insufficient validation of the integration of existing methods using new datasets and new clinical questions. Our study validated a framework to close one such gap. Our key contributions are summarized as follows. First, we demonstrated the feasibility of directly applying a trained DL model that is publicly available to a completely new CT dataset collected for other purposes with minimal inputs from radiologists. We recognized the limitations of applying trained DL models to new datasets and proposed the incorporation radiologists’ input on approximate tumor locations for reliable and targeted segmentation. Second, we systematically examined the accuracy of this integrated approach through lung cancer subtype prediction using the segmentation results from the DL model. Third, we discerned the practical issue of unbalanced data and demonstrated that an over-sampling approach such as SMOTE (synthetic minority oversampling technique) can effectively improve the accuracy with which real clinical data are classified. Fourth, for the first time, we demonstrated that radiomic analysis is able to classify three subtypes of lung cancers with accuracy comparable to that of two-subtype classification.
We believe that our study outcomes are of more use to researchers in the applied AI and/or biomedicine communities than to those whose expertise is novel AI methodology development. This is because our objective is not to build new deep learning or machine learning approaches. Instead, we identified important, practical limitations of existing AI methods when applied to new clinical data. Clinical data need to be carefully processed to be more specific and balanced so that the performance of AI methods can be maximized.
2. Materials and Methods
2.1. Data Description
We used the previously collected and publicly available CT images in the dataset named “A Large-Scale CT and PET/CT Dataset for Lung Cancer Diagnosis (Lung-PET-CT-Dx)”, obtained from The Cancer Imaging Archive (TCIA) [
15]. TCIA is an open-access information platform created to support cancer research initiatives where open access cancer-related images are made available to the scientific community (
https://www.cancerimagingarchive.net/, accessed on 30 May 2021) [
15]. The Lung-PET-CT-Dx dataset contains 251,135 de-identified CT and PET-CT images from lung cancer patients [
16]. These data were collected by the Second Affiliated Hospital of Harbin Medical University in Harbin, Heilongjiang Province, China. Lung cancer images were acquired from patients diagnosed with one of the four major lung cancer histopathological subtypes using biopsy. Radiologist annotations on the tumor locations were also provided for each CT/PET-CT image. Each image was manually annotated with a rectangular bounding box of similar length and width to the tumor lesion using the LabelImg tool [
17]. Five academic thoracic radiologists completed the annotations: the bounding box was drawn by one radiologist and then verified by the other four.
For our analysis, we only processed CT images with a resolution of 1 mm. CT scans with resolutions other than 1 mm were excluded from the analysis. We made this choice because CT images of different intervals may introduce variability in the radiomic features that complicate the interpretation of the results. A thickness of 1 mm is the most commonly acquired slice resolution in clinics [
18], and such CT images were indeed the most well represented in our dataset. Therefore, focusing on 1 mm thick CT images was the most relevant choice for future clinical utilization. In some cases, a patient had more than one chest CT scan. The anatomical scan taken at the earliest time point for a given patient was included in the analysis. This earliest timepoint CT scan is referred to as the patient’s primary CT scan. We decided to exclude non-primary scans for the following reasons: non-primary scans, such as contrast-enhanced or respiratory-gated scans, do not provide radiomic features comparable to those of CT scans, and thus are not appropriate for inclusion in our analysis. Furthermore, the non-primary CT scans might have been acquired after treatment had begun, at which point potential tumor necrosis and cell-death may affect the radiomic features within the CT image. In such cases, the non-primary CT images would not truly represent the radiomic properties of the tumors, which would have changed in response to treatment. A summary of patient demographic information and tumor TNM stages is provided in
Table 1 [
19].
2.2. Semi-Automated Segmentation and Manual Inspection
To perform machine learning-based radiomic analysis, the Computational Environment for Radiologic Research (CERR, Memorial Sloan Kettering Cancer Center, New York, NY, USA) software platform was used to apply the trained iMRRN to automated segmentation [
20]. CERR is an open-source, MATLAB-based (Mathworks Inc., Natick, MA, USA) tool with methods optimized for radiomic analysis. Using CERR, CT images in the DICOM format were converted to planC format in preparation for segmentation. Deployment of the iMRRN to the planC object enabled the segmentation of tumor ROIs and the production of a morphological mask over the tumor lesion. The Linux distribution Xubuntu 20.04 (Canonical Ltd., London, UK) was chosen for segmentation to execute the iMRRN Singularity (Sylabs, Reno, NV, USA) container.
A visual comparison of the iMRRN segmentations and the radiologist annotations is given in
Figure 2. One image from each of the four pathologic tumor subtypes is presented. In comparison with the rectangular delineations made by the radiologist, the automated segmentations followed the contours of the tumor lesions more precisely. In these examples, the iMRRN was able to adequately segment the tumors.
After the direct application of the iMRRN segmentation tool, the images were visually examined and compared with the manual box annotations. We found that in some patient CT scans, non-tumor structures, such as the heart, vertebrae, or sections of the patient’s couch, were mistakenly segmented as tumor nodules. Many of these structures were quite distant from where the tumors were located. To deal with such mistakes, we decided to supply the iMRRN only with CT images near the tumor locations. When performing segmentation within these focused tumor regions of interest (ROI), the iMRRN no longer erroneously segment unrelated structures. Using a Matlab program developed in house, original CT scans were trimmed by discarding the parts of the images outside the annotation boxes known not to contain the tumor lesion (
Figure 3). Since there were cases in which the radiologist annotations did not cover the entire tumor, which may have led to incomplete segmentation, an upper and lower buffer were included in the ROI to increase the segmentation ROI (
Figure 3, bottom panel).
2.3. Radiomic Feature Extraction
Using CERR methods, histogram intensity features and tumor morphology features were extracted from the segmented ROIs. These features included 17 shape features, 22 first-order features, and 80 texture features from the tumor regions in each CT scan. Texture features were further broken down into the following 5 subgroups: gray level co-occurrence matrix (26), gray level run length matrix (16), gray level size zone matrix (16), neighborhood gray tone difference matrix (5), and neighborhood gray level dependence matrix (26). Each of these features were defined mathematically in the Image Biomarker Standardization Initiative reference manual [
21]. A list of extracted radiomic features is given in
Supplementary Table S1. All attributes were continuous variables, and each feature was normalized to a range between zero and one so that the scales of the feature values did not affect the results. Observations containing missing values and infinite values were removed.
In order to evaluate the predictive role of 2D and 3D CT-scan features in determining tumor subtype, we initially narrowed down the analysis to the central transverse plane of the Region of Interest (ROI) for the 2D examination. Subsequently, we expanded the analysis to encompass the entire tumor mask for the 3D examination.To examine how well the center CT slice represents other slices in the same CT volume in terms of radiomic analysis, we trained several classifiers using only center slices from the CT volumes and then tested the classification accuracy on the off-center slices that were 4 mm away from the center slices. No test slices were acquired if there was no tumor lesion 4 mm away from the center.
When the 2D CT images were analyzed, only those shape features applicable in 2D (i.e., major axis, minor axis, least axis, elongation, max2dDiameterAxialPlane, and surfArea) were included. We hypothesize that these shape features are the most robust against CT scanner variation, and, therefore, will be the most important when identifying lung tumors by histological subtype.
2.4. Radiomic Model Building
To examine the effectiveness of supervised learning methods in classifying lung cancer subtypes using radiomic features extracted from segmented tumor CT data, we trained and tested seven classifiers. The MATLAB Classification Learner App (Statistics and Machine Learning Toolbox version 12.3) was used to perform the classification and evaluation. The training data contained the extracted radiomic features as well as the confirmed lung cancer subtypes. The following classification algorithms were considered and compared: decision tree, discriminant, naïve Bayes, support vector machine, k-nearest neighbors, ensemble, and a narrow neural network. Each of these models was trained in over fifty iterations. Five-fold cross-validation (CV) was used to evaluate the performance of each model. Five-fold CV divides the whole dataset into five subsets of equal size. Each model was trained using four subsets and then tested on the fifth subset; the process was repeated five times, and the averaged results were reported.
Three lung cancer subtypes, namely adenocarcinomas (group A), small cell carcinomas (group B), and squamous cell carcinoma (group C), were used as response variables for our analysis. Large-cell carcinomas were not included in the analysis as they were poorly represented in the dataset (only five instances). Clinically, large-cell carcinomas account for less than 10% of all lung cancer types, so omitting this particular type did not impact our study objectives.
A principal component analysis (PCA) was used to reduce data complexity [
22]. A PCA works by transforming the original dataset into a new set of variables (principal components) that are linear combinations of the original features. This is a widely used technique in machine learning and is especially useful when analyzing data with many features. We used the synthetic minority over-sampling technique (SMOTE) to address the problem of class imbalance in our dataset, in which adenocarcinoma patients (
n = 251) greatly outnumbered small cell carcinoma patients (
n = 38) and squamous cell carcinoma patients (
n = 61). SMOTE was used synthesized new observations using a k-nearest neighbors approach to balance the number of training observations for each histotype group [
23]. The MATLAB implementation of SMOTE we used created a more balanced dataset for radiomic modeling and feature analysis [
24].
Chi-square tests have been used in machine learning to select features [
25]. Although chi-square tests are restricted to categorical data, discretization enables the examination of continuous variables [
26]. In our study, we used chi-square tests to obtain a chi-square feature ranking. This ranking describes the degrees of association between each feature and the response variable, which is the class label for classification. Using the feature ranking, we determined which were the most important shape, texture, and first-order histogram intensity features for classifying lung cancer histological subtypes.
4. Discussion
Radiomic analyses provide valuable quantitative descriptions of medical images and have great potential to be used clinically for the improved management of cancer patients [
27]. One bottleneck obstructing the clinical application of radiomics in cancer diagnosis and treatment is the need for manual tracing of the tumor by a certified radiologist. In this study, we showed that a pre-trained deep learning segmentation model with minimal input from radiologists on tumor locations can be used to replace the tedious manual segmentation of lung tumors.
We examined three primary types of lung cancer in our analysis. Lung adenocarcinoma, lung squamous cell carcinoma, and small cell lung cancer exhibit distinct physical characteristics. Adenocarcinoma is the most common type and appears as irregular glands or clusters of cells, resembling glandular tissue. It typically develops in the outer regions of the lungs and is more common in non-smokers and in women. In contrast, squamous cell carcinoma is characterized by cancerous cells resembling flat, thin squamous cells arranged in layers. It commonly arises in the central airways, such as the bronchi, and is strongly associated with smoking, particularly in male smokers. Small cell lung cancer is characterized by small, round cancer cells with minimal cytoplasm that grow in clusters [
28]. By carefully analyzing the CT images, radiologists can identify specific patterns associated with each type of lung cancer.
To our knowledge, our study is the first that has attempted to classify three histological subtypes of lung cancer using clinical CT/PET images. Li et al. made an attempt to classify the same three subtypes; however, their analyses were primarily binary in nature. This is because their results focused solely on comparing classification accuracies between two out of the three subtypes, without testing the accuracy of distinguishing all three subtypes from each other [
29]. Every other study has classified only two subtypes (either adenocarcinoma versus squamous cell carcinoma [
30,
31,
32] or small cell lung cancer versus non-small cell lung cancer). Our best performing model was the support vector machine, which achieved a classification accuracy of 92.7% with an AUC of 0.97 when the three lung cancers subtypes were distinguished. The SVM and ensemble models performed the best when two classes (small cell lung cancer versus non-small cell lung cancer) were considered, both achieving an accuracy of 92.6% with an AUC of 0.98. Our models outperformed those used in previous studies [
32].
Our analysis provides important insights into how the proposed framework can be contextualized and used for radiomic analysis. First, although automated segmentation algorithms such as the iMRRN are designed to operate without prior information concerning the location of the tumor lesion, we found that segmentation accuracy was improved when the general location of the tumor was provided (
Table 3). The annotation can be as simple as the index of the slice which contains the tumor. A deep learning (DL) model can then be applied to the tumor-adjacent slices to remove the need to search through the entire stack for the tumor. The latter method was shown by our data to have a higher rate of misidentification of the tumor lesion. As this process will only require labeling a single tumor slice, this approach requires very limited effort from the radiologist. Thus, it may boost the use of automated DL segmentation and radiomics in oncology. From a clinical perspective, developing radiomics-based tools that can predict tumor histology may spare patients from invasive procedures and help physicians capture histological changes that may emerge in response to targeted treatments [
33,
34].
A second important issue that emerged from our analysis is the role of an unbalanced dataset, which truly poses challenges in radiomic analysis. When working with retrospective clinical samples, it is common for a dataset to contain unequal numbers of subjects across comparison groups. However, it is also known that many machine learning methods are sensitive to unbalanced data, as the minority classes may not be learned as sufficiently as the dominant classes. One should carefully examine the distribution of data before applying machine learning based radiomic analysis because unsatisfying results could partially stem from the underrepresentation of some classes. This is especially true for multi-class classifications, in which samples can be significantly skewed. As
Table 6 shows, over-sampling increased the accuracy of the two-class classification by a few percentage points (up to 4%), while more significant improvements in the accuracy of the models were seen in the three-class classification (up to 16.2%). These results are comparable to other lung radiomic studies that have demonstrated increased classification performance after applying re-sampling techniques [
35].
Lastly, effective classification methods rely on features that are informative and discriminating across compared groups. Radiomic features are divided into distinct classes: shape features, sphericity and compactness features, histogram-based features, and first- and second-order features. Shape features include geometric and spatial characteristics such as size, sphericity, and the compactness of the tumor. Sphericity and compactness are features known to have strong tumor classification reliability [
36]. First-order characteristics are features that describe pixel intensity values and may be expressed as histogram values. Histogram-based features have been shown to have a high degree of reliability in radiomic studies [
36]. Second-order features, or texture features, rely on statistical relationships between patterns of gray levels in the image. The gray level run length matrix and gray level zone length matrix features each describe homogeneity between pixels and have been shown to be reliable second-order features [
37]. Our study showed that 3D CT data outperform single 2D CT data by up to 5% when their radiomic features are used in classification. There are several factors that might cause this. First, 3D CT data provides a richer set of radiomic features, such as the true shape and size information of the three spatial dimensions described above. Second, tumor morphometric and texture characteristics are subject to spatial heterogeneity, which can only be captured by 3D features. Two-dimensional texture features may not be sufficient to accurately describe spatial heterogeneity. However, if clinical 3D CT data are unavailable, 2D radiomic analysis can still be used to achieve useful classification with decent accuracy.
Improving the accuracy of such classifications will rely on the selection of discriminating features. This study utilized each of the shape, first-order, and texture features available in CERR, as these are shown to be robust against differences in image acquisition techniques [
38]. Incorporating clinical features such as age, sex, weight, and smoking history are likely to improve classification accuracy as these have been shown to correlate with risk for lung cancer [
39].