1. Introduction
In recent years, advancements in medical technology and oncology have contributed to a decline in annual mortality rates associated with cancer. Nevertheless, lung cancer (LC) remains a significant public health challenge, consistently reporting the highest incidence and mortality rates as indicated in the 2024 World Cancer Statistics [
1]. In Taiwan, LC has been the leading cause of cancer-related fatalities for over the past decade [
2]. Projections for 2024 anticipate that mortality from LC will be approximately threefold more significant than that from colorectal cancer, which ranks third, underscoring the considerable lethality of this malignancy [
1].
One widely utilized lung cancer screening modality is low-dose computed tomography (LDCT) [
3]. Lung nodules are typically detected in LDCT images when more significant than 3 mm in diameter. Nodules of this size and, more important, are generally visible and can be evaluated for further assessment, although smaller nodules may also be detected under optimal imaging conditions. For precise clinical contexts, guidelines may vary, and follow-up imaging may be recommended for nodules below this size.
In lung cancer studies, the “nodule spectrum” refers to the variety of lung nodules identified in imaging studies. This spectrum includes differences in size, shape, margins, and density. Nodules can range from very small to large and may exhibit smooth or irregular edges, from soft to complex in density. The growth patterns of nodules over time are also important for evaluation. Understanding this spectrum helps clinicians assess the risk of lung cancer and determine appropriate management strategies. For instance, one of the spectra can be classified into solid, part-solid, and ground-glass nodules (GGN) [
4].
Several critical factors are assessed in the evaluation of lung nodules for potential malignancy. Nodule size is paramount; those exceeding 3 cm in diameter exhibit a higher probability of being malignant. Margin characteristics are also significant, as smooth, well-defined edges generally indicate benign lesions, while spiculated or irregular margins heighten suspicion of malignancy. The growth rate of the nodule is a crucial consideration; rapid enlargement is typically associated with malignancy, whereas stable nodules are often benign. Additionally, the density of the nodule plays a role, with solid nodules generally carrying a greater risk of cancer compared to ground-glass opacities. The pattern of calcification can aid in risk stratification, and patient-specific factors—including age, smoking history, and the presence of symptoms such as cough or weight loss—further inform the diagnostic process [
4]. However, it is non-trivial to summarize these factors as a likelihood in cancer malignancy classification.
Despite the recognized utility of these biomarkers, a notable limitation exists regarding their effectiveness in enabling radiologists to reach unequivocal diagnostic conclusions [
5]. Additionally, the diagnostic process is considerably impacted by the subjective interpretation of radiologists, leading to variability in diagnostic outcomes among practitioners with varying experience levels. Such diagnostic ambiguity often leads radiologists to recommend a biopsy to obtain definitive results. However, this approach carries the drawback of a significant false positive rate, increasing unnecessary invasive procedures [
6,
7]. However, with ongoing advancements in research, there is optimism for developing more reliable and precise diagnostic tools that may enhance diagnostic accuracy and consistency.
Significant research has been conducted on developing computer-aided diagnosis (CAD) systems, aiming to improve the accuracy of diagnosis in two main areas of lung cancer screening: nodule detection [
8] and malignancy classification. Since this study focuses on nodule malignancy classification, we do not discuss nodule detection here.
The LIDC-IDRI dataset [
9] is a popular dataset that primarily includes detailed annotations related to lung nodules, including their presence, size, and characteristics as assessed by radiologists. While it provides some information regarding the assessment of nodules, it does not explicitly label them as malignant or benign. Instead, the dataset includes radiologist assessments of the likelihood of malignancy, which is often based on subjective interpretations and consensus ratings but may not definitively classify each nodule’s malignancy status. Researchers typically use this information to develop and evaluate predictive models for nodule malignancy, such as [
10,
11,
12,
13,
14].
Relying exclusively on the LIDC-IDRI dataset for diagnosing nodule malignancy presents certain limitations. This dataset lacks biopsy-confirmed pathological verification; instead, malignancy labels are derived from subjective assessments by radiologists, which may introduce bias in the training and evaluation of CAD models. To overcome this challenge, our study gathers LDCT data from patients with suspected lung cancer at local institutions, ensuring that all subjects have undergone biopsies for definitive pathological confirmation.
Another common challenge in relevant research utilizing open datasets is the reliance on unprocessed CT images for developing CAD systems. In real-world clinical settings [
15], CT scans encounter variability in equipment settings, resulting in discrepancies in pixel spacing and slice thickness among patients. This variability results in non-uniform voxel sizes, which can negatively affect the comparability of extracted radiomic features and the generalization ability of machine learning (ML) models. To enhance the applicability and effectiveness of these systems, regardless of whether deep learning (DL) or ML techniques are employed, it is crucial to use normalized images with uniform parameters. Previous studies have addressed this issue, reaching a consensus on the importance of geometrically normalizing images before their utilization [
16,
17,
18,
19].
Moreover, the studies utilizing the LIDC-IDRI database for nodule classification face another challenge in managing contours delineated by four radiologists. Some studies only use contours from one radiologist [
20], potentially missing insights from others. In contrast, others employ Generative Adversarial Networks (GANs) [
21] to integrate all four, which can make models less interpretable and reduce clinical trust. This affects the consistency of classification outcomes. In this study, we will tackle this issue and demonstrate our solution to the problem.
In summary, this study has four primary objectives:
To collect two-center data with pathological verification for a reliable CAD system to differentiate between benign and malignant nodules.
To examine how voxel normalization affects nodule classification.
To create an intuitive contour fusion method for clinicians to merge contours from different radiologists.
To investigate distinct diagnostic paradigms—including feature-driven machine learning (radiomics) and data-driven deep learning methods—as the basis for the CAD system.
2. Materials and Methods
LDCT image data and their corresponding pathological confirmations are collected from two local centers: (a) Kaohsiung Veterans General Hospital (KVGH) and (b) Chia-Yi Christian Hospital (CYCH).
- (a)
The dataset includes 160 malignant and 81 benign pulmonary nodules (PNs) from 241 patients approved by the hospital’s Institutional Review Board (IRB number VGHKS18-CT5-09, date: 3 January 2018). Each nodule’s malignancy status was confirmed through pathological biopsy. Imaging was conducted using modalities from TOSHIBA, SIEMENS, GE MEDICAL SYSTEMS, and Philips. LifeX software (version 6.2.0, 2018, [
22]) facilitated DICOM image reading and annotation in various planes. The recorded ROIs were saved as near-raw raster data (NRRD) files. Due to technical issues, 16 cases with annotation errors (3 benign and 13 malignant) were excluded. Feature extraction was performed on reconstructed images rather than raw scans [
18].
- (b)
The dataset comprises 174 patients, including 78 benign cases (101 benign nodules) and 96 malignant cases (120 malignant nodules), approved by the hospital’s Institutional Review Board (IRB number IRB2022096, date: 16 November 2022). Each nodule’s malignancy status was confirmed through pathological biopsy. Imaging was conducted using various modalities: 104 patients with Siemens, 8 with GE Medical Systems, 11 with Toshiba, 2 with Canon Medical Systems, and 49 with Philips. All participants underwent imaging in Helical mode at 120 KVP. 122 patients received iodine-based contrast media, while 52 were scanned without contrast agents. LIFEx software (Version 6.2.0, 2018) was used to annotate regions of interest (ROIs), delineated independently by a skilled radiologist and a radiographer. The recorded ROIs were saved as NRRD files, with original CT files converted to NRRD format using SimpleITK.
The data formats are outlined: CT images are in int16 format, computations are in float64, and masks are binary. In radiomics, each bin contains 25 CT numbers. In deep learning, each CT image is transformed into a positive integer format within the range of [0, 255] using a lung window (window width: 1500, window level: −600).
2.1. General Frameworks for Dataset (A)
Figure 1 illustrates the flowchart for analyzing LDCT images and extracting radiomics features. The detailed steps are described as follows.
Raw LDCT Data: Collect LDCT images from KVGH.
Voxel Reconstruction for Geometric Normalization: Standardize images to defined isotropic voxel sizes (side-length 1.5 mm).
Radiomics Features Extraction: Obtain various features from the LDCT images while excluding shape-based features.
Statistical Testing:
- (1)
Conduct tests for Gaussian distribution and equal variances.
- (2)
If the criteria are met, perform an independent t-test; otherwise, use the Wilcoxon rank-sum test.
- (3)
Determine statistical significance with a p-value threshold of less than (1020).
Feature Selection with LASSO: Apply LASSO (Least Absolute Shrinkage and Selection Operator) for feature selection on the dataset (a).
This methodology allows for efficient processing of LDCT data, with a focus on rigorous statistical validation and feature selection.
Figure 1.
Flowchart (a) for LDCT Image Analysis and Radiomics Feature Extraction (Dataset A).
Figure 1.
Flowchart (a) for LDCT Image Analysis and Radiomics Feature Extraction (Dataset A).
2.2. General Frameworks for Dataset (B)
Figure 2 illustrates the general framework for dataset (B). The detailed procedures are described as follows.
Raw LDCT Data: Collect LDCT images from CYCH.
Voxel Reconstruction for Geometric Normalization: Standardize images to defined isotropic voxel sizes (side-length 1.5 mm).
Radiomics Features Extraction: Obtain various features from the LDCT images while excluding shape-based features. Before feature extraction, some filters and transformations are applied to generate comprehensive images, details of which are described in
Section 2.5.
Feature Selection and Classification
- (1)
With LightGBM: Apply Light Gradient Boosting Machine (LightGBM) for feature selection on radiomics features of dataset (b), followed by different ML classifiers.
- (2)
With neural networks: Apply different NNs.
Note: (1) and (2) are independent; they generate different results.
Figure 2.
Flowchart (b) for LDCT Image Analysis and Radiomics Feature Extraction (Dataset B).
Figure 2.
Flowchart (b) for LDCT Image Analysis and Radiomics Feature Extraction (Dataset B).
2.3. Isotropic Voxel Normalization on CT Images
To improve the consistency and effectiveness of machine learning models for identifying the malignancy of PNs, the study standardizes pixel size (PS) and slice thickness (ST) across all LDCT images. This is performed using isotropic voxel normalization with bicubic interpolation [
23], ensuring uniform spatial resolution and achieving three-dimensional isotropy. This research also examines the effects of different voxel sizes (side lengths from 0.5 to 2 mm) on model performance and the characteristics of extracted features. To rigorously evaluate the stability of the models, we performed statistical significance tests using the Wilcoxon rank-sum test. Specifically, we compared the performance metrics of each isotropic voxel size against a baseline derived from original, non-normalized images. This analysis ensures that the performance variations observed at different spatial resolutions are statistically significant compared to the unadjusted baseline.
2.4. Nodule Contour Fusion
The Fast Fourier Transform (FFT) is crucial for contour fusion of nodules from different expert annotations. This process is conducted on 2D binary images (0 for absence and 1 for presence of a nodule). The 2D FFT converts these binary images into their spectral forms, which are then averaged to create a merged spectrum. The inverse Fast Fourier Transform (IFFT) reverts the average spectrum to the spatial domain. Finally, the absolute values of the reversed images are processed. Instead of selecting an arbitrary threshold, we conducted a sensitivity analysis to determine the optimal threshold value. We evaluated threshold parameters ranging from 0.4 to 0.8 to identify the value that best balances the inclusion of nodule features with the exclusion of non-tumorous background tissues. Based on the performance metrics (detailed in
Section 3), a threshold of 0.7 was identified as the optimal operating point to finalize the fusion. The contour fusion process is detailed in Equations (1) and (2) for the case of two experts.
The fused contour is then used for feature extraction. Notably, this process can also be conducted in 1D form, where the coordinates of nodule contours are the inputs and outputs.
2.5. Feature Extraction and Selection
Radiomics [
24] is a well-known technique focused on extracting diverse features from medical images, encompassing first-order, shape-based, and texture features. Our study employs the pyradiomics library [
25] to extract these features from contour-defined ROIs. Before the extraction process, we apply various filters to the raw ROIs, including Wavelet filters [
26], the Laplacian of Gaussian filter [
27], and several transformations such as Square, Square Root, Logarithm, Exponential, and Gradient, along with the Local Binary Pattern technique [
28]. These preprocessing steps aim to enhance the dataset by generating additional images for more comprehensive feature extraction.
The extensive process results in 2120 features. To address the challenges of high dimensionality, we utilize the LightGBM [
29] for feature selection, which differs from our previous method [
18]. LightGBM is an efficient gradient-boosting framework known for its speed and effectiveness in handling large datasets, making it ideal for balancing limited computational resources and training time. Another significant advantage of LightGBM is its ability to rank features by their importance during training, based on each feature’s contribution to decision tree construction and its effect on model accuracy. Specifically, we first employ Spearman’s test to evaluate linear relationships among all features, removing those with collinearity exceeding 0.9. The remaining features are then used to train the LightGBM model, allowing us to identify and select the 38 most effective features for further model development.
2.6. Classifiers in Machine Learning (ML)
Six ML algorithms are investigated, including Logistic Regression, Multilayer Perceptron (MLP), Random Forest, Linear Discriminant Analysis (LDA), LightGBM, and CatBoost [
30].
2.7. CT Image Preparation for Neural Networks
To prepare input images for convolutional neural networks (CNNs), we employ a windowing technique on all CT images using a lung window [
31], characterized by a window width of 1500 and a window level of −600. This adjustment normalizes the CT image values to the range of [0, 255], enhancing lung nodular visibility. The transformation function is illustrated in Equation (3). Subsequently, we extract patches with dimensions of 64 × 64 × 9 from each CT image sequence, ensuring that the nodules are centered within these patches. These nine patches are then flattened into a single image with dimensions of 192 × 192 × 1, serving as the input to the 2D neural networks
where
WW = Window Width,
WL = Window Level.
We utilize 2D input rather than 3D input in our deep learning models. This decision is based on the fact that 3D input requires significantly higher computational resources than its 2D counterparts [
32]. The 3D format entails longer training time and necessitates a larger memory capacity for the model. Consequently, this research does not employ 3D images as the input. However, we test 3D deep learning models to compare performance between 2D and 3D inputs. For the 3D model, we use a matrix of dimensions 9 × 64 × 64 × 1 as the input, and the model is ‘3D ResNet101’, which is described in
Section 2.9.
2.8. 2D Neural Networks and DL Algorithms
In our research, we employ various DL algorithms, exploring a range of CNN architectures, including VGG16 [
33], ResNet101 [
34], InceptionNet [
35], and ConvNeXt [
36]. Additionally, we investigate two multi-modal models: EVA02 [
37] and Meta Transformer [
38]. Furthermore, we comprehensively evaluate six contemporary self-attention models, which include the Dual Attention Vision Transformers (DaViTs) [
39], Vision Outlooker for Visual Recognition (VOLO) [
40], Swin Transformer V2 [
41], Phase-Aware Vision MLP (Wave-MLP) [
42], LeViT [
43], Dilated Neighborhood Attention Transformer (DINAT) [
44], and Masked Image Modeling with Vector-Quantized Visual Tokenizers (BEIT v2) [
45]. These models have been selected for their innovative contributions to processing complex visual data through advanced attention mechanisms and architectural designs. Our objective in testing such a diverse set of self-attention models is to assess their efficacy in classifying lung nodule malignancy.
Each DL model is initially pre-trained on the ImageNet1000 dataset [
46] and subsequently fine-tuned through transfer learning [
47] using our in-house dataset. All training processes adhere to the same early stopping criterion [
48], which halts training if the validation loss does not improve for more than ten epochs, indicating model convergence. All models utilize the Adadelta optimizer [
49], maintain a cyclic learning rate [
50] (with a maximum learning rate of 0.1 and a minimum learning rate of 0.00001), and employ Binary Focal Loss with an alpha value of 2 [
51]. Our experimentation with all models proceeds using the configuration of an RTX 3090 alongside an i9-7900X, with a setup that includes PyTorch 2.1.0 and Ultralytics 8.2.8.
2.9. 3D ResNet101
The 3D ResNet101 [
52,
53] represents an evolution of the traditional ResNet architecture, tailored explicitly for processing 3D data. In contrast to the standard 2D ResNet, the 3D ResNet101 is optimized for handling volumetric data such as video frames or medical imaging scans [
52]. This model maintains the deep residual network structure, connecting layers with identity mappings to address the vanishing gradient problem that often occurs in intense networks. The “101” denotes the network’s depth, which comprises 101 layers. In the 3D variant, the 2D convolutional layers are replaced with 3D convolutions, allowing the model to capture three-dimensional information. This feature makes it particularly effective for tasks such as action recognition in videos, 3D object recognition, and volumetric medical image analysis. We selected this model for comparison instead of others because it is also used for pulmonary micronodule malignancy risk classification [
52].
2.10. Model Performance Validation
We implemented ten-fold cross-validation to rigorously assess the performance of the various models. Crucially, to prevent data leakage and ensure the generalizability of our results, the dataset partitioning was performed strictly at the patient level rather than the nodule level. This ensures that all nodules belonging to the same patient are assigned exclusively to either the training or the testing set in each iteration, thereby preventing the model from recognizing patient-specific features across splits. The patient cohort was divided into ten subsets; nine were used for training (employing transfer learning), and one was reserved for testing. This procedure was repeated ten times, with each subset serving as the test set once. The final performance is reported as the average across all ten folds. We evaluated the models using Balanced Accuracy, Weighted Sensitivity, Weighted Precision, Weighted F-score, and Weighted AUC, as defined in Equations (4)–(7), ensuring a comprehensive and robust assessment of effectiveness.
where
TP,
FP, and
FN denote true positive, false positive, and false negative, respectively.
2.11. Understanding Object Recognition: Model Interpretability and Visibility
Gradient-weighted Class Activation Mapping (Grad-CAM) [
54] is a prominent technique designed to enhance the interpretability of deep learning models by generating heatmaps that illustrate the regions of an input image that most influence the model’s class predictions. This method involves computing the gradients of the class score concerning the input image, which is then backpropagated to the model’s final convolutional layer. By weighting and combining these gradients, Grad-CAM produces a feature map that indicates the model’s focus areas within the input. This visual representation can be overlaid on the original image to highlight regions that capture the model’s attention, thereby facilitating an understanding of whether it relies on relevant image features for classification. This increases transparency and not only assists in validating the model’s decision-making process but exposes potential biases or shortcomings. In this study, Grad-CAM is effectively utilized to elucidate the model’s attention mechanisms, confirming its accurate identification of nodule characteristics in medical imaging.
4. Discussion
Figure 9 illustrates the differences in nodule size distribution between KVGH and CYCH, revealing that KVGH nodules are generally larger, while those from CYCH are smaller. Additionally, the benign and malignant nodules in CYCH exhibit a greater degree of overlap in size, further complicating classification. Nevertheless, 64% of the selected features overlap between the two datasets, underscoring the significance of voxel normalization in achieving consistent feature selection across varying distributions. This discrepancy also helps explain why the ML models in this study performed slightly less effectively than our previous research [
18]. Since CYCH serves as the primary dataset and contains smaller nodules and more significant overlap between benign and malignant cases, the classification task is inherently more challenging, resulting in more incredible difficulty distinguishing between benign and malignant cases.
Furthermore,
Figure 9 emphasizes the limitations of using the LIDC-IDRI dataset for training nodule classification models. Although the nodules in LIDC-IDRI exhibit considerable diameter overlap with those in CYCH and KVGH, they lack pathology-proven diagnoses and rely solely on the subjective malignancy ratings provided by radiologists. This reliance increases the risk of erroneous ground truth labels, which could lead to unreliable model performance when applied to datasets with confirmed diagnoses.
It is important to acknowledge that the comparison between ML and DL models reflects a fundamental difference in input modalities: ROI-based radiomics versus patch-based raw images. For the ML analysis, we deliberately focused on texture-based stability, explicitly excluding 14 geometric features (e.g., volume, size) and employing a coarser 25-bin configuration for CT numbers. While this approach ensured a rigorous evaluation of texture features across centers, it inherently limited the information available to the ML classifiers (e.g., CatBoost). In contrast, the patch-based DL models utilized inputs with full 256 grayscale levels and preserved spatial contexts. Consequently, the superior performance of the DL models should be attributed to their ability to leverage high-resolution, pixel-level information, whereas the ML models were restricted to discretized, aggregated texture descriptors within the ROIs.
While numerous studies have advocated for 3D ResNet architectures to capture volumetric spatial information, we demonstrate that 2D Transformer-based models (specifically DaViT and BEIT v2) achieve equivalent diagnostic performance (AUC ~0.999) without the prohibitive computational overhead associated with 3D networks. Traditional 2D CNNs often struggle with nodule classification due to their limited receptive fields; however, the self-attention mechanisms in Transformers effectively capture long-range dependencies across the entire image, mimicking the contextual understanding usually attributed to 3D models.
From a practical clinical perspective, selecting 2D Transformers over 3D architectures offers significant advantages for routine LDCT workflows. Although 3D ResNet101 contains fewer parameters, it necessitates processing volumetric data (9 × 64 × 64), which demands substantially higher video memory and floating-point operations (FLOPs) during inference. In contrast, 2D models require significantly lower computational resources, allowing for faster inference times and reducing the dependency on high-end, expensive GPU hardware. This efficiency is critical for integrating CAD systems into standard clinical workstations or PACS (Picture Archiving and Communication Systems), where hardware resources are often constrained. Therefore, the 2D Transformer-based approach provides a more scalable and cost-effective solution for large-scale lung cancer screening programs.
The challenges associated with ML-based approaches are evident. Manual contour delineation of nodules is time-consuming, and the feature selection can vary between different runs, leading to inconsistencies among researchers. Our findings emphasize that voxel normalization is essential in processing LDCT data, as 64% of the features selected in a previous study reemerged in this research, highlighting its significance in maintaining feature stability.
To further elucidate the impact of voxel normalization on deep learning architectures, our ablation study using BEIT v2 provides critical insights into the trade-off between spatial resolution and model stability. The observed performance plateau between 1.0 mm and 1.5 mm suggests that this resolution range optimally preserves the semantic features requisite for malignancy classification without introducing redundant information. Conversely, the degradation in accuracy coupled with a drastic increase in standard deviation at finer resolutions (specifically 0.625 mm) indicates that excessive spatial detail likely introduces high-frequency noise and reconstruction artifacts. Crucially, the loss of statistical significance at 0.625 mm (p = 0.062), despite a substantial drop in mean accuracy, statistically corroborates the model’s severe instability and susceptibility to overfitting. This finding parallels our observations in the ML analysis, reinforcing that 1.5 mm isotropic voxel normalization is a pivotal preprocessing step for ensuring robustness in both radiomics and deep learning pipelines.
One limitation of this research is the relatively small dataset, comprising only 415 patients from two medical centers. While utilizing pathology-confirmed LDCT data represents a significant strength, a more extensive dataset obtained from multiple centers would validate the findings more robustly. Additionally, merging annotations from various experts for nodule boundaries poses a considerable challenge. Some studies have explored using GANs for contour fusion; however, these models can be complex for clinicians to interpret. As an alternative, we propose an approach based on the FFT that mathematically integrates boundaries while preserving shape details. This method achieves performance comparable to that of GANs while being more interpretable. Furthermore, the computational speed of FFT is significantly greater than that of GANs, and it does not require expensive hardware such as GPUs, making it a more cost-effective solution that is better suited for clinical applications.
To determine the optimal error tolerance for this contour fusion technique, we performed a sensitivity analysis on the FFT thresholding parameter, ranging from 0.4 to 0.8. As presented in
Table 6, the threshold of 0.7 was identified as the optimal operating point, achieving the highest accuracy of 0.926 and an F1-Score of 0.926. Our analysis reveals a clear trade-off: thresholds below 0.7 (0.4–0.6) exhibit a slight degradation in performance (accuracy ~0.912–0.918). While these lower thresholds yield higher sensitivity by generating broader contours, they inadvertently incorporate non-tumorous background tissues, thereby reducing precision. Conversely, increasing the threshold to 0.8 results in a sharp decline in performance, with accuracy dropping to 0.865 and sensitivity to 0.784. This degradation suggests that an overly strict threshold excludes peripheral nodule textures essential for malignancy characterization. Therefore, the 0.7 threshold represents the critical balance point that maximizes feature stability and classification power.
Our findings underscore the pivotal role of voxel normalization, emphasize the limitations of the LIDC-IDRI dataset, and illustrate the effectiveness of FFT-based annotation fusion. Future research should investigate voxel normalization’s impact on deep learning models and compare various contour integration techniques to refine pulmonary nodule classification further.