1. Introduction
Lung cancer is one of the most aggressive and deadly forms of cancer, accounting for some of the highest rates of cancer-related cases and deaths worldwide. According to the World Health Organization (WHO), more than 1.8 million people die from this disease each year. The primary reason for the high mortality rate is that lung cancer is often diagnosed only after it has progressed to an advanced stage. In its early stages—particularly when it manifests as pulmonary nodules—the disease is usually asymptomatic but can be successfully treated if detected and evaluated promptly. This underscores the urgent need for advanced diagnostic tools capable of identifying and classifying pulmonary nodules at an early stage.
Pulmonary nodules are small, rounded opacities in the lungs that may be either benign or malignant. Their morphology, size, margins, and location are critical diagnostic indicators for distinguishing malignant nodules from benign ones. Currently, the standard diagnostic process involves manual interpretation of CT scans by radiologists, which can be time-consuming and subject to both inter-observer and intra-observer variability. Due to this variability, along with the need for faster and more accurate diagnosis and treatment planning, there is an increasing demand for computer-based systems that can reliably classify pulmonary nodules and assist clinicians in a timely manner.
The landscape of medical imaging has evolved rapidly due to advancements in machine learning (ML) and artificial intelligence (AI). Modern systems can analyze large volumes of imaging data and identify patterns that may not be readily apparent to human observers. Deep learning techniques, particularly Convolutional Neural Networks (CNNs), have demonstrated significant promise in medical image analysis and classification. These models offer powerful feature extraction capabilities and are particularly suitable for pulmonary nodule analysis when properly validated and trained.
The development of computer-aided diagnosis (CAD) systems for lung nodule classification typically includes phases such as data collection, image preprocessing, feature extraction, model training, classification, and performance evaluation. The primary objective of this study is to classify pulmonary nodules by type and malignancy using a comprehensive CAD framework. Several deep learning architectures were evaluated, including a traditional CNN, a hybrid CNN-SVM model, ResNet101, and a custom Attention-Based CNN model. All models were trained and tested on publicly available benchmark datasets, IQ-OTH/NCCD and LIDC-IDRI, containing CT scans annotated by experienced radiologists.
The CNN model served as the baseline architecture. CNNs utilize convolutional and pooling layers to learn spatial hierarchies of features. Although they demonstrate strong performance, standard CNNs may lose high-level spatial relationships, particularly when distinguishing between medical image classes that exhibit subtle differences. To address this limitation, a hybrid CNN-SVM model was implemented, combining CNN-based feature extraction with a Support Vector Machine (SVM) classifier. SVMs are effective classifiers that generalize well even with limited training data and perform efficiently in high-dimensional feature spaces.
Building upon this foundation, the study further explored ResNet101, a deep residual network comprising 101 layers designed to mitigate the vanishing gradient problem associated with deep architectures. ResNet101 facilitates deeper learning of feature representations, which is essential for identifying complex patterns in pulmonary nodule images. However, even deep residual networks may overlook fine-grained details if attention is not directed toward the most relevant image regions.
To overcome this challenge, this research introduces an innovative Attention-Based CNN that incorporates both channel-wise and spatial attention mechanisms. These attention modules enable the network to focus on diagnostically important regions of CT scans, thereby improving its ability to distinguish between visually similar nodules.
The classification tasks addressed in this study include two critical challenges: (i) differentiating benign from malignant nodules and (ii) identifying shape-based nodule categories to support differential diagnosis. These tasks involve varying levels of complexity. For example, juxta-vascular nodules occur near blood vessels, making them difficult to distinguish from surrounding tissues. Similarly, juxta-pleural nodules are located adjacent to the pleural lining and may generate artifacts that complicate analysis. Additionally, benign nodules can closely resemble malignant nodules in terms of density and size, necessitating advanced feature discrimination to reduce false positives and false negatives.
Model performance was evaluated using standard metrics, including accuracy, precision, recall, F1-score, and confusion matrix analysis. These metrics provide a comprehensive assessment, particularly when class imbalance results in underrepresentation of certain nodule categories. The proposed Attention-Based CNN outperformed CNN (92.5%), CNN-SVM (94.2%), and ResNet101, achieving a classification accuracy of 98.5% on the LIDC-IDRI dataset and 96.1% on the IQ-OTH/NCCD dataset. Through its attention mechanisms, the model effectively focuses on the most relevant features, enhancing its ability to differentiate nodule types and malignancy levels. These findings highlight the effectiveness of adaptive learning approaches in improving the precision and reliability of medical image classification.
Beyond quantitative performance, qualitative evaluation of model outputs provides important insights into interpretability. Visualization techniques such as heat maps and activation maps identify image regions that contribute most strongly to the model’s predictions. Such transparency is essential for clinical applications, where explainability and trust are critical for adoption.
Despite the favorable results, certain limitations remain. Classification accuracy depends heavily on the diversity and quality of training data. Although LIDC-IDRI and IQ-OTH/NCCD are widely used datasets, they may not fully represent the heterogeneity encountered in real-world clinical settings. Furthermore, inter-observer variability in ground-truth annotations can introduce label noise, affecting model training. Future research should consider incorporating multi-institutional datasets and semi-supervised learning techniques to enhance generalizability and robustness.
Healthy lung structures in CT images typically exhibit bilateral symmetry, while the presence of nodules disrupts this symmetry through localized variations in shape, texture, and intensity. These asymmetrical patterns provide valuable cues for lung cancer detection. The proposed attention-based deep learning framework leverages this principle by learning to focus on regions where normal symmetry is disrupted, thereby enhancing discrimination between benign and malignant nodules.
Finally, successful integration of CAD systems into clinical practice requires consideration of processing time, usability, and compatibility with modern radiology workflows. A lightweight and efficient Attention-Based CNN architecture can be deployed on edge or cloud platforms to enable rapid and practical diagnostic support. This study presents a robust framework for lung nodule classification using computer tomography imaging and demonstrates that an Attention-Based CNN architecture delivers superior performance for this application. As part of a comprehensive CAD system, this framework has the potential to provide significant clinical value by supporting radiologists in early detection and improving diagnostic confidence for patients with pulmonary nodules.
2. Literature Survey
Mehr Kashyap et al. [
1] presented an ensemble three-dimensional U-Net model trained on 1504 radiation therapy CT scans for lung tumor detection and segmentation. On a test set of 150 CT scans, the model achieved 92% sensitivity and 82% specificity. The segmentation performance reached a median Dice similarity coefficient of 0.77, which was comparable to inter-physician agreement. Moreover, the automated approach significantly reduced segmentation time compared to manual contouring, highlighting its clinical efficiency and potential for generalizability.
Hongfeng Wang et al. [
2] proposed MResNet, which achieved strong performance in lung nodule classification by effectively integrating multi-scale features. A Pyramid Pooling Module (PPM) was used for feature selection. On the training set, the model achieved 99.12% accuracy, 98.64% sensitivity, 97.87% specificity, and an AUC of 0.9998. On the testing set, it maintained reliable performance with 85.23% accuracy, 92.79% sensitivity, 72.89% specificity, and an AUC of 0.9275. These results demonstrate its capability to estimate malignancy risk from CT images with good generalization ability.
Chang et al. [
3] proposed a Multiview Residual Selective Kernel Network (MRSKNet) for pulmonary nodule classification using CT images. Their method integrates residual learning with selective kernel mechanisms to capture multi-scale nodule characteristics and employs axial, coronal, and sagittal views to enhance feature representation. By combining deep features with handcrafted texture features, the model achieved improved discrimination between benign and malignant nodules, reaching an AUC of 0.9711 on the LIDC-IDRI dataset. The study demonstrated that multiview and multi-scale feature fusion effectively enhances lung nodule classification performance.
Hussein et al. [
4] introduced a 3D CNN-based multi-task learning framework for lung nodule risk stratification that leverages volumetric CT information and transfer learning. The model jointly learned six nodule attributes along with malignancy scores using graph-regularized learning to model inter-task relationships and mitigate reader variability. Evaluated on 1018 CT scans, the approach achieved an accuracy of 91.26% with a low mean score difference of 0.4593, demonstrating robust and reliable malignancy prediction performance.
Aparna Harale et al. [
5] presented a comprehensive approach for the detection and classification of pulmonary nodules in CT scans using advanced deep learning algorithms. They proposed an optimized YOLOv5 model for lung nodule detection to address limitations in sensitivity and false negatives observed in existing detection methods. For classification, the authors employed a hybrid CNN-SVM model in which deep features extracted by a CNN were fed into an SVM classifier to categorize segmented nodules. Experimental results demonstrated highly reliable detection performance, reduced false positives, and improved classification accuracy compared to many existing methods. The optimized YOLOv5 model was able to efficiently identify nodule locations while requiring fewer computational resources. However, the study acknowledged challenges in detecting small or irregularly shaped nodules and emphasized the need for larger and more diverse datasets to further enhance model robustness. Overall, the study provided a comprehensive perspective on combining object detection and classification paradigms, highlighting the potential of multimodal data and transfer learning for advancing computer-aided diagnosis (CAD) systems in lung cancer detection.
Qin Wang et al. [
6] proposed a raw patch-based CNN framework for lung nodule detection in low-dose CT images, eliminating the need for handcrafted feature extraction and traditional candidate generation. By dividing CT scans into multiple categories of nodule and non-nodule patches and employing a ResNet-based architecture, the method achieved high detection sensitivity (92.8%) with controlled false positives on the LIDC-IDRI dataset. These results demonstrate the effectiveness of directly learning deep features from raw CT data.
Xu et al. [
7] introduced a confidence-aware semi-supervised segmentation framework designed to reduce reliance on extensive manual annotations in medical imaging. The method adaptively weights pixel-level supervision by distinguishing between uncertain and reliable regions. Entropy-based consistency learning emphasizes difficult pixels, while dynamic pseudo-label weighting reinforces trustworthy predictions. This strategy enables effective utilization of limited labeled data alongside abundant unlabeled samples. Experimental validation on benchmark datasets—including BraTS 2019, LA, and Pancreas-CT—demonstrated improved segmentation performance, achieving a Dice score of 86.33% on BraTS 2019 using only 20% labeled data. These findings indicate strong robustness and practical potential for tumor and organ segmentation tasks.
Kido et al. [
8] proposed a nested three-dimensional fully convolutional network with residual connections for automatic lung nodule segmentation from CT images. The model integrates multi-level encoder features through a nested decoder architecture and employs a hybrid loss function combining Dice loss and binary cross-entropy to enhance learning stability. The approach demonstrated strong segmentation performance on 332 lung nodules, achieving a Dice score of 0.845 and an Intersection over Union (IoU) of 0.738, outperforming architectures such as 3D U-Net and 3D SegNet, as well as traditional segmentation techniques including watershed and graph cut methods. The model showed particular effectiveness in segmenting challenging nodules with ground-glass opacity and chest-wall attachment, indicating its suitability for robust computer-aided lung cancer diagnosis.
Aparna Harale et al. [
9] presented a computer-aided detection (CAD) approach for early lung cancer diagnosis through pulmonary nodule detection and classification. The proposed method incorporates image preprocessing, segmentation, feature extraction, and machine learning techniques to enhance diagnostic accuracy. Using advanced image-processing techniques, the CAD system detects and segments nodules, extracts significant features, and trains classifiers to differentiate between benign and malignant nodules. The results demonstrate improved diagnostic sensitivity and a reduction in human errors compared to traditional radiological examinations. The study further suggests that the system can assist radiologists in selecting appropriate clinical interventions, particularly in high-volume imaging environments. However, the authors acknowledge limitations related to dataset size and variability across labeled datasets, emphasizing the need for validation using multi-institutional data. Future work focuses on improved feature extraction techniques and deeper learning architectures for enhanced classification performance.
Gupta and Shukla [
10] presented a deep learning-based approach for lung nodule classification using a modified AlexNet convolutional neural network trained on CT images from the LIDC-IDRI (LUNA-16) dataset. The model automatically learned discriminative features from CT slices without the need for manual feature engineering. Using binary cross-entropy loss, the proposed network achieved 99% training accuracy and 97% validation accuracy, demonstrating the effectiveness of transfer learning with AlexNet for distinguishing between benign and malignant lung nodules.
Aparna Harale et al. [
11] developed an innovative computer-aided detection (CAD) method for identifying pulmonary nodules in CT scans. The system performs nodule detection through image processing techniques involving segmentation and feature extraction based on tumor shape and texture. The performance of the CAD system was evaluated using CT scans from a well-curated dataset, and results indicated a significant improvement in detection sensitivity along with a reduction in radiologists’ workload. However, the authors noted that false positives—primarily caused by blood vessels and image noise—remain a major limitation restricting the broader application of CAD systems. While the study demonstrated the effectiveness of handcrafted features and traditional classifiers, it also highlighted the need to transition toward more advanced deep learning architectures. The authors suggested that future research should address limitations related to smaller datasets, limited validation strategies, and the absence of deep neural networks. Incorporating larger multi-institutional datasets, implementing rigorous validation protocols, and integrating deep learning frameworks were identified as essential steps for improving CAD-assisted diagnosis in clinical settings.
Donga et al. [
12] proposed a pulmonary nodule classification framework using CT images from the LIDC-IDRI dataset. Their approach included image enhancement through anisotropic diffusion filtering, semi-automatic nodule segmentation using the random walker algorithm, and texture feature extraction using Local Binary Patterns (LBP) and steerable Riesz wavelets. A modified gradient boosting classifier was then employed for classification. The framework achieved a validation accuracy of 95.67%, along with high precision and F1-score, outperforming several conventional machine learning and certain deep learning approaches reported in prior studies.
Aparna Harale et al. [
13] conducted an extensive investigation into CT-based lung cancer detection and classification. The authors proposed a novel multistage framework incorporating preprocessing, segmentation, feature extraction, and machine learning–based classification. The approach systematically addressed all major components of pulmonary nodule detection, employing enhancement and denoising techniques to improve image quality and segmentation accuracy. Various feature extraction methods were utilized, focusing primarily on shape, texture, and intensity-based descriptors. These extracted features were subsequently fed into machine learning classifiers to differentiate between benign and malignant nodules. Validation using reference datasets demonstrated that the proposed approach was robust and competitive with other conventional methods reported in the literature. The study emphasized the importance of precise preprocessing and appropriate classifier selection in enhancing the overall diagnostic accuracy of CAD systems. Furthermore, the authors highlighted that incorporating deep learning techniques and larger, more diverse datasets could further improve system performance. Although not presenting a fully autonomous solution, the study contributes toward the advancement of data-driven intelligent healthcare systems and lays the groundwork for future research in developing robust, automated CAD systems for early lung cancer detection. A summary of state-of-the-art methods for lung cancer recognition is presented in
Table 1.
Despite notable progress in lung nodule classification using machine learning and deep learning models, several critical challenges remain. Much of the recent research has focused primarily on improving classification accuracy through CNNs, hybrid architectures, and attention-based networks. However, accuracy alone does not guarantee clinical applicability. A closer examination reveals several important gaps.
First, many models are constrained by the quality and diversity of publicly available datasets, such as LIDC-IDRI and IQ-OTH/NCCD. Although widely used, these datasets do not capture the full spectrum of nodule types, anatomical variations, or imaging artifacts encountered in real-world clinical practice. This limitation restricts model generalizability across diverse populations, scanners, and imaging protocols.
Second, most existing approaches focus solely on binary classification (benign vs. malignant) without simultaneously addressing morphological subtype classification, which is clinically important for differential diagnosis and treatment planning.
Third, although some studies have incorporated attention mechanisms, hybrid CNN-SVM structures, or deep residual networks, the interpretability of these models remains limited. In high-stakes medical applications, the lack of explainable outputs reduces clinician trust and may hinder adoption in practice.
Furthermore, many state-of-the-art methods suffer from computational inefficiency, making deployment challenging in time-sensitive or resource-constrained environments. Very deep architectures or multimodal fusion networks (e.g., integrating CT with PET or MRI) may improve diagnostic performance; however, they increase training complexity and computational cost while depending on imaging modalities that are not always routinely available.
Finally, most existing systems are not designed with real-time performance and streamlined clinical workflow integration in mind. Collectively, these limitations highlight the need for classification approaches that strike a better balance between accuracy, generalizability, interpretability, computational efficiency, and practical applicability across diverse healthcare environments.
3. Materials and Methods
In this section, we describe the datasets, preprocessing techniques, and deep learning architectures used for pulmonary nodule classification. The primary objective of the classification task was to categorize nodules as either benign or malignant. We evaluated several deep learning models, including a conventional CNN, a hybrid CNN-SVM model, ResNet101, and a custom Attention-Based CNN architecture.
3.1. Dataset Description
Deep learning algorithms offer significant advantages in handling large volumes of diverse medical imaging data, contributing to improved accuracy and robustness in lung nodule classification. This study utilizes two well-known public datasets: LIDC-IDRI (Lung Image Database Consortium and Image Database Resource Initiative) and IQ-OTH/NCCD (Iraq-Oncology Teaching Hospital/National Center for Cancer Diseases). These datasets provide comprehensive and reliable resources for training, testing, and validating deep learning models for pulmonary nodule classification.
LIDC-IDRI: The LIDC-IDRI dataset is one of the largest and most widely used resources for lung nodule detection and characterization. It contains thoracic CT scans from 1018 patients, each independently reviewed by four experienced radiologists. The annotations provide detailed information regarding the location, size, shape, texture, and malignancy likelihood of each identified nodule. Radiologists assigned subjective malignancy ratings on a scale of 1 to 5, where scores of 1 and 2 indicate benign characteristics, scores of 4 and 5 indicate malignant characteristics, and The indeterminate class (score = 3) was not modeled as a separate category, and ordinal learning approaches were not considered in this study. However, future work may explore uncertainty-aware or ordinal regression methods to handle such cases. All CT scans are stored in DICOM format, including metadata such as slice thickness, pixel spacing, and resolution parameters. This allows researchers to reconstruct three-dimensional lung volumes or extract two-dimensional slices for analysis. In this study, the DICOM images were preprocessed to extract nodule regions using the annotated coordinates. The extracted regions of interest (ROIs) were resized to 224 × 224 pixels and normalized to ensure consistency across samples. In this study, two-dimensional (2D) CT slices were extracted from the original volumetric CT scans and used as input to the CNN models. Although 3D CNNs can capture volumetric spatial context, a 2D approach was adopted due to its lower computational complexity, faster training, compatibility with slice-level annotations, and feasibility for large-scale evaluation. The LIDC-IDRI dataset is widely used for binary classification tasks differentiating benign and malignant nodules. Its diversity in scanner types, imaging settings, and nodule characteristics makes it particularly suitable for training and validating robust deep learning models for lung nodule classification.
IQ-OTH/NCCD: The IQ-OTH/NCCD dataset is a carefully curated collection of thoracic CT scans designed for lung nodule classification into three categories: normal, benign, and malignant. The dataset comprises 110 patient cases, with labels confirmed through both radiological assessment and pathological verification to ensure high diagnostic reliability. Unlike LIDC-IDRI, the IQ-OTH/NCCD dataset provides CT images in high-resolution 2D formats such as PNG and JPEG. These image formats allow direct utilization in 2D convolutional neural networks without the need for extensive DICOM preprocessing, making the dataset particularly suitable for efficient model training and evaluation. In this study, the dataset was used primarily for binary classification by distinguishing benign from malignant nodules. The images were resized to 224 × 224 pixels, and pixel intensities were normalized to ensure consistency in input dimensions and image quality. Due to class imbalance and the relatively small sample size, data augmentation techniques—including rotation, flipping, and zooming—were applied to enhance generalization and reduce overfitting. Since the models were also evaluated on the LIDC-IDRI dataset, the IQ-OTH/NCCD dataset served as an additional benchmark to assess robustness across varying imaging sources. This cross-dataset validation provides valuable insights into the generalizability of the proposed models under different clinical conditions, imaging devices, and dataset characteristics. All CT images were resized to 224 × 224 during preprocessing to ensure uniform input dimensions. Subsequently, within the CNN architecture, the spatial dimensions are progressively reduced through convolution and pooling operations, resulting in 28 × 28 feature maps before the classification stage.
3.2. Classification Models
Classifying lung nodules into morphological and malignancy categories is a critical component of computer-aided diagnosis (CAD) pipelines. In this study, four classification models were implemented and evaluated: a conventional Convolutional Neural Network (CNN), a hybrid CNN-SVM model, a deep residual network (ResNet101), and a novel Attention-Based CNN. Each model offers distinct architectural advantages and varying levels of complexity, with the shared objective of improving classification accuracy, robustness, and generalization performance.
3.2.1. Convolutional Neural Network (CNN)
Convolutional Neural Networks (CNNs) are capable of learning hierarchical spatial features, making them highly effective for image classification tasks. The implemented CNN architecture consists of multiple sequential layers, including convolutional layers with ReLU activation functions, max-pooling layers for spatial dimension reduction, a flattening layer, and two fully connected (dense) layers. For binary classification, a sigmoid activation function was applied in the final output layer, while Softmax can be used in multi-class configurations to improve class separability. The proposed CNN model serves as a baseline architecture against which more complex models are compared to evaluate improvements in performance metrics.
The overall structure of the CNN architecture is illustrated in
Figure 1.
After successive convolution and max-pooling layers, the feature map resolution decreases from 224 × 224 to 28 × 28, representing the intermediate feature representation used for classification. An input image of size 28 × 28 pixels was processed using the CNN model. The architecture consists primarily of convolutional blocks followed by fully connected layers. To learn local image features such as edges and textures, the first convolutional block applies 32 kernels of size 5 × 5 to the input image. This is followed by a Rectified Linear Unit (ReLU) activation function, which introduces non-linearity and enables the network to learn complex feature representations. Subsequently, a 2 × 2 max-pooling operation is performed to downsample the feature maps from 28 × 28 to 14 × 14 pixels. This operation enhances spatial invariance while reducing computational complexity. The second convolutional block consists of 64 filters of size 5 × 5, again followed by ReLU activation and a 2 × 2 max-pooling layer. This further reduces the feature map size to 7 × 7 while extracting higher-level features. The resulting feature maps are then flattened into a one-dimensional vector of size 3136 (7 × 7 × 64) and passed to the fully connected (dense) layers. In this stage, each neuron is fully connected to all neurons in the previous layer, allowing the network to integrate the spatial features learned during convolution into a global representation suitable for classification. The final output layer generates class probabilities. For multi-class classification tasks, a Softmax activation function is typically used, whereas for binary classification tasks, a sigmoid activation function is applied. This hierarchical architecture progressively extracts low-level and high-level features, making it well-suited for lung nodule classification tasks such as predicting morphological type or malignancy status.
3.2.2. CNN-SVM Hybrid Model
To enhance classification performance, particularly in high-dimensional feature spaces, a hybrid CNN-SVM model was implemented. In this configuration, the CNN functions solely as a feature extractor, with the final dense and Softmax layers removed. The deep features extracted from the final convolutional or flattening layer are directly fed into a Support Vector Machine (SVM) classifier with a radial basis function (RBF) kernel. The SVM was selected due to its strong theoretical foundation and effectiveness in separating nonlinear data by constructing optimal hyperplanes. The overall architecture of the hybrid CNN-SVM model is illustrated in
Figure 2. In this study, the CNN-SVM framework combines the deep feature learning capability of CNNs with the robust classification performance of SVMs. The model processes input images of size 28 × 28 pixels. The first convolutional layer applies 32 filters of size 5 × 5, followed by a Rectified Linear Unit (ReLU) activation function to introduce nonlinearity. A 2 × 2 max-pooling layer is then applied, reducing the spatial dimensions from 28 × 28 to 14 × 14. The second convolutional layer consists of 64 filters of size 5 × 5, again followed by ReLU activation and 2 × 2 max pooling, further reducing the feature map size to 7 × 7. Additional convolutional processing may be applied before flattening the feature maps into a one-dimensional vector of size 3136 (7 × 7 × 64). This feature vector is then passed through fully connected layers to capture higher-level abstract representations. The first dense layer contains 2000 neurons with ReLU activation. Instead of using a final Softmax output layer, the extracted deep features are input into the SVM classifier for the final decision regarding nodule classification.
This hybrid architecture leverages the representational strength of deep neural networks for feature extraction while utilizing the discriminative power of SVMs to achieve robust classification, particularly in scenarios with limited training samples.
Instead of terminating with a conventional Softmax or sigmoid output layer, as in standard CNN architectures, the final classification in this hybrid model is performed by an SVM. The SVM utilizes the output of the second fully connected layer as input features and classifies the data into respective categories by constructing optimal separating hyperplanes. Applying an SVM on top of a neural network-based feature extractor can enhance performance compared to using a neural network classifier alone, particularly when the dataset is limited or when classes are not linearly separable in the original feature space. The RBF kernel further enables nonlinear decision boundary formation, improving classification capability in complex feature spaces.
By combining the CNN’s deep feature extraction capability with the SVM’s strong discriminative power, the hybrid architecture achieves improved accuracy and generalization performance for image classification tasks, specifically in lung nodule classification.
3.2.3. ResNet101
ResNet101, a 101-layer deep residual network, was employed to mitigate the vanishing gradient problem and performance degradation commonly observed in very deep neural networks. The key contribution of ResNet lies in its residual (skip) connections, which enable the network to learn identity mappings. These skip connections facilitate gradient propagation across layers, allowing effective training of significantly deeper architectures. In this study, the ResNet101 model was fine-tuned for lung nodule classification. Pretrained ImageNet weights were used as the baseline initialization, and the final fully connected classification layers were replaced to match the number of output classes required for the lung nodule classification task. The adapted ResNet101 architecture demonstrated strong capability in learning deep abstract representations of nodule features, which is particularly important for distinguishing visually similar classes such as benign and malignant nodules. The overall architecture of ResNet101 is illustrated in
Figure 3. The network begins with an input image that passes through an initial convolutional layer, followed by a max-pooling layer to reduce spatial dimensions while preserving dominant features. The core of ResNet101 consists of 33 stacked residual blocks (also referred to as residual units), forming a 101-layer deep architecture. Each residual block contains skip connections that add the input of the block directly to its output, thereby enabling stable deep feature learning and improved convergence during training.
Each Residual Node consists of a shortcut (identity) connection and three convolutional layers that are bypassed by the skip connection. This skip connection links the input and output of a block, allowing the network to learn residual mappings instead of direct mappings. This method circumvents the vanishing gradient problem and permits the training of deeper networks by maintaining the gradient flow through the identity connection. Following the 33 residual blocks, the architecture proceeds to an average pooling layer, which computes the average values along each feature channel to reduce the feature maps. To classify the input, a fully connected (FC) layer takes the spatial convolutional features and reorganizes them into a feature vector. In the final layer, the network provides class probability estimates for multiclass classification using a softmax activation function. ResNet101 is a robust choice for learning hierarchical complex features with deep layers and residual learning. It is particularly well suited for medical imaging applications, such as classifying lung nodules, because it can extract and learn subtle patterns of texture and shape that are highly useful for determining whether a lesion is benign or malignant.
3.2.4. Proposed Attention-Based CNN
The four-dimensional Attention-Based CNN is a ground-breaking model that evaluates not only classification performance but also the model’s ability to focus attention on the most important areas of CT images. The model utilizes a CNN framework integrated with both channel-wise and spatial attention mechanisms. The channel attention mechanism allows the network to prioritize or rank the importance of one feature map over others. The spatial attention mechanism identifies areas in lung images that may indicate the presence of nodular abnormalities. This dual attention mechanism permits the model to focus on differentiated nodule representations and suppress other background “nonspecific” information that may confound prediction. The Attention-Enhanced CNN performed well in distinguishing morphological and malignancy classifications and further demonstrated strong performance when nodule boundaries were fused or intermingled.
Figure 4 illustrates the architecture of the proposed attention-based CNN algorithm. This hierarchical attention design strengthens the interpretability of the model by highlighting clinically relevant regions. Furthermore, it facilitates more reliable decision-making, especially in ambiguous diagnostic scenarios encountered in real-world clinical practice.
These classification models were chosen and strategically developed to address various challenges in lung nodule classification. The CNN provided a solid baseline, the CNN-SVM hybrid enhanced decision boundaries, ResNet101 enabled deep feature extraction, and the attention-based CNN provided accurate focus on important image regions. A comparison of these models enables a proper understanding of their strengths and weaknesses in clinical lung CT imaging classification. This comparative evaluation guided the selection of suitable models for specific diagnostic requirements. Moreover, it highlights the importance of combining deep learning with traditional classifiers and attention mechanisms to improve accuracy. The algorithm for the proposed attention-based CNN is presented below in Algorithm 1.
| Algorithm 1: Evaluation Process of Proposed Attention-Based CNN Model |
1: Load and preprocess the image dataset() 2: Perform data augmentation() 3: Split the dataset into training, validation, and test sets() 4: Initialize the Attention-Based CNN model() 5: for each epoch in Number of Epochs do 6: for each batch in the training set do 7: Extract features using convolutional layers() 8: Apply attention mechanism to enhance important features() 9: ŷ ← model(features) 10: loss ← crossentropy(y, ŷ) 11: Optimize model parameters using loss() 12: end for 13: Evaluate on the validation set and compute accuracy() 14: Update learning rate (if using a scheduler) 15: end for 16: Test the model on the test set 17: Compute accuracy(), precision(), recall(), and F1-score() 18: Generate and display confusion matrix(y, ŷ) 19: return Trained Attention-Based CNN Model and Evaluation Metrics |
The layers used in the proposed attention-based CNN algorithm are listed in
Table 2.
3.3. Methodology
The proposed system is intended to enable the development of a reliable and effective computer-aided diagnosis (CAD) system for classifying lung nodules obtained from CT scans. Unlike manual assessments performed by radiologists, which are subjective and time-consuming, the proposed system uses deep learning approaches to automatically classify nodules and assist clinicians with the preliminary and accurate diagnosis of lung cancer. The goal of this system is to determine the malignancy status of pulmonary nodules by classifying whether a nodule is benign or malignant. A block diagram illustrating the proposed system is shown in
Figure 5.
The methodology employed in this study consists of preprocessing methods designed to standardize all input data and improve the quality of CT images. Features from the CT images were extracted using deep convolutional layers capable of learning complex nodular patterns. An attention mechanism was included in the methodology to highlight relevant regions and improve the interpretability and diagnostic accuracy. The classification layer provides transparency in the model’s decision-making process, as it outputs the predicted label along with a confidence score. The system was trained using annotated CT datasets to ensure applicability across a range of patients. The common evaluation metrics used in the validation phase include accuracy, precision, recall, and F1-score.
This automated pipeline reduces the workload of radiologists and minimizes human error in identifying lung cancer in its early stages. Ultimately, the system improves clinical workflow efficiency while enhancing patient outcomes through timely diagnosis. In addition, the modular structure of the system allows for future integration with hospital information systems (HIS) to enable seamless data exchange and reporting without manual intervention. The attention-based visualization component not only aids in classification but also provides intuitive heatmaps for medically trained professionals, offering support as a second opinion. In the end, the system’s scalability enables its use across a wide range of clinical settings, from urban hospitals to rural healthcare facilities with limited access to specialists. The adoption of such AI-driven tools ensures consistency in diagnostic standards and supports continuous medical education by providing explainable results.
3.3.1. Input Acquisition
The procedure begins by gathering lung CT scans from two publicly accessible datasets: LIDC-IDRI [
14] and IQOTH/NCCD [
15]. The datasets comprise annotated CT images with nodule types and malignancy labels confirmed by senior radiologists. They represent various nodule appearances and patient demographics, enabling the models to learn from diverse clinical scenarios. The distribution of the LIDC-IDRI dataset is presented in
Table 3.
The distribution of the IQ-OTH/NCCD dataset is presented in
Table 3. For the LIDC-IDRI dataset, a total of 1358 benign and 573 malignant CT images were included in this study after applying the defined inclusion and exclusion criteria. The dataset was divided into training and validation subsets while maintaining a balanced class distribution as shown in
Table 4. To ensure robust and unbiased performance evaluation, patient-level stratified five-fold cross-validation was employed. In each fold, images from the same patient were confined to a single subset to prevent data leakage. Performance metrics were calculated for each fold and reported as the mean ± standard deviation across the five folds.
3.3.2. Preprocessing Module
Preprocessing is a critical stage that prepares raw CT images for input into deep learning models. It involves:
Lung Region Segmentation: Morphological techniques were employed to separate the lung region from adjacent thoracic structures.
Resizing: To ensure uniformity in model inputs, all images were resized to 224 × 224 pixels.
Normalization: To enhance training speed and maintain stability, pixel values were scaled to the range of 0 to 1.
Data Augmentation: Techniques such as rotation, horizontal and vertical flipping, shifting, and zooming were utilized to expand the training dataset, reduce overfitting, and improve generalization capability.
Following resizing and intensity normalization, morphological preprocessing was performed to enhance lung region segmentation. Binary thresholding was first applied for initial lung separation. Morphological opening was used to eliminate small noise components, while morphological closing filled minor discontinuities within the segmented regions. Subsequently, connected-component analysis retained the largest anatomically relevant lung components and removed non-lung structures. This preprocessing pipeline ensured cleaner region extraction and improved the robustness of subsequent feature learning.
3.3.3. Deep Learning Models for Classification
This CAD system incorporates four different classification models, each chosen for its unique strengths in feature representation and learning behavior.
Convolutional Neural Network (CNN)
The foundational CNN model employs convolutional, pooling, and fully connected layers to identify spatial hierarchies in CT images. This network autonomously learns features ranging from basic to advanced, such as edges, gradients, and intricate textures and shapes that facilitate nodule classification. Despite their effectiveness, CNNs are often limited in their ability to extract fine-grained differences between visually similar nodules.
CNN-SVM Hybrid Model
To enhance classification accuracy, a hybrid CNN–SVM model was proposed. In this approach, features are first extracted using a CNN and then passed to a Support Vector Machine (SVM) for classification. The SVM is effective in separating nonlinear data and is widely recognized for its strong generalization capabilities, particularly when training samples are limited. This model represents a two-tier approach that combines deep feature encoding with robust margin-based classification.
ResNet101 (Residual Network)
ResNet101 is a 101-layer residual network designed to address the vanishing gradient problem in very deep architectures. It utilizes residual (skip) connections that enable the network to learn identity mappings, allowing deep models to be trained without degradation in accuracy. ResNet101 can extract deep semantic features and effectively identify subtle distinctions between malignant and benign nodules, as well as among different morphological types. To facilitate transfer learning and accelerate convergence, the model was first pretrained on ImageNet and then fine-tuned on the lung CT dataset.
Attention-Based CNN (Proposed Model)
The core innovation of this study is the integration of an attention mechanism into the CNN architecture. The attention-based model includes the following:
Channel attention: Assigns different weights to feature maps to highlight the most informative features across channels.
Spatial attention: Suppresses extraneous background information to concentrate on specific regions of the image where nodule characteristics are more prominent.
This model employs a dual-attention mechanism that enables it to focus on essential regions of the image, similar to how radiologists concentrate on areas of concern. Consequently, the model demonstrated higher classification accuracy and improved interpretability compared to the other models. The algorithms were trained using the following parameters:
CNN: 224 × 224 input; Adam optimizer (lr = 1 × 10−4, weight decay = 1 × 10−5), ReduceLROnPlateau (factor = 0.5, patience = 5), batch size = 32, epochs = 100 with early stopping (patience = 10), Binary Cross-Entropy with class weighting; augmentations included random rotation (±15°), horizontal/vertical flips, random zoom (0.9–1.1), and brightness/contrast jitter (±10%).
CNN–SVM: CNN used as a feature extractor (Global Average Pooling → 2048-dimensional embedding), followed by RBF-SVM (C = 10, γ = 1 × 10−3) with Platt scaling to obtain calibrated probabilities.
ResNet101: Initialized with ImageNet weights; fine-tuned on the last two stages and the classifier head; Adam optimizer (lr = 1 × 10−4), batch size = 16; same augmentation strategy as above.
Attention-Based CNN (proposed): Dual attention modules inserted after mid- and high-level blocks (channel attention via squeeze-and-excitation with a reduction ratio of 16; spatial attention using a 7 × 7 convolution); dropout = 0.5; Adam optimizer (lr = 1 × 10−4); other training settings remained the same as above. Patient-level stratified 5-fold cross-validation (without slice leakage), fixed random seeds, framework versions, and hardware specifications were also documented. These additions improve reproducibility.
The proposed CNN architecture consists of convolutional blocks with 3 × 3 kernels, batch normalization, and ReLU activation, followed by max-pooling layers for spatial downsampling. Dual-attention modules are inserted after the mid-level and high-level convolutional blocks to enhance feature discrimination. The channel-attention mechanism uses a reduction ratio of r = 16. Model training was performed using the Adam optimizer with an initial learning rate of 1 × 10−4 and a weight decay of 1 × 10−5. The batch size was set to 16 or 32, depending on model complexity. Training was conducted for up to 100 epochs with early stopping (patience = 10). A ReduceLROnPlateau scheduler was applied to dynamically adjust the learning rate. A strict patient-level stratified five-fold cross-validation strategy was employed. All slices from a single patient were assigned to the same fold to prevent data leakage. No slice-wise splitting was used.
3.3.4. Performance Evaluation
We evaluated the lung nodule classification framework using standard evaluation measures. These measures refer to how accurate, robust, and generalizable the model is to datasets and classes. Evaluation measures included k-fold validation, precision, accuracy, recall, confusion matrix, and F1-score. These were the main tools used for the evaluation.
Accuracy: The accuracy indicates how correct the model was overall, measuring the number of cases that were correctly predicted over all cases. Although this does give you an overall performance sense, it can be misleading with imbalanced datasets. For this reason, we need more measures to get a better feel for how well the model actually performs.
Precision: Precision is the ratio of true positive predictions to the total number of positive predictions. It reflects the model’s ability to accurately identify true positive instances. A high precision indicates that the model produces few false positives, which is particularly important in clinical settings to avoid unnecessary treatment for patients who do not require it.
Recall: Recall or sensitivity reflects the proportion of actual positive cases that were correctly identified by the model. This is an assessment of the model’s ability to identify all relevant events. In the classification of lung nodules, recall should be high to avoid missing malignant nodules and minimize the number of false negatives.
These evaluation metrics offer a reliable and multifaceted assessment of the deep learning models used in the proposed system, ensuring both effectiveness and clinical relevance in lung nodule classification.
4. Results and Discussion
This section presents the experimental results demonstrating the capability of multiple deep learning models to classify lung nodules. The objective of this study was to determine whether lung nodules were benign or malignant based on their characteristics. The CNN, CNN–SVM, ResNet101, and the proposed Attention-Based CNN were trained using the publicly available IQ-OTH/NCCD and LIDC-IDRI datasets. Model performance was evaluated using metrics such as confusion matrices, F1-score, recall, precision, and accuracy. A comparative analysis was conducted, which showed that the proposed method outperformed the other models.
4.1. Lung Nodule Classification to Malignant and Benign Using Deep Learning Algorithms
Accurate identification of benign versus malignant lung nodules is crucial for the early diagnosis and successful treatment of lung cancer. Radiologists face a challenging and often subjective task when manually assessing these nodules, as their appearances on CT scans are not markedly distinct. To overcome these challenges, deep learning approaches offer reliable and consistent solutions. In this study, we employed CNN, CNN–SVM, ResNet101, and an attention-based CNN model and compared their performance in classifying lung nodules as benign or malignant using two benchmark datasets: IQ-OTH/NCCD and LIDC-IDRI. The models were trained on standardized CT images with fixed input dimensions.
4.2. Experimental Results for the IQ-OTH/NCCD Lung Cancer Dataset
The primary objective of the IQ-OTH/NCCD lung cancer dataset was to distinguish between benign and malignant lung nodules. The preprocessed images were used to train and test the CNN, CNN–SVM, ResNet101, and the proposed Attention-Based CNN model. Model performance was evaluated using accuracy, precision, recall, F1-score, and confusion matrix analysis. The performance results of the CNN, CNN–SVM, ResNet101, and Attention-Based CNN models are presented in
Figure 6,
Figure 7,
Figure 8 and
Figure 9, respectively.
Figure 6 depicts the performance of the baseline CNN model, where the training accuracy curve indicates consistent convergence; however, the confusion matrix reveals some misclassifications between the benign and malignant categories.
Figure 7 presents the performance of the hybrid CNN–SVM model, which outperforms both the CNN and ResNet101 models.
Figure 8 demonstrates the improved performance of the ResNet101 model, which, owing to its deeper architecture and residual connections, achieves superior feature representation and fewer misclassifications.
Figure 9 illustrates the exceptional performance of the proposed Attention-Based CNN, which attained the highest classification accuracy and the fastest convergence. The evaluation metrics included precision, recall, F1-score, and accuracy.
Table 5 presents a comparative performance analysis in terms of the confusion matrices of the CNN, ResNet101, and the proposed Attention-Based CNN models.
In multiclass classification scenarios, particularly when distinguishing among three categories (Class 0, Class 1, and Class 2), the confusion matrix is often represented using specific symbols to denote correct and incorrect predictions. TP0 refers to the number of instances that truly belong to Class 0 and are correctly predicted as Class 0; these are the true positives for Class 0. Similarly, TP1 and TP2 represent the true positives for Classes 1 and 2, respectively. Misclassifications are denoted in the format Exy, where an instance belonging to Class x is incorrectly classified as Class y. For example, E10 indicates that an instance of Class 1 has been misclassified as Class 0, whereas E20 indicates that a Class 2 instance was incorrectly predicted as Class 0. In contrast, E01 indicates that Class 0 was incorrectly classified as Class 1, E21 signifies that Class 2 was wrongly labeled as Class 1, E02 denotes Class 0 predicted as Class 2, and E12 refers to Class 1 being misclassified as Class 2.
For the CNN–SVM model, the confusion matrix values were TP0 = 22, TP1 = 111, and TP2 = 80, with misclassifications distributed as E02 = 2, E12 = 1, E20 = 2, and E21 = 1, whereas E01 and E10 were both zero, indicating that no such errors occurred.
When taken as a whole, these labels make it easier to quantify how well a classification model performs by emphasizing both its correct predictions and specific types of errors.
Figure 6 shows the CNN model’s training accuracy and loss curves, which exhibit consistent convergence. However, the confusion matrix reveals some incorrect classifications between benign and malignant nodules, illustrating the difficulty in distinguishing similar appearances.
Figure 7 shows the performance of the CNN–SVM model, demonstrating improved accuracy and precision compared to the baseline CNN.
Figure 8 presents the performance of the ResNet101 model, which shows smoother loss curves and higher accuracy. The confusion matrix also indicates fewer misclassifications than the baseline CNN. Finally,
Figure 9 presents the results of the proposed Attention-Based CNN model, which achieved the highest training accuracy, the fastest loss convergence, and the most accurate confusion matrix with fewer misclassifications. These results collectively emphasize that adding an attention mechanism improves feature focus, resulting in more precise and reliable lung nodule classification compared with the standard CNN and ResNet101 models.
Table 6 compares the precision, recall, F1-score, and accuracy of the proposed Attention-Based CNN model with those of CNN, CNN–SVM, and ResNet101 for benign and malignant lung nodule classification using the IQ-OTH/NCCD database. The CNN model achieved 95% accuracy across all metrics, reflecting strong baseline performance. ResNet101 delivered slightly higher precision (96%) but maintained 95% recall, F1-score, and accuracy, highlighting the benefits of deeper architectures with only marginal improvements. The CNN–SVM hybrid model further enhanced classification results, achieving 97% across all evaluation metrics, demonstrating the effectiveness of integrating SVM as a classifier to improve decision boundaries. The Attention-Based CNN consistently achieved 98% precision, recall, F1-score, and accuracy, outperforming all other models. These results demonstrate that incorporating attention mechanisms significantly enhances the model’s ability to focus on distinguishing lung nodule characteristics, thereby improving classification robustness and accuracy. Overall, the findings in
Table 6 confirm that the proposed Attention-Based CNN is the most effective and efficient method for classifying lung nodules on the IQ-OTH/NCCD dataset.
4.3. Experimental Results for Classification of Lung Nodules into Malignant and Benign (On LIDC Dataset)
The performance of lung nodule classification into malignant and benign categories was also evaluated using the clinically diverse and well-annotated LIDC-IDRI dataset. For effective model learning, the CNN, ResNet101, and the proposed Attention-Based CNN were trained and validated on preprocessed and augmented images from the LIDC dataset. The experimental results for distinguishing between malignant and benign nodules are illustrated in
Figure 10,
Figure 11,
Figure 12 and
Figure 13.
The evaluation criteria included precision, recall, F1-score, and accuracy.
Table 7 illustrates the comparative analysis based on the confusion matrices of the CNN, ResNet101, and the newly proposed Attention-Based CNN model.
The CNN model achieved an accuracy of 98%, and the confusion matrix reflected similar performance with negligible misclassification. An accuracy of 98%, along with improved recall and precision values, was achieved using the ResNet101 model due to its deeper feature extraction and residual connections. The proposed Attention-Based CNN model performed exceptionally well, attaining 98% accuracy, precision, recall, and F1-score. The attention mechanism enables enhanced feature localization, reducing the likelihood of misclassifying benign and malignant nodules and allowing the model to focus on clinically significant regions. These results validate the effectiveness of the attention-based deep learning approach in improving diagnostic precision on the complex and large LIDC dataset.
Table 8 presents a comparison of the performance of the CNN, CNN-SVM, ResNet101, and the proposed Attention-Based CNN models in classifying lung nodules into benign and malignant categories using the LIDC dataset.
The CNN model achieved a precision, recall, F1-score, and accuracy of 95%, indicating strong baseline performance but with some limitations in correctly identifying all malignant and benign cases. ResNet101, although deeper in architecture, demonstrated comparatively lower performance, achieving 94% across all metrics, suggesting that its residual connections alone were insufficient to optimally capture the discriminative features in CT images of lung nodules. In contrast, the proposed Attention-Based CNN attained the highest precision, recall, F1-score, and accuracy of 98%, clearly outperforming both CNN and ResNet101.
These superior results highlight the effectiveness of incorporating both channel and spatial attention mechanisms, enabling the model to focus on the most informative regions in nodule images. This targeted feature refinement significantly improves classification outcomes. Accordingly,
Table 8 confirms that the proposed Attention-Based CNN is the most robust and best-performing model for malignancy classification on the LIDC dataset.
4.4. AUC-ROC Analysis
To provide a more comprehensive evaluation beyond conventional accuracy and F1-score, Receiver Operating Characteristic (ROC) analysis was conducted for each model.
Figure 14 and
Figure 15 illustrate the ROC curves for CNN, CNN–SVM, ResNet101, and the proposed Attention-Based CNN on the LIDC-IDRI and IQ-OTH/NCCD datasets, respectively. Based on the ROC comparisons, the proposed Attention-Based CNN consistently achieved the highest True Positive Rate (TPR) and one of the lowest False Positive Rates (FPR), indicating superior capability in distinguishing between benign and malignant nodules. On the LIDC-IDRI dataset, the proposed Attention-Based CNN attained the most favorable ROC point, followed closely by CNN–SVM, whereas ResNet101 demonstrated comparatively weaker performance. The trends observed for the IQ-OTH/NCCD dataset were similar to those of the LIDC-IDRI dataset, where both the proposed Attention-Based CNN and ResNet101 exhibited nearly optimal ROC positions, while the baseline CNN and CNN–SVM models were less optimal. These evaluations provide clear evidence that integrating an attention mechanism into the feature extraction process enhances the classification capability of CNN architectures. By enabling the proposed model to focus on the most informative regions of lung nodule images, diagnostic performance across both datasets became more reliable and consistent.
4.5. Statistical Significance and Robustness Analysis
On the LIDC-IDRI and IQ-OTH/NCCD datasets, we employed 5-fold stratified cross-validation to ensure that the observed performance improvements were not due to random variation. The t-distribution was used to compute the 95% confidence intervals (CIs), and the metrics are reported as mean ± standard deviation (SD). Using fold-wise results, pairwise comparisons were conducted between the baseline models (CNN, CNN–SVM, and ResNet101) and the proposed Attention-Based CNN. The normality of differences was assessed using the Shapiro–Wilk test; if normality was satisfied, paired t-tests were applied; otherwise, Wilcoxon signed-rank tests were used. Multiple comparisons were addressed using the Holm–Bonferroni correction. McNemar’s test was employed to evaluate discordance in classification errors, and the DeLong method was used to compute AUC confidence intervals.
The findings (
Table 9) demonstrate that the Attention-Based CNN consistently outperformed all baseline models on both datasets in a statistically significant manner. On the LIDC-IDRI dataset, the proposed model achieved an accuracy of 97.7 ± 0.4% (95% CI: 97.3–98.1%), with precision, recall, and F1-score values all close to 0.98. This performance substantially exceeded that of CNN (95.0 ± 0.6%), CNN–SVM (96.0 ± 0.5%), and ResNet101 (94.5 ± 0.7%). Pairwise comparisons confirmed that these differences were statistically significant (all
p < 0.01, Holm-corrected), with large effect sizes (Cohen’s d > 0.8).
On the IQ-OTH/NCCD dataset, the Attention-Based CNN again achieved the highest performance (98.0 ± 0.4%), compared with CNN (95.0 ± 0.6%), CNN–SVM (97.0 ± 0.5%), and ResNet101 (95.0 ± 0.6%). These improvements were consistent across folds and statistically significant (all p < 0.01). Furthermore, AUC analysis supported these findings: the proposed model achieved an AUC of 0.986 on LIDC-IDRI and 0.981 on IQ-OTH/NCCD, both significantly higher than the baseline models, as confirmed by the DeLong tests.
4.6. Ablation Study
To investigate the individual contributions of the channel and spatial attention modules, an ablation study was conducted on both datasets. Four configurations were evaluated.
Baseline CNN (without attention)
CNN + Channel Attention only
CNN + Spatial Attention only
CNN + Channel + Spatial Attention (proposed model)
As shown in
Table 10, the inclusion of either channel or spatial attention improves the performance of the baseline CNN. The best results were achieved when both mechanisms were integrated, thereby confirming the complementary benefits of spatial and channel feature refinement.
4.7. Real-Time Performance on Google Colab (T4 GPU)
We evaluated the runtime performance on Google Colab using a single NVIDIA T4 (16 GB) GPU, a batch size of 1, and 224 × 224 inputs per CT slice. With light preprocessing (windowing, resizing, normalization), end-to-end inference on the T4 measured approximately 4–5 ms per slice (≈200–250 slices/s), so a typical 300-slice CT scan was completed in roughly 1.3–1.6 s, including preprocessing. The peak GPU memory usage during inference was approximately 0.8 GB in FP16 for batch size 1.
In this Colab T4 setup, FP16 inference achieved essentially the same accuracy as FP32, with negligible differences in AUC and F1-score (ΔAUC ≤ 0.001, ΔF1 ≤ 0.001), while reducing latency by approximately 30–40%. Enabling INT8 post-training quantization further reduced latency by approximately 30–45% and memory usage by approximately 50–60%, with only a marginal decrease in accuracy (ΔAUC ≈ 0.002–0.005; ΔF1 ≈ 0.002–0.004).
These results, obtained in the Colab T4 environment, confirm that the proposed Attention-Based CNN can operate in near real time, with inference times well below 2 s per scan, while maintaining high performance (Accuracy ≈ 98%, F1-score ≈ 0.98, AUC = 0.986/0.981 on LIDC-IDRI and IQ-OTH/NCCD). This highlights the practicality of the proposed model for real-world deployment under constrained hardware conditions.
To complement the quantitative results and assess interpretability, we visualized class-discriminative attention using Grad-CAM on representative held-out cases (Methods). Grad-CAM heatmaps were computed from the last convolutional block and overlaid onto the corresponding CT slices to highlight the regions that most strongly influenced the predictions.
Figure 16 summarizes four outcomes—malignant true positive (TP), malignant false negative (FN), benign true negative (TN), and benign false positive (FP)—arranged in rows. The columns display the raw CT slice, the Grad-CAM heatmap, and the overlay.
Figure 16 illustrates how the model’s spatial focus aligns with clinical cues in correct predictions and how it degrades in error cases. In malignant TP cases, Grad-CAM localizes to the nodule core and irregular margins, consistent with the expected malignant appearance. In FN examples, the heatmaps are broader and of lower intensity or shifted away from the lesion, suggesting insufficient contrast or contextual information at the selected scale. For benign TN cases, attention remains low and localized, whereas FP cases highlight structures such as vessels or pleural interfaces that may resemble spiculation. These qualitative findings support the interpretability of the dual (channel + spatial) attention design and motivate targeted improvements, such as incorporating stronger multi-scale or 3D contextual information to reduce FNs and implementing vessel- or pleura-aware regularization to reduce FPs.
4.8. Qualitative Analysis
The test results of the proposed system on the IQ-OTH/NCCD Dataset are shown in
Figure 17.
Although metrics such as accuracy, precision, recall, and F1-score provide a clear understanding of classification performance, it is essential to determine the statistical significance of the observed improvements. Most existing studies report only point estimates of model performance, which may not fully capture variability due to differences in training data splits or random initializations. To address this limitation, we extended our evaluation using k-fold cross-validation on both the IQ-OTH/NCCD and LIDC-IDRI datasets. The results are reported as mean ± standard deviation, reflecting consistency across multiple runs. Additionally, paired t-tests were conducted between the proposed Attention-Based CNN and the baseline models (CNN, CNN–SVM, and ResNet101). The results indicated that the improvements in accuracy and AUC were statistically significant (p < 0.05), confirming that the gains were unlikely to have occurred by chance. By incorporating both cross-validation statistics and hypothesis testing, the credibility of the findings was strengthened, and the robustness of the proposed approach was demonstrated across diverse data splits. This ensures that the improvements are not only numerically higher but also statistically meaningful, enhancing the reliability of the results for clinical translation.
The qualitative analysis presented in
Figure 18 shows that the model accurately distinguished among benign, malignant, and normal cases in the dataset.
This model effectively distinguished between benign and malignant nodules in the LIDC dataset. These qualitative findings visually highlight the robust classification capabilities of the proposed Attention-Based CNN model across both datasets, thereby corroborating the quantitative results presented herein. The model accurately identified various types of nodules as well as normal cases in both datasets.
Table 11 presents a comparative Analysis of the Proposed System with Existing Approaches.
A comparison of the proposed technique with current methods for lung nodule classification is presented in
Table 11. Several previous studies, such as Hu He Xuan et al. [
16], achieved a high accuracy of 94.61% using DenseNet integrated with a hybrid attention mechanism, while Wang, Xi, Chen, et al. [
17] reported strong results using semi-supervised learning and feature aggregation on the TCGA and SUCC datasets. Jie Jiang et al. [
18] attained 91.60% accuracy with their Multiple Resolution Residually Connected Networks (MRRN), whereas Sarfaraz Hussein et al. [
19] achieved a comparatively lower accuracy of 78.06% using a 3D CNN with a sparse graph model. Other studies, such as those by N. Mehnapriya et al. [
20] and Mastouri et al. [
23], demonstrated moderate performance, reporting an accuracy of 82.55% and a precision of 90.68%, respectively. Additionally, Ju et al. [
21] and Song et al. [
22] explored multimodal PET-CT approaches and reported strong Dice Similarity Coefficient values, highlighting the advantages of combining imaging modalities. In contrast, the proposed Attention-Based CNN developed in this study outperformed all existing methods on both datasets. On the LIDC-IDRI dataset, it achieved an accuracy of 98%, surpassing the baseline CNN (95%), CNN–SVM (96%), and ResNet101 (94%). Similarly, on the IQ-OTH/NCCD dataset, the proposed model again attained the highest accuracy of 98%, compared with CNN (95%), CNN–SVM (97%), and ResNet101 (95%). The results demonstrate the consistent advantage of the attention-enhanced model across multiple datasets. This improvement highlights the effectiveness of the dual-channel and spatial attention mechanisms in accurately localizing features and enhancing the representation of subtle lung nodule patterns. The performance gains suggest improved generalization and a reduced risk of overfitting compared to standard CNN-based architectures. Moreover, the incorporation of attention modules enhances interpretability, which is essential for clinical decision support. Overall, the proposed model advances the state of the art by combining high diagnostic performance with practical feasibility for deployment in medical imaging applications.
To further contextualize the proposed model within recent advancements in deep learning, we compared our Attention-Based CNN with lightweight Vision Transformer (ViT) architectures such as Swin Transformer Tiny and MedViT, as reported in the liter-ature. Lightweight ViT models typically contain approximately 28–30 million parameters and involve higher computational complexity due to self-attention mechanisms, which scale quadratically with token size. In contrast, the proposed Attention-Based CNN in-troduces only a marginal parameter increase (approximately 3–5%) over the ResNet101 backbone, resulting in significantly lower computational overhead.
Reported FLOPs for Swin Transformer Tiny are typically higher than conventional CNN backbones at similar input resolutions, and transformer-based models generally demand greater GPU memory during inference. By comparison, our model achieved end-to-end inference latency of approximately 4–5 ms per CT slice on an NVIDIA T4 GPU, with peak memory consumption below 1 GB (FP16), demonstrating its practicality for real-time clinical deployment.
Although transformer-based models have shown strong performance in large-scale natural image datasets, their effectiveness in limited medical imaging datasets often depends on extensive pretraining and careful regularization. The proposed dual-attention CNN maintains strong inductive bias for local feature extraction while enhancing global contextual awareness through channel and spatial refinement. This design achieves competitive accuracy (98%) and AUC (0.986/0.981) while maintaining computational efficiency and robustness across datasets. These considerations indicate that the proposed method provides an effective trade-off between diagnostic performance, interpretability, computational cost, and deployment feasibility, making it particularly suitable for resource-constrained medical environments.
Despite the promising performance achieved by the proposed attention-based framework, several limitations should be acknowledged. First, the study was conducted on publicly available datasets with limited sample sizes and potential class imbalance, which may affect generalizability. Second, the LIDC-IDRI dataset is subject to inter-observer variability, as malignancy ratings are based on assessments from multiple radiologists, potentially introducing annotation inconsistencies. Third, the proposed model was evaluated without external or multi-institutional validation, which is essential for assessing robustness across diverse clinical settings. Therefore, although the framework demonstrates strong research performance, further large-scale, multi-center, and prospective clinical validation studies are necessary before real-world clinical deployment.
5. Conclusions
This research introduces a deep learning–based computer-aided diagnosis (CAD) system for the classification of benign and malignant lung nodules in CT images. Various deep learning models, including CNN, CNN–SVM, ResNet101, and the proposed Attention-Based CNN, were evaluated using the LIDC-IDRI and IQ-OTH/NCCD benchmark datasets. The comparative experiments demonstrated baseline performance for CNN and CNN–SVM, while ResNet101 showed improved deep feature extraction with mixed results across the two datasets. The proposed Attention-Based CNN significantly outperformed all baseline models across all evaluation metrics, achieving a classification accuracy of 98%, an average F1-score close to 0.98, and an area under the receiver operating characteristic curve (AUC) of 0.986 (LIDC-IDRI) and 0.981 (IQ-OTH/NCCD). The study highlights the positive impact of channel and spatial attention mechanisms, which enable the identification of salient features, enhance discriminative learning, and improve overall classification performance. The experimental findings provide strong evidence of the proposed model’s effectiveness and robustness, as well as its potential for clinical application to improve early lung cancer detection and reduce overdiagnosis of lung nodules. The obtained results indicate that the model effectively identifies asymmetrical regions introduced by lung nodules that deviate from the symmetric structure of healthy lungs. This finding underscores the importance of capturing symmetry-breaking patterns for accurate lung nodule classification.
Although the system demonstrated high accuracy and robustness, several directions for future work remain. First, integrating explainable AI (XAI) modules would enhance interpretability by providing visual and textual justifications for model decisions, thereby fostering clinician trust. Second, expanding the dataset to include multi-institutional imaging data and incorporating multimodal imaging modalities such as PET/CT or MRI could strengthen the model’s generalization across diverse healthcare settings, clinical practices, and patient populations. Third, semi-supervised or unsupervised learning approaches could reduce reliance on costly manual annotations by leveraging large volumes of unlabeled data. The model could also be optimized for lightweight, real-time deployment on edge or mobile devices to enable rapid diagnostic support in resource-constrained environments. The proposed system demonstrates strong potential as a supportive decision-making tool; however, further validation on larger and multi-institutional datasets is required before clinical integration. Finally, future studies should explore the predictive capabilities of the system, such as estimating nodule growth rates or the likelihood of malignant progression, to support comprehensive and proactive lung cancer management strategies.