Next Article in Journal
No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Next Article in Special Issue
Signal-Derived Feature Analysis for Cuffless Blood Pressure Estimation: Comparing Machine Learning and Deep Learning on ICU Physiological Waveforms
Previous Article in Journal
Neural Vessel Segmentation and Gaussian Splatting for 3D Reconstruction of Cerebral Angiography
Previous Article in Special Issue
Explainable AI for Diabetic Retinopathy: Utilizing YOLO Model on a Novel Dataset
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Machine Learning and Deep Learning in Lung Cancer Diagnostics: A Systematic Review of Technical Breakthroughs, Clinical Barriers, and Ethical Imperatives

by
Mobarak Abumohsen
1,
Enrique Costa-Montenegro
1,
Silvia García-Méndez
1,
Amani Yousef Owda
2,* and
Majdi Owda
3
1
Information Technologies Group, School of Telecommunication Engineering, University of Vigo, 36310 Vigo, Spain
2
Department of Natural, Engineering and Technology Sciences, Arab American University, Ramallah P600, Palestine
3
Faculty of Artificial Intelligence and Data Science, UNESCO Chair in Data Science for Sustainable Development, Arab American University, Ramallah P600, Palestine
*
Author to whom correspondence should be addressed.
Submission received: 27 November 2025 / Revised: 27 December 2025 / Accepted: 8 January 2026 / Published: 11 January 2026

Abstract

The use of machine learning (ML) and deep learning (DL) in lung cancer detection and classification offers great promise for improving early diagnosis and reducing death rates. Despite major advances in research, there is still a significant gap between successful model development and clinical use. This review identifies the main obstacles preventing ML/DL tools from being adopted in real healthcare settings and suggests practical advice to tackle them. Using PRISMA guidelines, we examined over 100 studies published between 2022 and 2024, focusing on technical accuracy, clinical relevance, and ethical aspects. Most of the reviewed studies rely on computed tomography (CT) imaging, reflecting its dominant role in current lung cancer screening workflows. While many models achieve high performance on public datasets (e.g., >95% sensitivity on LUNA16), they often perform poorly on real clinical data due to issues like domain shift and bias, especially toward underrepresented groups. Promising solutions include federated learning for data privacy, synthetic data to support rare subtypes, and explainable AI to build trust. We also present a checklist to guide the development of clinically applicable tools, emphasizing generalizability, transparency, and workflow integration. The study recommends early collaboration between developers, clinicians, and policymakers to ensure practical adoption. Ultimately, for ML/DL solutions to gain clinical acceptance, they must be designed with healthcare professionals from the beginning.

1. Introduction

Lung Cancer (LC) is the leading cause of cancer-related deaths worldwide, and survival rates depend largely on early detection of the disease. Around 2.2 million new LC cases are identified annually, resulting in approximately 1.8 million deaths globally [1,2]. LC often presents with a variety of common signs and symptoms, including fatigue, loss of appetite, weight loss, chest pain, coughing, shortness of breath (dyspnea), and coughing up blood (hemoptysis) [3]. Additionally, several risk factors are linked to LC, such as smoking, air pollution, exposure to radon, second-hand smoke, and chemical exposure [3]. Figure 1 shows the common signs and symptoms as well as the risk factors of LC. While advances in imaging technologies such as computed tomography have improved diagnosis, the burden on radiologists, the need for sufficient expertise to recognize the disease, and the differences in decision-making between specialists may require more time to make an accurate diagnosis. Therefore, machine learning (ML) and deep learning (DL) techniques assist in identifying and classifying LC with less time and effort and high accuracy [4,5]. Despite achieving a high sensitivity of >95% on a given dataset, these systems have not been integrated into clinical practice. Therefore, there is a great need to bridge the gap between technological innovations and real-world healthcare delivery.
Although the existing reviews highlight the algorithmic accuracy or dataset benchmarks (e.g., LUNA16), this study comprehensively analyzes why state-of-the-art algorithms often fail to translate into clinical applications and neglect systemic barriers to clinical adoption. For instance:
  • Technical walls, such as the dataset’s bias and the domain shift when applying the model to a different dataset, lead to damaged model generalization.
  • Clinical challenges include regulatory barriers, which refer to strict requirements for validation and approving ML/DL tools for clinical use, and radiologists’ suspicion of black box systems, which occur when ML/DL systems lack clarity, as they cannot examine how decisions were made.
  • Ethical concerns like the demographic bias that occurs when training a ML/DL model on a demographic dataset (e.g., race, gender, age).
This survey assesses emerging solutions such as human-in-the-loop frameworks, federated learning to enhance data diversity, and artificial data augmentation techniques. It also provides guidelines to bridge the research–clinic gap, prioritizing high-performance models that are also interpretable.
This systematic review addresses these gaps by analyzing the failure of ML and DL models for LC detection and classification to translate into clinical practice. It proposes actionable suggestions to bridge this gap. Unlike previous studies, this research:
  • Critically analyzes the contributions and classifies the challenges.
  • Evaluates emerging solutions, including federated learning for data diversity, synthetic data generation for rare subcategories, and interpretable ML algorithms to build medicine or specialist trust.
  • Redefines evaluation metrics by prioritizing clinical benefit (reducing time to diagnosis) over technical criteria.
Five core research questions steer this survey:
  • RQ1: What technical, clinical, and societal boundaries prevent high-performing ML/DL models from deploying in LC screening workflows?
  • RQ2: How might new models, such as human-in-the-loop systems, lightweight architectures, and synthetic data, handle these challenges?
  • RQ3: Which evaluation structures are required to ensure ML/DL tools are equitable, repeatable, and clinically impactful?
  • RQ4: How does the heavy reliance on public datasets (e.g., LIDC-IDRI, LUNA16) in LC ML and DL research contribute to dataset imbalance, and what obstacles does this create for model generalizability and clinical applicability in real-world scenarios?
  • RQ5: Which public datasets have been most frequently used in previous studies for LC detection and classification using ML and DL techniques?

1.1. Methodology

The PRISMA 2020 guidelines for reporting systematic reviews were followed in the conduct of this review. To guarantee open reporting of the search, selection, and synthesis procedures, we also adhered to the PRISMA 2020 checklist (Supplementary Materials) where applicable. The Open Science Framework (OSF) Registration DOI for the review protocol is (https://doi.org/10.17605/OSF.IO/A5BHW).

1.1.1. Research Design and Protocols

The Preferred Reporting Items led this systematic review for Systematic Reviews and Meta-Analyses (PRISMA 2020 guidelines) [6,7,8]. The aim was to evaluate the current trends, challenges, and clinical importance of ML and DL methodologies for LC detection and classification. A PRISMA flow diagram (Figure 2) illustrates the selection process.

1.1.2. Search Strategy

A comprehensive literature review was conducted on four primary databases: Google Scholar, MDPI, ScienceDirect, and IEEE Xplore. The literature review included research works that were published in the English language between January 2022 and December 2024.
The following Boolean combinations and keywords were applied to enhance search specificity:
  • Keywords: “Lung cancer detection”, “Lung cancer classification”, “Machine learning”, “Deep learning”, “Medical image processing”, “Clinical validation”.
  • Boolean Operators: (“Lung cancer detection” OR “Lung tumor classification” OR “Lung cancer segmentation”) AND (“Deep learning” OR “Machine learning”).
This strategy initially retrieved 452 articles across all databases.

1.1.3. Study Selection Method

After excluding 252 papers that did not meet the minimum inclusion criteria, such as being published before 2022, being written in a language other than English, or not being pertinent to ML/DL-based LC detection, the studies included were put through a two-stage screening. In the first stage, titles and abstracts were screened for relevance. During the second stage, full texts of shortlisted studies were reviewed to determine eligibility based on pre-decided criteria. The studies that were unable to fit within the scope of clinical or near-clinical uses of ML/DL in the detection and classification of LC were excluded here.

1.1.4. Inclusion and Exclusion Criteria

The final selection of 65 studies was made based on the following criteria:
  • Inclusion Criteria:
    • Studies that developed or experimented with ML/DL-based models for the classification or detection of LC.
    • Studies with near-clinical or clinical model validation.
    • Studies that had reported performance in terms of clinically relevant measures (e.g., subtype generalization, sensitivity/specificity).
    • Studies published in 2022–2024 in English.
This systematic review refers to the validation of AI quasi-clinical or clinical algorithms in previous studies that evaluated the performance of ML algorithms based on real patient data and under conditions that mimic clinical practice. Examples include the validation of models on external or multicenter datasets and testing on previous clinical data. Evaluation is based on clinically relevant outcomes (e.g., diagnostic accuracy for lung cancer detection/diagnosis, support for staging, and clinically relevant classification/segmentation) or the reporting of clinically interpretable performance metrics (e.g., sensitivity and specificity). Studies lacking a clinical context—such as those limited to purely technical experiments, simulations, non-clinical alternative tasks, or purely normative assessments with no relevance to the patient level—were excluded because our goal is to analyze barriers to realistic clinical translation, which cannot be adequately assessed without clinically grounded validation.
  • Exclusion Criteria:
    • Non-English language articles.
    • Studies published before 2022.
    • Studies that were not ML/DL-based method-focused.
    • Articles without clinical context or validation.
    • Duplicate or redundant studies.

1.1.5. Data Extraction

To ensure consistency and transparency in the review process, data from each included study were systematically extracted using a predefined data-charting structure designed according to the objectives of this review. The data-charting form was developed to align with the three core analytical dimensions of the study: (i) technical aspects of ML/DL models, (ii) clinical relevance and validation, and (iii) ethical and practical considerations for real-world implementation. For every eligible study, the following information was carefully extracted:
  • Bibliographic details: including the authors, publication year, and country of origin.
  • Dataset characteristics: dataset name, modality (e.g., CT, PET-CT, histopathology, X-ray), number of patients or images, dataset type (public/private), data balance, class distribution, strengths, and known limitations. These elements were further summarized and presented in tabulated form in Section 2 (Technical Section).
  • Study objective and task type: such as lung cancer detection, binary or multiclass classification, or segmentation.
  • Technical methodology: including the implemented ML or DL model architecture, any transfer learning or hybrid strategies, preprocessing techniques, feature engineering (if applicable), and optimization/training procedures.
  • Performance evaluation: based on commonly reported metrics such as accuracy, AUC, sensitivity, specificity, precision, recall, and F1-score.
  • Clinical validation details: type of validation strategy (internal/external), dataset representativeness in clinical settings, use of real patient data, radiologist involvement, or comparison with current diagnostic practice.
Data extraction was performed collaboratively by the review team. To minimize potential extraction bias, the extracted information was thoroughly cross-checked against different sections of each publication (e.g., methodology, results). No automated software was used during this process. The extracted data were then organized according to dataset type, classification task, ML/DL architecture, and clinical applicability, serving as the basis for the comparative analysis presented in later sections.
This study is organized around the research questions; each section addresses specific research questions and concludes with the final answers provided in Section Answers to Research Questions. This review has the following organization: Section 2 describes recent advances in ML/DL LC detection and classification methods. Section 3 discusses the challenges of clinical validation for the introduction of these models in real-world applications. Section 4 discusses ethical and societal aspects, such as bias in algorithms and data governance. Section 5 provides pragmatic recommendations for constructing clinically relevant ML/DL systems and a synthesis of the most important findings and a proposed translational agenda. Finally, Section 6 concludes the review and presents future directions for advancing the field toward safe and impactful clinical translation.

2. Technical Advances in LC Detection

This section describes the key ML/DL models applied to LC detection and classification, highlighting their technical mechanisms, performance, and suitability for clinical workflows. Figure 3 shows a structured overview of the advances in LC detection and classification with ML and DL techniques. This figure categorizes methods into two broad branches: classical ML and DL models. On the left side, classical ML models are depicted as being based primarily on manually crafted features from medical images and applying classical classification algorithms. A follow-up route consists of hybrid ML–DL systems with DL models employed for feature learning and ML models for classification, and depicts a trend toward combining the two frameworks to enhance diagnostic performance. On the right side, DL models are subdivided according to dataset type. The section highlights DL techniques’ predominance in current work owing to the potential for automated feature learning and end-to-end fine-tuning. Both routes lead ultimately to a list of common evaluation parameters: ML/DL model architecture, dataset employed, performance metrics, and limitations. These factors collectively define the success and applicability of ML/DL solutions in real-world clinical settings.
This section mainly deals with the research questions RQ1, RQ4, and RQ5 by studying and analyzing the technical performance of all ML/DL models, in addition to the datasets used and most used in detecting, classifying, and analyzing LC. It also investigates which technical limitations affect the possibility of generalizing and applying the models in the real world and real clinical applications.

2.1. Datasets

One of the most important issues in LC detection and classification research is dataset availability and quality. Model development, validation, and generalization of ML and DL models depend largely on dataset selection. The section provides an overview of the common types of datasets used in recent studies, with characteristics, strengths, and limitations.
An important piece of information about these datasets is the modality. We provide a brief explanation of the different ones we have found in our research. CT scans are the most common, and they are scans that use the thoracic computed tomography technology, which provides detailed anatomical lung images. Other types are positron emission tomography-computed tomography (PET-CT), which combines metabolic and structural imaging; magnetic resonance imaging (MRI), known for its high soft-tissue contrast; and histopathological images, which assist researchers in the early diagnosis of LC. The microscopic images of the tissue biopsies in histopathological datasets are necessary for LC subtype classification in cases of adenocarcinoma (ADC), squamous cell carcinoma (SCC), and benign cases. Such datasets help in training DL models capable of interpreting the morphology and architecture of cells and tissues at very high resolution.
The predominance of CT imaging in LC ML/DL studies is mainly due to its widespread clinical use and practical advantages. CT is relatively low-cost, widely available, and routinely used as the first imaging test to detect lung cancer. CT scans are fast to acquire and are less affected by patient breathing during scanning, resulting in fewer motion-related artifacts compared with other modalities. Moreover, CT provides clear anatomical information of the lungs, allowing reliable identification of suspicious nodules and assessment of cancer presence. These factors make CT data easier to collect, more consistent, and highly suitable for ML and DL employment.
Various publicly held CT datasets facilitate Radiomic analysis by offering segmentation masks or metadata used to extract quantitative features. Radiomic datasets denote collections of CT images in which structured features like shape, intensity, and texture of the tumor are extracted computationally and used to measure tumor phenotype. These datasets facilitate the construction of predictive models beyond image classification. While these datasets have come to be commonly known as “Radiomic datasets” because of the support that they provide for the extraction of Radiomic features, they are, by nature, CT datasets with added annotations or metadata.
Multimodal datasets combine various types of data—e.g., CT scans, histopathology slides, genomic data, and clinical data—to facilitate the construction of richer and clinically important models. These datasets are characterized by the variety of modalities present.
All these datasets are used to detect and classify different types of cancer, which include adenocarcinoma (ADC), squamous cell carcinoma (SCC), large cell carcinoma (LCC), and small cell lung carcinoma (SCLC), as well as benign (non-cancerous) tumors and normal, healthy tissue.

2.1.1. Public Datasets

Table 1 shows the summary of key public datasets used in LC detection and classification. Public datasets are a foundation upon which the advancement of LC research can be fostered through enabling the reproducibility and benchmarking of algorithms. Public datasets are typically curated, annotated, and released to the general research community, fostering collaborative innovation and openness [9]. Examples of publicly available, frequently utilized datasets include the LIDC-IDRI dataset [10], which features over 1000 thoracic computed tomography (CT) scans and expert annotations, and the National Lung Screening Trial (NLST) [11], providing longitudinal imaging and clinical data from thousands of patients. Another common public CT database is LUNA16 [12], which has 888 patient CT scans with over 551,000 DICOM images, marked for nodule detection (549,715 regular slices and 1352 abnormal slices) with a pixel resolution of 616 × 616. One of the histopathological datasets used most commonly is the LC25000 dataset [13], consisting of 25,000 augmented instances in five categories with 15,000 instances of lung tissue (ADC, SCC, and benign). Examples of radiomic datasets include The Cancer Imaging Archive’s (TCIA) NSCLC-Radiomics dataset [14] with pretreatment CT scans and expert-annotated tumors and clinical follow-up in 422 NSCLC cases. There are other such examples, like RIDER [15], with an emphasis on feature constancy across recurrent scanning. As an example, a multimodal dataset could be NSCLC-Radiomics-Genomics “Internal NSCLC cohort” [16], which brings together pretreatment CT scans and gene expression levels, as well as clinical data from 26 NSCLC patients, and has been used to investigate the correlation between imaging descriptors and expression levels of the genes.
Public datasets tend to be large, well-annotated, and supplemented with rich metadata, often created from multicenter studies or national and international collaborations, ensuring broad and representative case distribution. These data are typically well-structured and suitable for benchmarking algorithms; however, they may lack the diverse variability observed in real-world clinical settings. These datasets offer several advantages: they enhance reproducibility by enabling the evaluation of different algorithms under consistent conditions, foster transparency through open access to methods and results, and facilitate collaboration by promoting the sharing of resources and expertise across institutions. Yet, certain limitations must be taken into account. Heterogeneity resulting from variations in imaging protocols and annotation standards across contributing centers can introduce inconsistencies. Furthermore, strict anonymization protocols—implemented to uphold patient privacy and ethical compliance—may limit access to valuable clinical information. Finally, publicly available datasets can lack the variability present in real-world clinical practice, particularly in reflecting rare or complex cases, and thus may limit the generalizability of models trained on them. Histopathology images datasets provide high spatial and textural resolution, important for precise subtype identification, but present challenges in the form of stain variation, patch extraction methods, and the necessity of expert annotation. The datasets used also tend to be artificially balanced and may be missing real-world diversity. Regarding the radiomic datasets, they have the strength of allowing quantitative evaluation of heterogeneity in tumors by means of features of shape, intensity, and texture [17]. However, like any other datasets, they do come with shortcomings such as inter-observer variability in manually segmented data and sensitivity to the imaging protocol and preprocessing [18]. Multimodal datasets provide opportunities for patient-centric models with enhanced predictive ability and interpretability. Nevertheless, these pose a number of challenges in terms of data integration complexity, the scarcity of publicly available large multimodal resources, and the necessity for multidisciplinary groups for annotation.
Table 1. Summary of key public datasets.
Table 1. Summary of key public datasets.
Dataset NameModalitySize (Patients)StrengthsLimitations
LIDC-IDRI [10]CT1018Well-annotated, multicenter CT dataLimited to CT, no histopathology
NLST [11]CT53,452Very large, longitudinalLimited access, specific criteria
Duke LC Screening Dataset [19]CT2061Large, well-curatedLimited clinical annotations
LUNA16 [12]CT (nodules, non-nodules)888High-resolution, nodule-focusedNo histopathology, weak labeling
LC25000 [13]Histopathological (ADC, SCC, Benign)750Balanced classes, widely usedSynthetic generation concerns
IQ-OTH/NCCD [20]CT (normal, benign, malignant)1190Includes benign and malignant categoriesLimited diversity
Chest CT-Scan [21]CT (ADC, LCC, Normal, SCC)1653Balanced classes, useful for multi-classLimited sample size
Lung-PET-CT-Dx [22]CT, PET-CT (ADC, SCC, LCC, SCLC)355Multimodal (CT + PET), rich annotationsRequires pre-processing expertise, fewer patients
Lung Cancer Alliance (LCA) [23]CT (NSCLC and SCLC)76Real-world lung cancer casesLimited dataset size
SPIE-AAPM-NCI Lung Nodule Classification Challenge [24]CT70High-quality expert annotationsLimited sample size
Chest X-Ray [25]X-ray (Normal and Pneumonia)5856Balanced and labeled X-ray data useful for binary classificationOnly two classes (Normal vs. Pneumonia), not specific to lung cancer
Lung Tumor Segmentation [26]CT63High-quality manual tumor segmentation in full CT volumesLimited sample size; focus on segmentation tasks only
Lung Nodule Segmentation [27]CT (nodule, cancer, and adenocarcinoma)~1650Rich instance-level annotations (pixel-wise masks per lesion)No patient-level metadata (e.g., age, sex, diagnosis context)
NSCLC-Radiomics [14]CT422Multi-institutional cohort, supports radiomics standardizationManual delineations, outcome data
RIDER [15]CT~100Intra-patient test–retest variability, ideal for robustness testing of radiomic featuresRepeat scans, stability studies
Radiomic Features (various) [28]CT-derivedVariablePublic/Private| multi-dataset aggregation, useful for feature reproducibility and benchmarking ML modelsQuantitative features, diverse applications
NSCLC-Radiomics-Genomics [29]CT, Genomics, Clinical89Multimodal, gene-expression dataLimited size, complex integration
NSCLC Radiogenomics (TCIA) [30]CT, Genomics, Clinical211Combines imaging, gene expression, and survival dataLimited cohort; requires complex preprocessing
Internal NSCLC cohort [16]CT and PET/CT + gene expression26Includes matched imaging and gene expression; radiogenomic analysis performed directlySmall sample size
UniToChest [31]CT623Largest public lung nodule segmentation dataset. Manual annotations by radiologistsFocused on segmentation, not classification
SIMBA Public [32]CT (Low-dose CT)~1000Includes low-dose CT scans used in CAD systemsLimited clinical metadata, not all scans labeled for malignancy

2.1.2. Private Datasets

Private datasets are collected by individual research groups or institutions and are not publicly accessible. Private datasets are often used to validate models trained with public data or to address specific clinical questions. For example, studies have documented the use of private CT and PET datasets from hospital repositories for multimodal learning tasks [33]. Table 2 shows the summary of key private datasets.
Private datasets often incorporate proprietary imaging protocols, special patient populations, and clinical information not otherwise available in public datasets. Private datasets possess some dramatic advantages, including enhanced clinical relevance, as they more closely resemble real-world clinical practice and local populations. They can also contain other information, such as other imaging modalities or longitudinal follow-up that can enhance model development and testing. Private datasets have some severe limitations, however. Restricted access can hinder reproducibility and external validation. Moreover, the absence of standard annotation guidelines can affect data reliability and consistency. Lastly, private datasets are smaller in size compared to public datasets, reducing statistical power and the likelihood of limited result robustness.
Despite the importance of clinical proprietary data, limited access to such data significantly restricts the possibility of standardized comparison and reproducibility of results. Because this data is not publicly available, independent researchers and developers cannot re-evaluate these results or compare them under the same experimental conditions. Therefore, performance reported in proprietary datasets cannot be directly compared with other methods, making it difficult to assess the robustness, generalizability, and true clinical value across studies.

2.2. Traditional ML Models

Classic ML models combined with handcrafted features extracted from medical images or clinical data have played a pioneering role in detecting and classifying LC by identifying the lesions associated with malignant or benign nodules. Moreover, the authors used a combination of ML and DL to detect and classify LC, where DL models are exploited for feature extraction and ML for classifications.

2.2.1. ML on Public CT Datasets

Traditional ML models applied to publicly available CT datasets and Radiomic repositories have significantly contributed to early LC detection and classification. Publicly available CT datasets such as LUNA16, LIDC-IDRI, IQ-OTH/NCCD, and Chest CT-Scan are widely utilized to train and validate traditional ML algorithms for LC detection. The datasets contain various annotated imaging data used to support binary and multiclass classification tasks. Table 3 shows the ML performance on public datasets.
For binary classification, several studies have leveraged these datasets to distinguish between cancerous and non-cancerous cases. In [38], the authors built a new ML model to identify the LC based on the LIDC-IDRI [10] dataset. They proposed a radiomics-based ensemble ML framework that used two techniques, boosted and bagged ensemble classification trees, to select an apposite set of features and the SVM, DT, and Ensemble Subspace KNN for classifications. The proposed model achieved an accuracy of accuracy of 88.3%, an AUC of 93.4%, sensitivity of 97.1%, and specificity of 83.1%. However, the study was limited to a single dataset, making it difficult to assess the model’s generalizability. Similarly, the authors in [39] designed a new ML model to detect LC and classify it as cancerous or non-cancerous based on the public CT dataset LIDC-IDRI [10]. They used the CNN model for feature extraction and the models SVM, random forest (RF), Naive Bayes (NB), decision tree (DT), Wide ANN, and Capsule Network for classification. The SVM achieved the best result, obtaining an accuracy of 94%. Nevertheless, the study relied on a single dataset and omitted key performance metrics such as sensitivity and specificity, limiting a full evaluation of model performance.
Researchers in [40] introduced an automatic lung nodule detection method from CT images of the LUNA16 dataset [12] based on a modified AlexNet architecture and the SVM algorithm, namely LungNet-SVM. They utilized AlexNet for feature extraction and the SVM for classification. The suggested model obtained 97.64% accuracy, 96.37% sensitivity, and 99.08% specificity. While these results are impressive, the study was conducted on an imbalanced and single dataset, and lacks a detailed discussion on validation or external testing, which poses challenges to assessing model robustness. In addition, in [41], a hybrid model was suggested to classify NSCLC from histopathological images from the LC25000 dataset [13]. EfficientNet-B0, local binary pattern (LBP), and vision transformer (ViT) encoders were used in the model for feature extraction, and the features extracted were input into four various classifiers: SVM, logistic regression (LR), light gradient boosting machine (LightGBM), and XGBoost. The hybrid approach using EfficientNet-B0, LBP, ViT encoders, and an SVM classifier achieved the best accuracy of 99.87%, demonstrating its effectiveness for NSCLC classification. While these results are impressive, the study was conducted on a single dataset, and the histopathological image analysis needs high-quality samples and expert explanations, which may not be widely available and pose challenges to assessing model robustness. Last but not least, the authors in [42] proposed a novel ML technique for LC diagnosis using a SPIE-AAPM-NCI Lung Nodule Classification Challenge dataset [24]. Authors employed image-processing techniques and supervised ML algorithms (SVM, KNN, and RF) for the classification of whether DICOM CT scan images of lungs were normal or pathological. The highest accuracy of 88.57% was obtained by the KNN algorithm. Despite demonstrating acceptable performance, the study used only one dataset and did not provide in-depth descriptions of dataset characteristics or additional performance metrics, which makes it difficult to interpret the broader significance of the results.
In the context of multiclass classification, ML models have also demonstrated notable performance. For instance, in [43], the authors described a ML solution to LC diagnosis with DL methods and the publicly accessible LUNA16 CT dataset [12]. It began with the removal of noise from input images via Butterworth filtering, then feature selection using Chaotic Crow Search Algorithm and Random Forest (CCSA-RF). Feature extraction was performed using the Multi-space Image Reconstruction (MIR) method combined with the Gray Level Co-occurrence Matrix (GLCM). Extracted features were input into a hybrid Sparse Convolutional Neural Network (SCNN) and Probabilistic Neural Network (PNN) model to classify LC as benign, regular, or malignant. The suggested SCNN-PNN model recorded 97.5% accuracy, which was better than other models like CNN + SVM and DenseNet201. However, the study reported results only on a single dataset, with limited details on specific evaluation metrics or validation procedures, thereby restricting the ability to judge the model’s general effectiveness. Radiomics-based studies using public sources also show promise. Reference [44] applied ML to radiomic features extracted from the Lung-PET-CT-Dx dataset (TCIA) using an iMRRN segmentation model and SVM classifier. The model achieved an AUC of 97% in classifying ADC, small cell carcinoma, and SCC. The researchers in [45] introduced a new ML approach for the detection of NSCLC based on the CT imaging dataset LIDC-IDRI [10]. They employed classical ML to detect lung nodules (SVM, RF, KNN, DT) and other techniques to classify lung nodules such as the VGG19, EffetientNet-V2-L, Wide-Resnet-50-2-weight, and EffetientNet-B7 models. The VGG19 model achieved the highest accuracy of 99.70%, followed by EfficientNet-V2-L with 99.31%, while classical ML did not do well in terms of accuracy. While these results highlight the promise of classical ML in multiclass settings, the findings are based on a single dataset without external validation, which may limit their reproducibility in clinical contexts.
Table 3. ML performance on public datasets.
Table 3. ML performance on public datasets.
StudyDatasetTaskFeature ExtractionClassifiersAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[42]SPIE-AAPM-NCI Lung Nodule Classification ChallengeBinary-KNN88.57---
[38]LIDC-IDRI [10]Binaryboosted and bagged ensemble classification treesEnsemble Subspace KNN88.393.497.183.1
[39]LIDC-IDRIBinaryCNNSVM94.0---
[45]LIDC-IDRIMulticlass-VGG1999.70---
[41]LC25000BinaryEfficientNet-B0, LBP, and ViTSVM99.87---
[40]LUNA16BinaryAlexNetSVM97.64-96.3799.08
[43]LUNA16MulticlassMIR + GLCMSCNN + PNN97.50---
[44]Lung-PET-CT-DxMulticlassRadiomicsSVM----

2.2.2. ML on Private Datasets

In addition to publicly available sources, several studies have employed Radiomic features extracted from CT scans from private institutional sources to classify LC. Such studies utilize quantitative imaging biomarkers of tumor intensity, texture, and shape, offering a non-invasive approach relative to the traditional histological assessment. Such private collections usually provide denser clinical metadata, localized image variation, and varied patient populations, but due to their limited availability and small sizes, their reproducibility and generalizability are compromised. These datasets often combine imaging with clinical metadata. These dataset-based ML approaches have been applied to both multiclass classification of LC subtypes and binary classification tasks such as subtype prediction or survival analysis. Table 4 summarizes the performance of various ML models on private datasets.
In binary classification or subtype-specific prediction tasks, several studies have demonstrated the effectiveness of combining Radiomic and clinical features. In [46], the authors used classical ML to detect LC based on the dataset combined from Radiomic features from CT scans and clinical features related to 500 patients. They used RF, LR, and SVM to predict ADC. The SVM model achieved the best result, with an AUC of 94.2%. However, the reliance on clinical metadata, combined with low performance in some settings, limits its standalone diagnostic utility. In addition, the authors in [47] developed a new classical ML model to identify pathological subtypes of lung ADC presenting as ground-glass nodules based on a Radiomics dataset related to 224 CT images from a private hospital. They used the RF model to detect if there was lung ADC. The proposed model achieved an AUC of 92%. Although an AUC of 92% was reported, the findings are dataset-specific and may not generalize across different imaging protocols. Moreover, the authors in [37] built a new classical ML model to predict LC (Alive, Dead) based on Radiomics CT scans related to 59 patients with primary lung ADC. They used the LR and Dual Coordinate Descent method for Logistic Regression (DCD-LR) for prediction, and the LR achieved the best performance result with a sensitivity of 80% and an accuracy of 64.4%. Yet, the small sample size and low performance raise questions about model robustness and generalizability. In addition, researchers in [48] developed a new traditional ML model to detect LC based on a dataset that includes attributes such as age, gender, smoking status, and chronic disease. They used SVM, NB, KNN, DT, and RF to detect LC, where the DT model achieved the best result with an accuracy of 100% and an F1-score of 100%; such perfect performance likely reflects overfitting, particularly since the study provided limited discussion of dataset size, class distribution, or external testing, thus reducing confidence in its clinical applicability.
For multiclass classification, the study in [35] used a classical ML to detect and classify LC based on Radiomic features extracted from unenhanced CT images from Zhongshan Hospital and Fudan University, related to 920 patients. They used the RF model to classify LC into ADC, SCC, and small cell LC based on the Radiomic features extracted from CT scans. The proposed model achieved an AUC of 74%, 77%, and 88% for ADC, SCC, and SCLC, respectively. However, the performance for ADC and SCC was relatively low, and the study lacked external validation, raising concerns about potential overfitting due to the limited dataset size. The study relied on a single dataset with high-quality annotations, which may limit reproducibility and applicability across diverse clinical environments.
Despite promising results on various ML models on private datasets, the reviewed studies have limitations affecting their clinical implications. The pervasive limitation is the application of single, usually small and homogeneous datasets, the absence of external validation, and poor reporting of important metrics such as sensitivity, specificity, or AUC. These issues must be mitigated through multi-center collaborations and rigorous cross-validation to ensure clinical robustness.

2.3. DL Models

DL, a dedicated area of ML, has transformed medical imaging, particularly the identification and classification of LC. In contrast to standard ML models, which depend on handcrafted features and user input, DL models, particularly CNNs, can learn complicated patterns and hierarchical representations directly from raw image data [49]. The characteristic is fundamental in LC detection, because minor variations between benign and malignant nodules might be complicated to distinguish using conventional approaches [50]. DL models stand out in processing high-dimensional data like CT scans and collecting the spatial, textural, and contextual information required for enhanced detection and classification [51]. In addition, DL has shown more remarkable performance in distinctive LC subtypes such as ADC, SCC, and LCC, significantly increasing diagnostic accuracy [52]. Integrating sophisticated architectures such as VGG, ResNet, DenseNet, and Inception has enhanced diagnostic efficiency and reduced dependence on human experience, opening the door for more potent, automated, and precise LC screening systems [53].

2.3.1. DL on Public Datasets

Most of these studies utilized public CT databases, histopathological datasets, and multimodal datasets to detect and classify LC. These models leverage large-scale annotated datasets such as LUNA16 [12], LIDC-IDRI [10], LC25000 [13], and Chest CT-Scan [21]. Structured CT datasets, such as LUNA16 and LIDC-IDRI, are commonly employed in lung nodule detection tasks due to their standardized annotations and imaging format (see Section 2.1.1 for details).
Table 5 shows the DL models using the LUNA16 dataset for binary classifications. In [54], the authors proposed the categorization of malignant lung nodules from CT scans. It was developed using a 3D-convolution neural network (3D-CNN) and Recurrent Neural Network (RNN) algorithms. The proposed method attained a 95% accuracy, 90% selectivity, and 87% sensitivity. Similarly, in [55], researchers built ML algorithms to identify early-stage LC and classes. This study uses the CNN model to integrate the CT images into the improved dial’s loading algorithm (IDLA). It utilizes preprocessing techniques on the CT images, including removing the noise and converting the image into grayscale. It also extracts the features using CNN operations to identify the LC. They achieved 92.81% accuracy and 92.85% sensitivity. In addition, researchers in [56] developed a new model using the CNN model to recognize a lung nodule by combining the performance of two or more CNNs, which allows them to perform and predict results accurately, and a Deep Ensemble 2D CNN. The result of the proposed model is abnormal or normal with 95% accuracy. Furthermore, researchers in [57] used a new ML algorithm to classify LC at early stages from CT imaging. They proposed a new method to improve efficiency by utilizing weakly supervised dense instance-level lung segmentation (WSDI), laborious pixel-level annotations, and a deep continuous learning-based deep neural network (SS-CL) on labeled and unlabeled datasets. This leads to the possibility of employing lightweight, low-memory deep neural net (DNN) models in image processing using the combination of WSDI and LOS segmentation. The suggested model has 98.2% precision. Further advancements were demonstrated in [58], where the researchers proposed a DL model for LC classification. They implemented a Deep Convolutional Neural Network (DCNN) with an RPN-based Faster R-CNN model for classification. The proposed model performed better than alternative networks such as ResNet, DenseNet, MobileNet, and MixNet, with the latter being 95.32% precise. In [59], scientists suggested the Hybridized Faster R-CNN (HFR-CNN) model and compared its performance to other approaches, including CNN, fusion algorithm (FA), two-step module DL (TS-DL), and inspired snake swarm optimization paired with bat-based emulated chaotic atom search (ISSO-B + CAS). The HFR-CNN model proposed the highest accuracy of 97%, which reflects its efficiency in classifying LC. Lastly, in [60], the authors introduced AtCNN-DenseNet-201 TL-NBOA-CT, which uses Modified Sage-Husa Kalman Filtering (MSHKF) for pre-processing, enhanced empirical wavelet transform (IEWT) for extracting features, and attention-based CNN with DenseNet-201 for classification. Their proposed model achieved an accuracy of 99.43%. While these studies demonstrate strong classification for diverse DL architectures and new preprocessing techniques, they share a weakness: single-dataset training with no external validation. This puts model robustness, generalizability, and deployability into question in a clinical environment.
Several studies have employed the public LIDC-IDRI dataset, as shown in Table 6. The authors in [61] built the model on DBN, CNN, and Stacked Denotion Autoencoders (SDAE), with accuracies of 79.76%, 81.19%, and 79.29%, respectively. Similarly, the authors of [62] also introduced new algorithms based on DB models to differentiate LC. By applying DBN and CNN for classifying the lung nodule, the proposed model obtained results of 73.4% and 73.3% sensitivity, respectively. Another study [63] developed a new DL model, with the steps of preprocessing, lung region segmentation, segmentation of cancer tumor, data augmentation, and classification of lung cancer. The WSLO-trained Shepard Convolutional Neural Network (ShCNN) correctly classified LC with an accuracy of 90.91%. Moreover, the authors of [64] introduced a Semantic Characteristic Convolutional Neural Network (SCCNN), which analyzes 3D multi-view samples of lung nodules extracted through a spatial sampling method to determine malignancy. The proposed model achieved an accuracy of 95.45%, displaying its effectiveness in classifying lung nodules. Advanced transformer-based and generative approaches have also been tested. In [65], authors proposed a new DL model to enhance lung nodule classification with a new Multi-Granularity Dilated Transformer (MGDFormer) model. The MGDT approach has two primary Deformable Dilated Transformers (DDT) and Local Focus Schemes (LFS), utilized for capturing and local fine-grained feature extraction to enhance the classification rate. The proposed model attained 98.50% AUC and 96.10% accuracy. Likewise, study [66] also developed a novel DL model utilizing the GP-WGAN, lung rods, and Brock, of which the GP-WGAN had the best performance AUC of 86.2%. In [67] researchers designed a new hybrid model to identify the ADC, consisting of, at first, a Convolutional Auto-Encoder Transformer (CAET) path to extract and capture informative features, and the second one uses a Shifted Window (SWin) to extract nodules-related spatial features from a volumetric CT scan and finally makes a fusion from the features extracted from CAET and SWin. The proposed hybrid model (CAET-SWin) achieved an accuracy of 82.65%. Multiclass classification and nodule detection tasks have also been explored. In [68], authors utilized a DLg model to implement an automatic LC nodule detection. The study used the Jackknife Free-Response Receiver Operating Characteristic (JAFROC) in experimentation. After using the R-CNN DLg model, they achieved sensitivity of 98%. Research [69] offered a multi-task network (MT-Net) with a prediction distillation structure to offer concurrent segmentation and classification of lung nodules. The detailed model is segmented into three components: coarse segmentation subnetwork (Coarse Seg-net), cooperative classification subnetwork (Class-net), and fine segmentation subnetwork (Fine Seg-net). The proposed model yielded solid performance, as reflected by the Dice similarity coefficient (DI) score of 83.20% for segmentation and 91.90% accuracy in classifying nodules. Study [70] introduced the algorithm GWO used to reduce noise as well as improve segmentation accuracy. InceptionNet-V3, VGGNet, GoogLeNet, AlexNet, and ResNet DL models were used to discriminate LC into normal, benign, or malignant. Optimal discrimination results were obtained using the introduced GWO as well as the InceptionNet-V3 (GWO-IV3) model with precision 98.96%, specificity 94.74%, and sensitivity 100%. Despite the diverse modeling methods and strong reported results, all of the previous studies employed only LIDC-IDRI and lacked external validation, hence evoking uniform concerns regarding overfitting, model vulnerability, and narrow generalizability to real clinical imaging situations.
CT image datasets from Kaggle (https://www.kaggle.com/, accessed on 26 July 2025) (IQ-OTH/NCCD [20], Chest CT-Scans [21], Lung Tumor Segmentation [26], Lung Nodule Segmentation [27], Chest X-Ray-Scans [25], and UniToChest [31]) have become a frequently used resource for training and validating DL models for LC classification and detection. Despite the availability and sufficient numbers of images from these datasets, the majority of the studies using them suffer from major drawbacks, such as single dataset utilization, the absence of external validation, and class-specific or comparative performance measure underreporting. Table 7 shows the DL models using Kaggle datasets. These studies vary in task type—binary or multiclass classification—and employ a range of architectures.
For binary classification tasks, [71] developed a new DL model with the Chest CT-Scans dataset employing a CNN to classify, and the lung CT dataset also inputs user CT images through the end-user application. The findings reveal that the proposed CNN algorithm achieves an accuracy of 97.10%. The ARSGNet in [72] obtained a binary classification accuracy of 98.17% based on the Chest X-Ray_Scans dataset. Moreover, in [73], authors developed an optimal pipeline for lung nodule detect segmentation using the UniToChest CT dataset and 3D nnUNet models. The proposed model demonstrated a sensitivity of 68.4%, a precision of 71.3%, and an F1-score of 69.8%. Additionally, [74] experimented with various DL architectures such as MobileNetV2, ResNet152V2, Inception-ResNetV2, and UNet using the Lung Tumor Segmentation dataset, and concluded that Inception-ResNetV2 has the best performance for classification (accuracy: 98.5%) and that for segmentation, UNet worked best (Jaccard index: 95.3%).
For multiclass classification, several studies used the Chest CT-Scans dataset. In [75], the authors proposed a VER-Net model for the LC classification. The VER-Net model stacks three different transfer-learning algorithms, namely VGG19, EfficientNetB0, and ResNet101. The proposed model achieved an accuracy of 91%, a precision of 92%, a recall of 91%, and an F1-score of 91.3%. Researchers in [76] built a new DL model to classify NSCLC. They proposed a new technique, a dual-state transfer learning (DSTL) method using a deep CNN-based approach, to develop an efficient model that can classify the type of LC by employing the DCNN, VGG16, Inceptionv3, and RestNet50. This model achieved an accuracy of 92.57%. Also, study [77] suggested a DL model for classifying four different types of LC (normal, SCC, LCC, and ADC), obtaining 96% accuracy using the DLg method EfficientNet B3. Furthermore, study [78] developed a new pre-trained DL model for classifying four different types of LC (normal, SCC, LCC, and ADC) obtaining 98.95% accuracy using the DenseNet201. Additionally, [79] utilized the Lung Nodule Segmentation dataset to develop a DLg model for LC, generic nodules, cancer, and ADC detection, localization, and segmentation. To enhance performance, the authors utilized a new You-Only-Look-Once (YOLO) approach for lung nodule segmentation, data preprocessing, and data augmentation. The proposed model achieved an average precision of 75.70% and an average recall of 73.80% on classification tasks. It also achieved an average mask precision of 75.00% and an average mask recall of 73.30% in segmentation, indicating its effectiveness in detecting and classifying lung abnormalities. Lastly, ref. [80] used the IQ-OTH/NCCD dataset to design a new DL method to classify LC as malignant, benign, or normal. They used the CNN model where the best hyperparameters are: ReLU to provide nonlinearity, max-pooling layer to prevent overfitting, and SoftMax as the activation function to determine three categories of LC—benign, malignant, and normal. The accuracy of the suggested model was 99%.
Finally, despite the optimistic reported outcomes, all the models in these studies shared some common limitations. Most of them relied entirely on one publicly available dataset, without cross-dataset or external validation, rendering their clinical value poor. Class-wise sensitivity, specificity, or AUC performance metrics were underreported in most instances, thus hindering comparisons on a fair basis. High accuracy scores could be dataset-specific optima, casting uncertainty on overfitting and generalizability to the outside world.
Several studies applied hybrid DL approaches to public CT datasets for LC detection. These models often integrate multiple architectures or modalities to enhance classification accuracy. Table 8 shows the hybrid DL models.
For binary classification, in [81], the authors introduced a novel DLg algorithm to differentiate NSCLC from a CT imaging dataset. Dense neural networks (ResNet-50 and VGG-16) and sparse neural networks (Inception v3) were employed on 60 ADC patients’ images and 60 SCC patients’ images in the public dataset “lung PET/CT”. DL was used to extract the features from the CT images. The findings indicate that the Inception v3 model performed with the highest accuracy of 98.29%. Another binary classification approach was proposed in [82], where researchers developed a DLg model for LC classification from 3D CT images. They introduced a hybrid model named HLFFF-SRNN, which combined high-low frequency feature fusion (HLFFF) with a sequential recurrent neural network (SRNN) for classification, leveraging ResNet-50 as a feature extractor. When used for two public databases (Cancer Imaging Archive and China Consortium of Chest CT Image Investigation), the model was able to provide 99.20% and 99.40% accuracy levels, respectively. In [83], the authors introduced a CADLC-WWPADL, which employs three key operations: the feature extraction process using the MobileNet model, the hyperparameter tuning process using the Waterwheel Plant Algorithm (WWPA), and the symmetrical autoencoder (SAE) model for the classification process based on the SIMBA CT dataset. The suggested model achieved an accuracy rate of 99.05%. Moreover, in [84], a multimodal feature fusion for LC detection and classification (MFFOTL-LCDC) on CT images is described. To derive feature vectors, the authors utilized three transfer learning models (SqueezeNet, CapsNet, and Inception v3). They employed the remora optimization algorithm (ROA) for the hyperparameter choice of the mentioned DL models and also used the deep extreme learning machine (DELM) algorithm for the classification process based on the SIMBA CT dataset. The model attained an accuracy of 97.78%.
Despite the strong reported performance of hybrid DL models for the detection of LC from CT datasets, there are some limitations among the papers reviewed. Most of the models were trained and validated with a single dataset, limiting their applicability to real-world practice in varied clinical environments. For example, studies [81,83,84] lacked external validations, which is a source of overfitting and non-robustness. Furthermore, imbalances in classes were not well addressed in experiments such as [82] and could have skewed models toward major classes. Some articles failed to present essential performance metrics, including sensitivity, specificity, or AUC, which makes the evaluation of the real diagnostic usefulness in a comprehensive way challenging. Still, smaller dataset sizes and insufficiencies in the proper reporting of demographic or acquisition heterogeneity reduce the confidence that such models generalize across diverse patient populations. Lastly, even sophisticated models such as multimodal fusion or optimization-based models focused little on real-world deployment problems, like how they could be integrated into clinical workflows, interpretability, and computational needs.
Public histopathological datasets, like LC25000 [13], have been widely used in DL studies. Table 9 provides an overview of DL models across histopathological datasets. For instance, in [85], a binary classification study built a new hybrid DL model to classify lung and colon cancer using LC25000 dataset. The MobileNetV2 and EfficientNetB3 models were used for feature extraction, and GWO was used for feature selection and classification. The technique used is named MEGWO-LCCHC. It is also compared with ML models like XGBoost, LightGBM, and CatBoost. The result proves that the new technique achieves high accuracy, with the lightweight DNN model reaching 94.80%, with LightGBM at 93.90%, XGBoost at 93.50%, and CatBoost at 93.30%. In another binary classification study [86], researchers used a DL model to detect LC from histopathological whole-slide images of the public dataset acquired by the Pulmonary Department of the Greater Poland Center. The model used a CNN and a separable CNN with residual blocks and employed SepCNNs for feature extraction. The suggested method achieved an excellent accuracy of 97%. For multiclass classification, in [87], the authors introduced a new lightweight multi-scale (LW-MS) end-to-end CNN model for detecting LC based on the LC25000 dataset. The model was very effective with an accuracy of 99.20%. In addition, the authors of [88] also used the LC25000 histological dataset with a few CNN architectures, including MobileNet, AlexNet, VGG-19, Standard_CNN, and VGG-16, in which VGG-16 resulted in the highest accuracy of 99.20%. Moreover, researchers in [89] built a DL model to detect and classify lung and colon cancer from the public LC25000 histopathological image database and digital image processing. Their approach incorporated feature fusion of several cutting-edge architectures like ResNet-101V2, NASNet-Mobile, and EfficientNet-B0 to enable multi-class classification of these two cancers. The developed model achieved exemplary performance rates like 99.80% precision, 99.80% recall, and an overall accuracy of 99.94%, demonstrating its effectiveness in cancer detection and classification. Also, in [90], researchers introduced a new detection method, CroReLU, a new plug-and-play visual activation function (AF) to enhance DL models for LC detection. The authors experimented with models on a pathology image dataset from the Hospital of Baiqiu’en, Jilin University, related to 766 cases, and the LC25000 dataset to validate the model. The authors utilized different models: SENet50_CroReLU, SENet50, MobileNet, and MobileNet_CroReLU. The SENet50_CroReLU model was the best one, achieving 98.33% diagnostic performance. Finally, the authors of study [91] created a DL model to detect LC (lymph node involvement) from public CT scans and histopathological images. DL models used in this study were CNN, Convolution Neural Network Gradient Descent (CNN GD), Inception V3, ResNet-50, VGG-16, and VGG-19. CNN GD achieved a 99.84% accuracy rate. Despite the superior performance of DL models on histopathological datasets, extensive limiting factors reduce their utility in real-world implementations. Most research relied on the LC25000 dataset alone, which, though well-balanced, lacks clinical diversity and variability in acquisition environments. Most of the works did not evaluate their models on external data or cross-modal tasks that may lead to overfitting. Even the best-performing models had no sensitivity, specificity, or real-world deployment in mind, and thus their clinical generalizability was questionable.
DL models utilizing multimodal datasets combining CT, histopathology, and PET-CT data have been proposed to improve diagnostic robustness. Table 10 shows the DL models using multimodal datasets. For instance, a multiclass classification approach was proposed in [92] with a DL model using the public Lung-PET-CT-Dx dataset, comprising CT and PET-CT images. They employed CNN and DenseNet for feature extraction and utilized the MobileNet V3-Small model for classification. The proposed model achieved an accuracy of 98.6%, demonstrating its high efficacy in LC detection. Similarly, in [93], scientists developed a DL algorithm to classify LC based on CT (Chest CT-Scan dataset) and histopathological images (LC25000 dataset). They applied DenseNet for detection and modified it with an attention mechanism (ATT-DenseNet) to pay special attention to the vital areas of an image. Their ATT-DenseNet model was compared to DenseNet, AlexNet, and SqueezeNet with maximum accuracy. It achieved the highest accuracy, 95.40% and 94.00%, respectively, in detecting histopathological and CT images, showing higher performance. Another multiclass classification framework was presented in [94], where the researchers developed a DL method for LC detection based on public CT scan images (Chest CT-Scan dataset) and histopathological images (LC25000 dataset). They employed CNN, CNN GD, VGG-16, VGG-19, Inception V3, and Resnet-50. The CNN GD model achieved the highest accuracy, 97.86%, and sensitivity of 96.79%. In the binary classification study [95], researchers proposed a new model to identify and classify NSCLC based on DL models and the public datasets LUNA16. They used deep convolutional neural network VGG-19 with LSTM and 2351 input CT and X-ray images. The proposed model VGG-19-LSTM achieved an accuracy of 99.4%. Finally, in [96], the authors suggested a novel binary classification DLg and quantum computing hybrid model for LC detection using the public datasets ChestX-ray and LIDC-IDRI, which contain CT images and chest radiographs (CXR). The model used CT and CXR data with hybrid quantum layers to enable transfer learning and utilized VGG16, ResNet50-V2, and DenseNet201 for classification. The proposed model achieved 92.12% accuracy with very high values: sensitivity 94.00%, specificity 90.00%, F1-score 93.00%, and precision 92.00%, showing that it can be used for the detection of LC.
While promising results were reported across these studies, there are many limitations to be noted. Most models were trained and validated on single-source data, which constrains their ability to generalize across multiple clinical settings. External validation for most instances was absent, in addition to insufficient reporting regarding dataset heterogeneity, e.g., patient population or imaging protocols. Also, issues of class imbalance, limited sample sizes, and the absence of modality specific performance metrics were not adequately addressed. These limitations collectively lowered the clinical applicability and stability of the proposed models in actual healthcare practice.

2.3.2. DL on Private Datasets

In addition to public datasets, some researchers have chosen to train DL models on detecting and classifying LC from fully private datasets, which are typically procured from individual hospitals or regional healthcare facilities. These models offer the advantage of being optimized on actual clinical populations and imaging conditions. However, they are typically hampered by small dataset sizes, internal validation, and limited alternative methods, all of which compromise their generalizability and reliability. Table 11 shows the DL models using private datasets.
A binary classification study conducted in [97] developed a novel DL algorithm to diagnose stage I-IIIA NSCLC using image data from CT scans. These stages (I to IIIA) are early to locally advanced stages of NSCLC, in which the cancer is still localized in the lungs and surrounding lymph nodes, but has not reached distant organs. The suggested hybrid model combines pictures and clinical data, which was constructed using a DL 3D CNN. The proposed hybrid model obtained a median AUC value of 76%. Another binary classification model was proposed in [34], which built a hybrid metaheuristic and CNN algorithm to classify LC using CT images, using the Ebola optimization search algorithm (EOSA) with CNN and the CT dataset from the Iraq-Oncology Teaching Hospital in Iraq. The accuracy achieved from the proposed hybrid model is 93.21%. Study [98] introduces another binary classification framework designed as a dual-stage classification model to detect and stage LC using a combination of advanced DL techniques applied to a CT dataset. They started the development of a modified U-Net incorporating dual attention and pyramid atrous pooling to improve detection precision, extracting texture, color, and shape features from the segmented target area. Moreover, they employed a hybrid Xception and custom CNN model (XC–CNN) to detect the normal and abnormal cases for tumor detection. After that, they extracted additional locational features from the abnormal characteristics, utilizing them as input for the innovative hybrid adaptive learning neural network (ALNN) to achieve accurate LC staging. The proposed ALNN model achieved an accuracy of 93.30%. A multiclass classification approach was explored in [99], where authors built a new ML model to detect and segment lung malignant tumors based on CT images from Imperial College Hamlyn Center, London. They used the CNN model and some techniques to remove the noise and enhance the input images; after that, they inserted the image into a segmentation stage based on edge segmentation and Region of Interest (ROI). This approach achieved an accuracy of 99.80%.
Although DL models on private CT datasets demonstrate promising accuracy, they have some important drawbacks. Most utilized small institution-specific datasets with no external validation, and therefore, there are concerns about overfitting and non-generalizability. For example, papers like [97,98] utilized advanced DL architectures but still showed only modest or mediocre performance, especially on AUC or robustness in real-world scenarios. Others, including [34,99], lacked adequate evaluation measures such as sensitivity and specificity, or were modeled on limited diversity. Generally, the absence of external benchmarking and narrow clinical representation reduces confidence in these models’ generalizability to varied healthcare settings.

2.4. Performance Variability Across Open and Private Datasets

One of the most critical findings from this review is the evident performance gap between models trained using publicly available (open) datasets and those developed on private (non-open) clinical datasets. As one would notice in Table 12, publicly available datasets such as LIDC-IDRI, LUNA16, and LC25000, and part of the Kaggle CT dataset, have been used very commonly in the literature, since they are publicly available, and have standardized image protocols, consistent annotations, and typically balanced class distributions. These aspects create a controlled scenario in which models consistently report exceptionally high accuracy and AUC metrics often exceeding 95%. For example, deep learning models trained with the LC25000 dataset have recorded accuracy scores of nearly 99.9%, while models trained with the LUNA16 dataset have recorded over 98% performance on nodule detection operations.
Conversely, models trained on private datasets—usually acquired from single medical centers or local radiology departments—have lower performance, with AUC scores of 74% to 93% in studies such as [34,35,47,97]. These datasets usually have greater clinical variability, such as patient population heterogeneity, imaging hardware, acquisition protocols, and tumor presentations. Furthermore, they are subject to smaller sample sizes, class imbalance, and non-standardization, all of which are challenges more typical of real clinical conditions.
This discrepancy illustrates a significant flaw in the design of ML/DL based LC detection systems: excellent performance on public datasets does not directly imply clinical reliability or generalizability. Overfitting on meticulously curated datasets can lead to inflated expectations and dismal real-world performance. To bridge this gap, future research must focus on external validation, cross-institutional benchmarking, and the utilization of multi-modal or hybrid datasets integrating open and private imaging sources. These practices are crucial to advance ML/DL models from theoretical prototypes to the arena of clinically deployable, dependable diagnostic tools.

2.5. Analysis of Models Used and Datasets

This subsection summarizes key tendencies in dataset employment and model architectures across LC detection studies, based on the analysis of figures obtained from recent studies. Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 highlight several consistent trends in LC ML/DL research. Figure 4 shows a strong reliance on CT imaging and reflecting its routine use in clinical practice. Figure 5 indicates that most studies depend on publicly available datasets, which stand to repeatability but also introduce bias toward a limited number of well-known benchmarks. Figure 6 further emphasizes this concentration on a small set of popular datasets, raising concerns about generalizability to real-world clinical populations. Figure 7 and Figure 8 show that DL models, particularly CNN approaches, dominate current research and often report high performance on public datasets, while performance is more variable on private or multimodal data. In general, these findings illustrate the gap between strong benchmark performance and reliable clinical translation.
Figure 4 shows that the majority of datasets used for LC research were CT scans, accounting for 74.3% of the total. This reflects the widespread clinical use as well as high resolution to detect pulmonary nodules. Histopathological datasets are in second place at 11.4%, used predominantly for cancer subtype classification (e.g., ADC vs. SCC). Multimodal datasets account for 7.1%, reflecting emerging trends in combining complementary imaging data. PET-CT and X-ray datasets are underrepresented, with both having a proportion of 1.4%, indicating limited adoption in autonomous lung cancer studies. The prevalence of CT solidifies its role as the gold standard in LC screening chains.
Figure 5 shows that the general dataset was used in 90.6% of previous studies, allowing for the reproducibility and comparison of results against standardized criteria. In contrast, the specific dataset used in 9.4% of previous studies relied on individual clinical centers and was specifically designed to address particular diagnostic questions, thus limiting the generalizability of the findings.
Figure 6 illustrates the distribution of imaging modalities and datasets among ML/DL-based LC research. The most dominant is CT imaging, with publicly available datasets such as LIDC-IDRI and LUNA16 being the most utilized. The LC25000 histopathology dataset has moderate representation, suggesting growing but limited interest in non-radiological imaging approaches. Conversely, multi-modal datasets (e.g., CT-PET-CT-Dx, CT-X-Ray, and institution-specific collections) are less frequently utilized, pointing to the community’s continued reliance on curated CT-based datasets. This distribution suggests a heavy bias toward popular, publicly available datasets, which, while helpful for consistency, may fail to represent the variability of real-world clinical data best.
Figure 7 shows that DL models (73.6%) are more commonly used for LC detection and classification than ML models (26.4%). CNN-based architectures (such as ResNet, VGG, and MobileNet) serve as the backbone of modern systems due to their effectiveness in automated feature extraction. RF and SVM are often integrated into hybrid frameworks (e.g., CNN-SVM) to enhance interpretability. ResNet-50 and ResNet-101 achieve high accuracy in LC classification, while MobileNet is preferred for lightweight deployment on edge devices.
Thus, the chances and methodological lacunae reveal that the current research overemphasizes CT scans (83.9%) at the cost of multi-modal fusion possibilities (e.g., PET-CT for metabolic activity), which limits holistic diagnostic information. Further, while DL models dominate performance, classical ML methods (e.g., SVM, RF) remain vital for interpretability in high-stakes clinical decisions, pointing to the need for hybrid models that trade off accuracy and transparency. In addition, hospital-specific private datasets (9.4%) are unexplored, whereas they can address real-world challenges like scanner heterogeneity and population diversity. This work demands balanced dataset curation—diverse modalities and real-world data—and hybrid models integrating the computational strength of DL and interpretability of traditional ML to advance fair and clinically meaningful ML/DL tools.
Figure 8 shows a radar chart of accuracy performance for ML, CNN-based DL, and Hybrid DL models across multiple common LC datasets. Each axis is linked to a widely used dataset (LUNA16, LIDC-IDRI, LC25000, Kaggle CT, Multimodal, and Private). Although CNN-based DL models are generally accurate on datasets like LUNA16 and LC25000, their clinical usefulness remains hampered by the lack of interpretability and generalizability across diverse patient populations. ML models show good accuracy on datasets like LIDC-IDRI and LC25000, with better interpretability, but are potentially limiting when handling complex data structures. Hybrid DL models show robust performance on various datasets, but their practical deployment is hindered by the complexity and computational cost of the models. This chart indicates the necessary trade-off in balancing the accuracy, interpretability, and flexibility of models to various clinical settings for effective real-world implementation.
Performance comparison highlights the fact that ML models are as accurate as 88.57% (KNN for SPIE-AAPM-NCI dataset) and as high as 99.87% (SVM for LC25000) on publicly available datasets. DL models reflect even higher accuracy, i.e., 99.43% on LUNA16 (AtCNN + DenseNet201) and 96.10% on LIDC-IDRI (MGDFormer). These performances reflect technical competence, but the success of these results in clinical practice faces great hurdles. Kaggle CT datasets, for example, achieve up to 98.50% (Inception-ResNetV2 + U-Net) for binary classification and 96.00% (EfficientNet-B3) for multiclass classification, but they do not capture the heterogeneity and variability characteristics of real-world clinical settings. Hybrid DL models, while achieving up to 99.05% accuracy with blended architectures (e.g., MobileNet + WWPA + SAE), face challenges of complexity and interpretability when implemented in clinics.
DL methods applied to histopathology data offer excellent accuracy (99.84% with CNN GD) and show the technical potential of these approaches. Similarly, multimodal approaches based on CNN, DenseNet, and MobileNet offer up to 98.60% accuracy, suggesting enormous but still untapped potential. Although offering promising technical performance, their practical use is limited by poor access to multimodal clinical data and concerns over model generalizability and robustness over wide clinical populations.
Private datasets demonstrate varying results, with ML precision ranging from 64.40% (LR model) to 100% (Decision Tree) and DL methods achieving a maximum precision of 99.80%. Nevertheless, the inherent variability and opacity of private datasets create staggering challenges for universal clinical validation and reproducibility.
In summary, although the rigorous performance evaluation evidenced the high technical competence of hybrid and DL models, real-world clinical adoption is constrained by dataset errors, heterogeneity, interpretability problems, and insufficient validation on diverse, realistic clinical data.

3. Clinical Validation Challenges

It has been proven that integrating ML/DL algorithms into lung cancer (LC) detection and diagnosis plays a crucial role in enhancing accuracy and efficiency, as well as in the early detection of LC. However, the successful implementation and application of ML/DL algorithms in real clinical settings remains a significant challenge due to numerous clinical validation barriers. Therefore, this section focuses on key issues such as regulatory barriers, the opaque nature of ML/DL algorithms, and the practical limitations of clinical application and utilization. Figure 9 presents a list of the four main challenges for clinical validation. This section also presents case studies on the success and failure of LC detection and classification.
This section primarily addresses the RQ1 and RQ3 research questions by researching and focusing on clinical validation strategies, real-world challenges, and assurance requirements to achieve clinically meaningful, real world, reproducible, and non-invasive performance and accuracy of ML/DL models.

3.1. Regulatory Barriers and Approval Processes

ML/DL model diagnostics must undergo stringent regulatory approval before being integrated into clinical structures. Regulatory bodies, such as the US Food and Drug Administration (FDA) [100], the European Medicines Agency (EMA) [101], the World Health Organization (WHO) [102], CONSORT-AI Working Group [103], and other regional health authorities impose strict verification requirements [85,86]. These include:
  • Clinical Trial Rules: ML/DL models must prove safety, efficacy, and generalizability through comprehensive clinical trials, which can be time-consuming and expensive [70].
  • Dataset Standardization and Bias Limiting: Regulators request that ML/DL algorithms be trained on varied datasets to escape bias that could affect different patient populations [96].
  • Explainability: Many ML/DL models lack interpretability, making regulatory approval more complex as authorities require clarity on decision-making processes [88].

3.2. Radiologists’ Suspicion Toward ML/DL-Based Systems

Radiologists play a central role in implementing ML/DL-based LC detection systems. However, there is skepticism due to several concerns:
  • Black Box Problem: ML/DL algorithms tend to work as “black boxes”, i.e., their decisions are not interpretable. This lack of interpretability makes it difficult for radiologists to trust ML/DL recommendations without knowing the reasons behind them. The work in [104] discusses the need to open the black box of ML in radiology, suggesting that the neighborhood of annotated cases may be one solution.
  • Fear of Diagnostic Errors: While ML/DL models achieve high performance in controlled research, real clinical environments offer variability that can lead to misdiagnosis, with legal and ethical responsibility issues. The study in [105] highlighted the requirements to balance “black box” systems and explainable ML/DL in radiology to dispel such fears.

3.3. Integration Challenges in Clinical Practice

Beyond regulatory and trust-related issues, integrating ML/DL into daily clinical workflows presents a practical barrier:
  • ML/DL systems must smoothly integrate with current hospital systems; compatibility is considered a significant challenge [106].
  • ML/DL models need continuous updates to maintain performance, as new datasets should be available [107,108].
  • Implementing ML/DL diagnostic tools demands significant investments in infrastructure and staff training, which can be an obstacle for many healthcare facilities.
Addressing these challenges requires a collaborative effort among IT developers, healthcare providers, and policymakers to ensure that ML/DL integration improves patient care without affecting safety or efficiency.

3.4. Ethical and Legal Considerations

ML/DL technologies for LC detection also raise ethical and legal concerns that must be addressed before clinical approval:
  • Protecting compliance with patient data privacy is critical to safeguarding patient confidentiality [109].
  • Liability in a diagnostic mistake, whether it falls on the ML developers or healthcare providers, is essential in balancing the black box model vs. explainable model to deal with such ethical and legal concerns [110].
  • Patients must be informed about the ML/DL-assisted diagnosis to ensure transparency and ethical standards in patient care [111].
Adopting a balanced, transparent, and ethical approach is essential for the success of ML/DL algorithms in clinical applications, particularly in the identification and classification of LC. Therefore, prioritizing patient confidentiality, developing an ethical framework related to accountability, and promoting informed patient consent and bias control are crucial for building patient trust in ML/DL-assisted diagnostic processes. Furthermore, a collaborative framework among developers, experts, and relevant stakeholders is essential for achieving a balance between legal and ethical considerations, ultimately leading to better patient outcomes and the protection of their rights.

3.5. Case Studies of Success and Failure

Applying ML and DL to LC detection has created tremendous advancements and severe challenges. These case studies—derived from available literature and real-world deployments—highlight the bimodal nature of ML/DL applications in clinical environments. Several studies reviewed in Section 2 report promising results, particularly those using public datasets. For example, [89] presented a fusion-based DL model from LC25000 histopathological images with 99.94% accuracy, while [60] used the LUNA16 CT dataset and obtained 99.43% accuracy based on a DenseNet-201 architecture. Similarly, [59] used LUNA16 and LIDC-IDRI datasets to improve generalizability, with 97% accuracy in their HFRCNN model. These results confirm the robust diagnostic ability of DL models, especially in structured high-quality datasets. Beyond academic settings, practical applications of ML/DL systems in healthcare centers offer more success stories. A UK startup firm [112] developed a model trained on electronic medical records and symptom data from 122,193 patients and had 99.3% sensitivity for predicting early LC. In another extremely successful example, Harvard Medical School researchers developed a DL model named Chief [113] that was trained on millions of whole-slide pathology images and was up to 94% accurate in cancer detection, further proving the ability of ML/DL in multicancer diagnosis and therapy planning.
However, the review also indicates poorer attempts, particularly in those studies that accessed private data or small clinical samples. For instance, study [97] trained a 3D CNN on an in-house CT dataset for LC detection in the early stages. Still, the model had an AUC of just 76%, showing poor generalizability and data scarcity. Similarly, Ref. [35] applied RF classifiers to the radiomic features of the non-contrast CT images, and the performance was not uniform for all cancer subtypes; AUC ranged from 74% to 88%, with the worst being for ADC classification. Even hybrid model analyses failed when applied to small datasets. For example, Ref. [47] employed 224 patients’ radiomic CT features to classify tumor invasiveness with 92% AUC, but lacked external validation and was marred by dataset-specific biases. Such instances illustrate the pitfalls of overfitting, poor reproducibility, and clinical underperformance when models are trained on strictly defined, non-diverse datasets.
One typical pattern of successful research is the presence of large, heterogeneous, and well-annotated data along with real-world integration (e.g., electronic health records or pathology workflows). Conversely, the more limited studies describe the imperative requirement for cross-institutional verification, good-quality generalization techniques, and open data-sharing practices. In summary, though ML/DL methods continue to transform LC detection and classification, clinical success remains irrevocably tied to data quality, diversity, and deployment environment. As illustrated in Table 12, public vs. private dataset performance disparity is both a challenge and a cautionary note, highlighting the value of cautious optimism and rigorous evaluation before clinical application.
These contrasting case studies underscore that the success of ML/DL algorithms in LC detection depends not only on technological advances but also on integration into clinical workflows and rigorous validation across diverse real-world settings.

4. Ethical and Societal Implications

Integration of ML/DL into medical care brings significant ethical and societal considerations. This section investigates key issues, including algorithmic bias, data privacy, global medical equity, patient independence, and societal trust. Also, it addresses the research questions RQ1 and RQ3 by examining the ethical, societal, and organizational boundaries that affect the responsible development and clinical application of the ML/DL models in the diagnosis and classification of LC. Figure 10 shows the list of core challenges with ethical and societal implications.

4.1. Algorithmic Bias and Health Disparities

Blending ML and DL into LC detection and identification has revolutionized diagnostic procedures. With these, however, comes growing concern about algorithmic bias, where the trained models repetitively work differently across populations, thus having the potential to increase existing healthcare inequalities.
This systematic review revealed an extensive variety of model performance across studies, a large proportion of which could be explained by the type, heterogeneity, source of training, and test sets used. Models trained on open-access datasets such as LUNA16, LIDC-IDRI, and LC25000 consistently displayed high performance. Study [60] achieved 99.43% accuracy on LUNA16 using DenseNet-201, Ref. [59] achieved 97% using a hybrid model trained on LUNA16 and LIDC-IDRI, and Ref. [89] achieved a staggering 99.94% accuracy on LC25000 using a DL fusion strategy. Likewise, Refs. [54,56,71,72] all achieved accuracy levels above 95% using various public CT datasets.
While such discoveries are technically encouraging, these public datasets lack demographic diversity. Often, they lack race, age, socioeconomic status, and geographic metadata, allowing for potential unknown biases. These datasets are typically taken from well-financed institutions and reflect the general population’s homogeneous imaging protocols and clinical diversity.
In contrast, tests performed on private or hospital-specific data showed the flaws of applying ML/DL to real-world, heterogeneous clinical environments. For example, Ref. [35] used unenhanced CT scan radiomics features, where AUC varied from 74 to 88% for different LC subtypes. Also, Ref. [47] used ML-based models to forecast invasiveness of LC based on private hospital CT scans (224 patients) with an AUC of 92% but without external validation. Furthermore, Ref. [37] built prognostic models based on CT scans of merely 59 patients, hampered by small sample size and demographic specificity. In addition, the study in [97] employed a 3D CNN for early-stage LC detection with 76% AUC, far from standard public dataset standards. Ref. [34] tried a model on the Iraq-Oncology Teaching Hospital CT images, reporting 93.21% accuracy, but with a dataset optimized for one area and possible queries regarding broader application. Furthermore, Ref. [98] proposed a hybrid DL model on private CT data with moderate precision, at risk for overfitting and bias from dataset characteristics. The studies reviewed, for instance [34,35,37,47,97,98], illustrate that variability in performance metrics is usually driven by specific forms of algorithmic bias rather than consistent diagnostic capability. The model performance relies upon LC subtype imbalance, heterogeneous imaging characteristics, and hospital- or cohort-specific data distributions in these studies, like models that are trained on datasets dominated by a single LC subtype to achieve the best accuracy for these subgroups, but show reduced or unstable performance in underrepresented subtypes. On the other hand, the differences in the imaging protocols followed, quality, and sample size contribute to the overall performance variability observed across different studies. These findings indicate that high reported metrics may reflect dataset-specific advantages rather than robust generalization, indicating subtype bias and data heterogeneity as crucial barriers to the reliable clinical deployment of ML/DL models in LC diagnostics.
To serve as further evidence, external studies indicate similar issues: Ref. [114] noticed that ML models trained on SEER data for NSCLC patients made inconsistent survival predictions for non-Hispanic Black patients even though race/ethnicity was left out as a variable. Ref. [115] reported that chest X-ray-based ML algorithms exhibited underdiagnosis biases that specifically affected certain population groups. Ref. [116] presented a detailed model explaining how annotation errors, demographic bias, and hospital-based imaging artifacts lead to systemic bias in DL-based LC detection.
Even some of the highest-performing models (e.g., those in [57,70,74,91]) were validated on homogeneous datasets devoid of demographic or clinical variability, which subjects them to dataset-induced bias when applied beyond their training environments.
To address these challenges, the following approaches are essential:
  • Training on demographically diverse, multi-institutional datasets to preserve population representativeness.
  • Evaluating fairness and standard performance metrics to monitor sensitivity, specificity, or AUC variation across subgroups.
  • Add interpretable model design (e.g., attention maps, SHAP) so clinicians can inspect model behavior and detect bias.
  • Utilizing external validation pipelines, models trained on one dataset are tested on independent, demographically disparate cohorts.
In short, algorithmic bias in LC diagnosis is not solely a technical concern, it is also a clinical and social concern. As shown by a minimum of 15 studies in this review, performance tends to be overrated in carefully curated, public databases and underrated within real-world setups. Ensuring fairness, explainability, and strength in model design is imperative for ensuring equal deployment of ML/DL in healthcare applications.
Evidence from the reviewed literature indicates that variations in ML/DL model performance for LC diagnosis are largely driven by dataset creation, imbalance, and institutional heterogeneity, rather than by direct causal effects of demographic attributes themselves. Several studies report that models achieving very high accuracy on curated public datasets show marked performance degradation when evaluated on heterogeneous or private clinical cohorts, even in the absence of explicit demographic variables, highlighting the influence of subtype imbalance and imaging protocol differences [35,37,47,97,114,115]. Accordingly, equity-related concerns in lung cancer AI should be understood as data- and evaluation-driven limitations, emphasizing the need for stratified validation, transparent reporting, and diverse clinical datasets, rather than as established causal relationships between demographic factors and model discrimination [116].

4.2. Impact of ML/DL on Ethical and Societal Dimensions

The combination of ML and DL in LC detection has also quickly yielded significant advances in diagnostic effectiveness and accuracy. However, such technological progress must be matched by equally thoughtful attention to ethical and societal implications, which determine the acceptability, fairness, and accountability of ML/DL medical decisions. As echoed across this review, several studies raise data privacy, patient autonomy, explainability, and health equity issues that call for a multilateral response to responsible ML/DL development.
  • Data privacy and ethical use of patient information. ML and DL models require large datasets for training and validation, which often include sensitive clinical data. Several studies [54,59,60,89] used publicly available CT and histopathological datasets such as LUNA16, LIDC-IDRI, and LC25000. Although these datasets were anonymized before release, they originally came from hospital environments, which raises questions about whether patients were fully informed and if their data can continue to be used for secondary research. Studies [117,118] emphasize the importance of Institutional Review Board (IRB) approval and recommend privacy-preserving methods such as federated learning. This technique allows training models across multiple sites without sharing raw data, helping balance data utility with confidentiality.
  • Respecting patient autonomy and informed consent. Most ML/DL systems reviewed operate as black-box models with little opportunity for patient involvement, which may limit autonomy. For instance, in [47,97], models were trained using private clinical data without clarification on whether patients were notified about the use of their information in downstream AI applications. Another study [119] reports that many patients do not realize ML/DL systems are involved in their care, raising concerns about transparency. Future solutions should clearly inform patients, give them the option to decline or question AI-based decisions, and promote shared decision-making.
  • Human oversight and model explainability in clinical decision-making. Although models such as [59,60,89] achieved very high accuracy (up to 99.94%), their use in clinical diagnosis requires more than just performance. In high-risk settings, clinicians must understand how the model reaches its conclusions. Studies [70,93,94] on integrated methods such as Grad-CAM, SHAP, attention-weight analysis, and attention-based visualization help radiologists interpret results, which is essential for building trust and ensuring patient safety. As highlighted in [120], model transparency should be considered not only a technical requirement but also an ethical one. Even so, current explainability methods—whether based on images or model features—are still limited and cannot explain true cause and effect relationships. For this reason, explainability in clinical ML/DL should be seen as a way to support clinician understanding and trust, rather than a replacement for clinical judgment or medical reasoning.
  • Algorithmic bias and health equity. As discussed in Section 4.1, several studies ([34,35,37,47,97,98]) that relied on private datasets with limited demographic diversity reported lower AUC scores (74–93%), raising concerns about model fairness and generalizability. In addition, Refs. [114,115,116] found evidence of racial and demographic bias in cancer prediction models, showing how ML/DL systems can unintentionally disadvantage underrepresented groups. These findings highlight the need to ensure diverse training data and routinely evaluate model outcomes across different population groups, particularly in multiethnic clinical environments.
  • Accountability and responsibility in ML/DL-based diagnosis. Implementing ML/DL systems in multimodal or real-time diagnostic workflows (e.g., [92,95,96]) introduces legal and ethical challenges, especially regarding liability in cases of misdiagnosis. When models are trained on complex multimodal datasets, the responsibility between developers, clinicians, and healthcare institutions becomes unclear [121]. This issue is further complicated by the use of black-box architectures (such as the 3D CNN in [97] or hybrid DL models in [98]), reinforcing the need for traceability and reliable post hoc error analysis.
  • Societal trust and public acceptance. Long-term adoption of ML/DL tools in healthcare relies heavily on building societal and clinical trust. Models developed with transparent architectures and explainability features, such as in [93,94], are more likely to gain clinician support. In contrast, highly accurate yet non-transparent systems like those in [89] or [91], may face resistance if users do not understand their decision-making process. As argued in [122], earning trust requires not just high accuracy, but also fairness, accountability, and transparency.
The ethical and social implications of ML/DL in LC diagnosis cannot be dissociated from these systems’ technical innovation and clinical use. Worldwide guidelines, such as those issued by the World Health Organization [102], European Commission [101], and National Institute of Standards and Technology (NIST) [122], highlight the importance of ethical governance, responsibility, and human control in clinical AI systems. As shown across reviewed studies, ML/DL based models are likely to be highly accurate in diagnosis; however, such accuracy can come at the expense of equity, transparency, and accountability. To ensure that these technologies benefit patients and public health, the future work should concentrate on ethically sourced and demographically representative data, open consent practices, interpretable model structures, systematic bias auditing, and well-established frameworks for clinical responsibility and accountability. Embedding these ethical principles and regulations at every development and deployment stage is required to connect ML and DL systems from experimental tools to reliable, equitable, and clinically reliable solutions for LC detection and diagnosis.

5. Critical Analysis and Recommendations

This section provides a role for designing clinic-ready ML/DL models and practical steps for participants (researchers, clinicians, and policymakers). Moreover, it compiles all the results from R1, R2, R3, R4, and R5 from the previous sections through a critical analysis of all the obstacles that affect the transition to application in the real clinical world, in addition to proposing recommendations to bridge and address the gap between all the research on ML and DL with clinical applications in the real world. Table 13 illustrates the checklist.
The development from ML/DL research-oriented algorithms to clinically meaningful LC detection and diagnosis devices involves a rigorous, fair, trustworthy, and regulatory compliance process. As shown in Figure 11, this is preceded by (1) Data Collection and Curation, where heterogeneous and high-quality imaging datasets are obtained. This involves demographic representation across variables of age, gender, and ethnicity to maximize generalizability. Data also has to be anonymized based on regulatory frameworks such as HIPAA and GDPR. Following data collection, (2) Preprocessing and Artifact Reduction are performed to normalize, denoise, and enhance the quality of the imaging. Artifact removal in this step is critical to reducing noise bias and improving diagnostic accuracy. Then, in (3) Model Development, the ML/DL models are trained on the cleaned data. Predictive performance is not the sole concern here; model interpretability and fairness are also important considerations. Techniques such as SHAP values and fairness constraints are built into uncover model decision-making processes and to calibrate for bias, particularly in subpopulations that are underrepresented. (4) Validation and Robustness Testing involves systematic testing of model performance against internal and external datasets to confirm that the developed models not only work effectively but are also generalizable across various clinical populations. Geographically or demographically stratified cross-validation increases model robustness and reduces overfitting threats. The subsequent stage, (5) Clinical Integration and Regulatory Approval, is with respect to making the models operational with existing clinical infrastructures such as hospital information systems (HIS), electronic health records (EHRs), and Picture Archiving and Communication Systems (PACS). Regulatory clearances by the U.S. Food and Drug Administration (FDA) or the European Medicines Agency (EMA) must be obtained. Both retrospective as well as prospective clinical trials may be required to validate safety and efficacy in the real world. After obtaining regulatory clearance, (6) Deployment and Continuous Monitoring is initiated. This phase includes deploying ML/DL models in actual clinical workflows and then continuously monitoring their performance in real time. Model retraining or fine-tuning is performed on the basis of new patient data to maintain adaptation with evolving patterns. This is to ensure there is ongoing performance and clinical value over a period of time. Finally, (7) Ethical Oversight and Explainability is necessary to facilitate responsible and transparent implementation. Explainable AI (XAI) tools must be used to enable clinicians to understand and trust model output. Institution-level ethics committees must also be established to govern ongoing ML/DL oversight, ensuring data abuse, algorithmic bias, and patient consent are resolved proactively.
All seven steps together give a complete recipe for converting ML/DL breakthroughs into efficient, safe, and morally sound clinical tools for the diagnosis of lung cancer.

6. Conclusions

This systematic review addresses the transformative capability of ML/DL models in reclassifying LC detection and classification while strictly exploring the technical, clinical, and ethical challenges limiting real-world deployment. While models frequently achieve over 95% accuracy over standard benchmark datasets, they often fall short when dealing with clinical diversity due to domain shift and overfitting to controlled datasets, like LIDC-IDRI or LUNA16. Most of the studies reviewed rely on CT imaging, as it is the main tool used in LC screening and is supported by the wide availability of public CT datasets. Technically, the difficulties are performance deterioration on out-of-sample validation and limited generalizability to uncurated data. Clinically, clinician adoption of ML/DL is constrained by regulatory hurdles, rare multi-center validation, and low interpretability, thus reinforcing clinician resistance. Ethically, algorithm bias and demographic variation in model prediction create concerns around fairness and equity. Emerging approaches—human-in-the-loop solutions, lightweight designs, synthetic data creation, and federated learning—are vaguely envisioned to ease these issues by enhancing transparency, personalization, and privacy safeguarding. Moreover, the review tackled an important issue, which is the need for robust evaluation frameworks that have the capability to prioritize fairness audits, external reproducibility, and incorporation with real-world clinical procedures. Our findings also show that dominance of a few public datasets has led to model bias, data skewness, and poor validation in heterogeneous populations, which takes away from clinical transferability. The review positions LIDC-IDRI as the most utilized dataset in ML/DL literature for LC detection and classification, followed by LUNA16 and LC25000.
In summary, it takes collaborative models by policymakers, clinicians, and AI researchers to bridge the gap between clinical feasibility and technical novelty. For future work, developing a generalizable structure to bridge ML/DL novelty and clinical influence, making bias audits, and conducting longitudinal research to track ML/DL’s impact on survival rates and clinical processes are of utmost importance.

Answers to Research Questions

This section returns to the five research questions (RQs) that are listed at the start of this review. Each question is directly answered from results reported in previous sections.
  • RQ1: What technical, clinical, and societal boundaries prevent “practical and systemic barriers that limit widespread clinical adoption” of high-performing ML/DL models from deploying in LC screening workflows?
Although they achieve high accuracy on benchmark datasets, ML/DL models face a range of challenges for implementation. Technically, issues of overfitting to manually prepared data, degradation of performance for raw clinical inputs, and domain shift reduce their reliability for real-world use. Clinically, mini multi-center validation, regulation delays, and low interpretability impede clinician uptake. Societally, concerns around algorithmic bias, demographic disparities, and obscurity concerns generate ethical issues.
  • RQ2: How might new models, such as human-in-the-loop systems, lightweight architectures, and synthetic data, handle these challenges?
Elevating strategies attempt to mitigate the aforementioned limitations. Human-in-the-loop systems enable expert control to be brought into model decision-making, making the model more trustworthy and secure. Light models are designed to run in real time in environments with limited resources and support edge deployment. Synthetic data and federated learning offer privacy-preserving mechanisms and help train diversity expansion, which supports model generalizability.
  • RQ3: What evaluation structures are required to ensure ML/DL tools are equitable, repeatable, and clinically impactful?
Improved evaluation frameworks should incorporate fairness audits, external validation across diverse populations, reproducibility assessments, and interpretability metrics. Explainability techniques (e.g., SHAP, Grad-CAM, attention-weight analysis) are crucial but still intermittently utilized. Figure 11 illustrates a complete roadmap with emphasis on ethical monitoring, clinical adoption, and post-deployment performance monitoring to ensure ongoing performance and safety.
  • RQ4: How does the heavy reliance on public datasets (e.g., LIDC-IDRI, LUNA16) in LC ML and DL research contribute to dataset imbalance, and what obstacles does this create for model generalizability and clinical applicability in real-world scenarios?
Over-reliance on hand-assembled public datasets such as LIDC-IDRI, LUNA16, and LC25000 has led to data imbalance and decreased model exposure to actual real-world diversity. The result is that most models lack robustness under heterogeneous clinical data, weakening external validation and diminishing practitioners’ confidence. The reliance leads to biased learning and hinders the transfer of research findings to clinical practice.
  • RQ5: Which public datasets have been most frequently used in previous studies for LC detection and classification using ML and DL techniques?
LIDC-IDRI is employed most frequently in the reviewed studies, followed by LUNA16 and LC25000 (Figure 6). These datasets provide standardized metrics for comparison but fall short of real-world clinical richness. They have shaped the path of LC research currently, but also the need for larger, more representative datasets.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7010023/s1, Table S1: PRISMA 2020 Checklist [6].

Author Contributions

M.A.: Conceptualization, data curation, formal analysis, investigation, methodology, resources, software, validation, visualization, writing—original draft, and writing—review, and editing. E.C.-M.: Conceptualization, investigation, supervision, validation, visualization, and writing—review and editing. S.G.-M.: Conceptualization, funding acquisition, investigation, supervision, validation, visualization, and writing—review and editing. A.Y.O.: Conceptualization, investigation, validation, visualization, and writing—review and editing. M.O.: Conceptualization, investigation, validation, visualization, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are openly available in the literature.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
  2. Mapanga, W.; Norris, S.A.; Chen, W.C.; Blanchard, C.; Graham, A.; Baldwin-Ragaven, L.; Boyles, T.; Donde, B.; Greef, L.; Huddle, K.; et al. Consensus study on the health system and patient-related barriers for lung cancer management in South Africa. PLoS ONE 2021, 16, e0246716. [Google Scholar] [CrossRef]
  3. Collins, L.G.; Haines, C.; Perkel, R.; Enck, R.E. Lung cancer: Diagnosis and management. Am. Fam. Physician 2007, 75, 56–63. [Google Scholar]
  4. Kumar, V.; Prabha, C.; Sharma, P.; Mittal, N.; Askar, S.S.; Abouhawwash, M. Unified deep learning models for enhanced lung cancer prediction with ResNet-50–101 and EfficientNet-B3 using DICOM images. BMC Med. Imaging 2024, 24, 63. [Google Scholar] [CrossRef]
  5. Yadlapalli, S.; Bhavana, P.; Gunnam, K. Intelligent classification of lung malignancies using deep learning techniques. Int. J. Intell. Comput. Cybern. 2021, 14, 1147–1162. [Google Scholar] [CrossRef]
  6. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Int. J. Surg. 2021, 88, 105906. [Google Scholar] [CrossRef] [PubMed]
  7. Barghouthi, E.D.; Owda, A.Y.; Asia, M.; Owda, M. Systematic Review for Risks of Pressure Injury and Prediction Models Using Machine Learning Algorithms. Diagnostics 2023, 13, 2739. [Google Scholar] [CrossRef] [PubMed]
  8. García-Méndez, S.; de Arriba-Pérez, F.; del Carmen, M. A Review on the Use of Large Language Models as Virtual Tutors. Sci. Educ. 2025, 34, 877–892. [Google Scholar] [CrossRef]
  9. Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef]
  10. Armato, S.G., III; McLennan, G.; Bidaut, L.; McNitt-Gray, M.F.; Meyer, C.R.; Reeves, A.P.; Zhao, B.; Aberle, D.R.; Henschke, C.I.; Hoffman, E.A.; et al. LIDC-IDRI (Lung Image Database Consortium and Image Database Resource Initiative) [Data set]. The Cancer Imaging Archive (TCIA), 2015. Available online: https://www.cancerimagingarchive.net/collection/lidc-idri/ (accessed on 1 January 2026). [CrossRef]
  11. National Cancer Institute. National Lung Screening Trial (NLST)—Cancer Data Access System (CDAS); National Cancer Institute: Bethesda, MD, USA, 2024. Available online: https://cdas.cancer.gov/nlst/ (accessed on 1 January 2026).
  12. LUNA16: Lung Nodule Analysis 2016. Available online: https://www.kaggle.com/datasets/fanbyprinciple/luna-lung-cancer-dataset (accessed on 1 January 2026).
  13. Benhammou, Y.; Acharya, U.; Mittal, M.; Javed, S.; Hage-Ali, A. LC25000: Lung and Colon Cancer Histopathological Image Dataset. 2020. Available online: https://academictorrents.com/details/7a638ed187a6180fd6e464b3666a6ea0499af4af (accessed on 1 January 2026).
  14. Aerts, H.J.W.L.; Velazquez, E.R.; Leijenaar, R.T.H.; Parmar, C.; Grossmann, P.; Carvalho, S.; Bussink, J.; Monshouwer, R.; Haibe-Kains, B.; Rietveld, D.; et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun. 2014, 5, 4006. [Google Scholar] [CrossRef]
  15. Zhao, B.; Tan, Y.; Tsai, W.-Y.; Schwartz, L.H.; Lu, L. Lung image database consortium (LIDC) and image database resource initiative (IDRI): A completed reference database of lung nodules on CT scans. Med. Phys. 2009, 36, 4400. [Google Scholar]
  16. Gevaert, O.; Xu, J.; Hoang, C.D.; Leung, A.N.; Xu, Y.; Quon, A.; Rubin, D.L.; Napel, S.; Plevritis, S.K. Non–small cell lung cancer: Identifying prognostic imaging biomarkers by leveraging public gene expression microarray data—Methods and preliminary results. Radiology 2012, 264, 387–396. [Google Scholar] [CrossRef]
  17. Park, B.W.; Kim, J.K.; Heo, C.; Park, K.J. Reliability of CT radiomic features reflecting tumour heterogeneity according to image quality and image processing parameters. Sci. Rep. 2020, 10, 3852. [Google Scholar] [CrossRef]
  18. Van Timmeren, J.E.; Cester, D.; Tanadini-Lang, S.; Alkadhi, H.; Baessler, B. Radiomics in medical imaging—‘how-to’ guide and critical reflection. Insights Imaging 2020, 11, 91. [Google Scholar] [CrossRef] [PubMed]
  19. Duke Lung Cancer Screening Dataset 2024 (DLCS). 2024. Available online: https://doi.org/10.5281/zenodo.13799069 (accessed on 1 January 2026).
  20. IQ-OTH/NCCD: Lung Cancer Dataset. Available online: https://www.kaggle.com/datasets/hamdallak/the-iqothnccd-lung-cancer-dataset (accessed on 26 July 2025).
  21. Chest CT-Scan Images Dataset. Available online: https://www.kaggle.com/datasets/mohamedhanyyy/chest-ctscan-images (accessed on 26 July 2025).
  22. Lung-PET-CT-Dx: PET/CT Lung Cancer Diagnosis Dataset. Available online: https://www.cancerimagingarchive.net/collection/lung-pet-ct-dx/ (accessed on 1 January 2026).
  23. Alliance, L.C. Early Lung Cancer Detection—The Lifesaving Scan Many Smokers Skip. 2023. Available online: https://wgntv.com/news/medical-watch/early-lung-cancer-detection-the-lifesaving-scan-many-smokers-skip/ (accessed on 1 January 2026).
  24. Armato, S.G., III; Hadjiiski, L.; Tourassi, G.D.; Drukker, K.; Giger, M.L.; Li, F.; Redmond, G.; Farahani, K.; Kirby, J.S.; Clarke, L.P. SPIE-AAPM-NCI Lung Nodule Classification Challenge Dataset. 2015. Available online: https://doi.org/10.7937/K9/TCIA.2015.A6V7JIWX (accessed on 1 January 2026).
  25. Chest X-Ray Images (Normal and Pneumonia). 2019. Available online: https://cir.nii.ac.jp/crid/1881428067966525440 (accessed on 1 January 2026).
  26. Lung Tumor Segmentation Dataset (CT Scan). Available online: https://www.kaggle.com/datasets/rasoulisaeid/lung-cancer-segment/data (accessed on 26 July 2025).
  27. Lung Nodule Segmentation Study Dataset. 2024. Available online: https://universe.roboflow.com/varun-18tlk/lung-nodule-segmentation-study/dataset/3 (accessed on 1 January 2026).
  28. Parmar, C.; Grossmann, P.; Bussink, J.; Lambin, P.; Aerts, H.J.W.L. Machine learning methods for quantitative radiomic biomarkers. Sci. Rep. 2015, 5, 13087. [Google Scholar] [CrossRef]
  29. Aerts, H.J.W.L.; Leijenaar, R.; Hoebers, F.; Dekker, A.; Lambin, P. NSCLC-Radiomics-Genomics; The Cancer Imaging Archive: Little Rock, AR, USA, 2014. [Google Scholar]
  30. Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): NSCLC-Radiogenomics Collection. 2013. Available online: https://www.cancerimagingarchive.net/collection/nsclc-radiogenomics/ (accessed on 1 January 2026).
  31. Chaudhry, H.A.H.; Renzulli, R.; Perlo, D.; Santinelli, F.; Tibaldi, S.; Cristiano, C.; Grosso, M.; Limerutti, G.; Fiandrotti, A.; Grangetto, M.; et al. Unitochest: A lung image dataset for segmentation of cancerous nodules on ct scans. In Proceedings of the International Conference on Image Analysis and Processing, Lecce, Italy, 23–27 May 2022; pp. 185–196. [Google Scholar]
  32. SIMBA Lung Nodule Database. Available online: http://www.via.cornell.edu/lungdb.html (accessed on 1 January 2026).
  33. Marchetti, M.; Barbu, A.; Kalra, M.K.; Dreyer, K.J.; Rubin, D.L. Multi-stage intermediate fusion for multimodal learning to classify lung cancer histology from CT and PET scans. arXiv 2020, arXiv:2012.00318. [Google Scholar]
  34. Mohamed, T.I.A.; Oyelade, O.N.; Ezugwu, A.E. Automatic detection and classification of lung cancer CT scans based on deep learning and ebola optimization search algorithm. PLoS ONE 2023, 18, e0285796. [Google Scholar] [CrossRef] [PubMed]
  35. Huang, J.; He, W.; Xu, H.; Yang, S.; Dai, J.; Guo, W.; Zeng, M. Evaluating histological subtypes classification of primary lung cancers on unenhanced computed tomography based on random Forest model. J. Healthc. Eng. 2023, 2023, 8964676. [Google Scholar] [CrossRef]
  36. Memorial Sloan Kettering Cancer Center (MSKCC). “Lung Adenocarcinoma (MSKCC, 2020): Targeted Sequencing of 604 Lung Adenocarcinoma Tumor/Normal Pairs via MSK-IMPACT [Data set],” 2020. Available online: https://www.cbioportal.org/study/summary?id=luad_mskcc_2020 (accessed on 1 January 2026).
  37. Shayesteh, S.P.; Shiri, I.; Karami, A.H.; Hashemian, R.; Kooranifar, S.; Ghaznavi, H.; Shakeri-Zadeh, A. Predicting lung cancer patients’ survival time via logistic regression-based models in a quantitative radiomic framework. J. Biomed. Phys. Eng. 2020, 10, 479. [Google Scholar] [CrossRef]
  38. Nissar, A.; Mir, A.H. Predictive Radiomics Based Ensemble Machine Learning Approach in CT Lung Nodule Diagnosis. J. Egypt. Natl. Cancer Inst. 2025, 37, 68. [Google Scholar] [CrossRef]
  39. Shafi, I.; Din, S.; Khan, A.; Díez, I.D.L.T.; Casanova, R.d.J.P.; Pifarre, K.T.; Ashraf, I. An Effective Method for Lung Cancer Diagnosis from CT Scan Using Deep Learning-Based Support Vector Network. Cancers 2022, 14, 5457. [Google Scholar] [CrossRef] [PubMed]
  40. Naseer, I.; Masood, T.; Akram, S.; Jaffar, A.; Rashid, M.; Iqbal, M.A. Lung Cancer Detection Using Modified AlexNet Architecture and Support Vector Machine. Comput. Mater. Contin. 2023, 74, 2039–2054. [Google Scholar] [CrossRef]
  41. Katar, O.; Yildirim, O.; Tan, R.-S.; Acharya, U.R. A Novel Hybrid Model for Automatic Non-Small Cell Lung Cancer Classification Using Histopathological Images. Diagnostics 2024, 14, 2497. [Google Scholar] [CrossRef]
  42. Ramkumar, K.; Natarajan, M. Performance Analysis For Detection And Classification Of Lung Cancer Using Machine Learning Approaches. Educ. Adm. Theory Pract. 2024, 30, 2174–2181. [Google Scholar]
  43. Gharaibeh, N.Y.; De Fazio, R.; Al-Naami, B.; Al-Hinnawi, A.-R.; Visconti, P. Automated Lung Cancer Diagnosis Applying Butterworth Filtering, Bi-Level Feature Extraction, and Sparce Convolutional Neural Network to Luna 16 CT Images. J. Imaging 2024, 10, 168. [Google Scholar] [CrossRef]
  44. Dunn, B.; Pierobon, M.; Wei, Q. Automated classification of lung cancer subtypes using deep learning and CT-scan based radiomic analysis. Bioengineering 2023, 10, 690. [Google Scholar] [CrossRef]
  45. Bouamrane, A.; Derdour, M. Enhancing Lung Cancer Detection and Classification Using Machine Learning and Deep Learning Techniques: A Comparative Study. In Proceedings of the 2023 International Conference on Networking and Advanced Systems (ICNAS), Algiers, Algeria, 21–23 October 2023; pp. 1–6. [Google Scholar] [CrossRef]
  46. Sun, H.; Zhang, C.; Ouyang, A.; Dai, Z.; Song, P.; Yao, J. Multi-classification model incorporating radiomics and clinic-radiological features for predicting invasiveness and differentiation of pulmonary adenocarcinoma nodules. Biomed. Eng. Online 2023, 22, 112. [Google Scholar] [CrossRef]
  47. Zhao, F.-H.; Fan, H.-J.; Shan, K.-F.; Zhou, L.; Pang, Z.-Z.; Fu, C.-L.; Yang, Z.-B.; Wu, M.-K.; Sun, J.-H.; Yang, X.-M.; et al. Predictive efficacy of a radiomics random forest model for identifying pathological subtypes of lung adenocarcinoma presenting as ground-glass nodules. Front. Oncol. 2022, 12, 872503. [Google Scholar] [CrossRef]
  48. Mishra, A.; Gangwar, S. Lung cancer detection and classification using machine learning algorithms. Int. J. Recent. Innov. Trends Comput. Commun 2023, 11, 277–282. [Google Scholar] [CrossRef]
  49. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  50. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. Available online: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (accessed on 1 January 2026).
  51. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
  52. Samek, W. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv 2017, arXiv:1708.08296. [Google Scholar] [CrossRef]
  53. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
  54. Wankhade, S.; Vigneshwari, S. A novel hybrid deep learning method for early detection of lung cancer using neural networks. Healthc. Anal. 2023, 3, 100195. [Google Scholar] [CrossRef]
  55. Reddy, N.S.; Khanaa, V. Intelligent deep learning algorithm for lung cancer detection and classification. Bull. Electr. Eng. Informatics 2023, 12, 1747–1754. [Google Scholar] [CrossRef]
  56. Shah, A.A.; Malik, H.A.M.; Muhammad, A.; Alourani, A.; Butt, Z.A. Deep learning ensemble 2D CNN approach towards the detection of lung cancer. Sci. Rep. 2023, 13, 2987. [Google Scholar] [CrossRef]
  57. Bhatia, I.; Aarti; Ansarullah, S.I.; Amin, F.; Alabrah, A. Lightweight Advanced Deep Neural Network (DNN) Model for Early-Stage Lung Cancer Detection. Diagnostics 2024, 14, 2356. [Google Scholar] [CrossRef]
  58. Lee, J.-D.; Hsu, Y.-T.; Chien, J.-C. Study of a Deep Convolution Network with Enhanced Region Proposal Network in the Detection of Cancerous Lung Tumors. Bioengineering 2024, 11, 511. [Google Scholar] [CrossRef]
  59. Srivastava, D.; Srivastava, S.K.; Khan, S.B.; Singh, H.R.; Maakar, S.K.; Agarwal, A.K.; Malibari, A.A.; Albalawi, E. Early Detection of Lung Nodules Using a Revolutionized Deep Learning Model. Diagnostics 2023, 13, 3485. [Google Scholar] [CrossRef]
  60. Mohandass, G.; Krishnan, G.H.; Selvaraj, D.; Sridhathan, C. Lung Cancer Classification using Optimized Attention-based Convolutional Neural Network with DenseNet-201 Transfer Learning Model on CT image. Biomed. Signal Process. Control 2024, 95, 106330. [Google Scholar] [CrossRef]
  61. Sivasankaran, P.; Dhanaraj, K.R. Lung Cancer Detection Using Image Processing Technique Through Deep Learning Algorithm. Rev. d’Intelligence Artif. 2024, 38, 297–302. [Google Scholar] [CrossRef]
  62. Hua, K.-L.; Hsu, C.-H.; Hidayati, S.C.; Cheng, W.-H.; Chen, Y.-J. Computer-aided classification of lung nodules on computed tomography images via deep learning technique. Onco Targets Ther. 2015, 8, 2015–2022. [Google Scholar] [CrossRef] [PubMed]
  63. Shetty, M.V.; Jayadevappa, D.; Tunga, S. Optimized Deformable Model-based Segmentation and Deep Learning for Lung Cancer Classification. J. Med. Investig. 2022, 69, 244–255. [Google Scholar] [CrossRef]
  64. Dong, Y.; Li, X.; Yang, Y.; Wang, M.; Gao, B. A Synthesizing Semantic Characteristics Lung Nodules Classification Method Based on 3D Convolutional Neural Network. Bioengineering 2023, 10, 1245. [Google Scholar] [CrossRef] [PubMed]
  65. Wu, K.; Peng, B.; Zhai, D. Multi-Granularity Dilated Transformer for Lung Nodule Classification via Local Focus Scheme. Appl. Sci. 2023, 13, 377. [Google Scholar] [CrossRef]
  66. Wang, Y.; Zhou, C.; Ying, L.; Chan, H.-P.; Lee, E.; Chughtai, A.; Hadjiiski, L.M.; Kazerooni, E.A. Enhancing Early Lung Cancer Diagnosis: Predicting Lung Nodule Progression in Follow-Up Low-Dose CT Scan with Deep Generative Model. Cancers 2024, 16, 2229. [Google Scholar] [CrossRef]
  67. Khademi, S.; Heidarian, S.; Afshar, P.; Naderkhani, F.; Oikonomou, A.; Plataniotis, K.N.; Mohammadi, A. Spatio-Temporal Hybrid Fusion of CAE and SWin Transformers for Lung Cancer Malignancy Prediction. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
  68. Katase, S.; Ichinose, A.; Hayashi, M.; Watanabe, M.; Chin, K.; Takeshita, Y.; Shiga, H.; Tateishi, H.; Onozawa, S.; Shirakawa, Y.; et al. Development and performance evaluation of a deep learning lung nodule detection system. BMC Med. Imaging 2022, 22, 203. [Google Scholar] [CrossRef]
  69. Tang, T.; Zhang, R. A Multi-Task Model for Pulmonary Nodule Segmentation and Classification. J. Imaging 2024, 10, 234. [Google Scholar] [CrossRef]
  70. Bilal, A.; Shafiq, M.; Fang, F.; Waqar, M.; Ullah, I.; Ghadi, Y.Y.; Long, H.; Zeng, R. IGWO-IVNet3: DL-Based Automatic Diagnosis of Lung Nodules Using an Improved Gray Wolf Optimization and InceptionNet-V3. Sensors 2022, 22, 9603. [Google Scholar] [CrossRef]
  71. Swaroop, S.; Sharma, S.; Janarthanan, S. Lung Cancer Classification and Through Deep Learning Model and Localization of Tumor. In Proceedings of the 2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 17–18 March 2023; pp. 1530–1534. [Google Scholar] [CrossRef]
  72. Oumlaz, M.; Oumlaz, Y.; Oukaira, A.; Benelhaouare, A.Z.; Lakhssassi, A. Advancing Pulmonary Nodule Detection with ARSGNet: EfficientNet and Transformer Synergy. Electronics 2024, 13, 4369. [Google Scholar] [CrossRef]
  73. Mathai, T.S.; Hou, B.; Summers, R.M. Longitudinal assessment of lung lesion burden in CT. In Proceedings of the Medical Imaging 2025: Computer-Aided Diagnosis, London, UK, 19–21 November 2025; pp. 669–678. [Google Scholar] [CrossRef]
  74. Kalkan, M.; Guzel, M.S.; Ekinci, F.; Sezer, E.A.; Asuroglu, T. Comparative Analysis of Deep Learning Methods on CT Images for Lung Cancer Specification. Cancers 2024, 16, 3321. [Google Scholar] [CrossRef]
  75. Saha, A.; Ganie, S.M.; Pramanik, P.K.D.; Yadav, R.K.; Mallik, S.; Zhao, Z. VER-Net: A hybrid transfer learning model for lung cancer detection using CT scan images. BMC Med. Imaging 2024, 24, 120. [Google Scholar] [CrossRef]
  76. Atiya, S.U.; Ramesh, N.V.K.; Reddy, B.N.K. Classification of non-small cell lung cancers using deep convolutional neural networks. Multimed. Tools Appl. 2024, 83, 13261–13290. [Google Scholar] [CrossRef]
  77. Nafea, A.A.; Ibrahim, M.S.; Shwaysh, M.M.; Abdul-Kadhim, K.; Almamoori, H.R.; AL-Ani, M.M. A Deep Learning Algorithm for Lung Cancer Detection Using EfficientNet-B3. Wasit J. Comput. Math. Sci. 2023, 2, 68–76. [Google Scholar] [CrossRef]
  78. Abumohsen, M.; Costa-Montenegro, E.; García-Méndez, S.; Owda, A.Y.; Owda, M. Advanced Deep Learning Techniques for Accurate Lung Cancer Detection and Classification. In Proceedings of the 2025 12th International Conference on Information Technology (ICIT), Amman, Jordan, 27–30 May 2025; pp. 7–12. [Google Scholar] [CrossRef]
  79. Santone, A.; Mercaldo, F.; Brunese, L. A Method for Real-Time Lung Nodule Instance Segmentation Using Deep Learning. Life 2024, 14, 1192. [Google Scholar] [CrossRef]
  80. Abunajm, S.; Elsayed, N.; ElSayed, Z.; Ozer, M. Deep learning approach for early stage lung cancer detection. arXiv 2023, arXiv:2302.02456. [Google Scholar] [CrossRef]
  81. Swain, A.K.; Swetapadma, A.; Rout, J.K.; Balabantaray, B.K. Classification of non-small cell lung cancer types using sparse deep neural network features. Biomed. Signal Process. Control 2024, 87, 105485. [Google Scholar] [CrossRef]
  82. Chen, H.; Zhang, X. HLFSRNN-MIL: A Hybrid Multi-Instance Learning Model for 3D CT Image Classification. Appl. Sci. 2024, 14, 6186. [Google Scholar] [CrossRef]
  83. Alazwari, S.; Alsamri, J.; Asiri, M.M.; Maashi, M.; Asklany, S.A.; Mahmud, A. Computer-aided diagnosis for lung cancer using waterwheel plant algorithm with deep learning. Sci. Rep. 2024, 14, 20647. [Google Scholar] [CrossRef]
  84. Karthikeyan, B.; Seethalakshmi, N.; Nandhini, V.; Vinoth, D.; Muthusamy, P.; Bellam, K. Multimodal Feature Fusion Using Optimal Transfer Learning Approach for Lung Cancer Detection and Classification on CT Images. Full Length Artic. 2024, 12, 84. [Google Scholar] [CrossRef]
  85. Ochoa-Ornelas, R.; Gudiño-Ochoa, A.; García-Rodríguez, J.A. A Hybrid Deep Learning and Machine Learning Approach with Mobile-EfficientNet and Grey Wolf Optimizer for Lung and Colon Cancer Histopathology Classification. Cancers 2024, 16, 3791. [Google Scholar] [CrossRef] [PubMed]
  86. Ahmed, A.A.; Fawi, M.; Brychcy, A.; Abouzid, M.; Witt, M.; Kaczmarek, E. Development and Validation of a Deep Learning Model for Histopathological Slide Analysis in Lung Cancer Diagnosis. Cancers 2024, 16, 1506. [Google Scholar] [CrossRef] [PubMed]
  87. Hasan, M.A.; Haque, F.; Sabuj, S.R.; Sarker, H.; Goni, O.F.; Rahman, F.; Rashid, M. An End-to-End Lightweight Multi-Scale CNN for the Classification of Lung and Colon Cancer with XAI Integration. Technologies 2024, 12, 56. [Google Scholar] [CrossRef]
  88. Mercaldo, F.; Tibaldi, M.G.; Lombardi, L.; Brunese, L.; Santone, A.; Cesarelli, M. An Explainable Method for Lung Cancer Detection and Localisation from Tissue Images through Convolutional Neural Networks. Electronics 2024, 13, 1393. [Google Scholar] [CrossRef]
  89. El-Aziz, A.A.A.; Mahmood, M.A.; El-Ghany, S.A. Advanced Deep Learning Fusion Model for Early Multi-Classification of Lung and Colon Cancer Using Histopathological Images. Diagnostics 2024, 14, 2274. [Google Scholar] [CrossRef]
  90. Liu, Y.; Wang, H.; Song, K.; Sun, M.; Shao, Y.; Xue, S.; Li, L.; Li, Y.; Cai, H.; Jiao, Y.; et al. CroReLU: Cross-Crossing Space-Based Visual Activation Function for Lung Cancer Pathology Image Recognition. Cancers 2022, 14, 5181. [Google Scholar] [CrossRef]
  91. Mamatha, B.; Rashmi, D.; Tiwari, K.S.; Sikrant, P.A.; Jovith, A.A.; Reddy, P.C.S. Lung Cancer Prediction from CT Images and using Deep Learning Techniques. In Proceedings of the 2023 Second International Conference on Trends in Electrical, Electronics, and Computer Engineering (TEECCON), Bengaluru, India, 23–24 August 2023; pp. 263–267. [Google Scholar] [CrossRef]
  92. Sait, A.R.W. Lung Cancer Detection Model Using Deep Learning Technique. Appl. Sci. 2023, 13, 12510. [Google Scholar] [CrossRef]
  93. Uddin, J. Attention-Based DenseNet for Lung Cancer Classification Using CT Scan and Histopathological Images. Designs 2024, 8, 27. [Google Scholar] [CrossRef]
  94. Rajasekar, V.; Vaishnnave, M.P.; Premkumar, S.; Sarveshwaran, V.; Rangaraaj, V. Lung cancer disease prediction with CT scan and histopathological images feature analysis using deep learning techniques. Results Eng. 2023, 18, 101111. [Google Scholar] [CrossRef]
  95. Alsheikhy, A.A.; Said, Y.; Shawly, T.; Alzahrani, A.K.; Lahza, H. A CAD system for lung cancer detection using hybrid deep learning techniques. Diagnostics 2023, 13, 1174. [Google Scholar] [CrossRef] [PubMed]
  96. Martis, J.E.; Sannidhan, M.S.; Balasubramani, R.; Mutawa, A.M.; Murugappan, M. Novel Hybrid Quantum Architecture-Based Lung Cancer Detection Using Chest Radiograph and Computerized Tomography Images. Bioengineering 2024, 11, 799. [Google Scholar] [CrossRef]
  97. Zheng, S.; Guo, J.; Langendijk, J.A.; Both, S.; Veldhuis, R.N.; Oudkerk, M.; van Ooijen, P.M.; Wijsman, R.; Sijtsema, N.M. Survival prediction for stage I-IIIA non-small cell lung cancer using deep learning. Radiother. Oncol. 2023, 180, 109483. [Google Scholar] [CrossRef]
  98. Subash, J.; Kalaivani, S. Dual-stage classification for lung cancer detection and staging using hybrid deep learning techniques. Neural Comput. Appl. 2024, 36, 8141–8161. [Google Scholar] [CrossRef]
  99. Devi, M.M.Y.; Jeyabharathi, J.; Kirubakaran, S.; Narayanan, S.; Srikanth, T.; Chakrabarti, P. Efficient segmentation and classification of the lung carcinoma via deep learning. Multimed. Tools Appl. 2024, 83, 41981–41995. [Google Scholar] [CrossRef]
  100. U.S. Food and Drug Administration. Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning-Based Software as a Medical Device (SaMD); U.S. Food and Drug Administration: Silver Spring, MD, USA, 2019. Available online: https://www.fda.gov (accessed on 1 January 2026).
  101. European Medicines Agency. Artificial Intelligence in the Medicinal Product Lifecycle; European Medicines Agency: Amsterdam, The Netherlands, 2023. Available online: https://www.ema.europa.eu (accessed on 1 January 2026).
  102. World Health Organization. Ethics and Governance of Artificial Intelligence for Health; World Health Organization: Geneva, Switzerland, 2021; Available online: https://www.who.int (accessed on 1 January 2026).
  103. Liu, X.; Rivera, S.C.; Moher, D.; Calvert, M.J.; Denniston, A.K. CONSORT-AI Extension: Reporting Guidelines for Clinical Trials of Artificial Intelligence Interventions. Nat. Med. 2020, 26, 1364–1374. [Google Scholar] [CrossRef] [PubMed]
  104. Baselli, G.; Codari, M.; Sardanelli, F. Opening the black box of machine learning in radiology: Can the proximity of annotated cases be a way? Eur. Radiol. Exp. 2020, 4, 30. [Google Scholar] [CrossRef]
  105. De-Giorgio, F.; Benedetti, B.; Mancino, M.; Sala, E.; Pascali, V.L. The need for balancing ‘black box’ systems and explainable artificial intelligence: A necessary implementation in radiology. Eur. J. Radiol. 2025, 185, 20–45. [Google Scholar] [CrossRef]
  106. Nagendran, M.; Chen, Y.; A Lovejoy, C.; Gordon, A.C.; Komorowski, M.; Harvey, H.; Topol, E.J.; Ioannidis, J.P.A.; Collins, G.S.; Maruthappu, M. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020, 368, m689. [Google Scholar] [CrossRef]
  107. Abumihsan, A.; Owda, M.; Owda, A.Y.; Gasir, F.; Abumohsen, M.; Stergioulas, L. A Novel Deep Learning Approach for Enhanced Ischemic Brain Stroke Detection from CT Images Using Deep Feature Extraction and Optimized Feature Selection. In Proceedings of the 2025 12th International Conference on Information Technology (ICIT), Amman, Jordan, 27–30 May 2025; pp. 1–6. [Google Scholar]
  108. Chronicle, S.F. AI was Meant to Cut Health Care Costs. It Turns Out to Need Expensive Human Support. 2024. Available online: https://www.sfchronicle.com/health/article/ai-health-care-needs-costly-human-oversight-20028092.php (accessed on 1 January 2026).
  109. Herington, J.; McCradden, M.D.; Creel, K.; Boellaard, R.; Jones, E.C.; Jha, A.K.; Rahmim, A.; Scott, P.J.; Sunderland, J.J.; Wahl, R.L.; et al. Ethical Considerations for Artificial Intelligence in Medical Imaging: Data Collection, Development, and Evaluation. J. Nucl. Med. 2023, 64, 1848–1854. [Google Scholar] [CrossRef] [PubMed]
  110. Hantel, A.; Walsh, T.P.; Marron, J.M.; Kehl, K.L.; Sharp, R.; Van Allen, E.; Abel, G.A. Perspectives of Oncologists on the Ethical Implications of Using Artificial Intelligence for Cancer Care. JAMA Netw. Open 2024, 7, e244077. [Google Scholar] [CrossRef]
  111. Wall Street Journal. Khosla Ventures Backs U.K. Startup’s Plan to Bring Cancer AI Tool to U.S.; Wall Street Journal: New York, NY, USA, 2024; Available online: https://www.wsj.com/articles/khosla-ventures-backs-u-k-startups-plan-to-bring-cancer-ai-tool-to-u-s-556455b5 (accessed on 1 January 2026).
  112. Times, F. Harvard’s ‘Chief’ AI Model Detects Multiple Cancer Types with High Accuracy; Financial Times: London, UK, 2024. Available online: https://www.ft.com/content/0a8f2c61-77f4-43ce-87d2-a7b421bbda85 (accessed on 1 January 2026).
  113. Trentz, C.; Engelbart, J.; Semprini, J.; Kahl, A.; Anyimadu, E.; Buatti, J.; Casavant, T.; Charlton, M.; Canahuate, G. Evaluating machine learning model bias and racial disparities in non-small cell lung cancer using SEER registry data. Health Care Manag. Sci. 2024, 27, 631–649. [Google Scholar] [CrossRef]
  114. Seyyed-Kalantari, L.; Zhang, H.; McDermott, M.B.A.; Chen, I.Y.; Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 2021, 27, 2176–2182. [Google Scholar] [CrossRef] [PubMed]
  115. Sourlos, N.; Wang, J.; Nagaraj, Y.; van Ooijen, P.; Vliegenthart, R. Possible Bias in Supervised Deep Learning Algorithms for CT Lung Nodule Detection and Classification. Cancers 2022, 14, 3867. [Google Scholar] [CrossRef] [PubMed]
  116. Chassagnon, G.; Vakalopoulou, M.; Paragios, N.; Revel, M.P. Artificial intelligence applications for COVID-19 pandemic management in imaging. Lancet Digit. Health 2021, 3, e235–e243. [Google Scholar]
  117. Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Federated learning for privacy-preserving artificial intelligence in medicine. Nat. Mach. Intell. 2021, 3, 473–484. [Google Scholar] [CrossRef]
  118. Mittelstadt, B. Principles alone cannot guarantee ethical AI. Nat. Mach. Intell. 2022, 4, 104–110. [Google Scholar] [CrossRef]
  119. Holzinger, A.; Langs, G.; Denk, H.; Zatloukal, K.; Müller, H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021, 11, e1405. [Google Scholar] [CrossRef]
  120. Beede, E.; Baylor, E.; Hersch, F.; Iurchenko, A.; Wilcox, L.; Ruamviboonsuk, P.; Vardoulakis, L.M. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. Proc. ACM Human-Computer Interact. 2020, 4, 1–12. [Google Scholar] [CrossRef]
  121. Banerjee, I.; Ding, Y.; Kim, J.; Shah, N.H. Fairness and algorithmic bias in machine learning for healthcare. J. Biomed. Inform. 2022, 128, 104036. [Google Scholar] [CrossRef]
  122. National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0); National Institute of Standards and Technology: Gaithersburg, MD, USA, 2023. Available online: https://www.nist.gov (accessed on 1 January 2026).
Figure 1. Common signs and symptoms of LC.
Figure 1. Common signs and symptoms of LC.
Ai 07 00023 g001
Figure 2. PRISMA diagram that depicts the systematic selection process for the literature review.
Figure 2. PRISMA diagram that depicts the systematic selection process for the literature review.
Ai 07 00023 g002
Figure 3. Workflow of technical advances in LC detection.
Figure 3. Workflow of technical advances in LC detection.
Ai 07 00023 g003
Figure 4. Distribution of dataset types in LC studies.
Figure 4. Distribution of dataset types in LC studies.
Ai 07 00023 g004
Figure 5. Distribution of studies using public vs. private LC datasets.
Figure 5. Distribution of studies using public vs. private LC datasets.
Ai 07 00023 g005
Figure 6. Distribution of datasets used in LC studies.
Figure 6. Distribution of datasets used in LC studies.
Ai 07 00023 g006
Figure 7. ML/DL models used in LC detection and classification.
Figure 7. ML/DL models used in LC detection and classification.
Ai 07 00023 g007
Figure 8. Comparison of ML, CNN, and hybrid DL model accuracies across LC datasets.
Figure 8. Comparison of ML, CNN, and hybrid DL model accuracies across LC datasets.
Ai 07 00023 g008
Figure 9. The four core challenges in clinical validation.
Figure 9. The four core challenges in clinical validation.
Ai 07 00023 g009
Figure 10. Challenges with ethical and societal implications in ML/DL for LC diagnosis.
Figure 10. Challenges with ethical and societal implications in ML/DL for LC diagnosis.
Ai 07 00023 g010
Figure 11. A roadmap to clinic-ready ML/DL in lung cancer detection.
Figure 11. A roadmap to clinic-ready ML/DL in lung cancer detection.
Ai 07 00023 g011
Table 2. Summary of key private datasets.
Table 2. Summary of key private datasets.
Dataset NameModalitySize (Patients)StrengthsLimitations
Iraq-Oncology Teaching Hospital Dataset [34]CT (normal, benign, and malignant)110Clinical data from real patientsLimited annotation detail, lacks multicenter diversity
Zhongshan Hospital [35]CT—radiomic (ADC, SCC, SCLC)852Institutional clinical imaging dataLimited demographic and pathological variation
Structural and Functional Radiomics [17]CT—radiomic83Focused histological subtype classification;No raw images, limited generalizability
MSKCC Lung [36]CT, Clinical~200Rich clinical-genomic dataRestricted access
Moffitt-Maastricht Lung Adenocarcinoma [37]CT (contrast-enhanced; adenocarcinoma with survival labels)59Includes real patient survival data; pre-treatment radiomics; clinical featuresSmall cohort size, lacks diversity and public annotations
Table 4. ML performance on private datasets.
Table 4. ML performance on private datasets.
StudyDatasetTaskFeature ExtractionClassifiersAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[46]Not mentionedBinaryRadiomics + ClinicalSVM-94.20--
[47]Not mentionedBinaryRadiomicsRF-92.0--
[37]Moffitt-Maastricht Lung AdenocarcinomaBinaryRadiomicsLR64.40NR80.0
[48]Not mentionedBinaryClinical attributesDT100.0--100.0
[35]Zhongshan HospitalMulticlassRadiomicsRF-74.0, 77.0, 88.0--
Table 5. DL models using LUNA16 dataset.
Table 5. DL models using LUNA16 dataset.
StudyModelTaskAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[54]3D-CNN + RNNBinary95.00-87.00-
[55]CNN + IDLABinary92.81-92.85-
[56]Deep Ensemble 2D CNNBinary95.00---
[57]WSDI + SS-CLBinary----
[58]Faster R-CNN + DCNNBinary95.32---
[59]HFR-CNNBinary97.00---
[60]AtCNN + DenseNet201Binary99.43---
Table 11. DL models using private datasets.
Table 11. DL models using private datasets.
StudyModelTask TypeAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[97]3D CNNBinary-76--
[34]EOSA with CNNBinary93.21---
[98]Hybrid Xception and custom CNN model (XC–CNN), ALNNBinary93.3---
[99]CNNMulticlass99.8---
Table 12. Comparison of open vs. private datasets in LC detection studies.
Table 12. Comparison of open vs. private datasets in LC detection studies.
AspectOpen (Public) DatasetsPrivate (Non-Open) Datasets
Dataset TypePublic/Open-source [54,59,60,68,85,89]Institution-specific/Not publicly shared [34,35,37,47,97,98]
Common Datasets UsedLUNA16, LIDC-IDRI, LC25000, Kaggle CT [59,60,68,85,89]Hospital CT archives, radiomics from local studies [34,35,37,47,97]
AvailabilityFreely accessible (public sources)Restricted access or not shared [34,35,37,47]
SizeLarge (e.g., >500 patients or >5000 images) [10,12,13]Small to medium (e.g., <300 patients) [35,37,47]
DiversityLow to moderate [54,59,60]High (real-world variability) [34,35,97,98]
Image FormatMostly DICOM or JPEG (LUNA16, LIDC: DICOM; LC25000: JPEG)Mostly DICOM [34,35,97]
Annotation QualityHigh (multi-reader, standardized) [10,12]Moderate (institution-dependent) [35,47]
Reported Model PerformanceOften very high (ideal conditions): [60] 99.43%, [89] 99.94%Moderate to low (more realistic)
[35]: AUC 74–88%; [97]: AUC 76%
Typical Accuracy/AUC>95% accuracy/AUC in many studies [60,77,89]Typically 74–93% accuracy/AUC [34,35,47,97]
GeneralizabilityLow (dataset-specific tuning) [54,57]High potential (clinically reflective) [34,35]
ReproducibilityHigh (easy to reproduce with public datasets)Low (non-reproducible) [35,47]
Clinical RelevanceLimited (controlled, ideal conditions) [59,60]High (represents real clinical scenarios) [34,97,98]
Table 6. DL models using LIDC-IDRI dataset.
Table 6. DL models using LIDC-IDRI dataset.
StudyModelTaskAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[61]DBN, CNN, SDAEBinary79.76–81.19---
[62]DBN, CNNBinary--73.40–73.30-
[63]WSLO + ShCNNBinary90.91---
[65]SCCNNBinary95.45---
[66]MGDFormerBinary96.1098.50--
[67]GP-WGANBinary-86.20--
[64]CAET-SWinBinary82.65---
[68]R-CNNMulticlass--98.00-
[69]MT-NetMulticlass91.90---
[70]GWO + InceptionNet-V3Multiclass--100.0094.74
Table 7. DL models using Kaggle CT datasets.
Table 7. DL models using Kaggle CT datasets.
StudyModelTaskAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[71]CNNBinary97.10---
[72]ARSGNetBinary98.17---
[73]3D nnU-NetBinary--68.4071.30
[74]Inception-ResNetV2 (best), UNet (segm.)Binary + Segm.98.50---
[75]VER-Net (VGG19 + EffNetB0 + ResNet101)Multiclass (4-class)91.00-91.00-
[76]DSTL (DCNN + VGG16, InceptionV3, ResNet50)Multiclass (4-class)92.57---
[77]EfficientNet-B3Multiclass (4-class)96.00---
[78]DenseNet201Multiclass (4-class)98.9599.0099.0099.00
[79]YOLO-based modelDetection, Classification, Segmentation75.70-73.80-
[80]CNN (IQ-OTH/NCCD)Multiclass (3-class)99.00---
Table 8. Hybrid DL models using CT datasets.
Table 8. Hybrid DL models using CT datasets.
StudyModelTask TypeAccuracy (%)Sensitivity (%)Specificity (%)Precision (%)AUC (%)
[81]ResNet-50, VGG-16, Inception v3Binary98.29----
[82]HLFFF + SRNN (ResNet-50)Binary99.20/99.40----
[83]CADLC-WWPADL (MobileNet + WWPA + SAE)Binary99.05----
[84]MFFOTL-LCDC (SqueezeNet + CapsNet + ROA)Binary97.78----
Table 9. DL models using histopathological datasets.
Table 9. DL models using histopathological datasets.
StudyModelTask TypeAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[85]MEGWO-LCCHCBinary94.80---
[86]CNN + SepCNNBinary97.00---
[87]LW-MS CNNMulticlass99.20---
[88]VGG-16 (best among CNNs)Multiclass99.20---
[89]ResNet101V2 + NASNet + EfficientNet-B0Multiclass99.94-99.80-
[90]SENet50_CroReLUMulticlass98.33---
[91]CNN GD (best among multiple models)Multiclass99.84---
Table 10. DL models using multimodal datasets.
Table 10. DL models using multimodal datasets.
StudyModelTask TypeAccuracy (%)AUC (%)Sensitivity (%)Specificity (%)
[92]CNN + DenseNet + MobileNet V3Multiclass98.60---
[93]ATT-DenseNetMulticlass94.00–95.40---
[94]CNN GD (best among CNN, VGG, Inception)Multiclass97.86-96.79-
[95]VGG-19 + LSTMBinary99.40---
[96]Hybrid DL + Quantum (VGG16, ResNet50-V2, DenseNet201)Binary92.12-94.0090.00
Table 13. Checklist for clinic-ready ML/DL models.
Table 13. Checklist for clinic-ready ML/DL models.
CriterionRequirementsValidation Method
GeneralizabilityThe ML/DL model should perform well across diverse populations and institutions and be tested on more than two external datasets.Multi-center trials, external dataset examination.
Bias MitigationSecure fairness by addressing demographic imbalances (performance disparities across age, gender, and ethnicity).Bias assessment metrics, subgroup analysis.
InterpretabilityML/DL decisions should be explainable for clinical trust.Explainable ML/DL techniques, heatmaps, and SHAP values.
RobustnessML/DL must handle noise and variations in clinical data.Stress testing, adversarial robustness testing.
Data PrivacyProtect patient secrecy and comply with regulations.Federated learning, differential privacy.
Clinical Workflow IntegrationSeamless compatibility with HIS, PACS, and EHR.Pilot studies, specialist feedback.
Continuous LearningML/DL should adjust to evolving medical knowledge.Periodic model retraining, real world monitoring.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abumohsen, M.; Costa-Montenegro, E.; García-Méndez, S.; Owda, A.Y.; Owda, M. Machine Learning and Deep Learning in Lung Cancer Diagnostics: A Systematic Review of Technical Breakthroughs, Clinical Barriers, and Ethical Imperatives. AI 2026, 7, 23. https://doi.org/10.3390/ai7010023

AMA Style

Abumohsen M, Costa-Montenegro E, García-Méndez S, Owda AY, Owda M. Machine Learning and Deep Learning in Lung Cancer Diagnostics: A Systematic Review of Technical Breakthroughs, Clinical Barriers, and Ethical Imperatives. AI. 2026; 7(1):23. https://doi.org/10.3390/ai7010023

Chicago/Turabian Style

Abumohsen, Mobarak, Enrique Costa-Montenegro, Silvia García-Méndez, Amani Yousef Owda, and Majdi Owda. 2026. "Machine Learning and Deep Learning in Lung Cancer Diagnostics: A Systematic Review of Technical Breakthroughs, Clinical Barriers, and Ethical Imperatives" AI 7, no. 1: 23. https://doi.org/10.3390/ai7010023

APA Style

Abumohsen, M., Costa-Montenegro, E., García-Méndez, S., Owda, A. Y., & Owda, M. (2026). Machine Learning and Deep Learning in Lung Cancer Diagnostics: A Systematic Review of Technical Breakthroughs, Clinical Barriers, and Ethical Imperatives. AI, 7(1), 23. https://doi.org/10.3390/ai7010023

Article Metrics

Back to TopTop