1. Introduction
Lung cancer remains one of the leading causes of cancer-related mortality globally, with more than two million new cases diagnosed annually [
1]. Survival outcomes depend heavily on early detection, as five-year survival increases from below
in advanced stages to more than
when tumors are identified early [
2,
3]. Consequently, low-dose computed tomography (LDCT) has been established as the gold standard screening method for detecting small pulmonary nodules before symptoms appear [
3]. However, LDCT introduces noise and intensity inconsistencies due to dose reduction, and these variations complicate both radiologist interpretation and automated analysis [
4].
Computed tomography (CT) images are quantified using Hounsfield Units (HU), where reconstruction kernels, slice thickness, radiation dose, and vendor-specific algorithms strongly influence image texture, contrast, and noise patterns [
5]. Even scans from the same patient may exhibit marked differences when acquired using different scanners or protocols, leading to shifts in intensity distributions and nodule appearance [
6]. These variations create substantial challenges for artificial intelligence (AI) systems, which typically assume consistent input characteristics during training.
While LDCT remains the gold standard for structural lung screening, it is strictly limited to anatomical assessment and often lacks the functional specificity required for complex oncological cases. In scenarios such as cancer of unknown primary (CUP), standard anatomical imaging frequently fails to localize the lesion. In such contexts, complementary strategies like molecular imaging (e.g., PET/CT) have demonstrated superior detection rates of 38–
compared with conventional methods. Similarly, MRI-based radiomics has shown promise in soft-tissue characterization by extracting quantitative texture features (e.g., GLCM inverse variance) that escape human perception, enabling precise differential diagnosis in complex head and neck tumors. Although this review focuses on CT-based deep learning, acknowledging these multimodal diagnostic pathways is critical for a holistic understanding of pulmonary and general oncology [
7,
8].
Deep-learning models including convolutional neural networks (CNNs), U-Net-based segmentation systems, and Transformer architectures have shown strong performances on benchmark datasets [
9]. Yet, their accuracy decreases by 10–20% when exposed to unseen scanners, reconstruction kernels, or dose levels [
10]. This domain shift increases false negatives for small or low-contrast nodules and reduces clinical reliability [
11]. The issue is further compounded by the widespread use of enhancement techniques such as histogram equalization or CLAHE, which improve visual contrast but often distort HU values and undermine radiomic reproducibility [
12]. In contrast, harmonization methods such as ComBat, kernel matching, and physics-informed denoising aim to standardize intensity distributions while preserving quantitative integrity [
13].
Although publicly available datasets such as the Lung Image Database Consortium (LIDC-IDRI) and LUng Nodule Analysis 2016 (LUNA16) have facilitated algorithm development, they represent limited scanner diversity and do not reflect real-world multi-institutional variability [
14]. As a result, AI models trained exclusively on such datasets may generalize poorly across clinical environments [
15]. Improving robustness requires a structured understanding of how acquisition variability affects AI performance and which preprocessing and harmonization strategies effectively mitigate these effects.
This systematic review addresses these challenges by synthesizing evidence published between 2020 and 2025 on (1) how acquisition- and reconstruction-related variability impacts AI-based lung-nodule detection, (2) the effectiveness of preprocessing and harmonization strategies in managing intensity and contrast heterogeneity, and (3) the robustness of different deep-learning architectures under variable imaging conditions. The review further highlights dataset limitations, gaps in external validation practices, and the need for standardized, HU-faithful workflows to support the clinically reliable deployment of AI-assisted lung-cancer screening.
To date, however, existing reviews remain fragmented in their treatment of CT acquisition variability and provide limited insight into how reconstruction kernels, slice thickness, radiation dose, vendor-specific reconstruction algorithms, and HU calibration inconsistencies collectively influence the robustness of AI-based lung-cancer detection systems. Most prior surveys focus primarily on model architectures or overall diagnostic performance, offering little systematic comparison of HU-preserving preprocessing techniques, perceptual enhancement methods, physics-informed harmonization strategies, and deep-learning-based reconstruction approaches within a unified framework. The limited adoption of multi-center external validation and the widespread absence of standardized reporting for essential acquisition parameters further obscure the true generalizability of published AI systems. These gaps underscore the need for a comprehensive, methodologically focused synthesis that directly examines how acquisition variability and preprocessing workflows influence model stability, reproducibility, and clinical applicability.
Addressing these limitations, the present systematic review contributes a consolidated and quantitative assessment of the field by mapping the impact of CT acquisition variability including differences in reconstruction kernels, slice thicknesses, dose categories, and vendor-specific characteristics on segmentation, detection, and malignancy classification performance [
16,
17] across 100 studies published between 2020 and 2025. It provides a comparative evaluation of preprocessing and harmonization techniques, clearly distinguishing HU-faithful normalization approaches from perceptual contrast-enhancement methods and examining their effects on AUC, Dice scores, radiomic stability, and cross-scanner generalization. The review further analyzes the robustness of modern AI architectures, including CNNs [
18,
19], attention-based networks, Transformers, hybrid segmentation–classification pipelines, and radiomics deep-learning fusion models under heterogeneous imaging conditions. Finally, it identifies key methodological gaps such as inconsistently reported acquisition metadata, the lack of standardized preprocessing pipelines, minimal external validation, and risks associated with generative adversarial networks (GANs)-based harmonization and outlines practical recommendations for building clinically reliable, vendor-agnostic CT-based AI workflows. Together, these contributions provide a cohesive and rigorous foundation for advancing robust, reproducible, and clinically deployable AI systems for lung-cancer screening.
2. Methodology
This systematic review was conducted in full accordance with the PRISMA 2020 guidelines [
20], and the reporting follows all items outlined in the PRISMA checklist. The review protocol was not preregistered in PROSPERO or any other registry; however, all methodological steps including search strategy, screening, eligibility assessment, data extraction, and synthesis were performed following PRISMA standards to ensure transparency and reproducibility. A completed PRISMA checklist is provided in the
Supplementary Materials, and the PRISMA flow diagram summarizing identification, screening, eligibility, and inclusion is presented in
Figure 1.
2.1. Research Questions and Objectives
The review was designed to address key research questions (RQs) regarding the impact of CT variability on AI performance and to establish corresponding objectives for synthesis. Specifically, the study sought to answer:
RQ1: How do variations in CT acquisition parameters (e.g., kernel, dose, vendor) affect AI diagnostic performance?
RQ2: Which preprocessing and harmonization methods effectively reduce intensity variability while preserving HU fidelity?
RQ3: Which deep-learning architectures show the greatest robustness to cross-scanner variability?
RQ4: How representative are widely used CT datasets in terms of scanner and dose diversity?
RQ5: What methodological limitations exist in reporting acquisition parameters, dataset composition, preprocessing, and robustness evaluation?
RQ6: To what extent do studies include external or multi-center validation, and how does this influence reported generalizability?
Correspondingly, the primary objectives were to quantify the impact of these variations, evaluate the efficacy of preprocessing strategies, identify robust architectures, and propose improvements for standardized reporting and multi-center validation.
2.2. Search Strategy
A systematic search was conducted across PubMed, IEEE Xplore, Scopus, Web of Science, and ACM Digital Library for the literature published from 2020 to 2025. To broaden coverage, the first 500 Google Scholar results were also screened manually, with irrelevant items (patents, theses, duplicates) removed. Although a preregistered protocol was not utilized, this review strictly adhered to all methodological procedures outlined in the PRISMA 2020 guidelines.
The search strategy was structured using a PICOC-style logic, focusing on CT acquisition variability [
6], AI-based detection, and diagnostic performance.
Table 1 outlines the PICOC components and the corresponding terminology used during query formulation.
To operationalize the PICOC framework, the search queries combined controlled vocabulary and free-text keywords relating to CT imaging, acquisition variability, preprocessing methods, and AI-based lung-cancer detection. Boolean operators ensured both sensitivity and precision across databases.
Search Queries Used:
(“lung cancer” OR “pulmonary nodule”) AND (“computed tomography” OR “CT” OR “LDCT”) AND (“Hounsfield” OR “reconstruction kernel” OR “slice thickness”OR “vendor” OR “radiation dose”) AND (“deep learning” OR “CNN” OR “Transformer” OR “AI detection”)
(“CT variability” OR “intensity harmonization” OR “ComBat” OR “kernel matching” OR “HU normalization”) AND (“lung nodule detection”) AND (“classification” OR “segmentation”)
Database-specific filters were applied to restrict searches to titles, abstracts, and metadata where possible. Google Scholar results were manually curated to exclude non-peer-reviewed sources and redundant citations.
2.3. Search Outcomes
The combined search retrieved 16,451 records, including 15,900 from Google Scholar. Given the large volume of results in Google Scholar, only the top 500 most relevant records were screened, while the remaining records were excluded due to the screening threshold. After removing duplicates and applying this threshold, 1000 unique records underwent title and abstract screening. Of these, 825 were excluded due to irrelevance (non-CT imaging, non-AI, or non-peer-reviewed sources). Full texts of 175 studies were assessed for eligibility, and 100 met all inclusion criteria (
Table 2).
The complete workflow is illustrated in the PRISMA 2020 flow diagram (
Figure 1), which visualizes all stages of identification, screening, eligibility, and inclusion.
This diagram confirms that the selection pipeline adhered strictly to PRISMA standards, showing transparent tracking of excluded studies and reasons for exclusion.
2.4. Eligibility Criteria and Study Selection Process
Eligibility was defined according to PRISMA guidelines to ensure a focused analysis of CT variability. Studies were included if they were peer-reviewed, published in English between 2020 and 2025, and explicitly addressed CT-acquisition variability in AI-based lung-nodule analysis. Gray literature and studies missing essential methodological details were excluded to maintain quality standards. The detailed inclusion and exclusion criteria utilized for screening are summarized in
Table 3.
Applying these criteria, the study selection process was executed through a structured multi-stage workflow consistent with PRISMA 2020 standards:
Title & Abstract Screening: Initial removal of studies unrelated to CT, AI, or imaging variability to filter out the irrelevant literature.
Full-Text Review: A detailed assessment of the remaining articles to verify the reporting of acquisition parameters, model transparency, and performance metrics.
Backward & Forward Reference Checking: A final supplementary search was conducted to ensure no relevant studies were missed beyond the initial database pool.
2.5. Data Extraction
A standardized data-extraction form was developed to ensure consistency and reproducibility across all included studies. To address the reviewers’ comments regarding quality assurance, the data extraction process was structured as follows: The primary reviewer (S.K.) extracted all relevant data fields from the 100 included studies. Subsequently, a second reviewer (M.N.N.) performed an independent validation on a random sample of 20% of the entries () to verify accuracy. Agreement between reviewers was high, and minor discrepancies (<5%) were resolved through discussion and re-examination of the full texts until consensus was reached.
For each study, the following data fields were collected to map methodological practices:
Bibliographic metadata: First author, year, publication venue, and country.
Task type: Lung-nodule detection, segmentation, or malignancy classification.
Datasets used: Public benchmarks (e.g., LIDC-IDRI, LUNA16, NLST) or private institutional CT collections, including sample sizes and acquisition diversity [
21].
Acquisition and reconstruction parameters: Scanner vendor, reconstruction kernel, slice thickness, radiation dose (LDCT vs. standard), and window/level settings.
Preprocessing and harmonization methods: HU-normalization strategies, resampling, kernel/MTF matching, ComBat harmonization, physics-informed methods, and whether the pipeline preserved HU integrity [
22].
Modeling approach: DL architecture (CNN, Transformer, Hybrid), loss functions, augmentation strategies, and domain generalization techniques [
23].
Evaluation procedure: Validation setup (internal cross-validation, held-out testing, or external multi-site validation) [
24].
Performance metrics: AUC, FROC sensitivity (at specified FP/scan rates), Dice similarity coefficient (DSC), accuracy, and confidence intervals.
Robustness indicators: Reported performance degradation across different scanners, kernels, or dose levels, specifically for small nodules (≤6 mm).
Due to the substantial heterogeneity in datasets, preprocessing pipelines, and validation strategies across the selected literature, a quantitative meta-analysis was not feasible. Instead, the findings were synthesized qualitatively, grouped by task type and validation setting [
25].
A representative sample of extracted variables from five included studies is provided in
Table 4, illustrating the granularity of data captured during the extraction process.
2.6. Categorization of Preprocessing and Harmonization Methods
This review identified four major categories of preprocessing and harmonization strategies applied across the included studies. These methods differ in their underlying principles, their ability to preserve Hounsfield Unit (HU) integrity, and their suitability for quantitative or diagnostic AI pipelines.
Perceptual enhancement methods such as CLAHE and AHE were frequently used (18%), mainly due to their simplicity and ability to improve visual conspicuity; however, they do not preserve Hounsfield Units (HU) and are therefore unsuitable for radiomics or quantitative AI pipelines [
12,
30]. Statistical harmonization techniques including ComBat, Z-score normalization, and histogram matching accounted for 6% of studies and offered strong cross-scanner robustness while maintaining HU fidelity, although they rely on the assumption of consistent batch effects [
31]. Physics-informed harmonization strategies like kernel/MTF matching and spectral re-projection (5%) preserved HU distributions most accurately but required detailed acquisition metadata that many public datasets lacked [
32]. Deep-learning-based normalization and denoising approaches (10%), including DLIR, GAN-based mappings, and self-supervised harmonizers, provided joint noise reduction and harmonization but carried risks of hallucinated structures when not externally validated [
33]. Overall, the distribution illustrated in
Figure 2 highlights a critical trend: while perceptual enhancement methods remain common, only 21% of studies employed HU-preserving harmonization strategies, underscoring the need for greater adoption to achieve reliable cross-scanner generalization in clinical AI workflows.
2.7. Datasets Used in Selected Studies
The selected studies employed a range of publicly available and institutional CT datasets for lung-nodule detection, segmentation, and malignancy prediction [
34].
Table 5 summarizes the major datasets, including the number of CT volumes, scanner vendors, reconstruction kernels, and availability of acquisition metadata.
Figure 3 further illustrates the distribution of dataset usage across the included studies.
As shown in
Figure 3, the LIDC-IDRI dataset was the most widely used resource, appearing in more than half of the studies. Its detailed radiologist annotations and slice-level labels make it highly suitable for training and validating AI models [
35]. LUNA16, derived from LIDC-IDRI but standardized for nodule detection benchmarking, was also frequently used due to its preprocessing consistency and fixed train–test splits.
The NLST (National Lung Screening Trial) dataset, although used less frequently due to restricted access, provided the most realistic LDCT screening data and supported multi-vendor, multi-kernel variability analysis. Commercial or competition datasets such as Tianchi and Kaggle Lung CT Challenge appeared in several studies but lacked complete acquisition metadata, limiting harmonization research.
A subset of studies used multi-center institutional datasets, which often offered rich variability (different kernels, vendors, doses) but were not publicly accessible. This creates a reproducibility gap, as noted in several reviews. Smaller datasets were occasionally used for segmentation-focused studies but appeared infrequently.
2.8. Quality Assessment and Risk of Bias
To evaluate the methodological quality of the included studies, we applied the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool, tailored for deep-learning applications. This tool assesses risk across four key domains: (1) Patient Selection (checking for bias in dataset usage, e.g., use of public vs. private data), (2) Index Test (assessing if the AI-model methodology was transparent and reproducible), (3) Reference Standard (evaluating the reliability of ground truth labels, e.g., radiologist consensus), and (4) Flow and Timing (checking for data leakage and appropriate train–test splitting).
Each study was graded as “Low Risk,” “High Risk,” or “Unclear Risk” for each domain. As illustrated in
Figure 4, the majority of studies demonstrated low risk of bias in the ‘Index Test’ and ‘Reference Standard’ domains, reflecting the widespread use of high-quality public benchmarks like LIDC-IDRI. However, the ‘Patient Selection’ domain exhibited higher risk (30%), primarily due to the use of non-randomized institutional datasets without clear exclusion criteria.
5. Conclusions
This systematic review synthesized evidence from 100 studies published between 2020 and 2025 to evaluate the impact of CT-acquisition variability and preprocessing on AI-based lung-nodule analysis, providing direct answers to the research questions posed in
Section 2.1.
Addressing RQ1, the evidence conclusively demonstrates that variations in reconstruction kernels, slice thickness, and radiation dose are the primary drivers of AI performance degradation, causing significant reductions in AUC (10–) and Dice scores (8–) due to severe Hounsfield Unit (HU) and textural shifts. For RQ2, HU-preserving approaches such as HU clipping, ComBat harmonization, and physics-informed kernel matching emerged as the most effective strategies for mitigating this variability, offering stable cross-scanner performance while maintaining quantitative fidelity. Concerning RQ3, Transformer-based architectures and hybrid segmentation–classification models consistently showed superior robustness compared to conventional CNNs, maintaining higher AUC values (0.90–0.92 vs. 0.85–0.88) under heterogeneous imaging conditions. Regarding RQ4, public datasets like LIDC-IDRI remain dominant but are unrepresentative of real-world clinical diversity, lacking sufficient scanner, kernel, and low-dose variability. Addressing RQ5, significant methodological gaps persist, most notably the inconsistent reporting of essential acquisition metadata (e.g., kernel, vendor) and the lack of standardized preprocessing guidelines. Finally, for RQ6, external multi-center validation is alarmingly rare (present in only of studies), leading to widespread overestimation of model generalizability.
Overall, the findings indicate that achieving clinically reliable and vendor-agnostic lung-CT AI systems requires a unified, HU-faithful preprocessing framework. Such a pipeline should integrate HU-preserving normalization, DLIR or N4ITK-based denoising, kernel-consistent resampling, selective harmonization, and spatial normalization through segmentation [
9,
23]. When paired with modern architectures particularly Transformers and validated across multi-center datasets, these standardized workflows substantially improve robustness and generalizability [
15,
49]. The consolidation of this evidence provides a foundation for moving toward harmonized, reproducible, and clinically deployable AI solutions for lung-cancer screening and diagnosis.
6. Limitations and Future Directions
Although this systematic review provides a comprehensive synthesis of CT-acquisition variability, preprocessing, and harmonization strategies in AI-based lung-nodule analysis, several methodological limitations must be acknowledged. (1) The review was not based on a preregistered protocol, which may introduce a minor risk of selection bias despite strict adherence to PRISMA 2020 guidelines [
20,
53]. (2) The included studies demonstrated substantial heterogeneity in dataset composition, acquisition protocols, reporting completeness, and evaluation strategies [
4,
59]. This variability prevented the application of meta-analytic pooling and required a qualitative synthesis, which, although rigorous, lacked the statistical precision obtainable from standardized effect-size aggregation.
A second limitation is the widespread absence of detailed acquisition metadata in the primary literature. Many studies did not report reconstruction kernels, slice thickness, dose levels, or vendor specifications parameters essential for assessing the true impact of variability on AI robustness [
21,
22]. As a result, certain quantitative relationships (e.g., the exact influence of kernel mismatch on Dice loss) may be underestimated or inconsistently represented across studies [
5,
15]. Additionally, the over-reliance on public datasets such as LIDC-IDRI and LUNA16, which have limited scanner diversity, may bias findings toward optimistic performance estimates compared with real-world multi-center clinical environments [
31,
36].
Third, although GAN-based harmonization and DL-based reconstruction techniques show promising results, many published studies lacked rigorous safeguards against hallucinated textures or altered diagnostic cues [
33,
65]. Only a small fraction incorporated radiologist review or uncertainty quantification to evaluate the clinical fidelity of harmonized outputs [
24,
40]. Thus, conclusions regarding the safety and reproducibility of these methods should be interpreted cautiously.
Future research should prioritize standardized reporting and harmonized protocols. At minimum, CT-acquisition metadata including kernel type, slice thickness, radiation dose, tube current, and vendor should be mandatory in AI publications to support reproducibility and external benchmarking [
6,
64]. Furthermore, the field would benefit from widely accepted preprocessing standards, including recommended HU windows, normalization formulas, and denoising settings tailored to lung-CT analysis [
23,
42].
There is a pressing need for large-scale, multi-center, multi-vendor datasets with fully annotated acquisition metadata to evaluate AI generalizability under realistic clinical variability. Such datasets would also facilitate controlled cross-protocol experiments, enabling deeper understanding of the source [
14,
48].