Previous Article in Journal
Vogt–Koyanagi–Harada (VKH)—What Do We Know About the Disease, and Can We Recognize It?
Previous Article in Special Issue
Explainable Machine Learning Models for Predicting FEV1 in Non-Smoking Taiwanese Men Aged 45–55 Years
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Leveraging Large-Scale Public Data for Artificial Intelligence-Driven Chest X-Ray Analysis and Diagnosis

1
AI Laboratory, HealthHub Co., Ltd., Seoul 06524, Republic of Korea
2
Department of Radiology, Keimyung University Dongsan Hospital, Daegu 24601, Republic of Korea
3
Division of AI and Computer Engineering, Kyonggi University, Suwon 16227, Republic of Korea
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Diagnostics 2026, 16(1), 146; https://doi.org/10.3390/diagnostics16010146 (registering DOI)
Submission received: 3 November 2025 / Revised: 19 December 2025 / Accepted: 27 December 2025 / Published: 1 January 2026
(This article belongs to the Special Issue Machine-Learning-Based Disease Diagnosis and Prediction)

Abstract

Background: Chest X-ray (CXR) imaging is crucial for diagnosing thoracic abnormalities; however, the rising demand burdens radiologists, particularly in resource-limited settings. Method: We used large-scale, diverse public CXR datasets with noisy labels to train general-purpose deep learning models (ResNet, DenseNet, EfficientNet, and DLAD-10) for multi-label classification of thoracic conditions. Uncertainty quantification was incorporated to assess model reliability. Performance was evaluated on both internal and external validation sets, with analyses of data scale, diversity, and fine-tuning effects. Result: EfficientNet achieved the highest overall area under the receiver operating characteristic curve (0.8944) with improved sensitivity and F1-score. Moreover, as training data volume increased—particularly using multi-source datasets—both diagnostic performance and generalizability were enhanced. Although larger datasets reduced predictive uncertainty, conditions such as tuberculosis remained challenging due to limited high-quality samples. Conclusions: General-purpose deep learning models can achieve robust CXR diagnostic performance when trained on large-scale, diverse public datasets despite noisy labels. However, further targeted strategies are needed for underrepresented conditions.

1. Introduction

Chest X-ray (CXR) is an essential diagnostic tool in healthcare, providing a cost-effective, non-invasive, and widely accessible method for detecting thoracic diseases [1,2,3]. However, the increasing demand for CXR studies has placed a burden on radiologists, leading to diagnostic delays and highlighting the need for scalable automated diagnostic solutions to enhance efficiency and accuracy [4]. Although deep learning has achieved radiologist-level performance in CXR interpretation [5,6,7,8], its clinical integration is hampered by the reliance on private, expert-annotated datasets that limit reproducibility—especially in resource-constrained settings. Public datasets, including National Institute of Health (NIH) Chest X-ray-14 [9], CheXpert [10], and Medical Information Mart for Intensive Care-CXR (MIMIC-CXR) [11], offer diverse training data; however, noisy labels pose challenges for model reliability [12].
In this study, we aimed to demonstrate that the scale and diversity of training data can offset imperfections in label quality and model specificity. We leveraged large-scale, diverse public CXR datasets to train general-purpose image classification models that robustly classify multiple thoracic conditions without the need for specialized architectures. Moreover, we incorporated uncertainty quantification (UQ) to assess predictive performance and model reliability, offering insights into their real-world clinical applicability. Our findings indicate that, despite noisy labels, general-purpose deep learning models can achieve acceptable diagnostic performance when trained on expansive, varied datasets, thereby supporting the development of accessible, scalable artificial intelligence (AI) diagnostic tools for resource-limited settings.
While our study leverages publicly available datasets and standard deep learning models, its novelty lies in the systematic integration and analysis of these components. Specifically, we:
  • Aggregate and harmonize 17 public CXR datasets, addressing label inconsistencies and cross-dataset duplication to enable robust generalization analysis and broaden coverage of rare diseases.
  • Perform comprehensive UQ across multiple dataset scales, providing new insights into model stability, calibration, and reliability beyond what prior CXR studies have reported.
  • Conduct class-specific uncertainty analysis for clinically important but underexplored categories, including tuberculosis, pneumothorax, masses/nodules, and the heterogeneous “Other” class, offering a granular understanding of class-dependent behavior.
  • Characterize scaling-law behavior in both performance and uncertainty, demonstrating how increasing dataset size affects diagnostic accuracy and epistemic uncertainty, and offering actionable guidance for data collection and clinical deployment strategies.
These contributions underscore that our research extends beyond incremental improvements, providing new methodological and practical insights for scalable, uncertainty-aware CXR analysis.

2. Materials and Methods

The Institutional Review Board of Keimyung University Dongsan Medical Center approved this retrospective study and waived the requirement for written informed consent. All procedures followed the approved protocol.

2.1. Target Conditions

To maximize clinical relevance and impact, this study focused on the following six prevalent and clinically significant thoracic conditions: pneumonia, pleural effusion, tuberculosis, mass, consolidation, and pneumothorax. These conditions include both specific diseases (e.g., pneumonia and tuberculosis) and radiological findings (e.g., pleural effusion, mass, and consolidation) that frequently indicate an underlying pathology. The selection was guided by global disease burden, potential for early intervention, and the diagnostic importance of CXR in routine clinical workflows. Pneumonia and tuberculosis remain the leading causes of morbidity and mortality, particularly in low-resource settings with limited diagnostic tools [13]. Pleural effusion, masses, and consolidation are common thoracic abnormalities that, if detected early, can alter patient management and improve outcomes. Pneumothorax, although less prevalent, is a life-threatening condition requiring immediate identification and intervention [14]. For cases involving abnormal findings not otherwise classified among these six target diseases, an “Other” disease category was included to ensure broad diagnostic coverage without compromising specificity. This strategy enhances the clinical utility while maintaining focus on conditions with the greatest potential impact on patient care.

2.2. Data Collection and Preparation

2.2.1. Development Dataset

To construct a comprehensive training dataset, we aggregated 17 public CXR datasets (Table 1) into 894,373 unique frontal-view images, excluding non-frontal projections to ensure clinical consistency. This strategy enhanced model generalizability by incorporating diverse patient demographics, imaging protocols, and disease prevalence. Five major datasets (NIH Chest X-Ray-14 [9], CheXpert [10], MIMIC-CXR [11], PadChest [15], and BRAX [16]) provided a substantial portion of the development dataset. These datasets employed natural language processing (NLP) tools or deep learning algorithms to automatically extract labels from radiological reports. These methods facilitate large-scale label generation and introduce variability in label quality, particularly in disease categories requiring subjective interpretation. To address this, we harmonized labels across datasets, treated low-confidence labels consistently, and ensured mutual exclusivity in training and test sets (see Appendix B for full details).
Table 1 highlights class imbalance—a pervasive challenge in public CXR datasets—across target diseases. To address this, we strategically adjusted per-disease sample counts and augmented underrepresented classes (tuberculosis, mass, consolidation, pneumothorax) via image duplication and online augmentation techniques. Augmentation was restricted to training and validation sets to preserve the integrity of the uniquely composed test set, thereby preventing data leakage. Table 2 provides the final class distribution across all dataset splits.

2.2.2. External Validation Dataset

To assess generalization and fine-tuning efficacy, we constructed an independent validation dataset of 3031 unique-patient, posterior–anterior CXR images (July 2017–December 2019) from our hospital’s picture archiving and communication system (Table 2), ensuring no overlap with training data. This dataset enabled (1) an unbiased assessment of baseline model performance and (2) evaluation of domain-specific fine-tuning. To evaluate domain-specific fine-tuning, we used 10-fold cross-validation (training on k−1 folds and validating on the remaining fold), whereas non-fine-tuned models were directly tested on the external set to isolate domain adaptation effects.

2.3. Deep Learning Models

The three deep learning models used in this study—ResNet-50 [28], DenseNet-121 [29], and EfficientNet-B5 [30]—were selected for their proven effectiveness in natural image classification tasks and served as robust baselines for comparison. Furthermore, Deep Learning-based Algorithm for Detecting 10 Abnormalities (DLAD-10) [5], a high-performance model designed for CXR classification and recognized as a base model in commercial services due to its superior performance, was also included in the experiment.
All classification models were initialized with ImageNet-pretrained weights, and their inputs were normalized using the standard ImageNet mean and standard deviation. This preprocessing preserves the activation scaling expected by pretrained backbones and contributes to stable optimization during fine-tuning. In contrast, the U-Net segmentation model used to extract lung masks was trained entirely from scratch, without ImageNet initialization; therefore, segmentation inputs were processed using only min–max intensity normalization, ensuring consistency between preprocessing and initialization strategies.
For classification experiments, all models were trained and evaluated on the same pooled dataset to ensure consistent comparisons. Training was performed using the Adam optimizer with an initial learning rate of 1 × 10−3 and a batch size of 32. A comprehensive list of all hyperparameters, ensuring full reproducibility, is provided in Appendix A.
Although more recent architectures—including Vision Transformers [31], ConvNeXt [32], and large foundation models [33]—have demonstrated strong performance in medical image analysis, they require substantially greater computational resources and specialized pretraining pipelines. Consistent with our objective of evaluating scalable and broadly deployable diagnostic models, we focused on representative convolutional architectures that are widely used, computationally accessible, and well supported in open-source frameworks. This design choice is particularly relevant for resource-limited clinical environments, where training or deploying heavy transformer-based models or large foundation models may not be feasible. Accordingly, our evaluation centers on general-purpose convolutional neural network-based classifiers, including DLAD-10, an early commercial model, to assess whether large-scale public datasets alone can enable clinically acceptable performance without reliance on computationally intensive architectures.

2.4. UQ Using Monte Carlo (MC) Dropout

We employed MC dropout [34] to estimate model uncertainty by enabling stochastic dropout during inference, thereby generating multiple predictions for each input. This approach captures both aleatoric uncertainty (inherent data noise) and epistemic uncertainty (model or data limitations). Multiple forward passes were performed, and their predictions were aggregated to compute predictive entropy (PE) [35]—reflecting total uncertainty—and variance of predictions (VP) [36]—a direct measure of epistemic uncertainty. For robust estimation, we used a dropout rate of 0.4 and 20 stochastic passes, balancing computational efficiency with reliable metrics. The final class prediction was selected by averaging the probabilities across all passes. The dropout rate follows the standard configuration in previous studies [30,37], and the number of stochastic passes was chosen based on sensitivity analysis on the test set (see Appendix C for details).

2.5. Statistical Analysis

Diagnostic performance was evaluated using precision, sensitivity, specificity, F1-score, and area under the receiver operating characteristic curve (AUROC). These metrics provided a comprehensive assessment of the ability of the model to accurately classify disease states in chest radiographs, offering insights into discriminative capability and clinical utility. A 10-fold cross-validation was conducted, with performance reported as the average values across all folds, and 95% confidence intervals were calculated to assess variability and reliability. Stratified splitting was applied to ensure that each fold was representative of all target diseases, and multiple random seeds were evaluated to prevent folds with zero samples in low-prevalence classes; a seed was ultimately selected that preserved at least two samples per disease in every fold.
Class-specific decision thresholds (provided in Table 3 and Table 4) were optimized on the validation set by selecting operating points that favor sensitivity while avoiding excessive loss of specificity. Consistent with the clinical imperative to minimize false-negative findings—particularly for conditions such as tuberculosis, pneumonia, or pneumothorax, where missed diagnoses carry substantial risk—sensitivity was modestly prioritized. This approach acknowledges that while false positives necessitate additional confirmatory evaluation, they present significantly lower clinical risk than delayed diagnosis and treatment.

3. Results

3.1. Model Performance on Internal Validation Sets

Figure 1 and Table 3 summarize the performance of four deep learning models—ResNet, DenseNet, EfficientNet, and DLAD-10—on the internal validation sets. EfficientNet achieved the highest overall performance, with an average AUROC (0.8944) among all models. This model consistently outperformed the other models in sensitivity (0.81) and F1-score (0.61), reflecting its ability to accurately detect diseases while maintaining a balance between precision and recall (Table 3). DLAD-10 closely followed with an average AUROC of 0.8810, and although EfficientNet had a slight edge overall, disease-specific analysis revealed instances where DLAD-10 outperformed EfficientNet in certain categories, indicating that the relative advantage can vary by disease. In contrast, ResNet and DenseNet had slightly lower average AUROCs (0.8654 and 0.8626, respectively), with DenseNet exhibiting a marginally better F1-score than ResNet.

3.2. Model Performance on External Validation Set

For the external validation set, analysis focused on EfficientNet, which demonstrated the best performance using the development dataset. As shown in Figure 2 and Table 4, EfficientNet demonstrated comparable or superior performance to DLAD-10 across most disease categories, consistent with its performance on the development dataset. Fine-tuning with a portion of the external validation dataset consistently improved the results of both models across most target diseases. However, performance variations were observed between internal and external datasets. Pneumonia, tuberculosis, and mass showed higher AUROC scores internally, whereas the external dataset yielded better performance for the remaining diseases. These differences likely stemmed from the smaller number of samples for certain diseases in the external dataset, affecting generalizability. Overall, EfficientNet maintained strong diagnostic accuracy on the external validation set, demonstrating adaptability to diverse clinical scenarios.

3.3. Impact of Training Data Scale and Diversity on Model Effectiveness

3.3.1. Diagnostic Performance Across Data Scales

To evaluate the impact of training data scale, we incrementally increased the dataset size (20%, 40%, 60%, and 100% of 40,000 CXR images per disease) while maintaining proportional representation from source datasets. Experiments were conducted under controlled conditions using a development dataset, with MC dropout applied during inference to estimate uncertainty.
As presented in Table 5, model performance consistently improved with larger datasets: the mean AUROC rose from 0.8680 (20%) to 0.8937 (100%), and precision (0.47–0.51), sensitivity (0.80–0.82), specificity (0.78–0.81), and F1-score (0.57–0.61) all increased. Disease-specific results revealed significant gains for pneumothorax (0.9115–0.9437), pleural effusion (0.8657–0.8978), mass (0.9416–0.9520), and pneumonia (0.8414–0.8905), underscoring the benefit of expanded training data. In contrast, tuberculosis showed a slight AUROC decline (0.9937–0.9926). Given the relatively small number of unique tuberculosis cases (3929 studies) and the substantial reliance on duplicated samples to reach the target dataset size, this trend may reflect the limited diversity of underlying image features. However, other factors—including label variability across datasets or differences in acquisition protocols—may also contribute, and therefore this behavior should be interpreted with caution. Consolidation and the “Other” disease category also improved, although less dramatically. Overall, these findings confirm that larger, more diverse training datasets enhance model performance for well-represented diseases but underscore the need for targeted strategies when addressing rare or underrepresented conditions, including tuberculosis.

3.3.2. Diagnostic Performance Across Data Diversity

We trained models using two configurations to assess the impact of data diversity. In the single-source setup, each disease category comprised 5000 samples drawn from one public dataset (Chest X-Ray-14, CheXpert, MIMIC-CXR, or PadChest), with duplicates added as needed. In the multiple-source setup, 1250 samples per disease were selected from each dataset to form a combined set of 5000 images, thereby increasing diversity. An independent test dataset (Table 2) ensured consistent evaluation under uniform conditions. All models were trained under uniform conditions.
Table 6 indicates that diagnostic performance generally improved with the multiple-source dataset: mean AUROC increased from 0.7061 (Chest X-Ray-14-only) and 0.7523 (MIMIC-CXR-only) to 0.7708, with sensitivity and specificity rising to 0.63 and 0.75, respectively. However, the degree of improvement varied across diseases and metrics. For example, the AUROC for mass increased substantially from 0.6765 to 0.8850, and similar trends were observed for pneumonia and pleural effusion, whereas the “Other” disease category exhibited only modest gains (AUROC from 0.5324 to 0.5669). In some cases, single-source datasets outperformed the multiple-source configuration for specific metrics (e.g., MIMIC-CXR-only achieved a higher specificity for pneumonia: 0.72 vs. 0.65).
Tuberculosis exhibited unique behavior because its samples were exclusively derived from the PadChest dataset, rendering its data effectively single-source even in the multiple-source configuration. The AUROC improved from 0.3891 (PadChest-only) to 0.6440 with the multiple-source dataset, while precision and specificity showed inconsistent trends. This discrepancy likely reflects the influence of diverse non-tuberculosis samples on overall model calibration, while limited variability in tuberculosis-specific data constrained further improvements. Overall, these results demonstrate that data diversity enhances performance for common thoracic conditions, while highlighting the need for targeted data collection strategies to improve generalization for rare or heterogeneous diseases.

3.3.3. Uncertainty and Data Scale Relationship

As presented in Table 7, a descriptive analysis of PE and VP across increasing training data scales (20%, 40%, 60%, and 100%) reveals that VP generally decreases or stabilizes across all disease categories. Pleural effusion, consolidation, other diseases, and no finding exhibited clear decreases followed by stable low VP values (e.g., consolidation: 0.0007 at 20% to 0.0004 from 40% onward). Mass showed the largest reduction (0.0009 to 0.0005 and then to 0.0004), while pneumothorax remained unchanged through 60% (0.0004) and then decreased at 100% (0.0003). Pneumonia displayed a mild non-monotonic pattern, and tuberculosis remained constant at 0.0001 across all data scales.
Conversely, PE varied substantially by class and did not follow a monotonic trend. Most diseases showed a noticeable reduction between 20% and 40% data—such as pneumonia (1.9592 to 1.616) and consolidation (2.2179 to 1.893)—but increased again at 60% or 100%. Additionally, pneumothorax decreased sharply at 40% and partially rebounded at higher scales, while tuberculosis demonstrated the largest fluctuation relative to its scale (0.1858 → 0.0812 → 0.3547 → 0.2172) despite stable VP. The “Other” disease category exhibited a distinct pattern, with PE increasing from 20% to 60% (1.0883 to 1.3439) but decreasing sharply at 100% (0.6445). This substantial drop is attributable to the broad heterogeneity of the class: with limited data, the model is highly uncertain because it has insufficient exposure to the diverse subpatterns within this category. As data volume increases, the model encounters a wider and more representative range of these patterns, reducing epistemic uncertainty and yielding more stable calibration across this heterogeneous class.
Overall, these results demonstrate that scaling the training data reliably reduces VP across diseases, whereas PE is driven predominantly by class-specific factors including heterogeneity, rarity, and label consistency. Therefore, total predictive uncertainty does not necessarily decrease with more data, even when epistemic uncertainty becomes stable.

4. Discussion

Our study demonstrates that large-scale, diverse public CXR datasets with noisy labels can effectively train deep learning models for disease classification, overcoming limitations of imperfect data through scalability and diversity. However, rare and underrepresented conditions remain challenging, underscoring the need for targeted strategies, including higher-quality data collection and advanced model architectures tailored to their characteristics.
A key contribution of our study is the integration of UQ to analyze how predictive certainty evolves with increasing data sizes and across disease categories. As shown in our results, VP consistently decreased or stabilized as the training data expanded from 20% to 40%, 60%, and 100%, indicating improved parameter certainty under more balanced data regimes. In contrast, PE displayed heterogeneous and frequently non-monotonic trajectories, demonstrating that total predictive uncertainty is shaped not only by dataset size but also by inherent class characteristics, including heterogeneity, annotation quality, and imaging variability. For example, pneumonia, pleural effusion, and consolidation showed an initial reduction in PE but exhibited secondary increases at larger data scales, suggesting the presence of persistent aleatoric uncertainty that cannot be resolved solely by increasing training samples. Pneumothorax exhibited a similar pattern, with PE decreasing substantially at 40% data and subsequently increasing at higher scales despite stable VP.
The tuberculosis category exhibited a distinctive uncertainty pattern: VP remained constant at 0.0001 across all data scales, whereas PE fluctuated substantially. This behavior likely reflects the limited diversity of tuberculosis images in the original datasets and the extensive oversampling required to match the training volume of other diseases. Because the model repeatedly encounters a narrow set of visual patterns during training, epistemic variability remains minimal, resulting in stable VP values. However, PE remains sensitive to residual data-level ambiguities—such as heterogeneous imaging conditions, variations in disease presentation, and inconsistencies in text-derived labels—leading to the observed fluctuations. This contrast highlights how VP and PE capture different aspects of model uncertainty: while VP reflects stability in parameter estimates under repeated sampling, PE is influenced more strongly by intrinsic variability within the underlying data distribution.
These class-specific uncertainty profiles can be further interpreted by examining the dataset composition and adjusted training counts presented in Table 1 and Table 2, respectively. Several rare diseases—most notably tuberculosis—were severely underrepresented in the original datasets (~3.9 k of ~1.08 M) and required extensive oversampling and augmentation to balance the training distribution. Under these conditions, VP naturally stabilized due to repeated sampling of similar examples, whereas PE remained sensitive to unresolved label inconsistencies and limited intra-class diversity. Mass and pneumothorax, both of which rely heavily on NLP-derived labels, also showed PE fluctuations consistent with known variability in automated labeling pipelines and cross-dataset domain heterogeneity. Conversely, the substantial PE reduction observed for the “Other” diseases category at 100% data reflects the stabilizing effect of its exceptionally large and diverse original sample pool; only when exposed to the full dataset could the model approximate the underlying distribution sufficiently to reduce total uncertainty.
Direct comparisons with previous studies are challenging because of differences in dataset composition, labeling quality, and model architectures. Many previous studies relied on private or mixed private-public datasets, whereas our approach used only public datasets, enhancing accessibility but introducing variability and noise. Moreover, our study incorporated UQ to assess predictive certainty, a dimension frequently overlooked in similar research. Tang et al. [38] demonstrated robust binary classification using well-established models, while our study extends this approach to multiclass classification using an even larger dataset and highlights the benefits of fine-tuning for domain adaptation. Wu et al. [6] developed a custom model to classify 72 thoracic diseases from 353,818 CXR images from the NIH and MIMIC datasets using NLP-based automated labeling, achieving a mean AUROC of 0.807—comparable to a third-year radiology resident. Similarly, Cid et al. [7] utilized 1,896,034 images from three UK hospitals to classify 37 diseases, attaining a mean AUROC of 0.864, underscoring the impact of dataset scale on diagnostic accuracy. More recently, Seah et al. [8] trained EfficientNet on 821,681 images from five public datasets (including NIH Chest X-ray-14 and MIMIC) with radiologist-labeled high-quality data, achieving an outstanding mean AUROC of 0.957. A key distinction of our study is the methodological focus imposed by our open-source approach. While these large-scale studies [7,8] utilized private or highly curated institutional data and concentrated on maximizing accuracy within custom, complex taxonomies, our use of entirely public datasets necessitates and introduces multi-dataset harmonization and a focus on cross-dataset generalization. This methodological foundation, combined with our uncertainty-aware analysis, demonstrates high diagnostic performance using fully open-source resources across clinically significant multiclass conditions, broadening the applicability of automated diagnostic solutions, particularly in resource-constrained settings.
This study had some limitations. First, the external validation dataset was derived from a single institution, and the uneven distribution of cases for different diseases may not reflect real-world prevalence, potentially limiting generalizability. Second, although we aggregated 17 public datasets, substantial upsampling was required for several rare disease classes to maintain controlled comparisons across categories. While all unique and duplicated sample counts are transparently reported, reliance on duplicated images may constrain feature diversity for these underrepresented classes. Third, the “Other” diseases category encompasses a broad and heterogeneous set of abnormalities, and although we provide detailed label mappings, such heterogeneity inevitably reduces class-specific interpretability compared with well-defined disease categories. Fourth, although labels were harmonized and cross-dataset duplicates were removed, residual domain shifts arising from differences in imaging protocols, patient demographics, and disease prevalence across the 17 public datasets may still affect model generalizability, particularly for conditions with inconsistent labeling strategies. Fifth, we focused on six thoracic conditions, restricting their applicability to various diseases. Sixth, we employed MC dropout as the sole UQ method, leaving alternative techniques unexplored. Finally, we utilized a subset of state-of-the-art deep learning models, which may not fully leverage more recent architectures, including Vision Transformers or large foundation models; while our findings on scalability and uncertainty are expected to generalize, evaluating these newer models remains an avenue for future investigation.

5. Conclusions

This study demonstrated the feasibility of leveraging public datasets and open-source deep learning models to achieve robust diagnostic performance in CXR analysis. Using data scale and diversity, our approach achieved results comparable to those obtained with proprietary datasets while providing a fully reproducible and accessible framework. Our uncertainty analysis further highlighted that, although epistemic uncertainty decreases with increasing data, total predictive uncertainty remains strongly influenced by class-specific characteristics, label quality, and inherent data heterogeneity. These findings underscore the need for more diverse and high-quality datasets—particularly for rare or underrepresented conditions—to further improve model reliability and stability. Despite these challenges, our study advances the democratization of AI tools for medical imaging by providing a scalable, uncertainty-aware framework that facilitates broader adoption in diverse and resource-constrained clinical settings.

Author Contributions

Conceptualization, B.-D.L.; Data curation, F.K.K., W.B.T., M.S.L., J.Y.K. and B.-D.L.; Formal analysis, F.K.K., W.B.T., M.S.L., and B.-D.L.; Investigation, all authors; Methodology, F.K.K., W.B.T., M.S.L. and B.-D.L.; Project administration, S.S.B. and B.-D.L.; Resources, S.S.B. and B.-D.L.; Software, F.K.K., W.B.T. and S.-W.P.; Validation, all authors; writing—original draft, all authors; writing—review and editing, all authors. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This retrospective study was approved by the Institutional Review Board of Keimyung University Dongsan Medical Center (DSMC-2021-02-021, date of approval 21 February 2021). All procedures were performed in accordance with the ethical standards of the Declaration of Helsinki.

Informed Consent Statement

Informed consent was waived due to the retrospective nature of this study. The data used in this study were fully de-identified to protect patient confidentiality.

Data Availability Statement

Public datasets used in this study are described in the text and tables and are available from the corresponding author upon request. The in-house dataset supporting these findings is maintained at Keimyung University Dongsan Hospital. Due to licensing restrictions, these data are not publicly available; however, access may be granted upon reasonable request and subject to approval by Keimyung University Dongsan Hospital.

Conflicts of Interest

Authors Farzeen Khalid Khan, Waleed Bin Tahir and Shi Sub Byon were employed by the company HealthHub Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
AUROCArea Under the Receiver Operating Characteristics Curve
CXRChest X-ray
DLAD-10Deep Learning-based Algorithm for Detecting 10 Abnormalities
MCMonte Carlo
MIMIC-CXRMedical Information Mart for Intensive Care—Chest X-ray
NIHNational Institute of Health
NLPNatural Language Processing
PAPosterior–Anterior
PEPredictive Entropy
TBTuberculosis
UQUncertainty quantification
VPVariance of Predictions

Appendix A

We implemented four deep learning architectures for CXR classification—ResNet-50, DenseNet-121, EfficientNet-B5 and DLAD-10—chosen for their demonstrated effectiveness in natural image tasks and established performance in chest radiograph analysis. The original classification layer of each model was replaced with a linear layer corresponding to the dataset labels. The models were trained as multi-label classifiers, producing independent sigmoid outputs for each class, using binary cross-entropy with logit loss. This loss is suitable for multi-label classification, where each class is treated independently. The model parameters were optimized using the Adam optimizer. Class-specific decision thresholds were determined by evaluating performance metrics across a range of thresholds (0.0–1.0) on the validation set. Thresholds were selected at points where sensitivity and specificity were approximately balanced, with slight prioritization of sensitivity to minimize false negatives. The resulting disease-specific thresholds were subsequently applied unchanged to the test set to ensure unbiased evaluation.
Before input into the models, each input CXR image was subjected to lung segmentation using U-Net [39] architecture trained specifically for this study. The model was initialized with random weights and trained on a combined dataset of 7951 images from the Mendeley Chest X-ray Lung Segmentation dataset [40] and the MIMIC-CXR segmentation subset [41], split into 70–20–10 training, validation, and test sets. During segmentation model training, data augmentation included horizontal flipping, color jittering (brightness, contrast, and saturation ± 0.5; hue ± 0.3), conversion to grayscale, and resizing to 512 × 512 pixels. The model achieved a Dice coefficient of 0.9251 on an independent test set (n = 796), indicating accurate delineation of lung fields (see Table A1). The resulting segmentation masks were used to crop each image to the lung region, removing non-informative background areas. The cropped images were subsequently resized to 448 × 448 pixels.
For classification model training, online data augmentation was applied using PyTorch 2.3.0, deliberately employing milder transformations, random rotations up to ±20° and color jittering (brightness, contrast, and saturation ± 0.2, hue ± 0.1) than those used for segmentation, ensuring preservation of critical anatomical detail while still promoting generalization. To reduce overfitting, each image was uniquely augmented at every iteration. Subsequently, input images were resized and normalized using the standard ImageNet mean and standard deviation to ensure compatibility with the ImageNet-pretrained backbone used for initialization, preventing a distribution mismatch that could degrade performance. While ImageNet statistics originate from natural RGB images, their use in chest X-rays is widely adopted and empirically validated when using pretrained models. Chest X-rays are inherently single-channel; they are duplicated across three channels to match the input requirements of ImageNet-pretrained models. Therefore, normalization primarily standardizes the input range and stabilizes gradients without distorting diagnostic structures. Furthermore, widely used medical imaging benchmarks—including CheXpert [10], MIMIC-CXR [11], RSNA [18], and SIIM-ACR pneumothorax [19] datasets—commonly adopt ImageNet normalization with pretrained backbones, demonstrating its practicality and suitability for radiography tasks. The models’ training was configured for up to 30 epochs with early stopping using a batch size of 32. The initial learning rate started at 0.001 and was reduced upon validation loss plateau. Fine-tuning employed the same parameters with a reduced learning rate of 0.0001.
Table A1. Performance metrics of the U-Net architecture used for lung segmentation (preprocessing) on the independent test set (n = 796).
Table A1. Performance metrics of the U-Net architecture used for lung segmentation (preprocessing) on the independent test set (n = 796).
MetricsValue
Dice Coefficient0.9251
IoU0.8573
Precision0.9396
Recall0.9073

Appendix B

To ensure consistent and high-quality annotations across the 17 aggregated CXR datasets, several harmonization steps were applied. Only frontal-view CXRs (AP/PA) were included to maintain clinical consistency. Labels with low confidence or uncertainty (e.g., CheXpert’s “Uncertain”) were excluded from the training set to maintain label reliability. Annotation protocols varied across datasets in both naming conventions and label granularity. To ensure consistency, labels referring to related abnormalities (e.g., “mass” and “nodule”) were consolidated into a single Mass category. Findings outside the primary target disease set, including cardiomegaly, atelectasis, and fibrosis, were grouped into an “Other” disease category to maintain a unified label structure. PadChest [15] provides 174 radiographic findings extracted via NLP from full radiology reports; for this dataset, a target disease was marked positive whenever its corresponding condition appeared within PadChest’s annotated label list. Duplicate studies across datasets were identified and removed to prevent overlap between the training and evaluation sets, particularly for large public datasets, including CheXpert [10] and MIMIC-CXR [11].
To mitigate class imbalance, images from underrepresented classes were duplicated in the training set. Each image was copied without alteration to preserve its original characteristics. The distribution of original images, duplicates added, and total images after duplication is summarized in Table A2.
All harmonization rules and label mappings are mentioned in Table A3, enabling full reproducibility and transparency in dataset construction.
Table A2. Class distribution before and after image duplication: (a) Training Set, (b) Validation Set, and (c) Test Set.
Table A2. Class distribution before and after image duplication: (a) Training Set, (b) Validation Set, and (c) Test Set.
(a)
ClassUnique ImagesDuplicates AddedTotal Images
Pneumonia35,987401340,000
Pleural Effusion33,360664040,000
Tuberculosis230137,69940,000
Mass12,15627,84440,000
Consolidation31,002899840,000
Pneumothorax35,160484040,000
Other diseases67,57694968,525
No finding39,999039,999
(b)
ClassUnique ImagesDuplicates addedTotal Images
Pneumonia147977589237
Pleural Effusion299667259721
Tuberculosis39174297820
Mass50692579763
Consolidation53093549884
Pneumothorax55394299982
Other diseases142010,85712,277
No finding980709807
(c)
ClassUnique ImagesDuplicates addedTotal Images
Pneumonia242102421
Pleural Effusion299002990
Tuberculosis123701237
Mass294502945
Consolidation298802988
Pneumothorax299802998
Other diseases302403024
No finding293602936
Table A3. Label Mapping Across Datasets.
Table A3. Label Mapping Across Datasets.
DatasetOriginal Dataset LabelHarmonized LabelNotes
NIH Chest X-ray-14 [9]AtelectasisOtherNot part of target diseases
CardiomegalyOtherNot part of target diseases
EffusionPleural EffusionRetained as-is
InfiltrationOtherNot part of target diseases
MassMassCombined for consistency
NoduleMassCombined for consistency
PneumoniaPneumoniaRetained as-is
CheXpert [10],
MIMIC-CXR [11],
BRAX [16]
Enlarged CardiomediastinumOtherNot part of target diseases
CardiomegalyOtherNot part of target diseases
Lung LesionOtherNot part of target diseases
Lung OpacityOtherNot part of target diseases
EdemaOtherNot part of target diseases
ConsolidationConsolidationRetained as-is
PneumoniaPneumoniaRetained as-is
AtelectasisOtherNot part of target diseases
PneumothoraxPneumothoraxRetained as-is
Pleural EffusionPleural EffusionRetained as-is
Pleural OtherOtherNot part of target diseases
FractureOtherNot part of target diseases
Support DevicesSupport DevicesRetained as-is
PadChest [15] *effusionPleural EffusionTarget disease assigned positive if present in PadChest label list
noduleMass
massNodule
consolidationConsolidation
pneumoniaPneumonia
tuberculosisTuberculosis
VinDr-CXR [17]Aortic enlargementOtherNot part of target diseases
AtelectasisOtherNot part of target diseases
CardiomegalyOtherNot part of target diseases
CalcificationOtherNot part of target diseases
Clavicle fractureOtherNot part of target diseases
ConsolidationConsolidationRetained as-is
EdemaOtherNot part of target diseases
EmphysemaOtherNot part of target diseases
Enlarged PAOtherNot part of target diseases
Interstitial lung disease (ILD)OtherNot part of target diseases
InfiltrationOtherNot part of target diseases
Lung cavityOtherNot part of target diseases
Lung cystOtherNot part of target diseases
Lung opacityOtherNot part of target diseases
Mediastinal shiftOtherNot part of target diseases
Nodule/MassMassRenamed for consistency
Pulmonary fibrosisOtherNot part of target diseases
PneumothoraxPneumothoraxRetained as-is
Pleural thickeningOtherNot part of target diseases
Pleural effusionPleural EffusionRetained as-is
Rib fractureOtherNot part of target diseases
Other lesionOtherNot part of target diseases
Lung tumorMassCombined for consistency
PneumoniaPneuoniaRetained as-is
TuberculosisTuberculosisRetained as-is
Other diseasesOtherNot part of target diseases
Chronic obstructive pulmonary disease (COPD)OtherNot part of target diseases
RSNA Pneumonia Detection Challenge [18]PneumoniaPneumoniaRetained as-is
SIIM-ACR Pneumothorax Segmentation [19]PneummothoraxPneumothoraxRetained as-is
JSRT [20]Lung NoduleMassRetained as-is
Shenzhen Hospital CXR Set [21] **TuberculosisTuberculosisRetained as-is
EffusionPleural EffusionExtracted from ‘findings’
MassMassExtracted from ‘findings’
ConsolidationConsolidationExtracted from ‘findings’
Pleural thickeningOtherNot part of target diseases
Fibrous lesionsOtherNot part of target diseases
Montgomery County chest X-ray set (MC) [21]TuberculosisTuberculosisRetained as -is
COVID-19, Pneumonia and Normal Chest X-ray PA Dataset [22]PneumoniaPneumoniaRetained as -is
COVID-19OtherNot part of target diseases
Chest X-Ray Images (Pneumonia) [23]PneumoniaPneumoniaRetained as-is
Tuberculosis (TB) Chest X-ray Database [24],
TBX11K [25],
Belarus Dataset [26],
Chest X-rays tuberculosis from India [27]
TuberculosisTuberculosisRetained as-is
* PadChest annotations comprise 174 NLP-derived radiographic findings; target disease labels in this study were assigned as positive if the matching condition was present in the PadChest findings. ** Shenzhen annotations are recorded in the findings column, including descriptors for conditions other than tuberculosis.

Appendix C

To ensure the reliability of the uncertainty estimation reported in this study, we performed a justification and sensitivity analysis for the two key hyperparameters of the Monte Carlo Dropout (MCD) method: the dropout rate and the number of stochastic forward passes.
The dropout rate was set to 0.4. This selection strictly adheres to the original architectural design of EfficientNet-B5 proposed by Tan and Le [30] to balance model capacity and regularization. This value is also the established default in standard deep learning framework implementations of this architecture [37], ensuring that feature distributions remain consistent with the pre-trained weights used for transfer learning.
We determined the optimal number of forward stochastic passes (N) for inference by analyzing the convergence of the uncertainty metrics (Predictive Entropy and Variance). We evaluated the model predictions with 5, 10, 15, and 20 passes to identify the point where the estimation of epistemic uncertainty stabilized. As presented in Table A4, the Mean Predictive Variance converged significantly by 10 passes (3.75 × 10−4) and remained constant through 15 and 20 passes. Similarly, Predictive Entropy showed asymptotic stability between 15 and 20 passes. Based on this “diminishing returns” analysis, 20 stochastic passes were selected as the operating point. This ensures that the uncertainty trends reported in our results are attributable to data properties rather than stochastic estimation noise.
Table A4. Sensitivity analysis of uncertainty metrics with respect to the number of stochastic passes (N).
Table A4. Sensitivity analysis of uncertainty metrics with respect to the number of stochastic passes (N).
Stochastic Passes (N)Mean Predictive EntropyMean Variance of PredictionsRelative Change (Variance)
51.4083.13 × 10−4-
101.1423.75 × 10−4+19.8%
151.1463.75 × 10−40.0% (Converged)
20 (Selected)1.1463.75 × 10−40.0% (Stable)

References

  1. Raoof, S.; Feigin, D.; Sung, A.; Raoof, S.; Irugulpati, L.; Rosenow, E.C. Interpretation of plain chest roentgenogram. Chest 2012, 141, 545–558. [Google Scholar] [CrossRef] [PubMed]
  2. WHO (World Health Organization). Chest Radiography in Tuberculosis Detection: Summary of Current WHO Recommendations and Guidance on Programmatic Approaches; World Health Organization: Geneva, Switzerland, 2021; Available online: https://www.who.int/publications/i/item/9789241511506 (accessed on 30 October 2025).
  3. Ellis, S.; Aziz, Z. Radiology as an aid to diagnosis in lung disease. Postgrad. Med. J. 2016, 92, 620–623. [Google Scholar] [CrossRef]
  4. Rimmer, A. Radiologist shortage leaves patient care at risk, warns royal college. BMJ 2017, 359, j4683. [Google Scholar] [CrossRef]
  5. Nam, J.G.; Kim, M.; Park, J.; Hwang, E.J.; Lee, J.H.; Hong, J.H.; Goo, J.M.; Park, C.M. Development and validation of a deep learning algorithm detecting 10 common abnormalities on chest radiographs. Eur. Respir. J. 2021, 57, 2003061. [Google Scholar] [CrossRef] [PubMed]
  6. Wu, J.T.; Wong, K.C.L.; Gur, Y.; Ansari, N.; Karargyris, A.; Sharma, A.; Morris, M.; Saboury, B.; Ahmad, H.; Boyko, O.; et al. Comparison of chest radiograph interpretations by artificial intelligence algorithm vs radiology residents. JAMA Netw. Open 2020, 3, e2022779. [Google Scholar] [CrossRef] [PubMed]
  7. Cid, Y.D.; Macpherson, M.; Gervais-Andre, L.; Zhu, Y.; Franco, G.; Santeramo, R.; Lim, C.; Selby, I.; Muthuswamy, K.; Amlani, A.; et al. Development and validation of open-source deep neural networks for comprehensive chest x-ray reading: A retrospective, multicentre study. Lancet Digit. Health 2024, 6, e44–e57. [Google Scholar] [CrossRef]
  8. Seah, J.C.Y.; Tang, C.H.M.; Buchlak, Q.D.; Holt, X.G.; Wardman, J.B.; Aimoldin, A.; Esmaili, N.; Ahmad, H.; Pham, H.; Lambert, J.F.; et al. Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: A retrospective, multireader multicase study. Lancet Digit. Health 2021, 3, e496–e506. [Google Scholar] [CrossRef]
  9. Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3462–3471. [Google Scholar] [CrossRef]
  10. Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 590–597. [Google Scholar] [CrossRef]
  11. Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.Y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef]
  12. Khan, R.F.; Lee, B.D.; Lee, M.S. Transformers in medical image segmentation: A narrative review. Quant. Imaging Med. Surg. 2023, 13, 8747–8767. [Google Scholar] [CrossRef]
  13. World Health Organisation. Global Tuberculosis Report 2020; World Health Organization: Geneva, Switzerland, 2020; Available online: https://www.who.int/publications/i/item/9789240013131 (accessed on 30 October 2025).
  14. Rueckel, J.; Huemmer, C.; Fieselmann, A.; Ghesu, F.C.; Mansoor, A.; Schachtner, B.; Wesp, P.; Trappmann, L.; Munawwar, B.; Ricke, J.; et al. Pneumothorax detection in chest radiographs: Optimizing artificial intelligence system for accuracy and confounding bias reduction using in-image annotations in algorithm training. Eur. Radiol. 2021, 31, 7888–7900. [Google Scholar] [CrossRef]
  15. Bustos, A.; Pertusa, A.; Salinas, J.M.; de la Iglesia-Vayá, M. PadChest: A large chest x-ray image dataset with multi-label annotated reports. Med. Image Anal. 2020, 66, 101797. [Google Scholar] [CrossRef]
  16. Reis, E.P.; Paiva, J.; Bueno da Silva, M.C.; Sousa Ribeiro, G.A.; Fornasiero Paiva, V.; Bulgarelli, L.; Lee, H.; Dos Santos, P.V.; brito v Amaral, L.; Beraldo, G.; et al. BRAX, a Brazilian Labeled Chest X-Ray Dataset (version 1.1.0). PhysioNet, 17 June 2022. [Google Scholar] [CrossRef]
  17. Nguyen, H.Q.; Pham, H.H.; Linh, T.; Dao, M.; Khanh, L. VinDr-CXR: An Open Dataset of Chest X-Rays with Radiologist Annotations (version 1.0.0). PhysioNet, 22 June 2021. [Google Scholar] [CrossRef]
  18. Stein, A.; Wu, C.; Carr, C.; Shih, G.; Dulkowski, J.; Kalpathy, L.C.; Prevedello, L.; Kohli, M.; McDonald, M.; Peter, P.C.; et al. RSNA Pneumonia Detection Challenge. Kaggle. 2018. Available online: https://kaggle.com/competitions/rsna-pneumonia-detection-challenge (accessed on 30 October 2025).
  19. Zawacki, A.; Wu, C.; Shih, G.; Elliott, J.; Fomitchev, M.; Hussain, M.; ParasLakhani, P.; Culliton, S.B.; Siim, A.C.R. Pneumothorax Segmentation. Kaggle. 2019. Available online: https://kaggle.com/competitions/siim-acr-pneumothorax-segmentation (accessed on 30 October 2025).
  20. Shiraishi, J.; Katsuragawa, S.; Ikezoe, J.; Matsumoto, T.; Kobayashi, T.; Komatsu, K.; Matsui, M.; Fujita, H.; Kodera, Y.; Doi, K. Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. AJR Am. J. Roentgenol. 2000, 174, 71–74. [Google Scholar] [CrossRef] [PubMed]
  21. Jaeger, S.; Candemir, S.; Antani, S.; Wáng, Y.X.J.; Lu, P.X.; Thoma, G. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 2014, 4, 475–477. [Google Scholar] [CrossRef] [PubMed]
  22. Asraf, A.; Islam, Z. COVID19, Pneumonia and Normal Chest X-Ray PA Dataset. 2021. Version 1. Available online: https://data.mendeley.com/datasets/jctsfj2sfn/1 (accessed on 30 October 2025).
  23. Mooney, P.T. Chest X-Ray Images (Pneumonia). Kaggle. 2018. Available online: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia (accessed on 30 October 2025).
  24. Rahman, T.; Khandakar, A.; Chowdhury, M.E.H. Tuberculosis (TB) chest X-ray database. IEEE Dataport, 2020. [Google Scholar] [CrossRef]
  25. Liu, Y.; Wu, Y.-H.; Ban, Y.; Wang, H.; Cheng, M.-M. Rethinking computer-aided tuberculosis diagnosis. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2643–2652. [Google Scholar] [CrossRef]
  26. Pasa, F.; Golkov, V.; Pfeiffer, F.; Cremers, D.; Pfeiffer, D. Efficient deep network architectures for fast chest X-ray tuberculosis screening and visualization. Sci. Rep. 2019, 9, 6268. [Google Scholar] [CrossRef] [PubMed]
  27. Raddar. Chest X-Rays Tuberculosis from India. Kaggle. 2021. Available online: https://www.kaggle.com/datasets/raddar/chest-xrays-tuberculosis-from-india/data (accessed on 30 October 2025).
  28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  29. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
  30. Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Available online: https://proceedings.mlr.press/v97/tan19a.html (accessed on 30 October 2025).
  31. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  32. Todi, A.; Narula, N.; Sharma, M.; Gupta, U. ConvNext: A contemporary architecture for convolutional neural networks for image classification. In Proceedings of the 2023 3rd International Conference on Innovative Sustainable Computational Technologies (CISCT), Dehradun, India, 8–9 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
  33. Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image–text pairs. arXiv 2023, arXiv:2303.00915. [Google Scholar]
  34. Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the ICML’16: Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
  35. Tyralis, H.; Papacharalampous, G. A review of predictive uncertainty estimation with machine learning. Artif. Intell. Rev. 2024, 57, 94. [Google Scholar] [CrossRef]
  36. Choubineh, A.; Chen, J.; Coenen, F.; Ma, F. Applying Monte Carlo dropout to quantify the uncertainty of skip connection-based convolutional neural networks optimized by big data. Electronics 2023, 12, 1453. [Google Scholar] [CrossRef]
  37. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Vancouver, BC, Canada, 2019; Volume 32, pp. 8024–8035. [Google Scholar]
  38. Tang, Y.X.; Tang, Y.B.; Peng, Y.; Yan, K.; Bagheri, M.; Redd, B.A.; Brandon, C.J.; Lu, Z.; Han, M.; Xiao, J.; et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. npj Digit. Med. 2020, 3, 70. [Google Scholar] [CrossRef] [PubMed]
  39. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9351. [Google Scholar] [CrossRef]
  40. Danilov, V.; Proutski, A.; Kirpich, A.; Litmanovich, D.; Gankin, Y. Chest X-Ray Dataset for Lung Segmentation (version 2). Mendeley Data, 3 October 2022. [Google Scholar] [CrossRef]
  41. Chen, L.; Kuo, P.; Wang, R.; Gichoya, J.; Celi, L.A. Version 1.0.0. Chest X-Ray Segmentation Images Based on MIMIC-CXR (version 1.0.0). PhysioNet, 18 August 2022; RRID:SCR_007345. [Google Scholar] [CrossRef]
Figure 1. Comparison of AUROCs for deep learning models across different diseases using the development dataset. (a) Pneumonia, (b) Pleural effusion, (c) Tuberculosis, (d) Mass, (e) Consolidation, (f) Pneumothorax, (g) Other diseases, and (h) No finding (AUROC: area under the receiver operating characteristic curve).
Figure 1. Comparison of AUROCs for deep learning models across different diseases using the development dataset. (a) Pneumonia, (b) Pleural effusion, (c) Tuberculosis, (d) Mass, (e) Consolidation, (f) Pneumothorax, (g) Other diseases, and (h) No finding (AUROC: area under the receiver operating characteristic curve).
Diagnostics 16 00146 g001
Figure 2. Comparison of AUROCs for deep learning models across different diseases using the external validation dataset. (a) Pneumonia, (b) Pleural effusion, (c) Tuberculosis, (d) Mass, (e) Pneumothorax, (f) Other diseases, and (g) No finding (AUROC: area under the receiver operating characteristic curve).
Figure 2. Comparison of AUROCs for deep learning models across different diseases using the external validation dataset. (a) Pneumonia, (b) Pleural effusion, (c) Tuberculosis, (d) Mass, (e) Pneumothorax, (f) Other diseases, and (g) No finding (AUROC: area under the receiver operating characteristic curve).
Diagnostics 16 00146 g002
Table 1. Summary of publicly available CXR datasets used in the study (NLP: Natural Language Processing, NN: Neural Network).
Table 1. Summary of publicly available CXR datasets used in the study (NLP: Natural Language Processing, NN: Neural Network).
DatasetsNo. of LabelsAnnotation MethodNo. of CXRs
PneumoniaPleural EffusionTuberculosisMassConsolidationPneumothoraxOther DiseasesNo FindingTotal
NIH Chest X-ray-14 [9]14An NLP Tool143113,317011,2074667530236,25160,361112,120
CheXpert [10]14An NLP Tool373879,7130012,17016,257144,78915,892224,316
MIMIC-CXR [11]14An NLP Tool26,22176,9540014,67514,257180,251143,351377,110
PadChest [15]1927% of reports were manually annotated and the rest using a supervised NN.413854416473034142634552,51528,999160,000
VinDr-CXR [17]28Radiologists1229143010061080444133680910,606100,000
RSNA Pneumonia Detection Challenge [18]2Radiologists601200000020,67226,684
SIIM-ACR Pneumothorax Segmentation [19]2Radiologists0000023790829610,675
JSRT [20]2Radiologists00015400930247
Shenzhen Hospital CXR Set [21]4Radiologists034336132180272326662
Montgomery County chest X-ray set (MC) [21]2Radiologists0058000080138
BRAX [16]14An NLP Tool26453200112038361214,78240,967
COVID-19, Pneumonia and Normal Chest X-ray PA Dataset [22]3Radiologists152500000015254575
Chest X-Ray Images (Pneumonia) [23]2Radiologists387500000013415856
Tuberculosis (TB) Chest X-ray Database [24]2Radiologists00700000035004200
TBX11K [25]5Radiologists008000003800380011,200
Belarus Dataset [26]1Radiologists0030400000304
Chest X-rays tuberculosis from India [27]2Radiologists0078000077156
Total48,433177,421392915,60734,52038,711428,392313,6081,079,210
Table 2. Data distribution per target disease for the training, validation, and testing sets of the deep learning models.
Table 2. Data distribution per target disease for the training, validation, and testing sets of the deep learning models.
DatasetsDevelopment DatasetExternal Validation Dataset
No. of Total CXR ImagesTrainingValidationTestingNo. of Total CXR Images
Pneumonia48,43340,0009237242110
Pleural Effusion177,42140,00097212990422
Tuberculosis392940,00078201237318
Mass15,60740,0009763294569
Consolidation34,52040,000988429880
Pneumothorax38,71140,000998229981312
Other diseases428,29968,52512,2773024843
No finding313,70139,99998072936240
Table 3. Performance of deep learning models and selected thresholds for each disease using the development dataset.
Table 3. Performance of deep learning models and selected thresholds for each disease using the development dataset.
Target DiseasesThresholdResNetDenseNetEfficientNetDLAD-10
ResNetDenseNetEfficientNetDLAD-10
Pneumonia0.280.240.2550.20(0.31, 0.74, 0.74, 0.43)(0.30, 0.74, 0.74, 0.43)(0.37, 0.80, 0.79, 0.51)(0.33, 0.74, 0.76, 0.46)
Pleural Effusion0.250.1450.2150.16(0.45, 0.80, 0.80, 0.57)(0.45, 0.78, 0.81, 0.57)(0.47, 0.83, 0.81, 0.60)(0.46, 0.81, 0.81, 0.59)
Tuberculosis0.500.500.500.50(0.98, 0.73, 1.0, 0.84)(0.98, 0.81, 1.00, 0.89)(0.98, 0.89, 1.00, 0.93)(0.99, 0.86, 1.00, 0.92)
Mass0.350.290.100.29(0.51, 0.82, 0.84, 0.63)(0.51, 0.84, 0.84, 0.64)(0.57, 0.87, 0.87, 0.69)(0.56, 0.86, 0.87, 0.68)
Consolidation0.180.190.200.20(0.36, 0.75, 0.74, 0.49)(0.36, 0.74, 0.74, 0.48)(0.40, 0.76, 0.77, 0.52)(0.38, 0.76, 0.76, 0.51)
Pneumothorax0.180.150.180.18(0.45, 0.81, 0.81, 0.58)(0.47, 0.81, 0.82, 0.60)(0.58, 0.86, 0.88, 0.70)(0.59, 0.86, 0.88, 0.70)
Other diseases0.350.300.380.38(0.29, 0.66, 0.67, 0.40)(0.27, 0.65, 0.65, 0.38)(0.28, 0.65, 0.67, 0.39)(0.27, 0.64, 0.66, 0.38)
No finding0.230.240.300.24(0.40, 0.77, 0.78, 0.53)(0.40, 0.77, 0.77, 0.53)(0.45, 0.80, 0.81, 0.58)(0.44, 0.80, 0.80, 0.57)
Data in parentheses indicate precision, sensitivity, specificity, and F1-score.
Table 4. Performance of deep learning models and selected thresholds for each disease using the external validation dataset.
Table 4. Performance of deep learning models and selected thresholds for each disease using the external validation dataset.
Target DiseasesWith Fine-TuningWithout Fine-Tuning
ThresholdsEfficientNetDLAD-10EfficientNetDLAD-10
EfficientNetDLAD-10
Pneumonia0.2550.20(0.00, 0.00, 1.00, 0.00)(0.00, 0.00, 1.00, 0.00)(0.01, 0.40, 0.76, 0.01)(0.01, 0.70, 0.68, 0.01)
Pleural Effusion0.2150.16(0.59, 0.78, 0.91, 0.67)(0.55, 0.83, 0.89, 0.66)(0.30, 0.70, 0.74, 0.42)(0.31, 0.74, 0.73, 0.44)
Tuberculosis0.50.50(0.82, 0.88, 0.98, 0.84)(0.84, 0.80, 0.98, 0.80)(0.15, 0.08, 0.95, 0.10)(0.23, 0.08, 0.97, 0.12)
Mass0.10.29(0.23, 0.48, 0.96, 0.31)(0.40, 0.19, 0.99, 0.21)(0.04, 0.41, 0.74, 0.07)(0.05, 0.39, 0.84, 0.09)
Consolidation
Pneumothorax0.180.18(0.94, 0.97, 0.96, 0.96)(0.90, 0.98, 0.92, 0.94)(0.88, 0.84, 0.91, 0.86)(0.88, 0.86, 0.91, 0.87)
Other diseases0.380.38(0.73, 0.87, 0.87, 0.79)(0.73, 0.84, 0.88, 0.78)(0.31, 0.68, 0.42, 0.43)(0.32, 0.63, 0.48, 0.42)
No finding0.30.24(0.61, 0.77, 0.96, 0.68)(0.60, 0.73, 0.96, 0.65)(0.24, 0.87, 0.76, 0.37)(0.23, 0.92, 0.74, 0.37)
Data in parentheses indicate precision, sensitivity, specificity, and F1-Score.
Table 5. Diagnostic performance for each target condition across different training data scales using the external validation dataset (N: number of samples used for training, AUROC: area under the receiver operating characteristic curve).
Table 5. Diagnostic performance for each target condition across different training data scales using the external validation dataset (N: number of samples used for training, AUROC: area under the receiver operating characteristic curve).
Target DiseasesCase−1Case−2Case−3Case−4
NPerformanceAUROCNPerformanceAUROCNPerformanceAUROCNPerformanceAUROC
Pneumonia7913(0.32, 0.75, 0.75, 0.45)0.841416,077(0.34, 0.76, 0.79, 0.48)0.868924,017(0.37, 0.79, 0.79, 0.50)0.881940,000(0.37, 0.79, 0.80, 0.51)0.8905
Pleural Effusion8005(0.42, 0.78, 0.78, 0.55)0.865715,908(0.45, 0.81, 0.81, 0.58)0.888124,031(0.46, 0.81, 0.82, 0.59)0.890040,000(0.47, 0.81, 0.82, 0.57)0.8978
Tuberculosis7977(0.97, 0.99, 0.87, 0.92)0.993716,071(0.96, 0.99, 0.90, 0.93)0.994824,029(0.96, 0.99, 0.90, 0.93)0.993040,000(0.98, 0.99, 0.88, 0.93)0.9926
Mass7964(0.54, 0.86, 0.86, 0.66)0.941615,993(0.55, 0.87, 0.86, 0.67)0.949424,039(0.58, 0.88, 0.88, 0.70)0.958040,000(0.57, 0.87, 0.87, 0.69)0.9520
Consolidation8090(0.36, 0.74, 0.73, 0.48)0.813916,109(0.38, 0.76, 0.76, 0.51)0.836524,029(0.39, 0.77, 0.77, 0.52)0.848840,000(0.40, 0.77, 0.76, 0.52)0.8485
Pneumothorax8097(0.49, 0.83, 0.82, 0.61)0.911515,958(0.53, 0.85, 0.84, 0.65)0.926723,928(0.54, 0.85, 0.86, 0.67)0.937540,000(0.58, 0.88, 0.86, 0.70)0.9437
Other diseases13,777(0.27, 0.63, 0.67, 0.38)0.714127,447(0.26, 0.64, 0.64, 0.37)0.709841,072(0.27, 0.64, 0.65, 0.38)0.718668,525(0.28, 0.67, 0.65, 0.39)0.7354
No finding7933(0.41, 0.78, 0.79, 0.54)0.862315,972(0.43, 0.80, 0.79, 0.56)0.874523,995(0.43, 0.79, 0.80, 0.56)0.881739,999(0.44, 0.80, 0.80, 0.57)0.8891
Data in parentheses indicate precision, sensitivity, specificity, and F1-score.
Table 6. Diagnostic performance for each target condition across datasets with varying diversity (AUROC: area under the receiver operating characteristic curve).
Table 6. Diagnostic performance for each target condition across datasets with varying diversity (AUROC: area under the receiver operating characteristic curve).
Target DiseasesNIH Chest X-Ray-14MIMIC-CXRPadChestCheXpertMultiple-Source
PerformanceAUROCPerformanceAUROCPerformanceAUROCPerformanceAUROCPerformanceAUROC
Pneumonia(0.21, 0.62, 0.63, 0.31)0.6695(0.21, 0.48, 0.72, 0.29)0.6593(0.13, 0.48, 0.51, 0.21)0.4829(0.21, 0.62, 0.65, 0.32)0.6723(0.22, 0.66, 0.65, 0.33)0.7184
Pleural Effusion(0.32, 0.72, 0.69, 0.44)0.7915(0.40, 0.78, 0.77, 0.53)0.8479(0.19, 0.40, 0.66, 0.26)0.5511(0.37, 0.75, 0.75, 0.50)0.8209(0.38, 0.83, 0.73, 0.52)0.8574
Tuberculosis(0.05, 0.59, 0.26, 0.10)0.3891(0.05, 0.01, 0.99, 0.01)0.6440
Mass(0.23, 0.62, 0.60, 0.34)0.6765(0.08, 0.14, 0.68, 0.10)0.3969(0.41, 0.79, 0.78, 0.54)0.8850
Consolidation(0.24, 0.63, 0.60, 0.34)0.6656(0.27, 0.66, 0.64, 0.38)0.7059(0.15, 0.40, 0.56, 0.22)0.4804(0.27, 0.67, 0.64, 0.38)0.7131(0.31, 0.70, 0.70, 0.43)0.7636
Pneumothorax(0.39, 0.75, 0.77, 0.52)0.8393(0.44, 0.77, 0.80, 0.56)0.8625(0.15, 0.28, 0.69, 0.20)0.4997(0.44, 0.76, 0.80, 0.56)0.8514(0.46, 0.83, 0.81, 0.60)0.9002
Other diseases(0.17, 0.52, 0.51, 0.26)0.5324(0.22, 0.60, 0.57, 0.32)0.6214(0.16, 0.31, 0.66, 0.21)0.4560(0.18, 0.52, 0.51, 0.26)0.5538(0.20, 0.51, 0.59, 0.29)0.5669
No finding(0.30, 0.72, 0.67, 0.42)0.7680(0.36, 0.74, 0.74, 0.49)0.8167(0.16, 0.42, 0.56, 0.23)0.4943(0.31, 0.70, 0.70, 0.43)0.7687(0.37, 0.76, 0.75, 0.50)0.8312
The models trained on Chest X-ray-14, MIMIC-CXR, and CheXpert lacked tuberculosis classification capability, whereas those trained on MIMIC-CXR and CheXpert lacked mass classification because of the absence of corresponding samples in these datasets. Data in parentheses indicate precision, sensitivity, specificity, and F1-score.
Table 7. Predictive entropy (PE) and variance of predictions (VP) for each target condition across different training data scales using the external validation dataset (N: number of samples used for training).
Table 7. Predictive entropy (PE) and variance of predictions (VP) for each target condition across different training data scales using the external validation dataset (N: number of samples used for training).
Target DiseasesCase−1Case−2Case−3Case−4
NPEVPNPEVPNPEVPNPEVP
Pneumonia79131.95920.000716,0771.6160.000424,0171.91350.000540,0001.7720.0004
Pleural Effusion80051.80920.000615,9081.5060.000524,0311.51540.000440,0001.70320.0004
Tuberculosis79770.18580.000116,0710.08120.000124,0290.35470.000140,0000.21720.0001
Mass79642.14490.000915,9932.03380.000524,0391.98710.000440,0002.2280.0004
Consolidation80902.21790.000716,1091.8930.000424,0291.62530.000440,0001.95820.0004
Pneumothorax80972.28590.000415,9581.87440.000423,9282.01140.000440,0002.05660.0003
Other diseases13,7771.08830.000927,4471.16310.000641,0721.34390.000568,5250.64450.0005
No finding79331.42090.000715,9720.61320.000523,9951.4460.000539,9990.75140.0005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khan, F.K.; Tahir, W.B.; Lee, M.S.; Kim, J.Y.; Byon, S.S.; Pi, S.-W.; Lee, B.-D. Leveraging Large-Scale Public Data for Artificial Intelligence-Driven Chest X-Ray Analysis and Diagnosis. Diagnostics 2026, 16, 146. https://doi.org/10.3390/diagnostics16010146

AMA Style

Khan FK, Tahir WB, Lee MS, Kim JY, Byon SS, Pi S-W, Lee B-D. Leveraging Large-Scale Public Data for Artificial Intelligence-Driven Chest X-Ray Analysis and Diagnosis. Diagnostics. 2026; 16(1):146. https://doi.org/10.3390/diagnostics16010146

Chicago/Turabian Style

Khan, Farzeen Khalid, Waleed Bin Tahir, Mu Sook Lee, Jin Young Kim, Shi Sub Byon, Sun-Woo Pi, and Byoung-Dai Lee. 2026. "Leveraging Large-Scale Public Data for Artificial Intelligence-Driven Chest X-Ray Analysis and Diagnosis" Diagnostics 16, no. 1: 146. https://doi.org/10.3390/diagnostics16010146

APA Style

Khan, F. K., Tahir, W. B., Lee, M. S., Kim, J. Y., Byon, S. S., Pi, S.-W., & Lee, B.-D. (2026). Leveraging Large-Scale Public Data for Artificial Intelligence-Driven Chest X-Ray Analysis and Diagnosis. Diagnostics, 16(1), 146. https://doi.org/10.3390/diagnostics16010146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop