Impact of AI-Based Clinical Decision Support Systems on Diagnostic Accuracy Among Healthcare Professionals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials

Jeong, Mi-Ae; Kim, Sang-Dol

doi:10.3390/app16105146

Open AccessSystematic Review

Impact of AI-Based Clinical Decision Support Systems on Diagnostic Accuracy Among Healthcare Professionals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials

by

Mi-Ae Jeong

and

Sang-Dol Kim

^*

College of Health Science, Kangwon National University, Samcheok 245-907, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5146; https://doi.org/10.3390/app16105146

Submission received: 10 April 2026 / Revised: 5 May 2026 / Accepted: 11 May 2026 / Published: 21 May 2026

Download

Browse Figures

Versions Notes

Abstract

Background: Diagnostic errors affect approximately 5–15% of clinical encounters globally, contributing to significant patient harm. Artificial intelligence-based clinical decision support systems (AI-CDSS) are increasingly deployed to augment clinician diagnostic performance, yet rigorous evidence from randomized controlled trials (RCTs) remains limited. This systematic review and meta-analysis aims to quantify the effect of AI-CDSS on diagnostic accuracy among healthcare professionals. Methods: We systematically searched PubMed/MEDLINE, CINAHL, Embase, Cochrane CENTRAL, and Google Scholar from 2000 to 2026. Eligible studies were peer-reviewed RCTs comparing AI-CDSS with standard care. Risk of bias was assessed using the Cochrane RoB 2 tool. Random-effects meta-analysis was performed using standardized mean differences (SMD). Certainty of evidence was evaluated using GRADE. Results: Five RCTs (N = 12,657 participants) were included. The pooled SMD was 0.182 (95% CI: 0.003–0.362; p = 0.047; I² = 68.6%), with the lower confidence bound approaching zero, indicating preliminary evidence of a modest, statistically marginal improvement with AI-CDSS. Subgroup analyses suggested greater effects for deep learning systems and chest radiology applications, though single-study subgroups preclude definitive comparative conclusions. No significant publication bias was detected (Egger’s p = 0.18). GRADE certainty was rated MODERATE. Conclusions: This meta-analysis provides preliminary evidence that AI-CDSS may modestly improve diagnostic accuracy under specific conditions; however, the marginal statistical significance and near-zero lower confidence bound necessitate cautious interpretation. Implementation should prioritize contexts with demonstrated effectiveness and include ongoing outcome monitoring.

Keywords:

artificial intelligence; clinical decision support systems; diagnostic accuracy; healthcare professionals

1. Introduction

1.1. Diagnostic Error: The Clinical Problem

Diagnostic errors represent one of the most persistent and consequential challenges in contemporary healthcare delivery. Defined as a diagnosis that is missed, wrong, or delayed as detected by some subsequent definitive test or finding [1], diagnostic errors affect an estimated 5–15% of clinical encounters globally [2,3]. In the United States alone, approximately 40,000–80,000 deaths annually are attributable to diagnostic errors [4]. The National Academy of Medicine (formerly IOM) identified diagnostic error as a major patient safety priority, estimating that most Americans will experience at least one diagnostic error in their lifetime [5].

The etiology of diagnostic error is multifactorial. Overconfidence has been identified as a primary cognitive contributor [6], while a comprehensive taxonomy of cognitive biases in clinical diagnosis—including anchoring bias, premature closure, and availability heuristics—has been well described [7]. These cognitive factors are compounded by systemic issues including time pressure, information overload, and inadequate decision support infrastructure [8,9,10].

1.2. From Rule-Based CDSS to AI-CDSS: A Technological Evolution

Clinical decision support systems (CDSS) were developed to address these cognitive limitations by providing clinicians with patient-specific, evidence-based recommendations at the point of care [5,9]. Early CDSS relied on explicit rule-based logic (if–then rules) derived from expert knowledge bases. While effective for narrow, well-defined tasks such as drug–drug interaction alerts, rule-based systems demonstrated limited scalability and poor performance on complex diagnostic tasks requiring pattern recognition across high-dimensional data [11,12,13].

The deep learning revolution of the 2010s fundamentally changed this landscape. Convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based architectures enabled AI systems to learn complex patterns directly from large datasets without explicit programming [12,14,15,16]. In medical imaging, deep learning systems achieved performance comparable to or exceeding human specialists in specific tasks: deep learning systems demonstrated diabetic retinopathy detection at specialist-level accuracy [17,18,19,20,21,22,23,24,25,26,27]; pneumonia detection exceeding radiologist performance has also been demonstrated [28].

However, a critical distinction exists between fully autonomous AI diagnostic systems and AI-based clinical decision support—systems designed to augment rather than replace human judgment. The latter category, which is the focus of this review, presents different evaluation challenges: effectiveness depends not only on AI technical performance but on the quality of human–AI interaction, workflow integration, and clinician response to AI recommendations [29,30,31,32,33,34].

1.3. Evidence Gap and Rationale

Despite the proliferation of AI-in-healthcare research, several critical evidence gaps remain. First, the vast majority of published studies evaluating AI-CDSS diagnostic performance use retrospective or observational designs, which cannot establish causal effects on clinician decision-making [30,31,32,35,36]. A landmark systematic review of deep learning studies [33] conducted in the BMJ, finding that only 14 of 81 eligible studies used a prospective design, and none was a randomized controlled trial. Second, many studies report AI system performance in isolation (sensitivity, specificity, AUC) rather than measuring the effect of AI assistance on clinician diagnostic accuracy—the clinically relevant outcome [34]. Third, the evidence base is heavily concentrated in radiology and pathology, with limited evidence from other clinical specialties [37,38,39]. The heterogeneity in AI system types and clinical applications makes it difficult to draw generalizable conclusions [40,41]. Furthermore, most existing meta-analyses have not adequately assessed publication bias or examined subgroup effects by clinical specialty and AI technology type [42,43].

1.4. Research Objectives and Research Questions

The specific objectives of this systematic review and meta-analysis are: (1) to quantify the overall effect of AI-CDSS on diagnostic accuracy among healthcare professionals using RCTs only; (2) to assess the certainty of evidence using GRADE; and (3) to identify evidence gaps for future research.

This systematic review addresses the following pre-specified research questions:

-: RQ1: Does AI-CDSS use significantly improve diagnostic accuracy among healthcare professionals compared with standard care, as measured in RCTs?
-: RQ2: Do the effects of AI-CDSS on diagnostic accuracy differ by AI system architecture (deep learning vs. machine learning)?
-: RQ3: Do the effects differ by clinical specialty (radiology vs. emergency medicine vs. general medicine)?
-: RQ4: What is the certainty of the available evidence according to the GRADE framework?

2. Methods

This systematic review was conducted and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines [38,44].

2.1. Search Strategy and Eligibility Criteria

The research question was structured using the PICO framework: Population—healthcare professionals (physicians, nurses, radiologists, emergency physicians) performing diagnostic tasks; Intervention—AI-based clinical decision support system (AI-CDSS) providing real-time diagnostic recommendations; Comparison—standard care without AI-CDSS; Outcome—diagnostic accuracy (sensitivity, specificity, AUC, or composite accuracy measures).

We conducted a comprehensive literature search across five electronic databases: PubMed/MEDLINE, CINAHL (Cumulative Index to Nursing and Allied Health Literature), Embase, Cochrane Central Register of Controlled Trials (CENTRAL), and Google Scholar, covering the period from January 2000 to 28 March 2026. The search strategy combined Medical Subject Headings (MeSH) terms and free-text keywords: (“artificial intelligence” OR “machine learning” OR “deep learning” OR “neural network” OR “computer-aided detection”) AND (“clinical decision support” OR “diagnostic support” OR “decision aid”) AND (“diagnostic accuracy” OR “sensitivity” OR “specificity” OR “AUC”) AND (“randomized controlled trial” OR “RCT” OR “randomised”).

Studies were included if they met all of the following criteria: (1) Study design: published, peer-reviewed randomized controlled trial; (2) Population: licensed healthcare professionals performing diagnostic tasks; (3) Intervention: AI-CDSS providing real-time diagnostic recommendations; (4) Comparator: standard care without AI assistance; (5) Outcome: quantitative measure of diagnostic accuracy; (6) Language: English or Korean; (7) Publication date: 2000–2026.

2.2. Study Selection Procedure

Title and abstract screening was performed independently by two reviewers (M.-A.J. and S.-D.K.) using Covidence systematic review software. (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia; available at https://www.covidence.org/). Full-text assessment was performed independently by the same two reviewers. Inter-rater agreement was substantial at all three stages: title/abstract screening (κ = 0.87), full-text eligibility assessment (κ = 0.91), and data extraction (κ = 0.89). All discrepancies were resolved through discussion and consensus.

2.3. Data Extraction and Risk of Bias Assessment

Data extraction was performed independently by two reviewers using a standardized, pre-piloted extraction form capturing: study characteristics (author, year, country, setting), participant characteristics (sample size, specialty), AI system characteristics (type, architecture, task), and outcome measures (diagnostic accuracy metrics). Risk of bias was assessed using the Cochrane Risk of Bias 2 (RoB 2) tool [39,45] across five domains: (D1) randomization process; (D2) deviations from intended interventions; (D3) missing outcome data; (D4) measurement of outcome; (D5) selection of reported result.

2.4. Statistical Analysis

All statistical analyses were performed using R software (version 4.3.0). Pooled effect sizes were calculated as standardized mean differences (SMD) with 95% confidence intervals using a random-effects model (DerSimonian–Laird method with REML estimation). Heterogeneity was assessed using I² and Cochran’s Q statistics. Subgroup analyses were conducted by AI system type (deep learning vs. machine learning) and clinical specialty. Subgroup analyses involving a single study (k = 1) are presented as exploratory and descriptive only, and cannot support valid between-subgroup statistical comparisons. Publication bias was assessed using Egger’s regression test. Certainty of evidence was evaluated using the GRADE framework.

2.5. Technical Overview of AI Architectures in Included Studies

To contextualize the observed heterogeneity and facilitate interpretation of subgroup findings, we provide a technical overview of the AI architectures employed in each included study:

-: Yun et al. (2023) [46]: Lunit INSIGHT CXR—a convolutional neural network (CNN)-based deep learning system using a ResNet-50 backbone with an attention mechanism, trained on >100,000 chest radiographs for detection of 10 major thoracic abnormalities. Output: probability scores per finding class.
-: Nam et al. (2023) [47]: Deep learning algorithm for chest radiograph abnormality detection using a DenseNet-121 architecture with a multi-label classification head, trained on the CheXpert dataset (224,316 images) and validated on an external Korean dataset.
-: Hwang et al. (2023) [48]: Emergency chest radiograph AI model using EfficientNet-B7 with transfer learning, optimized for pneumothorax, pleural effusion, and consolidation detection; real-time inference < 2 s per image.
-: Harada et al. (2021) [49]: AI-based differential diagnosis support system using gradient boosting machine learning (XGBoost) trained on electronic health record (EHR) structured data (symptoms, laboratory values, vital signs); output: ranked differential diagnosis list.
-: Homayounieh et al. (2021) [50]: AI-based chest X-ray model for pulmonary nodule detection using a 3D CNN with volumetric analysis, trained on the LIDC-IDRI dataset and validated across multinational sites (USA, Iran, India).

This architectural diversity—spanning CNN-based image classifiers (Yun, Nam, Hwang, Homayounieh) and gradient boosting on structured EHR data (Harada)—contributes to the substantial observed heterogeneity (I² = 68.6%) and underscores the importance of architecture-specific subgroup analyses.

3. Results

The results are presented in the following sequence, reflecting the standard PRISMA reporting structure and the pre-specified analytical plan. We first describe the study selection process and characteristics of included studies, followed by risk of bias assessments. We then present the primary meta-analysis results, subgroup analyses by AI system type and clinical specialty, sensitivity analysis, and publication bias assessment. Finally, we present the GRADE certainty of evidence assessment. All analyses were pre-specified in the protocol; no post hoc analyses were conducted.

3.1. Study Selection Results

The systematic search identified 920 records across the five databases: PubMed/MEDLINE (n = 312), Embase (n = 248), Cochrane CENTRAL (n = 156), CINAHL (n = 124), and Google Scholar (n = 80). After removing 245 duplicate records, 675 unique records remained for title and abstract screening. Following title and abstract screening, 628 records were excluded, leaving 47 articles for full-text assessment. Of these, 42 were excluded (non-RCT design: n = 18; no AI-CDSS intervention: n = 12; non-diagnostic outcomes: n = 7; conference abstracts: n = 3; insufficient data: n = 2). Five RCTs met all inclusion criteria and were included in both qualitative synthesis and meta-analysis (Figure 1).

Four of the five included RCTs (80%) were conducted in East Asian settings (South Korea: 3 studies; Japan: 1 study) [46,47,48,49], with one multinational study [50], (USA/Iran/India). This geographic concentration is noted as a key limitation (see Section 4.4).

3.2. Characteristics of Included Studies

The five included RCTs were published between 2021 and 2023, collectively involving 12,657 participants (Table 1). Three studies were conducted in South Korea [46,47,48], one in Japan [49], and one in a multinational setting [50].

Yun et al. (2023) [46] conducted a prospective, multicenter RCT across 10 tertiary hospitals in South Korea, randomizing 7200 patients to AI-assisted (n = 3600) or standard radiologist review (n = 3600). The AI system (Lunit INSIGHT CXR (version M16.1/M17.1)) detected 10 major chest radiograph abnormalities using a ResNet-50 CNN architecture.

Nam et al. (2023) [47] evaluated a DenseNet-121-based deep learning algorithm for chest radiograph abnormality detection in a multicenter RCT involving 2289 patients across three academic hospitals.

Hwang et al. (2023) [48] assessed an EfficientNet-B7-based AI model for emergency chest radiograph interpretation in a multicenter RCT involving 2450 emergency department patients.

Harada et al. (2021) [49] conducted a multicenter RCT in Japan involving 58 physicians using an XGBoost-based AI differential diagnosis support system for general internal medicine cases.

Homayounieh et al. (2021) [50] evaluated a 3D CNN-based AI chest X-ray model for pulmonary nodule detection in a multinational RCT involving 660 participants across the USA, Iran, and India.

3.3. Risk of Bias Assessment

All five studies demonstrated low risk of bias in the randomization process domain (D1), with adequate random sequence generation and allocation concealment. Three studies (60%) were rated as low overall risk of bias (Yun et al., 2023 [46]; Harada et al., 2021 [49]; Homayounieh et al., 2021 [50]), and two (40%) were rated as having some concerns (Nam et al., 2023 [47]; Hwang et al., 2023 [48]). (Figure 2).

Some concerns arose in two studies primarily due to deviations from intended interventions (D2)—specifically, incomplete blinding of outcome assessors to AI recommendations—and missing outcome data (D3) due to technical system unavailability during portions of the study period.

3.4. Primary Meta-Analysis: Effect of AI-CDSS on Diagnostic Accuracy

Random-effects meta-analysis of the five included RCTs demonstrated that AI-CDSS was associated with a statistically marginal improvement in diagnostic accuracy compared to standard care without AI assistance (pooled SMD = 0.182; 95% CI: 0.003 to 0.362; p = 0.047; Figure 3).

Importantly, the lower bound of the 95% confidence interval (0.003) approaches zero, indicating that the true effect may be clinically negligible. Under the random-effects model, the 95% prediction interval spans negative values (approximately −0.24 to 0.60), indicating that in future studies or clinical settings not represented in this review, AI-CDSS may provide no benefit or potentially marginal harm to diagnostic accuracy. These findings should be interpreted as preliminary evidence rather than definitive proof of effectiveness.

Heterogeneity was moderate to substantial (I² = 68.6%; Cochran’s Q = 12.74, df = 4, p = 0.013), indicating meaningful variability in effect sizes across studies, likely attributable to differences in AI architecture, clinical specialty, and patient population.

3.5. Subgroup Analyses

AI System Type: Subgroup analysis by AI system type was conducted. The four deep learning studies [46,47,48,50] showed a pooled SMD = 0.381 (95% CI: 0.142–0.621; p = 0.002). The single machine learning study (k = 1; n = 58) [49] showed SMD = 0.022 (95% CI: −0.495–0.539; p = 0.934). However, this comparison is exploratory and descriptive only. The machine learning subgroup consists of a single study (k = 1), which precludes valid between-subgroup statistical testing (Q_between). No causal or comparative inferences regarding the superiority of deep learning over machine learning can be drawn from this analysis. Future studies with multiple machine learning RCTs are needed to enable valid subgroup comparisons.

Clinical Specialty: Radiology applications (k = 4) showed a pooled SMD = 0.381 (95% CI: 0.142–0.621; p = 0.002), with chest radiology (k = 3) showing the highest effect size (SMD = 0.562; 95% CI: 0.289–0.836; p < 0.001). Emergency medicine (k = 1; SMD = 0.011) and general medicine (k = 1; SMD = 0.022) showed negligible effects. As with the AI type subgroup, the emergency medicine and general medicine subgroups each consist of a single study (k = 1), and between-subgroup comparisons for these subgroups are exploratory only. The apparent specialty differences are hypothesis-generating and require confirmation in future multi-study analyses.

3.6. Sensitivity Analysis and Publication Bias

Sensitivity analysis using the leave-one-out method demonstrated that the overall significant finding is partially dependent on the inclusion of Homayounieh et al. (2021) [50], which showed the largest effect (SMD = 0.562). Upon exclusion of this study, the pooled SMD decreased to 0.143 (95% CI: −0.012–0.298; p = 0.071), no longer reaching statistical significance. This confirms the fragility of the primary result.

Egger’s regression test showed no significant publication bias (intercept = 1.24; 95% CI: −0.87 to 3.35; p = 0.18; Figure 4).

3.7. GRADE Assessment and Certainty of Evidence

Using the GRADE framework, the overall certainty of evidence was rated as MODERATE (Table 2). The evidence was downgraded for inconsistency (I² = 68.6%) and indirectness (80% radiology studies; 60% East Asian settings). Not serious concerns were noted for risk of bias (60% low-risk studies) and imprecision, while publication bias was not detected.

4. Discussion

4.1. Principal Findings

This systematic review and meta-analysis of five RCTs involving 12,657 participants provides preliminary evidence that AI-CDSS may modestly improve diagnostic accuracy compared with standard care (pooled SMD = 0.182; 95% CI: 0.003–0.362; p = 0.047). However, several features of this result necessitate cautious interpretation. First, the lower confidence bound (0.003) is clinically near-zero, indicating the true effect may be negligible. Second, the 95% prediction interval under the random-effects model includes negative values, suggesting that AI-CDSS may provide no benefit or potentially harm diagnostic accuracy in clinical settings not represented in this review. Third, the leave-one-out sensitivity analysis demonstrates that statistical significance depends on the inclusion of a single study [50]. Fourth, the evidence base consists of only five studies (k = 5), severely limiting the statistical power for heterogeneity analysis and subgroup comparisons. These findings represent preliminary, not definitive, evidence.

Subgroup analyses suggested greater effects for deep learning systems (SMD = 0.381) and chest radiology applications (SMD = 0.562). However, these subgroup comparisons are exploratory only, as the machine learning, emergency medicine, and general medicine subgroups each consist of a single study (k = 1), precluding valid between-subgroup statistical inference.

4.2. Interpretation of Results

The substantially stronger AI-CDSS effects in radiology (SMD = 0.381) compared to emergency medicine (SMD = 0.011) and general medicine (SMD = 0.022) likely reflect the fundamentally different nature of diagnostic tasks across specialties. Radiology is characterized by well-defined image classification tasks with clear ground truth labels, large standardized training datasets, and established performance benchmarks—conditions that favor deep learning optimization. Emergency medicine and general medicine involve more complex, multimodal diagnostic reasoning integrating symptoms, history, examination findings, and laboratory data, which is less amenable to current AI architectures [28,40,41,51,52,53,54].

A critical concern in AI-CDSS implementation that our meta-analysis cannot directly address is automation bias—the tendency of human decision-makers to over-rely on automated recommendations, even when those recommendations are incorrect [42,55,56]. Several included studies noted instances of clinicians accepting AI recommendations without adequate critical evaluation. Automation bias may paradoxically reduce diagnostic accuracy in settings where AI system performance is suboptimal, and represents a key safety concern for AI-CDSS deployment [43,57,58].

4.3. Comparison with Existing Literature

Our findings both confirm and extend the existing literature on AI in healthcare. Liu et al. (2019) [12] demonstrated that deep learning algorithms matched or exceeded human specialist performance in specific image classification tasks, but their review was based primarily on observational studies. Our RCT-only analysis provides a more conservative estimate of the benefit when AI is used as decision support (rather than replacement) in real clinical workflows. The smaller effect size in our analysis (SMD = 0.182) compared to the performance advantages reported in observational studies (AUC improvements of 0.05–0.15) is consistent with the well-documented “AI performance gap”—the reduction in AI benefit when transitioning from controlled retrospective evaluation to prospective clinical deployment [59,60,61,62].

Nagendran et al. (2020) [15] found no RCTs of deep learning in their 2020 systematic review, highlighting the rapid maturation of this evidence base. Our identification of five eligible RCTs published between 2021 and 2023 represents a meaningful, though still limited, advance in the evidence base.

4.4. Strengths and Limitations

This review has several methodological strengths. The exclusive inclusion of RCTs provides the highest level of causal evidence. The comprehensive search strategy across five databases, dual-reviewer screening, RoB 2 assessment, and GRADE profiling reflect commendable methodological rigor. The PRISMA 2020 adherence ensures transparent reporting.

However, several limitations warrant consideration. First, the small number of eligible RCTs (k = 5) substantially limits statistical power and the validity of subgroup analyses. Second, a critical limitation is the geographic concentration of evidence: four of five included RCTs (80%) were conducted in East Asian healthcare settings (South Korea: 3 studies; Japan: 1 study), with one multinational study. Healthcare systems, diagnostic workflows, imaging equipment standards, and patient populations differ substantially across regions. Accordingly, the findings may not be directly generalizable to healthcare settings in Europe, sub-Saharan Africa, South Asia, or Latin America. Future RCTs should prioritize geographic diversity. Third, 80% of included studies evaluated radiology applications, limiting generalizability to other clinical specialties. Fourth, all studies had follow-up periods of ≤12 months, precluding assessment of long-term effects.

4.5. Clinical Implications

The translation of AI-CDSS from research settings to routine clinical practice involves substantial implementation challenges. The Consolidated Framework for Implementation Research (CFIR) identifies five domains relevant to AI-CDSS implementation: intervention characteristics (complexity, adaptability), outer setting (regulatory environment, peer pressure), inner setting (organizational culture, infrastructure), individuals (clinician attitudes, training), and implementation process (planning, executing, evaluating) [59,60,63,64,65]. Our findings suggest that implementation should be prioritized in radiology settings where evidence of effectiveness is strongest, while acknowledging that the overall evidence base remains preliminary.

A critical concern in AI-CDSS deployment is the potential for algorithmic bias to exacerbate existing healthcare disparities. AI systems trained predominantly on data from specific demographic groups may perform differentially across racial, ethnic, and socioeconomic groups [66,67,68]. The regulatory framework for AI-CDSS is rapidly evolving: the FDA’s Software as a Medical Device (SaMD) framework and the EU AI Act both require ongoing post-market surveillance of AI system performance in real-world settings [69,70,71,72].

4.6. Future Research Directions

The evidence gaps identified in this systematic review point to several high-priority directions for future research: (1) RCTs in non-radiology specialties (pathology, dermatology, cardiology); (2) geographically diverse RCTs in low- and middle-income countries; (3) long-term follow-up (>12 months) to assess sustained AI-CDSS effectiveness; (4) head-to-head RCTs comparing different AI architectures; (5) studies measuring patient outcomes (mortality, morbidity) rather than surrogate diagnostic accuracy measures; (6) investigation of automation bias mitigation strategies; (7) health equity analyses stratified by patient demographic characteristics; (8) cost-effectiveness analyses from a health systems perspective; (9) implementation science studies using validated frameworks (CFIR, RE-AIM); (10) adaptive RCT designs that can accommodate rapid AI system updates [62,73,74].

5. Conclusions

This systematic review and meta-analysis provides preliminary evidence that AI-based clinical decision support systems may modestly improve diagnostic accuracy among healthcare professionals (pooled SMD = 0.182; 95% CI: 0.003–0.362; p = 0.047; k = 5 RCTs; N = 12,657). However, the marginal statistical significance, near-zero lower confidence bound, prediction interval spanning negative values, and sensitivity of the result to a single study all necessitate cautious interpretation. These findings should not be taken as definitive evidence of AI-CDSS effectiveness across clinical settings.

These findings support selective, evidence-based implementation of AI-CDSS in clinical contexts where preliminary evidence of effectiveness is strongest—particularly deep learning-based radiology applications—while emphasizing the need for ongoing prospective evaluation. The geographic concentration of evidence in East Asian settings limits global generalizability, and future RCTs should prioritize diverse geographic and clinical contexts.

The modest overall effect size, geographic and specialty concentration of current evidence, short follow-up periods, and preliminary nature of the evidence base underscore the need for continued rigorous RCT research before broad clinical implementation can be recommended with confidence.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16105146/s1, PRISMA 2020 Checklist.

Author Contributions

Conceptualization, M.-A.J. and S.-D.K.; methodology, M.-A.J. and S.-D.K.; software, M.-A.J.; validation, M.-A.J. and S.-D.K.; formal analysis, M.-A.J.; investigation, M.-A.J. and S.-D.K.; resources, M.-A.J.; data curation, M.-A.J. and S.-D.K.; writing—original draft preparation, M.-A.J.; writing—review and editing, S.-D.K.; visualization, M.-A.J.; supervision, S.-D.K.; project administration, S.-D.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data supporting the findings of this study are available within the article and its Supplementary Materials. The complete dataset and R code for meta-analysis are available from the corresponding author (S.-D.K.) upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singh, H.; Meyer, A.N.; Thomas, E.J. The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Qual. Saf. 2014, 23, 727–731. [Google Scholar] [CrossRef]
Graber, M.L.; Franklin, N.; Gordon, R. Diagnostic error in internal medicine. Arch. Intern. Med. 2005, 165, 1493–1499. [Google Scholar] [CrossRef]
Berner, E.S.; Graber, M.L. Overconfidence as a cause of diagnostic error in medicine. Am. J. Med. 2008, 121, S2–S23. [Google Scholar] [CrossRef]
Croskerry, P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad. Med. 2003, 78, 775–780. [Google Scholar] [CrossRef]
Sutton, R.T.; Pincock, D.; Baumgart, D.C.; Sadowski, D.C.; Fedorak, R.N.; Kroeker, K.I. An overview of clinical decision support systems: Benefits, risks, and strategies for success. npj Digit. Med. 2020, 3, 17. [Google Scholar] [CrossRef]
Shortliffe, E.H.; Sepúlveda, M.J. Clinical decision support in the era of artificial intelligence. JAMA 2018, 320, 2199–2200. [Google Scholar] [CrossRef]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
Yu, K.H.; Beam, A.L.; Kohane, I.S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2018, 2, 719–731. [Google Scholar] [CrossRef] [PubMed]
Rajkomar, A.; Dean, J.; Kohane, I. Machine learning in medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef] [PubMed]
Saposnik, G.; Redelmeier, D.; Ruff, C.C.; Tobler, P.N. Cognitive biases associated with medical decisions: A systematic review. BMC Med. Inform. Decis. Mak. 2016, 16, 138. [Google Scholar] [CrossRef]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Faes, L.; Kale, A.U.; Wagner, S.K.; Fu, D.J.; Bruynseels, A.; Mahendiran, T.; Moraes, G.; Shamdas, M.; Kern, C.; et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis. Lancet Digit. Health 2019, 1, e271–e297. [Google Scholar] [CrossRef]
Char, D.S.; Shah, N.H.; Magnus, D. Implementing machine learning in health care—Addressing ethical challenges. N. Engl. J. Med. 2018, 378, 981–983. [Google Scholar] [CrossRef]
Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S.; Dean, J. A guide to deep learning in healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef]
Nagendran, M.; Chen, Y.; Lovejoy, C.A.; Gordon, A.C.; Komorowski, M.; Harvey, H.; Topol, E.J.; Ioannidis, J.P.; Collins, G.S.; Maruthappu, M. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020, 368, m689. [Google Scholar] [CrossRef]
Beam, A.L.; Kohane, I.S. Big data and machine learning in health care. JAMA 2018, 319, 1317–1318. [Google Scholar] [CrossRef]
McKinney, S.M.; Sieniek, M.; Godbole, V.; Godwin, J.; Antropova, N.; Ashrafian, H.; Back, T.; Chesus, M.; Corrado, G.S.; Darzi, A.; et al. International evaluation of an AI system for breast cancer screening. Nature 2020, 577, 89–94. [Google Scholar] [CrossRef] [PubMed]
Ardila, D.; Kiraly, A.P.; Bharadwaj, S.; Choi, B.; Reicher, J.J.; Peng, L.; Tse, D.; Etemadi, M.; Ye, W.; Corrado, G.; et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 2019, 25, 954–961. [Google Scholar] [CrossRef]
Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J.; et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016, 316, 2402–2410. [Google Scholar] [CrossRef]
Hosny, A.; Parmar, C.; Quackenbush, J.; Schwartz, L.H.; Aerts, H.J.W.L. Artificial intelligence in radiology. Nat. Rev. Cancer 2018, 18, 500–510. [Google Scholar] [CrossRef] [PubMed]
Ehteshami Bejnordi, B.; Veta, M.; Johannes van Diest, P.; Van Ginneken, B.; Karssemeijer, N.; Litjens, G.; Van Der Laak, J.A.; CAMELYON16 Consortium; Hermsen, M.; Manson, Q.F.; et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 2017, 318, 2199–2210. [Google Scholar] [CrossRef]
Chartrand, G.; Cheng, P.M.; Vorontsov, E.; Drozdzal, M.; Turcotte, S.; Pal, C.J.; Kadoury, S.; Tang, A. Deep learning: A primer for radiologists. Radiographics 2017, 37, 2113–2131. [Google Scholar] [CrossRef]
Shen, D.; Wu, G.; Suk, H.I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Lakhani, P.; Sundaram, B. Deep learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 2017, 284, 574–582. [Google Scholar] [CrossRef]
Bejnordi, B.E.; Mullooly, M.; Pfeiffer, R.M.; Fan, S.; Vacek, P.M.; Weaver, D.L.; Herschorn, S.; Brinton, L.A.; van Ginneken, B.; Karssemeijer, N.; et al. Using deep convolutional neural networks to identify and classify tumor-associated stroma in diagnostic breast biopsies. Mod. Pathol. 2018, 31, 1502–1512. [Google Scholar] [CrossRef] [PubMed]
Rajpurkar, P.; Irvin, J.; Ball, R.L.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.P.; et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 2018, 15, e1002686. [Google Scholar] [CrossRef]
Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef]
Aggarwal, R.; Sounderajah, V.; Martin, G.; Ting, D.S.; Karthikesalingam, A.; King, D.; Ashrafian, H.; Darzi, A. Diagnostic accuracy of deep learning in medical imaging: A systematic review and meta-analysis. npj Digit. Med. 2021, 4, 65. [Google Scholar] [CrossRef]
Wu, E.; Wu, K.; Daneshjou, R.; Ouyang, D.; Ho, D.E.; Zou, J. How medical AI devices are evaluated: Limitations and recommendations from an analysis of FDA approvals. Nat. Med. 2021, 27, 582–584. [Google Scholar] [CrossRef]
Goddard, K.; Roudsari, A.; Wyatt, J.C. Automation bias: A systematic review of frequency, effect mediators, and mitigators. J. Am. Med. Inform. Assoc. 2012, 19, 121–127. [Google Scholar] [CrossRef]
Lyell, D.; Coiera, E. Automation bias and verification complexity: A systematic review. J. Am. Med. Inform. Assoc. 2017, 24, 423–431. [Google Scholar] [CrossRef]
Cabitza, F.; Rasoini, R.; Gensini, G.F. Unintended consequences of machine learning in medicine. JAMA 2017, 318, 517–518. [Google Scholar] [CrossRef]
Parikh, R.B.; Teeple, S.; Navathe, A.S. Addressing bias in artificial intelligence in health care. JAMA 2019, 322, 2377–2378. [Google Scholar] [CrossRef]
Benjamens, S.; Dhunnoo, P.; Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: An online database. npj Digit. Med. 2020, 3, 118. [Google Scholar] [CrossRef]
Liberati, A.; Altman, D.G.; Tetzlaff, J.; Mulrow, C.; Gøtzsche, P.C.; Ioannidis, J.P.; Clarke, M.; Devereaux, P.J.; Kleijnen, J.; Moher, D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: Explanation and elaboration. BMJ 2009, 339, b2700. [Google Scholar] [CrossRef] [PubMed]
Higgins, J.P.T.; Thomas, J.; Chandler, J.; Cumpston, M.; Li, T.; Page, M.J.; Welch, V.A. (Eds.) Cochrane Handbook for Systematic Reviews of Interventions, Version 6.3; Cochrane: London, UK, 2022. [Google Scholar]
Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G.; PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Med. 2009, 6, e1000097. [Google Scholar] [CrossRef] [PubMed]
Cai, C.J.; Winter, S.; Steiner, D.; Wilcox, L.; Terry, M. “Hello AI”: Uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making. Proc. ACM Hum. Comput. Interact. 2019, 3, 104. [Google Scholar] [CrossRef]
Sendak, M.P.; Gao, M.; Brajer, N.; Balu, S. Presenting machine learning model information to clinical end users with model facts labels. npj Digit. Med. 2020, 3, 41. [Google Scholar] [CrossRef] [PubMed]
Asan, O.; Bayrak, A.E.; Choudhury, A. Artificial intelligence and human trust in healthcare: Focus on clinicians. J. Med. Internet Res. 2020, 22, e15154. [Google Scholar] [CrossRef]
Wiens, J.; Saria, S.; Sendak, M.; Ghassemi, M.; Liu, V.X.; Doshi-Velez, F.; Jung, K.; Heller, K.; Kale, D.; Saeed, M.; et al. Do no harm: A roadmap for responsible machine learning for health care. Nat. Med. 2019, 25, 1337–1340. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Guyatt, G.H.; Oxman, A.D.; Vist, G.E.; Kunz, R.; Falck-Ytter, Y.; Alonso-Coello, P.; Schünemann, H.J. GRADE: An emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008, 336, 924–926. [Google Scholar] [CrossRef] [PubMed]
Yun, J.; Park, J.E.; Lee, H.; Jung, W.S.; Choi, S.H.; Yoo, R.-E.; Hwang, I.P. Impact of artificial intelligence-based clinical decision support on the diagnostic accuracy and confidence of radiologists for intracranial hemorrhage detection: A prospective multicenter randomized controlled trial. npj Digit. Med. 2023, 6, 38. [Google Scholar] [CrossRef]
Nam, J.G.; Hwang, E.J.; Kim, J.; Park, N.; Lee, E.H.; Kim, H.J.; Nam, M.; Lee, J.H.; Park, C.M.; Goo, J.M. AI improves nodule detection on chest radiographs in a health screening population: A randomized controlled trial. Radiology 2023, 307, e221894. [Google Scholar] [CrossRef] [PubMed]
Hwang, E.J.; Goo, J.M.; Nam, J.G.; Park, C.M.; Hong, K.J.; Kim, K.H. Development and deployment of a deep learning model for emergency department chest radiograph interpretation: A multicenter study. Korean J. Radiol. 2023, 24, 260–270. [Google Scholar] [CrossRef]
Harada, Y.; Shimizu, T.; Tokuda, Y.; Miyano, S.; Wakamiya, S.; Aramaki, E. Diagnostic accuracy of an AI-based differential diagnosis list for common diseases: A multicenter randomized controlled trial. Int. J. Environ. Res. Public Health 2021, 18, 2086. [Google Scholar] [CrossRef]
Homayounieh, F.; Digumarthy, S.; Ebrahimian, S.; Rueckel, J.; Hoppe, B.F.; Sabel, B.O.; Conjeti, S.; Ridder, K.; Sistermanns, M.; Wang, L.; et al. An artificial intelligence-based chest X-ray model on human nodule detection accuracy from a multicenter study. JAMA Netw. Open 2021, 4, e2141096. [Google Scholar] [CrossRef]
Haenssle, H.A.; Fink, C.; Schneiderbauer, R.; Toberer, F.; Buhl, T.; Blum, A.; Kalloo, A.; Hassen, A.B.H.; Thomas, L.; Enk, A.; et al. Man against machine: Diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 2018, 29, 1836–1842. [Google Scholar] [CrossRef]
De Fauw, J.; Ledsam, J.R.; Romera-Paredes, B.; Nikolov, S.; Tomasev, N.; Blackwell, S.; Askham, H.; Glorot, X.; O’Donoghue, B.; Visentin, D.; et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 2018, 24, 1342–1350. [Google Scholar] [CrossRef]
Tschandl, P.; Codella, N.; Akay, B.N.; Argenziano, G.; Braun, R.P.; Cabo, H.; Gutman, D.; Halpern, A.; Helba, B.; Hofmann-Wellenhof, R.; et al. Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: An open, web-based, international, diagnostic study. Lancet Oncol. 2019, 20, 938–947. [Google Scholar] [CrossRef]
Keane, P.A.; Topol, E.J. With an eye to AI and autonomous diagnosis. npj Digit. Med. 2018, 1, 40. [Google Scholar] [CrossRef]
Parasuraman, R.; Manzey, D.H. Complacency and bias in human use of automation: An attentional integration. Hum. Factors 2010, 52, 381–410. [Google Scholar] [CrossRef] [PubMed]
Skitka, L.J.; Mosier, K.L.; Burdick, M. Does automation bias decision-making? Int. J. Hum. Comput. Stud. 1999, 51, 991–1006. [Google Scholar] [CrossRef]
Gaube, S.; Suresh, H.; Raue, M.; Merritt, A.; Berkowitz, S.J.; Lermer, E.; Coughlin, J.F.; Guttag, J.V.; Colak, E.; Ghassemi, M. Do as AI say: Susceptibility in deployment of clinical decision-aids. npj Digit. Med. 2021, 4, 31. [Google Scholar] [CrossRef] [PubMed]
Jacobs, M.; Pradier, M.F.; McCoy, T.H., Jr.; Perlis, R.H.; Doshi-Velez, F.; Gajos, K.Z. How machine-learning recommendations influence clinician treatment selections: The example of antidepressant selection. Transl. Psychiatry 2021, 11, 108. [Google Scholar] [CrossRef] [PubMed]
Cresswell, K.; Williams, R.; Sheikh, A. Developing and applying a formative evaluation framework for health information technology implementations: Qualitative investigation. J. Med. Internet Res. 2020, 22, e15068. [Google Scholar] [CrossRef]
Ratwani, R.M.; Reider, J.; Singh, H. A decade of health information technology usability challenges and the path forward. JAMA 2019, 321, 743–744. [Google Scholar] [CrossRef]
Chen, J.H.; Asch, S.M. Machine learning and prediction in medicine—Beyond the peak of inflated expectations. N. Engl. J. Med. 2017, 376, 2507–2509. [Google Scholar] [CrossRef]
Faes, L.; Liu, X.; Wagner, S.K.; Fu, D.J.; Balaskas, K.; Sim, D.A.; Bachmann, L.M.; Keane, P.A.; Denniston, A.K. A clinician’s guide to artificial intelligence: How to critically appraise machine learning studies. Transl. Vis. Sci. Technol. 2020, 9, 7. [Google Scholar] [CrossRef]
Sterne, J.A.C.; Savović, J.; Page, M.J.; Elbers, R.G.; Blencowe, N.S.; Boutron, I.; Cates, C.J.; Cheng, H.Y.; Corbett, M.S.; Eldridge, S.M.; et al. RoB 2: A revised tool for assessing risk of bias in randomised trials. BMJ 2019, 366, l4898. [Google Scholar] [CrossRef]
Gama, F.; Tyskbo, D.; Nygren, J.; Barlow, J.; Reed, J.; Svedberg, P. Implementation frameworks for artificial intelligence translation into health care practice: Scoping review. J. Med. Internet Res. 2022, 24, e32215. [Google Scholar] [CrossRef] [PubMed]
Damschroder, L.J.; Aron, D.C.; Keith, R.E.; Kirsh, S.R.; Alexander, J.A.; Lowery, J.C. Fostering implementation of health services research findings into practice: A consolidated framework for advancing implementation science. Implement. Sci. 2009, 4, 50. [Google Scholar] [CrossRef] [PubMed]
DerSimonian, R.; Laird, N. Meta-analysis in clinical trials. Control. Clin. Trials 1986, 7, 177–188. [Google Scholar] [CrossRef]
Obermeyer, Z.; Powers, B.; Vogeli, C.; Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019, 366, 447–453. [Google Scholar] [CrossRef]
Gianfrancesco, M.A.; Tamang, S.; Yazdany, J.; Schmajuk, G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern. Med. 2018, 178, 1544–1547. [Google Scholar] [CrossRef]
Higgins, J.P.T.; Thompson, S.G.; Deeks, J.J.; Altman, D.G. Measuring inconsistency in meta-analyses. BMJ 2003, 327, 557–560. [Google Scholar] [CrossRef] [PubMed]
Egger, M.; Davey Smith, G.; Schneider, M.; Minder, C. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997, 315, 629–634. [Google Scholar] [CrossRef]
Muehlematter, U.J.; Daniore, P.; Vokinger, K.N. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015–20): A comparative analysis. Lancet Digit. Health 2021, 3, e195–e203. [Google Scholar] [CrossRef]
Reddy, S.; Allan, S.; Coghlan, S.; Cooper, P. A governance model for the application of AI in health care. J. Am. Med. Inform. Assoc. 2020, 27, 491–497. [Google Scholar] [CrossRef]
Pronovost, P.J.; Cleeman, J.I.; Wright, D.; Srinivasan, A. Fifteen years after To Err is Human: A success story to learn from. BMJ Qual. Saf. 2016, 25, 396–399. [Google Scholar] [CrossRef]
Peiffer-Smadja, N.; Rawson, T.M.; Ahmad, R.; Buchard, A.; Georgiou, P.; Lescure, F.X.; Birgand, G.; Holmes, A.H. Machine learning for clinical decision support in infectious diseases: A narrative review of current applications. Clin. Microbiol. Infect. 2020, 26, 584–595. [Google Scholar] [CrossRef]

Figure 1. PRISMA 2020 flow diagram. Record counts reconciled: PubMed/MEDLINE (n = 312), Embase (n = 248), Cochrane CENTRAL (n = 156), CINAHL (n = 124), Google Scholar (n = 80); Total = 920. After deduplication (n = 245 removed): 675 screened; 628 excluded; 47 full-text assessed; 42 excluded; 5 included.

Figure 2. Risk of bias summary for five included randomised controlled trials, assessed using the Cochrane Risk of Bias 2 (RoB 2) tool across five domains: D1, randomisation process; D2, deviations from intended interventions; D3, missing outcome data; D4, measurement of the outcome; D5, selection of the reported result. + Low risk; ? Some concerns [46,47,48,49,50].

Figure 3. Forest plot of the pooled meta-analysis of AI-CDSS effect on diagnostic accuracy (random-effects model; DerSimonian–Laird). Individual study estimates (squares) with 95% confidence intervals (horizontal lines) are shown for Yun et al. [46], Nam et al. [47], Hwang et al. [48], Harada et al. [49], and Homayounieh et al. [50]. The pooled effect (red diamond) represents SMD = 0.182 (95% CI: 0.003–0.362; p = 0.047). Heterogeneity: τ² = 0.012; χ² = 4.18, df = 4 (p = 0.38); I² = 4%. SMD, standardized mean difference; CI, confidence interval; AI-CDSS, artificial intelligence-based clinical decision support system.

Figure 4. Funnel plot for assessment of publication bias in the meta-analysis of AI-CDSS effect on diagnostic accuracy. Each point represents one included study: Yun et al. [46], Nam et al. [47], Hwang et al. [48], Harada et al. [49], and Homayounieh et al. [50]. The red vertical line indicates the pooled SMD = 0.182. Dashed lines represent the 95% pseudo-confidence interval funnel boundary. Egger’s regression test: intercept = 1.24 (95% CI: −0.87 to 3.35; p = 0.18), indicating no significant publication bias. SMD, standardized mean difference; SE, standard error; CI, confidence interval; AI-CDSS, artificial intelligence-based clinical decision support system.

Table 1. Characteristics of included studies (n = 5).

Study	Country	Clinical Specialty	AI System Type	Clinical Task	Sample Size	Participants	Primary Outcome	Risk of Bias
Yun et al., 2023 [46]	South Korea	Radiology	Deep Learning (CNN)	Intracranial hemorrhage detection (CT)	7200	Radiologists	Diagnostic accuracy (AUC)	Low risk
Nam et al., 2023 [47]	South Korea	Radiology	Deep Learning (CNN)	Chest radiograph abnormality detection	2289	Radiologists	Sensitivity, specificity	Some concerns
Hwang et al., 2023 [48]	South Korea	Emergency Medicine	Deep Learning (CNN)	Emergency chest X-ray interpretation	2450	Emergency physicians	Diagnostic accuracy	Some concerns
Harada et al., 2021 [49]	Japan	General Medicine	Machine Learning	Differential diagnosis support	58	Physicians	Diagnostic accuracy	Low risk
Homayounieh et al., 2021 [50]	USA	Radiology	Deep Learning (CNN)	Chest X-ray nodule detection	660	Radiologists	Sensitivity, specificity	Low risk

Table 2. GRADE evidence profile for the effect of AI-CDSS on diagnostic accuracy.

Outcome	No. Studies	No. Participants	Risk of Bias	Inconsistency	Indirectness	Imprecision	Pub. Bias	Certainty
Diagnostic accuracy (SMD) SMD = 0.182 (95% CI: 0.003–0.362)	5 RCTs	12,657	Not serious	Serious (I² = 68.6%)	Serious (specialty/geography)	Not serious	Not detected	⊕⊕⊕◯ MODERATE

Note: ⊕⊕⊕◯ = Moderate certainty (GRADE). ⊕ indicates a domain contributing to certainty; ◯ indicates a domain where certainty was downgraded. Downgrading was applied for inconsistency (I² = 68.6%) and indirectness (specialty and geographic limitations). “Not serious” indicates no downgrading was applied for that domain. Symbols follow the GRADE framework [29].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeong, M.-A.; Kim, S.-D. Impact of AI-Based Clinical Decision Support Systems on Diagnostic Accuracy Among Healthcare Professionals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials. Appl. Sci. 2026, 16, 5146. https://doi.org/10.3390/app16105146

AMA Style

Jeong M-A, Kim S-D. Impact of AI-Based Clinical Decision Support Systems on Diagnostic Accuracy Among Healthcare Professionals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials. Applied Sciences. 2026; 16(10):5146. https://doi.org/10.3390/app16105146

Chicago/Turabian Style

Jeong, Mi-Ae, and Sang-Dol Kim. 2026. "Impact of AI-Based Clinical Decision Support Systems on Diagnostic Accuracy Among Healthcare Professionals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials" Applied Sciences 16, no. 10: 5146. https://doi.org/10.3390/app16105146

APA Style

Jeong, M.-A., & Kim, S.-D. (2026). Impact of AI-Based Clinical Decision Support Systems on Diagnostic Accuracy Among Healthcare Professionals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials. Applied Sciences, 16(10), 5146. https://doi.org/10.3390/app16105146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Impact of AI-Based Clinical Decision Support Systems on Diagnostic Accuracy Among Healthcare Professionals: A Systematic Review and Meta-Analysis of Randomized Controlled Trials

Abstract

1. Introduction

1.1. Diagnostic Error: The Clinical Problem

1.2. From Rule-Based CDSS to AI-CDSS: A Technological Evolution

1.3. Evidence Gap and Rationale

1.4. Research Objectives and Research Questions

2. Methods

2.1. Search Strategy and Eligibility Criteria

2.2. Study Selection Procedure

2.3. Data Extraction and Risk of Bias Assessment

2.4. Statistical Analysis

2.5. Technical Overview of AI Architectures in Included Studies

3. Results

3.1. Study Selection Results

3.2. Characteristics of Included Studies

3.3. Risk of Bias Assessment

3.4. Primary Meta-Analysis: Effect of AI-CDSS on Diagnostic Accuracy

3.5. Subgroup Analyses

3.6. Sensitivity Analysis and Publication Bias

3.7. GRADE Assessment and Certainty of Evidence

4. Discussion

4.1. Principal Findings

4.2. Interpretation of Results

4.3. Comparison with Existing Literature

4.4. Strengths and Limitations

4.5. Clinical Implications

4.6. Future Research Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI