Meta-Analysis on Comparison of Diagnostic Accuracy Between Artificial Intelligence and Healthcare Professionals

Kumar, Prem; Alnaimi, Nouf A.; Soman, Sumi; Suansing, Leda; Ryan Arriola, Daniel; Jamea, Lamiaa Al

doi:10.3390/sci8040073

Open AccessReview

Meta-Analysis on Comparison of Diagnostic Accuracy Between Artificial Intelligence and Healthcare Professionals

by

Prem Kumar

^1,*

,

Nouf A. Alnaimi

²,

Sumi Soman

¹

,

Leda Suansing

¹,

Daniel Ryan Arriola II

¹ and

Lamiaa Al Jamea

¹

King Fahad Military Medical Complex, Ministry of Defence Health Services, Dhahran 31932, Saudi Arabia

²

Al Mouwasat Hospital, Alkhobar 34234, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Sci 2026, 8(4), 73; https://doi.org/10.3390/sci8040073

Submission received: 19 February 2026 / Revised: 19 March 2026 / Accepted: 23 March 2026 / Published: 31 March 2026

Download

Browse Figures

Versions Notes

Abstract

Background: Artificial intelligence (AI) can significantly enhance the efficient allocation of healthcare resources. The use of AI-driven diagnostic tests in healthcare settings supports healthcare professionals (HCPs) in diagnosis, treatment, and the prediction of patient outcomes. Methods: Relevant research studies published between 1 January 2015 and 30 August 2025 were included in this review. Randomized, retrospective, prospective, observational, comparative, and cross-sectional studies were incorporated. The PROBAST + AI tool was used to assess the risk of bias (ROB) and applicability concerns across the included studies. Results: The overall average diagnostic accuracy for AI vs. general HCPs was 81% vs. 71%. In comparisons of AI vs. non-expert HCPs, the accuracy was 95% vs. 82%. AI achieved significantly higher diagnostic accuracy than general and non-expert HCPs with odds ratios (OR) of 1.51 (95% CI: 1.17–1.96, p = 0.002) and 3.34 (95% CI: 1.13–9.86, p = 0.03), respectively. Diagnostic accuracy between AI and expert HCPs was 91% vs. 86%; AI achieved similar diagnostic accuracy to expert HCPs with an odds ratio (OR) of 0.72 (95% CI: 0.25–2.07, p = 0.54). Additionally, high levels of burden or burnout were significantly lower among healthcare professionals supported by AI compared with those working without AI. The pooled estimate yielded an OR of 1.77 (95% CI: 1.40–2.24, p < 0.00001), indicating a meaningful reduction in workload-related stress when AI tools were integrated into clinical practice. Conclusions: Based on the findings, AI demonstrates a positive impact on diagnostic accuracy and contributes to reducing the workload of healthcare professionals.

Keywords:

AI—artificial intelligence; HCP—healthcare professionals; DA—diagnostic accuracy

1. Introduction

The adoption of artificial intelligence (AI) in healthcare settings continues to rise, with most facilities utilizing AI to enhance the quality of patient care. More than 80% of healthcare settings employ AI to improve patient outcomes and increase workflow efficiency [1]. In the United States, nearly 46% of hospitals are in the beginning stages of AI implementation, though reports indicate many institutions are actively working toward enterprise-level deployment [2]. Recently, the AI healthcare market increased to $32.34 billion in 2024 and is projected to reach $431.05 billion by 2032. This growth reflects rapid investment in AI projects within healthcare and research sectors to bolster the quality of care [3].

In the clinical environment, achieving accurate and early diagnosis of diseases remains a challenge for many healthcare professionals (HCPs). These professionals often face difficulties in identifying precise disease conditions and symptoms. AI may assist HCPs by saving time and improving diagnostic accuracy through advanced algorithm processing capabilities. By analyzing large-scale electronic health records, AI supports HCPs in reaching correct diagnoses. Furthermore, AI can assist in decision-making during urgent situations, where algorithms help prioritize serious cases and reduce patient waiting times [4]. Diagnostic accuracy and risk-stratified patient care with specific concern show better results with an AI algorithm [5].

Generative AI is evolving rapidly and is increasingly utilized across various healthcare settings. Diverse professionals, including physicians, nurses, and laboratory technicians, utilize AI to examine clinical cases and gain a deeper understanding of medical conditions. Evidence suggests AI technologies are being used to train HCPs at higher medical educational institutions. A vital section of current research involves addressing barriers to AI application, such as ethical and regulatory issues, and the integration of AI into existing medical information systems. Further research should focus on the implementation and use of AI technologies within medical and educational establishments [6]. Future developments in multidisciplinary collaboration will likely emphasize improving algorithm accuracy, strengthening resilience against bias, and ensuring the safe application of AI for patient benefit [7].

While much of the current literature focuses on clinical diagnostic accuracy, the role of AI extends to optimizing the operational infrastructure supporting patient care. For instance, advanced models combining kernel Fisher discriminant analysis with improved graph convolutional neural networks have been utilized for the fault diagnosis of air handling units, ensuring a stable and safe environment for healthcare delivery [8].

Currently, few published studies evaluate the diagnostic accuracy of AI in comparison to general, expert, and non-expert healthcare professionals [9,10,11]. To address this knowledge gap, the present study evaluates the diagnostic accuracy between artificial intelligence and general, expert, and non-expert healthcare professionals. Additionally, this review explores the potential of AI in optimizing physician tasks and reducing professional burden. Through this approach, the review contributes to the global implementation of artificial intelligence in healthcare.

The remainder of this review is organized as follows: Section 2 details the materials and methods, including the PRISMA-compliant search strategy, inclusion and exclusion criteria, and the use of the PROBAST + AI tool for risk-of-bias assessment. Section 3 presents the study characteristics, ROB of included studies, and results of meta-analysis comparing diagnostic accuracy across different healthcare professional expertise levels and AI. Section 4 provides a discussion of these findings, addressing the implications for clinical practice and professional workload. Finally, Section 5 concludes the paper by summarizing the key insights and suggesting directions for future research.

2. Materials and Methods

This section describes the methodological approach, which outlines the criteria for inclusion and exclusion, the literature searches, data selection, data extraction, quality/risk of bias assessment of included studies, and data analysis.

This review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [12].

2.1. Criteria for Inclusion and Exclusion

Inclusion criteria:
- Published journal articles between 01 January 2015 and 30 August 2025 in peer-reviewed journals.
- Research studies focused on using AI to enhance quality healthcare.
- Presented data on healthcare outcomes such as diagnostic accuracy, streamlining the HCP tasks, and reducing their workload.
Exclusion criteria:
- Abstracts lacking original data from editorials, reviews, discussion articles, and conference papers.
- Research does not specifically regard AI’s application in healthcare.

2.1.1. Study Types

Studies employing a variety of designs were included, such as randomized controlled trials, prospective and retrospective studies, comparative studies, observational studies, and cross-sectional studies. Each of these evaluated the diagnostic accuracy of artificial intelligence (AI) in comparison with healthcare professionals (HCPs). This inclusive approach facilitated the capture of a broad spectrum of evidence across different clinical settings, AI systems, and levels of clinician expertise, thereby providing a comprehensive assessment of AI’s diagnostic performance (Supplementary File S2).

Studies evaluating the diagnostic accuracy of artificial intelligence without comparison groups were excluded (Supplementary File S3).

2.1.2. Participant Types

The participants in this review consisted of artificial intelligence (AI) systems, including ChatGPT, GPT-3, GPT-3.5, Maya-MD, the Ada app, Cascade-RCNN models, and various machine learning and deep learning algorithms. Healthcare professionals (HCPs) evaluated for comparison included radiologists, dermatologists, cardiologists, emergency physicians, general practitioners, neurologists, retinal specialists, rheumatologists, and endoscopists.

2.1.3. Intervention Types and Controls

Comparisons were planned between artificial intelligence (AI) and three distinct healthcare professional (HCP) groups: expert, general, and non-expert. Expert healthcare professionals consist of specialists and sub-specialists who possess advanced training in specific organ systems or diagnostic modalities. General healthcare professionals are defined as physicians or clinicians who provide primary care or broad-spectrum medical services. Non-expert healthcare professionals include individuals currently in the process of specialization, such as those in training (Table 1).

2.1.4. Outcomes Measures

The primary outcome of this review was to compare the diagnostic accuracy of artificial intelligence (AI) with that of healthcare professionals (HCPs), and to evaluate the potential of AI in optimizing clinical tasks and reducing the workload of physicians. Outcome measures included a comparison of correct diagnoses between AI and HCP groups (General, Expert, and Non-expert), as well as assessments of workflow efficiency, time savings, and the impact of AI on clinical decision-making and task delegation.

2.2. Literature Searches

A standard template prepared by the authors based on the following method section.

2.2.1. Electronic Searches

The authors searched electronic databases on 30 September 2025. There are no language restrictions. We searched studies from (Supplementary File S1) the following electronic databases:

PubMed (from 2015 to 2025).
Google Scholar (from 2015 to 2025).
Embase (from 2015 to 2025).
Scopus (from 2015 to 2025).
Web of Science (from 2015 to 2025).
Science.gov beta (from 2015 to 2025).
Clinical Trials.gov (on 30 September 2025).
Saudi Clinical Registry (03 October 2025).
Cumulative Index to Nursing and Allied Health Literature (CINAHL).

2.2.2. Searching Other Resources

Relevant systematic reviews and meta-analyses identified through the search process were screened. Studies not captured in the initial search but included in those reviews were also incorporated to ensure a comprehensive assessment of the available evidence.

2.3. Data Selection

Two authors (PK and SS) performed a blinded screening of all studies identified through the electronic search. During the title screening, clearly irrelevant studies were excluded (Supplementary File S3). The full texts of the remaining records were then assessed against predefined inclusion and exclusion criteria (Supplementary File S2). Disagreements were resolved through discussion or, when necessary, by consulting additional review authors.

2.4. Data Extraction

The two authors were blinded and extracted the following data from the selected articles using the standard data collection template prepared (Table 2 and Table 3) on the following domains.

Reference.
Year of Publication.
Place of Study.
Specialty.
Same dataset for AI and HCPs.
Comparison Type.
AI Model.
Samples.
Correct diagnosis by AI and HCPs.

The First author entered the study characteristics and risk-of-bias assessments into Review Manager [44], while other authors verified and transferred the study data for analysis.

2.5. Quality/Risk of Bias Assessment of Included Studies

2.5.1. Risk of Bias

Two authors independently assessed the quality of studies using PROBAST + AI for all outcomes. PROBAST + AI tool has four assessment domains: 1. Participants, 2. Predictors, 3. Outcomes, 4. Analysis. Each domain has signaling questions to assess the risk of bias. The tool also assesses applicability for the first three domains.

Risk of Bias Judgment

Participants

1.1 Were appropriate data sources used?

1.2 Was an appropriate study design used?

1.3 Did the inclusions and exclusions of study participants cause a representative dataset?

Predictors

2.1 Were predictors defined and assessed similarly for all participants?

2.2 Was any pre-processing of predictors similar for all participants?

2.3 Were predictor assessments made without knowledge of outcome data?

2.4 Were the predictors included in the model available at the time the model was intended to be used?

Analysis

3.1 Were outcomes defined and assessed appropriately?

3.2 Were outcomes defined and assessed similarly for all participants?

3.3 Were outcome assessments made without the use or knowledge of predictor data?

3.4 Was the time interval between predictor assessment and outcome assessment appropriate?

Outcome

4.1 Was model evaluation based on only apparent performance avoided?

4.2 Was there evidence that the sample size was reasonable?

4.3 Were participants with missing or censored data handled appropriately in the analysis?

4.4 If methods to address class imbalance were used, was the evaluation done in a dataset without imbalance correction?

4.5 If data splitting was done to create training and test datasets, was there evidence that data leakage was avoided?

4.6 If resampling methods were used to evaluate model performance, were all model development steps replicated in the resampling process?

4.7 Was the predictive performance of the model evaluated appropriately, e.g., calibration, discrimination, and net benefit?

Interpretation

Each domain contains signaling questions that help authors determine whether the study design and methods are appropriate and free from bias. Responses to these questions are categorized as “Yes,” “Probably Yes,” “Probably No,” “No,” or “No Information,” leading to an overall judgment of risk of bias as low, high, or unclear.

2.5.2. Applicability Concern

Two authors independently assessed applicability concerns of the selected studies using PROBAST + AI for all outcomes. PROBAST + AI tools have three domains: Participants: Concern that the data of the included participants do not match the review question, or the assessor has intended use of the prediction model. Predictors: Concern that the definition, pre-processing, assessment, or timing of assessment of the predictors in the model do not match the review question or the assessor’s intended use. Outcomes: Concern that the outcome, its definition, assessment, or timing of assessment does not match the review question or the assessor’s intended use.

Interpretation

Low concern for applicability if all three domains were rated low concern for applicability. High concern for applicability if at least one domain was rated as high concern for applicability. Unclear concern for applicability if at least one domain was rated unclear concern for applicability, and no domains were rated as high concern. We specifically used this PROBAST + AI tools, which are revised for applying in the AI studies.

2.6. Data Analysis

2.6.1. Synthesis Methods

All quantitative analyses were conducted using Review Manager [45] (RevMan-Version-1.0.95), and between-study heterogeneity was evaluated using the I² statistic. The primary synthesis method was meta-analysis, which produces a pooled effect estimate by calculating a weighted average of the effect sizes reported in the included studies. Each study’s effect estimate (e.g., odds ratio or mean difference) was weighted by the inverse of its variance, ensuring that greater influence was attributed to studies with higher precision, such as those with larger sample sizes or a higher number of events.

Two modeling approaches were considered: the Fixed-Effect Model (FEM), which assumes that all studies estimate a single underlying true effect, and the Random-Effects Model (REM), which assumes that true effects vary across studies and estimates the mean effect across this distribution. Meta-analysis results were presented using Forest Plots, with the pooled estimate displayed as a diamond representing the combined effect and its corresponding 95% confidence interval.

2.6.2. Investigation of Heterogeneity and Subgroup Analysis

Subgroup analyses were planned to compare diagnostic accuracy across general, expert, and non-expert healthcare professionals, as well as across different medical specialties and studies where AI performance was compared. These analyses aimed to determine whether diagnostic performance varied according to the level of clinical expertise or the specific clinical domain of the diagnostic task.

Subgroup analyses also provided a strategy for exploring potential sources of heterogeneity. Variability in study populations, clinical settings, AI model types, or diagnostic tasks could contribute to differences in effect estimates. By examining results within more homogeneous subgroups, the factors underlying between-study heterogeneity could be better understood, thereby enhancing the interpretability of the pooled findings.

2.6.3. Sensitivity Analysis

To assess the stability of the pooled diagnostic accuracy estimates, a sensitivity analysis was conducted by isolating studies in which healthcare professionals (HCPs) demonstrated superior performance over AI. This approach determines the extent to which these specific studies contribute to overall heterogeneity and whether their inclusion significantly alters the primary meta-analysis conclusions.

2.6.4. Certainty of the Evidence Assessment

Two authors independently assessed the certainty of evidence as high, moderate, low, or very low according to the five GRADE domains: risk of bias, inconsistency, imprecision, indirectness, and publication bias. Assessments were conducted in accordance with the guidance in the Cochrane Handbook for Systematic Reviews of Interventions, utilizing EPOC worksheets and GRADEpro GDT software to support the process. Disagreements were resolved through discussion. Clear justifications for all decisions to downgrade or upgrade the certainty of evidence were provided in footnotes within the summary of findings tables, with additional explanatory comments included as necessary to facilitate reader understanding. Plain-language statements were utilized to present the certainty assessments in an accessible and transparent manner.

3. Results

This section presents the study findings, including a summary of the characteristics of the included studies and an assessment of their quality and risk of bias. Additionally, this section reports the pooled results comparing the diagnostic accuracy of artificial intelligence (AI) and healthcare professionals. The findings are presented through descriptive summaries, relevant figures, and tables.

3.1. Study Characteristics

A total of 22,566 records were identified through comprehensive electronic database searches. Following the removal of duplicates, 2595 unique studies remained for further evaluation. From these, 2501 studies were excluded during the title and abstract screening phase due to irrelevance, study population, incompatible outcomes, or study type. Consequently, 94 articles were shortlisted for full-text eligibility assessment.

Subsequently, 55 full-text articles underwent a detailed eligibility assessment. During this stage, 23 studies were excluded for failing to meet predefined inclusion criteria; specific reasons included inappropriate study design, insufficient methodological details, absence of key outcome measures, or population mismatch. Ultimately, 32 studies satisfied all eligibility criteria and were included in the quantitative synthesis for data extraction and statistical analysis (Figure 1).

To ensure methodological rigor, study selection was performed independently by two authors, with disagreements resolved through discussion or consultation with the third author. Additionally, the reference lists of included studies were screened to identify further eligible research. A PRISMA flow chart was utilized to document each stage of the study selection process, ensuring transparency and reproducibility.

3.1.1. Results of the Search

The searches spanned the period from 1 January 2015 to 30 September 2025. Through comprehensive database searches, a total of 22,566 records were identified. The screening process for this update, including the number of studies brought forward, is illustrated in Figure 1.

3.1.2. Included Studies

Of the 32 included studies, 13 (40%) were randomized control trials (Baek 2025 [40]; Boginskis 2023 [13]; Camkıran 2025 [35]; Choi 2020 [33]; Faqar-Uz-Zaman 2022 [37]; Han 2022 [29]; Harada 2021 [24]; Homayounieh 2021 [21]; Keenan 2020 [44]; Liu 2022 [25]; Liu 2023 [27]; Luo 2021 [43]; Wang 2021 [22]), seven (21%) were observational studies (David M. Levine 2024 [14]; Delshad 2021 [16]; Lisa Herzog 2023 [32]; Michael Gottlieb 2024 [18]; Misurac 2025 [36]; Olson 2025 [38]; Fonseca 2024 [34]), six (18%) were retrospective studies (Cohen 2023 [19]; Guermazi 2022 [40]; Rauschecker 2020 [15]; Twinprai 2022 [20]; van Doorn 2021 [28]; Yamamura 2025 [31]), three (09%) were comparative studies (Gan 2019 [17]; Gräf 2022 [26]; Krusche 2024 [39], two (6%) were cross-sectional studies (Lyons 2024 [42]; Tamai 2023 [23]), and one (3%) was a prospective study (Garcia 2024 [30]). We extracted only the diagnostic accuracy data for the current review (Table 3).

To improve transparency regarding the comparability of diagnostic evaluations, the included studies were classified according to whether artificial intelligence (AI) and healthcare professionals (HCPs) assessed the same dataset (paired comparison) or different datasets. Most studies in this meta-analysis utilized paired diagnostic designs, where AI and HCPs evaluated identical imaging datasets or clinical cases, enabling a direct comparison of diagnostic accuracy. However, a small number of studies utilized alternative designs, such as vignette-based assessments or workload comparisons involving AI-assisted versus non-assisted cohorts. The classification of study designs and comparison conditions for each study is presented in Table 2 and Table 3, providing methodological transparency regarding the comparability of AI and HCP diagnostic performance.

The yearly distribution of the included studies is as follows: one study (3%) in 2019, three studies (9%) in 2020, six studies (18%) in 2021, six studies (18%) in 2022, five studies (15%) in 2023, six studies (18%) in 2024, and five studies (15%) in 2025 (Figure 2). The diagnostic cases spanned multiple clinical domains, including radiology (13 studies, 40%), cardiology (four studies, 12%), emergency medicine (four studies, 12%), pathology (two studies, 6%), neurology (two studies, 6%), dermatology (two studies, 6%), ophthalmology (two studies, 6%), and clinician burnout assessment (four studies, 12%). Further details regarding the included studies are provided in Supplementary File S2.

3.1.3. Excluded Studies

A total of 22 studies (Chang 2025 [46]; Edström 2025 [47]; Eng 2021 [48]; Geneş 2024 [49]; Gertz 2024 [50]; Gottlieb 2024 [51]; Hoppe 2024 [52]; Janik 2024 [53]; Jiao 2025 [54]; Johnson 2024 [55]; Kucking 2025 [56]; Li 2022 [57]; Luna 2021 [58]; Park 2019 [59]; Pluym 2021 [60]; Prinster 2024 [61]; Richard 2024 [62]; Surya 2023 [63]; Taloni 2023 [64]; Turan 2025 [65]; Yang 2024 [66]; Yu 2022 [67]) were excluded from the research. The reason for the exclusion from the review was the absence of a comparison group and failure to meet the inclusion criteria.

3.2. Quality/Risk of Bias of Included Studies

Figure 3 summarizes the overall methodological quality of the included studies. The included articles were evaluated using PROBAST + AI tools. The domain includes participants, predictors, outcomes, and analysis.

3.2.1. Risk of Bias Judgment

Participants: The majority of studies (26 of 32) had a low risk of bias. Predictors: The majority of studies (29 of 32) had a low risk of bias. Outcomes: The majority of studies (29 of 32) had a low risk of bias. Analysis: The majority of studies (16 of 32) had unclear risk of bias. Regarding applicability, the majority of studies (16/32) had low applicability concerns, and 13/32 had high applicability concerns. Overall, 13/32 were at low risk of bias, 2/32 were at high risk of bias, and 16/32 were at unclear risk of bias.

The risk of bias for participants was relatively low, indicating that the selection and characteristics of participants are likely to be appropriate and unbiased. The predictors show a mixed risk profile, suggesting that certain predictors have a high risk of bias. The outcomes category also reflects a significant number of instances categorized as high risk, highlighting potential concerns regarding the reliability of the outcomes measured. The analysis itself appears to have a low risk of bias, indicating that the methods used for analysis are sound and reliable. The overall risk of bias is concerning, with a notable number of instances categorized as high risk, which could affect the validity of the findings.

3.2.2. Applicability Concern

Overall, 17 of the 32 included studies (53%) were identified as having a low risk of bias, while 13 (41%) were classified as high risk, and two (6%) were categorized as having an unclear risk. The data indicate a high level of applicability, despite a significant percentage of studies falling into the ‘High’ risk category for certain domains. Similarly to the Participants domain, the Predictors domain demonstrates strong applicability, suggesting that the variables utilized in the analysis are relevant and well-suited to the clinical context. The Outcomes category shows a varied distribution, with some assessments falling into the ‘Unclear’ range, indicating potential uncertainty regarding the applicability of certain outcome measures. While overall applicability appears high, instances of unclear applicability suggest that further evaluation may be necessary.

The PROBAST + AI framework demonstrates strong applicability across most categories, particularly for participants and predictors. However, concerns regarding the outcomes and the overall risk of bias warrant further investigation.

3.3. Intervention Effects

The analysis includes various studies, each contributing to the overall understanding of AI performance relative to healthcare professionals. The forest plot visually summarizes the odds ratios (OR) and the corresponding 95% confidence intervals (CIs) for each study, along with an overall pooled effect estimate.

3.3.1. Artificial Intelligence vs. General Healthcare Professionals

Figure 4 shows the summary of the pooled odds ratio, OR = 1.51 (95% CI: 1.17–1.96; p = 0.002). Since the event represents the number of correct diagnostic classifications, this result indicates that artificial intelligence had significantly higher odds of making a correct diagnosis compared with general healthcare professionals across the included studies. The 95% prediction interval ranged from 0.58 to 3.95, suggesting that the effect may vary across future studies and clinical settings, potentially favoring either AI or healthcare professionals. Substantial heterogeneity was observed (Tau² = 0.22; Chi² = 78.62, df = 20, p < 0.00001; I² = 75%), indicating considerable variability between studies. Studies favoring AI (OR > 1) (including Boginskis 2023 [13]; Faqar-Uz-Zaman 2022 [37]; Graf 2022 [26]; Guermazi 2022 [40]; Han 2022 [29]; and Wang 2021 [22]) demonstrate high odds ratios, indicating strong support for AI’s effectiveness in those contexts. Studies favoring General HCPs (OR < 1) include David M. Levine 2024 [14]; Keenan 2020 [44]; and Lisa Herzog 2023 [32].

3.3.2. Artificial Intelligence vs. Expert Healthcare Professionals

Figure 5 presents the findings of a comparison between artificial intelligence and expert healthcare professionals. The pooled estimate showed OR = 0.72 (95% CI: 0.25–2.07; p = 0.54). As the event represents correct diagnosis, an OR below one suggests that expert healthcare professionals demonstrated slightly higher diagnostic accuracy than AI, although the wide confidence interval and non-significant p-value indicate no statistically significant difference between AI and expert clinicians. Substantial heterogeneity was observed (Tau² = 1.75; Chi² = 115.53, df = 6, p < 0.00001; I² = 95%), suggesting variability across studies.

3.3.3. Artificial Intelligence vs. Non-Expert Healthcare Professionals

Figure 6 suggests that outcomes from five studies (Choi 2020 [33]; Gan 2019 [17]; Homayounieh 2021 [21]; Rauschecker 2020 [15]; Twinprai 2022 [20]) were included in this analysis. The pooled result showed OR = 3.34 (95% CI: 1.13–9.86; p = 0.03). Because the event represents a correct diagnosis, the finding indicates that AI demonstrated significantly higher odds of correct diagnosis compared with non-expert healthcare professionals. However, very high heterogeneity was present (Tau² = 1.44; Chi² = 124.17, df = 4, p < 0.00001; I² = 97%), reflecting substantial variability across studies.

3.3.4. Use of AI Reducing Work Burden

Figure 7 shows the effect of AI implementation on clinician workload. In this analysis, the event was defined as reporting a lower level of work burden or burnout when using AI-supported systems. The pooled result showed OR = 1.77 (95% CI: 1.40–2.24; p < 0.00001), indicating that AI-assisted workflows were associated with significantly greater odds of reduced workload or burnout compared with clinicians using non-AI-supported systems. No heterogeneity was observed (I² = 0%).

3.3.5. Diagnostic Accuracy of Different Specialties and Overall Accuracy

Subgroup analyses were performed across different clinical specialties (Figure 8). In this analysis, the event again represents a correct diagnosis. AI demonstrated significantly higher diagnostic accuracy in radiology (OR = 1.93, p = 0.002) and dermatology (OR = 1.57, p = 0.005) compared with general healthcare professionals. In cardiology, emergency medicine, neurology, and pathology, pooled scores were not statistically significant. In ophthalmology, the pooled estimate favored healthcare professionals (OR = 0.75, p = 0.01), indicating higher diagnostic accuracy among clinicians in that specialty. Overall, the combined analysis across specialties yielded OR = 1.51 (95% CI: 1.17–1.96; p = 0.002), suggesting that AI systems, on average, demonstrated higher odds of correct diagnostic classification compared with general healthcare professionals. The overall findings suggest that while AI can enhance effectiveness in some areas, its application should be carefully considered based on departmental needs and contexts.

The overall average diagnostic accuracy for AI vs. general HCPs is 81% vs. 71%, AI vs. non-expert HCPs is 95% vs. 82%, and between AI and expert HCPs is 91% vs. 86% (Table 4).

3.3.6. Interpretation of Sensitivity Analysis

Substantial heterogeneity was observed across the included studies, with an I² value of 74% (Figure 9). This indicates considerable variation among the studies in terms of the number of datasets used, study design, study populations, and reported outcomes, yet results favored the diagnostic accuracy of artificial intelligence over general and non-expert healthcare professionals. The Z-test demonstrated a statistically significant overall effect (p = 0.00001). Therefore, a random-effects model was applied to account for the observed variability and to generate a more reliable pooled estimate.

4. Discussion

This section interprets the key findings of the study in the context of existing literature. It highlights the strengths and weaknesses of the review, limitations of the review, and implications of the results for clinical practice and the potential role of AI in supporting healthcare professionals.

4.1. Summary of Evidence

Quantitative results were based on 32 studies that compared diagnostic accuracy between artificial intelligence and healthcare professionals, and the use of artificial intelligence in optimizing physician tasks and reducing their burden.

Overall, AI models showed a higher pooled diagnostic accuracy (81%) compared to General HCPs (71%). Most individual studies favored AI: 19 out of 21 studies had an odds ratio (OR) greater than one, indicating a positive influence of AI on healthcare quality. Some studies, such as (Wang 2021 [22]; van Doorn 2021 [28]), show high odds ratios (10.14 and 4.47, respectively), indicating a strong favor towards AI. However, some studies showed that expert physicians outperformed AI models, for example, in detecting pulmonary nodules on chest radiographs, detecting arrhythmia (A) and structural disease patterns, or certain diagnostic accuracy measures where OR < 1. AI systems are highly effective in fracture detection, providing support to radiologists and attending physicians (Boginskis 2023 [13]).

The implementation of AI can enhance diagnostic accuracy, reduce errors, and serve as a valuable tool for clinical decision-making. Four studies that evaluated the use of artificial intelligence in optimizing physician tasks (Garcia 2024 [30]; Misurac 2025 [36]; Olson 2025 [38]; Baek 2025 [41];) and reducing their burden (OR = 1.77, 95% CI 1.40–2.24, p < 0.00001) found a positive relationship between AI technology and the reduction in administrative burden and burnout levels.

In addition to the pooled effect estimate, we calculated the 95% prediction interval (PI) to assess the potential range of effects that might be observed in future studies. Although the pooled analysis showed that artificial intelligence demonstrated higher diagnostic performance compared with healthcare professionals (OR = 1.57, 95% CI: 1.18–2.07), the 95% prediction interval ranged from 0.42 to 5.85. This wide interval reflects the substantial between-study heterogeneity (I² = 87%) and suggests that the true effect in a new study could vary considerably. In some settings, AI may perform substantially better than healthcare professionals, whereas in others, the performance may be comparable or even favor healthcare professionals. Therefore, these findings should be interpreted cautiously. Future well-designed studies across diverse clinical settings are needed to better define the conditions under which AI provides the greatest diagnostic benefit.

4.2. Agreements and Disagreements with Other Reviews

4.2.1. Strengths of the Review

Multiple major databases searched include Cochrane, PubMed, Embase, CINAHL, Web of Science, and Google Scholar; the results show the quality of the included studies. There were no initial restrictions on study design to maximize capture of relevant evidence. Well-defined criteria focused on AI vs. healthcare professionals’ (HCPs) diagnostic accuracy. Specific exclusion of studies without comparison groups was ensured. Use of a structured risk-of-bias tool (PROBAST + AI) improved methodological quality. A total of 32 studies across multiple specialties were included. Diverse clinical settings enhanced the generalizability of the research findings. The review used random-effects meta-analysis (Mantel–Haenszel), which was appropriate given heterogeneity, and provided pooled ORs, prediction intervals, and subgroup analyses. It differentiates AI vs. experts and AI vs. non-experts, which is an essential distinction. Integration of Workload/Burnout Outcomes into the review provides additional credit. Findings consistently favor AI.

4.2.2. Weaknesses of the Review

The weaknesses include a high volume of work originating from China, Korea, and Japan. Significant heterogeneity (I² = 75–78%) across studies reduces confidence in pooled effects. Sources of heterogeneity (e.g., type of AI, diagnostic modality, training dataset size) are not fully explored. Mixed study designs affect validity, as combining RCTs, retrospective studies, and cross-sectional studies risks inconsistent findings. While publication bias assessment is discouraged for diagnostic accuracy reviews, AI performance studies are prone to positive reporting bias. Workload evidence is underrepresented, as only four studies assessed workload. Multiple datasets or readers treated as single comparisons may artificially compress variance. Broad inclusion of AI models without stratified discussion: GPT models, CNNs, ML algorithms, radiology-specific systems, and diagnostic apps operate differently. Mixing them obscures insights regarding which types of AI are most effective. Possible overinterpretation of results: although overall OR > 1 favors AI, prediction intervals are wide (0.58–3.95). This indicates highly variable real-world performance, which warrants more cautious interpretation.

4.3. Limitations of Review

This review has several notable limitations. Significant heterogeneity existed among the included studies regarding design, sample size, AI models, and outcome measures (I² = 78%), which may affect the comparability and interpretation of pooled results. Restricting the search to English-language publications could have introduced language and publication bias. Many AI models lacked external validation or were evaluated on limited datasets, potentially limiting the generalizability of the findings. Additionally, variations in clinician expertise, healthcare settings, and patient populations may have influenced diagnostic performance comparisons. Finally, the review focused primarily on diagnostic accuracy and workload reduction, without assessing other critical dimensions such as cost-effectiveness, clinical workflow integration, and ethical implications.

4.4. Clinical and Research Implications

4.4.1. Implications for Practice

AI can support clinicians in detecting conditions more accurately, particularly in radiology, dermatology, cardiology, and emergency care. AI-assisted tools reduce time spent on repetitive tasks, enabling clinicians to focus on complex patient care. AI interventions can lower administrative workload, contributing to higher job satisfaction and staff retention. AI can serve as a decision-support system for trainees and non-specialist staff, improving clinical practice. Successful integration requires staff training, ongoing model validation, and clear guidelines to ensure safety, reliability, and ethical use.

4.4.2. Implications for Research

Future studies should adhere to established reporting guidelines for AI diagnostic studies (e.g., CONSORT-AI, SPIRIT-AI, STARD-AI, PROBAST + AI). Standardized reporting will improve reproducibility, reduce bias, and enable more precise meta-analyses. Many included studies were retrospective and based on secondary datasets. Prospective, real-world, multi-center studies are required to assess AI performance in clinically relevant settings. Most current datasets reflect limited demographic, ethnic, and geographic diversity. Studies should ensure equitable representation to avoid algorithmic bias and improve generalizability. Future trials should compare AI, clinicians, and combined AI–clinician workflows to determine which approach provides the highest accuracy and safety. Rather than comparing AI and humans as competitors, future research should assess how AI can support clinical decision-making and reduce workload without compromising patient safety. Studies should report detailed confusion matrices, thresholds, and performance metrics (sensitivity, specificity, AUC, PPV, and NPV) to enable accurate pooling of results. Many AI tools were validated on the same or similar datasets. Research should prioritize external validation on fully independent cohorts and evaluate performance drift over time. Very few studies assessed safety outcomes, user errors, over-reliance, or misinterpretation of AI recommendations. Future research should incorporate safety endpoints and workflow analyses. Studies should include economic evaluations, feasibility assessments, and implementation frameworks to understand practical adoption in healthcare systems. Evidence on AI reducing clinicians’ workload is still limited. Future trials should explicitly quantify time savings, workflow efficiency, burnout reduction, and task redistribution.

4.4.3. Evidence of Certainty and Generalizability

The certainty of evidence for the primary outcome, AI’s diagnostic performance superiority over healthcare professionals (OR 1.51), was rated as moderate. While the large number of included studies and the significant Z-score (6.01, p < 0.00001) provide strength, the certainty was downgraded due to high statistical heterogeneity (I² = 87) and the ‘Unclear’ risk of bias observed in the analysis domain of several studies. This suggests that while the trend is positive, future high-quality randomized controlled trials are needed to increase the precision and stability of these estimates.

A significant limitation identified in the current body of evidence is the geographical concentration of research. The majority of included studies were conducted in high-income regions with advanced digital healthcare infrastructures, such as North America, Europe, and parts of East Asia. This raises concerns regarding generalizability to underrepresented regions, including the Middle East, Africa, and Latin America. AI models trained predominantly on Western datasets may suffer from ‘algorithmic bias’ or reduced accuracy when applied to diverse populations with different genetic profiles, disease prevalence rates, or clinical settings. Future research must prioritize multi-center validation in these regions to ensure that AI-driven diagnostic tools are globally equitable and effective.

5. Conclusions

Based on the findings, artificial intelligence (AI) demonstrates a positive impact on diagnostic accuracy and contributes to reducing the workload of healthcare professionals. Nevertheless, some studies indicate that expert clinicians may still outperform AI in specific tasks, highlighting the complementary rather than replacement role of AI. Evidence suggests that integrating AI with healthcare professionals enhances diagnostic performance and improves the overall quality of care.

The authors conclude that while AI holds significant promise, its optimal utility lies in collaborative human–AI workflows, where clinical expertise and artificial intelligence are leveraged together. Future research conducted in real-world clinical settings is needed to strengthen the credibility, generalizability, and safety of AI applications. Looking ahead, the integration of AI in healthcare is likely to evolve toward more advanced hybrid decision-making models, enhanced interpretability of AI outputs, rigorous validation across diverse populations, and seamless incorporation into routine clinical workflows. Such advancements have the potential to deliver safer, more efficient, and equitable diagnostic care, reduce clinician burnout, and support evidence-based decision-making across healthcare systems.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/sci8040073/s1, File S1. Search strategy; File S2. Characteristics of included studies; File S3. Characteristics of excluded studies; File S4: Analysis.

Author Contributions

Conceptualization: P.K. and S.S.; methodology: P.K. and S.S.; software: not applicable (NA); validation: L.A.J. and D.R.A.II; formal analysis: P.K. and S.S.; investigation: P.K., L.S., and N.A.A.; resources: L.A.J.; data curation: P.K.; writing—original draft preparation: P.K. and D.R.A.II; writing—review and editing: L.A.J., D.R.A.II, L.S., and N.A.A.; visualization: P.K.; supervision: P.K.; project administration: P.K.; funding acquisition: N.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
HCP	Healthcare Professionals
DA	Diagnostic Accuracy
ML	Machine Learning
DL	Deep Learning
GPT	Generative Pre-Trained Transformer
ECG	Electrocardiogram
CNN:	Convolutional neural networks
AI-CDSS	Artificial Intelligence-Based Clinical Decision Support System
R-CNN	Region-based Convolutional Neural Network
CHATGPT	Chat Generative Pre-Trained Transformer

References

Staggering U.S. Misdiagnosis Statistics in Healthcare. Daniel Harwin. Freedland Harwin Valori. Available online: https://www.fhvlegal.com/blog/staggering-u-s-diagnostic-error-statistics-july-2024/ (accessed on 6 October 2025).
Wheeler, T.; Barnes, J.; Shurland, T.; Shukla, M.; Malhotra, R.; Wagh, M. How CFOs Can Help Transform Healthcare Organizations Amid an Uncertain Economic Environment. Deloitte Insights. Available online: https://www.deloitte.com/us/en/insights/industry/health-care/health-care-cfos-help-transform-organizations.html (accessed on 14 September 2023).
Shannon Germain Farraher. Generative AI Impact on Clinicians: Bringing the Fever Down. Forrester. Available online: https://www.forrester.com/blogs/generative-ai-impact-on-clinicians-bringing-the-fever-down/ (accessed on 16 August 2024).
Markets and Markets. Artificial Intelligence in Health Care Markets. Growth, Size, Share, Trends. Markets and Markets May-2025. Available online: https://www.marketsandmarkets.com/Market-Reports/artificial-intelligence-healthcare-market-54679303.html (accessed on 5 September 2025).
Wang, H.; Zu, Q.; Chen, J.; Yang, Z.; Ahmed, M.A. Application of Artificial Intelligence in Acute Coronary Syndrome: A Brief Literature Review. Adv. Ther. 2021, 38, 5078–5086. [Google Scholar] [CrossRef]
Bidenko, N.V.; Stuchynska, N.V.; Palamarchuk, Y.V.; Matviienko, M.M. Integrating artificial intelligence in healthcare practice: Challenges and future prospects. Wiad. Lek. 2025, 78, 1199–1205. [Google Scholar] [CrossRef] [PubMed]
Milan, T.; Jan, K. Artificial intelligence in healthcare. Klin. Mikrobiol. Infekcni Lek. 2025, 31, 22–26. [Google Scholar] [PubMed]
Zhang, H.; Zhao, H.; Fu, Y.; Ma, J.; Xiang, Y. The class labels and spatial information based fault diagnosis of air handling unit via combining kernel Fischer discriminant analysis with an improved graph convolutional neural network. Measurement 2026, 257, 118622. [Google Scholar] [CrossRef]
Al-Obeidat, F.; Hafez, W.; Gador, M.; Ahmed, N.; Abdeljawad, M.M.; Yadav, A.; Rashed, A. Diagnostic performance of AI-based models versus physicians among patients with hepatocellular carcinoma: A systematic review and meta-analysis. Front. Artif. Intell. 2024, 7, 1398205. [Google Scholar] [CrossRef] [PubMed]
Shen, J.; Zhang, C.J.P.; Jiang, B.; Chen, J.; Song, J.; Liu, Z.; He, Z.; Wong, S.Y.; Fang, P.H.; Ming, W.K. Artificial intelligence versus clinicians in disease diagnosis: Systematic review. JMIR Med. Inform. 2019, 7, e10010. [Google Scholar] [CrossRef] [PubMed]
Takita, H.; Kabata, D.; Walston, S.L.; Tatekawa, H.; Saito, K.; Tsujimoto, Y.; Miki, Y.; Ueda, D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digit. Med. 2025, 8, 175. [Google Scholar] [CrossRef] [PubMed]
Page, M.J.; Moher, D.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ 2021, 372, n160. [Google Scholar] [CrossRef]
Boginskis, V.; Zadoroznijs, S.; Cernavska, I.; Beikmane, D.; Sauka, J. Artificial intelligence effectivity in fracture detection. Med. Perspect. 2023, 28, 68–78. [Google Scholar] [CrossRef]
Levine, D.M.; Tuwani, R.; Kompa, B.; Varma, A.; Finlayson, S.G.; Mehrotra, A.; Beam, A. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: An observational study. Lancet Digit. Health 2024, 06, e555–e561. [Google Scholar] [CrossRef] [PubMed]
Rauschecker, A.M.; Rudie, J.D.; Xie, L.; Wang, J.; Duong, M.T.; Botzolakis, E.J.; Kovalovich, A.M.; Egan, J.; Cook, T.C.; Bryan, R.N.; et al. Artificial intelligence system approaching neuroradiologist-level differential diagnosis accuracy at brain MRI. Radiology 2020, 295, 626–637. [Google Scholar] [CrossRef]
Delshad, S.; Dontaraju, V.S.; Chengat, V. Artificial intelligence-based application provides accurate medical triage advice when compared to consensus decisions of healthcare providers. Cureus 2021, 13, e16956. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Gan, K.; Xu, D.; Lin, Y.; Shen, Y.; Zhang, T.; Hu, K.; Zhou, K.; Bi, M.; Pan, L.; Wu, W.; et al. Artificial intelligence detection of distal radius fractures: A comparison between the convolutional neural network and professional assessments. Acta Orthop. 2019, 90, 394–400. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Gottlieb, M.; Patel, D.; Viars, M.; Tsintolas, J.; Peksa, G.D.; Bailitz, J. Comparison of artificial intelligence versus real-time physician assessment of pulmonary edema with lung ultrasound. Am. J. Emerg. Med. 2023, 70, 109–112. [Google Scholar] [CrossRef] [PubMed]
Cohen, M.; Puntonet, J.; Sanchez, J.; Kierszbaum, E.; Crema, M.; Soyer, P.; Dion, E. Artificial intelligence vs. radiologist: Accuracy of wrist fracture detection on radiographs. Eur. Radiol. 2023, 33, 3974–3983. [Google Scholar] [CrossRef] [PubMed]
Twinprai, N.; Boonrod, A.; Boonrod, A.; Chindaprasirt, J.; Sirithanaphol, W.; Chindaprasirt, P.; Twinprai, P. Artificial intelligence (AI) vs. human in hip fracture detection. Heliyon 2022, 8, e11266. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Homayounieh, F.; Digumarthy, S.; Ebrahimian, S.; Rueckel, J.; Hoppe, B.F.; Sabel, B.O.; Conjeti, S.; Ridder, K.; Sistermanns, M.; Wang, L.; et al. An artificial intelligence–based chest X-ray model on human nodule detection accuracy from a multicenter study. JAMA Netw. Open 2021, 4, e2141096. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Wang, L.; Song, H.; Wang, M.; Wang, H.; Ge, R.; Shen, Y.; Yu, Y. Utilization of Ultrasonic Image Characteristics Combined with Endoscopic Detection on the Basis of Artificial Intelligence Algorithm in Diagnosis of Early Upper Gastrointestinal Cancer. J. Healthc. Eng. 2021, 2021, 2773022. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Tamai, K.; Terai, H.; Hoshino, M.; Tabuchi, H.; Kato, M.; Toyoda, H.; Suzuki, A.; Takahashi, S.; Yabu, A.; Sawada, Y.; et al. Deep Learning Algorithm for Identifying Cervical Cord Compression Due to Degenerative Canal Stenosis on Radiography. Spine 2023, 48, 519–525. [Google Scholar] [CrossRef] [PubMed]
Harada, Y.; Katsukura, S.; Kawamura, R.; Shimizu, T. Efficacy of artificial-intelligence-driven differential-diagnosis list on the diagnostic accuracy of physicians: An open-label randomized controlled study. Int. J. Environ. Res. Public Health 2021, 18, 2086. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Liu, P.; Lu, L.; Chen, Y.; Huo, T.; Xue, M.; Wang, H.; Fang, Y.; Xie, Y.; Xie, M.; Ye, Z. Artificial intelligence to detect the femoral intertrochanteric fracture: The arrival of the intelligent-medicine era. Front. Bioeng. Biotechnol. 2022, 10, 927926. [Google Scholar] [CrossRef]
Gräf, M.; Knitza, J.; Leipe, J.; Krusche, M.; Welcker, M.; Kuhn, S.; Mucke, J.; Hueber, A.J.; Hornig, J.; Klemm, P.; et al. Comparison of physician and artificial intelligence-based symptom checker diagnostic accuracy. Rheumatol. Int. 2022, 42, 2167–2176. [Google Scholar] [CrossRef]
Liu, Y.; Liu, W.; Chen, H.; Xie, S.; Wang, C.; Liang, T.; Yu, Y.; Liu, X. Artificial intelligence versus radiologist in the accuracy of fracture detection based on computed tomography images: A multi-dimensional, multi-region analysis. Quant. Imaging Med. Surg. 2023, 13, 6424. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
van Doorn, W.P.T.M.; Stassen, P.M.; Borggreve, H.F.; Schalkwijk, M.J.; Stoffers, J.; Bekers, O.; Meex, S.J.R. A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis. PLoS ONE 2021, 16, e0245157. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Han, S.S.; Kim, Y.J.; Moon, I.J.; Jung, J.M.; Lee, M.Y.; Lee, W.J.; Won, C.H.; Lee, M.W.; Kim, S.H.; Navarrete-Dechent, C.; et al. Evaluation of Artificial Intelligence-Assisted Diagnosis of Skin Neoplasms: A Single-Center, Paralleled, Unmasked, Randomized Controlled Trial. J. Investig. Dermatol. 2022, 142, 2353–2362.e2. [Google Scholar] [CrossRef] [PubMed]
Garcia, P.; Ma, S.P.; Shah, S.; Smith, M.; Jeong, Y.; Devon-Sand, A.; Tai-Seale, M.; Takazawa, K.; Clutter, D.; Vogt, K.; et al. Artificial intelligence–generated draft replies to patient inbox messages. JAMA Netw. Open 2024, 7, e243201. [Google Scholar] [CrossRef]
Yamamura, Y.; Fujii, K.; Nakashima, C.; Otsuka, A. Evaluation of the accuracy of artificial intelligence (AI) models in dermatological diagnosis and comparison with dermatology specialists. Cureus 2025, 17, e77067. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Herzog, L.; Kook, L.; Hamann, J.; Globas, C.; Heldner, M.R.; Seiffge, D.; Antonenko, K.; Dobrocky, T.; Panos, L.; Kaesmacher, J.; et al. Deep Learning Versus Neurologists: Functional Outcome Prediction in LVO Stroke Patients Undergoing Mechanical Thrombectomy. Stroke 2023, 54, 7. [Google Scholar] [CrossRef] [PubMed]
Choi, D.J.; Park, J.J.; Ali, T.; Lee, S. Artificial intelligence for the diagnosis of heart failure. npj Digit. Med. 2020, 3, 54. [Google Scholar] [CrossRef]
Fonseca, Â.; Ferreira, A.; Ribeiro, L.; Moreira, S.; Duque, C. Embracing the future-is artificial intelligence already better? A comparative study of artificial intelligence performance in diagnostic accuracy and decision-making. Eur. J. Neurol. Off. J. Eur. Fed. Neurol. Soc. 2024, 31, e16195. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Çamkıran, V.; Tunç, H.; Achmar, B.; Ürker, T.S.; Kutlu, İ.; Torun, A. Artificial intelligence (ChatGPT) ready to evaluate ECG in real life? Not yet! Digit. Health 2025, 11, 20552076251325279. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Misurac, J.; Knake, L.A.; Blum, J.M. The effect of ambient artificial intelligence notes on provider burnout. Appl. Clin. Inform. 2025, 16, 252–258. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Faqar-Uz-Zaman, S.F.; Anantharajah, L.; Baumartz, P.; Sobotta, P.; Filmann, N.; Zmuc, D.; von Wagner, M.; Detemble, C.; Sliwinski, S.; Marschall, U.; et al. The Diagnostic Efficacy of an App-based Diagnostic Health Care Application in the Emergency Room: eRadaR-Trial. A prospective, Double-blinded, Observational Study. Ann. Surg. 2022, 276, 935–942. [Google Scholar] [CrossRef] [PubMed]
Olson, K.D.; Meeker, D.; Troup, M.; Barker, T.D.; Nguyen, V.H.; Manders, J.B.; Stults, C.D.; Jones, V.G.; Shah, S.D.; Shah, T.; et al. Use of ambient AI scribes to reduce administrative burden and professional burnout. JAMA Netw. Open 2025, 8, e2534976. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Krusche, M.; Callhoff, J.; Knitza, J.; Ruffer, N. Diagnostic accuracy of a large language model in rheumatology: Comparison of physician and ChatGPT-4. Rheumatol. Int. 2024, 44, 303–306. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Guermazi, A.; Tannoury, C.; Kompel, A.J.; Murakami, A.M.; Ducarouge, A.; Gillibert, A.; Li, X.; Tournier, A.; Lahoud, Y.; Jarraya, M.; et al. Improving radiographic fracture recognition performance and efficiency using artificial intelligence. Radiology 2022, 302, 627–636. [Google Scholar] [CrossRef] [PubMed]
Baek, G.; Cha, C. AI-Assisted Tailored Intervention for Nurse Burnout: A Three-Group Randomized Controlled Trial. Worldviews Evid.-Based Nurs. 2025, 22, e70003. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Lyons, R.J.; Arepalli, S.R.; Fromal, O.; Choi, J.D.; Jain, N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can. J. Ophthalmol. 2024, 59, e301–e308. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.; Zhang, Y.; Liu, M.; Lai, Y.; Liu, P.; Wang, Z.; Xing, T.; Huang, Y.; Li, Y.; Li, A.; et al. Artificial intelligence-assisted colonoscopy for detection of colon polyps: A prospective, randomized cohort study. J. Gastrointest. Surg. Off. J. Soc. Surg. Aliment. Tract 2021, 25, 2011–2018. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Keenan, T.D.L.; Clemons, T.E.; Domalpally, A.; Elman, M.J.; Havilio, M.; Agrón, E.; Benyamini, G.; Chew, E.Y. Retinal Specialist versus Artificial Intelligence Detection of Retinal Fluid from OCT: Age-Related Eye Disease Study 2: 10-Year Follow-On Study. Ophthalmology 2020, 128, 100–109. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Review Manager, Version 9.14.0; The Cochrane Collaboration: London, UK, 2025. Available online: https://revman.cochrane.org (accessed on 16 October 2025).
Chang, T.Y.; Chou, T.Y.; Jen, I.A.; Yuh, Y.S. Artificial intelligence algorithm improves radiologists’ bone age assessment accuracy artificial intelligence algorithm improves radiologists’ bone age assessment accuracy. J. Chin. Med. Assoc. 2025, 88, 10–1097. [Google Scholar] [CrossRef] [PubMed]
Edström, A.B.; Makouei, F.; Wennervaldt, K.; Lomholt, A.F.; Kaltoft, M.; Melchiors, J.; Hvilsom, G.B.; Bech, M.; Tolsgaard, M.; Todsen, T. Human-AI collaboration for ultrasound diagnosis of thyroid nodules: A clinical trial. Eur. Arch. Oto-Rhino-Laryngol 2025, 282, 3221–3231. [Google Scholar] [CrossRef] [PubMed]
Eng, D.K.; Khandwala, N.B.; Long, J.; Fefferman, N.R.; Lala, S.V.; Strubel, N.A.; Milla, S.S.; Filice, R.W.; Sharp, S.E.; Towbin, A.J.; et al. Artificial intelligence algorithm improves radiologist performance in skeletal age assessment: A prospective multicenter randomized controlled trial. Radiology 2021, 301, 692–699. [Google Scholar] [CrossRef] [PubMed]
Geneş, M.; Deveci, B. A clinical evaluation of cardiovascular emergencies: A comparison of responses from ChatGPT, emergency physicians, and cardiologists. Diagnostics 2024, 14, 2731. [Google Scholar] [CrossRef] [PubMed]
Gertz, R.J.; Dratsch, T.; Bunck, A.C.; Lennartz, S.; Iuga, A.I.; Hellmich, M.G.; Persigehl, T.; Pennig, L.; Gietzen, C.H.; Fervers, P.; et al. Potential of GPT-4 for detecting errors in radiology reports: Implications for reporting accuracy. Radiology 2024, 311, e232714. [Google Scholar] [CrossRef] [PubMed]
Gottlieb, M.; Schraft, E.; O’Brien, J.; Patel, D. Diagnostic accuracy of artificial intelligence for identifying systolic and diastolic cardiac dysfunction in the emergency department. Am. J. Emerg. Med. 2024, 86, 115–119. [Google Scholar] [CrossRef]
Hoppe, J.M.; Auer, M.K.; Strüven, A.; Massberg, S.; Stremmel, C. ChatGPT with GPT-4 outperforms emergency department physicians in diagnostic accuracy: Retrospective analysis. J. Med. Internet Res. 2024, 26, e56110. [Google Scholar] [CrossRef] [PubMed]
Janik, M.; Raad, G.; Nijmeh, G.; O’Steen, M.; Rasmussen, J. Diagnostic accuracy for detecting atrial fibrillation using a novel machine learning algorithm in a blood pressure monitor. Heart Rhythm. 2024, 21, 2023–2027. [Google Scholar] [CrossRef] [PubMed]
Jiao, C.; Rosas, E.; Asadigandomani, H.; Delsoz, M.; Madadi, Y.; Raja, H.; Munir, W.M.; Tamm, B.; Mehravaran, S.; Djalilian, A.R.; et al. Diagnostic performance of publicly available large language models in corneal diseases: A comparison with human specialists. Diagnostics 2025, 15, 1221. [Google Scholar] [CrossRef]
Johnson, S.; Kantartjis, M.; Severson, J.; Dorsey, R.; Adams, J.L.; Kangarloo, T.; Kostrzebski, M.A.; Best, A.; Merickel, M.; Amato, D.; et al. Wearable sensor-based assessments for remotely screening early-stage Parkinson’s disease. Sensors 2024, 24, 5637. [Google Scholar] [CrossRef] [PubMed]
Kücking, F.; Hübner, U.H.; Busch, D. Diagnostic accuracy differences in detecting wound maceration between humans and artificial intelligence: The role of human expertise revisited. J. Am. Med. Inform. Assoc. JAMIA 2025, 32, 1425–1433. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Liu, Y.; Guo, J.; Wang, Y. Prediction of the activity of Crohn’s disease based on CT radiomics combined with machine learning models. J. X-Ray Sci. Technol. 2022, 30, 1155–1168. [Google Scholar] [CrossRef] [PubMed]
Luna, A.; Casertano, L.; Timmerberg, J.; O’Neil, M.; Machowsky, J.; Leu, C.-S.; Lin, J.; Fang, Z.; Douglas, W.; Agrawal, S. Artificial intelligence application versus physical therapist for squat evaluation: A randomized controlled trial. Sci. Rep. 2021, 11, 18109. [Google Scholar] [CrossRef] [PubMed]
Park, A.; Chute, C.; Rajpurkar, P.; Lou, J.; Ball, R.L.; Shpanskaya, K.; Jabarkheel, R.; Kim, L.H.; McKenna, E.; Tseng, J.; et al. Deep learning–assisted diagnosis of cerebral aneurysms using the HeadXNet model. JAMA Netw. Open 2019, 2, e195600. [Google Scholar] [CrossRef] [PubMed]
Pluym, I.D.; Afshar, Y.; Holliman, K.; Kwan, L.; Bolagani, A.; Mok, T.; Silver, B.; Ramirez, E.; Han, C.S.; Platt, L.D. Accuracy of automated three-dimensional ultrasound imaging technique for fetal head biometry. Ultrasound Obstet. Gynecol. 2021, 57, 798–803. [Google Scholar] [CrossRef] [PubMed]
Prinster, D.; Mahmood, A.; Saria, S.; Jeudy, J.; Lin, C.T.; Yi, P.H.; Huang, C.M. Care to explain? AI explanation types differentially impact chest radiograph diagnostic performance and physician trust in AI. Radiology 2024, 313, e233261. [Google Scholar] [CrossRef] [PubMed]
Richard, C.; Schriger, D.; Weingrow, D. Rapid Electroencephalography and Artificial Intelligence in the Detection and Management of Nonconvulsive Seizures. Ann. Emerg. Med. 2024, 84, 422–427. [Google Scholar] [CrossRef] [PubMed]
Surya, J.; Garima Pandy, N.; Hyungtaek Rim, T.; Lee, G.; Priya, M.N.S.; Subramanian, B.; Raman, R. Efficacy of deep learning-based artificial intelligence models in screening and referring patients with diabetic retinopathy and glaucoma. Indian J. Ophthalmol. 2023, 71, 3039–3045. [Google Scholar] [CrossRef] [PubMed]
Taloni, A.; Borselli, M.; Scarsi, V.; Rossi, C.; Coco, G.; Scorcia, V.; Giannaccare, G. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci. Rep. 2023, 13, 18562. [Google Scholar] [CrossRef] [PubMed]
Turan, E.İ.; Baydemir, A.E.; Balıtatlı, A.B.; Şahin, A.S. Assessing the accuracy of ChatGPT in interpreting blood gas analysis results ChatGPT-4 in blood gas analysis. J. Clin. Anesth. 2025, 102, 111787. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Zhao, H.; Wang, A.; Li, J.; Gao, J. Comparison of lung ultrasound assisted by artificial intelligence to radiology examination in pneumothorax. J. Clin. Ultrasound 2024, 52, 1051–1055. [Google Scholar] [CrossRef] [PubMed]
Yu, G.; Liu, X.; Li, Y.; Zhang, Y.; Yan, R.; Zhu, L.; Wang, Z. The nomograms for predicting overall and cancer-specific survival in elderly patients with early-stage lung cancer: A population-based study using SEER database. Front. Public Health 2022, 10, 946299. [Google Scholar] [CrossRef] [PubMed]

Figure 1. PRISMA flow diagram.

Figure 2. Yearly distribution of included studies.

Figure 3. Risk of bias and applicability concerns summary: methodological quality assessment of the included studies across four domains (Participants, Predictor, Outcomes, and Analysis). Green indicates low risk, yellow indicates unclear risk, and red indicates high risk or applicability concerns.

Figure 4. Forest plot 1.1 (comparing overall diagnostic accuracy between artificial intelligence and general HCPs) demonstrates that AI outperformed general HCPs with a pooled odds ratio (OR) of 1.51 (95% CI: 1.17–1.96), indicating a statistically significant difference for AI (p = 0.002).

Figure 5. Forest plot 2.1 (comparing diagnostic accuracy between ai and experts) shows no statistically significant (p = 0.54) difference in performance between AI and expert HCPs. The pooled odds ratio (OR) was 0.72 (95% CI: 0.25–2.07), indicating that Experts outperformed AI.

Figure 6. Forest plot 3.1 (comparing diagnostic accuracy between ai and non-experts) demonstrates a significant (p = 0.03) performance advantage for AI over non-expert HCPs. The pooled odds ratio (OR) was 3.34 (95% CI: 1.13–9.86).

Figure 7. Forest plot 5.1 (analysis on reduction in the level of workload with and without using AI) shows that a high level of burden (burnout) was significantly lower among HCPs using AI compared with those not using AI (OR = 1.77, 95% CI 1.40–2.24, p < 0.00001).

Figure 8. Forest plot 4.1 (comparing diagnostic accuracy between artificial intelligence and physicians on different specialties) shows AI significantly outperformed general HCPs in radiology (p = 0.002) and dermatology (p = 0.005). In contrast, AI performance was significantly lower than that of HCPs in ophthalmology (p = 0.01).

Figure 9. Forest plot 6.1 (sensitivity analysis for AI vs. HCPs performance).

Table 1. The comparison between artificial intelligence and healthcare professionals.

AI vs. HCPs		AI vs. HCPs
GLEAMER BoneView-AI v2.0.2a [13]	Radiologist	GPT-3 [14]	General Physician
AI Software [15]	Radiologist	Maya-MD [16]	General Physician
CNNs [17]	Radiologist	AI Software [18]	General Physician
AI Software [19]	Radiologist	AI Software [20]	General Physician
AI Software [21]	Radiologist	Cascade-RCNN [22]	General Physician
Deep Learning [23]	Radiologist	AI software [24]	General Physician
Faster-RCNN algorithm [25]	Radiologist	Ada App [26]	General Physician
AI Software [27]	Radiologist	Machine Learning [28]	General Physician
AI Software [29]	Dermatologist	AI Software [30]	General Physician
ChatGPT-4o, Claude 3.5, and Gemini 1.5 Pro [31]	Dermatologist	Deep Learning [32]	Neurologist
AI-CDSS [33]	Heart Specialist	GPT-3.5 [34]	Neurologist
GPT-ECG Reader, Analyzer, Interpreter [35]	Cardiologist	AI Software [36]	HealthCare Practitioner
Ada App [37]	ER Physician	AI Software [38]	HealthCare Practitioner
CHATGPT-4 [39]	Rheumatologist	Detectron2 [40]	HealthCare Practitioner
AI-assisted tailored intervention [41]	Nurses	GPT-4 [42]	Ophthalmology trainees
AI Software [43]	Endoscopists	Notal OCT Analyzer [44]	Retinal Specialist

AI: artificial intelligence; HCP: healthcare professional; GP: general physician; OT: ophthalmology trainee; ML: machine learning; GPT: generative pre-trained transformer; RCNN: region-based convolutional neural network; CHATGPT: chat generative pre-trained transformer; RS: retinal specialist; AI-CDSS: artificial intelligence-based clinical decision support systems; NOA: notal oct analyzer; GPT-ECG: generative pre-trained transformer-electrocardiogram.

Table 2. Summary of the design of included studies.

Reference	Year	Place	Specialty	Same Dataset for AI and HCP	Comparison Type	AI Model
Radiology
Boginskis 2023 [13]	2023	Europe	Radiology	Yes	Paired	GLEAMER Bone View
Cohen 2023 [19]	2023	France	Radiology	Yes	Paired	AI Software
Gan 2019 [17]	2019	China	Radiology	Yes	Paired	CNN
Guermazi 2022 [40]	2022	Boston	Radiology	Yes	Paired	Detectron 2
Homayounieh 2021 [21]	2021	USA	Radiology	Yes	Paired	AI Software
Liu 2022 [25]	2022	China	Radiology	Yes	Paired	Faster-RCNN algorithm
Liu 2023 [27]	2023	China	Radiology	Yes	Paired	AI Software
Luo 2021 [43]	2021	China	Radiology	Yes	Paired	AI Software
Michael Gottlieb 2024 [18]	2024	USA	Radiology	Yes	Paired	AI Software
Rauschecker 2020 [15]	2020	San Francisco	Radiology	Yes	Paired	AI Software
Tamai 2023 [23]	2023	Japan	Radiology	Yes	Paired	Deep Learning
Twinprai 2022 [20]	2022	Thailand	Radiology	Yes	Paired	AI Software
Wang 2021 [22]	2021	China	Radiology	Yes	Paired	Cascade-RCNN
Cardiology
Camkıran 2025 [35]	2025	Turkey	Cardiology	Yes	Paired	GPT
Choi 2020 [33]	2020	Korea	Cardiology	Yes	Paired	AI-CDSS
Graf 2022 [26]	2022	Germany	Cardiology	Yes	Paired	Ada App
Krusche 2024 [39]	2024	Germany	Cardiology	Yes	Paired	CHATGPT-4
Emergency
David M. Levine 2024 [14]	2024	USA	Emergency	No	Unpaired	GPT-3
Delshad 2021 [16]	2021	USA	Emergency	Yes	Paired	Maya-MD
Faqar-Uz-Zaman 2022 [37]	2022	Germany	Emergency	Yes	Paired	Ada App
Van Doorn 2021 [28]	2021	Netherlands	Emergency	Yes	Paired	Machine Learning
Neurology
Fonseca 2024 [34]	2024	Portugal	Neurology	Yes	Paired	GPT-3.5
Lisa Herzog 2023 [32]	2023	Switzerland	Neurology	Yes	Paired	Deep Learning
Dermatology
Han 2022 [29]	2022	Korea	Dermatology	Yes	Paired	AI Software
Yamamura 2025 [31]	2025	Japan	Dermatology	Yes	Paired	ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro
Pathology
Harada 2021 [24]	2021	Japan	Pathology	Yes	Paired	AI Software
Opthamology
Keenan 2020 [44]	2020	Maryland	Ophthalmology	Yes	Paired	Notal OCT Analyzer (NOA)
Lyons 2024 [42]	2024	Atlanta	Ophthalmology	Yes	Paired	GPT-4
Burnout
Baek 2025 [41]	2025	South Korea	Burnout			AI-assisted tailored intervention
Gracia 2024 [30]	2024	California	Burnout			AI Software
Misurac 2025 [36]	2025	USA	Burnout			AI Software
Olson 2025 [38]	2025	Connecticut	Burnout			AI Software

AI: artificial intelligence; CNN: convolutional neural networks; AI-CDSS: artificial intelligence-based clinical decision support system; DL: deep learning; GPT: generative pre-trained transformer; RCNN: region-based convolutional neural network; ML: machine learning; CHATGPT: chat generative pre-trained transformer.

Table 3. Summary of correct diagnosis by artificial intelligence and healthcare professionals of included studies.

Reference	Samples	Correct Diagnosis by AI and HCP
		AI	Expert	Non-Expert	General
Boginskis 2023 [13]	Radiographs-100	85			78
Cohen 2023 [19]	Radiographs-318	285			273
Gan 2019 [17]	Radiographs-2340	2246	1989	2223
Guermazi 2022 [40]	Radiographs-480	451			422
Liu 2022 [25]	Radiographs-57	50	47
Liu 2023 [27]	Radiographs-191	165			159
Twinprai 2022 [20]	Radiographs-1000	950	960	820
Krusche 2024 [39]	Patients-100	60			65
Homayounieh 2021 [21]	Patients-100	80	86	86
Luo 2021 [43]	Patients-150	59			51
Michael Gottlieb 2024 [18]	Patients-71	69			68
Rauschecker 2020 [15]	Patients-86	78	73	48
Tamai 2023 [23]	Patients-42	34			28
Choi 2020 [33]	Patients-1198	1174	1198	910
Wang 2021 [22]	Patients-80	71			35
Faqar-Uz-Zaman 2022 [37]	Patients-450	391			364
Van Doorn 2021 [28]	Patients-100	92			72
Lisa Her Zog 2023 [32]	Patients-50	36			32
Han 2022 [29]	Patients-295	159			124
Yamamura 2025 [31]	Patients-30	21			20
Graf 2022 [26]	Vignettes-132	93			70
David M. Levine 2024 [14]	Vignettes-48	42			46
Delshad 2021 [16]	Vignettes-50	44			39
Fonseca 2024 [34]	Vignettes-188	134			130
Harada 2021 [24]	Vignettes-16	9			9
Lyons 2024 [42]	Vignettes-44	41			42
Camkıran 2025 [35]	ECG-107	63	93
Keenan 2020 [44]	Eyes-1127	913			958

AI: artificial intelligence; HCP: healthcare professional; ECG: electrocardiogram.

Table 4. Overall diagnostic accuracy for artificial intelligence vs. general HCPs vs. expert HCPs vs. non-expert HCPs.

Reference	Diagnostic Accuracy
Reference	A	B	C	D
Boginskis 2023 [13]	85%			78%
Cohen 2023 [19]	83%			76%
Gan 2019 [17]	96%	85%	95%
Guermazi 2022 [40]	86%			78%
Liu 2022 [25]	88%	84%
Liu 2023 [27]	86%			71%
Twinprai 2022 [20]	95%	96%	82%
Krusche 2024 [39]	60%			55%
Homayounieh 2021 [21]	80%	86%	86%
Luo 2021 [43]	39%			34%
Michael Gottlieb 2024 [18]	97%			96%
Rauschecker 2020 [15]	91%	86%	56%
Tamai 2023 [23]	81%			66%
Choi 2020 [33]	98%	100%	76%
Wang 2021 [22]	89%			44%
Faqar-Uz-Zaman 2022 [37]	52%			81%
Van Doorn 2021 [28]	80%			73%
Lisa Herzog 2023 [32]	72%			64%
Han 2022 [29]	54%			44%
Yamamura 2025 [31]	70%			65%
Graf 2022 [26]	70%			54%
David M. Levine 2024 [14]	88%			96%
Delshad 2021 [16]	92%			80%
Fonseca 2024 [34]	71%			69%
Harada 2021 [24]	57%			56%
Lyons 2024 [42]	93%			95%
Camkıran 2025 [35]	63%	93%
Keenan 2020 [44]	85%			81%

A: artificial intelligence; B: expert healthcare professionals; C: non-expert healthcare professionals; D: general healthcare professionals.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kumar, P.; Alnaimi, N.A.; Soman, S.; Suansing, L.; Ryan Arriola, D., II; Jamea, L.A. Meta-Analysis on Comparison of Diagnostic Accuracy Between Artificial Intelligence and Healthcare Professionals. Sci 2026, 8, 73. https://doi.org/10.3390/sci8040073

AMA Style

Kumar P, Alnaimi NA, Soman S, Suansing L, Ryan Arriola D II, Jamea LA. Meta-Analysis on Comparison of Diagnostic Accuracy Between Artificial Intelligence and Healthcare Professionals. Sci. 2026; 8(4):73. https://doi.org/10.3390/sci8040073

Chicago/Turabian Style

Kumar, Prem, Nouf A. Alnaimi, Sumi Soman, Leda Suansing, Daniel Ryan Arriola, II, and Lamiaa Al Jamea. 2026. "Meta-Analysis on Comparison of Diagnostic Accuracy Between Artificial Intelligence and Healthcare Professionals" Sci 8, no. 4: 73. https://doi.org/10.3390/sci8040073

APA Style

Kumar, P., Alnaimi, N. A., Soman, S., Suansing, L., Ryan Arriola, D., II, & Jamea, L. A. (2026). Meta-Analysis on Comparison of Diagnostic Accuracy Between Artificial Intelligence and Healthcare Professionals. Sci, 8(4), 73. https://doi.org/10.3390/sci8040073

Article Menu

Meta-Analysis on Comparison of Diagnostic Accuracy Between Artificial Intelligence and Healthcare Professionals

Abstract

1. Introduction

2. Materials and Methods

2.1. Criteria for Inclusion and Exclusion

2.1.1. Study Types

2.1.2. Participant Types

2.1.3. Intervention Types and Controls

2.1.4. Outcomes Measures

2.2. Literature Searches

2.2.1. Electronic Searches

2.2.2. Searching Other Resources

2.3. Data Selection

2.4. Data Extraction

2.5. Quality/Risk of Bias Assessment of Included Studies

2.5.1. Risk of Bias

Risk of Bias Judgment

Interpretation

2.5.2. Applicability Concern

Interpretation

2.6. Data Analysis

2.6.1. Synthesis Methods

2.6.2. Investigation of Heterogeneity and Subgroup Analysis

2.6.3. Sensitivity Analysis

2.6.4. Certainty of the Evidence Assessment

3. Results

3.1. Study Characteristics

3.1.1. Results of the Search

3.1.2. Included Studies

3.1.3. Excluded Studies

3.2. Quality/Risk of Bias of Included Studies

3.2.1. Risk of Bias Judgment

3.2.2. Applicability Concern

3.3. Intervention Effects

3.3.1. Artificial Intelligence vs. General Healthcare Professionals

3.3.2. Artificial Intelligence vs. Expert Healthcare Professionals

3.3.3. Artificial Intelligence vs. Non-Expert Healthcare Professionals

3.3.4. Use of AI Reducing Work Burden

3.3.5. Diagnostic Accuracy of Different Specialties and Overall Accuracy

3.3.6. Interpretation of Sensitivity Analysis

4. Discussion

4.1. Summary of Evidence

4.2. Agreements and Disagreements with Other Reviews

4.2.1. Strengths of the Review

4.2.2. Weaknesses of the Review

4.3. Limitations of Review

4.4. Clinical and Research Implications

4.4.1. Implications for Practice

4.4.2. Implications for Research

4.4.3. Evidence of Certainty and Generalizability

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI