Diagnostic Accuracy of the Artificial Intelligence Methods in Medical Imaging for Pulmonary Tuberculosis: A Systematic Review and Meta-Analysis

Tuberculosis (TB) remains one of the leading causes of death among infectious diseases worldwide. Early screening and diagnosis of pulmonary tuberculosis (PTB) is crucial in TB control, and tend to benefit from artificial intelligence. Here, we aimed to evaluate the diagnostic efficacy of a variety of artificial intelligence methods in medical imaging for PTB. We searched MEDLINE and Embase with the OVID platform to identify trials published update to November 2022 that evaluated the effectiveness of artificial-intelligence-based software in medical imaging of patients with PTB. After data extraction, the quality of studies was assessed using quality assessment of diagnostic accuracy studies 2 (QUADAS-2). Pooled sensitivity and specificity were estimated using a bivariate random-effects model. In total, 3987 references were initially identified and 61 studies were finally included, covering a wide range of 124,959 individuals. The pooled sensitivity and the specificity were 91% (95% confidence interval (CI), 89–93%) and 65% (54–75%), respectively, in clinical trials, and 94% (89–96%) and 95% (91–97%), respectively, in model-development studies. These findings have demonstrated that artificial-intelligence-based software could serve as an accurate tool to diagnose PTB in medical imaging. However, standardized reporting guidance regarding AI-specific trials and multicenter clinical trials is urgently needed to truly transform this cutting-edge technology into clinical practice.


Introduction
Tuberculosis (TB) is one of the major communicable diseases that seriously endanger human health primarily in developing countries [1], and at least 5.8 million people were estimated to have contracted tuberculosis in 2020. However, around one-sixth of people with active tuberculosis are left undetected or not officially reported each year, which may delay the progress of elimination of this disease before 2035 [2]. Timely diagnosis and treatment could benefit a wide range of tuberculosis patients and minimize the transmission of pathogen in the whole population.
Mycobacterium tuberculosis culture on solid and/or liquid media is still the golden standard for diagnosis. However, the efficiency of culture-based diagnosis in clinical practice is diminished due to long turnaround times and lack of laboratory infrastructure, especially in resource-limited countries. To solve this, the Xpert MTB/RIF assay has emerged as a maturely implemented tool in many countries haunted greatly by TB disease, which is a semiautomated rapid molecular method allowing for rapid diagnosis based upon detection of Mycobacterium tuberculosis DNA and resistance to rifampicin [3], but the application of such rapid tests remains far too limited, with only 1.9 million (33%) people having taken it as an initial diagnostic test in 2022. Simultaneously, the World Health Organization (WHO) has recommended using chest X-ray (CXR) images as a screening technique to better target individuals needing a further microbiological test, which has been proved to be relatively easy to operate, low-cost and highly sensitive [4]. However, an accurate diagnosis with CXRs extremely depends on the clinical experience of radiologists, which poses a huge challenge in the aforementioned countries. As such, there has been increasing interest in using artificial-intelligence-based (AI-based) software in medical imaging for pulmonary tuberculosis (PTB) detection, achieving diagnostic accuracy improvement and cost reduction at the same time. Currently, more than 40 AI-based software programs certified for CXR or computed tomography (CT) examination are available, among which only five are certified for CXR tuberculosis detection [5]. In 2021, Creswell and colleagues conducted a study that tested the five certified software programs (CAD4TB (v6), InferRead ® DR (v2), Lunit INSIGHT CXR (v4.9.0), JF CXR-1 (v2), and qXR (v3)) with cohorts in Bangladesh and found that AI-based software significantly outperformed radiologists in TB detection [6]. However, poor reporting and wide variations in design and methodology limit the reliable interpretation of reported diagnostic accuracy [7]. Furthermore, systematic reviews [8,9] of the diagnostic accuracy of this software also identified several limitations in the available evidence, and uncertainty remains regarding its performance in PTB diagnosis.
Hence, we conducted a systematic review and meta-analysis to synthesize evidence of the accuracy of AI-based software in medical imaging for PTB and to provide new insights for future research.

Data Source and Search Strategy
A MEDLINE and Embase search through the OVID platform was performed on update to November 2022 without any restriction of country. The search terms were built as follows: 'artificial intelligence' (deep learning, machine learning, computer assisted, or cnn), 'imaging' (radiography, computed tomography, CT, photograph, or X-ray), 'diagnostic accuracy metrics' (sensitivity or specificity), and 'pulmonary tuberculosis' (Tuberculosis or tb). The full search strategy is laid out in Supplementary Materials File S2. This systematic review was registered in PROSPERO with the number CRD42022379114 and followed the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines (Supplementary Materials File S3).

Study Selection
Two researchers independently assessed the candidate studies for inclusion via screening of titles and abstracts, followed by the full text. Any discrepancy between the two researchers was resolved by a third researcher to achieve a consensus. We included all published studies that used AI-based software to analyze medical imaging in PTB diagnosis. Studies that met the following criteria were included in the final group: (1) Any study that analyzed medical imaging for PTB diagnosis with AI-based software; (2) Studies that provided raw diagnostic accuracy data, sensitivity, specificity, area under curve (AUC), accuracy, negative predictive values (NPVs), or positive predictive values (PPVs). Studies were excluded when they met the following criteria: (1) Case reports, conference reports, reviews, meta-analyses, abstracts without full articles, commentaries/editorials, mathematical modeling studies, and economic analyses; (2) Studies that investigated the accuracy of image segmentation or disease prediction; (3) Triage studies; (4) Studies without outcomes or separate data; (5) Studies that failed to report the source of the included population.

Data Extraction
Two researchers independently extracted demographic and diagnostic-accuracy data using a standardized extraction form from the included studies. When disagreements could not be resolved, we consulted with a third researcher. We extracted data that included study characteristics (first author name, country, year, study design, patient selection methods), demographic information (gender, age, human immunodeficiency virus (HIV)  status, drug resistance, history of TB, treatment, imaging modality), AI-based software  descriptions (type of artificial intelligence, model, data set, validation methods, threshold  score), reference standards, and diagnostic accuracy measures (true and false positives and  negatives (TP, FP, FN, TN), AUC, accuracy, sensitivity, specificity, PPV, NPV, and other reported metrics). If there were more than one reported accuracy data set for the same software, with other conditions consistent except for the threshold, the data set with the highest summed sensitivity and specificity would be extracted.

Quality Assessment
The risk of bias and applicability concerns of the included studies were assessed by two researchers separately, with a revised tool developed for diagnostic studies: QUADAS-2. Any disagreement between the two researchers was resolved through discussion with a third researcher.

Data Synthesis and Analysis
Data from development studies and clinical studies were analyzed separately. We first obtained the accuracy data that corresponded to TP, FP, FN, and TN in each included study and calculated the estimated pooled sensitivity, specificity, and AUC associated with the 95% CI, using bivariate random-effects models. Additionally, forest plots of sensitivity and specificity were generated for each study. We also used the model to create a summary receiver operating characteristic (SROC) curve. The I 2 index was used to assess the heterogeneity between the studies. Values greater than 50% were indicative of substantial heterogeneity [10]. We subsequently chose different study designs, software, reference standards, and AI types as potential sources of heterogeneity, using subgroup analyses to explore the results. A sensitivity analysis was also performed to assess the robustness of the results and identify possible sources of heterogeneity. According to the PRISMA-DTA statement, neither a systematic review nor a meta-analysis of diagnostic accuracy studies is required to assess publication bias [11]. Analyses were conducted in Review Manager version 5.7 and Stata version 17.0 (Stata Corp., College Station, TX, USA), with the midas and metaninf command packages.
Within the model-development studies, thirty reported diagnostic accuracy for PTB identification with deep-learning-based algorithms, compared with eight studies [34,35,43,[50][51][52][53]67] that used machine-learning models. Altogether, twenty-seven out of thirtyeight of the available studies were based on public data sets. Several data sets (Montgomery (NIH), Shenzhen (NIH), and Belarus) were analyzed in most studies, but dataset demographic details were not described in most of the studies. Only one article explicitly described the use of semiautomatic lesion delineation for training data. To validate model performance, nine studies [44,46,48,49,[59][60][61]68,70] validated algorithms for external data, while the remaining only implemented internal validation. Considering the economics of practical use, thirty-two out of the thirty-eight studies used CXRs as a diagnostic tool, with CT remaining to be further developed. In addition, eleven studies [36,[39][40][41][42]48,52,53,57,60,70] made all of the code used in their implementation freely available to the public. As an important step in the radiomic pipeline, feature extraction played a decisive role in the whole process. Hogeweg, L. et al. [53] combined the results of shape analysis, texture analysis, and focal lesion detection into one combined TB score.

Quality Assessment of Studies
The overall results of the methodological-quality assessment of the included clinical and development studies are summarized respectively in Figures 2 and 3. For clinical studies, the main sources of bias included index tests, flow, and timing. Most development studies were classified as high-risk, particularly with deficiencies in their methods of patient selection, the reference standards used, and their index tests.  For the patient-selection domain, a high or unclear risk of bias was observed in 84% (thirty-two out of thirty-eight) of the development studies, which was mainly related to missing information in the CXR/CT databases. For the index test, a prespecified threshold was reported only in 30% (seven out of twenty-three) of the clinical studies, and 18% (seven out of thirty-eight) of the development studies had a prespecified threshold, while the other studies had a high risk of bias, since the threshold was determined after the analysis in each. For the reference standard domain, a high or unclear risk of bias was seen in 76% (twenty-nine out of thirty-eight) of the development studies, with regards to assessment by radiologists as the reference standard. For flow and timing, there was a high or unclear risk of bias in 39% (nine out of twenty-three) of the clinical studies and 50% (nineteen out of thirty-eight) of the development studies due to the inconsistency of the reference standards and a lack of inclusion of all patients.

Diagnostic Accuracy Reported in AI-Based Software Assay for PTB
We found that only 13 development studies reported TP, FP, FN, and TN for index tests. Of all the 38 articles that included accuracy assessments, the sensitivity ranged from 0.580 to 0.993 and the specificity from 0.570 to 0.996. It is worth noting that CT showed a higher sensitivity in diagnosis with AI (0.750-0.993 of CT vs. 0.580-0.993 of CXR). The reported performance is summarized in Figure 4. The pooled sensitivity of all included studies was 94% (95% CI 89-96%), with I 2 = 93.22 (95% CI 91.07-95.37), and the pooled specificity was 95% (95% CI 91-97%), with I 2 = 97.52 (95% CI 96.94-98.09). After excluding the CT-based study, we obtained pooled sensitivity and specificity values of 93% (95% CI 87-96%) and 94% (95% CI 90-97%), respectively.  For the patient-selection domain, a high or unclear risk of bias was observed in 84% (thirty-two out of thirty-eight) of the development studies, which was mainly related to missing information in the CXR/CT databases. For the index test, a prespecified threshold was reported only in 30% (seven out of twenty-three) of the clinical studies, and 18% (seven out of thirty-eight) of the development studies had a prespecified threshold, while the other studies had a high risk of bias, since the threshold was determined after the analysis in each. For the reference standard domain, a high or unclear risk of bias was seen in 76% (twenty-nine out of thirty-eight) of the development studies, with regards to assessment by radiologists as the reference standard. For flow and timing, there was a high or unclear risk of bias in 39% (nine out of twenty-three) of the clinical studies and 50% (nineteen out of thirty-eight) of the development studies due to the inconsistency of the reference standards and a lack of inclusion of all patients.

Diagnostic Accuracy Reported in AI-Based Software Assay for PTB
We found that only 13 development studies reported TP, FP, FN, and TN for index tests. Of all the 38 articles that included accuracy assessments, the sensitivity ranged from 0.580 to 0.993 and the specificity from 0.570 to 0.996. It is worth noting that CT showed a higher sensitivity in diagnosis with AI (0.750-0.993 of CT vs. 0.580-0.993 of CXR). The reported performance is summarized in Figure 4. The pooled sensitivity of all included studies was 94% (95% CI 89-96%), with I 2 = 93.22 (95% CI 91.07-95.37), and the pooled specificity was 95% (95% CI 91-97%), with I 2 = 97.52 (95% CI 96.94-98.09). After excluding the CT-based study, we obtained pooled sensitivity and specificity values of 93% (95% CI 87-96%) and 94% (95% CI 90-97%), respectively. For the patient-selection domain, a high or unclear risk of bias was observed in 84% (thirty-two out of thirty-eight) of the development studies, which was mainly related to missing information in the CXR/CT databases. For the index test, a prespecified threshold was reported only in 30% (seven out of twenty-three) of the clinical studies, and 18% (seven out of thirty-eight) of the development studies had a prespecified threshold, while the other studies had a high risk of bias, since the threshold was determined after the analysis in each. For the reference standard domain, a high or unclear risk of bias was seen in 76% (twenty-nine out of thirty-eight) of the development studies, with regards to assessment by radiologists as the reference standard. For flow and timing, there was a high or unclear risk of bias in 39% (nine out of twenty-three) of the clinical studies and 50% (nineteen out of thirty-eight) of the development studies due to the inconsistency of the reference standards and a lack of inclusion of all patients.
There was significant heterogeneity in both sensitivity and specificity. We also constructed SROC curves and calculated the AUC for the included studies. The overall diagnostic performance of the clinical studies and the development studies was comparable [AUC 0.91 (95% CI 0.89-0.94) and 0.98 (95% CI 0.97-0.99), respectively] (Supplementary Materials File S1).

Subgroup and Sensitivity Analyses
Considering the variability of the methods and models tested in the development studies, we only performed a subgroup analysis in the clinical studies, based on predefined parameters, including study design, software, reference standard, and AI type. Some studies were excluded from the relevant subgroup analyses due to missing information or not being categorized into specific groups.

Subgroup and Sensitivity Analyses
Considering the variability of the methods and models tested in the development studies, we only performed a subgroup analysis in the clinical studies, based on predefined parameters, including study design, software, reference standard, and AI type. Some studies were excluded from the relevant subgroup analyses due to missing information or not being categorized into specific groups.
Compared to different study designs, the pooled specificity was 48% (95% CI 34-62%, I 2 = 99.87; 99.86-99.88) in the prospective assay versus 75% (95% CI 53-89%, I 2 = 99.94; 99.93-99.94) in the nonprospective assay. When Xpert MTB/RIF was used as the reference standard, the pooled specificity of the Xpert MTB/RIF assay [36% (95% CI 24-50%, I 2 = 99.93; 99.93-99.94)] was much lower than that of the studies that used human readers [90% (95% CI 80-95%, I 2 = 98.70; 98.32-99.08)]. Furthermore, the sensitivity and the specificity of various AI-based software (CAD4TB, qXR, Lunit INSIGHT CXR) evidently differed. The results of the subgroup analyses are summarized in detail in Table 2. There was still a substantial level of heterogeneity among each subgroup analysis.  We subsequently performed sensitivity analyses on the clinical and development studies, respectively. Results of our sensitivity analyses are provided in Supplementary Materials File S1. In the clinical studies, we found three articles that had great effects on the overall results. After removal of the corresponding articles, we obtained a still-high heterogeneity (I 2 = 92.97, 91.55-94.39 for sensitivity, I 2 = 99.83, 99.82-99.84 for specificity).

Discussion
This study sought to (1) evaluate the diagnostic efficacy of AI-based software for PTB and (2) describe the study characteristics, and evaluate the study methodology and the quality of reporting of AI-based software for PTB diagnosis, as well as providing some advice for future software development and clinical applications. Meta-analysis demonstrated that AI-based software has high accuracy in both clinical applications and development studies, indicating that it can assist the physicians in improving the accuracy of PTB diagnosis. However, due to the high heterogeneity and variability between studies, relevant results must be treated with caution when the result of AI-based software is used as a reference standard.
In this systematic review and meta-analysis, we included 23 clinical studies and 38 development studies of PTB diagnosis. Since some missing data were reported, the final count was 13 development studies and 23 clinical studies eligible for quantitative synthesis. Our results show that AI-based software has an excellent ability to diagnose PTB in medical imaging, with pooled sensitivities greater than 0.9 [clinical studies: 91% (95% CI 89%-93%); development studies: 94% (95% CI 89%-96%)]. Additionally, the pooled specificity of the software in the clinical studies was only modest [65% (95% CI 54%-75%)], while that in the development studies was relatively high [95% (95% CI 91%-97%)], which may have been caused by the application of the same test-data set for diagnostic performance assessment. However, a high level of heterogeneity was observed in all the results. Subgroup analysis revealed that nonprospective studies had significantly higher specificity and lower sensitivity than prospective studies had, which might have been due to the inclusion of identified PTB patients in the nonprospective studies. Additionally, studies that used Xpert MTB/RIF as a reference standard had much lower specificity compared to studies that used human readers, possibly because human readers were weaker than Xpert MTB/RIF in correctly identifying negative patients. Furthermore, all commercially available software (CAD4TB, Lunit INSIGHT CXR, and qXR) showed its advantages in improvement of diagnostic accuracy, but we found evident differences in sensitivity and specificity among various AI-based software. The level of heterogeneity between the subgroups remained high, suggesting that study design, software type, AI type, and different reference standards might not be source of heterogeneity. Our follow-up sensitivity analysis indicated that different types of medical imaging might be the sources of heterogeneity, as CT could offer enhanced sensitivity [72].
A number of methodological limitations in the existing evidence were identified, as were study-level factors associated with the reported accuracy, which should all be taken into consideration.
In development studies, most of the current AI-based software was developed for CXR, and only six studies were applied to CT. Because of the deficiency of accuracy data, we performed no subgroup analysis for CT versus CXR. In addition, specific accuracy results, threshold establishment, and inclusion criteria may not have been described well enough to allow emulation for further comparison and may cause greater clinical and methodological heterogeneity. A large proportion of the articles used human readers as the reference standard, meaning systematic overestimation of the diagnostic accuracy of the software. Furthermore, the lack of external validation made it very difficult to formally evaluate algorithm performance. Although most of the experiments used publicly available data sets for model training, few experiments fully disclosed their model details and codes. In addition, almost all of the development articles used manual-lesion-depiction data sets. Semiautomated approaches are known to have greater advantages in lesion delineation, as has been demonstrated with other lung diseases [73], so we encourage more studies in the future to adopt this approach. Several aspects mentioned above lead to the inability to guarantee reproducibility of these experiments. Much of the existing work focuses on multiparametric classification models, ignoring the influence of individual features. Accumulating evidence has confirmed the important role of individual features in discrimination of benign and malignant lung lesions [74,75]; this has great potential for improvement of accuracy and disease identification, and is also informative for research of automated classification models for PTB.
All of the clinical studies evaluated commercially available software developed for CXR. A total of 11 software types were tested, but the version and threshold reported varied among studies. There were varying methodologies of threshold determination and population inclusion, potentially resulting in a high level of heterogeneity. It is worth noting that 13 articles also compared the diagnostic accuracy of AI-based software with human clinicians, which would provide a more objective criterion allowing for a better comparison of models between studies.
Our study had several limitations. Although we searched the relevant literature as comprehensively as possible, some of the literature might have been missed. In addition, some studies failed to report demographic information in detail, and the corresponding subgroup analysis could not be performed. Furthermore, the limited number of studies included for different versions of the software allowed for no further analysis. When AI-based software was used to diagnose PTB, there was significant heterogeneity among studies, so it is difficult to determine whether the software is clinically applicable. Lastly, because current clinical software requires the inclusion of patients over 15 years of age, the diagnostic efficiency for children needs to be further determined.
To improve the future clinical applicability of AI-based software, we recommend that studies include detailed reporting of demographic information, and hope that existing reporting guidelines for diagnostic accuracy studies (STARD) [76] and prediction models (TRIPOD) [77] can be improved as soon as possible to conduct AI-specific amendments. In addition, some model training and validations were performed on CXRs from data sets or sites, potentially resulting in an overestimation of diagnosis power. As such, we suggest that different data sets should be used for model training and testing. Moreover, research teams can collaborate with multiple clinical centers for clinical trials and external validation to make results superior and investigate the stability and heterogeneity of their performance in clinical scenarios. What is more, we appealed to a large number of open, multi-source, and anonymous databases, along with detailed reporting of all of the information needed, such as reference standard, age, HIV status, etc., to fulfill the need for an adequate amount of data with high quality. At the same time, we recommend that development studies make their model details and all of the code used for their experiments freely available to the public to make it possible to reproduce these studies. It is also noteworthy that the diagnostic accuracy of AI-based software should be evaluated against a microbiological reference standard. Lastly, we found a lack of use of AI-based software in CT, and more studies may be needed to explore its superiority in early diagnosis of PTB. In addition, the influence of parameters such as intensity quantization, on imaging and final diagnosis in particular, could be considered.

Conclusions
In summary, there were relatively high pooled sensitivity and specificity values of AI-based software, which indicates that AI-based software has potential to facilitate diagnosis of PTB in medical imaging, especially in large-scale screening. Heterogeneity was significantly high and extensive variation in reporting, design, and methodology was observed. Thus, standardized reporting guidance around AI-specific trials and multicenter clinical trials is urgently needed to further confirm their stability and heterogeneity in various populations and settings. In the future, we expect more AI-based software with high accuracy to be comprehensively applied for early clinical detection of PTB.
Author Contributions: Conceptualization and design, C.W. and B.Y.; data curation and data analysis, Y.Z., Y.W. and W.Z.; manuscript editing and manuscript review, C.W., B.Y., Y.Z., Y.W. and W.Z. All authors have read and agreed to the published version of the manuscript.