Clinical Validation of a Deep Learning Algorithm for Detection of Pneumonia on Chest Radiographs in Emergency Department Patients with Acute Febrile Respiratory Illness

Early identification of pneumonia is essential in patients with acute febrile respiratory illness (FRI). We evaluated the performance and added value of a commercial deep learning (DL) algorithm in detecting pneumonia on chest radiographs (CRs) of patients visiting the emergency department (ED) with acute FRI. This single-centre, retrospective study included 377 consecutive patients who visited the ED and the resulting 387 CRs in August 2018–January 2019. The performance of a DL algorithm in detection of pneumonia on CRs was evaluated based on area under the receiver operating characteristics (AUROC) curves, sensitivity, specificity, negative predictive values (NPVs), and positive predictive values (PPVs). Three ED physicians independently reviewed CRs with observer performance test to detect pneumonia, which was re-evaluated with the algorithm eight weeks later. AUROC, sensitivity, and specificity measurements were compared between “DL algorithm” vs. “physicians-only” and between “physicians-only” vs. “physicians aided with the algorithm”. Among 377 patients, 83 (22.0%) had pneumonia. AUROC, sensitivity, specificity, PPV, and NPV of the algorithm for detection of pneumonia on CRs were 0.861, 58.3%, 94.4%, 74.2%, and 89.1%, respectively. For the detection of ‘visible pneumonia on CR’ (60 CRs from 59 patients), AUROC, sensitivity, specificity, PPV, and NPV were 0.940, 81.7%, 94.4%, 74.2%, and 96.3%, respectively. In the observer performance test, the algorithm performed better than the physicians for pneumonia (AUROC, 0.861 vs. 0.788, p = 0.017; specificity, 94.4% vs. 88.7%, p < 0.0001) and visible pneumonia (AUROC, 0.940 vs. 0.871, p = 0.007; sensitivity, 81.7% vs. 73.9%, p = 0.034; specificity, 94.4% vs. 88.7%, p < 0.0001). Detection of pneumonia (sensitivity, 82.2% vs. 53.2%, p = 0.008; specificity, 98.1% vs. 88.7%; p < 0.0001) and ‘visible pneumonia’ (sensitivity, 82.2% vs. 73.9%, p = 0.014; specificity, 98.1% vs. 88.7%, p < 0.0001) significantly improved when the algorithm was used by the physicians. Mean reading time for the physicians decreased from 165 to 101 min with the assistance of the algorithm. Thus, the DL algorithm showed a better diagnosis of pneumonia, particularly visible pneumonia on CR, and improved diagnosis by ED physicians in patients with acute FRI.


Introduction
Acute respiratory infections (ARIs) typically present as acute febrile respiratory illnesses (FRIs) cause approximately 4 million deaths worldwide each year [1]. In addition, ARIs were the second most common reason for emergency department (ED) visits in the United States in 2014 (18.2 per 1000 persons) [2], and chest radiographs (CR) have been the first-line imaging modality for diagnosing or excluding pneumonia [3]. It is very important to diagnose pneumonia in ARI patients because simple upper respiratory infections are usually self-limiting, while pneumonia can potentially lead to respiratory failure and intensive care unit admission without appropriate treatment [4]. However, it is challenging for the ED physicians to distinguish pneumonia from simple upper respiratory tract infections, mainly due to difficulties of CR interpretation. Several previous reports show substantial discrepancies in CR interpretation between the ED physicians and expert radiologists [5][6][7][8]. Unfortunately, it is not always possible to have full-time expert radiologists in every ED, especially on nights and weekends. Furthermore, CR interpretation in the ED should be timely for patient management [9], which is often challenging in reality.
Recently, deep-learning (DL) technology has been successfully applied in the medical field, particularly for the analysis of medical images [10] such as retinal photographs [11,12], pathology slides [13], and radiology images [14,15]. Hwang et al. developed and validated a DL algorithm for detection of major thoracic diseases including pneumonia on CRs [16], and it demonstrated excellent diagnostic performance with conveniently-collected datasets, surpassing expert radiologists. However, whether the DL algorithm can improve the CR interpretation of physicians in real-world clinical settings remains to be seen.
The purpose of our study was to evaluate the performance and added value of a commercially-available DL algorithm for detecting pneumonia on CRs from ED patients with acute FRI.

Materials and Methods
This retrospective study was approved by the ethics committee of the Armed Forces Medical Command (AFMC-18028-IRB-18-025), which waived the requirement for patients' informed consent.

Patients and CR Collection
A total of 377 consecutive patients (375 men and 2 women, median age 20.0; interquartile range 20.0-21.0) with acute FRI (new or worsening episode of cough and fever of 38 • C or higher in the previous 24 h) underwent chest radiographs (387 CRs) in the ED of a tertiary military hospital in South Korea from August 2018 to January 2019 were studied. Among 377 acute FRI patients (387 CRs), 218 patients (222 CRs) were scanned by chest computed radiography (CT) within 24 h of the CRs. One author (J.H.K., with 6 years' experience in CR interpretation) retrospectively reviewed all available medical records to select patients with acute FRI, and identify the available CRs and chest CT images of these patients.
All acute FRI patients in the present study underwent posteroanterior chest radiographs, acquired with a single dedicated radiography unit (GC85A, Samsung Healthcare, Seoul, Korea).

Laboratory Testing and Pathogen Detection
Bacterial culture was performed with the use of standard techniques on sputum samples. In addition, a real-time polymerase chain reaction (RT-PCR) assay was performed on throat swabs for the detection of adenovirus, influenza A and B viruses, human metapneumovirus (HMPV), parainfluenza virus types 1, 2, and 3, respiratory syncytial virus (RSV) A and B, human rhinovirus A, coronaviruses 229E, OC43, and NL63, human bocavirus 1/2/3/4, and human enterovirus. A bacterial pathogen was considered to be present if Gram-positive or Gram-negative bacteria were detected in the sputum sample in the culture. A viral pathogen was considered to be present if the RT-PCR assay for the virus tested positive.

DL Algorithm
We utilised a commercially available DL algorithm (Lunit INSIGHT for Chest Radiography, version 4.7.2; Lunit; accessible at https://insight.lunit.io). The algorithm was developed to detect major thoracic diseases including pulmonary malignancy, active pulmonary tuberculosis, pneumonia, and pneumothorax. It was developed with an image database consisting of 54,221 normal CRs and 35,613 CRs with major thoracic diseases (prevalence, 39.6%) [16]. The algorithm provided a probability score between 0 and 1 for the presence of the aforementioned thoracic diseases and created a heat map of the input CR to facilitate the localisation of the lesion. Among the two predefined cut-off values of the probability score (high-sensitivity and high-specificity cut-offs), we used a high-sensitivity cut-off (probability score of 0.16) for the binary classification of pneumonia in the present study. Although a high sensitivity could result in unnecessary antibiotic use, this decision was made considering that maintaining a high sensitivity is more important than high specificity in clinical practice, especially in the ED.

Reference Standards
The diagnosis of pneumonia in the present study was based on clinical, microbiological, and radiological information. Three radiologists (J.H.K., J.Y.K., and K.H.K., each with 5-8 years' experience in CR interpretation) independently determined whether patients had radiological evidence of pneumonia or not by retrospective review of CRs, and/or CT imaging along with any available clinical information and laboratory tests.
In addition, patients were classified as having "visible pneumonia on CR" if radiologists identified consolidation or other infiltration (linear or patchy alveolar or interstitial densities) on CR. Therefore, patients with evidence of pneumonia on CT scans but not CRs were excluded from "visible pneumonia on CR". In case of discordant interpretation among the three radiologists, they re-evaluated the CRs and/or CTs, and came to a consensus.
Evaluation of the lesion localisation accuracy was done by a board-certified radiologist (J.H.K.), who reviewed all heat map images and determined if the DL algorithm was correct. Classifications made by the DL algorithm were only considered correct when the lesion locations were accurate.

Observer Performance Assessment
ED CRs were routinely read by physicians (board-certified internists) in our hospital; therefore, we decided to conduct an observer performance test for ED physicians to simulate clinical practice. The performance assessment included 2 sessions, and in both, the observers read the CRs in the radiologist's reading room with a high-resolution radiology monitor (MS53i2; Totoku, Tokyo, Japan) without any time limit. In session 1, three ED physicians with 6-7 years of experience in interpretation of ED CRs were asked to independently grade all the CRs on a 5-point scale for the presence of pneumonia, as follows: 1 = definitely normal, 2 = probably normal, 3 = indeterminate, 4 = probably pneumonia, and 5 = definitely pneumonia. The physicians were aware that each patient had acute FRI, and that the CRs were acquired for that purpose. Eight weeks after session 1, the three physicians independently reassessed every CR with the assistance of the DL algorithm to assign a grade (according to the 5-point scale) corresponding to the presence of pneumonia (second session). The probability scores of pneumonia and heat map images of the DL algorithm were provided on each CR interpretation in session 2.
The total observer reading time at each session was recorded.

Statistical Analysis
We calculated diagnostic performances of the DL algorithm and the physicians in terms of the following two tasks: (a) Detection of pneumonia on CRs irrespective of its visibility on CRs, (b) Detection of visible pneumonia on CR.
Receiver operating characteristic curves were constructed and area under the receiver operating characteristics curves (AUROCs) was calculated with 95% confidence intervals (CIs) by using the method of DeLong et al. [17]. The sensitivity, specificity, positive predictive values (PPVs), and negative predictive values (NPVs) of the DL algorithm were calculated according to the high-sensitivity cut-off value (probability score of 0·16). Observer interpretation with scores ≥3 were regarded as positive. A threshold of score ≥3 was chosen through maximization of the F1 score on the pooled data of three observers from session 1. The McNemar test was used to compare the sensitivity and specificity of the different methods.
To evaluate clinical characteristics data, distribution normality was assessed using the Kolmogorov-Smirnov test. Non-normally distributed data were presented as median (interquartile range) and categorical variables as frequency (%). Differences between pneumonia and non-pneumonia groups were analyzed by Fisher's exact test (for categorical data) or Mann-Whitney U test (for continuous data, but not normally distributed).
Statistical analyses were performed with a software (MedCalc, version 19.0.3; MedCalc Software, Mariakerke, Belgium). p values were two-sided, and p < 0.05 indicated a statistically significant difference.

Results
The clinical characteristics of acute FRI patients are summarized in Table 1.   Data are median (IQR) or n (%). NA = not available. * Difference between pneumonia and non-pneumonia groups.

Performance Comparison between Deep-Learning Algorithm and Physicians
There was a statistically significant difference between AUROC of the DL algorithm and the pooled AUROC from the three observers for the detection of pneumonia (0.861 vs. 0.788 [95% CI: 0.763-0.811]; p = 0.017) (Figure 2) ( Table 2). The specificity of the algorithm was significantly higher than that of the observers (94.4% vs. 88.7%; p < 0.0001), and the algorithm's sensitivity was also greater than that of the observers but did not achieve statistical significance (58.3% vs. 53.2%; p = 0.053) ( Table 2).
Diagnostic performances of the algorithm and individual physician are summarized in Tables 2 and 3.

Performance Comparison between Physicians-only and Physicians Aided by the Algorithm
With regard to the detection of pneumonia, the performance of physicians assisted by the algorithm was higher than those of physicians-only (AUROC; 0.816 [95% CI: 0.793-0.838] vs. 0.788), but the difference was not statistically significant (p = 0.068) ( Figure 2) ( Table 2). The pooled sensitivity and specificity of physicians assisted by the algorithm were significantly higher than those of physicians-only (0.599; 95% CI: 0.536-0.660 vs. 0.532, and 0.981; 95% CI: 0.970-0.989 vs. 0.887; p = 0.008 and < 0.0001, respectively).
Mean total reading time of the physicians with the assistance of the algorithm was reduced by 64 min from 165 to 101 min (

Performance Comparison between Deep-Learning Algorithm and Physicians
There was a statistically significant difference between AUROC of the DL algorithm and the pooled AUROC from the three observers for the detection of pneumonia (0.861 vs. 0.788 [95% CI: 0.763-0.811]; P = 0.017) (Figure 2) ( Table 2). The specificity of the algorithm was significantly higher than that of the observers (94.4% vs. 88.7%; P < 0.0001), and the algorithm's sensitivity was also greater than that of the observers but did not achieve statistical significance (58.3% vs. 53.2%; P = 0.053) ( Table 2).
Diagnostic performances of the algorithm and individual physician are summarized in Table  2; Table 3.

Performance Comparison between Physicians-only and Physicians Aided by the Algorithm
With regard to the detection of pneumonia, the performance of physicians assisted by the algorithm was higher than those of physicians-only (AUROC; 0.816 [95% CI: 0.793-0.838] vs. 0.788), but the difference was not statistically significant (P = 0.068) ( Figure 2) ( Table 2). The pooled sensitivity and specificity of physicians assisted by the algorithm were significantly higher than those of physicians-only (0.599; 95% CI: 0.536-0.660 vs. 0.532, and 0.981; 95% CI: 0.970-0.989 vs. 0.887; P = 0.008 and < 0.0001, respectively).
Mean total reading time of the physicians with the assistance of the algorithm was reduced by 64 min from 165 to 101 min ( Table 2).

Discussion
In the present study, the DL algorithm demonstrated fair diagnostic performance in detecting pneumonia (AUROC, 0.861) by evaluating CRs in a consecutive patient acute FRI cohort. However, the sensitivity of the DL algorithm was only 58.3%; a result that can be sufficiently explained given that 24 of the 83 pneumonia patients had a form of pneumonia that was not visible on the concurrent CRs. With respect to detecting 'visible pneumonia on CR', the DL algorithm demonstrated excellent diagnostic performance (AUROC, 0.936). It is comparable to the diagnostic performance of thoracic radiologists to detect major thoracic diseases in a previous study (AUROC, 0.932) [16] and the diagnostic performance of the DL algorithm (AUROC, 0.95) for detection of clinically relevant abnormalities in the ED of a general tertiary hospital [9]. In addition, our results showed that the algorithm significantly improved the diagnostic performance of ED physicians in the detection of pneumonia on CRs. Additionally, for detecting pneumonia, pooled sensitivity and specificity of the ED physicians significantly improved with the assistance of the algorithm. In the detection of 'visible pneumonia on CR', pooled AUROC, sensitivity, and specificity of the ED physicians were significantly enhanced by the algorithm's assistance. These results are similar to the previous studies [9,16].
Interestingly, the DL algorithm not only improved the diagnostic performance, but also substantially reduced the reading time of CR interpretation by the ED physicians (mean total reading time, 165 min vs. 101 min). Consequently, with the assistance of the DL algorithm ED physicians could detect pneumonia on CRs more quickly and accurately. Furthermore, if the DL algorithm provisionally analyzed ED CRs and if there is an alerting system for clinically-critical or relevant diseases, ED physicians could prioritize CRs with a high-probability score of clinically relevant abnormalities (such as pneumonia). This would shorten the turnaround time from acquisition to interpretation and enable timely treatment of these patients. Therefore, we believe that the DL algorithm in the present study could improve the quality of pneumonia care in patients with acute FRI such as COVID-19 [18] by improving the diagnostic accuracy and reducing the time to diagnosis.

Discussion
In the present study, the DL algorithm demonstrated fair diagnostic performance in detecting pneumonia (AUROC, 0.861) by evaluating CRs in a consecutive patient acute FRI cohort. However, the sensitivity of the DL algorithm was only 58.3%; a result that can be sufficiently explained given that 24 of the 83 pneumonia patients had a form of pneumonia that was not visible on the concurrent CRs. With respect to detecting 'visible pneumonia on CR', the DL algorithm demonstrated excellent diagnostic performance (AUROC, 0.936). It is comparable to the diagnostic performance of thoracic radiologists to detect major thoracic diseases in a previous study (AUROC, 0.932) [16] and the diagnostic performance of the DL algorithm (AUROC, 0.95) for detection of clinically relevant abnormalities in the ED of a general tertiary hospital [9]. In addition, our results showed that the algorithm significantly improved the diagnostic performance of ED physicians in the detection of pneumonia on CRs. Additionally, for detecting pneumonia, pooled sensitivity and specificity of the ED physicians significantly improved with the assistance of the algorithm. In the detection of 'visible pneumonia on CR', pooled AUROC, sensitivity, and specificity of the ED physicians were significantly enhanced by the algorithm's assistance. These results are similar to the previous studies [9,16].
Interestingly, the DL algorithm not only improved the diagnostic performance, but also substantially reduced the reading time of CR interpretation by the ED physicians (mean total reading time, 165 min vs. 101 min). Consequently, with the assistance of the DL algorithm ED physicians could detect pneumonia on CRs more quickly and accurately. Furthermore, if the DL algorithm provisionally analyzed ED CRs and if there is an alerting system for clinically-critical or relevant diseases, ED physicians could prioritize CRs with a high-probability score of clinically relevant abnormalities (such as pneumonia). This would shorten the turnaround time from acquisition to interpretation and enable timely treatment of these patients. Therefore, we believe that the DL algorithm in the present study could improve the quality of pneumonia care in patients with acute FRI such as COVID-19 [18] by improving the diagnostic accuracy and reducing the time to diagnosis.
Regarding the diagnostic performance boosting effect of the algorithm, there was variability across the three ED physicians. There was a significant improvement in the specificity of detection of pneumonia and 'visible pneumonia on CRs' in observers 1 and 2, but there was no significant improvement in the specificity of observer 3 with the assistance of DL algorithm. In addition, only observer 1 showed a significant improvement of AUROC and observer 3 alone showed a significant improvement in sensitivity with the assistance of the DL algorithm for the detection of 'visible pneumonia on CR'. Although the diagnostic performance of ED physicians generally improved and the reading time decreased after using the DL algorithm, the variability in the effectiveness of assistance across individual physicians should be considered when using it in clinical practice.
It is noteworthy that the DL algorithm showed several unexpected false-positives. Specifically, the algorithm misinterpreted the radio-opaque letters on a shirt (n = 6) and abdominal shield (n = 1) as abnormal lesions, which would have been easily ignored by physicians; none of the observers considered those foreign materials as abnormal lesions. Physicians should be aware of this problem when utilising the DL algorithm in their clinical practice and the developers should correct this shortcoming.
The present study had several limitations. Firstly, the majority of our study cohort consisted of young men without underlying disease (370 men and 2 women, median age 20 [interquartile range 20.0-21.0) and moreover military hospital patients constitute a specialised population. Further investigations are needed to validate the diagnostic performance of the DL algorithm in acute FRI patients from the general population. Secondly, we performed the observer performance assessment on three physicians. In addition, since there was an inter-observer variability regarding the effect of DL algorithm assistance, further performance tests on multiple observers are needed to validate the results of the present study.
In conclusion, the DL algorithm showed fair diagnostic performance for detecting pneumonia, particularly visible pneumonia on CR, and improved the diagnostic performance of ED physicians in patients with acute FRI.