Diagnostic Performance in Differentiating COVID-19 from Other Viral Pneumonias on CT Imaging: Multi-Reader Analysis Compared with an Artificial Intelligence-Based Model

Growing evidence suggests that artificial intelligence tools could help radiologists in differentiating COVID-19 pneumonia from other types of viral (non-COVID-19) pneumonia. To test this hypothesis, an R-AI classifier capable of discriminating between COVID-19 and non-COVID-19 pneumonia was developed using CT chest scans of 1031 patients with positive swab for SARS-CoV-2 (n = 647) and other respiratory viruses (n = 384). The model was trained with 811 CT scans, while 220 CT scans (n = 151 COVID-19; n = 69 non-COVID-19) were used for independent validation. Four readers were enrolled to blindly evaluate the validation dataset using the CO-RADS score. A pandemic-like high suspicion scenario (CO-RADS 3 considered as COVID-19) and a low suspicion scenario (CO-RADS 3 considered as non-COVID-19) were simulated. Inter-reader agreement and performance metrics were calculated for human readers and R-AI classifier. The readers showed good agreement in assigning CO-RADS score (Gwet’s AC2 = 0.71, p < 0.001). Considering human performance, accuracy = 78% and accuracy = 74% were obtained in the high and low suspicion scenarios, respectively, while the AI classifier achieved accuracy = 79% in distinguishing COVID-19 from non-COVID-19 pneumonia on the independent validation dataset. The R-AI classifier performance was equivalent or superior to human readers in all comparisons. Therefore, a R-AI classifier may support human readers in the difficult task of distinguishing COVID-19 from other types of viral pneumonia on CT imaging.


Introduction
Coronavirus Disease 2019 (COVID-19) is a complex infectious disease caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), which has caused more than half a billion cases and 6 million deaths since it was first reported in late 2019 [1].
From a radiological point of view, CT findings of SARS-CoV-2 pulmonary infection include ground-glass opacities, areas of crazy-paving pattern, and consolidations. Such

Study Design and Imaging Data
This study was retrospectively conducted in a single high-volume referral hospital for the management of the COVID-19 pandemic. The Local Ethics Committee (decision number 188-22042020) approved the study and waived informed consent since data were collected retrospectively and processed anonymously.
The CT scans of COVID-19 patients were performed between March 2020 and April 2021, while CT scans of non-COVID-19 patients were performed between January 2015 and October 2019 (i.e., before SARS-CoV-2 started circulating). For both groups, the CT scans were acquired within 15 days of serological evidence of infection.
Chest CT examinations were performed with different CT scanners (Somatom Definition Edge-Siemens, Somatom Sensation 64-Siemens, Brilliance 64-Philips) and with the same patient set-up (supine position with arms over the head during a single breath-hold, in keeping with the patient compliance). The main acquisition parameters were: tube voltage = 80-140 kV; automatic tube current modulation; pitch = 1; matrix = 512 × 512. All acquisitions were reconstructed with high-resolution thorax kernels and a slice thickness of 3 mm.

Artificial Intelligence-Based Model
The collected CT images were used to develop a radiomic-based Neural Network (R-AI) classifier exploiting a Multi-Layer Perceptron architecture to discriminate between COVID-19 and non-COVID-19 pneumonia. In particular, the classifier was trained with 811 CT images (n = 496 COVID-19, n = 315), while the remaining 220 CT images (n = 151 COVID-19, n = 69 non-COVID- 19) were used as an independent validation dataset, applying a threshold on the predicted values of 0.5. Details about the R-AI classifier, including development and tuning, were previously described [18].
The R-AI classifier provided as output the probability (0.00-1.00) that the analyzed CT scan belonged to a COVID-19 patient.

Reader Evaluation
Three radiologists with >10 years of experience (Readers 1-3) and one radiology resident with 3 years of experience (Reader 4), all employed at a high-volume COVID-19 referral hospital, were enrolled to evaluate the 220 CT scans of the independent validation dataset. The four readers were blinded to the original radiologic report and all non-imaging data, including the acquisition date of the CT scans. They were asked to assign each case the CO-RADS score [8] (1 to 5) to express the increasing suspicion of COVID-19. To properly simulate a realistic clinical scenario, the readers were instructed to interpret the CT findings, assuming that the patients had an acute condition (e.g., presentation at the Emergency Department).
Additionally, as an estimate of disease severity, for each patient, the readers visually assessed the extent of pulmonary involvement expressed as a percentage of the total lung volume, rounded to the nearest 10%.
The test was performed using a program developed in JavaScript that automatically opened to the reader the anonymized CT series in random order. After the reader had assigned the CO-RADS score through a dialog box, the program automatically loaded the CT of the next patient in random order.

Data Analysis
Continuous variables were reported as median values with 25th and 75th percentiles (Q1-Q3) of their distribution; categorical variables were expressed as counts and percentages, with the corresponding 95% confidence interval (95%CI) using the Wilson method [19].
The chance-corrected inter-reader agreement for the assigned CO-RADS score was tested using Gwet's second-order agreement coefficient (AC2) with ordinal weights [20]. AC2 was chosen to correct for the partial agreement occurring when comparing ordinal variables with multiple readers and because it is less affected by prevalence and marginal distribution [21][22][23]. The level of agreement was interpreted following Altman's guidelines [24]. Weighted percentage agreement was reported as well [25].
To account for equivocal results (i.e., CO-RADS 3), two different scenarios were simulated: a high suspicion scenario, where CO-RADS 3 results were considered as COVID-19 patients, and a low suspicion scenario, where CO-RADS 3 results were considered as non-COVID-19 patients together with CO-RADS 1 and 2.
Sensitivity (SE), specificity (SP), accuracy (ACC), positive likelihood ratio (PLR), and negative likelihood ratio (NLR) of human readers in discriminating COVID-19 patients from non-COVID-19 patients were calculated for both high and low suspicion scenarios. The same metrics of diagnostic performance were also calculated for the R-AI classifier.
Moreover, a further subanalysis was conducted to compare the performance of human readers and the R-AI classifier in challenging cases when two or more readers had assigned a CO-RADS 3 score.
Significant differences in the diagnostic performance of the readers and the R-AI classifier were tested using Cochran's Q test with a post-hoc pairwise McNemar test.
The data analysis was generated using the Real Statistics Resource Pack software (Release 6.8) (www.real-statistics.com (accessed on 1 October 2022)) for Microsoft Excel (Microsoft Corporation, Redmond, Washington, DC, USA) and GraphPad Prism 8.4.0 (GraphPad Software, La Jolla, CA, USA).
Statistical significance was established at the p < 0.050 level, applying Bonferroni's correction for multiple comparisons when appropriate.
Regarding the diagnostic performance in identifying COVID-19 pneumonia, full results are provided in Table 3 The rate of patients classified as CO-RADS 1 (normal/noninfectious) was 10% (95%CI: 8-12%), while the rate of CO-RADS 3 (equivocal cases) was 17% (95%CI: 15-20%). Specifically, 43 (20%) cases received a CO-RADS 3 score from two or more readers, of which 26 (60% of 43) were COVID-19 patients and 17 (40% of 43) were non-COVID-19 patients. On the other hand, the R-AI classifier misclassified 21% (95%CI: 15-28%) of the COVID-19 patients and 22% (95%CI: 14-33%) of the non-COVID-19 patients. Exemplary cases are shown in Figure 1. Regarding the diagnostic performance in identifying COVID-19 pneumonia, full results are provided in Table 3, Figure 2 and     According to Cochran's Q test, only the performance of Reader 3 significantly changed between the high and low suspicion scenarios, decreasing in the latter (accuracy 70% vs. 78%, p = 0.008); no significant changes were found for the other readers (p > 0.999). No significant differences in performance were observed between the readers and the R-AI classifier for the high suspicion scenario (p = 0.369); on the contrary, a statistically significant result was obtained for the low suspicion scenario (p = 0.003). However, the post-hoc pairwise McNemar test revealed that the R-AI classifier still had diagnostic performance comparable to that of human readers (lowest p = 0.256), whereas Reader 3 had a significantly lower performance than Reader 2 (p = 0.039) and Reader 4 (p = 0.041). Full statistical results of the comparative analysis are provided in Table 4. Table 4. Comparative analysis of the diagnostic performance of radiomic-based artificial intelligence classifier (R-AI) and human readers in both high and low suspicion scenarios.

Low Suspicion Scenario R-AI Reader 1 Reader 2 Reader 3 Reader 4 Accuracy p-Value
Accuracy The p-values adjusted after Bonferroni's correction were reported ("*" = statistically significant values).
Finally, considering the subset of 43 CT scans to which two or more radiologists assigned a CO-RADS 3 score, the readers obtained a global accuracy of 55% (95%CI: 47-  Table 3. Diagnostic performance of radiomic-based artificial intelligence classifier (R-AI) and human readers in classifying the patients in the high and low suspicion scenarios. According to Cochran's Q test, only the performance of Reader 3 significantly changed between the high and low suspicion scenarios, decreasing in the latter (accuracy 70% vs. 78%, p = 0.008); no significant changes were found for the other readers (p > 0.999). No significant differences in performance were observed between the readers and the R-AI classifier for the high suspicion scenario (p = 0.369); on the contrary, a statistically significant result was obtained for the low suspicion scenario (p = 0.003). However, the post-hoc pairwise McNemar test revealed that the R-AI classifier still had diagnostic performance comparable to that of human readers (lowest p = 0.256), whereas Reader 3 had a significantly lower performance than Reader 2 (p = 0.039) and Reader 4 (p = 0.041). Full statistical results of the comparative analysis are provided in Table 4. Table 4. Comparative analysis of the diagnostic performance of radiomic-based artificial intelligence classifier (R-AI) and human readers in both high and low suspicion scenarios.

Cochran's Q Test
Post The p-values adjusted after Bonferroni's correction were reported ("*" = statistically significant values).
Finally, considering the subset of 43 CT scans to which two or more radiologists assigned a CO-RADS 3 score, the readers obtained a global accuracy of 55% (95%CI: 47-62%) in the high suspicion scenario and 45% (95%CI: 38-53%) in the low suspicion scenario, whereas the R-AI classifier showed an accuracy of 74% (95%CI: 59-86%). Cochran's Q test was significant in both cases, with p < 0.001; however, the post-hoc pairwise McNemar test was significant only for the comparison of the R-AI classifier with Reader 1 (p = 0.023 for both scenarios) and Reader 3 in the low suspicion scenario (p = 0.035). Full details are reported in Tables 5 and 6 and Figure 4. Table 5. Diagnostic performance of radiomic-based artificial intelligence classifier (R-AI) and human readers in classifying the subset (n = 43) of patients who were assigned a CO-RADS 3 score by two or more readers.   6. Comparative analysis of diagnostic performance of radiomic-based artificial intelligence classifier (R-AI) and human readers in classifying the subset (n = 43) of patients who were assigned a CO-RADS 3 score by two or more readers. The p-values adjusted after Bonferroni's correction were reported ("*" = statistically significant values).

Discussion
In this study, the diagnostic performance of multiple readers in distinguishing between COVID-19 and non-COVID-19 pneumonia was evaluated in two different risk scenarios and compared with a radiomic-based artificial intelligence classifier.
Given the well-known complexity of the task, inter-reader agreement in assigning the CO-RADS score was assessed and found to be good, in line with the currently available literature on the reproducibility of this reporting system. Prokop et al. [8] initially observed an overall Fleiss' kappa of 0.47, but subsequent studies reported a moderate-to-good level of agreement, comparable to that observed in our study [9,26]. Moreover, the absence of significant differences in the diagnostic performance of the three high-experience readers compared to the low-experience reader using the CO-RADS score

Discussion
In this study, the diagnostic performance of multiple readers in distinguishing between COVID-19 and non-COVID-19 pneumonia was evaluated in two different risk scenarios and compared with a radiomic-based artificial intelligence classifier.
Given the well-known complexity of the task, inter-reader agreement in assigning the CO-RADS score was assessed and found to be good, in line with the currently available literature on the reproducibility of this reporting system. Prokop et al. [8] initially observed an overall Fleiss' kappa of 0.47, but subsequent studies reported a moderate-to-good level of agreement, comparable to that observed in our study [9,26]. Moreover, the absence of significant differences in the diagnostic performance of the three high-experience readers compared to the low-experience reader using the CO-RADS score confirmed the observations by Bellini et al. [10]. On the contrary, in our study, some inconsistency in CO-RADS evaluation was observed for one of the high-experience readers, whose diagnostic accuracy were slightly inferior in the low suspicion scenario.
At the very beginning of the COVID-19 pandemic, a study [7] on 424 patients with COVID-19 and non-COVID-19 viral pneumonia yielded a classification accuracy ranging between 60% and 83% when considering radiologists with direct experience of SARS-CoV-2 infection. Such a wide range of accuracy was reported in subsequent multi-reader analyses [5,27,28], and the results of our study fell within it. The simulation of two different suspicion scenarios allowed us to account for diverse epidemiological conditions, thus providing a more complete picture of the diagnostic performance of the readers.
When applied to the same dataset, the R-AI classifier obtained an accuracy of 79%, comparable to the performance of the human readers in both high and low suspicion scenarios. This result was similar to that reported by Cardobi et al. [29], who developed a radiomic-based model to distinguish COVID-19 from other types of interstitial pneumonia at chest CT. As we used a ten-time larger dataset and applied the R-AI classifier to an independent validation set, our study provided stronger evidence that quantitative imaging and AI models can support this diagnostic task.
Notably, when considering only the subset of patients who were assigned a CO-RADS 3 score by two or more radiologists, the global accuracy of the human readers dropped to 45-55% (depending on the scenario), while the accuracy of the R-AI classifier was almost unchanged (74%). This suggested a more stable performance for the AI, probably based on the extraction of quantitative information within medical images not perceivable by the human brain, even though the result was only partially confirmed by the post-hoc pairwise McNemar test. However, it is reasonable to believe that the smaller sample size, resulting in larger confidential intervals for performance metrics, and correction for multiple comparisons reduced the statistical power by increasing the risk for type II errors. Nevertheless, the result bolsters the concept of AI models helping with equivocal cases, for example, as a second opinion tool to improve diagnostic performance.
AI models with higher performance than our classifier in differentiating between COVID-19 and non-COVID-19 viral pneumonia were also reported, as in the study by Wang et al. [14,30]. However, these authors proposed a method based on single-slice manual segmentation of pulmonary lesions, which is a time-consuming approach hardly feasible in everyday clinical practice compared with our fully automatic approach. Zhou et al. [13] provided another example of an automatic deep learning-based algorithm with very good performance but limited to patients with SARS-CoV-2 and influenza virus infections.
In this regard, contrary to many other similar studies on AI models [16,17], we decided to focus only on the differential diagnosis between COVID-19 and non-COVID-19 viral pneumonia, rather than a broader spectrum of pulmonary diseases. On the one hand, this choice was meant to stress the difficulty posed by the highly overlapping CT findings of these entities; on the other hand, the recognition of typical signs of bacterial infections, such as lobar consolidation, would most likely not require the support of an AI tool. In addition, even if rapid COVID-19 tests are currently widespread and help guide the clinical suspicion, they may be unavailable in some contexts (e.g., night shifts) or provide equivocal results. On the other hand, we envisioned our R-AI classifier as a tool for the radiologist to be used for pneumonia cases whose infectious nature is recognized but with ambiguous or discordant findings compared to clinical history or laboratory results. Nevertheless, in the future, it would be possible to further train the classifier with other lung diseases that mimic COVID-19, such as organizing pneumonia or drug-induced interstitial pneumonia, thus extending its applications.
The main limitation of this study is represented by its retrospective design in a single institution, showing a selection bias. For example, COVID-19 and non-COVID-19 groups had different sample sizes, although this was limited by the fact that the readers were unaware of the case proportions. The R-AI classifier was trained and tested on a COVIDpredominant dataset, as well. Additionally, the study population mainly included patients with moderate-to-severe pulmonary involvement based on the visual evaluation of the readers. The underrepresentation of cases with mild disease could represent a bias, even if the sample reflected the actual population for whom chest CT scan is recommended [31]. In addition, the CO-RADS score has been developed specifically for use in patients with moderate to severe disease [8]. Another limitation was that chest CT scans within 15 days from molecular evidence of infection were used, but the cause-and-effect relationship could have been fallacious. Indeed, some of the selected patients may have mixed pneumonia or other diseases. However, the large dataset used should have minimized the impact of this occurrence. Finally, the radiologists were not given clinical information during the evaluation, which could have further improved their performance. In the future, the generalizability of our results should be assessed with a prospective design in a multicenter setting, possibly incorporating clinical information in the AI model.
In conclusion, this work confirmed that distinguishing COVID-19 from other types of viral pneumonia is challenging, even for expert radiologists. Nevertheless, we showed that an artificial intelligence classifier based on radiomic features can provide diagnostic performance in this task comparable to human readers, and probably even better with equivocal cases. Once implemented in the clinical workflow, such a tool could support the radiological activity, for example, by providing a second opinion in case of ambiguous chest CT findings of pulmonary infection.