Artificial Intelligence Applied to Chest X-ray for Differential Diagnosis of COVID-19 Pneumonia

We assessed the role of artificial intelligence applied to chest X-rays (CXRs) in supporting the diagnosis of COVID-19. We trained and cross-validated a model with an ensemble of 10 convolutional neural networks with CXRs of 98 COVID-19 patients, 88 community-acquired pneumonia (CAP) patients, and 98 subjects without either COVID-19 or CAP, collected in two Italian hospitals. The system was tested on two independent cohorts, namely, 148 patients (COVID-19, CAP, or negative) collected by one of the two hospitals (independent testing I) and 820 COVID-19 patients collected by a multicenter study (independent testing II). On the training and cross-validation dataset, sensitivity, specificity, and area under the curve (AUC) were 0.91, 0.87, and 0.93 for COVID-19 versus negative subjects, 0.85, 0.82, and 0.94 for COVID-19 versus CAP. On the independent testing I, sensitivity, specificity, and AUC were 0.98, 0.88, and 0.98 for COVID-19 versus negative subjects, 0.97, 0.96, and 0.98 for COVID-19 versus CAP. On the independent testing II, the system correctly diagnosed 652 COVID-19 patients versus negative subjects (0.80 sensitivity) and correctly differentiated 674 COVID-19 versus CAP patients (0.82 sensitivity). This system appears promising for the diagnosis and differential diagnosis of COVID-19, showing its potential as a second opinion tool in conditions of the variable prevalence of different types of infectious pneumonia.


Introduction
Differential diagnosis of COVID-19 from other types of pneumonia has been a highpriority research topic and clinical aim since the early stages of the current pandemic [1,2].
Prompt identification of COVID-19 cases is paramount to ensure proper management and better patient outcomes [3][4][5]. Moreover, any tool to be applied for this aim should have a good cost-benefit ratio for the healthcare service, be able to adapt to heterogeneous settings, and be also useful outside COVID-19 pandemic peak, enabling accurate differential diagnosis with other types of pneumonia, such as non-COVID-19 community-acquired pneumonia (CAP) [2,[5][6][7][8].
The current reference standard for the detection of COVID-19 is the detection of SARS-CoV-2 by reverse transcription polymerase chain reaction (RT-PCR) [3,9]. However, due to intrinsic shortcomings of this diagnostic modality and to the high prevalence and clinical impact of COVID-19, chest imaging has been widely used to triage suspect cases [10][11][12]. A meta-analysis on the diagnostic performance of computed tomography (CT) showed a 94% pooled sensitivity, specificity being however under 40% [13]. Moreover, the use of CT implies higher healthcare costs since CT scanners have relatively limited availability, even in high-income countries, and CT equipment and rooms need sanitization after each use involving suspected or confirmed cases unless a continuous series of confirmed cases has to be studied [14][15][16]. In this context, the use of chest X-ray imaging (CXR) has become increasingly commonplace to evaluate patients presenting with symptoms potentially associated with COVID-19 such as fever, cough, or dyspnea [17][18][19][20][21]. Typical COVID-19 abnormal findings reported at CXR are portions of the lungs appearing as a "hazy" shade of grey instead of normal well-aerated parenchyma, representing pneumonia foci, with fine linear structures representing blood vessels [18], the so-called ground-glass opacities, which are also well-detected at CT [22]. Since these findings are among the first radiological manifestations of COVID-19 pneumonia, it could be hypothesized that an accurate CXR reading could aid the early diagnosis of COVID-19 pneumonia, also providing the additional benefit of differential diagnosis from CAP.
Recently, artificial intelligence (AI) and deep learning, in particular convolutional neural networks (CNNs), have been proven an effective and reliable tool to both automate and improve diagnosis and prognosis of various diseases, including pneumonia, as shown by competitors at the 2018 Kaggle Challenge for Chest X-ray images [23]. Furthermore, AI approaches have shown potential in performing differential diagnoses between different types of pneumonia, namely bacterial and viral CAP [24,25]. Early in the pandemic, AI was employed in COVID-19 diagnosis by various teams, showing sensitivities and specificities well over 0.90 by both machine learning [26][27][28] and deep learning [29][30][31][32][33][34][35][36][37][38][39][40] techniques. As already pointed out by Farhat et al. [41], Shi et al. [42], and López-Cabrera et al. [43], pre-trained CNN-based systems emerge as the most popular and powerful approaches for the automatic classification of images of suspected COVID-19 patients. However, most of the published studies so far have focused on CT, only some of them using CXR [26,28,31,36,40]. Moreover, only the studies by Kana et al. [26] and Ucar et al. [40] demonstrated the applicability of their models also to the differential diagnosis of COVID-19 versus other types of pneumonia. Specifically, Kana et al. [26] implemented a transfer learning model based on CXRs to differentiate healthy individuals, bacterial or viral pneumonia versus COVID-19 pneumonia, obtaining near-100% accuracy. Ucar et al. [40] fine-tuned a SqueezeNet using a Bayesian optimization approach, reaching 98% accuracy in classifying normal subjects, patients with non-COVID-19 CAP and COVID-19 patients. However, all the aforementioned studies did not use an independent testing set (neither temporally nor spatially independent) that would allow for an unbiased evaluation of model performance.
The aim of our study was to evaluate the two-sided role of AI applied to CXR in patients suspected to be affected by COVID-19 pneumonia, i.e., outright COVID-19 diagnosis and differential diagnosis from other CAPs. The general purpose was to present an effective tool supporting the diagnosis of COVID-19 pneumonia in the perspective of offering a second opinion to radiologists or a preliminary assessment when a radiologist is not immediately available. With these aims, taking into consideration the strengths and limitations of the current literature, which points to a consolidated superior performance of CNNs for image classification tasks [44], we trained and cross-validated a ResNet-50 architecture. The model was applied to CXRs for supporting the differential diagnosis of COVID-19. Model performance was evaluated using two independent testing sets.

Study Protocol
The local Ethics Committee approved this retrospective study on 8 April 2020, and informed consent was waived due to the retrospective nature of the study.

Training, Validation, and First Independent Testing
For the training/validation phase and the first independent testing (independent testing I) consecutive patients referred to the emergency department (ED) of two hospitals in Lombardy, Italy (IRCCS Policlinico San Donato, San Donato Milanese, Center 1; ASST Monza-Ospedale San Gerardo, Monza, Center 2) were included in the study. From these centers, two groups of patients were assessed according to two different timeframes-the first group referred to EDs between mid-February 2020 and mid-March 2020 and the second in the same period during 2019.
The first group included patients with RT-PCR-confirmed SARS-CoV-2 infection undergoing CXR on ED admission. Digital CXR was performed in two projections (posteroanterior and lateral) in the radiology unit or in one anteroposterior projection at bedside in the ED. Whenever both the posteroanterior and lateral projections were available, only the former was included for further analysis.
In the second group, we included patients with suspected CAP undergoing CXR on ED admission. As for patients of the first group, digital CXR was performed in two projections (posteroanterior and lateral) in the radiology unit or in one anteroposterior projection at bedside in the ED. Again, only the posteroanterior projection was considered when two projections were available.
Radiological labels for both groups were attributed by two radiologists of the two centers involved in this study (with 8 years and 13 years of experience in CXR interpretation).

Second Independent Testing
For the external testing (independent testing II) of our AI model, a third group of patients was retrieved from the publicly available dataset "AIforCOVID" [45]. Ethics Committee approval was obtained also for this study. The AIforCOVID dataset, collected between March 2020 and June 2020, includes posteroanterior CXRs of RT-PCR-confirmed COVID-19 patients from six other hospitals in Italy. Patients from this dataset are categorized as having "mild" or "severe" disease according to their clinical outcome-patients assigned to the "mild" group were either sent to domiciliary isolation or were hospitalized in ordinary wards without the need of ventilatory support, while the "severe" group included all hospitalized patients that required ventilation support, intensive care, and/or died during hospitalization.

AI System
The TRACE4© radiomic platform (DeepTrace Technologies S.R.L., Milano, Italy) [46] was used to classify CXRs of the different groups of patients. This platform allows training, validation, and testing of different AI systems combined with different feature-extraction methods applied to medical images for classification purposes.
The TRACE4© platform includes full workflow for radiomic analysis (i.e., compliant to the guidelines of International Biomarker Standardization Initiative [47]); different feature extraction and selection methods, and different ensembles of machine-learning techniques such as support vector machines, random forests, deep learning and transfer learning of neural networks.
The classification tasks of interest were binary (COVID-19 versus negative, COVID-19 versus CAP), considering the following cases: COVID-19, all patients with positive RT-PCR; negative, all patients from the second group without any CXR finding; and CAP, all patients from the second group with positive CXR findings for CAP. The deep-neuralnetwork classifier proposed in this work was implemented on the ResNet-50 architecture, i.e., a convolutional neural network composed of 50 layers. The network is able to learn a rich feature representation of the input classes (more than a million images from the ImageNet database [48]). To train and test an ensemble of 10 convolutional neural networks (based on the ResNet-50 architecture), a 10-fold cross-validation procedure was used. The classification outputs of each of the 10 models concurring to the ensemble were then merged by sum rule (ensemble-averaging of class probability) to obtain the final classification output of the ensemble of classifiers.
This feature representation is then used to classify new samples (new images, in this case) to one of the input classes. In order to specialize the pre-trained ResNet-50 network to the binary-discrimination tasks of interest in our study (i.e., COVID-19 versus negative and COVID-19 versus CAP), we applied a fine-tuning process to the last layers of the original ResNet-50 architecture (COVID-19 versus negative, COVID-19 versus CAP) [48].
CXR images were resized to the input size of the architecture (i.e., 224 pixels by 224 pixels) before being fed into the deep neural network. Automatic data-augmentation techniques were applied to the resized CXR set during the training of the classifier-this operation, which includes image rotation, shear, and reflection, aims at increasing CXR image diversity among different training phases (epochs), thus increasing the performance of the training procedure. No further data processing was applied to the CXR images used as input to the deep-learning network nor during the training-and-classification process.
To obtain further estimates of the performance of our AI system, the model resulting from the training procedure was tested on the two independent sets of patients (independent testing I and II); none of these two sets included patients from the cross-validation procedure.
Performance metrics for both cross-validation and independent testing I are presented in terms of sensitivity, specificity, and areas under the receiver operating curve (ROC AUC), with their 95% confidence intervals (95% CIs) reported only for cross-validation performances. For independent testing II, since no negative controls or CAP patients are present in this independent cohort, sensitivity only was calculated. In this second independent testing, subgroup analysis according to COVID-19 severity ("mild" or "severe" group) was also performed.

Results
A total of 162 patients who underwent CXR and tested positive for SARS-CoV-2 infection at RT-PCR in Centers 1 and 2 from 21 February 2020 to 16 March 2020 were included in our retrospective evaluation (first group of patients). From all patients admitted to the two hospitals roughly in the same period the year before, 112 patients with CAP and 158 negative controls were included, accounting for a total of 270 patients in the second group. For the third group, 820 patients RT-PCR-confirmed COVID-19 patients were retrieved for our retrospective evaluation from the AIforCOVID database, 384/820 (47%) being categorized in the "mild" subgroup of this database, 436/820 (53%) in the "severe" subgroup. Table 1 shows the distribution of the 1252-included patients, while Figure 1 shows examples of CXRs with typical COVID-19 pneumonia (Figure 1a      Independent testing II conducted on the CXRs of the 820 COVID-19 patients from the AIforCOVID dataset (third group of patients) showed 652 COVID-19 patients correctly classified versus negative subjects (with a sensitivity of 0.80) and 674 COVID-19 patients correctly classified versus CAP (with sensitivity of 0.82). Subgroup analysis on the "mild" and "severe" COVID-19 patients showed that the proposed AI system for the discrimination between COVID-19 and negative subjects correctly classified 264 out of 384 patients in the "mild" subgroup (sensitivity 0.69) and 388 out of 436 patients in the "severe" subgroup (sensitivity 0.89). Conversely, the proposed AI system for the discrimination between COVID-19 and CAP patients correctly classified 284 out of 384 patients in the "mild" subgroup (sensitivity 0.74) and 390 out of 436 patients in the "severe" subgroup (sensitivity 0.89).
Tables 2-4 detail the performance of the proposed AI system, with corresponding receiver operating curves (ROCs) shown in Figure 2. Table 2. Training and validation (cross-validation) performance of the proposed AI system.

Discussion
Over the last year, the COVID-19 pandemic showed an ever-shifting time-and spacerelated epidemiological profile worldwide [49,50]. However, in all phases of pandemic waves, quick and accurate diagnosis remains of utmost importance [51,52], in particular when facing viral variants [53,54] and the need to take advantage of the effect of vaccination campaigns [54,55]. In this context, CXR has emerged as a crucial first-line diagnostic tool for the detection of COVID-19 pneumonia in the ED setting [18] and beyond [56,57], given its high availability, low associated costs, and accuracy in pneumonia diagnosis [7,10,18,58,59].
Although COVID-19 pneumonia appears on CXR with characteristic features, many of them are also shared by other viral types of pneumonia [7,18,60]. The improvement of CXR diagnostic performance would be paramount to ameliorate decision making regarding patient management, which strongly relies on the initial assessment, considering both the intrinsic shortcomings of RT-PCR testing and the difficulties of implementing a CT strategy for early diagnosis [10,[14][15][16].
For the purpose of diagnosis, we trained and tested an ensemble of ten convolutional neural networks with CXRs of 98 COVID-19 patients referring to the EDs of two university hospitals in northern Italy (Center 1 and Center 2) during the first 2020 pandemic wave in northern Italy and 98 negative subjects from approximately the same period of 2019. We then tested the proposed AI system on an independent cohort of 148 patients not used during training coming from one of these two centers (independent testing I). The AI model was able to automatically classify COVID-19 and negative subjects with a sensitivity of 0.98, a 0.88 specificity, and a 0.94 AUC. Furthermore, another independent testing (independent testing II) on a public dataset of 820 COVID-19 patients showed good generalization abilities of our AI tool, yielding an average sensitivity of 0.80 for the task of

Discussion
Over the last year, the COVID-19 pandemic showed an ever-shifting time-and spacerelated epidemiological profile worldwide [49,50]. However, in all phases of pandemic waves, quick and accurate diagnosis remains of utmost importance [51,52], in particular when facing viral variants [53,54] and the need to take advantage of the effect of vaccination campaigns [54,55]. In this context, CXR has emerged as a crucial first-line diagnostic tool for the detection of COVID-19 pneumonia in the ED setting [18] and beyond [56,57], given its high availability, low associated costs, and accuracy in pneumonia diagnosis [7,10,18,58,59].
Although COVID-19 pneumonia appears on CXR with characteristic features, many of them are also shared by other viral types of pneumonia [7,18,60]. The improvement of CXR diagnostic performance would be paramount to ameliorate decision making regarding patient management, which strongly relies on the initial assessment, considering both the intrinsic shortcomings of RT-PCR testing and the difficulties of implementing a CT strategy for early diagnosis [10,[14][15][16].
For the purpose of diagnosis, we trained and tested an ensemble of ten convolutional neural networks with CXRs of 98 COVID-19 patients referring to the EDs of two university hospitals in northern Italy (Center 1 and Center 2) during the first 2020 pandemic wave in northern Italy and 98 negative subjects from approximately the same period of 2019. We then tested the proposed AI system on an independent cohort of 148 patients not used during training coming from one of these two centers (independent testing I). The AI model was able to automatically classify COVID-19 and negative subjects with a sensitivity of 0.98, a 0.88 specificity, and a 0.94 AUC. Furthermore, another independent testing (independent testing II) on a public dataset of 820 COVID-19 patients showed good generalization abilities of our AI tool, yielding an average sensitivity of 0.80 for the task of diagnosis versus negative subjects, with even higher performance when considering COVID-19 patients with severe disease, as expected.
For the purpose of differential diagnosis, we trained and tested the ensemble of convolutional neural networks with CXRs of 98 COVID-19 patients and 88 patients with CAP (collected in 2019) from Center 1 and Center 2. The temporal selection criterion for CAP patients was enforced to ensure that no patient could carry undetected SARS-CoV-2 infection. In independent testing I, this AI model was able to classify COVID-19 and CAP patients with a 0.97 sensitivity, a 0.96 specificity, and a 0.94 AUC. Of note, independent testing II confirmed the good generalization abilities of our AI tool, performing with an average sensitivity of 0.82 for the task of COVID-19 versus CAP patients, with better performance for COVID-19 patients with severe disease, as expected again.
The presented AI tool yielded an interesting performance for the detection of COVID-19 compared to negative subjects and for the differential diagnosis with CAP. These performances open promising perspectives for our AI system to be used in clinical practice, thanks to a relatively high sensitivity (ranging 0.80-0.98 in the independent testing) with an interesting specificity (0.88). Especially in conditions of variable prevalence [50,61,62], due to local viral waves, effects of vaccination, and the appearance of viral variants [54,55], the availability of an AI tool as a second opinion support system may be useful to increase the diagnostic performance. We imagine a practical possibility of combining human and AI reading according to the rule of double reading. When one reading (human or AI) is positive and the other one is negative, in the case of high prevalence, the overall result will be given as positive; hence, maximizing sensitivity and negative predictive value. Conversely, in the same combination of contradictory results but in a low prevalence setting, a third (human) reader will be asked to take the decision, trying to maximize specificity and positive predictive value. The next challenge will be to appraise how the human grasp on different settings, for instance, a pandemic context versus relative normalcy, may interact with the performance of our AI tool [63].
Our research has some important limitations. First, concerning the performance achieved by our AI system, in this study, we report the sensitivity, specificity, and AUC obtained by training the network with only 98 COVID-19 patients. Overall, our system was able to detect both COVID-19 subjects (sensitivity ranging 0.80-0.97) and non-infected subjects (negative and CAP, specificity 0.88) with high and quite balanced performance. Despite this, our AI system could improve these performances when trained with more CXRs in a larger multicenter setting. Even though a two-center CXRs set was used to train and validate the AI system (cross-validation), the first test set of CXRs (independent testing I) was separated from the set used to train and validate the AI but came from one of these centers, while the second test set of CXRs (independent testing II) came from different centers and also timeframe with respect to the training dataset.
Second, our AI model is currently not able to define the stage or to predict the progression or the prognosis of COVID-19 patients. Other implementations could be derived from the integration in our system of the "mild" and "severe" classes of COVID-19 images that could stratify the population according to disease severity or patient outcome (e.g., the need for mechanical ventilation, disease duration, and short-term survival). In order to develop these models, but also to improve our current model, it will be important to integrate biological and clinical data in the AI system.
Third, due to various reasons, the prevalence of bedside CXR was higher in the COVID-19 group than in negative cases. While this may appear as a potential source of bias, the use of bedside CXR does not underline more severe pneumonia, leading to the potentially incorrect association between higher severity and COVID-19. In fact, portable equipment allows for easier management and reduction of infections because they do not require infectious subjects to move around the hospital facilities and are easier to sanitize [56,57].
Finally, it is important to recognize that the role of CXR in evaluating patients depends on the severity of the infection in the individual patient and COVID-19 prevalence in the community [61,62]. In individuals who are asymptomatic, the sensitivity of CXR could fall steeply, in particular in the first 48 h after symptoms onset, since asymptomatic individuals could test positive at RT-PCR and negative at CXR. Moreover, CXR may prove less useful in areas with very little circulating SARS-CoV-2. Conversely, CXR is most useful in patients who are acutely ill and symptomatic in areas with relatively high prevalence. In this scenario, patients with CXR findings attributable to COVID-19 could be considered as presumptively infected by the virus when the first RT-PCR test result is still rarely negative. For the purpose of differential diagnosis, disease severity ought to be considered as a potential source of bias because COVID-19 pneumonia cases may potentially present as more severe, on average, than CAP.

Conclusions
In subjects with suspected COVID-19, an AI reader applied to CXR achieved a sensitivity ranging 0.80-0.98 and a specificity of 0.88 in the diagnosis of COVID-19, also attaining a sensitivity ranging 0.82-0.97 and a specificity ranging 0.82-0.96 in the differential diagnosis of COVID-19 versus CAP. This system may prove a cost-sustainable and efficient tool as a second opinion to radiologists in a variable spectrum of clinical and epidemiological contexts. It will be held in continuous training with new CXR images to increase its performance and provide a larger external validation. Funding: This research did not receive any specific grant from funding agencies in the public, commercial or not-for-profit sectors. This study was partially supported by funding from the Italian Ministry of Health to IRCCS Policlinico San Donato.

Institutional Review Board Statement:
The local Ethics Committee approved this retrospective study on 8 April 2020, protocol code COVID19-RXretro, approve code 37/int/2020.

Informed Consent Statement:
Study-specific patient informed consent was waived by Ethics Committee due to the retrospective nature of this study. Data Availability Statement: Datasets from Center 1 (IRCCS Policlinico San Donato, San Donato Milanese, Italy) and Center 2 (ASST Monza-Ospedale San Gerardo, Monza, Italy) used and/or analyzed during the current study are available from the corresponding author on reasonable request. The AIforCOVID dataset is publicly available at https://aiforcovid.radiomica.it/ (accessed 14 March 2021).

Acknowledgments:
The authors wish to thank Teresa Giandola and Maria Ragusi, who allowed to obtain patients' data from ASST Monza-Ospedale San Gerardo, and Lorenzo Blandi, Saverio Chiaravalle, Laurenzia Ferraris, Andrea Giachi and Moreno Zanardo, who allowed to obtain patients' data from IRCCS Policlinico San Donato.

Conflicts of Interest:
Christian Salvatore declares to be CEO of DeepTrace Technologies SRL, a spinoff of Scuola Universitaria Superiore IUSS, Pavia, Italy. Matteo Interlenghi, Annalisa Polidori, and Isabella Castiglioni declare to own DeepTrace Technologies S.R.L shares. Simone Schiaffino declares to have received travel support from Bracco Imaging and to be a member of the speakers' bureau for General Electric Healthcare. Marco Alì declares to be a scientific advisor for Bracco Imaging.
Francesco Sardanelli declares to have received grants from, or to be a member of, the speakers' bureau/advisory board for Bayer Healthcare, the Bracco Group, and General Electric Healthcare. All other authors have nothing to disclose.