Unveiling COVID-19 from CHEST X-Ray with Deep Learning: A Hurdles Race with Small Data

The possibility to use widespread and simple chest X-ray (CXR) imaging for early screening of COVID-19 patients is attracting much interest from both the clinical and the AI community. In this study we provide insights and also raise warnings on what is reasonable to expect by applying deep learning to COVID classification of CXR images. We provide a methodological guide and critical reading of an extensive set of statistical results that can be obtained using currently available datasets. In particular, we take the challenge posed by current small size COVID data and show how significant can be the bias introduced by transfer-learning using larger public non-COVID CXR datasets. We also contribute by providing results on a medium size COVID CXR dataset, just collected by one of the major emergency hospitals in Northern Italy during the peak of the COVID pandemic. These novel data allow us to contribute to validate the generalization capacity of preliminary results circulating in the scientific community. Our conclusions shed some light into the possibility to effectively discriminate COVID using CXR.


I. INTRODUCTION
C OVID-19 virus has rapidly spread in mainland China and into multiple countries worldwide [1].As of April 7th 2020 in Italy, one of the most severely affected countries, 135,586 Patients with COVID19 were recorded, and 17,127 of them died; at the time of writing Piedmont is the 3rd most affected region in Italy, with 13,343 recorded cases [2].
Early diagnosis is a key element for proper treatment of the patients and prevention of the spread of the disease.Given the high tropism of COVID-19 for respiratory airways and lung epythelium, identification of lung involvement in infected patients can be relevant for treatment and monitoring of the disease.
Virus testing is currently considered the only specific method of diagnosis.The Center for Disease Control (CDC) in the US recommends collecting and testing specimens from the upper respiratory tract (nasopharyngeal and oropharyngeal swabs) or from the lower respiratory tract when available (bronchoalveolar lavage, BAL) for viral testing with reverse transcription polymerase chain reaction (RT-PCR) assay [3].Testing on BAL samples provides higher accuracy, however this test is unconfortable for the patient, possibly dangerous for the operator due to aerosol emission during the procedure and cannot be performed routinely.Nasopharingeal swabs are instead easily executable and affortable and current standard in diagnostic setting; their accuracy in literature is influenced by the severity of the disease and the time from symptoms onset and is reported up to 73.3% [4].
However, it has been widely demonstrated that, even at early stages of the disease, chest x-rays (CXR) and computed tomography (CT) scans can show pathological findings.It should be noted that they are actually non specific, and overlap with other viral infections (such as influenza, H1N1, SARS and MERS): most authors report peripheral bilateral ill-defined and ground-glass opacities, mainly involving the lower lobes, progressively increasing in extension as disease becomes more severe and leading to diffuse parenchymal consolidation [7], [8].CT is a sensitive tool for early detection of peripheral ground glass opacities; however routine role of CT imaging in these Patients is logistically challenging in terms of safety for health professionals and other patients, and can overwhelm available resources [9].
Chest X-ray can be a useful tool, especially in emergency settings: it can help exclude other possible lung "noxa", allow a first rough evaluation of the extent of lung involvement and most importantly can be obtained at patients bed using portable devices, limiting possible exposure in health care workers and other patients.Furthermore, CXR can be repeated over time to monitor the evolution of lung disease [5].
Wong et al. in a study recently published on Radiology reported that x-ray has a sensitivity of 69% and that the severity of CXR findings peaked at 10-12 days from the date of symptom onset [8].
Because of their mostly peripheral distribution, subtle early findings on CXRs may be a diagnostic challenge even for an experienced thoracic radiologist: in fact, there are many factors that should be taken into account in image interpretation and that could alter diagnostic performance (such as patient body type, compliance in breath-holding and positioning, type of projection that can be executed i.e. antero-posterior in more critical patients examined at bedside, postero-anterior if the patient can be moved to radiology unit and is more collaborating, presence of other medical devices on the thorax, especially in x rays performed in intensive care units, etc.).In the challenging and never-before seen scenario that rose to attention in the last months, radiologists may look at Artificial Intelligence and deep learning applications as a possible aid for daily activity, in particular for identification of the more subtle findings that could escape the human eye (i.e.reduce false-negative x-rays) or, on the other side, could prompt swab repetition of further diagnostic examinations when first virus testing is negative (considering its sub-optimal sensitivity).
Given the intrinsic limits of CXR but at the same time its potential relevant role in the fight against COVID 19, in this work we set up a state of the art deep learning pipeline to investigate if computer vision can unveil some COVID fingerprints.It is evident that the answer will be given only when publicly available large image datasets will empower scientists to train complex neural models, to provide reproducible and statistically solid results and to contribute to the clinical discussion.Unfortunately, up to date, we are stuck with few labelled images.Thanks to the collaboration with the radiology unit of Città della Salute e della Scienza di Torino (CDSS) hospital in Turin in the last days of March (at the peak of epidemic in Italy), we managed to collect the Covid Radiographic images Data-set for AI (CORDA), currently comprising images from 386 Patients that underwent COVID screening.The data are still limited but, using them to train and test Convolutional Neural Network (CNN) architectures such as resnet [10], we contribute to shed some light into the problem.In this work we do not mean to answer whether and how CXR can be used in the early diagnosis of COVID, but to provide a methodological guide and critical reading of the statistical results that can be obtained using currently available datasets and learning mechanisms.Our main contribution is an extensive experimental evaluation of different combinations of usage of existing datasets for pre-training and transfer learning of standard CNN models.Such analysis allows us to raise some warnings on how to build datasets, pre-process data and train deep models for COVID classification of X-ray images.We show that, given the fact that datasets are still small and geographically local, subtle biases in the pre-trained models used for transfer learning can emerge, dramatically impacting on the significance of the performance one achieves.

II. RELATED WORKS
It is evident that currently there is not yet a significant amount of work devoted to automatic detection of covid from medical imaging.Nonetheless, one can refer to previous epidemics caused by novel strain of coronavirus such as severe acute respiratory syndrome (SARS), first recognized in Canada in March 2003, characterised by similar lung condition, i.e. interstitial pneumonia [11].Most results leverage on the use of high resolution CT scans.As an example, in [12] CNN are investigated for classification of interstitial lung disease (ILD).Also [13], [14] show that deep learning can be used to detect and classify ILD tissue.The authors of [14] focus on a design a CNN tailored to match the ILD CT texture features, e.g.small filters and no pooling to guarantee spatial locality.
Fewer contributions focus on classification of X-ray chest images to help SARS diagnosis: in [15] lung segmentation, followed by feature extraction and three classification algorithms, namely decision tree, shallow neural network and classification and regression tree are compared, the latter yielding the higher accuracy on the SARS detection task.However, on the pneumonia classification task, NN-based approaches show encouraging results.In [16] texture features for SARS identification in radiographic images are proposed and designed using signal processing tools.
In the last days a number of preprints targeting covid classification with CNN on radiographic images have begun to circulate thanks to open access archives.Many approaches have been taken to tackle the problem of classifying chest Xray scans to discriminate COVID-positive cases.For example, Sethy et al. compare classification performances obtained between some of the most famous convolutional architectures [17].In particular, they use a transfer learning-based approach: they take pre-trained deep networks and they use these models to extract features from images.Then, they train a SVM on these "deep features" to the COVID classification task.A similar approach is also used by Apostopolous et al.: they pre-train a neural network on a similar task, and then they use the trained convolutional filters to extract features, on top of which a classifier attempts to select COVID features [18].Narin et al. make use of resnet-based architectures and the recent Inception v3 and then they use a 5-fold cross validation strategy [19].Finally, Wang et al. propose a new neural network architecture to be trained on the COVID classification task [20].All of these approaches use a very small dataset, COVID-ChestXRay [21], consisting of approximately 100 COVID cases considering CXR only.Furthermore, in order to build COVID negative cases, typically data are sampled from other datasets (mostly, from ChestXRay).However, this introduces a potential issue: if any bias is present in the dataset (a label in the corners, a medical device, or other contingent factors like similar age, same sex etc.) the deep model could learn to recognize these dataset biases, instead of focusing on COVIDrelated features.These works present some potential issues to be investigated: • Transfer learning: in the literature it is widely recognized that transfer learning-based approaches prove to be effective, also for medical imaging [22].However, it is very important to be careful on the particular task the feature extractor is trained on: if such task is very specific, or contains biases, then the transfer learning approach should be carefully carried on.• Hidden biases in the dataset: most of the current works rely on very small datasets, due to the limited availability of public data on COVID positive cases.These few data, then, contain little or even no metadata on age, gender, different pathologies also present in these subjects, and other necessary information necessary to spot on this kind of biases.Besides these, there are other biases we can try to correct.For example, every CXR has its own image windowing parameters or other acquisition settings that a deep model could potentially learn to discriminate.For example, one model may cluster images according to the scan tool used for the exam; if some scan settings correspond to all COVID examples, these will generate a spurious correlation that the model can exploit to yield apparently optimal classification accuracy.Another example is given by textual labeling in images: if all the negative examples are sampled from the same dataset, the deep model could learn to recognize such feature instead of focusing on the lung content etc. • Very small test sets: as a further consequence of having very little data, test set sizes are extremely small and they do not provide any statistical certainty on learning.

III. METHODOLOGY
In this section we are going to describe the proposed deeplearining approach based on quite standard pipeline, namely chest image pre-processing and lung segmentation followed by classification model obtained with transfer learning.As we will see in this section, data pre-processing is fundamental to remove any bias present in the data.In particular, we will show that it is easy for a deep model to recognize these biases which drive the learning process.Given the small size of COVID datasets, a key role is played by the larger datasets used for pre-training.Therefore, we first discuss which datasets can be used for our goals.

A. Datasets
For the experiments we are going to show, three different datasets will be used.Each of these contain CXR images, but their purpose is different:

B. Pre-processing
For our simulations we propose a pre-processing strategy aiming at removing bias in the data.This step is very important in a setting in which we train to discriminate different classes belonging to different datasets: a neural network-based model might learn the distinction between the different dataset biases and from them "learn" the classification task.The proposed pre-processing chain is summarized in Fig. 1 and is based on the following steps: factors, typically depending on subject contrast, receptor contrast or other factors like scatter radiations [26].Hence, the raw acquisition has to be filtered through Value Of Interest transformation.However, due to different calibrations, different range dynamics can be covered, and this potentially is a bias.Histogram equalization is a simple mean to guarantee quite uniform image dynamic in the data.
Being able to segment the lungs only, discarding all the rest of the CXRs, potentially prunes away possible bias sources, like for example the presence of medical devices (typically correlated to sick patients), various text which might be embed in the scan etc.In order to address this task, we train a U-Net [30] on Montgomery County Xray Set and Shenzhen Hospital X-ray Set.The lung masks obtained are then blurred to avoid sharp edges using a 3 pixel radius.An example of the segmentation outcome is shown in Fig.

C. Training
After data have been pre-processed, a deep model will be trained.Towards this end, the following choices have been taken: • Pre-training on the feature extractor, i.e. convolutional layers of the CNN, will be performed.In particular, the pre-training will be performed on a related task, like pneumonia classification for CXRs.It has been shown that such an approach can be effective for medical imaging [12], in particular when the amount of available data is limited as in our classification task.Clearly, pretraining the feature extractor on a larger dataset containing related features may allow us to exploit deeper models, potentially exploiting richer image feature.• The feature extractor will be fine-tuned on COVID data.
Freezing it will certainly prevent over-fitting the small COVID data; however, we have no warranty that COVID related features can be extracted at the output of a feature extractor trained on a similar task.Of course, its initialization on a similar task helps in the training process, but in any case a fine-tuning is still necessary [31].• Proper sizing of the encoder to-be-used is an issue to be addressed.Despite many recent works use deeper architectures to extract features on the COVID classification task, larger models are prone to over-fit data.Considering the minimal amount of data available, the choice of the appropriate deep network complexity significantly affects the performance.• Balancing the training data is yet another extremely important issue to be considered.Unbalanced data favor biases in the learning process.Such balancing issue can be addressed in a number of ways: the most common and simple way to solve this issue is adding or removing data from the training-set.Removing data from a tiny dataset is not a viable approach; considering that the COVID datasets are built mainly of positive cases, one solution is to augment them with negative cases from publiclyavailable datasets.However, this is a very delicate operation and needs to be done very carefully: if all the negative cases are made of non-pathological patients, the deep model will not necessarily learn COVID features.It may simply discriminate between healthy and unhealthy lung.Providing a good variety of conditions in the negative data is not an easy task.The choice of the images may turn to be critical and, just like in the pre-training phase, one can include unwanted biases: again the model can end up classifying new images (that are positive to covid) exploiting discriminative biases present in different datasets.
• Testing with different data than those used at training time is also fundamental.Excluding from the test-set exams taken from patients already present in the training-set is important to correctly evaluate the performance and to exclude the deep model has not learned a "patient's lung shape" feature.• Of course many other issues have to be taken into account at training time, like the use of a validation-set to tune the hyper-parameters, using a good regularization policy etc. but these very general issues have been exhaustively discussed in many other works [32]- [34].
IV. DISCUSSION In this section we present and comment the experimental results obtained on a combination of different datasets (introduced in Sec.III-A).All the simulations have been run on a Tesla T4 GPU using PyTorch 1.4. 6The performance obtained on a comprehensive number of experiments is presented in Tab.II and Tab.III.In these tables, in particular, three factors will be evaluated: • Pre-training of the feature extractor: the feature extractor can be pre-trained on large generic CXR datasets, or can not be pre-trained.where not possible we subsampled the more populated class.This percentages were not used for the COVID-ChestXRay dataset: in this case only 15 samples are used for testing in order to compare with other works [17]- [19] that use the same partitioning.• Testing on different datasets: in order to observe the possible presence of hidden biases, testing on different, qualitatively-similar datasets is a necessary step.For all of these trained models, a number of metrics [35] will be evaluated: • Accuracy.
• AUC (area under the ROC curve), provides an aggregate measure of performance across all possible classification thresholds.For every other metric, the classification threshold is set to 0.5.• Sensitivity.
• BA (balanced accuracy), since the test-set might be unbalanced.

• DOR (diagnostic odds ratio). Results are shown in Table II, Table III and Table IV.
A. To pre-train or not to pre-train?
One very important issue to pay attention to is whether to pre-train the feature extractor or not.Given the large availability of public data for pneumonia classification (for example, in this scope we used ChestXRay and RSNA), it could be a good move to pre-train the encoder, and effectively this is what we observe looking at Table II.For example, if we focus on the results obtained training on the CORDA dataset, without a pre-trained encoder, BA and DOR are lower than pre-training with ChestXRay or RSNA.Despite the sensitivity remains very similar, pre-training the encoder helps in improving the specificity: on the test-set extracted from CORDA, using a pre-trained encoder on RSNA, the specificity is 0.80, while it is only 0.58 with no pre-trained feature extractor.Similar improvements in the specificity can be observed also on test-sets extracted from all the other datasets, except for ChestXRay.In general, a similar behavior can be observed when comparing results for differently pretrained encoders trained on the same dataset.Pre-training is important; however, we can not just "freeze" the encoder on the pre-trained values.Since the encoder is pretrained on a similar, but different task, there is no warranty the desired output features are optimal for the given classification task, and a fine-tuning step is typically required [36].

B. Pre-training on different datasets
Focusing on pre-trained encoders, we show results for encoders pre-trained on two different datasets: ChestXRay and RSNA.While RSNA is a more generic pneumoniasegmentation dataset, ChestXRay contains information also about the type of pneumonia (bacterial or viral); so, at a first glance it looks a better fit for the pre-training.However, if we look at training on the CORDA dataset, we see that for the same sensitivity value, we get typically higher specificity scores for RSNA pre-training.This is not the same we observe when we compare results on the publicly-available COVID-ChestXRay: in this case, sensitivity and specificity are higher when we pre-train on ChestXRay.Looking at the same pretrained encoder, let us say ChestXRay, we can compare results training on CORDA and on COVID-ChestXRay, which are the two COVID datasets: CORDA shows a lower sensitivity, but in general a higher specificity, except for the ChestXRay dataset.Having very little data at training time, pre-training introduces some priors in the choice of the features to be used, and depending on the final classification task, performance changes, yielding very good metric in some cases.Pre-training on more general datasets, like RSNA, in general looks a slightly better choice than using a more specific dataset like ChestXRay.

C. Augmenting COVID-data with different datasets
For each and every simulation, performance on different test-sets is evaluated.This gives us hints on possible biases introduced by different datasets used at training time.A general trend can be observed for many COVID-augmented training-sets: the BA and DOR scores measured on the testset built from the same dataset used at training time are typically very high.Let us focus on the ChestXRay pretrained encoder.When we train on CORDA&ChestXRay, the BA score measured on the test-set from the same dataset is 0.9 and the DOR is 122.67.However, its generalization capability for a different composition of the test-set, let us say, CORDA&RSNA, is way lower: the BA is 0.56 and the DOR 2.26 only.The same scenario can be observed when we train on CORDA&RSNA: on its test-set the BA is 0.90 and DOR 122.64, while on the test-set of CORDA&ChestXRay the BA is 0.59 and DOR 2.47.The key to understand these results lies again in the specificity score: this score is extremely high for the test-set of the same dataset the training is performed on (for example, for CORDA&RSNA is 0.95 and for CORDA&ChestXRay is 0.94) while for the others is extremely low.Such a behavior is due to the presence of some common features in all the data belonging to the same augmenting dataset.This can be observed, for example, in Fig. 3a, where the extracted features from an encoder pretrained on ChestXRay and trained on CORDA&ChestXRay are clustered using t-SNE [37] (blue and orange dots represent ChestXray and CORDA data samples respectively, regardless of the COVID label).It can be noted that CORDA samples, regardless the COVID+ or COVID-label, are clearly separable from ChestXRay data.Of course, all ChestXRay images have COVID-label, so someone could argue that the COVID feature has been captured.Unfortunately we have a counterexample: in Fig. 3b we compare CORDA vs. RSNA samples, using the same ChestXRay pre-trained encoder and now RSNA and CORDA samples no longer form clear clusters.Hence, the deep model specializes not in recognizing COVID features, but in learning the common features in the same dataset used at training time.We would like to remark that for all the data used at training or at test time, all the pre-processing presented in Sec.III-B has been used.We ran the same experiments without that pre-processing and performance on different datasets than the one used at training time gets even worse.For example,

D. How deep should we go?
After reviewing results on ResNet-18, we move to similar experiments run on the deeper ResNet-50 shown in ab.III.The hope is that a deeper network could extract more representative features for the classification task.Given the discussion in Sec.IV-A, we show only the cases with pre-training of the feature extractor.Using this deeper architecture, we can observe that all the discussions made for ResNet-18 still holds.In some cases performance impairs slightly: for example, the DOR score on CORDA&ChestXRay for ResNet-18 was 122.67 while for ResNet-50 drops to 73.35.This is a sign of over-fitting: given the very small quantity of data currently available, using a small convolutional neural network is sufficient and safer.Taking an opposite approach, we tried to use a smaller artificial neural network, made of 8 convolutional layers and a final fully-connected layer, which takes inspiration from the ALL-CNN-C architecture [38].We call this architecture Conv8.The results on this smaller architecture are similar to those observed in Table II.For example, training the model on CORDA dataset, on Conv8 we have a BA of 0.61 and DOR of 2.38 while for ResNet-18 with encoder pre-trained on RSNA we have BA of 0.67 and DOR 4.78.We can conclude that using a smaller architecture than ResNet-18 does not give relevant training advantages, while by using larger architectures we might over-fit data.
All the observations on train and test data made above are also valid for the recently published results on the COVID classification from CXR [17]- [20].One very promising approach is COVID-Net [20].They also share the source code and the trained model. 7In Tab.IV we compare the classification metrics obtained with COVID-Net and our ResNet-18 model: both models have been trained using COVID-ChestXRay, and tested on both CORDA and COVID-ChestXRay.In line with the discussion above we can note that both COVID-Net and ResNet-18 yields surprising results when the same dataset is used for traning and testing: The performance of COVID-Net on the COVID-ChestXRay testset (the same dataset used at training time) is very high (BA of 0.85 and DOR of 36.0) while it drops significantly when tested on CORDA, where BA is 0.55 only and DOR is 6.68.This drop is explained looking at sensitivity and specificity: it is evident that the model classifies as COVID-almost all the data.A similar behavior can be observed also in ResNet-18 model: the observed performance apparently looks incredible (since that the BA on the test-set is 1.0), and in fact similar numbers are also claimed in the other works on ResNetlike architectures [17]- [19].However, testing on CORDA reveals that the deep model is likely to have learned some hidden biases in COVID-ChestXRay and tends to mis-classify COVID-samples as COVID+ (given that the specificity is here 0.20).

V. CONCLUSIONS
One of the very recent challenges for both clinical and AI community is to use deep learning to learn to discriminate COVID from CXR.Some recent works highlighted the possibility of successfully tackle this problem, despite the currently small quantity of publicly available data.In this work we have highlighted possible obstacles in successfully training a deep model, ranging from the proper choice of the architecture to-be-trained to handling removable biases in medical datasets.Extensive experiments show that extracting a "COVID" feature from CXR is not an easy task.Such a problem should be addressed very carefully: it is very easy to misinterpret very good results on test-data, still showing poor generalization on new data in the same domain.We could perform such a test thanks to the possibility of using CORDA, a larger dataset comprising COVID cases.Of course, the quantity of available data is still limited but allowed us to find some promising seminal classification results.The ongoing collection and sharing of large amount of CXR data is the only way to further investigate if promising CNN results can aid in the fight to COVID pandemic.

Fig. 2 :
Fig. 2: Original image (a) and extracted lung segmented image (b).Many possible bias sources like all the writings and medical equipment is naturally removed.

TABLE I :
Datasets composition.The datasets used at training and test time are in the rows, and the total data are in the last two columns.

TABLE III :
Results obtained training a ResNet-50 model.