Accelerating COVID-19 Differential Diagnosis with Explainable Ultrasound Image Analysis

Controlling the COVID-19 pandemic largely hinges upon the existence of fast, safe, and highly-available diagnostic tools. Ultrasound, in contrast to CT or X-Ray, has many practical advantages and can serve as a globally-applicable first-line examination technique. We provide the largest publicly available lung ultrasound (US) dataset for COVID-19 consisting of 106 videos from three classes (COVID-19, bacterial pneumonia, and healthy controls); curated and approved by medical experts. On this dataset, we perform an in-depth study of the value of deep learning methods for differential diagnosis of COVID-19. We propose a frame-based convolutional neural network that correctly classifies COVID-19 US videos with a sensitivity of 0.98+-0.04 and a specificity of 0.91+-08 (frame-based sensitivity 0.93+-0.05, specificity 0.87+-0.07). We further employ class activation maps for the spatio-temporal localization of pulmonary biomarkers, which we subsequently validate for human-in-the-loop scenarios in a blindfolded study with medical experts. Aiming for scalability and robustness, we perform ablation studies comparing mobile-friendly, frame- and video-based architectures and show reliability of the best model by aleatoric and epistemic uncertainty estimates. We hope to pave the road for a community effort toward an accessible, efficient and interpretable screening method and we have started to work on a clinical validation of the proposed method. Data and code are publicly available.


Introduction
To date, SARS-CoV-2 has infected several millions and COVID-19 has killed hundreds of thousands around the globe. Its long incubation time calls for fast, accurate, and reliable techniques for early disease diagnosis to successfully fight the spread [29]. The standard genetic test (RT-PCR), suffers from a processing time of up to 2 days [33], several publications reported sensitivity as low as 70% [2,25] and a recent meta-analysis estimated the false negative rate to be at least 20% over the course of the infection [26]. Medical imaging has great potential to complement the diagnostic process as a fast assessment tool that guides further PCR-testing, especially in triage situations [16]. Currently, CT scans are the gold standard for pneumonia [8] and are considered relatively reliable for COVID-19 diagnosis [2,5,18], although a significant amount of patients exhibit normal CT scans [52]. However, performing CT is expensive and highly irradiating, posing risks of infection for patients and staff, requires extensive sterilization [34], and is reserved for developed countries; there are only ∼30k CT scanner globally [13]. Chest X-ray (CXR) scans are still the first line examination, despite some reports of low specificity and sensitivity for COVID-19 (for example, Weinstock et al. [51] found 89% normal CXR in 493 COVID-19 patients). Ultrasound (US), by contrast, is a cheap, safe, non-invasive and repeatable technique that can be performed with portable devices at patient bedside and is ubiquitously available around the globe. Over the last two decades, ultrasound became an established tool to diagnose pulmonary diseases [14,30,36], has been forecast to replace radiographic techniques [9], was demonstrated to be superior to CXR for pulmonary diseases [15,38], and started to replace X-Ray as first-line examination [1,10].
In the COVID-19 outbreak, a growing body of evidence for disease-specific patterns in US (e.g. B-lines and subpleural consolidations) has led to advocacy for an amplified role of US from the research community [11,27,44,46] and radiologists reported great agreement between US and CT findings for COVID-19 infections [19,37]. Moreover, in triage situations or in third-world countries, where CT, PCR and CXR are not available, US was demonstrated to be a valuable patient stratification technique for pneumonia [3,17]. This gives US, in contrast to other imaging techniques, the potential to become a globally applicable first-line examination method [45]. However, the relevant pattern are hard to discern for humans [35,47], calling into play medical image analysis based on machine learning technique as a decision support tool for physicians. Here, we provide the first study of automatic lung ultrasound analysis for differential diagnosis of bacterial and viral pneumonia; aiming to develop a medical decision support tool.
Related work. Literature on exploiting medical image analysis and computer vision techniques to classify or segment CT or CXR data of COVID-19 patients recently exploded (for reviews, see Shi et al. [42], Ulhaq et al. [48], for a list of public data sources see Kalkreuth and Kaufmann [24]). For example, in an early study, Butt et al. [12] reported a sensitivity of 98% (specificity 92%) in a binary classification on CT scans from 110 COVID-19 patients, while Mei et al. [33] very recently achieved equal sensitivity (but lower specificity) compared to senior radiologists in detecting COVID-19 from CT and clinical information of 279 patients. US instead has been neglected heavily by the ML community [6]; only the Italian COVID-19 Lung Ultrasound (ICLUS) project has proposed a deep learning approach for a severity assessment of COVID-19 from ultrasound data [39]. The work convincingly predicts disease severity and segments COVID-19 specific patterns, building up on their previous work on localizing B-lines [49]. The paper claims to release a dataset of annotated COVID-19 cases, but to date, no annotations are available. While this effort is highly relevant for disease monitoring, it is not directly applicable for first-line diagnosis, where the main problem lies in distinguishing COVID-19 from other pneumonia. We aim to close this gap with our approach to classify COVID-19, healthy, and pneumonia point-of-care ultrasound (POCUS) images.
Our contributions. Figure 1 depicts a graphical overview of our contributions. We provide the largest publicly-available dataset of lung US recordings consisting of 106 videos. This dataset is heterogeneous and mostly from public sources, but was curated manually and approved by a medical doctor. We further take a first step towards a tool for differential diagnosis of pulmonary diseases, here especially focused on bacterial and viral pneumonia such as COVID-19. An earlier version of our dataset alongside some preliminary results, we already presented in [7]. Without deprivation of novelty, we here demonstrate that competitive performance can be achieved from raw US recordings, thereby challenging the current focus on irradiating imagining techniques. Moreover, we employ explainability techniques such as class activation maps or uncertainty estimates and present a roadmap towards an automatic detection system that can segment and highlight relevant spatio-temporal patterns. Such a system could not only lead to superior diagnostic performance, as was partially shown for CT [33], but can also reduce the time doctors require to make a diagnosis [41]. Our approach is of evident need because physicians must be trained thoroughly to reliably differentiate COVID-19 from pneumonia [35], making it necessary to use powerful deep learning to develop a system that can complement the work of physicians in a timely manner.

A lung ultrasound dataset for COVID-19 detection
We provide the to-date largest pre-processed and publicly available lung POCUS dataset 1 , comprising samples of COVID-19 patients, pneumonia-infected lungs and healthy patients. As shown in Table 1, we collected and gathered 139 recordings (106 videos + 33 images) recorded with either convex or linear probes, where the latter is a higher frequency probe yielding more superficial images.  3 Differential diagnosis of COVID-19 with lung ultrasound 3.1 Experimental setup Data processing. All experiments are conducted on data recorded with convex ultrasound probes, the standard probe for lung assessment that allows to see deeply into the lung [31]. We manually processed all convex ultrasound recordings and split them into images at a frame rate of 3Hz (with maximal 30 frames per video), leading to a database of 693 COVID-19, 377 bacterial pneumonia, and 295 healthy control images. For examples see Figure 1A. All images were cropped to a quadratic window excluding measure bars and texts and artifacts on the borders before they were resized to 224 × 224 pixels. Apart from the independent test data, all reported results were obtained in a 5-fold stratified cross validation. It was ensured that the frames of a single video are present within a single fold only, and that the number of samples per class is similar in all folds. All models were trained to classify images as COVID-19, pneumonia, healthy, or uninformative. The latter consists of ImageNet pictures as well as neck ultraosund data; we added these picture for the purpose of detecting out-of-distribution data (thus making the model more robust). This is particularly relevant for public web-based inference services. In this paper, we present all results omitting the uninformative class, as it is not relevant for the analysis of differential diagnosis performance and would bias the results (please refer to appendix A.4.1 for results including uninformative data). Furthermore, we use data augmentation techniques (horizontal and vertical flips, rotations up to 10 • and translations of up to 10%) to diversify the dataset and prevent overfitting.
Frame-based models. Our backbone neural architecture is a VGG-16 [43] that is compared to NasNET Mobile, a light-weight alternative [55] that uses less than 1 /3 of the parameters of VGG and was optimized for applications on portable devices. Both models are pre-trained on Imagenet and fine-tuned on the frames sampled from the videos. Specifically, we use two variants of VGG-16 that we name VGG and VGG-CAM. VGG-CAM has a single dense layer following the convolutions, thus enabling the usage of plain CAMs, class activation maps [53], whereas VGG has an additional dense layer with ReLU activation and batch normalization.
Considering the recent work of Roy et al. [39] on lung US segmentation and severity prediction for COVID-19, we investigated whether a segmentation-targeted network can also add value to the prediction in differential diagnosis. We implemented two approaches building upon the pre-trained model of Roy et al. [39], an ensemble of three separate U-Net-based models (U-Net, U-Net++, and DeepLabv3+, with a total of ∼ 19.5M parameters). First, VGG-Segment is identical to VGG, however instead of training on the raw US data, we train on the segmented images from the ensemble (see example in Appendix A.2). Although it might seem unconventional, we hypothesized that the colouring entails additional information that might simplify classification. Secondly, in Segment-Enc the bottleneck layer of each of the three models is used as a feature encoding of the images, resulting in 560 filter maps that are fed through two dense layers of size 512 and 256 respectively. The encoding weights are fixed during training. Both settings are compared to the other models that directly utilize the raw images. For more details on the architectures and the training procedure, please refer to appendix A.1.
Video-based model. In comparison to a naïve, frame-based video classifier (obtained by averaging scores of all frames), we also investigate Models Genesis, a generic model for 3D medical image analysis pretrained on lung CT scans [54]. For Models Genesis, the videos are split into chunks of 5 frames each, sampled at a frame rate of 5Hz. 5-fold cross validation is performed using the same split as for frame-based classifiers. Individual images were excluded, leaving aside 86 videos (from which 10 were excluded due to too many frames with artifacts such as moving pointers) which were split into 292 video chunks. Table 2 shows a detailed comparison of the three best models in terms of recall, precision, specificity and F1-scores, as well as MCC. Overall, both VGG and VGG-CAM achieve promising performance with an accuracy of 90 ± 2% and 90 ± 5% respectively on a 5-fold CV of 1,365 frames. Concerning per-class prediction accuracies, it is evident that bacterial pneumonia infections are distinguished best, with recall, precision, and specificity above 0.93 for VGG and VGG-CAM, indicating the models' ability to recognize strong irregularities in lung images. Although VGG slightly outperforms VGG-CAM, we explored the latter more in detail, due to its higher sensitivity for COVID-19 and its better performance when taking into account the class activation maps. Figure 2a visualizes the results of the VGG-CAM model for each binary detection task as a ROC curve, showing ROC-AUC scores of 0.94 and above for COVID-19 and the other two classes, while depicting the point where the accuracy is maximal for each class. The false positive rate at the maximal-accuracy point is larger for COVID-19 than for pneumonia and healthy patients. In a clinical setting, where false positives are less problematic than false negatives, this property is highly desirable. Since the data is imbalanced, we also plot the precision-recall curve in Figure 2b, which confirms that pneumonia is the class that is predicted most easily. In addition, the confusion matrices in Figure 2c and Figure 2d further detail the predictions of VGG-CAM; we observe that the high sensitivity for COVID-19 (0.93, 642 out of 693 frames) comes at a cost of 22% false positives from the healthy class. For further results including the ROC-and precision-recall curves of all three models see Appendix A.4.

Frame-based experiments
Ablation study with segmentation models. Lung US recordings are noisy and operator-dependent, posing difficulties for the classification of raw data. Hence, we compare VGG and VGG-CAM to VGG-Segment where all frames are segmented (i.e. classified on a pixel level into pathological     patterns) with the model from Roy et al. [39]; see Appendix A.1) for an example input. The relevant rows in Table 2 exhibit mixed results: while training on segmented images improves most relevant performance metrics slightly (higher accuracy, COVID-19 sensitivity, and MCC scores), balanced accuracy is inferior compared to VGG. Since this small increase in predictive performance comes at the cost of a large increase in model size (due to the ensemble of three independent models; selecting only one of the models resulted in inferior performance), we considered Segment-Enc, i.e. a dense model classifying the 560-dimensional encoding produced by the pre-trained segmentation models. Segment-Enc achieved comparable performance for most metrics, apart from lower scores for pneumonia detection. Since the difference in performance is only marginal, and the architectures of VGG-Segment and Segment-Enc prohibit the computation of class activation maps, we prefer to focus on the analysis of VGG-CAM in the following.
Ablation study on other architectures. Initially, we had tested further models proposed for medical image analysis, such as COVID-Net (previously used for the classification of X-Ray images [50]), and an architecture following [28] based on a Res-Net [22], but we observed that the experiments on our data resulted in significantly worse results. Last, we tested several smaller networks such as MobileNet [23] as an additional ablation study, with NASNetMobile [55] performing best. As most ultrasound devices are portable and real-time inference on the devices is technically feasible, resource-efficient networks are highly relevant and could supersede web-based inference. Due to low precision and recall on healthy data, our fine-tuned NASNetMobile is less performant than VGG-CAM,  Table 3: Video classification results. The frame-based model VGG-CAM outperforms the 3D CNN Models Genesis, showing high accuracy (94%), recall, precision for COVID-19 and pneumonia detection. but also requires less than a third of the parameters, thus providing a first step towards real-time on-device inference.

Video-based experiments
To investigate the need for a model with the ability to detect spatiotemporal patterns in lung US, we explored Models Genesis, a pretrained 3D-CNN designed for 3D medical image analysis [54]. Table 3 contrasts the frame-based performance of VGG-CAM model to Model Genesis. The video classifier is outperformed by VGG-CAM, with a video accuracy of 94% compared to 87%. Note that all videos of pneumonia-infections are classified correctly, while especially Model Genesis struggles with the prediction of healthy patients. Considering that only 292 video-chunks were available for training Model Genesis, while 1356 images are used to train VGG-CAM, even extended through data augmentation techniques, it is likely that video-based classification may improve with increasing data availability.

Evaluation on independent test data
Very recently, the ICLUS initiative released 60 COVID-19 lung US recordings from Italian patients 2 [39]. The data is not annotated, but was initially assumed to contain only COVID-19 videos, based on its general description. We evaluated the performance of the VGG-CAM model on all 40 convex probes from ICLUS, alongside 24 recordings from healthy controls (6 viewpoints each) and 2 videos from public sources (healthy), jointly comprising an independent test dataset of 66 videos.
We predicted all frames as an average of the five VGG-CAM models trained in cross-validation. The model achieves a frame-prediction accuracy of 83.3%, divided into 89.5% for healthy-patient data and 74% for COVID-19 videos. Furthermore, averaging the class probabilities over all frames, VGG-CAM achieves a video classification accuracy of 92.2% and 77.5%, respectively. Notably, the four healthy patients are all classified correctly if summarized across viewpoints. Combining both datasets, the sensitivity of detecting COVID-19 corresponds to the accuracy (0.775) with a precision score of 0.94 (no video was classified as bacterial pneumonia).
Evaluation by domain experts. We further investigated the comparably low sensitivity on the COVID-19 data (ICLUS) with the help of two medical experts. When asked for their unbiased diagnosis of the incorrectly-predicted videos, they independently reported for 6 out of 9 videos that no disease-specific patterns can be observed ("A-lines, normal lung"). While these findings support the performance of our model, the true label of the data remains unclear. In addition, the dataset may contain further healthy-patient data which was incorrectly predicted as COVID-19. At this point, we can safely conclude that test data performance is highly promising, in particular considering the high accuracy for healthy patients, but requires further validation with independent and labeled data.

.1 Class activation maps
Class activation mapping (CAM) are a popular technique for model explainability that exploits global average pooling and allows to compute class-specific heatmaps that indicate the discriminative regions of the image that caused the particular class activity of interest [53]. For healthcare applications, CAMs, or their generalization Grad-CAMs [40], can provide valuable decision support by unravelling whether a model's prediction was based on visible pathological patterns. Moreover, CAMs can guide doctors and point to informative patterns, especially relevant in time-sensitive (triage) or knowledge-sensitive (third-world countries) situations. Figure 3 shows representative CAMs in the three rightmost panels. They highlight the most frequent US pattern for the three classes, COVID-19 (vertical B-lines), bacterial pneumonia (consolidations), and healthy (horizontal A-line). For a more quantitative estimate, we computed the points of maximal activation of the CAMs for each class (abbreviated as C, P, and H) and all samples of the dataset (see Figure 3 left). While, in general, the heatmaps are fairly distributed across the probe, pneumonia related features were rather found in the center and bottom part, especially compared to COVID-19 and healthy patterns 3 . Please refer to Appendix A.5 for a density plot. To assess to what extent the differences between the individual distributions are significant, we employed maximum mean discrepancy (MMD), a metric between statistical distributions [21] that enables the comparison of distributions via kernels, i.e. generic similarity functions. Given two coordinates x, y ∈ R 2 and a smoothing parameter σ ∈ R, we use a Gaussian kernel k(x, y) := exp(− x−y 2 /σ 2 ) to assess the dissimilarity between x and y. Following Gretton et al. [21], we set σ to the median distance in the aggregated samples (i.e. all samples, without considering labels). We then calculate MMD values for the distance between the three classes, i.e. MMD(C, P) ≈ 0.0051, MMD(C, H) ≈ 0.0061, and MMD(P, H) ≈ 0.0065. Repeating this calculation for 5000 bootstrap samples per class (see Figure 9 for the resulting histograms), we find that the observe achieved significance levels of the intra-class MMD values of well below an α = 0.05 significance level.

Results.
Expert validation of CAMs for Human-in-the-loop settings. A potential application of our framework is a human-in-the-loop (HITL) setting with CAMs as a core component of the decision support tool that highlights pulmonary biomarkers and guides the decision makers. Since the performance of qualitative methods like CAMs can only be validated with the help of doctors, we conducted a blind-folded study with two medical experts experienced in the diagnostic process with ultrasound recordings. The experts were shown 50 videos (14 COVID-19, 21 pneumonia, 14 regular) comprising all non-proprietary video data which was correctly classified by the model. The class activation map for the respective class was computed two times, first with an average of all five models that were trained, and second only with the model that did not see any frame of the video during training (called train-and test-CAMs in the following). Both experts were asked to compare both activation maps for all 50 videos, and to score them on a scale of −3 ("the heatmap is only distracting") to 3 ("the heatmap is very helpful for diagnosis").
First, the CAMs were overall perceived useful and the train and test CAMs were assigned a higher average score of 0.45 and 0.81 respectively. Second, disagreeing in only 8% of the cases, both experts independently decided for the test-CAM with a probability of 56%. Hence, the test-CAMs are not inferior to the train-CAMs, however non-significant in a Wilcoxon signed-rank test. However, trainand test-CAM both scored best for videos of bacterial pneumonia, lacking performance for videos of healthy and COVID-19 patients. Specifically, test-CAM received an average score of 0.81, divided into −0.25 for COVID-19, 2.05 for pneumonia, and 0 for healthy patients. Third, the experts were asked to name the pathological patterns visible in general, as well as the patterns that were highlighted by the heatmap. Figure 4 shows the average ratio of pattern that were correctly highlighted by the CAM model, where the patterns listed by the more senior expert are taken as the ground truth for each video. Interestingly, the high performance of our model in classifying videos of bacterial pneumonia is probably explained by the model's ability to detect consolidated areas, where 17 out of 18 are correctly classified. Moreover, A-lines are highlighted in ∼ 60% of the normal lung recordings. Problematically, in 13 videos mostly fat, muscles or skin is highlighted, which has to be studied and improved in future work.

Confidence estimates
The ability to quantify states of high uncertainty is of crucial importance for medical image analysis and computer vision applications in healthcare. We assessed this via independent measures of epistemic (model) uncertainty (by drawing Monte Carlo samples from the approximate predictive posterior [20]) and aleatoric (data) uncertainty (by means of test time data augmentation [4]). The sample standard deviation of 10 forward passes is interpreted as inverse, empirical confidence score ∈ [0, 1] (for details see appendix). The epistemic confidence estimate was found to be highly correlated with the correctness of the predictions (ρ = 0.41, p < 4e−124, mean confidence of 0.75 and 0.26 for correct and wrong predictions), while the aleatoric confidence was found correlated to a lesser extent (ρ = 0.29, p < 6e−35, mean confidence of 0.88 and 0.73, respectively). Across the entire dataset, both scores are highly correlated (ρ = 0.52), suggesting to exploit them jointly to detect and remove predictions of low confidence in a possible application.

Discussion
Ultrasound as an established diagnosis tool that is both safe and highly available constitutes a method with potentially huge impact that has nevertheless been neglected by the machine learning community. This work presents methods and analyses that pave the way towards computer vision-assisted differential diagnosis of COVID-19 from US, providing an extensive analysis of (interpretable) methods that are relevant not only in the context of COVID-19, but in general for the diagnosis of viral and bacterial pneumonia.
We provide strong evidence that automatic detection of COVID-19 is a promising future endeavour and competitive compared to CT and CXR based models, with a sensitivity of 98% and a specificity of 91% for COVID-19, achieved on our dataset of 106 lung US videos. In comparison, sensitivity up to 98% and specificity up to 92% was reported for CT [12,33]. We verified our results with independent test data, studied model uncertainty and concluded a significant ability of our model to recognize low-confidence situations. We combined our approach with the only available related work, lung US segmentation models from Roy et al. [39], and found mild performance improvement in most metrics. It however remains unclear whether this gain can be attributed to the segmentation itself or is a side-effect of the increased parametrization. Certainly, there are many approaches yet to be explored in order to improve on the results presented here, including further work on video classification, but also exploiting the higher availability of CT or X-ray scans with transfer learning or adapting generative models to complement the scarce data about COVID-19 as proposed in [32]. Furthermore, we investigated the value of interpretable methods in a quantitative manner with the implementation and validation of class activation mapping in a study involving medical experts. While the analysis provides excellent evidence for the successful detection of pathological patterns like consolidations, A-lines and effusion, it reveals problems in the model's "focal point" (e.g. missing B-lines and sometimes highlighting muscles instead of the lung) which should be further addressed using ultrasound segmentation techniques [49].
Our published database is constantly updated and verified by medical experts researchers are invited to contribute to our initiative. We envision the proposed tool as a decision support system to accelerate diagnosis or provide a "second opinion" to increase reliability. We started to collaborate with radiologists and an intensive care unit and are currently designing a controlled, clinical study to investigate the predictive power of US for automatic detection of COVID-19, especially in comparison to CT and CXR. As a preliminary demonstration, we have built a web service (link not anonymized) where users can screen ultrasound images, querying our averaged prediction model. We aim to extend the functionality of the website in the future to offer interpretable video inference, aiming for an accessible and validated tool that enables medical doctors to draw inference from their US images with unprecedented ease, convenience and speed.

A.1 Model architectures and hyperparameter
As a base, we use the convolutional part of the established VGG-16 [43], pre-trained on Imagenet. The model we call VGG is followed by one hidden layer of 64 neurons with ReLU activation, dropout of 0.5, batch normalization and the output layer with softmax activation. The CAMs for this model were computed with Grad-CAM [40]. To compare Grad-CAMs with regular CAMs [53], we also tested VGG-CAM, a CAM-compatible VGG with a single dense layer following the global average pooling after the last convolutional layer. For both models, during training only the weights of the last three layers were fine-tuned, while the other ones were frozen to the values from pre-training. This results in a total of ∼ 2.4M trainable and ∼ 12.4M non-trainable parameters. The model is trained with a cross entropy loss function on the softmax outputs, and optimized with Adam with an initial learning rate of 1e−4. All models were implemented in TensorFlow and trained for 40 epochs with a batch size of 8 and early stopping was enabled.
A.2 Pretrained segmentation models Figure 5 gives an example for the segmented ultrasound image with the model from Roy et al. [39]. In our work the segmented image serves as input to the VGG-Segment model. Left side shows the raw US recording and the right side shows the segmentation method from Roy et al. [39] highlighting the B-line. The images shown on the right were used as input for the VGG-Segment model.

A.3 Uncertainty estimation
For both aleatoric and epistemic uncertainty, the confidence estimate c i of sample i is computed by scaling the sample's standard deviation to ∈ [0, 1] and interpreting it as an inverse precision: where σ i,j is the sample standard deviation of the ten class probabilities of the winning class j, σ min is the minimal standard deviation (0, i.e. all probabilities for the winning class are identical) and σ max is the maximal standard deviation, i.e. 0.5. Practically, for epistemic uncertainty, dropout was set to 0.5 across the VGG model and for aleatoric uncertainty the same transformations as during training are employed.

A.4 Results
Re-formulating the classification as a binary task, the ROC-curve and precision-recall curves can be computed for each class. Figure 6 and Figure 7 depict the performance per class, comparing all proposed models. While pneumonia is distinguished well by all models, NASNet has difficulties with the correct classification of normal lung images. Figure 7a and Figure 7b show that COVID-19 is predicted better than healthy lung images, but not as distinct as pneumonia infections. Furthermore, in addition to the normalized confusion matrices we provide the absolute values here in Figure 7c (referring to VGG-CAM). Note that most of our data shows COVID-19 infected lungs, despite the novelty of the disease. Problematically, healthy and COVID-19 patients are confused in 100 images, whereas bacterial pneumonia is predicted rather reliably.

A.4.1 Uninformative class
Although the main task is defined as differentiating COVID-19, bacterial pneumonia and healthy, we trained the model actually with a fourth "uninformative" class in order to identify out-of-distribution samples. This concerns both entirely different pictures (no ultrasound), as well as ultrasound images not showing the lung. Thus, we added 200 images from Tiny ImageNet (one per class taken from the test set) together with 200 neck ultrasound scans taken from the Kaggle ultrasound nerve segmentation challenge. Note that the latter is data recorded with linear ultrasound probes, leading to very different ultrasound images.

A.5 Class activation maps
In addition to the scatter plot in Figure 3 we present the corresponding density plot in Figure 8, showing the area of the ultrasound image where the class activation is maximal for each class. It can be observed that the activation on healthy and COVID-19 videos is located further in the upper part of the image, where usually only muscles and skin are observed. Further work is thus necessary to analyze and improve the qualitative results of the model. However, with respect to pathological patterns visible, the model does in many cases focus on the patterns that are interesting to medical experts. Table 5 breaks down the results presented in Figure 4 more in detail, and in particular separately for both medical experts. Note that with respect to the pleural line, we only consider the opinion of expert 2 since expert 1 did not mention it. With the exception of consolidations, the difference in responses is quite large, which is however unsurprising for such a qualitative task. Besides the patterns that were already named in Figure 4, the heatmaps also correctly highlighted air bronchograms (2 cases according to expert 1) and a pleural effusion in 1 out of 7 cases.   e. the ones we obtain by looking at the labels, is indicated as a dashed line in each histogram. We observe that these values are highly infrequent under the null distribution, indicating that the differences betwee the three classes are significant. Notably, the statistical distance between patients suffering from bacterial pneumonia and healthy patients (rightmost histogram) achieves a slightly lower empirical significance of ≈ 0.04. We speculate that this might be related to other pre-existing conditions in healthy patients that are not pertinent to this study, though.