Automation of Lung Ultrasound Interpretation via Deep Learning for the Classification of Normal versus Abnormal Lung Parenchyma: A Multicenter Study

Lung ultrasound (LUS) is an accurate thoracic imaging technique distinguished by its handheld size, low-cost, and lack of radiation. User dependence and poor access to training have limited the impact and dissemination of LUS outside of acute care hospital environments. Automated interpretation of LUS using deep learning can overcome these barriers by increasing accuracy while allowing point-of-care use by non-experts. In this multicenter study, we seek to automate the clinically vital distinction between A line (normal parenchyma) and B line (abnormal parenchyma) on LUS by training a customized neural network using 272,891 labelled LUS images. After external validation on 23,393 frames, pragmatic clinical application at the clip level was performed on 1162 videos. The trained classifier demonstrated an area under the receiver operating curve (AUC) of 0.96 (±0.02) through 10-fold cross-validation on local frames and an AUC of 0.93 on the external validation dataset. Clip-level inference yielded sensitivities and specificities of 90% and 92% (local) and 83% and 82% (external), respectively, for detecting the B line pattern. This study demonstrates accurate deep-learning-enabled LUS interpretation between normal and abnormal lung parenchyma on ultrasound frames while rendering diagnostically important sensitivity and specificity at the video clip level.


Introduction
Lung ultrasound (LUS) is a versatile thoracic imaging method that offers the diagnostic accuracy of a CT scan for many common clinical findings, with all the advantages of portable, handheld technology [1][2][3][4]. Since recent reports have highlighted that the potential for LUS dissemination is near-limitless, for example, primary care, community settings, developing countries, and outer space [5][6][7], accordingly, it has been praised as a worthy upgrade to auscultation [8]. With experts in its use in persistent short supply [9][10][11][12],

External Data
Exams labelled as LUS studies within the University of Ottawa ultrasound archiving system (Qpath, Port Coquitlam, BC, Canada) were exported from 50 patients to a shared drive.

Data Labelling
LUS clips were uploaded to an online platform (Labelbox, San Francisco, CA, USA) for vetting and labelling. Clips that had text over the image, cardiac structures in view, or included abdominal organs and/or the diaphragm (pleural views) were excluded from this analysis. All other clips were labelled with either an A line or a B line identity. The B line label contained the following 3 sublabels: (1) mild, fewer than 3 B lines; (2) moderate, occupying less than 50% of the pleural line; and (3) severe, occupying >50% of the pleural line. The clip-level B line label was additionally stratified as either homogeneous (all frames of the clip contained B lines) or heterogeneous (B lines seen in some frames but not others, coming in and out of view with tidal respiration). This distinction would allow homogeneous clips to be the source of the frame-based data for our classifier training, because the overall clip label ("B lines") was valid across all individual frames. Heterogeneous B line clips would be used in clip-level inference and validation, as outlined below. See Figure 2 and Videos S1-S5 in Supplementary Materials for examples of each label.

External Data
Exams labelled as LUS studies within the University of Ottawa ultrasound archiving system (Qpath, Port Coquitlam, BC, Canada) were exported from 50 patients to a shared drive.

Data Labelling
LUS clips were uploaded to an online platform (Labelbox, San Francisco, CA, USA) for vetting and labelling. Clips that had text over the image, cardiac structures in view, or included abdominal organs and/or the diaphragm (pleural views) were excluded from this analysis. All other clips were labelled with either an A line or a B line identity. The B line label contained the following 3 sublabels: (1) mild, fewer than 3 B lines; (2) moderate, occupying less than 50% of the pleural line; and (3) severe, occupying >50% of the pleural line. The clip-level B line label was additionally stratified as either homogeneous (all frames of the clip contained B lines) or heterogeneous (B lines seen in some frames but not others, coming in and out of view with tidal respiration). This distinction would allow homogeneous clips to be the source of the frame-based data for our classifier training, because the overall clip label ("B lines") was valid across all individual frames. Heterogeneous B line clips would be used in clip-level inference and validation, as outlined below. See Figure 2 and Videos S1-S5 in Supplementary Materials for examples of each label.
Local data labels were generated by clinical members of our team (labelling training methods in Table S1 in the Supplementary Materials) and reviewed by an expert in LUS (R.A.), while external data were subjected to dual expert (R.A. and S.M.), independent, blinded labelling. This latter approach was taken given the importance of external data serving as a validation set.  Figure 3 offers a schematic representation of our data volumes and how they were used. Local data labels were generated by clinical members of our team (labelling training methods in Table S1 in the Supplementary Materials) and reviewed by an expert in LUS (R.A.), while external data were subjected to dual expert (R.A. and S.M.), independent, blinded labelling. This latter approach was taken given the importance of external data serving as a validation set.  . Schematic breakdown of the data sources, data volume, hetero/homogeneity, and how data were allocated in our frame-based classifier development and clip-based clinical metric. For the frame-based classifier, the number of clips used for the training, validation, and test sets are presented as the mean +/− standard deviation of the ten-fold crossvalidation partitions.

Clip-Based Inference Data
The local clip inference data were generated from a combination of all heterogeneous A or B line data clips and all homogeneous clips generated from our labelling team after the frame-based classifier was already trained, thus, avoiding data leakage between the . Schematic breakdown of the data sources, data volume, hetero/homogeneity, and how data were allocated in our frame-based classifier development and clip-based clinical metric. For the frame-based classifier, the number of clips used for the training, validation, and test sets are presented as the mean ± standard deviation of the ten-fold cross-validation partitions.

Clip-Based Inference Data
The local clip inference data were generated from a combination of all heterogeneous A or B line data clips and all homogeneous clips generated from our labelling team after the frame-based classifier was already trained, thus, avoiding data leakage between the framebased training data and clip-inference data (as this may inflate performance). Locally, there were 523 A line clips and 350 B line clips. Among the B line clips, 153 were heterogeneous. The external clip inference dataset was screened similarly yielding 92 A line clips and 197 B line clips. Among the B line clips, 89 were heterogeneous. Details regarding these datasets are in Table 2.

Dataset Split
Prior to a training experiment, the dataset was randomly split into training, validation, and test sets by patient ID. Therefore, all the clips obtained from each unique patient were confined to a single set (i.e., training, validation, or test) without overlap. A summary of the split used in K-fold cross-validation are outlined in Table 3.

Data Preprocessing
All ultrasound clips were deconstructed into their constituent frames. Following this, the frames were scrubbed of all on-screen information (e.g., vendor logos, battery indicators, index mark, and depth markers) extraneous to the ultrasound beam itself (see Figure 4). This was done using a dedicated deep learning masking software for ultrasound (AutoMask, WaveBase Inc., Waterloo, ON, Canada).

Data Preprocessing
All ultrasound clips were deconstructed into their constituent frames. Following this, the frames were scrubbed of all on-screen information (e.g., vendor logos, battery indicators, index mark, and depth markers) extraneous to the ultrasound beam itself (see Figure  4). This was done using a dedicated deep learning masking software for ultrasound (Au-toMask, WaveBase Inc., Waterloo, ON, Canada). Transformations were stochastically applied to training batches as a means of data augmentation. Possible transformations included rotation up to 45° clockwise or counterclockwise, vertical or horizontal width shifting up to 10%, magnification up to 10% inwards or outwards, shear up to 10° counterclockwise, horizontal reflection and brightness increase/decrease up to 30%. These methods were applied to increase the heterogeneity of the training dataset because, despite a large quantity of frames, the number of distinct clips and individual patients was comparatively lower.

Model Architecture
After iterative experiments with a subset of our data on feedforward convolutional neural networks (CNNs), residual CNNs, and benchmark CNN architectures pretrained on ImageNet [30], we chose a model comprised of the first 3 blocks of VGG16 as our network architecture ( Figure 5, Table S2 in the Supplementary Materials) [31]. This architecture exploits the pretrained, earlier layers of VGG16 for low-level features (e.g., edges and Transformations were stochastically applied to training batches as a means of data augmentation. Possible transformations included rotation up to 45 • clockwise or counterclockwise, vertical or horizontal width shifting up to 10%, magnification up to 10% inwards or outwards, shear up to 10 • counterclockwise, horizontal reflection and brightness increase/decrease up to 30%. These methods were applied to increase the heterogeneity of the training dataset because, despite a large quantity of frames, the number of distinct clips and individual patients was comparatively lower.

Frame-Based Deep Learning Classifier Model Architecture
After iterative experiments with a subset of our data on feedforward convolutional neural networks (CNNs), residual CNNs, and benchmark CNN architectures pretrained on ImageNet [30], we chose a model comprised of the first 3 blocks of VGG16 as our network architecture ( Figure 5, Table S2 in the Supplementary Materials) [31]. This architecture exploits the pretrained, earlier layers of VGG16 for low-level features (e.g., edges and lines), while avoiding more sophisticated feature detection is likely unhelpful to interpreting lower complexity LUS images. Additionally, this approach afforded a lighter computational demand and may be less prone to overfitting the training data than the full VGG16 architecture.
Diagnostics 2021, 11, x FOR PEER REVIEW 8 of 18 lines), while avoiding more sophisticated feature detection is likely unhelpful to interpreting lower complexity LUS images. Additionally, this approach afforded a lighter computational demand and may be less prone to overfitting the training data than the full VGG16 architecture. The model's prediction is a probability distribution indicating its confidence that an input lung ultrasound frame exhibits A lines or B lines. We elected to focus on framebased predictions, as single LUS frames are able to convey A vs. B line patterns and represent the building block unit of clips. Therefore, a classifier at the frame level has the The model's prediction is a probability distribution indicating its confidence that an input lung ultrasound frame exhibits A lines or B lines. We elected to focus on frame-based predictions, as single LUS frames are able to convey A vs. B line patterns and represent the building block unit of clips. Therefore, a classifier at the frame level has the greatest agility to be applied to clips of varying compositions as is typical of point-of-care imaging.
The prediction for a single frame is the probability distribution p = [p A , p B ] obtained from the output of the softmax final layer, and the predicted class is the one with the greatest probability (i.e., argmax(p)) (full details of the classifier training and evaluation are provided in the Methods section, Table S3 of the Supplementary Materials).

Clip-Based Clinical Metric
As LUS is not experienced and interpreted by clinicians in a static, frame-based fashion, but rather in a dynamic (series of frames/video clip) fashion, mapping the classifier performance against clips offers the most realistic appraisal of eventual clinical utility. Regarding this inference as a kind of diagnostic test, sensitivity and specificity formed the basis of our performance evaluation [32].
We considered and applied multiple approaches to evaluate and maximize performance of a frame-based classifier at the clip level. For clips where the ground truth is homogeneously represented across all frames (e.g., a series of all A line frames or a series of all B line frames), a clip averaging method would be most appropriate. However, with many LUS clips having heterogeneous findings (where the pathological B lines come in and out of view and the majority of the frames show A lines), clip averaging would lead to a falsely negative prediction of a normal/A line lung (see the Supplementary Materials for the methods and results-Figures S1-S4 and Table S6 of clip averaging on our dataset).
To address this heterogeneity problem, we devised a novel clip classification algorithm which received the model's frame-based predictions as input. Under this classification strategy, a clip is considered to contain B lines if there is at least one instance of τ contiguous frames for which the model predicted B lines. The two hyperparameters defining this approach are defined as follows: Classification threshold (t) The minimum prediction probability for B lines required to identify the frame's predicted class as B lines.
Contiguity threshold (τ) The minimum number of consecutive frames for which the predicted class is B lines.
Equation (1) formally expresses how the clip's predicted classŷ ∈ {0, 1} is obtained under this strategy, given the set of frame-wise prediction probabilities for the B line class, P B = p B 1 , p B 2 , . . . , p B n , for an n-frame clip. Further details regarding the advantages of this algorithm are in the Methods section of the Supplementary Materials.
Equation (1) We carried out a series of validation experiments on unseen internal and external datasets, varying both of these thresholds. The resultant metrics guided the subsequent exploration of the clinical utility of this algorithm.

Explainability
We applied the Grad-CAM method [33] to visualize which components of the input image were most contributory to the model's predictions. The results are conveyed by color on a heatmap, overlaid on the original input images. Blue and red regions correspond to the highest and lowest prediction importance, respectively.

Frame-Based Performance and K-Fold Cross-Validation
Our K-fold cross-validation yielded a mean area under (AUC) the receiver operating curve of 0.964 for the frame-based classifier on our local data ( Figure 6, Panel A). The confusion matrix of frame-wise predictions exhibits a strong diagonal pattern ( Figure 6, Panel B). A summary of the results is shown in Table 4 (full results in Table S5 in the  Supplementary Materials). color on a heatmap, overlaid on the original input images. Blue and red regions correspond to the highest and lowest prediction importance, respectively.

Frame-Based Performance and K-Fold Cross-Validation
Our K-fold cross-validation yielded a mean area under (AUC) the receiver operating curve of 0.964 for the frame-based classifier on our local data ( Figure 6, Panel A). The confusion matrix of frame-wise predictions exhibits a strong diagonal pattern ( Figure 6, Panel B). A summary of the results is shown in Table 4 (full results in Table S5 in the Supplementary Materials).

Frame-Based Performance on External Data
The AUC obtained from the external data at the frame level was 0.926 ( Figure 6, Panel C). The confusion matrix ( Figure 6, Panel D) of frame-wise predictions exhibit a strong diagonal pattern, supporting the results of the individual class performance.
A summary of the results is shown in Table 4.

Explainability
The Grad-CAM explainability algorithm was applied to the output from the model on our local test set data and the external data. Example heatmaps with associated predictions are seen for our internal data and external data in Figures 7 and 8, respectively. The correctly predicted A line frames demonstrate strong activations on the horizontal markings, indicating the correct areas where a clinician would assess for this specific pattern. Similarly, there are strong activations along the vertically oriented B lines on the correctly identified clips for this pattern. The incorrectly predicted frames show activations taking on a similar morphology for the predicted class (i.e., horizontal shapes for predicted A lines, vertical shapes for predicted B lines).

Clip-Based Clinical Metric
The relationship of a contiguity threshold (T) from 1 to 40 and/or a frame classification threshold (t) from 0.5 to 0.9 to diagnostic sensitivity and specificity was fully explored on both internal and external clip-based inference datasets (Figure 9). In both datasets, increasing t led to incremental upward translation of the specificity curve with modest negative translation of the sensitivity curve. With a T > 1 for any given t, specificity was able to be further augmented with modest reductions in sensitivity. Across all thresholds, peak diagnostic performance, as determined by optimum combined sensitivity and specificity, was a t of 0.7 and a T of 3 (sensitivity of 90.0% and specificity of 92.0%) for the internal data, and a t of 0.8 and a T of 3 (sensitivity of 83.2% and specificity of 81.5%) for the external data.

Clip-Based Clinical Metric
The relationship of a contiguity threshold (T) from 1 to 40 and/or a frame classification threshold (t) from 0.5 to 0.9 to diagnostic sensitivity and specificity was fully explored on both internal and external clip-based inference datasets (Figure 9). In both datasets, increasing t led to incremental upward translation of the specificity curve with modest negative translation of the sensitivity curve. With a T > 1 for any given t, specificity was able to be further augmented with modest reductions in sensitivity. Across all thresholds, peak diagnostic performance, as determined by optimum combined sensitivity and specificity, was a t of 0.7 and a T of 3 (sensitivity of 90.0% and specificity of 92.0%) for the internal data, and a t of 0.8 and a T of 3 (sensitivity of 83.2% and specificity of 81.5%) for the external data. 21, 11, x FOR PEER REVIEW 13 of 18 Figure 9. Influence of varying the contiguity threshold (T) and the classification threshold (t) across the internal and external clip-based data. While absolute diagnostic performance (intersection of sensitivity and specificity, dashed line) differed between the internal and external set, common trends in increasing both T and t were seen. Increases in T at lower levels of t are useful to increase performance at all levels for external data while this effect plateaus at a t of 0.8 for local data.

Discussion
In this study, we developed a deep learning solution for accurate distinction between the A line and B line pattern on lung ultrasound. Since this classification, between normal Figure 9. Influence of varying the contiguity threshold (T) and the classification threshold (t) across the internal and external clip-based data. While absolute diagnostic performance (intersection of sensitivity and specificity, dashed line) differed between the internal and external set, common trends in increasing both T and t were seen. Increases in T at lower levels of t are useful to increase performance at all levels for external data while this effect plateaus at a t of 0.8 for local data.

Discussion
In this study, we developed a deep learning solution for accurate distinction between the A line and B line pattern on lung ultrasound. Since this classification, between normal and abnormal parenchymal patterns, is among the most impactful and well-studied applications of LUS, our results form an important step toward the automation of LUS interpretation.
With reliable frame level classification (local AUC of 0.96, external AUC of 0.93) and explainability figures that show appropriate pixel activation regions, results support generalized learning of the A line and B line pattern. Clip-level application of this model was carried out to mimic the more difficult, clinical task of interpreting LUS in a real-time, continuous fashion at a given location on the chest.
A challenge of classifying B lines at the clip level is to ensure sufficient responsiveness that low burden B line clips (either because of flickering, heterogeneous frames, or a low number of B lines) are accurately identified, while still preserving specificity to the classifier. The thresholding techniques we devised around frame prediction strength and contiguity of such predictions were successful in addressing this challenge, while also providing insight into how an A vs. B line classifier may be customized for a variety of clinical environments. Through adjustment of these thresholds (Figure 9), varying clinical use cases may be matched with appropriate emphasis on either greater sensitivity or specificity. Further considerations such as disease prevalence, presence of disease specific risk factors, and the results and/or availability of ancillary tests and expert oversight would also influence how automated interpretation should be deployed [34].
Among the many DL approaches to be considered for medical imaging, our framebased foundation was chosen deliberately for the advantages it may offer for eventual real-time automation of LUS interpretation. Larger, three-dimensional or temporal DL models that might be applied to perform clip-level inference would be too bulky for eventual front-line deployment on the edge and also lack any semantic clinical knowledge that our clip-based inference approach is intended to mimic.
The automation of LUS delivery implied by this study may seem futuristic amid some public trepidation about deploying artificial intelligence (AI) in medicine [35]. Deep learning solutions for dermatology [36] and for ocular health [37], however, have shown tolerance exists for non-expert and/or patient-directed assessments of common medical concerns [38]. As acceptance for AI in medicine grows, automated LUS may be anticipated to satisfy the consistently exceptional demand for lung imaging [39], especially where access to standard imaging may not be convenient or possible. The recently announced reimbursement for DL-enhanced imaging in the United States will, by offsetting the costs of developing such solutions, accelerate interest in the DL-imaging interface [40]. Beyond A and B lines, LUS automation priorities can be expected to include lung sliding, pleural effusion, and consolidation. Additionally, multicenter validation of automated diagnosis [19] or prognosis [18] with LUS offers promising research avenues.
Real world deployment of a classifier as we have developed will require further progress before it can be realized. Firstly, since LUS is user dependent, a method of standardizing acquisition, as has recently been proposed, can only enhance the opportunities for both DL development and implementation in LUS [41]. Anticipating that technical standards take significant time to be adopted, however, a more realistic approach may be to pair automated interpretation with image guidance systems that assure standards that meet the needs of the image classifier. Such an approach has recently been described with some success in the domain of AI-assisted echocardiography [42]. The other barrier to deployment is how to run the DL technology "on the edge" at the patient's bedside with a portable machine capable of LUS. Eventual integration of high-performance GPUs with ultrasound devices will address this; however, in the interim, portable "middleware" devices capable of interacting directly with ultrasound machines and running AI models in real time have been developed and are commercially available [43].
Despite the rarity of DL work with LUS, there have been some recent studies that have addressed LUS [20][21][22]44]. These studies, with a wide array of different DL approaches, all share a non-clinical emphasis and small datasets. Our work differs significantly through a comparatively much larger LUS data volume from multiple centers, rigorous curation and labelling methods that resemble reference standards [45], and a pragmatic, clinical emphasis on diagnostic performance. In addition, while medical DL classifiers have struggled notoriously with generalization [46,47], our model performed well on an external dataset with reasonably distinct acquisition features as compared with our data.
There are important limitations to our work. The implicit heterogeneity of point-ofcare data can contribute to unseen learning points for our model that could unduly increase performance. We have sought to mitigate these effects through rigorous preprocessing as well as through our K-fold validation methods, external validation, and explainability. Despite generalizable results against the external data set, a performance gap at the frame and clip level was seen. False positive B line predictions (B line prediction for ground truth A line clips, Figure 9, and in Supplementary Materials, Figure S2) provided the greatest challenge to our model and was driven largely by dataset imbalances relative to the training data: images generated with either curved linear probe, cardiac preset, or the Philips machine. This understanding will inform future iterations of this classifier. While we have designed our classifier as a "normal vs. abnormal" model, there is an opportunity for greater granularity within the B line class. Features such as subpleural consolidations and pleural line irregularity [48] were not addressed by this classifier. Combining the present model with other published models devoted to disease-specific diagnoses within the B line class seems desirable [19].

Conclusions
The information presented here supports an eventual goal of automated LUS through deep learning. We describe the development of an accurate A vs. B line, frame-based classifier validated at the clip level. Novel techniques to both maximize and adjust diagnostic performance to suit the priorities of differing clinical environments have further been established. Future work will rely on broader data representation and evaluating the feasibility and accuracy of real-time clinical deployment.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/diagnostics11112049/s1, Figure S1: Effect of A vs. B line class and B line heterogeneity on B line prediction certainty, Figure S2: Effect of probe type, exam preset, and ultrasound vendor on A and B line prediction certainties for dataset 3 (external data, including homogenous and heterogenous clips) at the clip level, Figure S3: Effect of B line severity and heterogeneity on B line prediction false negative rate (FNR) using a clip level, average prediction of >0.5 for B lines, Figure S4: Influence of varying average clip prediction threshold on sensitivity and specificity for internal (A) and external data (B), Table S1: Graded qualification system for independent data labelling used to label local lung ultrasound data, Table S2: Visual summary of the model's architecture, Table S3: Details of the runs comprising the Bayesian hyperparameter optimization, Table S4: K-fold cross validation experiment data distribution by patients, clips, and frames, Table S5: Full internal k fold validation results, Table S6: Clip-wise performance for B line detection across local and external data, Video S1: Video example of a homogeneous A line label on our lung ultrasound dataset. Horizontal reverberation artifacts from the pleural line (A lines) connote normal lung parenchyma and can be seen throughout the clip, uninterrupted, Video S2: Video example of a homogeneous fewer than 3 B lines label on our lung ultrasound dataset. A solitary, bright, ring-down artifact from the pleural line (B line) can be seen throughout the clip, Video S3: Video example of homogeneous moderate B line label on our lung ultrasound dataset. Multiple ring-down artifacts (B lines) can be seen throughout, occupying less than 50% of the total pleural line surface, Video S4: Video example of homogeneous severe B line label on our lung ultrasound dataset. Multiple ring-down artifacts (B lines) can be seen throughout, occupying more than 50% of the total pleural line surface, Video S5: Video example of a heterogeneous moderate B line label on our lung ultrasound dataset. Multiple ring-down artifacts (B lines) can be seen moving in and out of the field of view of the ultrasound image.