Ensemble of Deep Convolutional Neural Networks for Classiﬁcation of Early Barrett’s Neoplasia Using Volumetric Laser Endomicroscopy

: Barrett’s esopaghagus (BE) is a known precursor of esophageal adenocarcinoma (EAC). Patients with BE undergo regular surveillance to early detect stages of EAC. Volumetric laser endomicroscopy (VLE) is a novel technology incorporating a second-generation form of optical coherence tomography and is capable of imaging the inner tissue layers of the esophagus over a 6 cm length scan. However, interpretation of full VLE scans is still a challenge for human observers. In this work, we train an ensemble of deep convolutional neural networks to detect neoplasia in 45 BE patients, using a dataset of images acquired with VLE in a multi-center study. We achieve an area under the receiver operating characteristic curve (AUC) of 0.96 on the unseen test dataset and we compare our results with previous work done with VLE analysis, where only AUC of 0.90 was achieved via cross-validation on 18 BE patients. Our method for detecting neoplasia in BE patients facilitates future advances on patient treatment and provides clinicians with new assisting solutions to process and better understand VLE data. compare our results with recent work on VLE, as well as highlight the nature of each of the studies. Our study is based on experiments with a more robust dataset, provided by increasing to 45 the number of BE patients, which allows the DCNN to learn a wider range of features. Moreover, we show that using a multi-frame approach for classifying an ROI increases the conﬁdence of our algorithm, compared to single-frame analysis (0.90 vs 0.86 with Scheeve et al. [15] method, and 0.96 vs 0.91 with our proposed DCNN).


Introduction
Esophageal adenocarcinoma (EAC) is among the most common and lethal cancers in the world.EAC has shown a rapid increase since the late 1980s and it is estimated that the number of new esophageal cancer cases will be doubled by 2030 [1].Barrett's esophagus (BE) is a condition in which normal squamous epithelium at the distal end of the esophagus is replaced by metaplastic columnar epithelium due to overexposure to gastric acid and it is associated with an increased risk of developing EAC [2].For this reason, patients diagnosed with BE currently undergo regular surveillance with white-light endoscopy (WLE) with the aim to detect early high-grade dysplasia (HGD) and intramucosal adenocarcinoma.It is important to detect these lesions early, as curative treatment is still possible at this stage by a minor endoscopic intervention.However, early neoplastic lesions are regularly missed because of their subtle appearances, or as a result of sampling errors during biopsy [3][4][5].
Volumetric laser endomicroscopy (VLE) is a novel advanced imaging system with the potential to early detect suspicious areas containing BE, which may be regularly missed with current white-light endoscopy.VLE incorporates a second-generation form of optical coherence tomography (OCT) technology.Improvements in image acquisition speed enable this balloon-based system to perform a quick circumferential scan of the entire distal esophagus.VLE provides a three-dimensional map of near-microscopic resolution of the subsurface layers of the esophagus over a length of 6 cm and depth of 3 mm into the tissue.However, a VLE scan generates a large amount of gray-shaded data (i.e., typically 1200 cross-sectional images or frames of 4096 × 2048 pixels) that need to be analyzed in real time during endoscopy.
In recent studies [6,7], clinical VLE prediction models were developed for BE neoplasia with successful accuracy results.Several visual VLE features (lack of layering, high surface signal intensity, and irregular glandular architecture) were identified as possible indicators of BE neoplasia by comparing VLE-histology correlated images from 25 ex-vivo specimens.However, follow-up studies using ex-vivo [8] and in-vivo [9] VLE data suggested that full-scan VLE interpretation by experts remains a challenge.Computer aided diagnosis (CADx) systems offer relevant assistance in clinical decision making.In van der Sommen et al. [10], a CADx system was developed for the detection of early BE neoplasia on WLE images.Early work in endoscopic optical coherence tomography (EOCT) was shown in Qi et al. [11,12], where a CADx system based on multiple feature extraction methods was developed to detect neoplasia in BE patients in a dataset of EOCT biopsies.Similar related work was performed by Ughi et al. [13], where a system was developed to automatically characterize and segment the esophageal wall of patients with BE using tethered capsule endomicroscopy (TCE).
The first related work performed on VLE was presented in Swager et al. [14], where a CADx system was developed to detect early BE neoplasia on 60 VLE images from a database of high-quality ex-vivo VLE-histology correlations (30 non-dysplastic BE (NDBE) and 30 neoplastic images, containing HGD or early EAC).In their work, two novel clinically-inspired quantitative image features specific for VLE were developed based on the VLE surface signal and the intensity histogram of several layers.Recently, additional work was performed by Scheeve et al. [15], where another novel clinically-inspired quantitative image feature was developed based on the glandular architecture, and analyzed in a dataset of 18 BE patients with and without early BE neoplasia (88 NDBE and 34 HGD/EAC).Both studies investigated the features using several machine learning methods, such as support vector machine, random forest or AdaBoost, and showed successful results towards BE neoplasia assessment.However, the results were only obtained in a small patient population (29 endoscopic resections and 18 VLE laser-marked regions of interest (ROIs), respectively), suggesting that a larger dataset might be needed to further validate and possibly improve the results.
In this study, we extend the work of Scheeve et al. [15] by using a larger dataset of 45 patients, and we incorporate current state-of-the-art deep learning techniques to improve the classification of non-dysplastic and neoplastic BE patients.To the best of our knowledge, this is the first study applying deep convolutional neural networks (DCNNs) for early BE neoplasia classification on patients acquired with in-vivo VLE.The results are compared with previously developed clinically-inspired features specific for VLE, and validated on a test dataset that is unseen during training time.

VLE Imaging System
The Nvision VLE Imaging System (NinePoint Medical, Inc., Bedford, MA, USA) integrates a second-generation form of OCT technology, termed optical frequency-domain imaging (OFDI) [16][17][18][19].The VLE system was composed of a disposable optical probe with an inflation system and an imaging console, which incorporates a swept light source, optical receiver, interferometer, and a data-acquisition computer.The light source consisted of a near infrared light (λ = 1250-1350 nm) that was transmitted into the catheter.At the distal end of the optical probe, a non-compliant balloon allowed a correct alignment in the esophagus for in-vivo imaging.During an automatic pullback of the optical probe, a 6 cm circumferential segment of the esophagus is scanned in 90 seconds.A VLE pullback acquires 1200 cross-sectional images with a sampling density of 50 µm (voxel dimensions, 5.9 µm × 16.2 µm × 50 µm).The axial and lateral resolution of a VLE image were approximately 7 µm and 40 µm, respectively.The penetration depth reached approximately 3 mm into tissue.For more comprehensive technical details we refer to previous publications [16][17][18][19].

Data Collection and Description
In a prospective multi-center clinical study, for the PREDICTion of BE neoplasia (PREDICT study), VLE data was acquired in vivo from 45 patients undergoing BE surveillance at the Amsterdam UMC (AMC; Amsterdam, The Netherlands), the Catharina Hospital (CZE; Eindhoven, The Netherlands), and the St. Antonius Hospital (ANZ; Nieuwegein, The Netherlands) from October 2017 to November 2018, using a commercial VLE system (NinePoint Medical, Inc., Bedford, MA, USA).Patients undergoing surveillance of NDBE, or patients referred for work-up and treatment of BE with early neoplasia (HGD and/or EAC), were eligible for this study.The study was approved by the institutional review boards at AMC, CZE, and ANZ.Written informed consent was obtained from all patients prior to VLE imaging.
For each patient, one or several regions of interest (ROIs) were extracted in the following manner.First, four-quadrant laser-mark pairs were placed at 2 cm intervals using the VLE system, according to the Seattle biopsy protocol [20,21].Next, a full VLE scan was performed, after which the VLE balloon was retracted from the esophagus.Then, regular endoscopy was used to obtain biopsies in between the laser-mark pairs.Finally, ROIs were cropped from the full scan in between the same laser-mark pairs, and were labeled according to pathology outcome, ensuring histology-correlation [22] of the extracted ROIs.The histopathological correlation was assumed to apply over 1.25 mm, conform a small biopsy specimen, comprising 25 cross-sectional images, in both vertical directions (i.e., distal and proximal), and thus resulting in 51 images per ROI.
In total, 233 NDBE and 80 neoplastic (HGD/EAC) ROIs were laser-marked under VLE guidance and subsequently biopsied for histological evaluation by an expert pathologist for BE.Out of the total cohort of patients, the first 22 patients were used as the training dataset (134 NDBE and 38 HGD/EAC ROIs, totalling 8772 VLE images) and the remaining 23 were treated as the unseen test dataset (99 NDBE and 42 HGD/EAC ROIs, totalling 7,191 VLE images).

Clinically-Inspired Features for Multi-Frame Analysis
In the previous works of Klomp et al. [23] and Scheeve et al. [15], several clinically-inspired quantitative image features were developed, the layer histogram (LH) and gland statistics (GS), to detect BE neoplasia in single frames.We referred to analysing one VLE image in a ROI to predict BE neoplasia as single-frame analysis.In Scheeve et al. [15], a single VLE image per ROI was used to compute the resulting prediction for each ROI.For a fair comparison with these studies, and per our request to the authors, we present our results by extending the single-frame analysis to 51 VLE images per ROI, further referred to as multi-frame analysis.The development of the clinically-inspired features has been described previously [15,23,24].A summary of the methodology for the multi-frame analysis is given in the following sections.

Preprocessing
Relevant tissue regions were segmented from VLE images, removing regions not suitable for analysis, such as regions of air and deeper tissue with a low signal-to-noise ratio.The tissue segmentation masks were obtained using FusionNet with a domain-specific loss function [25].Using these segmentation masks, tissues of interest (TOIs) were segmented and flattened, by extracting the first 200 pixels (i.e., approximately 1 mm of tissue) from the top of each column that is indicated by the tissue segmentation mask [24].

Layer Histogram and Gland Statistics
From the flattened TOI data, the LH feature was computed.The LH feature captures the (lack of) layering in the VLE data by computing the N-bin histograms of the first M layers of d pixels, starting from the top of the flattened TOI data [23,24].To detect glandular structures in the flattened TOI data, gland segmentation masks were computed using a simple segmentation algorithm, involving local adaptive thresholding and basic morphological operations.After glands were detected, the GS feature was computed, which captures the characteristics of glandular structures in the VLE data.This GS feature comprises features for (1) texture analysis, (2) geometry analysis, as well as (3) VLE-specific information [15].Both LH and GS capture characteristics in the VLE data that are indicative for dysplasia [7,26].

Ensemble of Deep Convolutional Neural Networks
In recent years, deep convolutional neural networks (DCNNs) have shown to be highly effective for segmentation, classification, and learning specific patterns in images.In the paradigm of optical coherence tomography (OCT), DCNNs have been successfully applied to OCT, such as segmentation of retinal layers [27] and macular edema [28], detection of macular fluid [29], or treatment of age-related macular degeneration [30,31].In this section we present our approach with DCNNs in a dataset of in-vivo VLE images for the classification of neoplasia in patients with BE.

Preprocessing VLE
In order to guide the network towards a better convergence, and similar to single-frame analysis (Section 2.3.1),we performed some specific cleaning by removing non-informative areas from the original VLE images.This removal involved two main sources of less important information: (1) the pixel background information, and (2) the balloon pixel information.Our aim was to maximize the useful information that the network can learn and exclude any additional learning that degrades the classification task.For this reason each image was cropped to occlude the balloon.For each VLE image, the balloon region (Figure 1, red curve) was removed following the steps below: 1.For each image the average intensity curve was computed along the vertical dimension, thus allowing us to obtain the profile of changing intensity (Figure 1, cyan curve).2. The first derivative of the average intensity curve was computed to obtain a quantitative measurement of slope differences (Figure 1, yellow curve).3. Given the first derivative, the balloon end location was defined as the local maximum value found after the minimum point of the first derivative (Figure 1, green star).

Preprocessing DCNN
Each of the networks were initialized using pre-trained ImageNet weights [32].One limitation of using a pre-trained model is that the associated architecture cannot be changed, since the weights are originally trained for a specific input configuration.Hence to match the requirements of the pre-trained ImageNet weights, each image was resized to 224 × 224 pixels.In addition, the dataset was normalized by subtracting the mean and dividing by the standard deviation specified by the pre-trained ImageNet weights.As a final step, the gray-scale channel of each VLE image was triplicated to simulate the RGB input requirement of the pre-trained model (Figure 2).

Detect balloon line
Crop VLE image ImageNet preprocessing

Data Split of the Training Dataset
For separating training and validation datasets, we propose to split the data according to the patient neoplasia-basis, rather than splitting data on frame-basis.Splitting on frame-basis cannot be done naively, since the frames are highly correlated in the following way.Each (multi-frame) ROI of the training dataset (22 patients of the PREDICT study) contains a total of 51 frames comprising roughly 2.5 mm of the total 6 cm length of the scanned esophagus.Succeeding frames are therefore highly correlated and contain nearly the same data.The only remaining option therefore is splitting data on patient neoplasia-basis.To avoid overfitting, the patient distribution for each neoplasia grade was analyzed.
In Table 1, we show the number of patients that belong to each class, as well as the number of ROIs pathologically confirmed as non-dysplastic and dysplastic.We can observe that three patients share the NDBE and HGD class label, therefore we decided to split our datasets in three groups.To avoid patient bias during training, we select only ROIs from NDBE and HGD and data from one patient that have both.This leads to three permuted datasets.Given this permutation, we have trained three individual DCNNs, which together is called an ensemble of networks.When evaluated in the external test dataset, the resulting ensemble of the three networks is used to obtain the probability of an ROI belonging to NDBE or HGD, further explained in Section 3.

DCNN Description
In this section we present the choice of the network architecture and motivate why it best fits our data.In this work we utilized an ensemble of networks, each of them based on the VGG16 architecture proposed by Simoyan et al. [33].We have found that due to the unbalanced nature of the data, a simple but yet deep model was the most effective way to classify our VLE images.Deep models are useful to learn complex features, but at the cost of a large amount of parameters to train and slow inference time.
Figure 3 depicts the architecture of the DCNN, based on the VGG16 network.VGG16 is composed of five groups of convolutional blocks with a maxpooling layer at the end of each block.In the original architecture, the classification layer was wrapped into two fully connected (FC) layers followed by the classification layer with a softmax activation.In our work, we added a global average pooling (GAP) layer (Figure 3, blue block) after the last convolutional block, which reduced the amount of learnable parameters from 7 × 7 × 512 to 1 × 1 × 512.The GAP layer was followed by a fully connected layer of 512 hidden units instead of 4096 proposed in the original VGG16 network.Subsequently, a dropout layer (p = 0.5) was added to the architecture, followed by the final classification layer with a softmax activation.

Training
We have fine-tuned the VGG16 of each ensemble-network by freezing the first four convolutional blocks.Stochastic gradient descent (SGD) was chosen as optimizer with momentum (m = 0.99).We opted for using an adaptive learning rate via cosine annealing with restarts.The learning rate fluctuated between 1e-3 and 1e-6.The learning cycle was repeated every two epochs.Furthermore, to alleviate imbalanced data between both classes, we have enforced an equal distribution of classes for each epoch.We have chosen a batch size of 100, equally sampling both classes.For each image, several data augmentations were applied, to enrich the generalization of the network and to avoid overfitting.We chose to use a combination of horizontal flip, motion blur and optical grid distortion, which in our opinion best represents the behaviour of VLE images with in-vivo data exploration.The three DCNNs were trained until convergence was achieved.

Results
The results in this section are divided in three parts: (A) clinical feature comparison of single-frame analysis and multi-frame performance, (B) performance of the three DCNNs and its ensemble, and (C) comparison of our work with the literature.
A. Clinical feature comparison.In this section we report the results obtained in the multi-frame analysis provided per our demand to the authors of Scheeve et al. [15].We compared the multi-frame classification results with the single-frame work.The multi-frame analysis for the LH feature achieved an average area under the receiver operating characteristic curve (AUC) of 0.90 ± 0.07 compared to the single-frame analysis, which achieved an AUC of 0.86 ± 0.02.In the same manner, the GS feature achieved an average AUC of 0.83 ± 0.01 for the multi-frame analysis, in comparison with an average AUC of 0.84 ± 0.02 for the single-frame methodology.Overall, we observe that the GS feature does not achieve better performance than the single-frame technique.Alternatively, we saw an increased performance on the LH feature when the multi-frame analysis is applied (AUC 0.90 vs. 0.86).
B. Performance of the three DCNNs and its ensemble.In order to validate the ensemble of DCNNs, we have evaluated our approach using the unseen test dataset.The posterior probabilities were computed for each VLE frame using the three trained DCNNs.For each ROI, a total of 51 possible predictions were obtained for each DCNN.The final probability of an ROI belonging to NDBE or HGD was computed by averaging the total number of probabilities in each DCNN, reported as the multi-frame probability.In Equation ( 1), the multi-frame probability can be explained as the decision of an ROI belonging to certain class A, computed by averaging the total probabilities of M frames for a number of N networks, where in our case M = 51 and N = 3.
Given a multi-frame probability, we computed accuracy metrics for both the training dataset and the test dataset.Additionally we computed the sensitivity, defined as the rate of HGD ROIs that are correctly identified as such (true positives), and specificity, defined as the proportion of NDBE ROIs that are correctly identified as such (true negatives).The receiver operating characteristic curve (ROC) was computed as well for both datasets for each trained DCNN.In Table 2, we present the results of the ensemble of DCNNs.To showcase the efficacy of our method, we report the classification results using the single-frame and multi-frame analysis for each of the training dataset and the test dataset.All metrics are reported with confidence intervals (CIs) at 95%.All presented values are reported at the optimal threshold setting, which was calculated from the training dataset.We observe that similar to the results presented in Section 3, we obtain an increased performance when accounting the multi-frame probability for both the training dataset (AUC, 0.98 vs. 0.95) and the test dataset (AUC, 0.96 vs. 0.90).Figure 4 portrays the computed ROC for the test dataset and the associated confusion matrix.The ROC curve was computed for each trained DCNN and the total ensemble.It shows that the combination of the three DCNNs improves the performance of our model (AUC, 0.96 vs. 0.92-0.96)(Figure 4a).
C. Comparison of our work with the literature.Previous work done in VLE images [14,15] focused on extracting features to detect neoplasia in BE patients.To compare with the literature, we have extended the work of Scheeve et al. [15] for multi-frame analysis.In Table 3, we compare our results with recent work on VLE, as well as highlight the nature of each of the studies.Our study is based on experiments with a more robust dataset, provided by increasing to 45 the number of BE patients, which allows the DCNN to learn a wider range of features.Moreover, we show that using a multi-frame approach for classifying an ROI increases the confidence of our algorithm, compared to single-frame analysis (0.90 vs 0.86 with Scheeve et al. [15] method, and 0.96 vs 0.91 with our proposed DCNN).

Discussion
DCNNs can be seen as black-box models that output probabilities based on features learned during the training process, where it is difficult to control which features or patterns in the image are relevant.In contrast with previous work on VLE images [14,15], where handcrafted features were selected based on visual properties with a clinical explanation, we have trained several DCNNs that provide a decision, based on learnable features from VLE images.Therefore, to observe the decisions of our trained DCNN, we computed the class activation maps (CAMs) [34], allowing us to observe which regions of the image were chosen as most important and discriminative.Examples of CAMs can be seen in Figure 5.We observe that for the HGD class, the activation maps mainly focus on concentration of glands that are located around the first layers of the esopaghus.Similar conclusions were presented in the analysis reported by Wang et al. [35].Alternatively, we consider that the activation maps of the NDBE class indicate the homogeneity of the esophagus layers.These findings suggest a possible link between clinically-inspired features and the decisions learned by our DCNNs.More validation is needed to confirm these findings.
Table 3 shows the differences between the reported studies.In comparison with Swager et al. [14], and similar to Scheeve et al. [15], we have obtained a dataset of in-vivo VLE images, but with the following differences: (1) our dataset is different from Swager et al. but closer to Scheeve et al. and (2) our dataset is larger than the other two references (Scheeve et al. compares only 18 patients).Our work improves in both the data acquistion and the evaluation method, by using a dataset of 22 in-vivo patients to uniquely train three DCNNs and further evaluate the results in a separate set of 23 in-vivo unseen patients.We use the AUC as the comparison metric because it is the most suitable value that better quantifies the work done in the aforementioned studies.Our work shows that we achieve the best reported results by utilizing a larger dataset in combination with state-of-the-art DCNNs.Moreover, in comparison with previous studies we take advantage of adjacent VLE frames to improve the classification results for a more robust prediction of neoplasia.Although our results are a positive step towards VLE interpretation, we have the opinion that an even larger dataset will provide more insight into understanding VLE and will pave the way towards further improving early detection of EAC.

Conclusions
Barrett's esophagus (BE) is precursor of esophageal cancer, where detection of neoplasia in patients with this condition can enable early prevention and avoid further complications.We show that deep convolutional neural networks are capable of classifying a VLE region of interest between non-dysplastic BE (NDBE) and high-grade dysplasia (HGD).In this work we have trained several DCNNs to classify neoplasia in patients with BE and we evaluated them in a dataset of 45 patients.We obtained a specificity of 0.85, a sensitivity of 0.95 and an AUC of 0.96 on the test dataset, which clearly outperforms earlier work.Compared to earlier work, we took advantage of multi-frames of the in-vivo endoscopic acquistion to improve the confidence of our algorithm on predicting neoplasia.As far as our knowledge extends, our work is the first to train DCNNs on in-vivo VLE images and validate it using an unseen dataset.In our opinion, the presented study will improve the treatment of patients with BE by providing assistance to endoscopists.Our work could potentially aid clinicians towards a more accurate localization of regions of interest during biopsy extraction and provide an assessment to replace costly histopathological examinations.

Figure 1 .
Figure 1.(Best viewed in color) Example of preprocessing applied to each volumetric laser endomicroscopy (VLE) frame.At the left side of the image the balloon line (red line) is located by calculating the average intensity of the whole image (cyan line) and then using the first derivative (yellow line) to extract the end point of the balloon pixel (green star).Scale bars: 0.5 mm.

Figure 2 .
Figure 2. Automatic workflow of the preprocessing applied to each VLE frame.First the balloon line is detected in the VLE frame.Secondly, the image is cropped below the detected balloon line.The VLE image is then resized to 224 × 224.The mean (µ) is subtracted and the standard deviation ( σ) divided according to the pre-trained ImageNet weights.Finally, the gray-scale channel is then triplicated to simulate the RGB requirements of the pre-trained network.

Figure 4 .
(a) Receiver operating characteristic curve for the three trained networks and the resulting ensemble validated on the external dataset.AUC denotes the area under the receiver operating characteristic curve.(b) Resulting confusion matrix for the external dataset.

Figure 5 .
Figure 5. (Best viewed in color) Several VLE frames and its corresponding class activation maps (CAM).Images (a,b) belong to regions of interest (ROIs) with high-grade dysplasia (HGD), represented as class activation map (CAM) in images (e,f).Images (c,d) correspond to ROIs with non-dysplastic Barrett's esopaghagus (NDBE), with its CAM in images (g,h).Scale bars: 0.5 mm.Color bar (a-d): pixel intensity.Color bar (e-h): class activation intensity.

Table 1 .
Class distribution based on the regions of interest (ROIs) of the training dataset of the first 22 patients.

Table 2 .
Comparison of single-frame and multi-frame analysis using the ensemble deep convolutional neural network (DCNN) for both the training and the test dataset.
Confidence intervals (CIs) reported between brackets.

Table 3 .
Evaluation of literature studies using volumetric laser endomicroscopy for detection of neoplasia in Barrett's esophagus patients.