Deep Learning for Whole-Slide Tissue Histopathology Classification: A Comparative Study in the Identification of Dysplastic and Non-Dysplastic Barrett’s Esophagus

The gold standard of histopathology for the diagnosis of Barrett’s esophagus (BE) is hindered by inter-observer variability among gastrointestinal pathologists. Deep learning-based approaches have shown promising results in the analysis of whole-slide tissue histopathology images (WSIs). We performed a comparative study to elucidate the characteristics and behaviors of different deep learning-based feature representation approaches for the WSI-based diagnosis of diseased esophageal architectures, namely, dysplastic and non-dysplastic BE. The results showed that if appropriate settings are chosen, the unsupervised feature representation approach is capable of extracting more relevant image features from WSIs to classify and locate the precursors of esophageal cancer compared to weakly supervised and fully supervised approaches.


Background
Barrett's esophagus (BE) is a precancerous condition that results from damage to the lining of the squamous esophageal mucosa. BE diagnosis is based on the endoscopic and histologic findings of the columnar epithelium lining the distal esophagus [1]. In order to increase sensitivity for dysplasia, guidelines recommend the Seattle protocol, which involves taking four-quadrant random biopsies at 1-2 cm intervals [2]. However, this protocol does not permit real-time diagnosis or therapy and is labor-intensive, leading to low adherence [3,4]. Additionally, numerous studies have documented poor inter-observer agreement among pathologists when diagnosing both low-grade [5][6][7] and high-grade dysplasia [8], suggesting significant room still exists for improvement in even the gold standard of histopathologic diagnosis. As dysplastic and non-dysplastic BE can progress to esophageal cancer, the need for an accurate and efficient diagnostic tool is evident. At the time of diagnosis, this knowledge could radically improve our clinical care by altering disease management and preventing disease complications. Therefore, there is a major need to develop innovative computational methods to translate heterogeneous histopathological images into accurate and precise diagnostics. The development of such a methodology in high-dimensional clinical research will support precision medicine, with improved diagnostics, predictions, treatments, and patient clinical outcomes. The success of these approaches relies on the appropriateness of extracted morphological features for characterizing the images.
There are some problems associated with conventional feature engineering approaches: First, the combinatorial nature of the feature extraction process makes it expensive to hand-craft features. In addition, the development of these features commonly relies on task/domain-specific expertise, preventing them from adapting to new tasks or domains. Furthermore, human bias is an inseparable part of hand-crafted features. In recent years, deep learning approaches have revolutionized the process of feature extraction tasks [9]. However, dealing with whole-slide tissue histopathology images (WSIs) offers new challenges, which demands more effective representation-learning approaches. Some challenges associated with these images include image size (typically 100,000 × 100,000 RGB pixels), high complexity, high morphological variance, and uncertainty associated with the pathology level. Automatic analysis of WSIs with typical deep learning approaches is impractical or impossible due to the above hurdles. Most recent researches has considered feature acquisition from high-resolution tissue tiles sampled from WSIs as a potential solution to these hurdles. In this approach, the final label of a given WSI is predicted based on image features extracted from sampled tissue tiles [10][11][12][13][14]. There is a rich body of literature investigating feature representation in the form of three primary approaches: fully supervised (FS) [15][16][17], weakly supervised [18][19][20][21][22][23][24], and unsupervised feature learning [25][26][27][28][29][30].
Of these, the fully supervised feature learning approach requires a large amount of accurately annotated data, which can be a labor-intensive, time-consuming, and error-prone process. These challenges are abundantly clear in the classification and segmentation of histopathology images as accurate and complete annotations can be difficult even for expert pathologists. On the opposite end of the annotation spectrum, unsupervised feature representation approaches aim to learn a discriminative representation of WSIs from annotation-free histopathology images according to the application domain. These methods extract the salient features from WSIs without requiring any image-level diagnosis as an image label or region of interest annotated by experts. Finally, weakly supervised methods have the advantages of both the fully supervised and unsupervised approaches for feature learning [31]. These approaches are not as highly dependent on annotated images as fully supervised approaches, nor are they as prior free as unsupervised approaches. Weakly supervised approaches exploit coarsely grained annotated WSIs to simultaneously classify histology images and yield pixel-wise localization scores, thereby identifying the corresponding regions of interest. Weakly supervised approaches in this way address the challenges related to the scarcity of densely annotated images.
This paper provides a comparative study that sheds light on the characteristics and behavior of different representation learning approaches through the sliding window approach, and their aggregation, by using a histogram-based method [32] for whole-slide inference, and identifying dysplastic and non-dysplastic BE on high-resolution histopathological images.

Data Collection
This study utilized previously published preliminary data to apply deep learning techniques for detecting BE and dysplasia in Hematoxylin and Eosin (H&E) stained biopsies. All patients in the study conducted by Shah et al. [6] (years 2014-2016) underwent targeted biopsy or mucosal resection, and Seattle protocol biopsies. In order to increase the sample size, a retrospective chart review was conducted to identify and retrieve biopsy slides of patients who had undergone upper endoscopies for BE surveillance (years 2016-2019). These patients all underwent high-definition white-light endoscopy (HD-WLE), narrow-band imaging (NBI), and acetic acid chromoendoscopy followed by targeted biopsies/mucosal resection, and Seattle protocol biopsies. All biopsy specimens were fixed in formalin. Samples were embedded to exhibit the full mucosal thickness. The paraffin blocks were sectioned at three microns to create biopsy slides that were stained with hematoxylin and eosin. All suspected diagnoses of dysplasia or malignancy required a consensus of two or more pathologists. For patients included in Shah et al.'s study [6], a blinded expert pathologist also reviewed all biopsy specimens. Blinded and unblinded pathology results were prospectively recorded.
This study was approved by the Hunter Holmes McGuire Veterans Affairs Medical Center Institutional Review Board and the University of Virginia Institutional Review Board for Health Science Research (IRB-HSR #21328).

Esophageal Biopsy Datasets
Tissue images were digitized at 40× magnification via scanning of biopsy slides using a Hamamatsu NanoZoomer S360 Digital slide scanner C13220 [33]. A total of 387 whole-slide images from 130 unique patients were collected. WSIs increased to 650 after pre-processing and cropping; 115 whole-slide images from 10 patients were selected to train deep models to extract patch-level image features in all three feature learning approaches, and the rest of the dataset was used for model evaluation. To train deep models in fully supervised approaches, these WSIs were manually pixel-wise annotated to highlight each class' examples within each whole-slide image (see Figure 1).

Figure 1.
An example of the annotation process on a typical whole-slide image (WSI). Red, green, and yellow highlighted areas indicate areas that were annotated and from which labeled patches were taken. Squamous tissue (green arrowhead), non-dysplastic Barrett's with Goblet cells (yellow arrowhead), and dysplastic tissue with crowding and hyperchromasia (lower zoomed section) were all present within the same whole-slide image.

Deep Learning-Based Feature Representation
As a result of using the histogram-based method, each image is represented as a histogram; the mapping mechanism and histogram structure depend on the feature extraction approach (i.e., fully supervised, weakly supervised, or unsupervised). To extract histological features from tissue tiles, each WSI X i , i = 1, ..., N is considered as a set of tissue tiles X i = {x 1 , ..., x n i }. This method uses a function M to map each image tile x belonging to WSI X i to a concept c k , k = 1, ..., K. In practice, instead of x ∈ R D , its representation learned by neural network f ψ (.) with parameter ψ is considered, f ψ (x) ∈ R E in which E << D. All concepts can be organized in as a vocabulary V = {c 1 , ..., c K }. Consequently, WSI X i is mapped to histogram H i = (h 1 , ..., h K ) as follows [32]: The k-th bin in H i is calculated as follows: Here, p(c k | f ψ (x)) is the likelihood that embedding vector of tissue tile x belongs to c k . In other words, the image-level histogram is the normalized frequency of each concept c k in the corresponding image.

Fully Supervised Feature Learning
In this approach, a convolutional neural network (CNN) is trained on high-resolution tissue tiles sampled from annotated regions of WSIs in the training set. Here, vocabulary V = {c 1 , ..., c K } is the set of K classes and mapping function M is the fully connected classifier network of the CNN which outputs class probabilities )) for each input tissue tile x. The summation over patch-level class probabilities given by the CNN for all image patches belonging to a single image generates its image-level histogram; the value of its k-th bin is derived from Equation (2). Figure 2 represents the overview of fully supervised feature representation approach.  Figure 2. Overview of the fully supervised feature extraction framework. (A) A convolutional neural network (CNN) is trained on high-resolution tissue tiles sampled from annotated regions of WSIs in the training set. (B) Next, the trained model is employed to output the class' probability distributions for each high-resolution tissue tile generated from new WSIs. The patch-level probabilities corresponding to all patches derived from a WSI are aggregated into WSI-level probabilities histogram.

Unsupervised Feature Learning
One of the main unsupervised approaches for feature representation is the bag-of-features framework. This approach was inspired by the bag-of-words scheme used for text categorization and text retrieval [27]. It consists of two main stages: In the first stage, an appropriate codebook is learned for representing the images of interest. A codebook is a visual vocabulary V = {c 1 , ..., c K } including K representative local descriptors codified as visual words; in the second stage, each image is encoded based on each codeword's frequencies in the image. Thus, the resulting representation of the image is a histogram of the codewords. In more detail, image classification using the bag-of-feature approach can be described in the following steps:

Codebook Learning
In this step, salient features of an image are identified. In the literature, different strategies have been proposed for local feature extraction from histopathology images. Popovici et al. employed the Gabor wavelets [30], and Caicedo et al. used the scale-invariant feature transform (SIFT) to extract features [27]. In this study, we employed a convolutional auto-encoder (CAE) trained in an unsupervised fashion to map each tissue tile into a low-dimensional embedding space. We then used a Gaussian mixture model (GMM) to cluster extracted features from tissue tiles into several clusters. Each cluster is considered a visual descriptor or codeblock, which are components of a codebook. Selection of the number of clusters (codebook size) is an important decision in codebook construction. This parameter should be guessed/optimized and then imported to the model as an input.

WSI Encoding
After codebook learning, the histogram of codeblocks' occurrences in the set of local features of an image is considered as image representation. This concept was inspired by term frequencies (TF) in text applications [27]. The hard assignment or soft assignment of patches to the clusters can be considered, depending on which clustering algorithm is used. In the case of employing k-means (KM) clustering, which gives a hard assignment of instances to clusters, image-level histogram values are calculated based on Equation (3).
where h k is the value of codeblock k-th in the generated histogram, and c p is the cluster that image patch p-th belongs to. If soft assignment is considered, a WSI-level histogram would be calculated according to Equation (2), where c k is the k-th cluster and p(c k | f ψ (x)) is the posterior probability of cluster k-th given embedding vector f ψ (x), which in GMM is derived from the following equation [34]: where π m is the probability of component m, and N is the multivariate Gaussian distribution with mean µ m and covariance matrix Σ m . Figure 3 represents the overview of the unsupervised feature representation approach.

Weakly Supervised Feature Learning
An overview of the weakly supervised feature representation approach is shown in Figure 4. In this section, multiple instance learning (MIL) [32] as a particular form of weakly supervised learning and an expectation-maximization (EM)-based method [22], which was developed as an improvement over MIL, are investigated in more detail.

Multiple Instance Learning
In MIL, a classifier is trained on a training set of labeled bags; each contains multiple instances. Here in the training phase, unlike the fully supervised approach in which a CNN is trained on tissue tiles sampled from pixel-wise annotated regions of WSIs, a CNN is trained on tissue tiles that are sampled from labeled WSIs. The training instances are not individually labeled, and it is assumed that each instance's label is the same as the corresponding WSI. In this setting, some instances inside one bag might be more related to other classes of bags. These instances do not convey any relevant information about the class, providing some confusing information.
Similar to the fully supervised, in this approach, vocabulary V is a set of K classes, and mapping function M is the fully connected classifier network of CNN which outputs class probabilities )) for each input tissue tile x. Finally, each WSI is encoded as an image-level histogram by aggregation over patch-level class probabilities given by the trained CNN for its tissue tiles.

Expectation-Maximization Model
In the EM method, which was proposed by [22], the main goal is to train the model over discriminative patches, i.e., those more likely to have the same labels as their corresponding WSIs. As the patch-level labels do not match the WSI labels, to avoid misleading the model by training over patches with incorrect labels, the model will be trained over patches, which are more likely to have the correct labels. It is assumed that there are hidden binary variables z corresponding to each tissue tile extracted from the labeled WSIs to detect discriminative patches. The hidden variable's value is equal to 1 if and only if this tissue tile is discriminative for the corresponding WSI. In this method, at the initial E-step, it is assumed that all tissue tiles are discriminative (i.e., for all images and all tissue tiles, z = 1). In the M-step, a CNN is trained over discriminative tissue tiles to maximize data likelihood. Here, vocabulary V = {c 1 , ..., c K } is a set of K classes and the fully connected classifier network of CNN is a mapping function M which outputs class probabilities M( f ψ (x), V) = (p(c 1 | f ψ (x)), ..., p(c k | f ψ (x)), ..., p(c K | f ψ (x))) for each input patch x. Then, in the E-step, the hidden variables z are estimated by applying Gaussian smoothing on p(c k | f ψ (x)) for all patches belong to a single image. Those patches will be considered discriminative if and only if p(z|X) is above a certain threshold. These tissue tiles are selected to continue training the CNN. Expectation and maximization steps are repeated until convergence is reached.

Slide-Level Inference
After encoding of WSIs using aggregation of patch-level labels, the image-level histograms are employed to train a classifier to predict the WSI-level labels. Since this decision-level fusion scheme [22] considers different patterns of the combination of patch-level labels to predict image-level class, it is more robust than the max-pooling method, which only considers the class with more tiles to predict the label of WSI. Additionally, because this image representation is generated by aggregating lots of tiles, it is very robust to misclassified tiles [22].

Feature Importance
Regions of interest (ROIs) detected by the models are considered to evaluate each model's interpretability. A softmax function fit in the CNN architecture outputs each class' probability distribution for each tissue tile in fully supervised and weakly supervised approaches. Visualizing these probabilities for every patch in a WSI generates a heatmap that highlights the attention areas associated with each class. Nevertheless, for the unsupervised approach, there is not a one-to-one correspondence between classes and codeblocks c k , k = 1, ..., K. In this setting, each tissue tile's importance for class C, I C x is calculated as follows: where p(c m | f ψ (x)) is the posterior probability of codeblock m-th given f ψ (x) (embedding vector of tissue tile x), and I C c m is the importance of the same codeblock for class C. We use the permutation feature importance to calculate per-class importance of each feature. In this method, each feature's importance for a specific class is defined to be the increase in the models' prediction error when values of that feature are randomly shuffled [35].

Patch Extraction
We employed a sliding window method on each WSI at 40× magnification to generate patches of size 128 × 128 pixels. Tissue tiles with less than 50% tissue sections were discarded. Image augmentation was also performed by horizontal flipping and random 90-degree rotations during training to prevent CNNs from over-fitting.
A common issue that causes bias while training the model on histopathological images is color variation. This issue, which originates from various sources, including differences in raw materials, staining protocols, and digital scanners [36], should be addressed and resolved as an essential pre-processing step before any analyses. Various solutions, such as color balancing [37], gray-scale, and stain normalization, have been proposed in the published literature to address the color variation issue. In this study, we used gray-scale images, and before converting the RGB patches to gray-scale, the stain normalization approach proposed by Vahadane et al. [36] was applied to make sure that the effect of variation of color intensity was significantly reduced.

Deep Models Architecture
For fully supervised and weakly supervised feature representation, we used the ResNet34 [38] architecture as our baseline architecture. We removed fully connected layers from the original network and employed the ResNet backbone as feature extractor followed by a dense layer with 1024 neurons that received the flattened output of the feature extractor. Finally, our softmax output layer was added to deliver the probability of each class. We used dropout on the fully connected layers with p = 0.5 as the regularizer.
For unsupervised feature extraction, ResNet18 was employed as a convolutional encoder. The decoder comprised convolutional and up-sampling layers to increase the size of the feature map and get back the original size of input image.

Experimental Setup
We used 115 WSIs from 13 patients (around 18% of the dataset) to train deep models to extract patch-level image features for WSIs encoding to associated histograms. After encoding another 535 WSIs, we employed 10-fold cross-validation, and all the models used the same groups of training and test sets for a fair evaluation. By changing some components of the approaches such as classifier type (SVM or random forest (RF)) in the decision fusion method and clustering algorithm (KM or GMM) for codebook learning in unsupervised feature learning, 10 models were evaluated in total. These models were as follows:

1.
FS-RF: Image features were extracted using the fully supervised approach, and the random forest was employed for image-level decision fusion;

2.
FS-SVM: Image features were extracted using the fully supervised approach, and SVM was employed for image-level decision fusion; 3.
MIL-RF: Image features were extracted using the MIL approach, and the random forest was employed for image-level decision fusion; 4.
MIL-SVM: Image features were extracted using the MIL approach, and SVM was employed for image-level decision fusion; 5.
EM-RF: Image features were extracted using the EM approach, and the random forest was employed for image-level decision fusion; 6.
EM-SVM: Image features were extracted using the EM approach, and SVM was employed for image-level decision fusion; 7.
KM-RF: Image features were extracted using an unsupervised approach that applies the k-means clustering algorithm to learn codewords. This model employs the random forest for image-level decision fusion; 8.
KM-SVM: Image features were extracted using an unsupervised approach that applies the k-means clustering algorithm to learn codewords. This model employs SVM for image-level decision fusion; 9.
GMM-RF: Image features were extracted using an unsupervised approach that applies the GMM clustering algorithm to learn codewords. This model employs the random forest for image-level decision fusion; 10.
GMM-SVM: Image features were extracted using an unsupervised approach that applies the GMM clustering algorithm to learn codewords. This model employs SVM for image-level decision fusion.
These models are summarized in Table 1. Accuracy, the area under the ROC curve (AUC), precision, recall, and F1 score were considered as evaluation metrics for assessing the models' performance.

Classification Results and Statistical Analysis
The performance of a classification model was highly correlated with the degree of separability between different classes. Before applying classification algorithms on encoded WSIs, we visualized features extracted by different methods using the principal component analysis (PCA) method to understand better how well each method characterizes the histopathology images' visual contents. Figure 5 shows the results of PCA. What can be deduced from the graphs is that in all models, squamous WSIs were encoded relatively separately from dysplastic and non-dysplastic BE. However, there is considerable confusion between these two classes. Unexpectedly, the degree of separation between images encoded by the fully supervised approach was not better than those of other methods, and the confusion between BE with dysplasia and without dysplasia and even the confusion between squamous and two other classes was higher compared to some other methods. The unsupervised approach with GMM clustering demonstrated a considerable performance in the extraction of image features, which contributed to high separability between different classes. k-means was an exception-its graph does not show acceptable separability performance, at least between two hard-to-separate classes: dysplastic and non-dysplastic BE. As the diagrams show, the MIL and EM algorithms' performances for discriminating between classes were not much different in the domain of weakly supervised approaches. In this setting, applying the EM algorithm to detect discriminative patches to improve the training process did not lead to a significant improvement over MIL.
Classification results can further refine our findings from the PCA plots. We used 10-fold cross-validation to evaluate the models' performances, randomly dividing 120 patients in the test set into 10-folds, using all WSIs belonging to a single patient in a fold for testing and other WSIs for training the classifier. The classification results of esophageal WSIs in three classes of squamous, dysplastic BE, and non-dysplastic BE for different models are summarized in Table 2. The reported values are averages with 95% confidence intervals. For computing confidence intervals, numbers greater than one were truncated to 1.
As the number of clusters is a key parameter in the performance of clustering methods, we evaluated different numbers of clusters (codeblocks) for both k-means and GMM to pick a decent number of codeblocks given our dataset and type of classifier. We used weighted AUC as an evaluation metric. Table 3 summarizes the results of the evaluation of different numbers of clusters for k-means and GMM. These results demonstrated that the optimal feature number was as follows: 200 features for the combination of k-means and SVM, 100 features for the combination of k-means and random forest, and 150 features for GMM. Figure 6 shows boxplots of the 10-fold cross-validation for different metrics. As shown, GMM-RF and GMM-SVM performed better than the other models, and as PCA plots show and boxplots confirm, the performances of classification models on WSIs encoded by the fully supervised approach and also WSIs encoded by employing k-means clustering were poorer than those of the other models.
In terms of the classification model, results show that there is not a significant difference between the performances of SVM and random forest given the same set of encoded WSIs. In other words, choosing the type of classifier is not a critical decision as opposed to deciding on the feature extraction approach.

Visualization of ROIs on WSIs
For more scrutiny about the models' performances, heatmaps using different feature extraction approaches on WSIs with dysplastic BE and non-dysplastic BE were considered. In this evaluation, the performances of weakly supervised (MIL and EM) and unsupervised (GMM) approaches, which have better performances in the classification of WSIs, were examined. Some heatmaps for WSIs from both classes generated by the three approaches mentioned above are presented in Figure 7. In each heatmap, image regions that are more important given the associated classes are in red. As can be seen, the performance of the unsupervised approach in ROI detection is far better than the performances of the weakly supervised approaches.

Discussion
In this comparative study, the performances of different deep learning-based approaches for extracting image features from tissue tiles for identification of dysplastic and non-dysplastic BE on whole-slide tissue histopathology images were evaluated. The results demonstrated the capability of the unsupervised approach in extracting relevant image features from WSIs if an appropriate setting is chosen. This is important because the unsupervised approach without utilizing any manually annotated image or even image-level diagnosis learned a discriminative representation of esophageal WSIs.
By contrast, the fully supervised feature extraction approach did not achieve as high a performance as the other two approaches. Although the fully supervised approach used the same deep learning model as the weakly supervised approach, difficulties in obtaining accurate annotations on WSIs may explain the relatively poor performance of the fully supervised approach. Specifically, distinguishing non-dysplastic BE from low-grade dysplasia can be challenging even for human pathologists, and annotations involving confusing areas may muddle the inputs for the fully supervised model.
In the unsupervised approach, a CAE was applied to map tissue tiles to an embedding space to reduce the dimensionality of raw image features. Then, the embedding vectors were clustered into using a clustering algorithm independently of the first step. Although employing GMM for the soft assignment of embedded features to clusters and encoding WSIs generated promising results, the assumptions underlying the dimensionality reduction techniques are generally independent of the clustering techniques' assumptions. Thus, there is no theoretical guarantee that the network would learn feasible representations. End-to-end training of such a model can be an improvement over the unsupervised approach applied in this study. Applying such a model ensures that features are learned that lead to the best results in the WSI classification phase.
A deep learning-based model for detecting and locating dysplastic and non-dysplastic BE patterns on histopathologic images has a wide variety of applications in clinical settings. Such a model can be integrated into clinical information management systems as a decision support system. These support systems can provide clinicians with "possible" diagnoses or improve confidence in their assessments via providing second opinions for prognostic decision-making of more challenging histopathological patterns. Successful implementation of this system can support a more accurate classification of pre-malignant diseases of the esophagus.
This study has some limitations. First, all biopsy images used for this study were collected from a single center and scanned with the same equipment. Thus, such data might not be representative of the entire range of histopathological patterns in patients worldwide. Collaboration with other medical centers and collecting more images would allow us to refine our model using a more diverse dataset. Furthermore, in this study, we applied the gray-scale images to reduce color variation to prevent the model from being misled. Repeating this study with color images while at the same time applying an approach to mitigate the color variation problem could be another potential area of future work.

Conclusions
In this paper, we performed a comparative study on different feature representation approaches, including unsupervised, weakly supervised, and fully supervised approaches to classify precursors to esophageal cancer using whole-slide histopathology images. We used a two-step process, in which, in the first step, a feature representation of WSIs was learned as an image-level histogram, and a decision fusion model was trained on histograms in the second step to output the final labels of new WSIs. Considering different feature extraction approaches in the first step and two different classification models (i.e., SVM and random forest) in the second step, 10 models were evaluated on an independent test set of 535 WSIs from 120 patients. The results showed that applying the unsupervised approach can lead to extracting more relevant image features from WSIs, and consequently, better identifying dysplastic and non-dysplastic BE. Employing a CAE for mapping tissue tiles in an embedding space while aiming for dimensionality reduction, and then their soft assignment to a relatively large number of clusters using GMM can generate a discriminative representation of esophageal WSIs. Applying a classification algorithm in such representations accurately predicted the labels of WSIs. Highlighting the histopathological patterns identified by the models that contributed to the WSI classification and comparing them with the same images annotated by our pathologist annotators demonstrated a comparable diagnostic performance of the unsupervised feature representation approach. Nevertheless, despite having relatively good classification results, weakly supervised approaches showed a poor performance in locating regions of interest on WSIs. Thus, the model trained on histopathological features learned by the unsupervised approach, if confirmed in clinical trials, could be employed to improve diagnostic procedures of precursors to esophageal cancer.