Ensembling EfficientNets for the Classification and Interpretation of Histopathology Images

: The extended utilization of digitized Whole Slide Images is transforming the workflow of traditional clinical histopathology to the digital era. The ongoing transformation has demonstrated major potentials towards the exploitation of Machine Learning and Deep Learning techniques as assistive tools for specialized medical personnel. While the performance of the implemented algorithms is continually boosted by the mass production of generated Whole Slide Images and the development of state-of the-art deep convolutional architectures, ensemble models provide an additional methodology towards the improvement of the prediction accuracy. Despite the earlier belief related to deep convolutional networks being treated as black boxes, important steps for the interpretation of such predictive models have also been proposed recently. However, this trend is not fully unveiled for the ensemble models. The paper investigates the application of an explanation scheme for ensemble classifiers, while providing satisfactory classification results of histopathology breast and colon cancer images in terms of accuracy. The results can be interpreted by the hidden layers’ activation of the included subnetworks and provide more accurate results than single network implementations.


Introduction
Machine learning techniques with a dedicated emphasis on deep learning methodologies have been applied successfully on the field of health informatics as an assistive tool for the relief of workload that specialized medical personnel need to carry [1,2] and for educational purposes [3]. The iterative process of continuously evolving the concerned algorithms has brought to light more effective implementations that exceed the human eye discriminative capability [4][5][6] and enhance the objectivity criteria by means of visual patterns' quantification. These improved implementations are, therefore, applied for the reliable and precise prognosis and diagnosis of pathologic cases.
The processing of traditional medical imaging material such as MRI's, X-ray's, Ultrasounds, Endoscopy, Thermography, Tomography, Microscopy, and Dermoscopy has been transformed to each digital version providing numerous benefits in a variety of tasks that were earlier performed manually [2,[7][8][9][10][11][12][13]. The abovementioned tasks fall under the umbrella of well-known computer vision tasks, namely, semantic segmentation [14,15], generation [16], registration [17,18], image classification [15], and object detection [19]. In the last decade, the registered and documented ability of deep convolutional networks to identify visual patterns beyond the human perspective is gaining popularity in the field of digital pathology as well. Driven by the rise of digital scanners that produce whole slide images, the assessment of human tissue in histopathology images can be conducted by means of a virtual microscope. A whole slide image, containing in average 10 GB, can satisfy the needs of data hungry deep convolutional networks and alleviate issues concerning the creation, handling, and preservation of glass slides. In this framework, patches, extracted from whole slide images, are inserted as inputs in deep convolution networks in a supervised or unsupervised manner, exploiting the benefits of latest developments in the field of deep learning such as transfer learning with pretrained models and the unlabeled training via autoencoders or Generative Adversarial Networks (GANs) [20,21]. Apart from deep learning techniques, machine learning algorithms have been utilized in the field of digital pathology for content-based image retrieval and classification of histopathology images. While firstly introduced for text classification, the Bag of Words technique is utilized in [22,23] for the description of dense imagery content and its exploitation on the designated tasks. However, whole slide imaging is introduced to the scientific community with a newly breed set of challenges that needs to be addressed, mainly related to the polymorphism of the data formats, the big data management, the standardization of staining and the transparency, and explainability of predictions.
In this work, we focus on breast and colon cancer, which are distinguished as two of the most lethal cases, among different kinds of cancer that cause high rates of mortality worldwide. Breast cancer is the first leading disease in terms of incidents for women [24], whereas colon cancer is classified as second for women and third for men [25]. Utilizing automated machine learning techniques for the prognosis and diagnosis is vital for the early detection of malignancies in both cases aiming at total healing and avoidance of metastasis [26,27]. Towards this direction, researchers in the field of digital pathology have been occupied with the specific forms of cancer systematically. Although the availability of datasets is immense and reported results of the deep learning techniques are high [28], the need for explaining the connection between the input and the result is overlooked, yet compelling especially in the case of predictive models in healthcare information systems where the responsibility for high-stake decisions is heavy. "In order to build trust in intelligent systems and move towards their meaningful integration into our everyday lives, it is clear that we must build 'transparent' models that have the ability to explain why they predict what they predict" [29].
Ensemble classifiers existed before the rise of deep learning and were utilized in machine learning methods with a main purpose to increase the performance of the classifiers that they consist of. Starting from ancient Greece and the foundation of Democracy, the idea of ensemble classifiers derives from the human best practice of seeking for opinions of different experts before taking high risk decisions. The experts' opinion in the domain of machine learning is represented by the prediction of a classifier. In an ensemble classifier, the input is analyzed by a set of classifiers, each implementing an algorithmic logic, resulting in a set of corresponding predictions that need to be combined in various manners in order to reach a final total prediction. Ensemble models have shown remarkable performance and the capability of correcting the faulty prediction of each included predictive model [30]. Such an example of exploiting the benefits of ensemble classifiers in the field of medical imaging can be manifested in [31], where authors employ a new weighted voting procedure on a self-supervised scheme towards the improved performance of medical X-ray and computed tomography images' classification task. Apart from the advantage of providing a boost to the performance metrics, their simple implementation that relies on different architectural combinations provides the advantage of imposing explainability modules on top of existing architectures. In [32], the authors presented a weighted patch ensemble method that requires the modification of the ensemble classifier for the integration of the explainability scheme. In this work, the proposed methodology maintains the classification scheme without modifications. This is an important feature to consider, since the alteration of (removal or addition) layers may significantly influence the performance of the classifier. Therefore, leaving the neural network intact when integrating an explainability scheme is an important advantage. This integration is made possible as well, due to the nature of the well-known gradient weighted class activation mapping Grad-CAM technique [29] that can be applied effortlessly to the last convolutional layer of existing deep learning schemes without interfering with the functionality of the predictive model.
In this paper, we propose an ensemble classification scheme that is based on implementations of state-of-the-art deep convolutional networks, namely, EfficientNets [33]. Our contribution lies on the combination of this ensemble classifier with a Grad-CAM explanation scheme that can highlight the visual patterns which are responsible for each class prediction, while providing promising results. Furthermore, a standalone application that follows the principles of distributed computing is available for online validation and experimentation, providing its functionality (classification and explainability) as a web service. The remainder of the paper is organized as follows. In Section 2, the utilized datasets, hardware, and deep convolutional (CNN) architectures and methods are described in detail and in Section 3 the performance and explainability results are shown. In Section 4, the provided results are discussed in terms of a broader context and future work directions are indicated, whereas Section 5 concludes the paper.

Deep Learning Methods
The methodology of ensembling involves the combination of well-established classifiers in reaching a final decision. For the purposes of this study, deep convolutional neural networks are employed as the main 'ingredients' of an ensemble classifier. Starting from the newly developed group of CNNs called EfficientNets, the potential of combining state-of-the-art approaches in classifying histopathology with an emphasis on providing explainable results by means of a Grad-CAM technique is explored. Other types of deep architectures that are utilized herein are the InceptionNet, ExceptionNet, and the ResNet. When combined in an ensemble classifier and by the addition of the Grad-CAM explainability scheme, the final configuration achieves higher performance and provides plausible connections between the input and the result.

EfficientNets
EfficientNets are a group of deep convolutional networks that achieve and surpass state-of-the-art accuracy in different classification tasks with up to ten times better efficiency, thus the name (smaller and faster). Their main novelty lies on the latest achievement of AutoML, and, specifically, on the intelligent and controlled expansion of the three dimensions (width, depth, resolution) of a neural network by the utilization of a compound coefficient. Throughout years of research, the basic concern has been the growth of a neural network's dimensions in such a way that accuracy is improved with the minimum of operations given certain resources' constraints. Even when the minimum of operations is not a basic goal, increasing the dimensions of a neural network in a greedy manner does not have the expected results due to the vanishing gradients' phenomenon. Efficient Nets address this issue by exploring the relation of the increase in each dimension and applying a grid search under a fixed resources constraint instead of arbitrarily changing these dimensions. The compound scaling method is summarized in the set of Equations (1): where φ is a global scaling factor that controls how many resources are available and α, β, γ determine how to allocate these resources to network depth, width, and resolution, respectively. By assigning φ = 1 and applying grid search, α, β and γ can be determined for a given convolutional architecture to achieve better accuracy. Once concluding with the definition of α, β and γ, φ can be gradually increased to augment the dimensions of the network towards better accuracy. The scaling method is applicable to any convolutional architecture that consists of a repeated pattern of layers. However, the authors of EfficientNets paper proposed a specific architecture where the main building block is the mobile inverted bottleneck convolution (MB Conv), shown in its three basic configurations in Figure 1. The base model of the EfficientNets group is Efficient Net B0 and its architecture is shown in Table 1, consisting mainly of MBConv1 and MBConv6. By utilizing MBConv blocks and increasing the value φ, Efficient Net group reaches its most complicated form B7. In the heart of these building blocks, two important innovations have found grounds to act: the depthwise separable convolution [34] that performs the functionality of a normal convolution with less resources and the squeeze and excitation unit that enables the network to perform dynamic channelwise feature recalibration [35]. Concerning depthwise separable convolution, the convolution operation is divided into two parts. First, the convolution is conducted depthwise, meaning that the convolution kernel is applied to each channel individually in order to learn channel dependent features and second, pointwise, meaning that a 1 × 1 kernel is applied to each point in order to combine the channel dependent learned features. In reference to the squeeze and excitation unit, the unit consists of two parts. Starting the squeeze part, global average pooling is applied to each channel leading to the formation of an 1 × 1 × C vector (where C are the channels), followed by a fully connected  ReLU  fully connected  sigmoid block (excitation part). In this manner, each channel is enhanced with additional information concerning the other channels and captures in between interactions. Finally, the output of the excitation part is multiplied with the original input.  The above-mentioned building blocks and architectures are learned lessons through months of development and experience produced in the ever-evolving domain of deep learning and encapsulate notions that have been partially tested and evaluated in earlier deep learning architectures such as ResNet [36], XceptionNet, and InceptionNet [37]. These approaches achieved state-of-the-art results in computer vision tasks because they have incorporated these blocks partially. Once combined in a structured manner by means of a controlled augmentation mechanism such as in the EfficientNets, the performance is further improved.
ResNets are driven by the intuitive need for neural networks to grow deeper in order to understand and quantify more complex features and simultaneously compensate for the vanishing gradient issue. The authors discovered that, by adding the identity function between layers, the network can reach deeper architectures and cope with the vanishing gradient issue, since the layers where the gradients diminish rapidly gets bypassed. Since its publishing, the idea has spread around fast and is being utilized in different deep CNN architectures including EfficientNets.
Rather than investing in deeper architectures, the authors of InceptionNet prioritized the importance of creating wider approaches, meaning filters with multiple sizes, and leveraged their options between these two dimensions in order to capture salient patterns in the image that appears in different sizes. The initial version V1 was improved in terms of accuracy and speed by adding an auxiliary classifier during the training process, factorizing convolution operations and placing them at a wider grid. By further improvement of the initial proposal, the InceptionNet is now transformed in its fourth version. A combined approach of Resnet and Inception is proposed by the enhancement with residual blocks (Inception-ResNet). Moving a step forward, an extreme version of the Inception-Net, called XceptionNet, managed to achieve even better results, inspired by the inverse sequence of operation in the depthwise convolution (firstly proposed in Inception Net) and the removal of nonlinearity between convolutional layers.
In order to select the best performing DCNN architectures, multiple tests were performed with the two datasets and each of the above-mentioned approaches. The results verified the superiority of EfficientNets over the other approaches. Due to these preliminary tests, the ensemble scheme proposed later in this paper consists only of different ranks of EfficientNet.

Ensemble Classifiers
The ensemble classifiers notion lies on the founding principles of democracy as it was first established in ancient Greece. The Greeks did not need much to realize that the best decision is reached only when many opinions (the opinions of people) are heard and processed. This simple yet efficient idea has become for modern humans merely an intuitive action, since, on the verge of taking an important decision, they demand the opinion of several experts. However, if we were to leave the empirical and intuitive evidence alone, literature in the health informatics domain proves in a placid way that classifiers produce more accurate results when they are gathered together and their predictions-opinions are combined in different ways to reach a final result [38][39][40][41][42]. The manner utilized for the combination of different base classifiers is one of the basic criteria of characterizing ensemble classifiers. The basic classification of ensemble classifiers consists of the following three major categories: bagging, boosting, and stacking. Bagging is based on a parallel and independent learning procedure of base classifiers that are in turn combined as dictated by a deterministic averaging process, while boosting corresponds to a sequential adaptive learning method that adaptively modifies the distribution of the training set based on the performance accuracy of previously trained classifiers [41]. Stacking refers to a parallel learning algorithm that results in a training of a meta-model. This meta-model is responsible for the combination of base learners' predictions. Another aspect of categorizing the different types of ensembling methods is related to the input patterns. Utilizing different classifiers, where one is trained with the original input and others with modified input versions, is common practice [42]. Another aspect categorizes ensemble classifiers in those that utilize different classifiers to solve the same task and those that break the original task into subtasks and employ a different classifier for each decomposed problem [43]. Moving further to distinguish ensemble classifiers by means of the manner between base classifiers achieves diversity. There exist randomized methods to populate an ensemble classifier by other classifiers and metrics-based techniques with a main concern to increase diversity to a certain extent that does not harm performance [44,45].

Explainability
Ensemble classifiers are widely utilized in classification tasks for the well-recognized virtue to improve performance metrics in terms of accuracy. However, when dealing with high stake predictive models such as those in healthcare applications, there are major concerns also related to the explanation of decision-making and the avoidance of erroneous ones. In our effort to construct models that can decipher the uncertainty of real-world problems, we have created black box mechanisms that produce accurate results but are not transparent and trustworthy [46]. For experts to embrace AI in the healthcare domain, the provided predictions should be retraceable and reliable. In this framework, efforts of computer vision researchers are directed towards the discovery of methods that can highlight the relationships and interactions between the visual patterns included in an input image and the final prediction. Unveiling these connections are of crucial importance [47] since humans demand that health threatening decisions are thoroughly justified.
Especially in the domain of computer vision and deep learning XAI (Explainable Artificial Intelligence), attempts to extract localization information of important visual patterns for decision-making have been widely witnessed. One way to achieve this goal is the construction of class activation maps [48]. Class activation mapping is a method which indicates the discriminative regions of an image that influenced the predictive model in reaching its final decision. Initially, it was mandatory that the predictive model should follow a certain architecture for the technique to provide plausible results, meaning that the output of the convolutional layers should be directed to a global average pooling layer and then directly to SoftMax activation function. This architecture, as discussed earlier, demands retraining of the predictive model and sacrifices complexity (added by the insertion of fully connected layers) for explainability. A generalization of this method (Grad-CAM) is proposed in [29]. In the same paper, the combination of Grad-CAMs with the guided-back propagation technique is proposed to provide a fine-grained pixel to pixel visualizations. This approach fits better to the visual characteristics of digital pathology images, where the patterns correspond to small cellular structures as opposed to larger structures. By computing the gradients for the score of each class with respect to the feature maps from the last convolutional layer and performing global average pooling on them, the importance weights for each feature map are obtained. In this fashion, the architecture of the predictive model remains intact.

System Architecture and Methodology
The system is developed with two main purposes: • Image classification; • Result explainability.
Two integrated subsystems in the whole architecture interact seamlessly and are responsible for the fulfilment of each purpose (Figure 2). Concerning the image classification task, an ensemble classifier consisting of three different pretrained implementations of the EfficientNets group is employed in a parallel configuration that results in the concatenation of three different groups of feature maps. The pretrained models are trained by means of the ImageNet dataset [49]. The models are trained to classify 1000 general classes, thus resulting in a generalized ability to distinguish visual patterns in more specific tasks. In our method, the pretrained models are utilized without modification for feature extraction. Although fine-tuning was also performed by unfreezing a variety of top layers of the base classifiers and tuning the weights of the remaining neural network structure to the specific task, best classification results are reported with the same configuration. Prior to inserting an input image into the ensemble architecture, the images are resized and pixels normalized according to the authors' recommended guidelines of each DCNN architecture, and the dataset is split into two parts, in 60% (training) and 40% (validation). The training set is augmented three times of the initial size by the utilization of three randomized operations, flip, rotation, and zoom. The final concatenated set of features is driven into a fully connected layer that acts as a classifier following typical best practices of deep CNNs. For the selection of the pretrained models, a preliminary examination of the individual performance on the two datasets led to the selection of the best performing models in terms of accuracy. The best individually performing deep CNNs are the Inception Net, XceptionNet, and the Effi-cientNets group. Consequently, an ablation study is conducted between these selections in groups of three to determine the best selection. Upon removing a CNN, the influence of this removal is measured by terms of difference in accuracy. The final selection results in the EfficientNets B1, B2, and B3. Although the basic building blocks for the three networks are the same, the required diversity in the basic classifiers of the ensemble classifier is achieved by different values provided by the compound scaling method. Regarding the explainability task, the concerning modules are attached to the architecture of the classification scheme while providing feedback for the localization of important visual patterns that influence the outcome of the classifier and without interfering with its functionality. When utilizing the Grad-CAM technique in a single classifier environment, the feature maps of the last convolutional layers and the gradients for the score of each class with respect to the feature maps are necessary to produce a heatmap with the explainability visualizations. As explained in [29], the technique can be divided into three steps. The first step refers to the calculation of the gradient G (Equation (2)), where Yc is the raw output of the CNN before applying softmax to turn it into a probability and Ak are the generated feature map activations. Indicator c is the class for which the heatmap is generated, since the technique is class dependent and k reflects the number of utilized convolutional filters. An important requirement for validating the results is the differentiability of the network included between the final convolutional layer and the softmax layer ( Figure 3). The second step is the calculation of alpha values (Equation (3)). This operation is performed by applying global average pooling on the gradients G. Z parameter is the number of pixels in the feature map. The third step rests on the application of ReLU on the product of each feature map with the corresponding alpha value (Equation (4)): In the ensemble environment, all the necessary information regarding the calculation of the Grad-CAMs exists but needs the addition of a concatenation layer so as to bring together all extracted features' maps. This concatenation layer takes place after the last convolutional layer of each base classifier. This minor modification enables the integration of the Grad-CAM explanation module into the ensemble classifier. Apart from the calculation of Grad-CAMs, an independent procedure is conducted in parallel, namely guided back propagation. Guided backpropagation is the combination of two distinct operations. The first is the backpropagation at ReLU activation functions. This backward pass ensures that values being greater than zero during the forward pass in the -1 filter are passed as is one step backwards. The second operation is the deconvolution at ReLU. Values being greater than zero in the current filter are passed as is one step backwards. To reach to the final heatmap, the results of guided back propagation and Grad-CAM are multiplied.

Datasets and Hardware
Two widely utilized and publicly available datasets from the breast and colon cancer domain are the main sources of visual information that are exploited in this paper for the training and validation of the deep convolutional networks. The first dataset named Break Histological Image Classification (BreakHis) and consists of 7909 microscopic, breast tumor tissue images that are collected from 82 patients using different magnifying factors [50]. The images are:  [51]. Samples of the second dataset are shown in Figure 5. The class distribution of the CRC dataset is depicted in Table 3. Training and validation of the developed implementations take place on a remote configuration of a double-GPU equipped server. The GPUs are the TITAN Xp (11 GB, corecount:30 and coreClock:1.582 GHz) and the GeForce GTX 970 (4 GB, corecount:13 and coreClock: 1.392 GHz). All of the basic algorithmic operations concerning the deep neural network approaches and the Grad-CAM technique are implemented by using the TensorFlow 2.3 framework for Python programming language.  . This is an overview of the BreakHis dataset. Each row depicts a specific tissue type.: Adenosis is indicated as (a), fibroadenoma as (f), phyllodes tumor as (pt), and tubular adenoma as (ta), ductal carcinoma as (dc), lobular carcinoma as (lc), mucinous carcinoma as (mc), and papillary carcinoma as (pc). Each number stands for a specific magnification factor: 1 for 40×, 2 for 100×, 3 for 200×, and 4 for 400× (i.e., pc2 image depicts a papillary carcinoma in 100× magnification).

Evaluation Metrics
In terms of classification performance, the two datasets analyzed in Section 3.1 are split in 60-40% train-validation ratio for the colon cancer dataset and 70-30% for the breast cancer dataset. Although the 60-40% split in the first case is considered rather strict, this choice supports the purposes of this study concerning the trade-off between performance and explainability. Single EfficientNets achieve accuracy near perfection for the colon cancer dataset. The choice of split (60-40%) manages to lower the accuracy metric of single EfficientNets and, therefore, demonstrate the improvement in performance when utilizing ensemble classifiers. The utilized performance metrics for the binary and multiclass classification tasks are described hereafter: • Accuracy metric is defined as the fraction of the correctly classified instances divided by the total number of instances, as shown in Equation (5) • Recall metric is defined as the fraction of the true positives divided by the true positives and false negatives as shown in Equation (7): • Area under Curve (AUC) metric is defined as the area under the receiver operating curve. The receiver operating curve is drawn by plotting true positive rate (TPR) versus false positive rate (FPR) at different classification thresholds. TPR is another word for recall, whereas FPR is the fraction of the false positives divided by the true negatives and false positives as shown in Equation (8): Although balanced accuracy is the appropriate performance metric when dealing with imbalanced datasets such as BreakHis, accuracy is chosen in order to provide comparison feedback in reference to the state of the art. In terms of measuring the performance of the explanation scheme, an evaluation tool runs on for specialists to test and review the results of explanation schemes. The results of this evaluation are reported in the following section.

Results
In order to determine which pretrained deep convolutional neural networks are better performing in the specific datasets, a preliminary experiment is conducted with single classifiers. We choose from the pool of the TensorFlow 2.3 API (https://www.tensorflow.org/, accessed on 26 September 2021) the following well established architectures: The hyperparameters for the deep convolutional architectures were set after experimentation to the values shown in Table 4. To further improve the performance of each classification scheme, experiments are conducted with different custom learning rate schedulers that result in the learning rate scheduler which is expressed by Equation (9): Lr(epochs) = Lrstart + (Lrmax − Lrstart)/(k × epoch) (9) where Lr defines a function that depends on epochs, Lrmax is set to 0.00005, and Lrstart to 0.0001. The difference in accuracy increases by 1.6% in the case of EfficientNet B0 when utilizing the above learning rate scheduler in contrast to using a plain Adam optimizer and k a hyperparameter that is computed by heuristic methods. In Table 5, the corresponding results for the binary (benign vs. malignant) breast cancer and for the multiclass colon cancer classification task (adipose vs. background vs. debris vs. lymphocytes vs. mucus vs. smooth muscle vs. normal colon mucosa vs. cancer associated stroma vs. colorectal adenocarcinoma epithelium) are depicted. By forming different groups of three baseline classifiers and removing one each turn, two ensemble architectures were formed. Each architecture contains the baseline implementation that had the greater impact in performance metrics when removed. The two qualified architectures are the EfficientNet group consisting of B0, B1, B2 and the group consisting of B1, B2, B3. In order to evaluate the effect of utilizing ensemble architectures against the baselines, Table 6 demonstrates the performance metrics for each configuration. The performance of the baseline architectures leaves a small space for improvement even when the dataset is split in a 60-40% ratio. Even so, the Efficient B0-2 ensemble method is on par for the colon cancer dataset. Baseline architectures leave small space for improvement in performance; even when splitting the dataset in 60-40%, the ensemble architecture managed a minor improvement in some cases. Nevertheless, in the worst-case scenario, the proposed ensemble architectures are on par with the baseline implementations. The task of classification is made more difficult by splitting the dataset 40-60% (training-validation) and 30-70%. Each experiment is conducted by splitting the dataset into two subsets at the beginning of the study to avoid introducing bias. Consequently, each validation process is conducted without receiving any information about the images used for training. Bootstrapping the splits 10 times is performed to enhance randomness. In Table 7, the results from these two extreme splits are demonstrated. The difference in performance metrics is not significant even as the problem of classification becomes more difficult. Returning to the BreakHis dataset, four datasets are generated by the partition of the initial dataset to subsets based on the magnification factor. The four datasets correspond to the magnification factors 40×,100×, 200×, 400×. Two classification tasks are addressed depending on the assigned labels. The first classification task is binary where the classes are benign and malignant, whereas the second classification task is multiclass where the classes are adenoma, fibroadenoma, tubular adenoma, phyllodes tumor, ductal carcinoma, lobular carcinoma, mucinous carcinoma, and papillary carcinoma. The training-validation split is set to 70-30%. As shown in Tables 8 and 9, ensemble classifiers achieve better performance at all magnification factors in both tasks apart from one binary classification case at 100×, where classifiers perform equally.  Regarding the explainability task of the proposed methodology, a test bench application was developed for visual inspection and verification of the produced results by specialized medical personnel. The web interface ( Figure 6) is available in the URL http: 83.212.75.102:3005/ (accessed on 26 September 2021) and upon uploading of a histopathology image, the sample is sent to the back end where the best performing ensemble architecture returns the classification result along with the generation of a heatmap of the original image. The visual patterns of the image that are characterized as highly related to the result are painted red, whereas those irrelevant with blue. The explainability capability of the different deep frameworks or ensemble classifiers are evaluated on a qualitive basis by expert pathologists in the respective field. The specialists inspect the highly related visual patterns and assess the results according to their prior experience in histopathology image-based diagnosis. The initial qualitive results show significant accordance concerning the areas responsible for the characterization of results between specialists and the ensemble classifier. The images are selected randomly from the validation set of BreakHis dataset and the Bachs dataset [52] and processed by both Grad-CAM and Guided Grad-CAM explainability techniques. The visualization and classification results are analyzed by specialized personnel and commented on terms of their opinion concerning the classification in benign or malignant class and the localization of important visual patterns that are responsible for the classification result. In Figure 7, a benign adenosis is depicted in ×400 magnification. The ensemble classifier classifies the image as probably benign but not being totally representative with high confidence in contrast to the experienced physician that refers to this image as not being totally representative of the benign class in terms of morphological patterns. The red highlighted regions are localized on epithelial tissue, though not totally. Humans tend to point their attention on the specific kind of tissue because carcinomas are malignant neoplasms of epithelial tissue. On the other hand, nearby stromal and epithelial areas are colored with yellow as they are in the vicinity of the most important regions. Concerning the Guided Grad-CAM algorithm, the coloring of respective areas is fuzzier but still more intense on the epithelial patterns.  Moving on to the next image presented in Figure 8 which is taken from the Bachs dataset and depicts an in situ carcinoma, the depicted patterns are visually representative of the malignant class. The classifier correctly predicts the class with high confidence and manages to generalize well on an unknown dataset with several variances owing to different production and staining procedures. Concerning the Grad-CAM technique, highly important regions colored as red correspond to epithelial cells, whereas, in the Guided Grad-CAM case, the coloring of respective regions is fuzzy. Some yellow painted regions are considered of less importance to the classifier and highlighted due to the vicinity to the most important regions and other yellow regions are colored with no obvious reason to experienced physicians. In other cases, both algorithms fail to highlight the regions which are considered significant by experienced physicians. In Figure 9, drafted from the BreakHis dataset, a benign fibroadenoma is depicted. Fibroadenomas are benign tumors of the epithelial and stromal tissue. The Grad-CAM algorithm highlights mostly epithelial and stromal regions and ignores epithelial tissue on the lower left part of the image which is also indicative of the disease. Nevertheless, in terms of morphology, the depicted patterns are not highly indicative of the disease as physicians state. A special case takes place when images contain uniform patterns of malignant or benign tissue as shown in Figure 10. In the figure, the depicted patterns are all indicative of a malignancy. Since there is no specific area of interest on the image that the algorithm individually detects as being highly responsible to the outcome, it returns medium measurements for all areas of the image, while some artifacts might be considered the cause for the assignment of high values on the edges. To compare the explainability properties between single and ensemble classifiers, experiments were conducted with images from the BreakHis and Bachs dataset. In Figure  11, the interpretability results of an adenosis (BreakHis) are depicted along with the respective heatmaps, whereas, in Figure 12, the corresponding outcome for an in situ carcinoma (Bachs) is shown for single and ensemble classification schemes. A closer look in results for all images and generated heatmaps concerning the base classifiers delineates that each classifier focuses on regions of interest (ROIs) that differ and/or overlay each other. To be more specific, single classifiers Efficient B1 and B2 in Figure 11 have highlighted the bottom right tile of image with high values of importance corresponding to orange and dark red colors, whereas the EfficientNet B3 classifier shows no interest on the specific tile. In the same figure, the tiles situated on the upper left corner are considered of importance to B1 and B3 classifiers, but not to B2. On the other hand, results of the ensemble classifier B1-2-3 incorporate the ROIs of the containing base classifiers on a weighted scheme in order to support the polyphony of base classifiers. This weighted aggregation of designated ROIs instead of their partial selection leads to increased accuracy performance in the case of the ensemble classifier. In Figure 11, the ensemble classifier focuses its attention on the tile situated on the lower and the upper right corner as well as the upper left area of the image by highlighting each area according to the weighted classification scheme. Taking into consideration all the tiles that base classifiers deem as important results in improved classification results. The same behavior is observed on the in situ sample in Figure 12, although the image derives from a different dataset.

Discussion
The main goal of the article is the proposal and evaluation of an explainability scheme in an ensemble environment and therefore the classification performance was highlighted as a secondary feature of the proposed methodology. In the proposed framework, the experimental results are produced by application of the presented methodology on two well-known datasets, BreakHis and Bachs. The utilization of different datasets enables the exploration of generalization properties.
Evaluating the classification accuracy with the utilization of images belonging to the same dataset shows that the task is trivial even for the plain architectures (not ensemble ones), the EfficientNets series supersede other well-established architectures (VGG, In-ceptionNet, ResNet, ExceptionNet) and achieve higher performance in both accuracy and AUC metrics for breast and colon datasets even when the training-validation split is 60-40%. The results leave small space for improvement in the case of applying the ensemble architecture. However, in some cases, such improvement occurs. The signs of better performance are more evident when splitting the datasets in a 40-60% or a 30-70% ratio. These extreme set ups make it more difficult for the plain architectures to perform as well as the ensemble configurations and, therefore, stress out the fact that the added complexity of ensemble classifiers is useful in further improving accuracy.
Utilizing ensemble architectures in order to achieve better results hinders the effort of explainability due to the added complexity. However, that is not the case for the Grad-CAM and Guided Grad-CAM technique which are seamlessly integrated in the network's architecture. The quality of highlighting and detecting correctly the most important regions concerning the final prediction is evaluated by experienced physicians. The explainability module manages to highlight in red (highly significant) regions of the images that are indicative of the presence or absence of the respective pathology in most of the cases concerning images of the same dataset. The red highlighted regions are usually epithelial cells, and, in the case of malignancies, usually are atypical cells with hyperchromatic (dark colored) nuclei, which is in accordance with the common practice of the physicians. However, the highlighting is not performed for all similar regions in an image which would be desirable, and, in some cases, it is localized in dark colored artefacts. Therefore, the implementation of an artefact removal methodology would further enhance the generated results. Yellow colored regions (less important regions) are generated by the explainability module of the Grad-CAM technique in regions in the vicinity of red highlighted regions. A positive aspect of the method, as shown in Figure 7, as a representative sample of cases deriving from the Bachs dataset, is the fact that it generalizes well on unseen data. An important drawback of the proposed explainability methodology is the failure to highlight important regions when the morphological characteristics of the disease are uniform.
To a certain extent, it is acceptable since there is no particular region that excels to highlight, and the granularity of the proposed methodology is coarse. Although the Guided Grad-CAM technique was intended to solve the issue of granularity, the provided visualizations are fuzzier than the ones presented by Grad-CAM, in contrast to the results provided by Grad-CAM that are more expressive.
Concerning the comparison of explainability properties between baseline and ensemble classifiers, it has been noted that taking into consideration all the visual patterns that baseline classifiers individually consider important can be beneficial in the same way that ensemble classifiers perform better as they combine the decisions of single classifiers on a weighted scheme.

Conclusions
In this work, we have investigated the application of the Grad-CAM and Guided Grad-CAM explainability techniques on ensemble classification schemes based on pretrained deep convolutional network architectures. It has been shown that the combination of different architectures improves the performance of the designated classifiers on two different use case scenarios. Concerning the explainability results, generated by the standalone web application, the initial feedback is promising in many cases but fails to distinguish important patterns where the depicted malignancy is visually uniform. Another drawback is the deficiency to localize on specific depicted morphology findings, since the Grad-CAM technique can highlight certain rectangular regions and Guided Grad-CAM is fine grained and focuses on specific pixels. Therefore, future work should be redirected towards the combination of these techniques with complementary ones that manage to distinguish morphology entities in histopathology images. In addition, future effort should be directed towards the exploration of explainability techniques that can combine the coarse-grained properties of the Grad-CAM approach with the strong discrimination abilities of the morphological patterns depicted in histopathology images. Institutional Review Board Statement: This work did not require an approval from a research ethics board because only computational data analysis is performed, and no animal or human experimentation was involved.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.